# App Popularity Analysis

This project is designed to help app developers identify common traits among the most popular apps.

We will be analyzing two data sets. The first set is from August 2018 and contains data from approximately 10,00 Android apps on the Google Play Store. The second set is from July 2017 and contains data from approximately 7,000 iOS apps from the App Store.

### Exploring the Data

To begin, we will create a function that allows us to explore these two data sets in more detail. This function will import a data set, print the rows we'd like to see, and (optionally) print the number of rows and columns in the range we've selected.

In [1]:
from csv import reader

# Imports data, slices rows based on start and end arguments, and prints sliced list 

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') 

# If rows_and_columns argument is True, prints number of rows and columns

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))


Now that we have a working function that explores a data set, we can import the two sets that we're analyzing for this project.

In [2]:
# Opening and reading App Store data, then converting to list of lists

opened_ios = open('AppleStore.csv')
read_ios = reader(opened_ios)
list_ios = list(read_ios)

# Opening and reading Google Play data, then converting to list of lists

opened_android = open('googleplaystore.csv')
read_android = reader(opened_android)
list_android = list(read_android)

With our data imported and formatted as a list of lists, we can use the exploratory function we created earlier to view the contents of each data set.

In [3]:
explore_data(list_ios[1:], 0, 3, True)
print('\n')
explore_data(list_android[1:], 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1,

Let's also print the header row of each data set to see which columns may be useful in the next stages of our analysis.

In [4]:
print(list_android[0])
print('\n')
print(list_ios[0])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


### Identifying relevant fields

Based on the header rows of each data set, the following fields look relevant to our analysis: Category, Reviews, Installs, Type, Price, price, rating_count_tot, cont_rating, prime_genre, sup_devices.num, ipadSc_urls.num, vpp_lic.

Here is the supporting documentation for the two data sets, which provides more context for each of the column fields:

[AppleStore.csv](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) 

[googleplaystore.csv](https://www.kaggle.com/lava18/google-play-store-apps/home)

### Cleaning the Data

Before we jump into analyzing our data sets, we must ensure that there are no errors that could skew our results.

When we read through the documentation of the Google Play data set, we can see that other users have flagged an issue with row 10473. It appears that the "Category" column is missing, which is causing subsequent columns to erroneously shift. We will remove this entry.

In [5]:
print(list_android[10473])
del list_android[10473]

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


### Removing Duplicate Entries

Upon inspection of the Google Play dataset, we can see that there are a number of duplicate entries. To filter out any duplicates, we can loop through each row and inspect the "App" column. For each loop, we need to do two things: 

1) Check to see if we've already encountered the app. We can do this by creating an empty list that will store the names of apps we've encountered for the first time. When we encounter an app name, we'll check to see if the app exists in this list. If it doesn't, it means it's the first time we've encountered that app in the data set, so we need to add it.

2) If the app name already exists in the list we've created, we need to record that app as a duplicate. We'll create a second list to store these duplicate apps.

In [6]:
# Looping through Play Store dataset, identifying if a row is unique or a duplicate, and storing in the applicable list

unique_list = []
duplicate_list = []

for row in list_android:
    app_name = row[0]
    if app_name in unique_list:
        duplicate_list.append(app_name)
    else:
        unique_list.append(app_name)

print(duplicate_list[:3])  
print(len(duplicate_list))

['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business']
1181


We've identified 1,181 duplicate entries in our dataset. We need to establish some logic for determining which app entry we should keep as we move forward. It would make sense to identify and retain the most recent entry. Looking at the column fields, we can see a field called 'Reviews,' which cites the total number of review for the app. We can use this field to associate the most recent entry with the highest number of reviews.

In [7]:
# Looping through Play Store dataset to find max number of reviews for each app, then adding to dictionary for reference

reviews_max = {}
for row in list_android[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews
        
# Printing length of newly created dictionary to ensure code above successful

print(len(reviews_max))


9659


Now that we have a dictionary that contains the maximum number of reviews for each app, we can use it to determine which rows we should keep in our "clean" list. Below we will loop through our dataset, compare the number of reviews in each row to our "reviews_max" dictionary, and add rows to our "clean" list that have a rating that matches our dictionary value for the applicable app name.

In [8]:
# Creating a new list with no duplicates

android_clean = []
already_added = []

for row in list_android[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(row)
        already_added.append(name)

# Checking the length and first rows of our new cleaned list

print(android_clean[:10])
print(len(android_clean))
    
    

[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up'], ['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000+', '

### Removing Non-English Characters from Data Set

Now that we've successfully removed duplicate entries, we need to further cleanse our data by finding and removing apps that are not built for an English audience (as they aren't relevant to the market we're studying). To accomplish this task, we can leverage ASCII numbers. Every character in a string has a corresponding ASCII number, with common English characters falling between 0 and 127. 

To find foreign apps, we can loop through the characters of each app name and detect anything out of range. However, since many commonly used characters like emojis also fall out of range, we need to establish a reasonable threshold so we're not needlessly discarding data. Thus, we will build a function that discards any app that has a total of three or more characters outside of range.

In [9]:
# This function takes in a string and returns False if any character in the string doesn't belong to the set of common English characters (0 and 127). Otherwise it returns True.

def ascii_check(name):
    out_of_bounds_total = 0
    for char in name:
        if ord(char) > 127:
            out_of_bounds_total += 1
        if out_of_bounds_total > 3:
            return False
    return True

# Checking the results to make sure our function works

print(ascii_check('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(ascii_check('Instagram'))
print(ascii_check('Docs To Go™ Free Office Suite'))
print(ascii_check('Instachat 😜'))

False
True
True
True


Next we'll loop through both data sets and use our newly created "ascii_check" funtion to filter out any non-English apps. The remaining apps will be added to new "clean" lists. Finally we'll print the lengths of our new lists to see how many apps are left.

In [10]:
# Using the new function to filter out non-English apps, then checking the length of the newly created lists

eng_list_ios = []
eng_android_clean = []

for row in list_ios[1:]:
    app_name = row[1]
    if ascii_check(app_name):
        eng_list_ios.append(row)

for row in android_clean:
    app_name = row[0]
    if ascii_check(app_name):
        eng_android_clean.append(row)

print(len(eng_list_ios))
print(len(eng_android_clean))


6183
9614


### Removing Paid Apps

The company for which we're analyzing the data only builds free apps. Therefore we need to remove any paid apps from our data sets.

To accomplish this, we'll loop through each data set, identify which apps are free, and add those apps to a new list.

In [14]:
free_list_ios = []
free_list_android = []

for row in eng_android_clean:
    type = row[6]
    if type == 'Free':
        free_list_android.append(row)
        
for row in eng_list_ios:
    price = float(row[4])
    if price == 0.0:
        free_list_ios.append(row)
        
print(len(free_list_android))
print(len(free_list_ios))

8863
3222


### Validation Strategy

Our data is now in sufficient shape to move forward with our analysis. We mentioned at the outset that our goal is to identify the kinds of apps that are likely to attract the most users. For any new app idea, we have a validation that consists of three steps:

1) Build a minimal Andoid version of the app, and add it to Google Play
2) If the app has a good response from users, we develop it further
3) If the app is profitable after six months, we build an iOS version of the app and add it to the App Store

Therefore it's imperative that we identify what makes a successful app on both the Google Play and App Stores.

To start, we need to build a frequency table for the most common genres for each market. We'll use the "Genres" and "Category" columns from the Google Play data, and "prime_genre" column from the App Store data. 

To help build our frequency table, we'll create a function that generates frequency tables.

In [None]:
def freq_table(dataset, index):
    freq_table_dict = {}
    for row in dataset:
        index_value = row[index]
        if index_value not in freq_table_dict:
            freq_table_dict[index_value] = 1
        else:
            freq_table_dict[index_value] += 1
            