# Profitable Apps Profiles for the App Store and Google Play Markets

Our company builds Android and iOS mobile apps that are free to download from the Google Play and App store. Since the apps developed are free, the main source of revenue for the company is in-app ads, therefore revenue is mostly influenced by the number of users who use the app.The goal of this project is to help app developers understand which apps attract the most users.

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

To avoid spending resources on collecting new data ourselves, we will use the following two data sets that are freely available on Kaggle:


1. A data set containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. You can download the data set directly from this [link.](https://www.kaggle.com/lava18/google-play-store-apps) 

2. A data set containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. You can download the data set directly from this [link.](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) 

# Exploring the Data

To start off, we define the function `explore_data`, with the following parameters:

1. `dataset`: a set of data in a list of lists without the header row.

2. `start`: the beginning index of the "slice" of data we want to explore.

3. `end` : the ending index of the "slice" of data we want to explore.

4. `rows_and_columns`: a parameter set to `False` by default that when set to `True` will print the number of rows and number of columns in the data set.

In [1]:
from csv import reader
apple_opened_file = open('AppleStore.csv',encoding='utf8')
apple_read_file = reader(apple_opened_file)
apple_data_set = list(apple_read_file)
apple_rows = apple_data_set[1:] #Exlude the header row

google_opened_file = open('googleplaystore.csv',encoding='utf8')
google_read_file = reader(google_opened_file)
google_data_set = list(google_read_file)
google_rows = google_data_set[1:] #Exlude the header row

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # Adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Using the function `explore_data`, let's examine the first couple rows of each data set and find the number of rows and columns in each dataset.

In [2]:
explore_data(apple_rows,0,2,True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7197
Number of columns: 16


In [3]:
explore_data(google_rows,0,2,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


Below is a list of a the columns found in the App store and Google Play store data sets. 

**App Store Data Set Columns**

|Column|Index|Definition|
|:---|:---:|:---|
|id|0|App ID|
|track_name|1|App Name|
|size_bytes|2|Size (in Bytes)|
|currency|3|Currency Type|
|price|4|Price amount|
|ratingcounttot|5|User Rating counts (for all version)|
|ratingcountver|6|User Rating counts (for current version)|
|user_rating|7|Average User Rating value (for all version)|
|userratingver|8|Average User Rating value (for current version)|
|ver|9|Latest version code|
|cont_rating|10|Content Rating|
|prime_genre|11|Primary Genre|
|sup_devices.num|12|Number of supporting devices|
|ipadSc_urls.num|13|Number of screenshots showed for display|
|lang.num|14|Number of supported languages|
|vpp_lic|15|Vpp Device Based Licensing Enabled|

**Google Play Store Data Set Columns**

|Column|Index|Definition|
|:---|:---:|:---|
|App|0|Application Name|
|Category|1|Category the app belongs to|
|Rating|2|Overall user rating of the app (as when scraped)|
|Reviews|3|Number of user reviews for the app (as when scraped)|
|Size|4|Size of the app (as when scraped)|
|Installs|5|Number of user downloads/installs for the app (as when scraped)|
|Type|6|Paid or Free|
|Price|7|Price of the app (as when scraped)|
|Content Rating|8|Age group the app is targeted at - Children / Mature 21+ / Adult|
|Genres|9|An app can belong to multiple genres (apart from its main category).|

# Ensuring Accuracy in our Data

## Deleting entries with missing values

From a [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) on Kaggle for the Google Play dataset we learn that entry 10472 is missing data for the column `category`. In the next step we delete this row, since the missing value offsets the indexes of the remaining values in the row, which could later cause an error in our analysis. Alternatively we could have added the missing value for this row to the data set, however since there are 10,841 rows, deleting one row will not effect our analysis significantly.

In [4]:
del(google_rows[10472])

## Removing Duplicate Entries

The Google Play dataset has some duplicate entries for certain apps. We discover this by looping through the data set and saving unique app names (index 0) to a list called `unique_apps`. Any app name that is already in `unique_apps` is appeneded to the list `duplicate_apps`.By printing the length of the list `duplicate_apps` we see there are 1,181 duplicate entries. 

In [5]:
unique_apps = []
duplicate_apps = []
for app in google_rows:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:10])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


One example of an app with duplicate entries is 'Slack'.By examining the duplicate data for 'Slack' printed below, we can see the main difference between the entries is the number of reviews (index 3). Rather than deleting the duplicate entries at random, for any app with duplicate entries we will keep the entry with the highest number of reviews, since this is most likely the most recent version of the data for the app. In the case below, we would keep line 3. If two lines have the same highest number of reviews like lines 1 & 2, the second entry will be removed (line 2).

In [6]:
for app in google_rows:
    name = app[0]
    if name == 'Slack':
        print(app)

['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']


To apply the logic described to remove the duplicates, we will:

- Create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app

- Use the information stored in the dictionary and create a new data set, which will have only one entry per app (the entry with the highest number of reviews)

In [7]:
reviews_max = {}
for app in google_rows:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] > n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews
        
print(reviews_max['Slack'])


51507.0


The expected length of the dictionary is the total number of rows in the Google Play data set (10841) minus the number of duplicate entries (1181).

In [8]:
print('Expected length: ',len(google_rows)-len(duplicate_apps))
print('Actual length: ', len(reviews_max))

Expected length:  9659
Actual length:  9659


The next step is to use the dictionary `reviews_max` to remove the duplicate rows. We create two empty lists: `android_clean`, a list that will store the data set with the duplicates removed, and `already_added`, a list that will store the names of apps added to the clean dataset.

Next we loop through the Google Play store data set, saving the app name to the variable `name` and the number of reviews an app has to the variable `n_reviews`. As we loop through the rows of data, whenever we get to the row we want to keep (where `n_reviews` is equal to the max `n_reviews` saved in our dictionary), we append that entire row of data to the list `android_clean` and append the name of the app to the list `already_added` as long as the name of the app isn't in `already_added`. It is important to add the name of the app to `already_added` because of the case like Slack where two entries contain the same `n_reviews` value, and that value is the highest number of reviews. We only need to save one of these rows, so we save the first row the loop encounters. The second entry will not be appended to `android_clean`.

Next let's inspect the data for Slack in `android_clean` to confirm there is only one entry, and that the entry contains the value 51507, the max number of reviews from the duplicates. We should also confirm that the length of the clean data set matches the expected length: 9659.

In [9]:
android_clean = [] #A list that will store our new cleaned data set.
already_added = [] # A list that will just store app names.

for app in google_rows:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)

print('Length of data set without duplicated: ', len(android_clean))
print('\n')
for row in android_clean:
    name = row[0]
    if name == 'Slack':
        print(row)

Length of data set without duplicated:  9659


['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']


Now that we have removed the duplicate data from the Google Play Store data set, we confirm that the App Store dataset does not have any duplicate entries by following the same exercise we used for the Google Play store data set. There are no duplicate entries in the App Store data set.

In [10]:
unique_apps_2 = []
duplicate_apps_2 = []
for row in apple_rows:
    id_num = row[0]
    if id_num in unique_apps_2:
        duplicate_apps_2.append(id_num)
    else:
        unique_apps_2.append(id_num)
        
        
print('Number of duplicate apps:', len(duplicate_apps_2))

Number of duplicate apps: 0


## Removing irrelevant data

Since our company only develops English apps, we want to remove any apps that are not directed toward an English-speaking audience like the ones printed below.

In [11]:
print(apple_rows[813][1])

爱奇艺PPS -《欢乐颂2》电视剧热播


The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system. We can get the corresponding number of each character using the `ord()` built-in function. We will use this information to identify app names with foreign characters by iterating through the name of each app looking for non-English characters.

The function `language_checker` will take in a string and return either `True` or `False` depending on if that string contains non-English characters (where `ord()` function returns a value greater than 127).

In [12]:
def language_checker(string):
    for letter in string:
        if ord(letter) > 127:
            return False
    return True

Now we can test if the following app names are detected correctly:
- 'Instagram'
- '爱奇艺PPS -《欢乐颂2》电视剧热播'
- 'Docs To Go™ Free Office Suite'
- 'Instachat 😜'

Below we see that the function has classified some English apps as non-English because they contain special characters like emojis or ™ that fall outside the ASCII range and have corresponding numbers over 127. 

In [13]:
name_1 = 'Instagram'
name_2 = '爱奇艺PPS -《欢乐颂2》电视剧热播'
name_3 = 'Docs To Go™ Free Office Suite'
name_4 = 'Instachat 😜'

print(language_checker(name_1))
print(language_checker(name_2))
print(language_checker(name_3))
print(language_checker(name_4))

print('\n')

print(ord('™')) #Some English special characters fall outside the ASCII range
print(ord('😜'))

True
False
False
False


8482
128540


Since we don't want to exclude English apps with special characters like emojis, we'll only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range.This means all English apps with up to three emojis or other special characters will still be labeled as English.While this solution isn't perfect, it will minimize data loss and be fairly effective.

In [14]:
def language_checker(string):
    char_count = 0
    for letter in string:
        if ord(letter) > 127:
            char_count += 1
    if char_count>3:
        return False
    return True

Our new `language_checker` function now correctly identifies the English and non-English apps.

In [15]:
name_1 = 'Instagram'
name_2 = '爱奇艺PPS -《欢乐颂2》电视剧热播'
name_3 = 'Docs To Go™ Free Office Suite'
name_4 = 'Instachat 😜'

print(language_checker(name_1))
print(language_checker(name_2))
print(language_checker(name_3))
print(language_checker(name_4))

True
False
True
True


We'll now use the `language_checker` function to filter out non-English apps from both data sets. We'll loop through each data set and append rows with English app names to a new list, `apple_new` or `google_new`.

In [16]:
#Loop through App Store Data:

apple_new = []
for row in apple_rows:
    name = row[1]
    if language_checker(name) == True:
        apple_new.append(row)

#Loop through Google Store Data:

google_new = []

for row in android_clean:
    name = row[0]
    if language_checker(name) == True:
        google_new.append(row)

Next we'll explore the data sets and see how many rows we have remainining for each data set. The App store data set started off with 7197 rows of data. We now have 6183 rows which means we removed 1014 rows with non-English app names. 

The Google Play store data set started off with 10841 rows of data. We removed one row that had a missing field leaving us with 10840 rows. We then removed 1181 duplicate rows, keeping the row with the maximum number of reviews, leaving us with 9659 rows. We now have 9614 rows which means we removed 45 rows with non-English app names. 

We did not remove any columns from either data sets.

In [17]:
explore_data(apple_new,0,2,True)
print('\n')
explore_data(google_new,0,2,True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 6183
Number of columns: 16


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 9614
Number of columns: 13


As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps; we'll need to isolate only the free apps for our analysis. To do this we will loop through each data set to isolate rows with free apps in a separate list (`ios` and `android`). In the Google Play store data set, the `price` column has index 6, and free apps have a price equal to the string 'Free'. In the App store data set, the `price` column has index 4, and free apps have a price equal to the integer 0.

In [18]:
ios = []

for row in apple_new:
    price = float(row[4])
    if price == 0:
        ios.append(row)

android = []

for row in google_new:
    price = str(row[6])
    if price == 'Free':
         android.append(row)

The final length of our data sets from the App store and Google Play store are 4056 and 8861 rows respectively. The lists `ios` and `android` isolate free apps from the App store and Google Play store directed toward an English-speaking audience. We can now use these lists to perform our analysis.

In [19]:
explore_data(ios,0,1,True)
print('\n')
explore_data(android,0,1,True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 3222
Number of columns: 16


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 8861
Number of columns: 13


# Analyzing App Genres and Categories

As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets.

We'll begin our analysis by getting a sense of what the most common genres are for each market. By inspecting the column definitions, the useful column in the App Store data set will be `prime_genre`. We'll use the columns `Genres` and `Category` from the Google Play Store data set. Using these columns we will generate frequency tables to determine the most common genres of app in each market.

|Data Set|Column|Index|Column Definition|
|:---|:---|:---|:---|
|App Store|prime_genre|11|Primary Genre|
|Google Play Store|Category|1|Category the app belongs to|
|Google Play Store|Genres|9|An app can belong to multiple genres (apart from its main category)


We'll build two functions we can use to analyze the frequency tables:

- One function to generate frequency tables that show percentages
- Another function we can use to display the percentages in a descending order

The function `freq_table()` takes in two inputs: `dataset` (which is expected to be a list of lists) and `index` (which is expected to be an integer). The function returns the frequency table (as a dictionary) for any column we want. First, we find the count of each key in the dataset (which will be genre in our case), and then we divide each value at that key by the total number of apps (the length of the dataset) to express the frequency of that genre as a percentage.

In [20]:
def freq_table(dataset,index): # Dataset:a list of lists; index:integer
    freq_dict = {}
    for row in dataset:
        key = row[index]
        if key in freq_dict:
            freq_dict[key] += 1
        else:
            freq_dict[key] = 1
    for key in freq_dict:
        freq_dict[key] /= len(dataset)
    return freq_dict

Next we'll define a second function to display the percentages in our frequency dictionary in a descending order. The `display_table()` function you see below:

- Takes in two parameters: `dataset` and `index`. `dataset` is expected to be a list of lists, and `index` is expected to be an integer.
- Generates a frequency table using the `freq_table()` function (defined above).
- Transforms the frequency table into a list of tuples, then sorts the list in a descending order.
- Prints the entries of the frequency table in descending order.

In [21]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

## Analysis of prime_genres in the App Store (English, Free)

The most common `prime_genre` amoung free, English apps is 'Games' with over half of all apps falling into this genre (56%). The runner-up genre is 'Entertainment', but it is significantly less frequent and makes up just 8% of apps. The remaining 21 apps have a frequency below 4%. The top 4 genres (Games, Entertainment, Photo & Video, and Social Networking) seem to be designed for fun and pleasure versus Productivity/Utilities which we see at lower frequencies. Even though Games make up 56% of apps in the App Store, this does not necessarily mean these apps have the most users. We will look into app usage later on. Since we want to deploy our app to the Google Play store first, it's important we also take a look at the dominant genres in that market as well.

In [22]:
print('ios prime_genre Frequency Table:')
print('Number of genres: ', len(freq_table(ios,11)))
print('\n')
print(display_table(ios,11))

ios prime_genre Frequency Table:
Number of genres:  23


Games : 0.5816263190564867
Entertainment : 0.07883302296710118
Photo & Video : 0.04965859714463067
Education : 0.03662321539416512
Social Networking : 0.032898820608317815
Shopping : 0.0260707635009311
Utilities : 0.025139664804469275
Sports : 0.021415270018621976
Music : 0.020484171322160148
Health & Fitness : 0.020173805090006207
Productivity : 0.01738050900062073
Lifestyle : 0.015828677839851025
News : 0.01334574798261949
Travel : 0.012414649286157667
Finance : 0.0111731843575419
Weather : 0.008690254500310366
Food & Drink : 0.008069522036002483
Reference : 0.00558659217877095
Business : 0.005276225946617008
Book : 0.004345127250155183
Navigation : 0.00186219739292365
Medical : 0.00186219739292365
Catalogs : 0.0012414649286157666
None


## Analysis of Category and Genres in the Google Play Store (English, Free)

The most common app category in the Google Play Store is 'Family' with 18% of apps falling into this cateogory. Similar to the App Store `prime_genres`, after the top 3 categories, the frequency drops off to less than 4% for the remaining categories.

In [23]:
print('Android Category Frequency Table:')
print('Number of Categories: ', len(freq_table(android,1)))
print('\n')
print(display_table(android,1))
print('\n')

Android Category Frequency Table:
Number of Categories:  33


FAMILY : 0.18778918857916713
GAME : 0.09637738404243314
TOOLS : 0.08441485159688522
BUSINESS : 0.04581875634804198
LIFESTYLE : 0.039047511567543165
PRODUCTIVITY : 0.03893465748786819
FINANCE : 0.03701613813339352
MEDICAL : 0.03532332693826882
SPORTS : 0.03419478614151902
PERSONALIZATION : 0.03317909942444419
COMMUNICATION : 0.03250197494639431
HEALTH_AND_FITNESS : 0.03069630967159463
PHOTOGRAPHY : 0.029454914795169845
NEWS_AND_MAGAZINES : 0.027987811759395104
SOCIAL : 0.02663356280329534
TRAVEL_AND_LOCAL : 0.023360794492720913
SHOPPING : 0.02245796185532107
BOOKS_AND_REFERENCE : 0.021442275138246248
DATING : 0.018620923146371742
VIDEO_PLAYERS : 0.01783094458864688
MAPS_AND_NAVIGATION : 0.013993905879697552
EDUCATION : 0.012526802843922808
FOOD_AND_DRINK : 0.012413948764247828
ENTERTAINMENT : 0.010382575330098183
LIBRARIES_AND_DEMO : 0.009366888613023362
AUTO_AND_VEHICLES : 0.00925403453334838
HOUSE_AND_HOME : 0.0083512018959

Since it isn't immediately clear what kind of apps fall into the category 'Family', let's print the names of the apps with the top 10 highest number of reviews, leveraging code from the display_table function we created earlier to create a dictionary where the keys are app names, the values are the number of reviews an app has, and the dictionary is sorted in descending number of reviews. 

The top 10 'FAMILY' apps include mostly games, with some directed at teens like 'Clash of Clans' and 'Clash Royale', and others directed at children like 'My Talking Tom' and 'Minion Rush'.

In [24]:
categories_dict = {}
for row in android:
    category = str(row[1])
    name = row[0]
    n_reviews = float(row[3])
    if category == 'FAMILY':
        categories_dict[name] = n_reviews
        
table_display = []
for key in categories_dict:
    key_val_as_tuple = (categories_dict[key], key)
    table_display.append(key_val_as_tuple)

table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
    entry[1], ':', entry[0] 
    
for i in range(10):
    print(str(i+1),'.',table_sorted[i][1])

1 . Clash of Clans
2 . Clash Royale
3 . Candy Crush Saga
4 . My Talking Tom
5 . Pou
6 . Minion Rush: Despicable Me Official Game
7 . Hay Day
8 . My Talking Angela
9 . Boom Beach
10 . Netflix


While there are 33 Categories of apps in the Google Play Store, there are 114 genres of app. This is because multiple genres can be used to tag an app. The number of genre combinations has driven up the number of genres for the Google Play Store to 114.

In [25]:
print('Android Genres Frequency Table:')
print('Number of Genres: ', len(freq_table(android,9)))
print('\n')
print(display_table(android,9))

Android Genres Frequency Table:
Number of Genres:  114


Tools : 0.08430199751721025
Entertainment : 0.06071549486513937
Education : 0.05349283376594064
Business : 0.04581875634804198
Productivity : 0.03893465748786819
Lifestyle : 0.03893465748786819
Finance : 0.03701613813339352
Medical : 0.03532332693826882
Sports : 0.03464620246021894
Personalization : 0.03317909942444419
Communication : 0.03250197494639431
Action : 0.031034871910619568
Health & Fitness : 0.03069630967159463
Photography : 0.029454914795169845
News & Magazines : 0.027987811759395104
Social : 0.02663356280329534
Travel & Local : 0.023247940413045932
Shopping : 0.02245796185532107
Books & Reference : 0.021442275138246248
Simulation : 0.020426588421171427
Dating : 0.018620923146371742
Arcade : 0.018620923146371742
Video Players & Editors : 0.01783094458864688
Casual : 0.01749238234962194
Maps & Navigation : 0.013993905879697552
Food & Drink : 0.012413948764247828
Puzzle : 0.011285407967498025
Racing : 0.0099311590113982

The most frequent genre is 'Tools', however this genre only makes up 8% of apps. Since 'Tools' is only the third most frequent category, you may be wondering why it's the most frequent genre. One reason is that there is no genre called 'Family'. Let's print the genres of the ten most reviewed apps we listed above. The top 10 most reviewed 'FAMILY' games fall into a few different genres like 'Strategy', 'Casual', and 'Entertainment'. Since there is no genre called 'Games' (other than Brain Games), 'Family' apps end up falling into multiple genres, with children's games labelled as 'Casual' for our top 10 apps. Since we are only looking for a high-level classification of what type of app to develop, we should be ok to focus on the `Category` column instead of the `genre` column for the Google Play store data set. 

In [26]:
top_10 = []
top_10_genres = []

for i in range(10):
    top_10.append(table_sorted[i][1])
    
for i in range(len(top_10)):
    for row in android:
        name = row[0]
        genre = row[9]
        if top_10[i] == name:
            top_10_genres.append(genre)
            
names_genres = dict(zip(top_10,top_10_genres))
for name in names_genres:
    print(name, ":",names_genres[name])

Clash of Clans : Strategy
Clash Royale : Strategy
Candy Crush Saga : Casual
My Talking Tom : Casual
Pou : Casual
Minion Rush: Despicable Me Official Game : Casual;Action & Adventure
Hay Day : Casual
My Talking Angela : Casual
Boom Beach : Strategy
Netflix : Entertainment


Overall, our analysis of the above frequency tables shows us that the App store landscape is dominated by Gaming apps, whereas the Google Play Store landscape has more of a spread where the most common app cateogry ('Family') only represents 18% of apps. Going forward we will use the `prime_genre` and `Category` columns from the App Store and Google Play store data, leaving out the `genre` column from the Google Play Store data because it is a bit too granular for our purposes.

## Most Popular Apps by Genre in the App Store

Now, we'd like to get an idea about the kind of apps with the most users.One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the `Installs` column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the `rating_count_tot` column.

Let's start with calculating the average number of user ratings per app genre in the App Store. To do that, we'll need to:

- Isolate the apps of each genre.
- Sum up the user ratings for the apps of that genre.
- Divide the sum by the number of apps belonging to that genre

We'll start by generating a frequency table for the `prime_genre` column to get the unique app genres.

In [27]:
genres_dict = freq_table(ios,11)

Below, we calculate the average number of user ratings per app genre in the App Store, saving the genres and average number of user ratings to the dictionary `genres_dict` (where the values are first the frequency of the genre app, and are then overwritten with the average user ratings). 

In [28]:
for genre in genres_dict:
    total = 0
    len_genre = 0
    for row in ios:
        genre_app = row[11]
        if genre_app == genre:
            user_rating = float(row[5]) 
            total += user_rating
            len_genre += 1
    avg_user_ratings = total/len_genre
    genres_dict[genre] = avg_user_ratings

table_display = []
for key in genres_dict:
    key_val_as_tuple = (genres_dict[key], key)
    table_display.append(key_val_as_tuple)

table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
    print(entry[1], ':', entry[0])

Navigation : 86090.33333333333
Reference : 74942.11111111111
Social Networking : 71548.34905660378
Music : 57326.530303030304
Weather : 52279.892857142855
Book : 39758.5
Food & Drink : 33333.92307692308
Finance : 31467.944444444445
Photo & Video : 28441.54375
Travel : 28243.8
Shopping : 26919.690476190477
Health & Fitness : 23298.015384615384
Sports : 23008.898550724636
Games : 22788.6696905016
News : 21248.023255813954
Productivity : 21028.410714285714
Utilities : 18684.456790123455
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Business : 7491.117647058823
Education : 7003.983050847458
Catalogs : 4004.0
Medical : 612.0


The ios app genre with the highest number of average user ratings is 'Reference'. Let's take a look at the apps that fall into this category. The most popular app is the Bible, followed by apps for Dictionary.com, other religious texts, and then apps that provide reference material (for example cheat sheets) for various versions of Minecraft. Based on this data, we may want to propose creating an app that provides reference material for a particular kind of popular game, a language dictionary for travellers abroad, or some type of niche Encyclopedia.

In [29]:
for app in ios:
    name = app[1]
    ratings = app[5]
    if app[11] == 'Reference':
        print(name,':', ratings)

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


## Most Popular Apps by Category in the Google Play Store

Now let's analyze the Google Play market. We came up with an app profile recommendation for the App Store based on the number of user ratings. We have data about the number of installs for the Google Play market. The number of installs is not a precise number, but rather a rounded number that represents a 'floor' number of installs (Ex. 10,000+, 1,000,000+). Although this value isn't precise, we can still use it to determine the popularity of an app category.

In [30]:
display_table(android,5) # The Installs column

1,000,000+ : 0.15754429522627242
100,000+ : 0.11556257758717978
10,000,000+ : 0.10506714817740662
10,000+ : 0.1022457961855321
1,000+ : 0.08407628935786028
100+ : 0.0691795508407629
5,000,000+ : 0.06816386412368808
500,000+ : 0.055749915359440246
50,000+ : 0.047737275702516645
5,000+ : 0.0451416318699921
10+ : 0.035436181017943796
500+ : 0.03250197494639431
50,000,000+ : 0.022796524094346012
100,000,000+ : 0.021216566978896286
50+ : 0.019185193544746643
5+ : 0.007899785577248618
1+ : 0.005078433585374111
500,000,000+ : 0.002708497912199526
1,000,000,000+ : 0.002257081593499605
0+ : 0.000451416318699921


We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on. To perform computations, however, we'll need to convert each install number from string to float.

To calculate the average number of installs per app category for the Google Play data set, we'll start by generating a frequency table for the `Category` column of the Google Play data set to get the unique app genres using the `freq_table()` function defined in a previous step of our analysis. We save the categories and average number of installs to the dictionary (where the values are first the frequency of the category app, and are then overwritten with the average number of installs).

Below we see that the category with the highest average user installs is 'COMMUNICATION.'

In [31]:
categories = freq_table(android,1)

for category in categories:
    total = 0
    len_category = 0
    for app in android:
        category_app = app[1]
        if category_app == category:
            installs = app[5]
            installs = installs.replace('+','')
            installs = installs.replace(',','')
            installs = float(installs)
            total += installs
            len_category += 1
    avg_installs = total/len_category
    categories[category] = avg_installs
    
table_display = []
for key in categories:
    key_val_as_tuple = (categories[key], key)
    table_display.append(key_val_as_tuple)

table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
    print(entry[1], ':', entry[0])

COMMUNICATION : 38326063.197916664
VIDEO_PLAYERS : 24790074.17721519
SOCIAL : 23253652.127118643
ENTERTAINMENT : 19428913.04347826
PHOTOGRAPHY : 17805627.643678162
PRODUCTIVITY : 16772838.591304347
TRAVEL_AND_LOCAL : 13984077.710144928
GAME : 13006872.892271662
TOOLS : 10695245.286096256
NEWS_AND_MAGAZINES : 9549178.467741935
BOOKS_AND_REFERENCE : 8767811.894736841
SHOPPING : 7036877.311557789
PERSONALIZATION : 5201482.6122448975
WEATHER : 5074486.197183099
FAMILY : 4374336.352163462
SPORTS : 4274688.722772277
HEALTH_AND_FITNESS : 4167457.3602941176
MAPS_AND_NAVIGATION : 4056941.7741935486
EDUCATION : 3057207.207207207
FOOD_AND_DRINK : 1924897.7363636363
ART_AND_DESIGN : 1905351.6666666667
BUSINESS : 1704192.3399014778
LIFESTYLE : 1437816.2687861272
FINANCE : 1387692.475609756
HOUSE_AND_HOME : 1313681.9054054054
DATING : 854028.8303030303
COMICS : 817657.2727272727
AUTO_AND_VEHICLES : 647317.8170731707
LIBRARIES_AND_DEMO : 638503.734939759
PARENTING : 542603.6206896552
BEAUTY : 513151.

Let's take a look at the apps that fall into the 'COMMUNICATION' category. This category contains instant messaging apps, internet browsers, caller ID, and call blocking apps amoung other things. Instant messaging and internet browsers would have a long development time, and that market is already dominated by services like WhatsApp and Safari. The fourth most popular category is 'ENTERTAINMENT'. Perhaps this would go along better with the App Store genre of 'Reference', like some type of entertaining 'Reference' material.

In [32]:
for app in android:
    name = app[0]
    installs = app[5]
    if app[1] == 'COMMUNICATION':
        print(name,':', installs)

Messenger – Text and Video Chat for Free : 1,000,000,000+
Messenger for SMS : 10,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Viber Messenger : 500,000,000+
My Tele2 : 5,000,000+
Firefox Browser fast & private : 100,000,000+
Yahoo Mail – Stay Organized : 100,000,000+
imo beta free calls and text : 100,000,000+
imo free video calls and chat : 500,000,000+
Contacts : 50,000,000+
Call Free – Free Call : 5,000,000+
Web Browser & Explorer : 5,000,000+
Opera Mini - fast web browser : 100,000,000+
Browser 4G : 10,000,000+
MegaFon Dashboard : 10,000,000+
ZenUI Dialer & Contacts : 10,000,000+
Cricket Visual Voicemail : 10,000,000+
Opera Browser: Fast and Secure : 100,000,000+
TracFone My Account : 1,000,000+
Firefox Focus: The privacy browser : 1,000,000+
Google Voice : 10,000,000+
Chrome Dev : 5,000,000+
Xperia Link™ : 10,000,000+
TouchPal Keyboard - Fun Emoji & Android Keyboard : 10,000,000+
Who : 100,000,000+
Skype Lite - Free

Some of the facts we know right now are that:

- 'Reference' apps have the highest average number of user ratings in the App Store, followed by Music, Social Networking, and Weather.

- 'Communication' apps have the highest average number of installs in the Google Play store, followed by Video Players, Social, and Entertainment apps.

At this stage I like the profile of an Entertainment/Reference app. As mentioned in our introduction, since this app is free, our revenue source is in-app ads viewed by the user. An interesting [study](https://www.bankmycell.com/blog/cell-phone-usage-in-toilet-survey) by BankMyCell suggests that 3 in 4 Americans admit to using their phone whilst on the toilet. Also 88% of Android owners and 76% of iPhone owners use their device on the toilet. Based on these stats along with the data we have analyzed, the app we decide to pitch is called 'Toilet Trivia'. It is an app that will tell you entertaining fun facts while you're on the toilet. The app might also include trivia quizzes to make it more engaging. This type of app would result in users spending time in the app (depending on how long they're in the bathroom). While people surveyed in the BankMyCell study listed using social media, messaging, listening to music, watching videos, and mobile gaming as activities they do while on the toilet, a trivia app might put a more productive spin on this time, and therefore might be appealing to users who want to do something less mindless.

# Conclusion
In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

When we combined our analysis with a study that found that 88% of Android owners and 76% of iPhone owners use their device on the toilet, we decided on creating an app called "Toilet Trivia" with fun facts and trivia quizzes meant to be used while in the loo could be profitable for both the Google Play and the App Store markets.