# Profitable app profiles for App Store and Google Play market

Our aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. We're working as data analysts for a company that builds Android and iOS mobile apps, and our job is to enable our team of developers to make data-driven decisions with respect to the kind of apps they build.

At our company, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.

## Data Sources
Collecting data for over 4 million apps requires a significant amount of time and money, so we'll try to analyze a sample of the data instead.

* [A data set](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv)
* [A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)

In [1]:
def explore_data(dataset, start, end, rows_and_columns = False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
    if rows_and_columns:
        print('Number of rows: ', len(dataset))
        print('Number of columns: ', len(dataset[0]))
    else:
        print('Number of rows: ', len(dataset)-1)
        print('Number of columns: ', len(dataset[0]))

In [2]:
apple_file = open('AppleStore.csv')
google_file = open('googleplaystore.csv')
from csv import reader

apple_reader = reader(apple_file)
google_reader = reader(google_file)

apple_list = list(apple_reader)
google_list = list(google_reader)

In [3]:
explore_data(apple_list, 0, 1)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Number of rows:  7197
Number of columns:  16


In [4]:
explore_data(google_list, 0, 1)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Number of rows:  10841
Number of columns:  13


From the above output we notice that both the dataset have headers and when there's an header, the explore_data function doesn't count the header as part of data and prints the number of rows and column correspondingly.

Detail documentation on the app store dataset can be found [Here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)

# Cleaning the data

The Google Play data has a dedicated [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion), and we can see that [one of the discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) describes an error in a row 10472 *(10473 if you count the header as row).*

Let us verify if the data is really missing with below code

In [5]:
google_header = google_list[0]
google_data = google_list[1:]

print('Header Length: ', len(google_header))
print('10472 Row Length: ', len(google_data[10472]))

Header Length:  13
10472 Row Length:  12


Header has the length of 13, but the row 10472 has only 12. Let's find which column is missing

In [6]:
print('Header: ')
print(google_header)
print('\n')
print('Row 10472: ')
print(google_data[10472])

Header: 
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Row 10472: 
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Comparing the two print statements, the 'Category' column should have a text instead the output shows **'1.9'** which clearly belongs a numeric field **Rating**

Using this data for analysis is not ideal. So, let's remove the row which has the missing information.

**Do not run the below statement more than once as it will remove the data which is clean**

In [7]:
del google_data[10472]

Let us double check if all the other rows in the Google App Store data are good. We know that each row in the googleappstore.csv should have a length of 13. So, we will iterate through each row and delete any row which has less than 13 fields. We should also take a backup of the deleted rows in case if we need them in the future

In [8]:
incorrect_google_data = []
incorrect_google_data.append(google_header)
print(incorrect_google_data)

[['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']]


In [9]:
print("Lenght of dataset prior to removing incomplete data: ",len(google_data))

for row in google_data:
    if len(row) < 13:
        incorrect_google_data.append(row)
        del google_data[google_data.index(row)]

Lenght of dataset prior to removing incomplete data:  10840


In [10]:
print("Length of dataset after removing incomplete data: ",len(google_data))

Length of dataset after removing incomplete data:  10840


After attempting to delete the duplicate the rows, length of the google_data is decreased only by 1. So apart from the row 10472 all the other rows have complete data

Let us attempt to see the same for the app store dataset

In [11]:
apple_header = apple_list[0]
apple_data = apple_list[1:]
print("Total number of fields: ", len(apple_header))

Total number of fields:  16


We have total of 16 Fields in the AppleStore dataset. Let's iterate through each row to eliminate incomplete records

In [12]:
incorrect_apple_data = []
incorrect_apple_data.append(apple_header)
print(incorrect_apple_data)

[['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']]


In [13]:
print("Lenght of dataset prior to removing incomplete data: ",len(apple_data))

for row in apple_data:
    if len(row) < 16:
        incorrect_apple_data.append(row)
        del apple_data[apple_data.index(row)]
        
print("Length of dataset after removing incomplete data: ",len(apple_data))        

Lenght of dataset prior to removing incomplete data:  7197
Length of dataset after removing incomplete data:  7197


Number of records are same after attempting to remove any incomplete records. So, the apple dataset is complete without missing values.

Now that the data is complete, next step of cleaning is to find the duplicates.

In [14]:
duplicate_google_apps = []
unique_google_apps = []

for app in google_data:
    app_name = app[0]
    if app_name in unique_google_apps:
        duplicate_google_apps.append(app_name)
    else:
        unique_google_apps.append(app_name)

print("Number of Unique apps: ", len(unique_google_apps))        
print("Number of duplicate apps: ", len(duplicate_google_apps))
print("Sample duplicate apps: \n", duplicate_google_apps[:10])

Number of Unique apps:  9659
Number of duplicate apps:  1181
Sample duplicate apps: 
 ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


There are total 1181 Duplicate app entries in the googleappstore dataset. These entries should be removed with some criteria instead of removing them randomly and losing potential information.

Let us look at few of these apps to see if we can find a pattern.

In [15]:
for app in google_data:
    app_name = app[0]
    if app_name == 'Google Ads':
        print(app)

['Google Ads', 'BUSINESS', '4.3', '29313', '20M', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 30, 2018', '1.12.0', '4.0.3 and up']
['Google Ads', 'BUSINESS', '4.3', '29313', '20M', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 30, 2018', '1.12.0', '4.0.3 and up']
['Google Ads', 'BUSINESS', '4.3', '29331', '20M', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 30, 2018', '1.12.0', '4.0.3 and up']


For 'Google Ads' all 3 entries are exactly the same without any difference. So, it doesn't matter which record to keep and we can just keep the first record. Let's keep exploring to find other differences with different apps

In [16]:
for app in google_data:
    app_name = app[0]
    if app_name == '365Scores - Live Scores':
        print(app)

['365Scores - Live Scores', 'SPORTS', '4.6', '666521', '25M', '10,000,000+', 'Free', '0', 'Everyone', 'Sports', 'July 29, 2018', '5.5.9', '4.1 and up']
['365Scores - Live Scores', 'SPORTS', '4.6', '666246', '25M', '10,000,000+', 'Free', '0', 'Everyone', 'Sports', 'July 29, 2018', '5.5.9', '4.1 and up']


We found an app that has difference between the records. The number of `Reviews` is more for the first record with remaining fields being the same. So it makes sense to keep the record that has more reviews.

## Removing duplicates

In [17]:
google_reviews_max = {}

for rows in google_data:
    app_name = rows[0]
    app_reviews = float(rows[3])
    if app_name not in google_reviews_max:
        google_reviews_max[app_name] = app_reviews
    else:
        if app_name in google_reviews_max and app_reviews > google_reviews_max[app_name]:
            google_reviews_max[app_name] = app_reviews

We found that there are total 1181 duplicate apps and 9659 Unique apps. So, our dictionary should have the length of 9659

In [18]:
print(len(google_reviews_max))

9659


In [19]:
google_data_clean = []
google_data_already_added = []

for rows in google_data:
    app_name = rows[0]
    n_reviews = float(rows[3])
    if n_reviews == google_reviews_max[app_name] and app_name not in google_data_already_added:
        google_data_clean.append(rows)
        google_data_already_added.append(app_name)
        
print(len(google_data_clean))        

9659


The `google_data_clean` list has the expected number of rows as our unique apps which we will be using from now for further Analysis.

Let's make sure the AppStore data is clean and if not, clean it before moving further

In [20]:
duplicate_apple_app_id = []
unique_apple_app_id = []

duplicate_apple_app_name = []
unique_apple_app_name = []

for app in apple_data:
    app_id = app[0]
    app_name = app[1]
    
    if app_id in unique_apple_app_id:
        duplicate_apple_app_id.append(app_id)
    else:
        unique_apple_app_id.append(app_id)
        
    if app_id in unique_apple_app_name:
        duplicate_apple_app_name.append(app_id)
    else:
        unique_apple_app_name.append(app_id)

print("Number of Unique apps: ", len(unique_apple_app_id))        
print("Number of duplicate apps: ", len(duplicate_apple_app_id))

print("Number of Unique apps: ", len(unique_apple_app_name))        
print("Number of duplicate apps: ", len(duplicate_apple_app_name))


Number of Unique apps:  7197
Number of duplicate apps:  0
Number of Unique apps:  7197
Number of duplicate apps:  0


There are no duplicate app names or IDs in the `apple_data`. However, browsing through the dataset, we can see some app names that are named in other than English.

Since we develop apps using English, we need to clean the dataset such that the apps are only in English for both of our datasets

In [21]:
def isEnglishApp(app_name):
    ASCIICounter = 0    
    
    for chars in app_name:        
        if ord(chars) > 127:
            ASCIICounter += 1
            
    if ASCIICounter > 3:
        return False
    else:
        return True

In [22]:
print('Length of google apps before English filter: ', len(google_data_clean))
print('Length of apple apps before English filter: ', len(apple_data))

google_data_clean_english = []
apple_data_clean_english = []

# Creating list of google apps which are unique and English
for row in google_data_clean:
    app_name = row[0]
    if isEnglishApp(app_name):
        google_data_clean_english.append(row)
        
# Creating list of apple which are unique and English
for row in apple_data:
    app_name = row[1]
    if isEnglishApp(app_name):
        apple_data_clean_english.append(row)
        

Length of google apps before English filter:  9659
Length of apple apps before English filter:  7197


In [23]:
print('Length of google apps after English filter: ', len(google_data_clean_english))
print('Length of apple apps after English filter: ', len(apple_data_clean_english))

Length of google apps after English filter:  9614
Length of apple apps after English filter:  6183


We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps; we'll need to isolate only the free apps for our analysis.

In [25]:
print(google_header)
print(apple_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In google apps, Price field is in Index 7 and apple apps has it under 4.

In [42]:
print(type(apple_data_clean_english[1][4]))

<class 'str'>


In [48]:
google_data_clean_english_free = []
apple_data_clean_english_free = []

for row in google_data_clean_english:
    if row[7] == '0.0' or row[7] == '0':
        google_data_clean_english_free.append(row)
        
for row in apple_data_clean_english:
    if row[4] == '0.0' or row[4] == '0':
        apple_data_clean_english_free.append(row)        

In [49]:
print(len(google_data_clean_english_free))
print(len(apple_data_clean_english_free))

8864
3222


We're left with 8864 Android apps and 3222 iOS apps, which should be enough for our analysis.

# Most Common Apps by Genre
Our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:
1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we then develop it further.
3. the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both the App Store and Google Play, we need to find app profiles that are successful on both markets. For instance, a profile that might work well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of the most common genres for each market. For this, we'll build a frequency table for the prime_genre column of the App Store data set, and the Genres and Category columns of the Google Play data set.

In [51]:
#Creating a dictionary which has the Genre and Frequency
apple_counts_by_genre = {}
for row in apple_data_clean_english_free:
    genre = row[11]
    
    if genre in apple_counts_by_genre:
        apple_counts_by_genre[genre] += 1
    else:
        apple_counts_by_genre[genre] = 1
        
print(apple_counts_by_genre)        

{'Social Networking': 106, 'Photo & Video': 160, 'Games': 1874, 'Music': 66, 'Reference': 18, 'Health & Fitness': 65, 'Weather': 28, 'Utilities': 81, 'Travel': 40, 'Shopping': 84, 'News': 43, 'Navigation': 6, 'Lifestyle': 51, 'Entertainment': 254, 'Food & Drink': 26, 'Sports': 69, 'Book': 14, 'Finance': 36, 'Education': 118, 'Productivity': 56, 'Business': 17, 'Catalogs': 4, 'Medical': 6}


In [67]:
#Storing the app frequency into a list
apple_freq_list = []
total_apple_apps = len(apple_data_clean_english_free)
for key in apple_counts_by_genre:
    list_tuple = (round((apple_counts_by_genre[key]/total_apple_apps)*100,2), key)
    apple_freq_list.append(list_tuple)

sorted_apple_freq_list = sorted(apple_freq_list, reverse=True)    

In [68]:
#Printing the Frequency table ordered by Number of repetitions
for lists in sorted_apple_freq_list:
    print(lists[1], ':', lists[0])

Games : 58.16
Entertainment : 7.88
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.51
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.74
Lifestyle : 1.58
News : 1.33
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12


Top 5 Genres in Apple are Games, Entertainment, Photo & Video, Education, Social Networking

In [69]:
#Creating a dictionary which has the Genre and Frequency
google_counts_by_genre = {}
for row in google_data_clean_english_free:
    genre = row[9]
    
    if genre in google_counts_by_genre:
        google_counts_by_genre[genre] += 1
    else:
        google_counts_by_genre[genre] = 1
        
print(google_counts_by_genre)  

{'Art & Design': 53, 'Art & Design;Creativity': 6, 'Auto & Vehicles': 82, 'Beauty': 53, 'Books & Reference': 190, 'Business': 407, 'Comics': 54, 'Comics;Creativity': 1, 'Communication': 287, 'Dating': 165, 'Education': 474, 'Education;Creativity': 4, 'Education;Education': 30, 'Education;Pretend Play': 5, 'Education;Brain Games': 3, 'Entertainment': 538, 'Entertainment;Brain Games': 7, 'Entertainment;Creativity': 3, 'Entertainment;Music & Video': 15, 'Events': 63, 'Finance': 328, 'Food & Drink': 110, 'Health & Fitness': 273, 'House & Home': 73, 'Libraries & Demo': 83, 'Lifestyle': 345, 'Lifestyle;Pretend Play': 1, 'Card': 40, 'Arcade': 164, 'Puzzle': 100, 'Racing': 88, 'Sports': 307, 'Casual': 156, 'Simulation': 181, 'Adventure': 60, 'Trivia': 37, 'Action': 275, 'Word': 23, 'Role Playing': 83, 'Strategy': 81, 'Board': 34, 'Music': 18, 'Action;Action & Adventure': 9, 'Casual;Brain Games': 12, 'Educational;Creativity': 3, 'Puzzle;Brain Games': 15, 'Educational;Education': 35, 'Casual;Pre

In [70]:
#Storing the app frequency into a list
google_freq_list = []
total_google_apps = len(google_data_clean_english_free)
for key in google_counts_by_genre:
    list_tuple = (round((google_counts_by_genre[key]/total_google_apps)*100,2), key)
    google_freq_list.append(list_tuple)

sorted_google_freq_list = sorted(google_freq_list, reverse=True) 

#Printing the Frequency table ordered by Number of repetitions
for lists in sorted_google_freq_list:
    print(lists[1], ':', lists[0])

Tools : 8.45
Entertainment : 6.07
Education : 5.35
Business : 4.59
Productivity : 3.89
Lifestyle : 3.89
Finance : 3.7
Medical : 3.53
Sports : 3.46
Personalization : 3.32
Communication : 3.24
Action : 3.1
Health & Fitness : 3.08
Photography : 2.94
News & Magazines : 2.8
Social : 2.66
Travel & Local : 2.32
Shopping : 2.25
Books & Reference : 2.14
Simulation : 2.04
Dating : 1.86
Arcade : 1.85
Video Players & Editors : 1.77
Casual : 1.76
Maps & Navigation : 1.4
Food & Drink : 1.24
Puzzle : 1.13
Racing : 0.99
Role Playing : 0.94
Libraries & Demo : 0.94
Auto & Vehicles : 0.93
Strategy : 0.91
House & Home : 0.82
Weather : 0.8
Events : 0.71
Adventure : 0.68
Comics : 0.61
Beauty : 0.6
Art & Design : 0.6
Parenting : 0.5
Card : 0.45
Casino : 0.43
Trivia : 0.42
Educational;Education : 0.39
Board : 0.38
Educational : 0.37
Education;Education : 0.34
Word : 0.26
Casual;Pretend Play : 0.24
Music : 0.2
Racing;Action & Adventure : 0.17
Puzzle;Brain Games : 0.17
Entertainment;Music & Video : 0.17
Casual;

Top 5 Genres in Google Apps are Tools, Entertainment, Education, Business

Now, let's create a function to ease the process of generating frequency table for any dataset

In [85]:
def freq_table(dataset, index):
    counts_by_genre = {}
    total = 0
    for row in dataset:        
        genre = row[index]
        total += 1
        if genre in counts_by_genre:
            counts_by_genre[genre] += 1
        else:
            counts_by_genre[genre] = 1
            
    table_percentages = {}
    for key in counts_by_genre:
        percentage = (counts_by_genre[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    total_apps = len(dataset)
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])      

In [86]:
print("Apple Genre frequency table:")
print('\n')
display_table(apple_data_clean_english_free, 11)

Apple Genre frequency table:


Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


In [87]:
print("Google Genre frequency table:")
print('\n')
display_table(google_data_clean_english_free, 9)

Google Genre frequency table:


Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
A

In [88]:
print("Google Category frequency table:")
print('\n')
display_table(google_data_clean_english_free, 1)

Google Category frequency table:


FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0

The frequency differes hugely between the IOS and Google Apps. While Games lead the Genre in IOS, "Tools" lead in Google.

However, if we observe the frequency of google Genres there isn't a huge gap between the frequencies. So we generated a table for Google App **Categories** which shows that **Family** leads by huge difference

## Most Popular Apps by Genre on the App Store
The frequency tables we analyzed on the previous screen showed us that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and fun apps. Now, we'd like to get an idea about the kind of apps with the most users.

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the `Installs` column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the `rating_count_tot` app.

In [94]:
genres_ios = freq_table(apple_data_clean_english_free, 11)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in apple_data_clean_english_free:
        genre_app = app[11]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


On average, navigation apps have the highest number of user reviews, but this figure is heavily influenced by Waze and Google Maps, which have close to half a million user reviews together:

In [95]:
for app in apple_data_clean_english_free:
    if app[11] == 'Navigation':
        print(app[1], ':', app[5])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


Similar pattern applies to other IOS apps where the data is skewed with huge players like Facebook, Instagram, Pandora, Spotify etc.,

Let us analyze the next popular Genre which is **"Reference"**

In [96]:
for app in apple_data_clean_english_free:
    if app[11] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


Even in the case of the **Reference** genre, data is skewed by Bible. However, it's the only book that has huge difference in the numbers.

With this information gathered so far, we can assume that taking a book and making an app from it would be profitable for IOS

# Most Popular Apps by Genre on Google Play

Unlike the App Store, Google Play Store gives us a clear picture on the number of installs. With that information, it is possible to get good information on Genre popularity.

In [100]:
categories_google= freq_table(google_data_clean_english_free, 1)

for category in categories_google:
    total = 0
    len_category = 0
    for app in google_data_clean_english_free:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            #Replace the , and + with nothing because we don't want to use the range of Install number. Since, we don't have an exact number of installations this seems to be the best approach
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

From the above information, **COMMUNICATION** category tops the number of installs. Let's analyze more on the communication

In [101]:
for app in google_data_clean_english_free:
    if app[1] == 'COMMUNICATION':
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
Messenger for SMS : 10,000,000+
My Tele2 : 5,000,000+
imo beta free calls and text : 100,000,000+
Contacts : 50,000,000+
Call Free – Free Call : 5,000,000+
Web Browser & Explorer : 5,000,000+
Browser 4G : 10,000,000+
MegaFon Dashboard : 10,000,000+
ZenUI Dialer & Contacts : 10,000,000+
Cricket Visual Voicemail : 10,000,000+
TracFone My Account : 1,000,000+
Xperia Link™ : 10,000,000+
TouchPal Keyboard - Fun Emoji & Android Keyboard : 10,000,000+
Skype Lite - Free Video Call & Chat : 5,000,000+
My magenta : 1,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Seznam.cz : 1,000,000+
Antillean Gold Telegram (original version) : 100,000+
AT&T Visual Voicemail : 10,000,000+
GMX Mail : 10,000,000+
Omlet Chat : 10,000,000+
My Vodacom SA : 5,000,000+
Microsoft Edge : 5,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Calls & Text by Mo+ : 5,000,000+
free 

There are several apps which have only few thousand installations like `Mail1Click - Secure Mail, SolMail - All-in-One email app, LokLok: Draw on a Lock Screen`

This number seems far less when compared with some hugely installed applications like `WhatsaApp Messenger, Android Messages etc.,`

Let's filter for the top installations

In [102]:
for app in google_data_clean_english_free:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+' or app[5] == '500,000,000+' or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

This data seems to be skewed with few apps which have installations 1,000,000,000+
Let's take a look at average number of installations for the apps that has less than 100M installations

In [104]:
under_100_m = []

for app in google_data_clean_english_free:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
sum(under_100_m) / len(under_100_m)    

3603485.3884615386

The Average number of installations seems way less when compared to the apps with 100M+ installations. Moving to the runner up category **BOOKS AND REFERENCE**

In [105]:
for app in google_data_clean_english_free:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+' or app[5] == '500,000,000+' or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


Similar to the IOS apps which were heavily skewed by super popular applications, Google Play Store has the same problem. Let's remove these apps from the data and look further to see if we'll find any potential books

In [106]:
for app in google_data_clean_english_free:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+' or app[5] == '5,000,000+' or app[5] == '10,000,000+' or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

Google Play Store seems to be overloaded with software that facilitate reading books. However, there are also some apps which based straight on the book. This pattern seems to overlay with IOS Apps.

# Conclusion
After analyzing the trends and patterns on the IOS and Google App datasets, we see a common pattern with the Books. However, we need to make sure that the app shouldn't be another Book facilitating app but an app Based on book.