# Profitable App Profiles for the App Store and Google Play Markets

This is the final project suggested in the _Python for Data Science: Fundamentals_ course hosted by [Dataquest.io](https://www.dataquest.io/).

In this project we will analyze apps pertaining to Apple's App Store and Google Play markets, in order to understand which of them are likely to attract more users.

We pretend to work with a team of app developers, and our job is to enable the team to make data-driven decisions with respect of how to build the apps.
At our ficiticious company, we only build free apps, and our source of revenue are ads, which is why number of users is essential. Our job is to make sure this number is high.

## Opening and Exploring the Data

By September 2018, there were approximately 2.1 million apps on Google Play, and 2 million apps on the App Store.

![](py1m8_statista.png)

For the purpose of this project, we will analyze two sample data sets instead, obtained from [Kaggle.com](https://www.kaggle.com/):

- The [first data set](https://www.kaggle.com/lava18/google-play-store-apps/home) contains data from 10,000 Android apps, collected in August 2018
- The [second one](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) contains data about 7,000 iOS apps, collecte in July 2017

First, we open the two data sets.

In [1]:
from csv import reader

# Google Play data set
opened_file = open('googleplaystore.csv', encoding='utf8')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

# App Store data set
opened_file = open('AppleStore.csv', encoding='utf8')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

We create a function called `explore_data()` to make it easier to explore the two data sets. This function also takes an optional parameter to display the number of rows and columns of the dataset.

In [2]:
def explore_data(dataset, start, end, rows_and_cols=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
        
    if rows_and_cols:
        print('Number of rows:    ', len(dataset))
        print('Number of columns: ', len(dataset[0]))

In [3]:
print(android_header, '\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows:     10841
Number of columns:  13


We start looking into the Google Play data set. It has 10841 rows and 13 columns. We can easily identify the columns that might be important for our analysis: 'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price' and 'Genres'.

Now we take a look at the App Store data set:

In [4]:
print(ios_header, '\n')
explore_data(ios, 0, 3, True)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


Number of rows:     7197
Number of columns:  17


As we can see, there are 7197 iOS apps, and the columns that seem interesting are: 'track_name', 'currncy', 'price', 'rating_count_tot', 'rating_count_ver' and 'prime_genre'.

Not all column names are self-explanatory, especially in the second case. Details about each column can be found on the [Google Play data set documentation](https://www.kaggle.com/lava18/google-play-store-apps/home) and the [App Store data set documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home).

## Deleting wrong data

Reading the [discussion of the Google Play data set](https://www.kaggle.com/lava18/google-play-store-apps/discussion), we find a top rated [entry](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) that outlines an error with row 10472. Let's print this row and compare it with the header and the first row, to check if it's actually incorrect or they fixed the data set.

In [5]:
print(android[10472], '\n') # incorrect row
print(android_header, '\n')
print(android[0])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up'] 

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


As we can see, the rating is 19, and there's a general column shift in this row, and thus we have to delete it so it doesn't lead us to errors.

In [6]:
print(len(android))
del android[10472] # don't run this more than once
print(len(android))

10841
10840


We also check the [App Store data set discussion](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion), and we find a possible duplication in apps. Reading the coment section of the [entry](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion/90409) we find out it's just two different apps with the same name (track_name). This leads us into the next section.

## Removing duplicate entries

### Part One

Reading the discussion section of the Google Play data set, and looking into the data set itself, we notice some apps have duplicate entries. For example, we check that Instagram has four entries:

In [7]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Let's check how many duplicate apps there are.

In [8]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps: ', len(duplicate_apps), '\n')
print('Examples of duplicate apps: ', duplicate_apps[:10])

Number of duplicate apps:  1181 

Examples of duplicate apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


We don't want to count some apps more than once when we analyze data, so we need to remove the duplicate entries, and keep only one per app. 

Let's examine the Instagram duplicates data alongside the header to see if we find some criterion for selecting the entry we will keep.

In [9]:
print(android_header, '\n')

for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


We can see that the 'Rating' is the same in all of them, and the 'Installs' and 'Last Updated' too. We will focus on the 'Reviews' column, because the more reviews an app has, the more recent the data should be.

Thus, we will keep the entry of each app which has the highest number of reviews. To do so, we will create a dictionary where each key is a unique app name, the most reviewed one, and hen we will use it to create a new data set with those selected unique apps, and no duplicates.

### Part Two

First we create the dictionary:

In [10]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and (reviews_max[name] < n_reviews):
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

In a previous cell we checked that there are 1181 duplicate apps. To make sure everything was correct, we can check if the length of our `reviews_max` dictionary is equal to the difference between the length of the data set and 1181.

In [11]:
print('Expected length: ', len(android) - 1181)
print('Actual length:   ', len(reviews_max))

Expected length:  9659
Actual length:    9659


Now we use the `reviews_max` dictionary to remove the duplicates, and keep only the entries with the highest number of reviews in a list.

In [12]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

Now, let's explore our new clean Google Play data set, and confirm that everything is correct.

In [13]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:     9659
Number of columns:  13


As we expected, the number of rows is 9659.

## Removing Non-English Apps

### Part One

Remember we use English for the apps we develop at our company, so we only want to analyze apps that are directed towards an English-speaking audience.

If we explore the two data sets long enough, we will find that both contain apps with names that suggest they are not drected to an English-speaking audience.

In [14]:
print(android_clean[4412][0])
print(android_clean[7940][0], '\n')
print(ios[70][2])
print(ios[236][2])

中国語 AQリスニング
لعبة تقدر تربح DZ 

新浪新闻-阅读最新时事热门头条资讯视频
优酷视频


We don't want to keep these apps, so we'll remove them. To do so, we can remove each app with a name containing a character that is not commonly used in English - latin letters, numbers composed of digits from 0 to 9,punctuation marks and other symbols such as +, /, etc.

All these characters are encoded used the ASCII standard. Each ASCII character has a number between 0 and 127 associated.

We can take advantage of it to buil a function that checks if an app name contains non-ASCII characters, making use of the built-in `ord()` function to find out the corresponding encoding numbers.

In [15]:
def is_english(string):
    for character in string:
        if ord(character) > 127:
            return False
    return True

Now, let's test it with some app names:

In [16]:
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
False
False


Looking at the first two strings, our function seems to work. But then it doesn't work properly in the third and fourth ones. This is probably because of the ™ symbol and the emoji. Let's check that their associated numbers fall out of our ASCII range.

In [17]:
print(ord('™'))
print(ord('😜'))

8482
128540


If we use our function, we will remove some useful apps.

### Part Two

We will now rewrite our previous function so it returns False if the input string contains more than three characters that fall outside the ASCII range.

This is not perfect, and some English apps may not pass this filter, but for the purpose of the project it is good enough.

In [18]:
def is_english(string):
    count = 0
    for character in string:
        if ord(character) > 127:
            count += 1
    if count > 3:
        return False
    return True

Using our new function should work properly with the previous app names.

In [19]:
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
True
True


It does, indeed.

Now, let's filter out the non-English apps in both data sets.

In [20]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)

for app in ios:
    name = app[2]
    if is_english(name):
        ios_english.append(app)

Now let's explore our data and check how many rows we have remaining for each data set.

In [21]:
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:     9614
Number of columns:  13


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0',

We are let with 9614 android apps and 6183 iOS apps.

## Isolating the Free Apps

Our company only develop free apps, so we only want to analyze those which are free to download, so the next step is to isolate the free apps.

In [22]:
android_final = []
ios_final = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)
        
for app in ios_english:
    price = app[5]
    if price == '0':
        ios_final.append(app)
        
print(len(android_final))
print(len(ios_final))

8864
3222


We are left with 8864 android apps and 3222 iOS ones to analyze.

## Most Common Apps by Genre

### Part One

As we mentioned in the introduction, we want to determine the kinds of apps that attract more users, because our revenue is directly influenced by the number of users of our apps.

To minimize risks, the validation strategy of the development team is as follows:
1. Build a minimal Android version of the app, and publish it on Google Play.
2. If the app has a good response from users, develop it further.
3. If the app is profitable in six months, we build an iOS version of it too.

Since our final goal is to add the app on both Google Play and the App Store, we want to find app profiles that are succesful on both.

We will begin our analysis by looking for the most common genres for each market. For that, we will build frequency tables for the `Genres` and `Category` columns of the Google Play data set, and the `prime_genre` column of the App Store one. 

### Part Two

We will now build two functions, one to generate frequency tables that show percentages, and another to display the percentages in a descending order.

In [23]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage
    
    return table_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
    
    table_sorted = sorted(table_display, reverse=True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

### Part Three

We will start by examining the frequency table for the `prime_genre` column in the App Store data set.

In [24]:
display_table(ios_final, 12)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


We can clearly see that more than half of the free English iOS apps are games. That is followed by a far less percentage of entertainment (~7.8%), photo and video (~4.9%) and social networking (~3.3%) apps.

This means that the App Store (the part that concerns us) is dominated by apps designed for fun, while apps with more practical purposes, like productivity, business or medical ones are more rare (around 1.7%, 0.5% and 0.2% each).

But we can't get any conclusions at this point, because the offer may be much higher than the demand, and these apps might not have the greatest number of users.

Now, let's examine the `Category` and `Genres` columns of the Google Play data set, which seem to be pretty similar.

In [25]:
display_table(android_final, 1) # Category

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

We can see an interesting difference: there are not that many game apps, and a lot more practical apps, like family and tools.
However, if we further investigate the family category, we will see that it consists mostly of games for kids.

In [26]:
for app in android_final[:1430]:
    category = app[1]
    if category == 'FAMILY':
        print(app)

['Jewels Crush- Match 3 Puzzle', 'FAMILY', '4.4', '14774', '19M', '1,000,000+', 'Free', '0', 'Everyone', 'Casual;Brain Games', 'July 23, 2018', '1.9.3901', '4.0.3 and up']
['Coloring & Learn', 'FAMILY', '4.4', '12753', '51M', '5,000,000+', 'Free', '0', 'Everyone', 'Educational;Creativity', 'July 17, 2018', '1.49', '4.0.3 and up']
['Mahjong', 'FAMILY', '4.5', '33983', '22M', '5,000,000+', 'Free', '0', 'Everyone', 'Puzzle;Brain Games', 'August 2, 2018', '1.24.3181', '4.0.3 and up']
['Super ABC! Learning games for kids! Preschool apps', 'FAMILY', '4.6', '20267', '46M', '1,000,000+', 'Free', '0', 'Everyone', 'Educational;Education', 'July 16, 2018', '1.1.6.7', '4.1 and up']
['Toy Pop Cubes', 'FAMILY', '4.5', '5761', '21M', '1,000,000+', 'Free', '0', 'Everyone', 'Casual;Brain Games', 'July 4, 2018', '1.8.3181', '4.0.3 and up']
['Educational Games 4 Kids', 'FAMILY', '4.3', '11618', '39M', '5,000,000+', 'Free', '0', 'Everyone', 'Educational;Education', 'April 3, 2018', '2.4', '4.1 and up']
['

Anyway, practical apps have a better representation on Google Play than on the App Store. This is confirmed by the following frequency table.

In [27]:
display_table(android_final, 9) # Genres

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

The difference between `Category` and `Genres` isn't that clear, although we can see that the second one is much more precise and has a lot more categories. For the purpose of this project, we have enough with the bigger picture, so we will only work with the `Category` column from now on.

Now that we know that the App Store is dominated by apps designed for fun and Google Play shows a better balance of both practical and fun apps, we have to find out which kinds of apps have most users, in both markets.

## Most Popular Apps by Genre on the App Store

One way to find out what genres are the most popular is to calculate the average number of install for each one of them. For the Google Play data set we can find this in the `Installs` column, but the App Store data set doesn't have this information. As a workaround, we will use the total number of user ratings, which we can find in the `rating_count_tot` column.

Let's calculate the average number of user ratings per app genre.

In [28]:
genres_ios = freq_table(ios_final, 12)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in ios_final:
        genre_app = app[12]
        if genre_app == genre:
            n_ratings = float(app[6])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ': ', avg_n_ratings)    

Productivity :  21028.410714285714
Weather :  52279.892857142855
Shopping :  26919.690476190477
Reference :  74942.11111111111
Finance :  31467.944444444445
Music :  57326.530303030304
Utilities :  18684.456790123455
Travel :  28243.8
Social Networking :  71548.34905660378
Sports :  23008.898550724636
Health & Fitness :  23298.015384615384
Games :  22788.6696905016
Food & Drink :  33333.92307692308
News :  21248.023255813954
Book :  39758.5
Photo & Video :  28441.54375
Entertainment :  14029.830708661417
Business :  7491.117647058823
Lifestyle :  16485.764705882353
Education :  7003.983050847458
Navigation :  86090.33333333333
Medical :  612.0
Catalogs :  4004.0


Navigation seems to have the most number of reviews... or the highest average of it. Let's look into it to see what really happens.

In [29]:
for app in ios_final:
    if app[12] == 'Navigation':
        print(app[2], ': ', app[6])

Waze - GPS Navigation, Maps & Real-time Traffic :  345046
Geocaching® :  12811
ImmobilienScout24: Real Estate Search in Germany :  187
Railway Route Search :  5
CoPilot GPS – Car Navigation & Offline Maps :  3582
Google Maps - Navigation & Transit :  154911


As we expected, a couple of apps have almost half a million reviews, so the average number of reviews here doesn't mean much. This also happens in genres like social networking and music.

Looking at other genres, we find the `Reference` one. Let's look into it to see what happens.

In [30]:
for app in ios_final:
    if app[12] == 'Reference':
        print(app[2], ': ', app[6])

Bible :  985920
Dictionary.com Dictionary & Thesaurus :  200047
Dictionary.com Dictionary & Thesaurus for iPad :  54175
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran :  18418
Merriam-Webster Dictionary :  16849
Google Translate :  26786
Night Sky :  12122
WWDC :  762
Jishokun-Japanese English Dictionary & Translator :  0
教えて!goo :  0
VPN Express :  14
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition :  17588
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools :  4693
Guides for Pokémon GO - Pokemon GO News and Cheats :  826
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free :  718
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) :  8535
GUNS MODS for Minecraft PC Edition - Mods Tools :  1497
Real Bike Traffic Rider Virtual Reality Glasses :  8


The average is 74942, but we find that the Bible and Dictionary.com have 985000 and 74000 ratings each. Even though the Bible dominates this genre, this could be a good niche to focus on for our development team. There aren't many apps, and it's a broad genre, so there's more posibilities for us to, let's say, choose another popular book, and make an interactive app about it.

If we recall, we know that the App Store is saturated with free apps for fun, so our previous decission could be a good option to stand out. 

We also find other popular genres, such as book, in which our idea also fits, weather, food and drink or finance. But these last ones aren't really interesting for us, because of the following reasons:
- Weather apps: people don't spend much time on them, and this is crucial for our ad based revenue strategy.
- Food and drink: these apps are based on actual cooking and delivering services, which we don't have.
- Finance apps: we would need a lot of domain knowledge, and maybe some association with banks or other financial institutions.

Now, let's analyze the Google Play data set.

## Most Popular Apps by Genre on Google Play

For this data set, we have actual data about the number of installs. However, this number is open ended (5000+, 100000+, etc.), which isn't that precise.

In [31]:
display_table(android_final, 5)

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


The problem is that we don't know if an app with 10000+ installs has 100000, 200000 or 350000 installs. For the purpose of the project we don't actually need that much precision, we only want to get an idea of which app genres attract the most users, so we are going to leave the numbers as they are - if it has 100+ we'll leave it as 100, and if it has 100000+ we'll leave it as 10000.

For our computations, we need to each install number from string to float. To do so, we have to remove the + symbol and the commas.

In [32]:
categories_android = freq_table(android_final, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace('+', '')
            n_installs = n_installs.replace(',', '')
            n_installs = float(n_installs)
            total += n_installs
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ': ',avg_n_installs)

ART_AND_DESIGN :  1986335.0877192982
AUTO_AND_VEHICLES :  647317.8170731707
BEAUTY :  513151.88679245283
BOOKS_AND_REFERENCE :  8767811.894736841
BUSINESS :  1712290.1474201474
COMICS :  817657.2727272727
COMMUNICATION :  38456119.167247385
DATING :  854028.8303030303
EDUCATION :  1833495.145631068
ENTERTAINMENT :  11640705.88235294
EVENTS :  253542.22222222222
FINANCE :  1387692.475609756
FOOD_AND_DRINK :  1924897.7363636363
HEALTH_AND_FITNESS :  4188821.9853479853
HOUSE_AND_HOME :  1331540.5616438356
LIBRARIES_AND_DEMO :  638503.734939759
LIFESTYLE :  1437816.2687861272
GAME :  15588015.603248259
FAMILY :  3695641.8198090694
MEDICAL :  120550.61980830671
SOCIAL :  23253652.127118643
SHOPPING :  7036877.311557789
PHOTOGRAPHY :  17840110.40229885
SPORTS :  3638640.1428571427
TRAVEL_AND_LOCAL :  13984077.710144928
TOOLS :  10801391.298666667
PERSONALIZATION :  5201482.6122448975
PRODUCTIVITY :  16787331.344927534
PARENTING :  542603.6206896552
WEATHER :  5074486.197183099
VIDEO_PLAYERS 

We can see that on average, communication apps have the most installs, around 38 million. We suspect it's all about a few huge apps again, so let's look into it.

In [33]:
for app in android_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ': ', app[5])

WhatsApp Messenger :  1,000,000,000+
imo beta free calls and text :  100,000,000+
Android Messages :  100,000,000+
Google Duo - High Quality Video Calls :  500,000,000+
Messenger – Text and Video Chat for Free :  1,000,000,000+
imo free video calls and chat :  500,000,000+
Skype - free IM & video calls :  1,000,000,000+
Who :  100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji :  100,000,000+
LINE: Free Calls & Messages :  500,000,000+
Google Chrome: Fast & Secure :  1,000,000,000+
Firefox Browser fast & private :  100,000,000+
UC Browser - Fast Download Private & Secure :  500,000,000+
Gmail :  1,000,000,000+
Hangouts :  1,000,000,000+
Messenger Lite: Free Calls & Messages :  100,000,000+
Kik :  100,000,000+
KakaoTalk: Free Calls & Text :  100,000,000+
Opera Mini - fast web browser :  100,000,000+
Opera Browser: Fast and Secure :  100,000,000+
Telegram :  100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer :  100,000,000+
UC Browser Mini -Tiny Fast Private & Secure :  

If we removed the apps above 500 million downloads, the average would be reduced by more than 10 times.

In [34]:
under_100_m = []

for app in android_final:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
sum(under_100_m) / len(under_100_m)

3603485.3884615386

The same can be inferred for video player, social or photography apps, among others.

These are popular genres dominated by a few giants, which are hard to compete against. Also, there's the game genre, but we previously found out that it is oversaturated on the App Store and a little saturated on Google Play, so we have to look for something else.

Since we liked the book and reference genres on the App Store, let's look at them here.

In [35]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

This genre includes a wide variety of apps, from software for reading ebooks to libraries, tutorials or dictionaries. But it still seems to be dominated by a small number of hugely popular apps.

In [36]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ': ', app[5])

Google Play Books :  1,000,000,000+
Bible :  100,000,000+
Amazon Kindle :  100,000,000+
Wattpad 📖 Free Books :  100,000,000+
Audiobooks from Audible :  100,000,000+


It is indeed dominated by a few apps, but we believe there's a possibility to succed, since there are other smaller but still great apps, above 1 million installs. Let's get some insight from these apps.

In [37]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ': ', app[5])

Wikipedia :  10,000,000+
Cool Reader :  10,000,000+
Book store :  1,000,000+
FBReader: Favorite Book Reader :  10,000,000+
Free Books - Spirit Fanfiction and Stories :  1,000,000+
AlReader -any text book reader :  5,000,000+
FamilySearch Tree :  1,000,000+
Cloud of Books :  1,000,000+
ReadEra – free ebook reader :  1,000,000+
Ebook Reader :  5,000,000+
Read books online :  5,000,000+
eBoox: book reader fb2 epub zip :  1,000,000+
All Maths Formulas :  1,000,000+
Ancestry :  5,000,000+
HTC Help :  10,000,000+
Moon+ Reader :  10,000,000+
English-Myanmar Dictionary :  1,000,000+
Golden Dictionary (EN-AR) :  1,000,000+
All Language Translator Free :  1,000,000+
Aldiko Book Reader :  10,000,000+
Dictionary - WordWeb :  5,000,000+
50000 Free eBooks & Free AudioBooks :  5,000,000+
Al-Quran (Free) :  10,000,000+
Al Quran Indonesia :  10,000,000+
Al'Quran Bahasa Indonesia :  10,000,000+
Al Quran Al karim :  1,000,000+
Al Quran : EAlim - Translations & MP3 Offline :  5,000,000+
Koran Read &MP3 30

We can see that this niche is dominated by ebooks readers, libraries and dictionaries, so we won't build apps like those, to avoid the high competition.

We observe that there are some apps built around Al-Quran, which confirms our previous thought about building a profitable app around a popular book. This seems to be a good idea both for the Google Play market and the App Store.

However, we should add some features to the raw book, like an audio version, quotes, several translations of the book with built-in dictionary, etc., if we want to make it stand out among the other apps of this genre.

## Conclusions

In this project we analyzed data about apps in Google Play and the App Store, to determine a profitable app for both markets. Our ficiticious development team only builds English, free apps, and the main source of revenue are ads.

We conclude that taking a popular book and turning it into an app can be our best bet. But we should add some features like quotes, audio version, built-in dictionary and translations if we want to make it succed and stand out among the competition.