# Profitable App profiles for the App Store and Google Play Markets

Many apps are free to download and install. Therefore, their main source of revenue consists of in-app ads. This means the revenue for any given free app is mostly influenced by the number of users - the more users that see and engage with the ads, the better.

Our goal in this project is to analyse data from the Apple App Store and Google Play Markets to help developers understand what type of free apps are likely to attract more users. We will do so only for apps that are free and directed towards an English-speaking audience.

## Opening and Exploring the Data

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.
Collecting data for over 4 million apps requires a significant amount of time and money, so we'll try to analyze a sample of the data instead. To avoid spending resources on collecting new data ourselves, we have found existing relevant data at no cost:

- [A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) from Apple's App Store, which can be downloaded directly [here.](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/download)
- [A data set](https://www.kaggle.com/lava18/google-play-store-apps) from Google Play Store, which can be downloaded directly [here.](https://www.kaggle.com/lava18/google-play-store-apps/download)

Let's start by opening and expoloring our data sets.


In [1]:
from csv import reader

# App Store data set
opened_file = open('AppleStore.csv', encoding='utf8')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]
opened_file.close()

# Google Play data set
opened_file = open('googleplaystore.csv', encoding='utf8')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]
opened_file.close()

To make it easier to get a feel for our data sets, we will create a function named `explore_data` that allows us to repeatedly print samples of our data in a readable way. We will also include an option to show the number of rows and columns within any dataset.

In [2]:
def explore_data(dataset: list, start: int, end: int, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Now, let's print the header row along with the first few data rows from both the App Store and Google Play Store data sets. 

In [3]:
print(ios_header)
print('\n')
explore_data(ios, 0, 3, rows_and_columns=True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


We can see that the App Store data set has 7,197 rows and 16 columns. At first glance, the columns that could help us with our analysis are `track_name`, `currency`, `price`, `rating_count_tot`, `user_rating`, `cont_rating` and `prime_genre`.

Not all column names are self-explanitory, so this [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) may help.

In [4]:
print(android_header)
print('\n')
explore_data(android, 0, 3, rows_and_columns=True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


We can see that the Google Play Store data set has 10,841 rows and 13 columns. At first glance, the columns that could help us with our analysis are `App`, `Category`, `Rating`, `Installs`, `Price`, `Content Rating` and `Genres`.

# Deleting Wrong Data

Before beginning our analysis, we need to make sure the data we analyse is accurate. Otherwise, the results of our analysis will be wrong.
In order to clean our data, we need to:

- Detect inaccurate data, and correct or remove it
- Detect duplicate data, and remove the duplicates

As we are also only carrying out our analysis on free apps that are directed towards an English-speaking audience, we need to ensure all rows of data we focus on reflect this.

The Google Play Store data set has a dedicated [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion), and we can see that [one of the discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) highlights an error with row 10472, where a datapoint is missing and a column shift happened as a result.

Let's print this row to check the data ourselves.

In [5]:
print(android_header)
print('\n')
print(android[10472]) # Incorrect row
print('\n')
print(android[0]) # Row we know to be correct

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


We can see this row is indeed missing a column as there are only 12 (The Google Play Store Data set has 13). If we compare this row to the one we know to be correct, it is clear the `Category` column is missing.

We will delete the incorrect row to ensure it doesn't affect our analysis.

In [6]:
print(len(android))
del android[10472]
print(len(android))

10841
10840


# Removing Duplicate Entries

## Part One

The Google Play data set may have duplicate entries. We will write code to find out which apps, if any, have duplicate entries and how many apps this affects.

In [7]:
unique_apps = []
duplicate_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name) # If the name is a duplicate of an entry in unique_apps, add it do the duplicate_apps list
    else:
        unique_apps.append(name) # If the name has not occured before, add to the unique_apps list

print(duplicate_apps[:5])
print('\n')
print(len(duplicate_apps))


['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


1181


The Google Play data set has 1,181 duplicate entries. Let's have a closer look at the duplicate rows for 'Quick PDF Scanner + OCR FREE'.

In [8]:
for app in android:
    name = app[0]
    if name == 'Quick PDF Scanner + OCR FREE':
        print(app)

['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


We can see that the difference between the sample rows is the fourth column, which corresponds to the number of reviews. As we wish to work with the rows that include the richest data, we will not remove the duplicates randomly. Instead, we will keep the entry with the highest number of reviews as the average rating will be more accurate.

In order to delete the duplicate entries, we will:

- Create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.

- Use the information stored in the dictionary and create a new data set, which will have only one entry per app (and for each app, we'll only select the entry with the highest number of reviews).

## Part Two

Let's start by creating the dictionary `reviews_max`

In [9]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        

Let's check that the length of `reviews_max` matches the expected length when we deduct the 1,181 duplicate entries.

In [10]:
print('Expected length: ', len(android) - 1181)
print('Actual length: ', len(reviews_max))

Expected length:  9659
Actual length:  9659


Now, let's use `reviews_max` to remove duplicate entries. In the case of duplicates, we'll only keep the entry with the highest number of reviews. In the code cell below:

- We'll start by initialising two empty lists: `android_clean` and `already_added`
- We'll then loop through the Google Play data set and for each iteration:
    - Isolate the name of the app and number of reviews
    - If the number of reviews is the same as the number of maximum reviews for the app (found in the `reviews_max` dictionary) **and** the name is not already in the list `already_added`:
        - We'll add the entire row to the `android_clean` list and;
        - Add the name to the `already_added` list, the purpose of which is to make sure we don't add duplicate entries where more than one entry for an app has the maximum reviews number.


In [11]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if name not in already_added and n_reviews == reviews_max[name]:
        android_clean.append(app)
        already_added.append(name)

Now let's quickly explore the new data set and make sure the number of rows matches the expected length, 9,659.

In [12]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


We know our code works as intended, so let's do the same for the App Store data set. Recall from the [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) that the columns we need are `track_name` (index 1) and `rating_count_tot` (index 5).

In [13]:
# Overwrite the reviews_max dictionary with the results from the App Store data set.
reviews_max = {}

for app in ios:
    name = app[1]
    n_reviews = float(app[5])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

# Create a new cleaned data set for ios
ios_clean = []
already_added = []

for app in ios:
    name = app[1]
    n_reviews = float(app[5])
    
    if name not in already_added and n_reviews == reviews_max[name]:
        ios_clean.append(app)
        already_added.append(name)

# Removing Non-English Apps

## Part One

Remember, we'd like to analyse only the apps that are directed towards an English-speaking audience. However, it is possible that both the App Store and Google Play Store data sets have apps with names that contain characters which are not commonly used in English text. This would suggest that they are not directed to an English-speaking audience.

We're not interested in keeping these apps, so we'll remove them.

Each string character has a corresponding encoding number that's associated with it behind the scenes. We can find out these numbers by using the built-in `ord()` function.
All characters that are specific to English text are encoded using the [ASCII](https://en.wikipedia.org/wiki/ASCII) system. Each ASCII character has an ecoding number associated with it between the range 0 to 127.

Below, we'll build a function that takes an app name and ascertains whether it contains any non-ASCII characters.

In [14]:
def is_english(string: str):
    
    for character in string:
        if ord(character) > 127:
            return False
        
    return True
        

Let's check that our function works by passing through some test strings.

In [15]:
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
False
False


The `is_english` function that we created does detect none-English characters. However, as we can see, it couldn't correctly identify certain English app names that use non-ASCII symbols or emojis.
If we use the function in it's current form, we may lose useful data.

## Part Two

To minimize data loss, we'll modify the function to only lable a string as non-English if it contains more than three non-ASCII characters.

In [16]:
def is_english(string: str):
    
    non_english = 0
    
    for character in string:
        if ord(character) > 127:
            non_english += 1
        
    if non_english > 3:
        return False
            
    return True

print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

False
True
True


While the function isn't perfect, and a few non-English apps may get past our filter, it has now correctly labelled the two English app names in our test.

So, we'll use the function to filter out non-English apps from both our data sets.

In [17]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)
        
for app in ios_clean:
    name = app[1]
    if is_english(name):
        ios_english.append(app)
        
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

We can see that we are left with 9,614 android apps and 6,181 IOS apps.

# Isolating the Free Apps

Our data sets contain both free and non-free apps; we'll isolate only the free apps for our analysis.

In [18]:
print(android_header)
print('\n')
print(ios_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [19]:
android_final = []
ios_final = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)
        
for app in ios_english:
    price = app[4]
    if price == '0.0':
        ios_final.append(app)
        
explore_data(android_final, 0, 3, True)
print('\n')
explore_data(ios_final, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 8864
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

We now have 8,864 Android apps and 3,220 IOS apps left in our data sets.

# Most Common Apps by Genre

## Part One

As we mentioned in the introduction, we wish to find out which kind of free apps attract the most users and, therefore, have the highest revenue-generating potential. For a company to minimize risk and overheads when developing an app, it's important that they have a validation strategy.

A good validation strategy might look like this:

1. Build a minimal android app, and add it to Google Play.

2. If the app has a good response from users, develop it further.

3. If the app is profitable after six months, build an iOS version off the app and add it to the App Store.

Because the end goal here is to add the app on both Google Play and the App store, there's a need to find app profiles that are successful across both markets.

We'll begin our analysis by getting a sense of what are the most common genres for each market. Let's explore the data sets to find out which columns might be useful for us.


In [20]:
print(android_header)
print('\n')
print(ios_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


We can use the `Genres` and `Category` columns from the Google Play data set, and the `prime_genre` column from the App Store data set to help us find the most common app genres.

## Part Two

We'll build two functions that will allow us to analyse the frequency tables:

- One function to generate frequency tables that show percentages
- Another function we can use to display the percentages in decsending order


In [21]:
def freq_table(dataset: list, index: int):
    frequencies = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in frequencies:
            frequencies[value] += 1
        else:
            frequencies[value] = 1
            
    frequencies_percentages = {}
    for key in frequencies:
        percentage = (frequencies[key] / total) * 100
        frequencies_percentages[key] = percentage
        
    return frequencies_percentages


def display_table(dataset: list, index: int):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

## Part Three

We start by examining the frequency table for the `prime_genre` column of the App Store data set.

In [22]:
display_table(ios_final, -5)

Games : 58.13664596273293
Entertainment : 7.888198757763975
Photo & Video : 4.968944099378882
Education : 3.6645962732919255
Social Networking : 3.291925465838509
Shopping : 2.608695652173913
Utilities : 2.515527950310559
Sports : 2.142857142857143
Music : 2.049689440993789
Health & Fitness : 2.018633540372671
Productivity : 1.7391304347826086
Lifestyle : 1.5838509316770186
News : 1.3354037267080745
Travel : 1.2422360248447204
Finance : 1.1180124223602486
Weather : 0.8695652173913043
Food & Drink : 0.8074534161490683
Reference : 0.5590062111801243
Business : 0.5279503105590062
Book : 0.43478260869565216
Navigation : 0.18633540372670807
Medical : 0.18633540372670807
Catalogs : 0.12422360248447205


We can see that in the App Store data set, over half (58.14%) of all apps are within the 'Games' genre, giving it a clear majority. The next most common apps are 'Entertainment (7.89%) and 'Photo & Video' (4.97%).

Overall, it's clear that the most common apps on the App Store are designed for entertainment (games, photo and video, social networking ect). Apps that are more geared towards partical purposes (education, shopping, utilities, productivity, lifestyle) generally fall lower in the frequency table.

While this frequency table gives us an idea of the kind of apps that make up the App Store market, it does not tell us which kind of apps generally have the largest number of users. Therefore, we cannot recommend an app profile based on this alone.

Let's now examine the frequency table for the `Category` column of the Google Play data set.

In [23]:
display_table(android_final, 1) # Category

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

The most common category in the Google Play Store data set is 'Family' with 18.9%. This is followed by 'Game' (9.72%) and 'Tool' (8.46%).
While 'Family' takes the top spot, it is not immediately clear what the category consists of. It is possible that it is a combination of games, video and education apps.

Overall, we can see that there is a much move even mix of practical/lifestyle and entertainment apps within the most common app types in the Google Play Store data set when compared with the App Store data set.

Finally, we'll examine the frequency table for the `Genres` column of the Google Play data set.

In [24]:
display_table(android_final, -4)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

The main difference between the `Genres` and `Category` columns in the Google Play data set is that 'Genres' is much more granular. We can see this as there are many more categories.
As we are focussing on the bigger picture in our analysis, we'll stick with the broader 'Category' column.

So far, we have seen that the App Store is more geared towards fun apps while the Google Play Store has more of a mix between fun and praticial apps.
However, in order to recommend an app profile for maximum revenues, we need to investigate which kind of apps generally have the most users.

# Most Popular Apps by Genre on the App Store

One way to find out what genres are the most popular is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the `Installs` column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of ratings as a proxy, which we can find in the `rating_count_tot` column.

Below, we'll calculate the average number of user ratings per app genre on the App Store:


In [25]:
ios_genres = freq_table(ios_final, -5)

for genre in ios_genres:
    total = 0
    len_genre = 0
    for app in ios_final:
        genre_app = app[-5]
        if genre_app == genre:
            user_ratings = float(app[5])
            total += user_ratings
            len_genre += 1
            
    avg_ratings = total / len_genre
    
    print(genre, ':', avg_ratings)

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22812.92467948718
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


Based on the average number of user ratings for any given app within a certain genre on the App Store, we can see that the most popular apps fall into the 'Navigation', 'Reference' and 'Social Networking' genres.
Although 'Navigation' came out on top (86,090 average ratings per app), this figure is heavily influenced by Waze amd Google maps, which have close to half a million user reviews together:

In [26]:
for app in ios_final:
    genre = app[-5]
    if genre == 'Navigation':
        print(app[1], ':', app[5]) # Print name and number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


If we look closer into the 'Reference' genre (74,942 average ratings per app), a similar pattern occurs in that the Bible and Dictionary.com apps account for a huge proportion of user ratings.

In [27]:
for app in ios_final:
    genre = app[-5]
    if genre == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


The 'Social Networking' genre also has a number of apps that dominate the statistics, such as Facebook and Pinterest:

In [28]:
for app in ios_final:
    genre = app[-5]
    if genre == 'Social Networking':
        print(app[1], ':', app[5])

Facebook : 2974676
Pinterest : 1061624
Skype for iPhone : 373519
Messenger : 351466
Tumblr : 334293
WhatsApp Messenger : 287589
Kik : 260965
ooVoo – Free Video Call, Text and Voice : 177501
TextNow - Unlimited Text + Calls : 164963
Viber Messenger – Text & Call : 164249
Followers - Social Analytics For Instagram : 112778
MeetMe - Chat and Meet New People : 97072
We Heart It - Fashion, wallpapers, quotes, tattoos : 90414
InsTrack for Instagram - Analytics Plus More : 85535
Tango - Free Video Call, Voice and Chat : 75412
LinkedIn : 71856
Match™ - #1 Dating App. : 60659
Skype for iPad : 60163
POF - Best Dating App for Conversations : 52642
Timehop : 49510
Find My Family, Friends & iPhone - Life360 Locator : 43877
Whisper - Share, Express, Meet : 39819
Hangouts : 36404
LINE PLAY - Your Avatar World : 34677
WeChat : 34584
Badoo - Meet New People, Chat, Socialize. : 34428
Followers + for Instagram - Follower Analytics : 28633
GroupMe : 28260
Marco Polo Video Walkie Talkie : 27662
Miitomo : 2

However, the 'Social Networking' genre shows some potential for certain niches. For example, apps that are directed towards analysing or adding functionality to an existing app with a lot of users can also attract a lot of users. For example, Followers - Social Analytics For Instagram; Timehop; and Quick Reposter follow this pattern.

Furthermore, if an app offers much needed functionality to a big social networking platform, there's a chance of investment from and partnership with the platform itself. 

Now let's analyse the Google Play market further.

# Most Popular Apps by Genre on Google Play

Unlike the App Store data set, the Google Play data set has data about the number of installs per app, so we should be able to get a clearer picture about genre popularity. However. the install numbers don't seem precise enough - we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.):

In [29]:
display_table(android_final, 5) # The Installs column

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


Although we don't know the exact number of installs per app, we don't need very precise data for our purposes - we only want to find out which app genres attract the most users overall.
With this in mind, we are going to leave the numbers as they are. 

However, to perform computations, we'll need to convert each install number from a string to a float. This means we'll need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error. We'll do this within a loop below and also compute the average number of installs for each genre (category).

In [30]:
android_categories = freq_table(android_final, 1)

for category in android_categories:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:
            installs = app[5]
            installs = installs.replace('+', '')
            installs = installs.replace(',', '')
            installs = float(installs)
            total += installs
            len_category += 1
            
    avg_installs = total / len_category
    
    print(category, ':', avg_installs)


ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

On average, communication apps have the highest average installs (38,456,119). However, this number is heavily influenced by a few hugely popular apps such as WhatsApp Messenger and Skype:

In [31]:
for app in android_final:
    genre = app[1]
    if genre == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                     or app[5] == '500,000,000+'
                                     or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])
        

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

This pattern is also seen in the social app category, where Facebook and Instagram have over a billion installs each:

In [32]:
for app in android_final:
    genre = app[1]
    if genre == 'SOCIAL' and (app[5] == '1,000,000,000+'
                              or app[5] == '500,000,000+'
                              or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Facebook : 1,000,000,000+
Facebook Lite : 500,000,000+
Tumblr : 100,000,000+
Pinterest : 100,000,000+
Google+ : 1,000,000,000+
Badoo - Free Chat & Dating App : 100,000,000+
Tango - Live Video Broadcast : 100,000,000+
Instagram : 1,000,000,000+
Snapchat : 500,000,000+
LinkedIn : 100,000,000+
Tik Tok - including musical.ly : 100,000,000+
BIGO LIVE - Live Stream : 100,000,000+
VK : 100,000,000+


However, we can try to get some app ideas based on the kind of social apps that are somwhere in the middle in terms of popularity (between 1,000,000 and 100,000,000 downloads):

In [33]:
for app in android_final:
    genre = app[1]
    if genre == 'SOCIAL' and (app[5] == '1,000,000+'
                              or app[5] == '5,000,000+'
                              or app[5] == '10,000,000+'
                              or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

TextNow - free text + calls : 10,000,000+
The Messenger App : 1,000,000+
Messenger Pro : 1,000,000+
Free Messages, Video, Chat,Text for Messenger Plus : 1,000,000+
Telegram X : 5,000,000+
Jodel - The Hyperlocal App : 1,000,000+
Hide Something - Photo, Video : 5,000,000+
Love Sticker : 1,000,000+
Web Browser & Fast Explorer : 5,000,000+
LiveMe - Video chat, new friends, and make money : 10,000,000+
VidStatus app - Status Videos & Status Downloader : 5,000,000+
Love Images : 1,000,000+
SPARK - Live random video chat & meet new people : 5,000,000+
Facebook Local : 1,000,000+
Meet – Talk to Strangers Using Random Video Chat : 5,000,000+
MobilePatrol Public Safety App : 1,000,000+
💘 WhatsLov: Smileys of love, stickers and GIF : 1,000,000+
HTC Social Plugin - Facebook : 10,000,000+
Quora : 10,000,000+
Kate Mobile for VK : 10,000,000+
Family GPS tracker KidControl + GPS by SMS Locator : 1,000,000+
Moment : 1,000,000+
Text Me: Text Free, Call Free, Second Phone Number : 10,000,000+
Text Free: 

This niche seems to be dominated by messaging apps, dating apps and apps that compliment large social media platforms. There's a lot of competition in these areas, so a new app would have to offer functionality or services that are unique.

As we have seen that there are a number of successful of apps across both the Google Play and the App store that compliment existing social media platforms or offer dating service, a new app that combines both of these functionalities could be profitable on both stores.
There also don't seem to be many apps that are directed towards social media dating.

# Conclusions

In this project our goal was to analyse data from the Apple App Store and Google Play Markets to help developers understand what type of free apps are likely to attract more users, and therefore generate more revenue through advertising.

We concluded that an app that combines dating with a user's social media profiles could be profitable for both the Google Play and App Store markets. Or, alternatively, an app that adds functionality to a social media platform. These apps tend to attract a large number of users and even though there is a fair bit of competition, an app that provides a genuinely useful service will stand out.