# Analyzing Profitable Android and iOS Mobile Apps

This project analyzes a Kaggle dataset to find profitable mobile apps in the App Store and Google Play Store. Apps are free of cost for an English-speaking audience and the main source of revenue is through in-app ads, which implies that revenue is mostly influenced by the number of app users. Our goal is to help developers identify apps that attract more users using a data-driven approach.


## Concepts Used

* The basics of programming in Python (arithmetical operations, variables, common data types, etc.)
* List and for loops
* Conditional statements
* Dictionaries and frequency tables
* Functions
* Jupyter Notebook

##  Data Overview

As of September 2018, there are over 4 million apps in the App Store and Google Play Store combined. We will be looking at two sample datasets which look at a subset of these apps from both stores.

[Google Play Store Apps](https://www.kaggle.com/lava18/google-play-store-apps): A dataset containing information for approximately 10,000 Android apps collected in 2018 by Lavanya Gupta. Here's a link to download the [dataset](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).

[Mobile App Store](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps): A dataset containing information for over 7,000 Apple iOS apps collected in 2017 by Ramanathan. Here's a link to download the [dataset](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).

## Data Exploration

We will load both datasets and look at the first couple of rows for each dataset. We will also print, describe and identify columns that can help us with our analysis.

In [1]:
from csv import reader

### iOS App Dataset ###

open_AppStore = open("AppleStore.csv")
read_AppStore = reader(open_AppStore)
ios = list(read_AppStore)

for row in ios[:5]:
    print(row)
    print('\n')
    
print('Number of rows:', len(ios))
print('Number of columns:', len(ios[0]))

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7198
Number of columns: 16


For the iOS dataset, we have data describing 7197 apps (first row is for the header) with 13 columns. Here's a description for each of the columns from the Kaggle website.

* "id" : App ID

* "track_name": App Name

* "size_bytes": Size (in Bytes)

* "currency": Currency Type

* "price": Price amount

* "rating_count_tot": User Rating counts (for all versions)

* "rating_count_ver": User Rating counts (for current version)

* "user_rating" : Average User Rating value (for all versions)

* "user_rating_ver": Average User Rating value (for current version)

* "ver" : Latest version code

* "cont_rating": Content Rating

* "prime_genre": Primary Genre

* "sup_devices.num": Number of supporting devices

* "ipadSc_urls.num": Number of screenshots showed for display

* "lang.num": Number of supported languages

* "vpp_lic": Vpp Device Based Licensing Enabled

At first glance, `track_name, price, rating_count, user_rating, cont_rating` and `prime_genre` seem to be the most useful columns.

In [2]:
### Google App Store Dataset ###

open_GoogleStore = open("googleplaystore.csv")
read_GoogleStore = reader(open_GoogleStore)
android = list(read_GoogleStore)

for row in android[:5]:
    print(row)
    print('\n')
    
print('Number of rows:', len(android))
print('Number of columns:', len(android[0]))

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10842
Number of columns: 13


For the android dataset, we have data describing 10841 apps (first row is for the header) with 13 columns. Here's a description for each of the columns from the Kaggle website.

* "App": Application name

* "Category":Category the app belongs to

* "Rating": Overall user rating of the app (as when scraped)

* "Reviews": Number of user reviews for the app (as when scraped)

* "Size": Size of the app (as when scraped)

* "Installs": Number of user downloads/installs for the app (as when scraped)

* "Type": Paid or Free

* "Price": Price of the app (as when scraped)

* "Content Rating": Age group the app is targeted at - Children / Mature 21+ / Adult

* "Genres": An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to INCOMPLETE

At first glance, `App, Category, Rating, Reviews, Installs, Type and Genres` seem to be the most useful columns.

## Data Cleanup

We will now continue with the data cleanup process. Specifically, we will:
1. Deal with inaccurate data.
2. Remove duplicate data.
3. Remove non-English apps.
4. Remove non-free apps.

The Android dataset has a discussion section, which mentions an error for row 10473. Let's see if the data is inaccurate.

In [3]:
print(android[10473]) 
print('\n')
print(android[0])
print('\n')
print(android[1])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


We can see that the `Category` columns seems to be off as well as the `Rating` which equals 19 (max is 5) for row 10473. It seems that the `Category` column is missing its value. Hence, we will delete this row.

In [4]:
del android[10473]

According to the discussion section, there is duplicate data in the android dataset. Let's count them.

In [5]:
dup_apps = []
unq_apps = []

for app in android:
    if app[0] in unq_apps:
        dup_apps.append(app[0])
    else:
        unq_apps.append(app[0])
        
print(len(dup_apps)) # Number of duplicate apps
print(dup_apps[:3]) # A sample of the duplicate app names

1181
['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business']


 We have 1181 duplicate apps, and we will need a method of choosing which rows to keep instead of randomly deleting rows. Let's print one of the duplicate apps and see the difference between rows.

In [6]:
print(android[0]) # Header columns
print('\n')

for app in android:
    if app[0] == 'Instagram':
        print(app)
        print('\n')

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']




We can see that the only difference between the duplicates is in the 4th column, `Reviews`. We will keep the app with the highest number of reviews, as a higher number indicates more recent data. To remove the duplicates:
* Create a dictionary with a key-value pair of app_name: highest_number_of_reviews
* Use the information in the dictionary to create a new dataset, with one entry per app

In [7]:
print("Expected number of apps after we remove duplicates:", len(android) - 1181 - 1) # the 1 is for the header row

dict_review = {}

for app in android[1:]:
    n_review = float(app[3])
    app_name = app[0]
    if app_name in dict_review and n_review > dict_review[app_name]:
        dict_review[app_name] = n_review
    elif app_name not in dict_review:
        dict_review[app_name] = n_review
        
print(len(dict_review))

Expected number of apps after we remove duplicates: 9659
9659


Since the length of our dictionary matches the length of the dataset after we remove the duplicates, it is correct. Now we will use this dictionary to filter our dataset. We will:

* Create two lists: android_clean & already_added
* Run a for loop, to append apps with the highest reviews to our android_clean list

In [8]:
android_clean = []
already_added = []

for app in android[1:]:
    name = app[0]
    reviews = float(app[3])
    if reviews == dict_review[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
        
print(len(android_clean))

9659


Now, our new dataset had 9659 unique apps. Let's check if the iOS dataset has duplicate apps by checking the `id` column.

In [9]:
dup_apps_ios = []
unq_apps_ios = []

for app in ios:
    if app[0] in unq_apps_ios:
        dup_apps_ios.append(app[0])
    else:
        unq_apps_ios.append(app[0])
        
print(len(dup_apps_ios)) # Number of duplicate apps

0


So, there are no duplicate apps within the iOS dataset. Now, let's check for non-English speaking apps.

In [10]:
print(ios[814][1],ios[6732][1])
print('\n')
print(android_clean[4412][0],android_clean[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播 【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


中国語 AQリスニング لعبة تقدر تربح DZ


Since we are not interested in these apps, we will remove them. We'll remove apps whose names contain symbols not commonly used in English text. Since each character in a string has a corresponding number associated with it, we will use the number via the `ord()` function. The numbers corresponding to the characters used in English text are all in the range of 0 to 127 according to ASCII (American Standard Code for Information Interchange)

In [11]:
def check_string(string):
    for char in string:
        if ord(char) > 127:
            return False
    return True

print(check_string('Instagram'))
print(check_string('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(check_string('Docs To Go™ Free Office Suite'))
print(check_string('Instachat 😜'))
print(ord('😜'))
print(ord('™'))

True
False
False
False
128540
8482


For the last two examples, we get false, due to emojis and ™ logo. Since these symbols fall outside the ASCII range, we could potentially remove English speaking apps. So, in order to avoid this, we will keep apps with upto three non-English characters. This is still not perfect, but much more effective.

In [12]:
def check_string_v1(string):
    count = 0
    for char in string:
        if ord(char) > 127:
            count += 1
    if count > 3:
        return False
    return True

print(check_string_v1('Docs To Go™ Free Office Suite'))
print(check_string_v1('Instachat 😜'))

True
True


Now, let's use this function to filter both datasets.

In [13]:
android_eng = []
ios_eng = []

for row in android_clean:
    if check_string_v1(row[0]) == True:
        android_eng.append(row)
        
for row in ios[1:]:
    if check_string_v1(row[1]) == True:
        ios_eng.append(row)

print("Number of ios apps:", len(ios_eng))
print('\n')
print("Number of android apps:", len(android_eng))

Number of iOS apps: 6183


Number of android apps: 9614


In the last step of our data cleaning process, we will isolate free apps.

In [14]:
android_final = []
ios_final = []

for row in android_eng:
    if row[7] == '0':
        android_final.append(row)
        
for row in ios_eng:
    if row[4] == '0.0':
        ios_final.append(row)

print("Number of ios apps:", len(ios_final))
print('\n')
print("Number of android apps:", len(android_final))

Number of ios apps: 3222


Number of android apps: 8864


## Data Analysis

A developer's validation strategy is to:

* Build a minimal app and add it to the Google Play Store
* Develop the app further if it's received positively
* If the app is profitable for 6 months, build an iOS version and add it to the App Store

Since we would like to add apps on both stores, we need to find app profiles that are successful in both markets. We will look at the most important genres in both markets by building a frequency table. Next, we'll look at the `prime_genre` column for the iOS dataset, and the `Genres` and `Category` columns for the android dataset.

In [15]:
### Frequency Table Function ###

def freq_table(dataset,index):
    # creating the frequency dictionary
    freq_dict = {}
    total = 0
    for row in dataset:
        total += 1
        if row[index] in freq_dict:
            freq_dict[row[index]] += 1
        else:
            freq_dict[row[index]] = 1
    # creating the percentage dictionary
    freq_percents = {}
    for key in freq_dict:
        percentage = (freq_dict[key] / total) * 100
        freq_percents[key] = percentage 
    # creating a list of tuples from the percentage dictionary
    display_table = []
    for key in freq_percents:
        display_table.append((freq_percents[key],key))
    # sorting the list in ascending order and printing percentages
    display_table = sorted(display_table, reverse = True)
    for item in display_table:
        print(item[1],':',item[0])

In [16]:
### prime_genre column frequency table in the iOS dataset ###
freq_table(ios_final,-5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


In the iOS dataset, `Games` make up the majority of apps (58.16%), followed by `Entertainment` (7.88%) and `Photo & Video` (4.97%). It seems that in the App Store, the free apps section is dominated by leisure apps (`Games`, `Entertainment`, `Photo & Video`) while practical/productivity apps (`Education`, `Shopping`, `Utilities` etc.) are less dominant. However, making the assumption that leisure apps have the most users is not correct - demand might not match supply.

In [17]:
### Category column frequency table in the android dataset ###
freq_table(android_final,1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

In the android dataset, `FAMILY` makes up the majority of apps (18.91%), followed by `GAME` (9.72%) and `TOOLS` (8.46%). It seems that in the Google Play Store, the distribution of apps is more balanced. However, upon further investigation, the `FAMILY` category is mostly games for kids.

![Image](https://play.google.com/store/apps/category/FAMILY?hl=en)

Still, practical apps are better represented in Google Play Store. Let's take a look at the `Genres` column in the android dataset. 

In [18]:
### Genres column frequency table in the android dataset ###
freq_table(android_final,-4)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

Looking at the distribution of the `Genres` column, we can confirm that the distribution of apps in the android dataset is much better compared to the iOS dataset. Although the difference between the `Genres` and `Category` column is not clear, the `Genres` column seems much more granular (more categories).

Now, let's look at the apps that have the highest number of users in each dataset. We will:

* Calculate the average number of installations for each app. For the android dataset, we have the `Installs` column, but there is no such column in the iOS dataset.
* We will use the total number of user ratings as a proxy for the iOS dataset, using the `rating_count_tot` column.

In [19]:
### Average number of user rating per app genre for the iOS dataset ###

# using a for loop to get unique genres

ios_genres = []
for row in ios_final:
    if row[-5] not in ios_genres:
        ios_genres.append(row[-5])

# using a for loop to count average user ratings for each genre
genre_average = []
genre_dict = {}

for genre in ios_genres:
    total = 0
    genre_len = 0
    for row in ios_final:
        if row[-5] == genre:
            total += float(row[5])
            genre_len += 1
    genre_dict[genre] = total/genre_len
    genre_average.append((genre_dict[genre],genre))
    
# sorting and displaying average number of user rating for each genre in descending order
genre_sorted = sorted(genre_average,reverse=True)

for item in genre_sorted:
    print(item[1],':',item[0])

Navigation : 86090.33333333333
Reference : 74942.11111111111
Social Networking : 71548.34905660378
Music : 57326.530303030304
Weather : 52279.892857142855
Book : 39758.5
Food & Drink : 33333.92307692308
Finance : 31467.944444444445
Photo & Video : 28441.54375
Travel : 28243.8
Shopping : 26919.690476190477
Health & Fitness : 23298.015384615384
Sports : 23008.898550724636
Games : 22788.6696905016
News : 21248.023255813954
Productivity : 21028.410714285714
Utilities : 18684.456790123455
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Business : 7491.117647058823
Education : 7003.983050847458
Catalogs : 4004.0
Medical : 612.0


`Navigation` apps have the highest number of user ratings. Let's look at the distribution to see which apps have the highest rating count.

In [22]:
# Use a for loop to display the rating for each app in the Navigation genre

ios_navigation = []
ios_navigation_dict = {}

for row in ios_final:
    if row[-5] == 'Navigation':
        ios_navigation_dict[row[1]] = row[5]
        ios_navigation.append((ios_navigation_dict[row[1]],row[1]))

# Sort and display results in ascending order
ios_navigation_sorted = sorted(ios_navigation,reverse=True)

print("Rating for Apps in Navigation genre")
print('\n')

for item in ios_navigation_sorted:
    print(item[1],':',item[0])

Rating for Apps in Navigation genre


Railway Route Search : 5
CoPilot GPS – Car Navigation & Offline Maps : 3582
Waze - GPS Navigation, Maps & Real-time Traffic : 345046
ImmobilienScout24: Real Estate Search in Germany : 187
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811


It seems that **Waze - GPS Navigation, Maps & Real-time Traffic** and **Google Maps - Navigation & Transit** dominate in the `Navigation` category. Let's take a look at the `Reference` and `Social Networking` category.

In [21]:
# Use a for loop to display the rating for each app in the Reference genre

ios_reference = []

for row in ios_final:
    if row[-5] == 'Reference':
        ios_reference.append((row[5],row[1]))

# Sort and display results in ascending order
ios_reference_sorted = sorted(ios_reference,reverse=True)

print("Rating for Apps in Reference genre")
print('\n')

for item in ios_reference_sorted:
    print(item[1],':',item[0])
    
# Use a for loop to display the rating for each app in the Social Networking genre

ios_sn = []

for row in ios_final:
    if row[-5] == 'Social Networking':
        ios_sn.append((row[5],row[1]))

# Sort and display results in ascending order
ios_sn_sorted = sorted(ios_sn,reverse=True)

print('\n')
print("Rating for Apps in Social Networking genre")
print('\n')

for item in ios_sn_sorted:
    print(item[1],':',item[0])

Rating for Apps in Reference genre


Bible : 985920
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
Real Bike Traffic Rider Virtual Reality Glasses : 8
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
Dictionary.com Dictionary & Thesaurus for iPad : 54175
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
Google Translate : 26786
Dictionary.com Dictionary & Thesaurus : 200047
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
VPN Express : 14
Night Sky : 12122
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


Rating for Apps in Social Networking genre


MeetMe - Chat and Meet New People : 97

It seems that the pattern repeats for the `Reference` and `Social Networking` category, where a few apps dominate. INCOMPLETE?