## Determining Profitable Mobile App Profiles

The aim of this project is to determine what mobile app profiles are profitable based on data from the App Store and Google Play markets.  

In the context of this project, I am working at a company that builds mobile apps that are free to download.  The company's main source of revenue consists of in-app ads, meaning that apps that attract a large audience will be most relevant.    

## Preparing and Exploring the Data

As of March 2017, there were 2.8 million apps available on Google Play and 2.2 million apps available in Apple's App Store.  Given the limited scope of this project, I will be analyzing a sample of the available app data.  

[A data set containing data about approximately ten thousand Android apps from Google Play](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)  
[A data set containing data about approximately seven thousand iOS apps from the App Store](https://www.kaggle.com/lava18/google-play-store-apps/home)

In [1]:
from csv import reader

# Prepare iOS App Store data
opened_ios = open("AppleStore.csv", encoding = "utf8")
reader_ios = reader(opened_ios)
ios_data = list(reader_ios)

# Prepare Android Google Play data
opened_android = open("googleplaystore.csv", encoding = "utf8")
reader_android = reader(opened_android)
android_data = list(reader_android)

In [2]:
# Function that prints a subset of a dataset which can repeatedly use to explore the data in a readable format.
# The function also has the option to print the number of rows and columns.
def explore_data(dataset, start, end, rows_and_columns = True):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds an empty line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
explore_data(ios_data, 0, 5)
explore_data(android_data, 0, 5)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


Number of rows: 7198
Number of columns: 17
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'P

## Cleaning the Data: Handling Incorrect Data and Duplicate Entries

The Google Play market data has a discussion forum, which appears to indicate that row 10473 is erroneous (including header row).  

Let's compare row 10473 to the header row to figure out if this is true.

In [4]:
print(android_data[0])
print("\n")
print(android_data[10473])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Based on the above output, it is clear that row 10473 is incorrect.  We know this is the case because it has a rating of 19 and the maximum rating available in the Google Play Store is 5.

In [5]:
del android_data[10473]

We should also check for duplicate entries and remove those.

In [6]:
# Function to find duplicate entries 
def find_dups(dataset):
    duplicate_apps = []
    unique_apps = []
    
    for each_app in dataset:
        if each_app[0] in unique_apps:
            duplicate_apps.append(each_app[0])
        else:
            unique_apps.append(each_app[0])
            
    print("Number of duplicate entries in: ", len(duplicate_apps))
    print("Number of unique entries: ", len(unique_apps))

In [7]:
find_dups(ios_data)
find_dups(android_data)

Number of duplicate entries in:  0
Number of unique entries:  7198
Number of duplicate entries in:  1181
Number of unique entries:  9660


It looks like there are a lot of duplicate entries in the Google Play dataset.  Let's keep only the most recent entry and remove the others.  Because there are also duplicate "Last Updated" entries in the dataset, recency will be proxied with total number of reviews.  

To start, we will build a dictionary that pairs unique app names with the largest number of reviews.

In [8]:
reviews_max = {}

for each_app in android_data[1:]:
    app_name = each_app[0]
    app_reviews = float(each_app[3])
    
    if app_name in reviews_max and reviews_max[app_name] < app_reviews:
        reviews_max[app_name] = app_reviews
        
    elif app_name not in reviews_max:
        reviews_max[app_name] = app_reviews

We know that there are 1,181 duplicate entries.  Therefore, an an error-check, the length of our new dictionary should be equal to the length of our original dataset minus 1,182 (we subtract an additional row to account for the header).

In [9]:
print(len(reviews_max))
print(len(android_data) - 1181 - 1)

9659
9659


Rather than change the original underlying data, let's use the reviews_max dictionary we generated to create a clean list of unique apps.

As we loop through the Google Play dataset:
1. We isolate the name of the app and the number of reviews
2. We add the current iteration's entire row to the clean list of apps and the app name to the android_already_added list if:
  * The number of reviews of the current iteration row matches the maximum number of reviews in the reviews_max dictionary and...
  * The name of the app is not already in the android_already_added list.  

As a note, we use the android_already_added list to account for the edge cases where the highest number of reviews of a duplicate app is the same for more than one entry.  

In [10]:
android_clean = []
android_already_added = []

for each_app in android_data[1:]:
    app_name = each_app[0]
    app_reviews = float(each_app[3])
    
    if (app_reviews == reviews_max[app_name]) and (app_name not in android_already_added):
        android_clean.append(each_app)
        android_already_added.append(app_name)
        
print(len(android_clean))
print(len(android_already_added))

9659
9659


Let's explore the newly cleaned data to make sure everything looks okay.  

In [11]:
explore_data(android_clean, 0, 3)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


## Cleaning the Data: Removing Non-English Apps

After further exploration of the dataset, some app names suggest they are not directed toward an English-speaking audience.  Given that our company's primary audience speaks English, we want to exclude non-English apps.

According to the ASCII, the numbers corresponding to commonly used English characters are all in the range 0 to 127.  With this in mind, we can create a function that determines whether or not an app name is in English.

Unfortunately, some commonly used symbols in English text (emojis, trademark, etc.) correspond to numbers greater than 127.  To minimize the impact from losing too much data, we will retain all app names that have three or less ASCII characters greater than 127.

In [12]:
#Function to determine whether or not a string is English
def is_english_name(string):
    count = 0
    for character in string:
        if ord(character) > 127:
            count += 1
    if count > 3:
        return False
    else:
        return True

In [13]:
ios_english_apps = []
android_english_apps = []

for each_app in ios_data[1:]:
    app_name = each_app[2]
    if is_english_name(app_name):
        ios_english_apps.append(each_app)
        
for each_app in android_clean:
    app_name = each_app[0]
    if is_english_name(app_name):
        android_english_apps.append(each_app)
        
explore_data(ios_english_apps, 0, 3)
explore_data(android_english_apps, 0, 3)

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


Number of rows: 6183
Number of columns: 17
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Te

## Determining Which Apps are Free

Due to the fact our company is only interested in the profiles of free apps, we will need to segregate the free apps from the non-free apps.

We can achieve this by iterating through our English-only apps lists to remove any app with a price greater than zero.

In [14]:
ios_free_apps = []
android_free_apps = []

for each_app in ios_english_apps:
    price = each_app[5]
    if price == "0":
        ios_free_apps.append(each_app)

for each_app in android_english_apps:
    price = each_app[7]
    if price == "0":
        android_free_apps.append(each_app)       
        
print(len(ios_free_apps))
print(len(android_free_apps))

3222
8864


## Determining Profitable App Profiles: Common Genres

In order to maximize the results of our effort, the company validation strategy for an app idea follows three steps:

1. Build a minimal Android version and release it to Google Play
2. If the app receives good user response, we continue developing it
3. If the app is profitable after six months, we build an iOS version and add it to the App Store

Because our ultimate goal is to add the app to both the iOS and Android markets, we want to find a common profile the is successful.

To begin, we will determine what the most common app genres are in each market.  We will do this by using two functions.  The first will generate a frequency table for a user-specified column, while the second will create an ordered display of the frequency tables.

In [15]:
# Function to generate a frequency table for app genres
def freq_table(dataset, index):
    table =  {}
    count = 0
    
    for row in dataset:
        count += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentage = {}
    for key in table:
        percentage = (table[key] / count) * 100
        table_percentage[key] = percentage
    
    return table_percentage

# Function to create an ordered display for frequency tables
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [16]:
display_table(ios_free_apps, 12)
print("\n")
display_table(android_free_apps, 1)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
C

In the iOS market, "games" are by far the most common type of app (~58%).  It's important to keep in mind that while games are most common, they may not necessarily attract the most customers.  We will need to examine additional data.  

In the Android market, "family" is the most common type of app (~19%).  Based on the frequency table, it appears that the app landscape is vastly different on Google Play compared to the iOS App Store.  Generally, the Google Play market seems to contain more practical apps that are focused on "tools", "business", and "productivity".

## Determining Profitable App Profiles: Popular Genres

In addition to figuring out which app genres are most common, we also want to figure out which genres are most popular so our company has an idea of where to direct development efforts.

For the iOS App Store, we can proxy popularity with the average number of user ratings for each genre.  

In [17]:
print(ios_free_apps[0])

['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


In [24]:
genre_freq_table = freq_table(ios_free_apps, 12)

for genre in genre_freq_table:
    total = 0
    len_genre = 0
    
    for app in ios_free_apps:
        genre_app = app[12]
        if genre_app == genre:
            n_ratings = float(app[6])
            total += n_ratings
            len_genre += 1
            
    avg_n_ratings = total / len_genre
    print(genre, ":", avg_n_ratings)

print("\n")  

for app in ios_free_apps:
    if app[12] == "Navigation":
        print(app[2], ":", app[6])

print("\n")  

for app in ios_free_apps:
    if app[12] == "Music":
        print(app[2], ":", app[6])
        
print("\n")  

for app in ios_free_apps:
    if app[12] == "Social Networking":
        print(app[2], ":", app[6])  

Productivity : 21028.410714285714
Weather : 52279.892857142855
Shopping : 26919.690476190477
Reference : 74942.11111111111
Finance : 31467.944444444445
Music : 57326.530303030304
Utilities : 18684.456790123455
Travel : 28243.8
Social Networking : 71548.34905660378
Sports : 23008.898550724636
Health & Fitness : 23298.015384615384
Games : 22788.6696905016
Food & Drink : 33333.92307692308
News : 21248.023255813954
Book : 39758.5
Photo & Video : 28441.54375
Entertainment : 14029.830708661417
Business : 7491.117647058823
Lifestyle : 16485.764705882353
Education : 7003.983050847458
Navigation : 86090.33333333333
Medical : 612.0
Catalogs : 4004.0


Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Geocaching® : 12811
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5
CoPilot GPS – Car Navigation & Offline Maps : 3582
Google Maps - Navigation & Transit : 154911


Pandora - Music & Radio : 1126879
Shazam - Discover music, artists, videos & lyrics : 402925
iHe

Based on the table above, navigation apps have the highest average number of reviews.  However, drilling into the navigation genre reveals that Google Maps and Waze command ~0.5 million reviews, which could be skewing the results.  

In fact, we see a similar phenomenon occurring with social networking apps and music, where a small number of big players command a disproportionately high number of user reviews.

Let's take a look at the Google Play market to see what profiles stand out there.  To analyze the Google Play market, we will be using the average number of installs as our comparison metric

In [30]:
display_table(android_free_apps, 5)

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


Based on the above output, it is apparent that the number of installs are generically bucketed.  For the purposes of this analysis, we don't require a precise number of installs, so this will not be a problem.  However, we will need to remove the commas and plus signs from the install buckets for further analysis.  

In [42]:
category_freq_table = freq_table(android_free_apps, 1)

for category in category_freq_table:
    total = 0
    len_category = 0
    
    for app in android_free_apps:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace(",", "")
            n_installs = n_installs.replace("+", "")
            n_installs = float(n_installs)
            total += n_installs
            len_category += 1
    
    avg_n_installs = total / len_category
    print(category, ":", avg_n_installs)
    
print("\n")

for app in android_free_apps:
    if (app[1] == "COMMUNICATION") and (app[5] == "1,000,000,000+" or app[5] == "500,000,000+" or app[5] == "100,000,000+"):
        print(app[0], ":", app[5])
        
print("\n")

under_100mm = []
for app in android_free_apps:
    n_installs = app[5]
    n_installs = n_installs.replace(",", "")
    n_installs = n_installs.replace("+", "")
    n_installs = float(n_installs)
    if (n_installs < 100000000) and (app[1] == "COMMUNICATION"):
        under_100mm.append(n_installs)

print("Avg Communication Installs < 100mm:", sum(under_100mm) / len(under_100mm))

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

Looking at our output, the communications category has the highest average number of installs.  Similar to the iOS market, the average number of installs is heavily skewed by a small number of very popular apps.  If we remove the apps with installs greater than 100 million, the average number of installs is ~10 times smaller.  Much like the iOS market, this pattern of a few major competitors owning a large portion of the market share persists across many app categories.  

## Conclusions

During this project, we analyzed data about apps in the App Store and Google Play markets.  The goal was to determine what mobile apps are profitable based on the largest number of users.

We concluded that the genres with the largest average number of installs and reviews tend to be dominated by a small number of large companies that would be difficult to compete with.

We also observed that certain genres, like games, are oversaturated, meaning it may be tough to garner a large enough audience.

Therefore, we propose developing a comprehensive analog to a common app in an unsaturated market.  In order to determine which market we should target, additional analysis will need to be conducted.