Profitable App Profiles for the App Store and Google Play Markets

Our aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. We're working as data analysts for a company that builds Android and iOS mobile apps, and our job is to enable our team of developers to make data-driven decisions with respect to the kind of apps they build.

At our company, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

We're going to import the csv module and open both the Applestore & googleplay data into seperate lists of lists

In [13]:
import csv 
open_file = open('AppleStore.csv')
read_file = csv.reader(open_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

open_file = open('googleplaystore.csv')
read_file = csv.reader(open_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

Next we're going to explore both data sets with our explore_data function which takes in four parameters; dataset which will be a list of lists, start and end which are expected to be integers and represent the starting and ending indicies of a slice from the data set, lastly rows_and_columns which is expected to be a boolean and has a False default argument 

In [14]:
explore_data(ios,0,3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


In [15]:
explore_data(android, 0,3,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


The ios data set has 7197 rows and 16 columns
the android data set has 10841 rows and 13 columns

In [16]:
print(ios_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [17]:
print(android_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Lets take a minute and see what columns will be helpful for determining what types of apps are the most profitable; price/ratings/content ratings/genres all seem like they will be useful in determining what type of app our company should create

In [20]:
print(android[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In the googleplay dataset the above entry is missing the category column which will cause a shift in the columns for the rest of our data set so we're going to delete this entry

In [21]:
del android[10472]

In [22]:
# check to make sure we removed it correctly
print(android[10472])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


Next we're going to see if there is any duplicate entries in either of the data sets

In [29]:
duplicate_ios_apps = []
unique_ios_apps = []

for v in ios:
    name = v[0]
    if name in unique_ios_apps:
        duplicate_ios_apps.append(name)
    else:
        unique_ios_apps.append(name)

In [30]:
# print(duplicate_ios_apps)
print(len(duplicate_ios_apps))

0


In [33]:
duplicate_android_apps = []
unique_android_apps = []

for v in android:
    name = v[0]
    if name in unique_android_apps:
        duplicate_android_apps.append(name)
    else:
        unique_android_apps.append(name)

In [34]:
#print(duplicate_android_apps)
print(len(duplicate_android_apps))

1181


As we can see above, in the googleplay data set there is 1,181 apps that have duplicate entries. In the applestore data set there is no duplicate entries.

Next we're going to have to remove these duplicate entries but we will not do this at random, we're going to find out which entry is the most updated by keeping the entry with that highest number of reviews. To do this we're going to create a dictionary where each key is a unique app name, and the value is the highest number of reviews of that app.

In [36]:
highest_reviews = {}

for v in android:
    name = v[0]
    reviews = float(v[3])
    
    if name in highest_reviews and highest_reviews[name] < reviews:
        highest_reviews[name] = reviews
    elif name not in highest_reviews:
        highest_reviews[name] = reviews

In [38]:
print('Expected length:', len(android) - 1181)
print('Actual length:', len(highest_reviews))

Expected length: 9659
Actual length: 9659


In [39]:
android_clean = []
already_added = []

for v in android:
    name = v[0]
    reviews = float(v[3])
    
    if (highest_reviews[name] == reviews) and (name not in already_added):
        android_clean.append(v)
        already_added.append(name)



Now, let's use the reviews_max dictionary to remove the duplicates. For the duplicate cases, we'll only keep the entries with the highest number of reviews. In the code cell below:

    We start by initializing two empty lists, android_clean and already_added.
    We loop through the android data set, and for every iteration:
        We isolate the name of the app and the number of reviews.
        We add the current row (app) to the android_clean list, and the app name (name) to the already_added list if:
            The number of reviews of the current app matches the number of reviews of that app as described in the reviews_max dictionary; and
            The name of the app is not already in the already_added list. We need to add this supplementary condition to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry (for example, the Box app has three entries, and the number of reviews is the same). If we just check for reviews_max[name] == n_reviews, we'll still end up with duplicate entries for some apps.



In [40]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


If we look at our data sets we can see that there is apps with names that are not English, since we use English for the apps we develop we only want to analyze other apps that are made in English. Next we will create a function that is named is_english and this function will check if each character inside of the app name we pass through it is English based on the American Standard Code for Information Interchange (ASCII) - the built in ord() function can be used to determine if the characters has a value greater than 127, if it does it includes characters that are not English.

In [53]:
def is_english(string):
    for letter in string:
        if ord(letter) > 127:
            return False
        else:
            return True

In [55]:
#Testing our is_english function is working properly
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
False


We need to update our function because there is apps that have emojis and other characters that have ranges outside of 127 but are still English apps, for example'Docs To Go™ Free Office Suite' and 'Instachat 😜' will return False so to prevent us from losing vaulable data we will check if there is at least 3 characters that are outside of the 127 ASCII range before returning false.

In [56]:
def is_english(string):
    non_english = 0
    
    for letter in string:
        if ord(letter) > 127:
            non_english += 1
    
    if non_english > 3:
        return False
    else:
        return True

In [57]:
#Testing our updated is_english function
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
True
True


The two apps we mentioned in the previous markdown cell now return True. Next we will use our new function to filter out the non-English apps. We'll iterate through both data sets storing only the english apps in new lists.

In [70]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)
        
for app in ios:
    name = app[1]
    if is_english(name):
        ios_english.append(app)

In [71]:
explore_data(android_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


In [72]:
print(len(ios_english))

6183


After using the is_english function on both data sets we're left with 9614 android apps and 6183 ios apps.

Since our company is only interested in creating free apps we're going to go through each data set and isolate the free apps. Just as a reminder the ios price is located at index 4 and the android price is index 7.

In [76]:
free_ios = []
free_android = []

for app in ios_english:
    price = app[4]
    if price == '0.0':
        free_ios.append(app)
        
for app in android_english:
    price = app[7]
    if price == '0':
        free_android.append(app)

In [78]:
print(len(free_android))
print(len(free_ios))

8864
3222


After isolating the free apps in both data sets we're down to 8864 android and 3222 ios apps.

As mentioned before, our company is focused on building free apps and they want to have the app on both googleplay and the app store. The company strategy for devlopement is to build a minimal android version of the app and add it to google play, if the app has a good response from users it will be further developed, after 6 months if the app is profitable the company will build an ios version for the app store. In order to further understand which types of apps would be best we're going to try to get a sense of which genres perform best in each market.

In [81]:
print(android_header)
print(ios_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [83]:
# android genre column index : 1
# ios genre column index : 11

We'll build two functions; freq_table which will generate frequency tables that show percentages, and display_table which will display percentages in descending order

In [90]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages


def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])



In [91]:
display_table(free_ios, -5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


We can see above that the games genre dominates more than half of the app store at a whopping 58%, entertainment is in a far behind second with 7.88%.

In [92]:
display_table(free_android, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

The googleplay store is not really dominated by one category; Family category leads the way with 18.9%, games category comes next with 9.7% followed closely by tools at 8.4%

After reviewing the tables from both data sets, games are a leading category in the app store as well as the second highest category on the googleplay store.

We can find out what genres are most popular by calculating the average number of downloads for each genre; lets remind ourselves that for the app store data set we'll be using the rating_count_top column located at index 5, and the installs column from the googleplay store located at index 5.

In [94]:
genres_ios = freq_table(free_ios, -5)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in free_ios:
        genre_app = app[-5]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

Lifestyle : 16485.764705882353
Productivity : 21028.410714285714
Finance : 31467.944444444445
Travel : 28243.8
Health & Fitness : 23298.015384615384
Reference : 74942.11111111111
Shopping : 26919.690476190477
Education : 7003.983050847458
Photo & Video : 28441.54375
Navigation : 86090.33333333333
Medical : 612.0
Utilities : 18684.456790123455
Sports : 23008.898550724636
Entertainment : 14029.830708661417
Music : 57326.530303030304
News : 21248.023255813954
Games : 22788.6696905016
Food & Drink : 33333.92307692308
Business : 7491.117647058823
Book : 39758.5
Social Networking : 71548.34905660378
Catalogs : 4004.0
Weather : 52279.892857142855


Navigation apps have the highest number of users at 86,090. Lets view a few of the top navigation apps.

In [95]:
for app in free_ios:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


Waze and google maps are leading the charge by more than 100,000 users comapred to their smaller competitors. This also leads us to ask ourselves are these categories really that popular or is it a few giant companies bloating up those entire category's numbers ?

Another good example of bloated numbers is the average user ratings for Reference apps:

In [97]:
for app in free_ios:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


We can see that the Bible and Dictionary.com have much higher numbers than even google translate which we can all assume is used often yet the numbers arent in the hundreds of thousands - forget hundreds of thousands when it comes to the Bible users are closing in on one million.

After reviewing the genres a game may not be the best option, some popular categories we can pivot to are weather, book, food/drink or finance apps.

In [98]:
#lets review the installs column[5] of the android data set
display_table(free_android, 5)

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


We'll need to convert all of these install numbers to floats, so we'll do some cleaning below and convert the data in a frequency table

In [101]:
categories_android = freq_table(free_android, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in free_android:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)



WEATHER : 5074486.197183099
ENTERTAINMENT : 11640705.88235294
AUTO_AND_VEHICLES : 647317.8170731707
NEWS_AND_MAGAZINES : 9549178.467741935
FINANCE : 1387692.475609756
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
LIFESTYLE : 1437816.2687861272
HOUSE_AND_HOME : 1331540.5616438356
BOOKS_AND_REFERENCE : 8767811.894736841
TRAVEL_AND_LOCAL : 13984077.710144928
DATING : 854028.8303030303
PHOTOGRAPHY : 17840110.40229885
LIBRARIES_AND_DEMO : 638503.734939759
PERSONALIZATION : 5201482.6122448975
ART_AND_DESIGN : 1986335.0877192982
EVENTS : 253542.22222222222
GAME : 15588015.603248259
TOOLS : 10801391.298666667
PARENTING : 542603.6206896552
FAMILY : 3695641.8198090694
VIDEO_PLAYERS : 24727872.452830188
SPORTS : 3638640.1428571427
SHOPPING : 7036877.311557789
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
FOOD_AND_DRINK : 1924897.7363636363
PRODUCTIVITY : 16787331.344927534
MAPS_AND_NAVIGATION : 4056941.7741935486
BEAUTY : 513151.88679245283
COMMUNICATION : 38456119.167247385