# Determining App Profitability

The goal of this project is to analyze data from two different markets—the Google Play store and the App Store from Apple-to determine what makes an app popular. 

Any companies hoping to make free apps for the English market (thus relying on user downloads and ad revenue for money) will find this data valuable in determining the right category and genre for what new apps could be in.

This project can be broken down into three parts: acquiring the relevant data, cleaning the data, and analyzing the data.


## Data Acquisition

Working with a dataset from the millions of apps available on both markets is impractical, so a sample must be used. As such, I used a sample database for each market.

The Google Play database was sourced from [here](https://www.kaggle.com/datasets/lava18/google-play-store-apps).

The App Store database was sourced from [here](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps).

I begin by opening both of the relevant files and familiarizing myself with both databases.

In [111]:
from csv import reader

opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
apple = list(read_file)
apple_header = apple[0]
apple = apple[1:]

## Data Cleaning
#### Deleting inaccurate data, duplicate apps, non-English apps, and paid apps

The datasets must first be cleaned to provide only relevant apps for analysis. As the target of this project is for companies that are aiming to develop free apps for the English market, the data will be cleaned accordingly for the sample to be representative.


Firstly, a quick look at the discussion page of the Google Play datalist reveals that row 10472 lacks a category entry, which column shifts the database. This can be fixed by deleting this row.

In [112]:
del android[10472] # <-- run code once to delete row with wrong data (was missing a category)

#### Removing Duplicate Apps

To continue with data cleaning, there are duplicate entries in the Google Play dataset that must be removed (the Apple App Store does not have duplicate rows). We can confirm Android duplicates using the code below:

In [31]:
duplicate_apps = []
unique_apps = []

for row in android:
    name = row[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('There are ', len(duplicate_apps), ' duplicate apps.')
print('There are ', len(unique_apps), ' unique apps.')
print('\n')
print('Duplicate app examples: ', duplicate_apps[:5])

There are  1181  duplicate apps.
There are  9659  unique apps.


Duplicate app examples:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


I notice that each duplicate app differs on the "Reviews" column, defined as how many reviews each app has. These differences can be attributed to reviews data from these apps being collected at different points in time. As such, it would be most accurate to keep the entry with the most recent number of reviews (the highest number) and delete the rest of the duplicates.

We do this by adding all the apps, keeping only the ones with the highest number of reviews, to a dictionary called `reviews_max`, with the app names as keys and the number of ratings as values. We then sort through the database to add each row to a list called `android_clean` only if the row's app name and number of ratings match those in the `reviews_max` dictionary.


In [41]:
reviews_max = {}

for row in android:
    name = row[0]
    n_reviews = float(row[3])
    
    if (name in reviews_max) and (reviews_max[name] < n_reviews):
        reviews_max[name] = n_reviews
    
    if name not in reviews_max:
        reviews_max[name] = n_reviews
        
print(len(reviews_max)) 
# as expected, there are now 9659 rows in the dictionary, corresponding to the number of unique apps

android_clean = []
already_added = []

for row in android:
    name = row[0]
    n_reviews = float(row[3])
    
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(row)
        already_added.append(name)
        
print(len(android_clean))
# as expected, there are still 9659 rows in this list without duplicates

9659
9659


#### Removing Non-English Apps

The next step of data cleaning is removing apps with names that sugggest they are for a non-English audience. This can be done by iterating over the name of each app and checking if each character string is within the standard English text system. Each string corresponds to a number, and, according to the [American Standard Code for Information Interchange](https://en.wikipedia.org/wiki/ASCII), numbers between 0 and 127 represent English strings. Numbers outside this range likely denote a non-English app.

We begin by defining a function where names with more than three non-standard characters are removed (more than three is to account for English apps with special characters like em dashes; this system is not perfect but still effective in sorting out the majority of non-English apps).

In [62]:
def is_english(input_string):
    non_english_count = 0
    
    for character in input_string:
        if ord(character) > 127:
            non_english_count += 1
            
    if non_english_count > 3:
        return False
        
    return True

android_clean_english = []
apple_clean_english = []

for row in android_clean:
    name = row[0]
    if is_english(name) is True:
        android_clean_english.append(row)
        
for row in apple:
    name = row[1]
    if is_english(name) is True:
        apple_clean_english.append(row)

print(len(android_clean_english)) # 9614 Android apps now
print(len(apple_clean_english)) # 6183 Android apps now

9614
6183


#### Removing Paid Apps

Lastly, we target only the free apps in the dataset. We can isolate free apps by examining the pricing column in each dataset and keeping only those whose prices are 0.

In [68]:
android_final = []
apple_final = []

for row in android_clean_english:
    price = row[7]
    if price == '0':
        android_final.append(row)
        
for row in apple_clean_english:
    price = row[4]
    if price == '0.0':
        apple_final.append(row)
        
print(len(android_final)) # 8864 apps in final list for analysis
print(len(apple_final)) # 3222 apps in final list for analysis
    

8864
3222


## Data Analysis

A potential  strategy in developing the most successful apps is building a prototype Android version to be released on the Google Play store; depending on the app's performance and profitability after six months, it will then be added to the Apple App Store. This strategy minimizes risk in developing unprofitable apps.

However, any decisions to develop an app must be data-backed. As such, I conducted data analysis to determine the most common and most popular apps in each market as a starting point before diving deeper into what specific genres have potential.

#### Most Common Apps

To begin analyzing data to determine which types of apps are most profitable, we can first build a frequency table based on common app genres in both markets. The relevant Google Play column is `Category`, while the relevant App Store column is `prime_genre`. After building a frequency table by genre for each market, they can then be sorted by descending percentages to show which genres are most popular.


In [71]:
def freq_table(dataset, index):
    table = {}
    table_percent = {}
    counter = 0
    
    for row in dataset:   
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
        counter += 1
        
    for key in table:
        percent = table[key] * 100 / counter
        table_percent[key] = percent
        
    return table_percent

# helper function from Dataquest
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

With the frequency table by percentage function created, the data can now be analyzed with tables based on the relevant genre columns. We begin with free English apps on the App Store.

In [73]:
display_table(apple_final, 11)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.6623215394165114
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.017380509000621
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


The most popular genre of free English apps on the App Store is, by far, games with 58% of the share. This is followed by the entertainment category at a distant second with 7.9% of the share. The general impression is that the games fun apps, rather than practical and productive ones, tend to take up the vast majority of the market share.

We continue our analysis with Google Play apps.

In [76]:
display_table(android_final, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.700361010830325
MEDICAL : 3.5311371841155235
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.237815884476534
HEALTH_AND_FITNESS : 3.079873646209386
PHOTOGRAPHY : 2.9444945848375452
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768953
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418774
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075813
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 0

There appears to be a big difference between the most frequent free English apps on the App Store vs the Google Play store; the latter is much less dominated by games, which ranks as the second most common at 9.7% behind family at 18.9%. In general, the Google Play store's most common apps tend to be much more productivity and practicality focused, with less fun and entertaining apps like in the App Store.

#### Most Popular Apps

Although the previous analysis showed which free English apps were most common in each market, they may not necessarily reflect which apps are most popular and have the most users. To determine this data, we can calculate each genre's average number of app installations or total number of users, beginning with the App Store. This can be done with a nested for loop summing the average number of ratings of all the apps in a genre before dividing it by the total number of apps in that genre to find the average.

##### App Store

In [91]:
apple_genres = freq_table(apple_final, 11)

for genre in apple_genres:
    total = 0
    counter = 0
    
    for row in apple_final:
        genre_app = row[11]
        if genre_app == genre:
            num_ratings = float(row[5])
            total += num_ratings
            counter += 1

    average_num_ratings = total / counter
    print(genre, ': ', average_num_ratings)
    

Social Networking :  71548.34905660378
Photo & Video :  28441.54375
Games :  22788.6696905016
Music :  57326.530303030304
Reference :  74942.11111111111
Health & Fitness :  23298.015384615384
Weather :  52279.892857142855
Utilities :  18684.456790123455
Travel :  28243.8
Shopping :  26919.690476190477
News :  21248.023255813954
Navigation :  86090.33333333333
Lifestyle :  16485.764705882353
Entertainment :  14029.830708661417
Food & Drink :  33333.92307692308
Sports :  23008.898550724636
Book :  39758.5
Finance :  31467.944444444445
Education :  7003.983050847458
Productivity :  21028.410714285714
Business :  7491.117647058823
Catalogs :  4004.0
Medical :  612.0


Although some genres may seem extremely popular, many are actually extremely skewed by one or two apps; for example, the Navigation category, which has the highest average number of user ratings, is skewed by the two apps Waze and Google Maps. This was determined by looping through each genre's app names and average number of ratings, as in the following code:

In [92]:
for row in apple_final:
    if row[11] == 'Navigation':
        print(row[1], ': ', row[5])

Waze - GPS Navigation, Maps & Real-time Traffic :  345046
Google Maps - Navigation & Transit :  154911
Geocaching® :  12811
CoPilot GPS – Car Navigation & Offline Maps :  3582
ImmobilienScout24: Real Estate Search in Germany :  187
Railway Route Search :  5


A similar trend follows with other categories such as Social Networking:

In [114]:
for row in apple_final:
    if row[11] == 'Social Networking':
        print(row[1], ': ', row[5])

Facebook :  2974676
Pinterest :  1061624
Skype for iPhone :  373519
Messenger :  351466
Tumblr :  334293
WhatsApp Messenger :  287589
Kik :  260965
ooVoo – Free Video Call, Text and Voice :  177501
TextNow - Unlimited Text + Calls :  164963
Viber Messenger – Text & Call :  164249
Followers - Social Analytics For Instagram :  112778
MeetMe - Chat and Meet New People :  97072
We Heart It - Fashion, wallpapers, quotes, tattoos :  90414
InsTrack for Instagram - Analytics Plus More :  85535
Tango - Free Video Call, Voice and Chat :  75412
LinkedIn :  71856
Match™ - #1 Dating App. :  60659
Skype for iPad :  60163
POF - Best Dating App for Conversations :  52642
Timehop :  49510
Find My Family, Friends & iPhone - Life360 Locator :  43877
Whisper - Share, Express, Meet :  39819
Hangouts :  36404
LINE PLAY - Your Avatar World :  34677
WeChat :  34584
Badoo - Meet New People, Chat, Socialize. :  34428
Followers + for Instagram - Follower Analytics :  28633
GroupMe :  28260
Marco Polo Video Walki

It would not make practical sense to break into these giant-dominated genres, so running the above code for different categories and removing those with one or two apps that make up the majority of the market share will reveal more niche genres that have potential for entry.

As it turns out, genres such as Shopping have plenty of room for growth, with many non-corporate giant apps able to gain tens of thousands of user downloads. A similar story can be found with the Finance, Book, and Weather genres (among others).

Given that the App Store tends to be dominated by fun apps, developing a new app with some element of gamification, such as an online shopping challenge or social platform for users to share their recent buys, with a shopping app makes sense: the genre is big and has proven to allow new entries with high number of downloads, and it fits the general theme of what apps are prevalent in the App Store.

In [99]:
for row in apple_final:
    if row[11] == 'Shopping':
        print(row[1], ': ', row[5])

Groupon - Deals, Coupons & Discount Shopping App :  417779
eBay: Best App to Buy, Sell, Save! Online Shopping :  262241
Wish - Shopping Made Fun :  141960
shopkick - Shopping Rewards & Discounts :  130823
Amazon App: shop, scan, compare, and read reviews :  126312
Target :  108131
Zappos: shop shoes & clothes, fast free shipping :  103655
Walgreens – Pharmacy, Photo, Coupons and Shopping :  88885
Best Buy :  80424
Walmart: Free 2-Day Shipping,* Easy Store Shopping :  70286
OfferUp - Buy. Sell. Simple. :  57348
Apple Store :  55171
Shop Savvy Barcode Scanner - Price Compare & Deals :  54630
Ibotta: Cash Back App, Grocery Coupons & Shopping :  44313
letgo: Buy & Sell Second Hand Stuff :  38424
CVS Pharmacy :  35981
Victoria’s Secret – The Sexiest Bras & Lingerie :  34507
Etsy: Shop Handmade, Vintage & Creative Goods :  30434
Gilt :  26353
Mercari: Shopping Marketplace to Buy & Sell Stuff :  24244
Shopular Coupons, Weekly Deals for Target, Walmart :  22729
RetailMeNot Shopping Deals, Coup

##### Google Play

We can use similar code to determine the most popular apps of the Google Play store. Although the number of installs is represented as a range (ex. 10,000+), we do not require extremely precise analysis, so we can simply replace the numbers with an interger version as they are (ex. replacing '100,000+' with '1000000').

In [105]:
android_genres = freq_table(android_final, 1)

# we loop the Category column (index number 1)
for genre in android_genres:
    total = 0
    counter = 0
    
    for row in android_final:
        genre_app = row[1]
        if genre_app == genre:
            num_installs = row[5]
            num_installs = num_installs.replace('+', '')
            num_installs = num_installs.replace(',', '')
            num_installs = float(num_installs)
            total += num_installs
            counter += 1
            
    average_num_installs = total / counter
    print(genre, ': ', average_num_installs)


ART_AND_DESIGN :  1986335.0877192982
AUTO_AND_VEHICLES :  647317.8170731707
BEAUTY :  513151.88679245283
BOOKS_AND_REFERENCE :  8767811.894736841
BUSINESS :  1712290.1474201474
COMICS :  817657.2727272727
COMMUNICATION :  38456119.167247385
DATING :  854028.8303030303
EDUCATION :  1833495.145631068
ENTERTAINMENT :  11640705.88235294
EVENTS :  253542.22222222222
FINANCE :  1387692.475609756
FOOD_AND_DRINK :  1924897.7363636363
HEALTH_AND_FITNESS :  4188821.9853479853
HOUSE_AND_HOME :  1331540.5616438356
LIBRARIES_AND_DEMO :  638503.734939759
LIFESTYLE :  1437816.2687861272
GAME :  15588015.603248259
FAMILY :  3695641.8198090694
MEDICAL :  120550.61980830671
SOCIAL :  23253652.127118643
SHOPPING :  7036877.311557789
PHOTOGRAPHY :  17840110.40229885
SPORTS :  3638640.1428571427
TRAVEL_AND_LOCAL :  13984077.710144928
TOOLS :  10801391.298666667
PERSONALIZATION :  5201482.6122448975
PRODUCTIVITY :  16787331.344927534
PARENTING :  542603.6206896552
WEATHER :  5074486.197183099
VIDEO_PLAYERS 

Just like with the most popular genres of the App Store, the most popular categories in the Google Play store tend to be dominated by a few large corporate giants; it would not make sense to enter these genres and compete with the large, established companies. Looking at a sample of a few more niche categories yields more potential:

In [110]:
for row in android_final:
    if row[1] == 'SHOPPING':
        print(row[0], ': ', row[5])

Amazon for Tablets :  10,000,000+
OfferUp - Buy. Sell. Offer Up :  10,000,000+
Shopee - No. 1 Online Shopping :  10,000,000+
Shopee: No.1 Online Shopping :  10,000,000+
Kroger :  5,000,000+
Walmart :  10,000,000+
eBay: Buy & Sell this Summer - Discover Deals Now! :  100,000,000+
letgo: Buy & Sell Used Stuff, Cars & Real Estate :  50,000,000+
Amazon Shopping :  100,000,000+
Lazada - Online Shopping & Deals :  50,000,000+
OLX - Buy and Sell :  50,000,000+
The wall :  1,000,000+
Flipp - Weekly Shopping :  10,000,000+
Shrimp skin shopping: spend less, buy better :  5,000,000+
Lotte Home Shopping LOTTE Homeshopping :  5,000,000+
Horn, free country requirements :  1,000,000+
Jiji.ng :  1,000,000+
GS SHOP :  10,000,000+
The birth :  50,000,000+
Home & Shopping - Only in apps. 10% off + 10% off :  10,000,000+
EHS Dongsen Shopping :  1,000,000+
bigbasket - online grocery :  5,000,000+
Bukalapak - Buy and Sell Online :  10,000,000+
Life market :  1,000,000+
Jabong Online Shopping App :  10,000,0

After running the above code to take a deeper dive into the various categories on Google Play, a few niche categories stand out. For example, the Shopping category, although still featuring a few big companies, is composed of many more smaller apps that still manage to gain significant traction. Many of these apps appear to also employ elements popular to the Google Play store; for example, the apps "Tophatter - 90 Second Auctions" and "Shopfully - Weekly Ads & Deals" both have over ten million downloads and seem to take advantage of gamification by making online shopping more fun, feeding into the Google Play store's second most common category of gaming. As such, a similar recommendation can be made for developing apps in this category to be profitable as made for the App Store: the shopping category has niche apps without being overly saturated by giants, and building on the emerging theme of gamification found in some apps in this category has potential for an app to gain a lot of traction.

## Conclusion

This project analyzed what the most popular genres and categories of apps are in both the App Store and Google Play store to aid any companies aiming to develop ad-based apps in the free and English markets. The analysis examined the general sentiment of what types of apps are common in each market, finding that both stores had high prevalences of fun and entertaining apps (the App Store especially), before taking a deeper look at rating counts and installation numbers for specific genres.

Although plenty of categories such as books, finance, and weather appear to be suitable for entry, the shopping category particularly stood out in being one where there are emerging developers taking advantage of the preference for gamification in apps. In both markets, the shopping category is not dominated by a large corporate giants but instead feature plenty of successful apps from smaller developers. To make a new app stand out in this category, I encourage any companies to consider potentially implementing elements of gamification, such as initiating an online shoppping challenge or allowing users to share with friends their newly bought items. This data-backed conclusion will allow developers to make the most profitable apps with their target audience.