# Analyzing App Profitability on the Google Play and App Store Markets

The goal of this project is to analyze data from two different markets—the Google Play store and the App Store from Apple—to determine what makes an app popular and, thus, profitable. The data will then be used to arrive at a recommendation for developers.

Many companies and developers make free apps for the English market, relying on user downloads and ad revenue for money. Such companies will find this data valuable in determining the most profitable categories and genres for new apps to be developed in.

After considering various factors such as general app market sentiments, the most popular and highly rated genres, and individual app trends within each category, I arrive at the recommendation that an innovative, gamified take on the **Shopping** genre is most likely to bring a positive return on investment for developers and companies alike.

This project can be broken down into three parts: acquiring the relevant data, cleaning the data, and analyzing the data.


## Data Acquisition

Working with databases including the millions of apps available on both markets is impractical, so a sample database for each market must be used.

The Google Play database was sourced from [here](https://www.kaggle.com/datasets/lava18/google-play-store-apps).

The App Store database was sourced from [here](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps).

I begin by opening both of the relevant files and setting them to lists without the header row.

In [1]:
from csv import reader

opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:] # removes the header row

opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
apple = list(read_file)
apple_header = apple[0]
apple = apple[1:] # removes the header row

## Data Cleaning
#### Deleting inaccurate data, duplicate apps, non-English apps, and paid apps

The databases must first be cleaned to provide only relevant apps for analysis. As the target of this project is companies that are aiming to develop free apps for the English market, the data will be cleaned accordingly for the sample to be representative.


First, a quick look at the discussion page of the Google Play database reveals an immediate data error: row 10472 lacks a category entry, which shifts the following columns. This can be fixed by deleting this row.

In [2]:
del android[10472] # run code only once to delete row with wrong data

#### Removing Duplicate Apps

To continue with data cleaning, there are duplicate entries in the Google Play database that must be removed (the Apple App Store database does not have duplicate rows). We can confirm Android duplicates using the code below:

In [3]:
duplicate_apps = []
unique_apps = []

# sorts duplicate and unique apps based on name into separate lists
for row in android:
    name = row[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('There are ', len(duplicate_apps), ' duplicate apps.')
print('There are ', len(unique_apps), ' unique apps.')
print('\n')
print('Duplicate app examples: ', duplicate_apps[:5])

There are  1181  duplicate apps.
There are  9659  unique apps.


Duplicate app examples:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


Examining the Google Play database, I notice that each duplicate app differs on the "Reviews" column (index number 3), defined as how many user reviews each app has. These differences can likely be attributed to reviews data from these apps being collected at different points in time. As such, it would be most accurate to keep the entry with the most recent number of reviews (the highest number) and delete the rest of the duplicates.

I can do this by adding all the apps, keeping only the duplicates with the highest number of reviews, to a dictionary called `reviews_max`, with the app names as keys and the number of user reviews as values. I can then sort through the database to add each data row to a list called `android_clean` only if the row's app name and number of reviews match those in the `reviews_max` dictionary.


In [4]:
reviews_max = {}

# adding each app's name and highest number of reviews to a dictionary
for row in android:
    name = row[0]
    n_reviews = float(row[3])
    
    if (name in reviews_max) and (reviews_max[name] < n_reviews):
        reviews_max[name] = n_reviews
    
    if name not in reviews_max:
        reviews_max[name] = n_reviews
        
print(len(reviews_max)) 
# as expected, there are now 9659 rows in the dictionary, corresponding to the number of unique apps

android_clean = []
already_added = []

# compares each app to the corresponding dictionary key to keep add the entire row data to a new list
for row in android:
    name = row[0]
    n_reviews = float(row[3])
    
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(row)
        already_added.append(name)
        
print(len(android_clean))
# as expected, there are still 9659 rows in this list without duplicates

9659
9659


#### Removing Non-English Apps

The next step of data cleaning is removing apps with names that sugggest they are for a non-English audience. This can be done by iterating over the name of each app and checking if each character string is within the standard English text system. Each string corresponds to a number and, according to the [American Standard Code for Information Interchange](https://en.wikipedia.org/wiki/ASCII), numbers between 0 and 127 represent standard English strings. If running through an app name returns any string with a number outside this range, the app is likely a non-English app.

We begin by defining a function where names with more than one non-standard character are removed (more than one is to account for English apps with special characters, like an em dashes, an emoji, or a trademark symbol; this system is not perfect but still effective in sorting out the majority of non-English apps).

In [5]:
# function to check if a string contains more than 1 non-English character
def is_english(input_string):
    non_english_count = 0
    
    for character in input_string:
        if ord(character) > 127: # compares the integer value of each character to the English standard range
            non_english_count += 1
            
    if non_english_count > 1:
        return False
        
    return True

android_clean_english = []
apple_clean_english = []

# adding only English apps to a new Android list
for row in android_clean:
    name = row[0]
    if is_english(name) is True:
        android_clean_english.append(row)
     
# adding only English apps to a new Apple list
for row in apple:
    name = row[1]
    if is_english(name) is True:
        apple_clean_english.append(row)

print(len(android_clean_english)) # 9523 Android apps now
print(len(apple_clean_english)) # 6100 Apple apps now

9523
6100


#### Removing Paid Apps

Lastly, we target only the free apps in the databases. We can isolate free apps by examining the pricing column in each dataset and keeping only the rows whose prices are 0.

In [6]:
android_final = []
apple_final = []

for row in android_clean_english:
    price = row[7]
    if price == '0':
        android_final.append(row)
        
for row in apple_clean_english:
    price = row[4]
    if price == '0.0':
        apple_final.append(row)
        
print(len(android_final)) # 8781 apps in final list for analysis
print(len(apple_final)) # 3169 apps in final list for analysis
    

8781
3169


## Data Analysis

Developers may wish to look for a genre or category in both the Google Play and App Store that sees high demand for free English apps and that is not dominated by major companies. Such a genre should be consistent across both markets to minimize upfront costs of developing and optimizing two different apps to make a profit.

However, any decisions to develop an app must be data-backed. As such, I analyzed the cleaned data to determine the most common and most popular apps in each market as a starting point before diving deeper into what specific genres have potential for the target developers to break into.

#### Most Common Apps

To begin analyzing data to determine which types of apps are most profitable, I can first build a frequency table based on the most common app genres in both markets. The relevant Google Play column is `Category` (index number 1), while the relevant App Store column is `prime_genre` (index number 11). After building a frequency table by genre for each market, the table can then be sorted by descending percentages to show which genres are most prevalent in their respective market.


In [7]:
# function to build a frequency table based on genre 
def freq_table(dataset, index):
    table = {}
    table_percent = {}
    counter = 0
    
    for row in dataset:   
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
        counter += 1
    
    # converts frequency table to show percentage of total market
    for key in table:
        percent = table[key] * 100 / counter
        table_percent[key] = percent
        
    return table_percent

# helper function from Dataquest to sort the above frequency table
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', round(entry[0], 4))

With the frequency table by percentage function created, the data can now be analyzed by generating tables for the most common genres in each market. We begin with free English apps on the App Store.

In [8]:
display_table(apple_final, 11)

Games : 58.5358
Entertainment : 7.8258
Photo & Video : 5.0489
Education : 3.7236
Social Networking : 3.2818
Shopping : 2.5245
Utilities : 2.3982
Sports : 2.1773
Music : 2.0511
Health & Fitness : 1.988
Productivity : 1.704
Lifestyle : 1.5462
News : 1.3253
Travel : 1.136
Finance : 1.1044
Weather : 0.852
Food & Drink : 0.8204
Reference : 0.5364
Business : 0.5364
Book : 0.3787
Navigation : 0.1893
Medical : 0.1893
Catalogs : 0.1262


The most popular genre of free English apps on the App Store is, by far, games, which hold 58.5% of the share. This is followed by the entertainment category at a distant second with 7.8% of the market. The general impression is that fun apps, rather than practical and productive ones, tend to take up the vast majority of the market share, as some of the most popular genres are games, social networking, shopping, and other entertainment genres.

We continue our analysis with Google Play apps.

In [9]:
display_table(android_final, 1)

FAMILY : 18.95
GAME : 9.6572
TOOLS : 8.4615
BUSINESS : 4.635
PRODUCTIVITY : 3.9289
LIFESTYLE : 3.9062
FINANCE : 3.7126
MEDICAL : 3.5417
SPORTS : 3.3254
PERSONALIZATION : 3.3026
COMMUNICATION : 3.257
HEALTH_AND_FITNESS : 3.0976
PHOTOGRAPHY : 2.9723
NEWS_AND_MAGAZINES : 2.8015
SOCIAL : 2.6876
TRAVEL_AND_LOCAL : 2.3346
SHOPPING : 2.2549
BOOKS_AND_REFERENCE : 2.1524
DATING : 1.8563
VIDEO_PLAYERS : 1.7993
MAPS_AND_NAVIGATION : 1.378
FOOD_AND_DRINK : 1.2299
EDUCATION : 1.173
ENTERTAINMENT : 0.9566
LIBRARIES_AND_DEMO : 0.9338
AUTO_AND_VEHICLES : 0.9224
HOUSE_AND_HOME : 0.7972
WEATHER : 0.7858
EVENTS : 0.7175
ART_AND_DESIGN : 0.6491
PARENTING : 0.6377
BEAUTY : 0.6036
COMICS : 0.5808


There appears to be a noticeable difference between the most frequent free English apps on the App Store vs the Google Play store; the latter is much less dominated by Games, which rank as the second most common at 9.7% behind Family at 19.0% of the market. In general, the Google Play store's most common apps tend to be much more productivity and practicality focused, with less entertaining apps like in the App Store. Nonetheless, fun categories like Games, Sports, Social, and Shopping still represent some of the most popular categories in the market.

#### Most popular Apps

Although the previous analysis showed which free English apps were most common in each market, they may not necessarily reflect which apps are most popular and actually have the most users or highest ratings. To determine this data, I can calculate each genre's average number of app installations (for the Google Play store) or total number of user ratings (for the App Store database, which is missing installation data). This can be done with a nested for loop summing the number of installations/ratings of all the apps in a genre before dividing it by the total number of apps in that genre to find the average. I begin with the App Store.

##### App Store

In [10]:
apple_genres = freq_table(apple_final, 11)

# first loop iterates over all genres
for genre in apple_genres:
    total = 0 # will be used to sum total user ratings in genre
    counter = 0 # will be used to count how many apps in genre
    
    # nested loop iterates over all individual apps and their data
    for row in apple_final:
        genre_app = row[11]
        if genre_app == genre:
            num_ratings = float(row[5]) # index 5 is the total user ratings count column
            total += num_ratings
            counter += 1

    average_num_ratings = round(total / counter, 4)
    print(genre, ': ', average_num_ratings)
    

Social Networking :  72916.5481
Photo & Video :  28441.5438
Games :  22985.2113
Music :  58205.0308
Reference :  79350.4706
Health & Fitness :  24037.6349
Weather :  54215.2963
Utilities :  19900.4737
Travel :  31358.5
Shopping :  27816.2
News :  21750.0714
Navigation :  86090.3333
Lifestyle :  16739.3469
Entertainment :  14364.7742
Food & Drink :  33333.9231
Sports :  23008.8986
Book :  46384.9167
Finance :  32367.0286
Education :  7003.9831
Productivity :  21799.1481
Business :  7491.1176
Catalogs :  4004.0
Medical :  612.0


Although some genres may seem extremely popular, many are actually extremely skewed by one or two apps; for example, the Navigation category, which has the highest average number of user ratings, is skewed by the two apps Waze and Google Maps. This was determined by looping through each genre's app names and average number of ratings, as in the following code:

In [11]:
print('App Name : Number of User Ratings')
for row in apple_final:
    if row[11] == 'Navigation':
        print(row[1], ': ', row[5])

App Name : Number of User Ratings
Waze - GPS Navigation, Maps & Real-time Traffic :  345046
Google Maps - Navigation & Transit :  154911
Geocaching® :  12811
CoPilot GPS – Car Navigation & Offline Maps :  3582
ImmobilienScout24: Real Estate Search in Germany :  187
Railway Route Search :  5


A similar trend follows with other categories such as Social Networking, which is dominated by big social media companies:

In [12]:
print('App Name : Number of User Ratings')
for row in apple_final:
    if row[11] == 'Social Networking':
        print(row[1], ': ', row[5])

App Name : Number of User Ratings
Facebook :  2974676
Pinterest :  1061624
Skype for iPhone :  373519
Messenger :  351466
Tumblr :  334293
WhatsApp Messenger :  287589
Kik :  260965
ooVoo – Free Video Call, Text and Voice :  177501
TextNow - Unlimited Text + Calls :  164963
Viber Messenger – Text & Call :  164249
Followers - Social Analytics For Instagram :  112778
MeetMe - Chat and Meet New People :  97072
We Heart It - Fashion, wallpapers, quotes, tattoos :  90414
InsTrack for Instagram - Analytics Plus More :  85535
Tango - Free Video Call, Voice and Chat :  75412
LinkedIn :  71856
Match™ - #1 Dating App. :  60659
Skype for iPad :  60163
POF - Best Dating App for Conversations :  52642
Timehop :  49510
Find My Family, Friends & iPhone - Life360 Locator :  43877
Whisper - Share, Express, Meet :  39819
Hangouts :  36404
LINE PLAY - Your Avatar World :  34677
WeChat :  34584
Badoo - Meet New People, Chat, Socialize. :  34428
Followers + for Instagram - Follower Analytics :  28633
Group

It would not make practical sense to break into these giant-dominated genres, so running the above code for different categories and removing those with one or two apps that make up the majority of the market share will reveal more niche genres that have potential for entry.

To supplement my search for the most popular genres that new developers could feasibly break into, I determine the average user rating of apps—representing how well-liked each app is—in each genre to help develop a stronger recommendation for which genre would be profitable to enter.

In [13]:
apple_genres_ratings = freq_table(apple_final, 11)
genres_ratings_list = [] # list to store all genres and their average ratings

# iterate over all App Store genres
for genre in apple_genres_ratings:
    total = 0 # sums up the ratings for each genre
    counter = 0 # counts how many apps in each genre
    
    # sum up ratings and how many apps in each genre
    for row in apple_final:
        genre_app = row[11]
        if genre_app == genre:
            rating = float(row[8])
            total += rating
            counter += 1

    average_ratings = round(total / counter, 4)
    print(genre, ': ', average_ratings)
    
    genres_ratings_list.append([genre, average_ratings])

total_genre_averages = 0 # sums the average ratings for *all* genres
counter_genres = 0 # counts how many genres there are

for genre in genres_ratings_list:
    rating = genre[1]
    total_genre_averages += rating
    counter_genres += 1

average_genre_ratings = round(total_genre_averages / counter_genres, 4) 
    
print('\n')
print('The average rating of App Store genres is: ' + str(average_genre_ratings))
    

Social Networking :  3.0433
Photo & Video :  3.3844
Games :  3.9329
Music :  3.9154
Reference :  4.0882
Health & Fitness :  3.7381
Weather :  2.9444
Utilities :  3.1645
Travel :  2.9861
Shopping :  3.5312
News :  2.6071
Navigation :  2.25
Lifestyle :  2.949
Entertainment :  3.3226
Food & Drink :  3.25
Sports :  2.6812
Book :  3.6667
Finance :  2.9286
Education :  3.1102
Productivity :  4.0833
Business :  3.0588
Catalogs :  4.0
Medical :  3.25


The average rating of App Store genres is: 3.2994


Many high-rated genres can be ruled out: Games is too saturated, Music is too dominated by a few giants, and Productivity does not fit the prevalent 'fun' sentiment of the App Store. Upon removing these genres from consideration, the remaining list will include potentially profitable genres.

After further synthesizing the data from both the average number of user ratings and average ratings of each genre, I found that genres such as **Shopping** have plenty of room for growth, with many non-corporate-giant apps able to gain tens of thousands of user ratings. A similar story can be found with the Finance, Book, and Weather genres (among others); however, Shopping remains one of the few genres that have an average rating (3.5313) above the average categorial rating (3.2994), indicating that users tend to be more satisfied with apps made in this category.

Given that the App Store is dominated by fun apps, developing a new app with some element of gamification, such as an online shopping challenge or social platform for users to share their recent buys, in a shopping app makes sense: the genre is big and has proven to allow new entries with high number of downloads, and it fits the general theme of what apps are prevalent in the App Store.

In [14]:
print('App Name : Number of User Ratings')
for row in apple_final:
    if row[11] == 'Shopping':
        print(row[1], ': ', row[5])

App Name : Number of User Ratings
Groupon - Deals, Coupons & Discount Shopping App :  417779
eBay: Best App to Buy, Sell, Save! Online Shopping :  262241
Wish - Shopping Made Fun :  141960
shopkick - Shopping Rewards & Discounts :  130823
Amazon App: shop, scan, compare, and read reviews :  126312
Target :  108131
Zappos: shop shoes & clothes, fast free shipping :  103655
Walgreens – Pharmacy, Photo, Coupons and Shopping :  88885
Best Buy :  80424
Walmart: Free 2-Day Shipping,* Easy Store Shopping :  70286
OfferUp - Buy. Sell. Simple. :  57348
Apple Store :  55171
Shop Savvy Barcode Scanner - Price Compare & Deals :  54630
Ibotta: Cash Back App, Grocery Coupons & Shopping :  44313
letgo: Buy & Sell Second Hand Stuff :  38424
CVS Pharmacy :  35981
Etsy: Shop Handmade, Vintage & Creative Goods :  30434
Gilt :  26353
Mercari: Shopping Marketplace to Buy & Sell Stuff :  24244
Shopular Coupons, Weekly Deals for Target, Walmart :  22729
RetailMeNot Shopping Deals, Coupons, Savings :  18544
P

##### Google Play

We can use similar code to determine the most popular apps of the Google Play store. Although the number of installs is represented as a range (ex. 10,000+), we do not require extremely precise analysis, so we can simply replace the numbers with an interger version as they are (ex. replacing '100,000+' with '1000000').

In [15]:
android_genres = freq_table(android_final, 1)

# loop the Category column (index number 1)
for genre in android_genres:
    total = 0 # will be used to sum total installations in genre
    counter = 0 # will be used to sum total apps in genre
    
    # loop through individual apps and their data
    for row in android_final:
        genre_app = row[1]
        if genre_app == genre:
            # converting installations to readable floats
            num_installs = row[5]
            num_installs = num_installs.replace('+', '')
            num_installs = num_installs.replace(',', '')
            num_installs = float(num_installs)
            total += num_installs
            counter += 1
            
    average_num_installs = round(total / counter, 4)
    print(genre, ': ', average_num_installs)


ART_AND_DESIGN :  1986335.0877
AUTO_AND_VEHICLES :  654074.8272
BEAUTY :  513151.8868
BOOKS_AND_REFERENCE :  8814199.7884
BUSINESS :  1712290.1474
COMICS :  859042.1569
COMMUNICATION :  38590581.0874
DATING :  861409.5521
EDUCATION :  1833495.1456
ENTERTAINMENT :  11767380.9524
EVENTS :  253542.2222
FINANCE :  1365500.4049
FOOD_AND_DRINK :  1951283.8056
HEALTH_AND_FITNESS :  4204220.2279
HOUSE_AND_HOME :  1380033.7286
LIBRARIES_AND_DEMO :  645070.8537
LIFESTYLE :  1447458.9767
GAME :  15593824.6934
FAMILY :  3717297.5841
MEDICAL :  121161.8778
SOCIAL :  23253652.1271
SHOPPING :  7072366.5909
PHOTOGRAPHY :  17840110.4023
SPORTS :  3750580.6438
TRAVEL_AND_LOCAL :  14120454.078
TOOLS :  10902378.8345
PERSONALIZATION :  5273184.0966
PRODUCTIVITY :  16787331.3449
PARENTING :  552875.1786
WEATHER :  5212877.1014
VIDEO_PLAYERS :  24878048.8608
NEWS_AND_MAGAZINES :  9626407.3577
MAPS_AND_NAVIGATION :  4115374.2149


Just like with the most popular genres of the App Store, the most popular categories in the Google Play store tend to be dominated by a few large corporate giants; it would not make sense to try to break into these categories and compete with the large, established companies. Sorting through the categories like I did for the App Store and examining a few more niche categories yields more potential:

In [16]:
print('App Name : Number of Downloads')
for row in android_final:
    if row[1] == 'SHOPPING':
        print(row[0], ': ', row[5])
        

App Name : Number of Downloads
Amazon for Tablets :  10,000,000+
OfferUp - Buy. Sell. Offer Up :  10,000,000+
Shopee - No. 1 Online Shopping :  10,000,000+
Shopee: No.1 Online Shopping :  10,000,000+
Kroger :  5,000,000+
Walmart :  10,000,000+
eBay: Buy & Sell this Summer - Discover Deals Now! :  100,000,000+
letgo: Buy & Sell Used Stuff, Cars & Real Estate :  50,000,000+
Amazon Shopping :  100,000,000+
Lazada - Online Shopping & Deals :  50,000,000+
OLX - Buy and Sell :  50,000,000+
The wall :  1,000,000+
Flipp - Weekly Shopping :  10,000,000+
Shrimp skin shopping: spend less, buy better :  5,000,000+
Lotte Home Shopping LOTTE Homeshopping :  5,000,000+
Horn, free country requirements :  1,000,000+
Jiji.ng :  1,000,000+
GS SHOP :  10,000,000+
The birth :  50,000,000+
Home & Shopping - Only in apps. 10% off + 10% off :  10,000,000+
EHS Dongsen Shopping :  1,000,000+
bigbasket - online grocery :  5,000,000+
Bukalapak - Buy and Sell Online :  10,000,000+
Life market :  1,000,000+
Jabong 

After running the above code to take a deeper dive into the various categories on Google Play, a few niche categories stand out. For example, the Shopping category, although still featuring a few big companies, is composed of many more smaller apps that still manage to gain significant traction. Many of these apps appear to also employ elements popular to the Google Play store; for example, the apps "Tophatter - 90 Second Auctions" and "Shopfully - Weekly Ads & Deals" both have over ten million downloads and take advantage of gamification by making online shopping more fun, feeding into the Google Play store's second-most common category of gaming.

As such, I make a similar recommendation for the Google Play store as made for the App Store: the Shopping category has niche apps without being overly saturated by giants, and building on the emerging theme of gamification found in some apps in this category has potential for a new app to gain a lot of traction and generate a lot of installations, thus leading to high ad revenue and profitability.

## Conclusion

This project analyzed what the most popular genres and categories of apps are in both the App Store and Google Play store to aid any companies aiming to develop ad revenue-based apps in the free English markets. The analysis examined the general sentiment of what types of apps are common in each market, finding that both stores had high prevalences of fun and entertaining apps (the App Store moreso), before taking a deeper look at rating counts, ratings, and installation numbers for specific genres.

With all the above data, I arrive at a final recommendation for a developer or company to break into the **Shopping category**, which I believe holds the most potential. Although plenty of categories such as Finance, Books, and Weather appear suitable for entry, Shopping particularly stood out in being one where there are emerging developers taking advantage of the preference for gamification in apps. In both markets, Shopping is not dominated by a few large corporate giants but instead features plenty of successful apps from smaller developers. Moreover, consistent with both markets, the category is not overly saturated; in fact, the average rating of the Shopping genre is above the average for all genres, raising the possibility of additional revenue sources with in-app purchases or subscriptions should customers be satisfied enough.

To make a new app stand out in this category, I encourage any companies to consider potentially implementing elements of gamification, such as initiating an online shopping challenge or allowing users to share with friends their newly bought items in a fun way. This is supported by the fact that both markets, especially the App Store, have a high prevalance and demand for fun and entertaining apps. Overall, my data-backed recommendation will allow developers to make the most profitable apps with their target audience.