# Defining the most profitable apps on Google Play and the App Store through Data Analysis

** by Gerard Tieng **


Today, we will be assuming the role of a Data Analyst whose task is to sift through metadata from the App Store and Google Play to identify app categories that would be most profitable for development. The following notebook will cover the following skills/tasks in the analysis process:

 - Opening, reading, and saving data from CSV files (CSV library)
 - General inspection of data structure
 - Removing invalid data entries
 - Identifying duplicate entries (frequency tables)
 - Data filtering by category and non-English apps
 - Percentage calculations of categories from large datasets

## Importing the Data

Data for this project is provided by two public datasets from Kaggle.[This dataset](https://www.kaggle.com/lava18/google-play-store-apps) features a scraping of 10,000 apps from the Google Play store, while [this dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) covers roughly 7,000 apps from Apple's App Store.

Let's begin by loading in our CSVs for our two datasets: AppleStore.csv & googleplaystore.csv. We'll be using the csv python library to convert each file to Python list types.

In [1]:
import csv

apple_file = open('AppleStore.csv')
apple_reader = csv.reader(apple_file)
apple_data = list(apple_reader)

google_file = open('googleplaystore.csv')
google_reader = csv.reader(google_file)
google_data = list(google_reader)

Here's a simple function we created to print the number of records in the dataset, as well as the column names from the CSV header.

In [2]:
def explore_data(dataset):
    print("The length of this dataset is " + str(len(dataset)) + ", with the following column names:")
    for columns in dataset[0]:
        print(columns)

In [3]:
explore_data(apple_data)

The length of this dataset is 7198, with the following column names:
id
track_name
size_bytes
currency
price
rating_count_tot
rating_count_ver
user_rating
user_rating_ver
ver
cont_rating
prime_genre
sup_devices.num
ipadSc_urls.num
lang.num
vpp_lic


In [4]:
explore_data(google_data)

The length of this dataset is 10842, with the following column names:
App
Category
Rating
Reviews
Size
Installs
Type
Price
Content Rating
Genres
Last Updated
Current Ver
Android Ver


## Cleaning the Data

### Part One: Eliminating Errors

According to [this discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015), there is one row of records from the Google Play set with a missing value at index 10473. — Let's delete that row.

In [5]:
google_data[10473]

['Life Made WI-Fi Touchscreen Photo Frame',
 '1.9',
 '19',
 '3.0M',
 '1,000+',
 'Free',
 '0',
 'Everyone',
 '',
 'February 11, 2018',
 '1.0.19',
 '4.0 and up']

In [6]:
del google_data[10473]

### Part Two: Handling Duplicates

With even further inspection of the Google Play dataset, we'll also find many duplicate entries. Instagram, for example, is listed four times. We'll use the following function to loop through the dataset and detect matches for any app named "Instagram".

In [7]:
def insta_dupes():
    dupes = []
    for row in google_data[1:]:
        if row[0] == 'Instagram':
            print(row)

insta_dupes()

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Let's see how far this goes throughout the whole dataset. The following code will loop through the Google Play dataset to add the first instance of an app name to a list. If any further instances appear after being checked against the unique list, that app name will be identified as a duplicate and added to a separate list.

In [8]:
duplicate_apps = []
unique_apps = []

for i in google_data[1:]:
    if i[0] in unique_apps:
        duplicate_apps.append(i[0])
    else:
        unique_apps.append(i[0])

print("There are a total of " + str(len(duplicate_apps)) + " duplicate apps.")
print("")
print("Examples of duplicate apps:", duplicate_apps[:10])      


There are a total of 1181 duplicate apps.

Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


As far as duplicates go, we will only concern ourselves with the app which has the highest amount of reviews. The following code will loop through the Google Play dataset and create a dictionary of unique app names plus their corresponding review count. When the loop comes across a duplicate, it will update the dictionary key with the higher review count.

In [9]:
reviews_max = {}

for i in google_data[1:]:
    name = i[0]
    n_reviews = float(i[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

The total amount of unique apps is as shown below:

In [10]:
len(reviews_max)

9659

Time to compile. The following code will produce a clean dataset featuring no duplicate entries. If the review count for the row entry matches our record in the dictionary of the highest reviews, it will be added to the list.

In [11]:
android_clean = []
already_added = []

for i in google_data[1:]:
    name = i[0]
    n_reviews = float(i[3])
    
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(i)
        already_added.append(name)

The clean dataset is a verified match to the maximum reviews dataset!

In [12]:
len(android_clean)

9659

### Part Three: Filtering Out Non-English & Paid Apps

Now that our dataset has no duplicates, we will be going another level deeper in its cleaning by filtering out any Non-English Apps. Python's built-in ord() function allows for us to check the unicode equivalent of any character in a string. All English characters will translate to a number equal to or less than 127.

In [13]:
def english_check(word):
    for i in word:
        if ord(i) > 127:
            return False
    return True

In [14]:
english_check("instagram")

True

For the purposes of this project, we will earmark any app name that includes more than 3 non-English characters as a non-English app. Here is a function that loops through each character in a string to see if it passes our ord() test.

In [15]:
def english_check(string):
    symbol_counter = 0
    for character in string:
        if ord(character) > 127:
            symbol_counter += 1
    
    if symbol_counter > 3:
        return False
    
    return True

In [16]:
print(english_check('Docs To Go™ Free Office Suite'))
print(english_check('Instachat 😜'))
print(english_check('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
True
False


Now, let's use our english_check() function to filter through our App Store and Google Play datasets. The following code will loop through our datasets and every record returns positive for English will be added to a list for English-titled apps.

In [17]:
apple_en = []
google_en = []

for i in apple_data[1:]:
    name = i[0]
    if english_check(name):
        apple_en.append(i)

for i in google_data[1:]:
    name = i[0]
    if english_check(name):
        google_en.append(i)
        
print("Number of English Apple apps: " + str(len(apple_en)))
print("Number of English Google apps: " + str(len(google_en)))

Number of English Apple apps: 7197
Number of English Google apps: 10795


We'll use a similar method to filter and keep all apps from each dataset that costs $0. The outcome will be the final versions of each dataset to use for analysis.

In [18]:
apple_free = []
google_free = []

for i in apple_en:
    if i[4] == '0.0':
        apple_free.append(i)
        
for i in google_en:
    if i[7] == '0':
        google_free.append(i)

print("Number of free Apple apps: " + str(len(apple_free)))
print("Number of free Google apps: " + str(len(google_free)))

Number of free Apple apps: 4056
Number of free Google apps: 9999


## Analyzing the Data

With our final dataset, we'll begin to analyze the data to identify the most popular apps on Google Play and the App Store based on genre. To do this, we'll build a frequency table function to count the amount of times each Genre appears in each respective dataset. We'll also include a counter to measure the number of records for the purposes of converting each genre count as a percentage of the total. 

In [20]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset[1:]:
        genre = row[index]
        total += 1
        
        if genre in table:
            table[genre] += 1
        else:
            table[genre] = 1
    
    table_percent = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percent[key] = percentage
        
    return table_percent

The code below nests the frequency table function from above, allowing us to sort genre counts and display them in desecending order.

In [21]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

The genre percentages from the App Store dataset shows that Games dominate the market with a 55% share. Meanwhile, the next three runner-up genres comprise of fun apps in the form of Entertainment, Photo/Video, and Social Networking.

In [22]:
display_table(apple_free, 11)

Games : 55.659679408138096
Entertainment : 8.236744759556105
Photo & Video : 4.1183723797780525
Social Networking : 3.501849568434032
Education : 3.255240443896424
Shopping : 2.9839704069050557
Utilities : 2.688039457459926
Lifestyle : 2.318125770653514
Finance : 2.0715166461159065
Sports : 1.9482120838471024
Health & Fitness : 1.8742293464858202
Music : 1.6522811344019728
Book : 1.627620221948212
Productivity : 1.528976572133169
News : 1.4303329223181258
Travel : 1.381011097410604
Food & Drink : 1.060419235511714
Weather : 0.7644882860665845
Reference : 0.4932182490752158
Navigation : 0.4932182490752158
Business : 0.4932182490752158
Catalogs : 0.22194821208384713
Medical : 0.19728729963008632


Family takes the No. 1 spot of all Google Play categories, though mostly comprised of games for kids. Still, the Google Play set sees a more even distribution of genres when compared to the App Store.

In [24]:
display_table(google_free, 1) #Category

FAMILY : 17.67353470694139
GAME : 10.592118423684736
TOOLS : 7.641528305661133
BUSINESS : 4.450890178035607
PRODUCTIVITY : 3.9507901580316065
SPORTS : 3.6007201440288057
LIFESTYLE : 3.590718143628726
COMMUNICATION : 3.590718143628726
MEDICAL : 3.540708141628326
FINANCE : 3.4906981396279257
HEALTH_AND_FITNESS : 3.2506501300260053
PHOTOGRAPHY : 3.120624124824965
PERSONALIZATION : 3.080616123224645
SOCIAL : 2.9205841168233646
NEWS_AND_MAGAZINES : 2.7705541108221645
SHOPPING : 2.570514102820564
TRAVEL_AND_LOCAL : 2.4604920984196843
DATING : 2.2704540908181636
BOOKS_AND_REFERENCE : 1.990398079615923
VIDEO_PLAYERS : 1.7003400680136025
EDUCATION : 1.5103020604120825
ENTERTAINMENT : 1.4702940588117623
MAPS_AND_NAVIGATION : 1.300260052010402
FOOD_AND_DRINK : 1.250250050010002
HOUSE_AND_HOME : 0.8801760352070414
LIBRARIES_AND_DEMO : 0.8401680336067214
AUTO_AND_VEHICLES : 0.8201640328065612
WEATHER : 0.7401480296059212
EVENTS : 0.630126025205041
ART_AND_DESIGN : 0.6001200240048009
COMICS : 0.5901

In [23]:
display_table(google_free, -4) # Genre

Tools : 7.631526305261052
Entertainment : 6.00120024004801
Education : 5.1310262052410485
Business : 4.450890178035607
Productivity : 3.9507901580316065
Sports : 3.7407481496299257
Communication : 3.590718143628726
Lifestyle : 3.580716143228646
Medical : 3.540708141628326
Finance : 3.4906981396279257
Action : 3.410682136427286
Health & Fitness : 3.2506501300260053
Photography : 3.120624124824965
Personalization : 3.080616123224645
Social : 2.9205841168233646
News & Magazines : 2.7705541108221645
Shopping : 2.570514102820564
Travel & Local : 2.450490098019604
Dating : 2.2704540908181636
Arcade : 2.000400080016003
Books & Reference : 1.990398079615923
Simulation : 1.8803760752150431
Casual : 1.8403680736147228
Video Players & Editors : 1.6803360672134429
Maps & Navigation : 1.300260052010402
Food & Drink : 1.250250050010002
Puzzle : 1.2102420484096819
Racing : 0.9501900380076015
Strategy : 0.9301860372074415
House & Home : 0.8801760352070414
Role Playing : 0.8701740348069614
Libraries & 

By another metric, we can also rank popular apps by their installation history. In the Google Play dataset, installs are measured in milestones such as 1M+ and 10M+. We'll need to use Python's replace() function to eliminate symbols such as comma and plus before we can aggregrate totals.

As expected, Games, Social Media, and Photography rank high on this list, but Communication appears the most useful as the Genre with the largest install base.

In [26]:
google_category = freq_table(google_free, 1)

for category in google_category:
    total = 0
    len_category = 0
    
    for row in google_free[1:]:
        category_app = row[1]
        if category_app == category:
            installs = row[5]
            installs = installs.replace('+','')
            installs = installs.replace(',','')
            total += float(installs)
            len_category += 1
    avg_installs = total / len_category
    
    print(category, ':', avg_installs)

GAME : 33111302.596789423
MAPS_AND_NAVIGATION : 5569698.307692308
COMICS : 950443.220338983
BUSINESS : 2250454.1348314607
SPORTS : 4860918.563888889
WEATHER : 5747142.162162162
HOUSE_AND_HOME : 1917187.0568181819
EDUCATION : 5760596.026490066
LIBRARIES_AND_DEMO : 749950.119047619
TRAVEL_AND_LOCAL : 27921561.32520325
DATING : 1164270.7356828193
BEAUTY : 513151.88679245283
SHOPPING : 12637504.221789883
AUTO_AND_VEHICLES : 647317.8170731707
HEALTH_AND_FITNESS : 4869225.852307692
FOOD_AND_DRINK : 2190710.008
LIFESTYLE : 1479956.6267409471
MEDICAL : 147563.28813559323
BOOKS_AND_REFERENCE : 9655197.28643216
FAMILY : 5784094.900962083
PERSONALIZATION : 7533233.402597402
COMMUNICATION : 90935671.86908078
TOOLS : 14988276.79842932
ENTERTAINMENT : 19516734.69387755
PARENTING : 542603.6206896552
EVENTS : 253542.22222222222
VIDEO_PLAYERS : 36599010.11764706
FINANCE : 2511355.6790830945
NEWS_AND_MAGAZINES : 27058831.263537906
SOCIAL : 48184458.56849315
PHOTOGRAPHY : 32321374.407051284
PRODUCTIVITY 

The App Store dataset does not account for installs in its structure, so we'll use "ratings_count_tot" as the next best metric to estimate installations.

It's surprising to see that Music is the front-runner for most reviewed genre in this dataset, with Social Networking and Weather as runners-up.

In [25]:
test = freq_table(apple_free, 11)

for genre in test:
    total = 0
    len_genre = 0
    
    for row in apple_free[1:]:
        genre_app = row[11]
        if genre_app == genre:
            total += float(row[5])
            len_genre += 1
    print(genre, ":",total/len_genre)

Travel : 20216.01785714286
Weather : 47220.93548387097
Education : 6266.333333333333
Navigation : 25972.05
Food & Drink : 20179.093023255813
Social Networking : 32503.563380281692
Entertainment : 10822.961077844311
Finance : 13522.261904761905
Book : 8498.333333333334
Reference : 67447.9
Medical : 459.75
News : 15892.724137931034
Utilities : 14010.100917431193
Lifestyle : 8978.308510638299
Shopping : 18746.677685950413
Catalogs : 1779.5555555555557
Music : 56482.02985074627
Health & Fitness : 19952.315789473683
Photo & Video : 27249.892215568863
Sports : 20128.974683544304
Productivity : 19053.887096774193
Business : 6367.8
Games : 18924.68896765618


## Conclusion

Judging by the trends we've identified in this project, while games would seem like a sure moneymaker, it would be hard to breakthrough in such a saturated market.

It does appear that there is room to capitalize on the creation of an app that combines highly-rated genres like music, social networking, photography, and communication to deploy in a smaller genre category.

In the case of one successful instance of this: TikTok (formerly Musical.ly), it could be said that a simple analysis of the market could have predicted its rise and/or creation.