# Profitable App Profiles for the Google Play Store
### Analysis by Lee Benson

One method in which mobile apps can run a profit is by allowing apps to be downloaded and installed for free, basing revenue on in-app ads. Therefore, getting more users to download an app will likely increase exposure to ads and increase profit. The purpose of this project is to analyze app-store data and make actionable insights to help developers understand what kinds of apps are likely to attract more users in the Google Play Store and therefore capture more of the Android market.

## Exploration of the Play Store dataset

#### A dataset captured containing data on approximately ten thousand Android apps from the Google Play Store. This dataset was scraped in August 2018. Below I open the dataset in Python, make it into a list, check that they have been loaded, and show the first few rows:

In [1]:
opened_file_android = open('googleplaystore.csv')
from csv import reader
read_file_android = reader(opened_file_android)
android_list = list(read_file_android)
android_header = android_list[0]
android = android_list[1:]

print(opened_file_android)
print(android_list[0:2])

<_io.TextIOWrapper name='googleplaystore.csv' mode='r' encoding='UTF-8'>
[['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'], ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']]


#### To make exploration of the data easier, I create a function that lists the data, with parameters for starting row and ending row, and data stats. Using the function we see below an organized view of the headers in the dataset, the first few rows, the count of the total number of rows, and a count of the total number of columns:

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

explore_data(android_list, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


### Reviewing the dataset to identify columns that can aid in our analysis

In [3]:
print(android_list[0])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


#### Reviewing the columns above, the App, Category, Rating, Reviews, Installs, Price, and Genres columns look like they could be useful in the analysis of the data toward our goal of finding profitable app profiles.

#### See [here](https://www.kaggle.com/lava18/google-play-store-apps) for the documentation for each column

## Data cleaning

#### The next step is to clean the data. Sifting through the data there a column shift on line 10473 - it is missing a 'category' column value. See below:

In [4]:
explore_data(android_list, 0, 2) #show header and 1 row of correct values

print(android_list[10473]) # bad row that is missing 'category' column

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


#### I remove the entire row to cleanse the data of that error:

In [5]:
del android_list[10473]

#### Further, the Google Play data set appears to have duplicates. The program below looks for repeats of the same app name in the dataset:

In [6]:
duplicate_apps = []
unique_apps = []

for app in android_list:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    unique_apps.append(name)
    
print(len(duplicate_apps))
print(duplicate_apps[:15])

1181
['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


#### Above we see that there were 1,181 duplicates. Instagram is one of the apps that is repeated, I examine all of the duplicate rows to see differences if any:

In [7]:
for app in android_list:
    name = app[0]
    if name == 'Quick PDF Scanner + OCR FREE':
        print(app)

['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


#### It appears that an important difference between each row is the number of reviews. To remove these duplicates I will create a dictionary that loops through the data, where each dictionary key is a unique app and the value is the highest number of reviews of that app. Using the dictionary I will then create a new dataset, which will only have one entry per app (based on the row for each app with the highest number of user ratings):

In [8]:
reviews_max = {}

for app in android_list[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
print('Expected length: ', len(android_list[1:])-1181) #number we should have after duplicates are removed
print('Real/dictionary length: ', len(reviews_max))
#reviews_max

Expected length:  9659
Real/dictionary length:  9659


#### Since the numbers look correct, I will make a new list and store it as android_clean:

In [9]:
android_clean = []
already_added = []

for app in android_list[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if reviews_max[name] == n_reviews and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
        
explore_data(android_clean, 0, 5, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 9659
Number of columns: 13


#### The new list android_clean has 9,659 rows - as expected

## Remove non-English apps

#### The apps that we are looking to make are in English, so all non-English apps should be removed from the data set. 

#### First I will create a function to loop through the data and return True/False if a string name contains more than three non-English characters (to account for emojis, dashes, etc). I chose three non-English characters because an English based app could contain an emoji or dash and if no non-English characters were allowed these would be removed from the list. See below for new function isEnglish and testing it with difference potential strings:

In [10]:
def isEnglish (string):
    count = 0
    for character in string:
        if ord(character) > 127:
            count += 1
    if count > 3:
            return False
    return True

print(isEnglish('Instagram'))
print(isEnglish('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(isEnglish('Docs To Go™ Free Office Suite'))
print(isEnglish('Instachat 😜'))

True
False
True
True


#### Using the function created above (isEnglish), I loop through the datasets and create new lists for apps that are most likely to be English based, then compare the length of the lists to the prior lists (before non-English apps removed):

In [11]:
android_foreignapps = []
android_englishapps = []

for app in android_clean:
    name = app[0]
    if isEnglish(name) == True:
        android_englishapps.append(app)
    else:
        android_foreignapps.append(app)
        
#explore_data(android_foreignapps, 0, 5, True)
print(explore_data(android_englishapps, 0, 3, True))
print('Android English & non-English row count: ', len(android_clean))

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13
None
Android English & non-English row count:  9659


## Isolate free apps
#### The developers only build apps that are free to download and install, with revenue generated through in-app advertising. The Android and Apple datasets contain both free and paid apps; I will isolate only free apps for this :

In [12]:
free_android_apps = []

for app in android_englishapps:
    price = app[7]
    if price == '0':
        free_android_apps.append(app)

print('We are left with: ', len(free_android_apps), ' Android apps')
    

We are left with:  8864  Android apps


#### I stored the list of free only apps as the list free_android_apps

## Data Strategy
#### At this point I will analyze the data and find app profiles that are successful in the Play Store, which have great potential for a new app.

#### The first step is to get a sense of the most common app categories. I will start by constructing a function to generate a frequency table (freq_table) that shows the market coverage of the English-based app by category) and a function that generates a display table based upon freq_table and that sorts the data highest to lowest:

In [13]:
def freq_table (dataset, index):
    genre_freqs = {}
    total = 0
    for app in dataset:
        total += 1
        genre = app[index]
        if genre in genre_freqs:
            genre_freqs[genre] += 1
        else:
            genre_freqs[genre] = 1
            
    table_percentages = {}
    for key in genre_freqs:
        percentage = (genre_freqs[key] / total) * 100
        table_percentages[key] = percentage
    return table_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

#### The Android dataset contains two potentially relevant columns to determine the category of the app: Genres and Category. It is unclear the difference between the two, and reviewing the frequency tables below may shed some light:

In [14]:
print(display_table(free_android_apps, -4)) #Genres

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

For the Genres column, Tools holds most of the share of available English apps (8.4%), followed by entertainment at 6.1% and Education at 5.3%. Here, games seem to be broken down into more specific genres, greatly splitting the superficial marketshare.

In [15]:
print(display_table(free_android_apps, 1)) #Category

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

#### The category column tells a similar story. Family (most games for kids) is the most popular app category at 18.9% of the available apps, followed by Games at 9.7% and Tools at 8.46%.

#### It is not clear exactly the difference between the Genres and Category columns. The Genres column appears to be more granular, and the Category column more general. Thus I will use the Category column moving forward. 

#### Reviewing the frequency tables we get the idea that the Play Store contains slightly more entertainment oriented games, followed by practical applications. In other words, apps whose purpose is fun, through interaction or otherwise, are slightly more populous in the Play Store. 

## Analyzing install/review frequency

#### The Google Play Store dataset contains an installs column, but as we see below it is not precise. The column uses general numbers so one cannot see whether the app has, for example, 1,000,000 installs or 2,000,000 installs. See below:

In [16]:
display_table(free_android_apps, 5) #installs column

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


#### We do not need exact numbers to get an idea of the landscape. For the purposes of this analysis I will remove all of the extra characters, so 1,000+ will simply be 1,000. The numbers below show the average number of "installs" per app category in the Google Play Store:

In [17]:
prime_genre_freqs = freq_table(free_android_apps, 1)
dic_freqs = {}

for genre in prime_genre_freqs:
    total = 0
    len_genre = 0
    for app in free_android_apps:
        genre_app = app[1]
        if genre_app == genre:
            user_ratings = app[5] #installs column
            user_ratings = user_ratings.replace('+','')
            user_ratings = user_ratings.replace(',','')
            user_ratings = float(user_ratings)
            total += user_ratings
            len_genre += 1
    average = total/len_genre
    dic_freqs.update({genre : average})

sorted_freqs = sorted(dic_freqs, key=lambda x: dic_freqs[x], reverse = True)
for value in sorted_freqs:
    print("{} : {}".format(value, dic_freqs[value]))

COMMUNICATION : 38456119.167247385
VIDEO_PLAYERS : 24727872.452830188
SOCIAL : 23253652.127118643
PHOTOGRAPHY : 17840110.40229885
PRODUCTIVITY : 16787331.344927534
GAME : 15588015.603248259
TRAVEL_AND_LOCAL : 13984077.710144928
ENTERTAINMENT : 11640705.88235294
TOOLS : 10801391.298666667
NEWS_AND_MAGAZINES : 9549178.467741935
BOOKS_AND_REFERENCE : 8767811.894736841
SHOPPING : 7036877.311557789
PERSONALIZATION : 5201482.6122448975
WEATHER : 5074486.197183099
HEALTH_AND_FITNESS : 4188821.9853479853
MAPS_AND_NAVIGATION : 4056941.7741935486
FAMILY : 3695641.8198090694
SPORTS : 3638640.1428571427
ART_AND_DESIGN : 1986335.0877192982
FOOD_AND_DRINK : 1924897.7363636363
EDUCATION : 1833495.145631068
BUSINESS : 1712290.1474201474
LIFESTYLE : 1437816.2687861272
FINANCE : 1387692.475609756
HOUSE_AND_HOME : 1331540.5616438356
DATING : 854028.8303030303
COMICS : 817657.2727272727
AUTO_AND_VEHICLES : 647317.8170731707
LIBRARIES_AND_DEMO : 638503.734939759
PARENTING : 542603.6206896552
BEAUTY : 51315

In the Google Play store the Communication category appears to have the most installs, followed by Video Players and Social apps. Let's take a deeper dive and see which apps are the most popular for each category:

In [18]:
#Communication Genre
dic_freqs = {}

for app in free_android_apps:
    if app[1] == 'COMMUNICATION':
        user_ratings = app[5] #installs column
        user_ratings = user_ratings.replace('+','')
        user_ratings = user_ratings.replace(',','')
        user_ratings = float(user_ratings)
        name = app[0]
        #print(name, ':', user_ratings)
        dic_freqs.update({name : user_ratings})
        
sorted_freqs = sorted(dic_freqs, key=lambda x: dic_freqs[x], reverse = True)
for value in sorted_freqs:
    print("{} : {}".format(value, dic_freqs[value]))

WhatsApp Messenger : 1000000000.0
Messenger – Text and Video Chat for Free : 1000000000.0
Skype - free IM & video calls : 1000000000.0
Google Chrome: Fast & Secure : 1000000000.0
Gmail : 1000000000.0
Hangouts : 1000000000.0
Google Duo - High Quality Video Calls : 500000000.0
imo free video calls and chat : 500000000.0
LINE: Free Calls & Messages : 500000000.0
UC Browser - Fast Download Private & Secure : 500000000.0
Viber Messenger : 500000000.0
imo beta free calls and text : 100000000.0
Android Messages : 100000000.0
Who : 100000000.0
GO SMS Pro - Messenger, Free Themes, Emoji : 100000000.0
Firefox Browser fast & private : 100000000.0
Messenger Lite: Free Calls & Messages : 100000000.0
Kik : 100000000.0
KakaoTalk: Free Calls & Text : 100000000.0
Opera Mini - fast web browser : 100000000.0
Opera Browser: Fast and Secure : 100000000.0
Telegram : 100000000.0
Truecaller: Caller ID, SMS spam blocking & Dialer : 100000000.0
UC Browser Mini -Tiny Fast Private & Secure : 100000000.0
WeChat : 

In [19]:
#Video Players Genre
dic_freqs = {}

for app in free_android_apps:
    if app[1] == 'VIDEO_PLAYERS':
        user_ratings = app[5] #installs column
        user_ratings = user_ratings.replace('+','')
        user_ratings = user_ratings.replace(',','')
        user_ratings = float(user_ratings)
        name = app[0]
        #print(name, ':', user_ratings)
        dic_freqs.update({name : user_ratings})
        
sorted_freqs = sorted(dic_freqs, key=lambda x: dic_freqs[x], reverse = True)
for value in sorted_freqs:
    print("{} : {}".format(value, dic_freqs[value]))

YouTube : 1000000000.0
Google Play Movies & TV : 1000000000.0
MX Player : 500000000.0
Motorola Gallery : 100000000.0
VLC for Android : 100000000.0
Dubsmash : 100000000.0
VivaVideo - Video Editor & Photo Movie : 100000000.0
VideoShow-Video Editor, Video Maker, Beauty Camera : 100000000.0
Motorola FM Radio : 100000000.0
Vote for : 50000000.0
Vigo Video : 50000000.0
MiniMovie - Free Video and Slideshow Editor : 50000000.0
Samsung Video Library : 50000000.0
LIKE – Magic Video Maker & Community : 50000000.0
DU Recorder – Screen Recorder, Video Editor, Live : 50000000.0
KineMaster – Pro Video Editor : 50000000.0
VMate : 50000000.0
HD Video Downloader : 2018 Best video mate : 50000000.0
Ringdroid : 50000000.0
Video Downloader : 10000000.0
Video Player All Format : 10000000.0
Code : 10000000.0
Music - Mp3 Player : 10000000.0
YouTube Studio : 10000000.0
video player for android : 10000000.0
HTC Service － DLNA : 10000000.0
HTC Gallery : 10000000.0
PowerDirector Video Editor App: 4K, Slow Mo & Mo

In [20]:
#Social Genre
dic_freqs = {}

for app in free_android_apps:
    if app[1] == 'SOCIAL':
        user_ratings = app[5] #installs column
        user_ratings = user_ratings.replace('+','')
        user_ratings = user_ratings.replace(',','')
        user_ratings = float(user_ratings)
        name = app[0]
        #print(name, ':', user_ratings)
        dic_freqs.update({name : user_ratings})
        
sorted_freqs = sorted(dic_freqs, key=lambda x: dic_freqs[x], reverse = True)
for value in sorted_freqs:
    print("{} : {}".format(value, dic_freqs[value]))

Facebook : 1000000000.0
Google+ : 1000000000.0
Instagram : 1000000000.0
Facebook Lite : 500000000.0
Snapchat : 500000000.0
Tumblr : 100000000.0
Pinterest : 100000000.0
Badoo - Free Chat & Dating App : 100000000.0
Tango - Live Video Broadcast : 100000000.0
LinkedIn : 100000000.0
Tik Tok - including musical.ly : 100000000.0
BIGO LIVE - Live Stream : 100000000.0
VK : 100000000.0
ooVoo Video Calls, Messaging & Stories : 50000000.0
MeetMe: Chat & Meet New People : 50000000.0
Zello PTT Walkie Talkie : 50000000.0
POF Free Dating App : 50000000.0
SKOUT - Meet, Chat, Go Live : 50000000.0
TextNow - free text + calls : 10000000.0
LiveMe - Video chat, new friends, and make money : 10000000.0
HTC Social Plugin - Facebook : 10000000.0
Quora : 10000000.0
Kate Mobile for VK : 10000000.0
Text Me: Text Free, Call Free, Second Phone Number : 10000000.0
Text free - Free Text + Call : 10000000.0
YouNow: Live Stream Video Chat : 10000000.0
We Heart It : 10000000.0
Path : 10000000.0
SayHi Chat, Meet New Peop

#### As we see above some of these categories are dominated by a single few apps that skew the data for that category. Below I redo the category calculations, but removing apps that have over 100,000,000 installs:

In [21]:
prime_genre_freqs = freq_table(free_android_apps, 1)
dic_freqs = {}

for genre in prime_genre_freqs:
    total = 0
    len_genre = 0
    for app in free_android_apps:
        user_ratings = app[5] #installs column
        user_ratings = user_ratings.replace('+','')
        user_ratings = user_ratings.replace(',','')
        genre_app = app[1]
        if (genre_app == genre) and  (float(user_ratings) < 100000000):
            user_ratings = float(user_ratings)
            total += user_ratings
            len_genre += 1
    average = total/len_genre
    dic_freqs.update({genre : average})

sorted_freqs = sorted(dic_freqs, key=lambda x: dic_freqs[x], reverse = True)
for value in sorted_freqs:
    print("{} : {}".format(value, dic_freqs[value]))

PHOTOGRAPHY : 7670532.29338843
GAME : 6272564.694894147
ENTERTAINMENT : 6118250.0
VIDEO_PLAYERS : 5544878.133333334
WEATHER : 5074486.197183099
SHOPPING : 4640920.541237113
COMMUNICATION : 3603485.3884615386
PRODUCTIVITY : 3379657.318885449
TOOLS : 3191461.128987517
SOCIAL : 3084582.5201793723
SPORTS : 2994082.551839465
TRAVEL_AND_LOCAL : 2944079.6336633665
PERSONALIZATION : 2549775.832167832
MAPS_AND_NAVIGATION : 2484104.7540983604
FAMILY : 2342897.527075812
HEALTH_AND_FITNESS : 2005713.6605166052
ART_AND_DESIGN : 1986335.0877192982
FOOD_AND_DRINK : 1924897.7363636363
EDUCATION : 1833495.145631068
NEWS_AND_MAGAZINES : 1502841.8775510204
BOOKS_AND_REFERENCE : 1437212.2162162163
HOUSE_AND_HOME : 1331540.5616438356
BUSINESS : 1226918.7407407407
LIFESTYLE : 1152128.779710145
FINANCE : 1086125.7859327218
DATING : 854028.8303030303
COMICS : 817657.2727272727
AUTO_AND_VEHICLES : 647317.8170731707
LIBRARIES_AND_DEMO : 638503.734939759
PARENTING : 542603.6206896552
BEAUTY : 513151.88679245283


#### Photography, Games, and Entertainment top the category list once outlier apps are removed. Let's do a deeper dive into these categories:

In [22]:
#Photography Genre
dic_freqs = {}

for app in free_android_apps:
    user_ratings = app[5] #installs column
    user_ratings = user_ratings.replace('+','')
    user_ratings = user_ratings.replace(',','')
    user_ratings = float(user_ratings)
    name = app[0]
    if (app[1] == 'PHOTOGRAPHY') and  (float(user_ratings) < 100000000):
            #print(name, ':', user_ratings)
            dic_freqs.update({name : user_ratings})
        
sorted_freqs = sorted(dic_freqs, key=lambda x: dic_freqs[x], reverse = True)
for value in sorted_freqs:
    print("{} : {}".format(value, dic_freqs[value]))

Motorola Camera : 50000000.0
InstaBeauty -Makeup Selfie Cam : 50000000.0
Selfie Camera - Photo Editor & Filter & Sticker : 50000000.0
ASUS Gallery : 50000000.0
Square InPic - Photo Editor & Collage Maker : 50000000.0
VSCO : 50000000.0
PhotoWonder: Pro Beauty Photo Editor Collage Maker : 50000000.0
Photo Effects Pro : 50000000.0
Photo Editor Selfie Camera Filter & Mirror Image : 50000000.0
Pic Collage - Photo Editor : 50000000.0
Photo Editor by Aviary : 50000000.0
Video Editor Music,Cut,No Crop : 50000000.0
Pixlr – Free Photo Editor : 50000000.0
Adobe Photoshop Express:Photo Editor Collage Maker : 50000000.0
InstaSize Photo Filters & Collage Editor : 50000000.0
Snapseed : 50000000.0
Keepsafe Photo Vault: Hide Private Photos & Videos : 50000000.0
MakeupPlus - Your Own Virtual Makeup Artist : 50000000.0
SNOW - AR Camera : 50000000.0
Boomerang from Instagram : 50000000.0
Photo Lab Picture Editor: face effects, art frames : 50000000.0
MomentCam Cartoons & Stickers : 50000000.0
LightX Photo 

#### The Photography category seems promising. In terms of developing time required to make an app like Photo Frame or Photo Collage is likely low and has many installs.

In [23]:
#Game Genre
dic_freqs = {}

for app in free_android_apps:
    user_ratings = app[5] #installs column
    user_ratings = user_ratings.replace('+','')
    user_ratings = user_ratings.replace(',','')
    user_ratings = float(user_ratings)
    name = app[0]
    if (app[1] == 'GAME') and  (float(user_ratings) < 100000000):
            #print(name, ':', user_ratings)
            dic_freqs.update({name : user_ratings})
        
sorted_freqs = sorted(dic_freqs, key=lambda x: dic_freqs[x], reverse = True)
for value in sorted_freqs:
    print("{} : {}".format(value, dic_freqs[value]))

Bubble Witch 3 Saga : 50000000.0
Block Craft 3D: Building Simulator Games For Free : 50000000.0
Love Balls : 50000000.0
Snake VS Block : 50000000.0
PUBG MOBILE : 50000000.0
Summoners War : 50000000.0
Lords Mobile: Battle of the Empires - Strategy RPG : 50000000.0
Toy Blast : 50000000.0
Gardenscapes : 50000000.0
Magic Tiles 3 : 50000000.0
Granny : 50000000.0
Doodle Jump : 50000000.0
Anger of stick 5 : zombie : 50000000.0
CATS: Crash Arena Turbo Stars : 50000000.0
War Robots : 50000000.0
GUNSHIP BATTLE: Helicopter 3D : 50000000.0
Earn to Die 2 : 50000000.0
Candy Crush Jelly Saga : 50000000.0
Swamp Attack : 50000000.0
Bowmasters : 50000000.0
DEAD TARGET: FPS Zombie Apocalypse Survival Games : 50000000.0
Word Search : 50000000.0
UNO ™ & Friends : 50000000.0
Rolling Sky : 50000000.0
► MultiCraft ― Free Miner! 👍 : 50000000.0
Pixel Gun 3D: Survival shooter & Battle Royale : 50000000.0
Perfect Piano : 50000000.0
Beach Buggy Blitz : 50000000.0
Geometry Dash Meltdown : 50000000.0
Gangstar Vegas 

DEER HUNTER RELOADED : 5000000.0
DEER HUNTER CHALLENGE : 5000000.0
NARUTO X BORUTO NINJA VOLTAGE : 5000000.0
Does not Commute : 5000000.0
Dr. Gomoku : 5000000.0
Wheelie Challenge : 5000000.0
Peggle Blast : 5000000.0
SCRABBLE : 5000000.0
Guess The Emoji : 5000000.0
HAWK – Force of an Arcade Shooter. Shoot 'em up : 5000000.0
Texas Holdem Poker Pro : 5000000.0
Governor of Poker 2 - OFFLINE POKER GAME : 5000000.0
Classic Words Solo : 5000000.0
Oggy : 5000000.0
I Know Stuff : 5000000.0
Rescue Robots Survival Games : 5000000.0
ETERNITY WARRIORS 2 : 5000000.0
Snes9x EX+ : 5000000.0
Golden HoYeah Slots - Real Casino Slots : 5000000.0
Angry Birds Space HD : 5000000.0
Crazy Bike attack Racing New: motorcycle racing : 5000000.0
PRO MX MOTOCROSS 2 : 5000000.0
Farm Fruit Pop: Party Time : 1000000.0
Woody Puzzle : 1000000.0
Bricks n Balls : 1000000.0
The Fish Master! : 1000000.0
Looper! : 1000000.0
Mad Skills BMX 2 : 1000000.0
MMX Hill Dash 2 – Offroad Truck, Car & Bike Racing : 1000000.0
Offroad Ou

Monster Ride Pro : 10.0
Ay Vamos - PJ. Balvin - Piano : 5.0
Brick Breaker BR : 5.0


#### The Game category is not as promising, games require significantly more development. However, if the game is great than there is potential for significant exposure.

In [24]:
#Entertainment Genre
dic_freqs = {}

for app in free_android_apps:
    user_ratings = app[5] #installs column
    user_ratings = user_ratings.replace('+','')
    user_ratings = user_ratings.replace(',','')
    user_ratings = float(user_ratings)
    name = app[0]
    if (app[1] == 'ENTERTAINMENT') and  (float(user_ratings) < 100000000):
            #print(name, ':', user_ratings)
            dic_freqs.update({name : user_ratings})
        
sorted_freqs = sorted(dic_freqs, key=lambda x: dic_freqs[x], reverse = True)
for value in sorted_freqs:
    print("{} : {}".format(value, dic_freqs[value]))

Talking Ginger 2 : 50000000.0
Amazon Prime Video : 50000000.0
Twitch: Livestream Multiplayer Games & Esports : 50000000.0
PlayStation App : 50000000.0
Mobile TV : 10000000.0
Motorola Spotlight Player™ : 10000000.0
MEGOGO - Cinema and TV : 10000000.0
ivi - movies and TV shows in HD : 10000000.0
Movies by Flixster, with Rotten Tomatoes : 10000000.0
BBC Media Player : 10000000.0
Fandango Movies - Times + Tickets : 10000000.0
Crackle - Free TV & Movies : 10000000.0
CBS - Full Episodes & Live TV : 10000000.0
STARZ : 10000000.0
Tubi TV - Free Movies & TV : 10000000.0
Crunchyroll - Everything Anime : 10000000.0
WWE : 10000000.0
FOX : 10000000.0
Vudu Movies & TV : 10000000.0
Viki: Asian TV Dramas & Movies : 10000000.0
Redbox : 10000000.0
Imgur: Find funny GIFs, memes & watch viral videos : 10000000.0
SketchBook - draw and paint : 10000000.0
Colorfy: Coloring Book for Adults - Free : 10000000.0
TV+ : 5000000.0
Digital TV : 5000000.0
Vigo Lite : 5000000.0
Peers.TV: broadcast TV channels First, M

#### The Entertainment category is also not as promising, it seems like the apps with a lot of "market share" require a partnership with a content provider.

#### Reviewing these modified frequency tables Photography, Games, and Entertainment are the most promising categories to target.

## Conclusion

#### The purpose of this analysis was to find actionable insights that a developer could use to build a free app, based in in-app revenue. Having cleansed the data, limited it to English-based apps, and removed duplicates and outliers we see a clearer picture of which categories to target for a free app and capture more of the Android market. Reviewing the data, I would recommend that a developer make an app in the Photography space. An app in this category could take relatively little development for the exposure that it could gain compared to other categories.