# Profitable App Profiles for the App Store and Google Play Markets

Using a comprehensive database of applications and their associated data, this project will ascertain and idenfity the profitability for ads of current free apps. With this data, the ability to predict and focus on profit driven work should help team leads and managers determine a course of action. By creating a series of outcomes via Python programming, this project will drill down into useable data from the available tables. 

The end result should be a fully realized data set with conclusions which will assist developers in working on the types of applications and features that users find most desireable to creating free apps.

## Opening and Exploring the Data

Near the end of 2018, there were approximately 2 million AppleStore iOS apps and roughly 1.2 million GooglePlaysotre Andriod apps.

[Here is the GooglePlayStore app list with a sample of about ~10,000 apps that we will be using](https://www.kaggle.com/lava18/google-play-store-apps)

[Here is the AppleStore app list that conatins ~7000 sample apps that we will also use](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)


First we must open and identify the data before we can begin to use it

In [1]:
from csv import reader

### The Google Play data set ###
opened_file = open('googleplaystore.csv', encoding='utf8')
read_file = reader(opened_file)
Google = list(read_file)
Google_header = Google[0]
Google = Google[1:]

### The App Store data set ###
opened_file = open('AppleStore.csv', encoding='utf8')
read_file = reader(opened_file)
Apple = list(read_file)
Apple_header = Apple[0]
Apple = Apple[1:]

To expedite our process, here is a defined function that will parse the data as we need.

Furthermore, we have now established a base line for the number of apps that should exist in each list and will use this benchmark as reference to any augmentations we make

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
print('Apple Store Columns:', '\n', Apple_header)
print('\n')        
apple = explore_data(Apple,1,4,True) 
print('\n')
print('Google Store Columns:', '\n', Google_header)
print('\n')
google = explore_data(Google,1,4,True) 

Apple Store Columns: 
 ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7197
Number of columns: 16


Google Store Columns: 
 ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art 

DESCRIPTION OF DATA

We can see that Android has 10841 apps and 13 columns, while Apple has 7197 apps and 16 columns. From a first glance some common columns that would be useful for interpretation include "Installs", "Ratings", "Reviews", "Price", and "Type" for Android and "track_name", "price", "rating_count_tot", "prime_genre", and "user_rating" for Apple. [Here is a description of the Apple Store Data column meanings](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)

We aren't sure yet that they don't include duplicates or useless data and for the next section we will continue to clean.

## Data Cleaning

Data rarely comes to analysis in workable shape, now is the time to clean and augment the database to ensure useability.

This process includes:
- Eliminating duplicates
- Sorting between Null and NaN
- Ensuring columns are uniform
- Isolating English only apps

Starting out, we know this app key is missing values and the most straight-forward approach is to delete it. 

In [3]:
print(Google[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


This App is missing a value in the 9th column and the rating is above the maximxum of "5". We will delete it for posterity

In [9]:
#If this output matches the above output, re-run this cell and all below#
#Otherwise, do not run again#
#del Google[10472]
print(Google[10472])
print('\n')
print('New total apps:', len(Google))

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


New total apps: 10840


Keeping a track record of all augmentations is imperative. Here the row is deleted and the new row record is listed. Where we started with 10841 rows, we now have 10840

### REMOVING DUPLICATES

#### PART ONE:

It has been established in the [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) section of the app data source, that there as many duplicate apps. Here we will demonstrate the duplicates and count them.

In [37]:
for app in Google:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


This is an example of "Instagram" being present 4 times within the Google data

Next we will isolate the unique names with a loop

In [38]:
duplicate_apps = []
unique_apps = []

for app in Google:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps:', len(duplicate_apps)) 
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


It would be rash to delete these duplicates at random, we will use various methods to idenfity the best fit. Criteria such as the "Number of ratings" count will assist in isolating the correct row to utilize. The "Instagram" examples indicate that the most recent version would have the highest number of ratings.

To do that, we will:

- Create a dictionary where each key is a unique app name, and the value is the highest number of reviews of that app
- Use the dictionary to create a new data set, which will have only one entry per app (and we only select the apps with the highest number of reviews)

In [39]:
print('Expected length:', len(Google)-1181)

Expected length: 9659


The above value is what we should end up with after this code is done

In [40]:
reviews_max = {}
for row in Google:
    name = row[0]
    n_reviews = float(row[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
                    

In [41]:
print('Expected length:', len(Google)-1181)
print('Actual length:', len(reviews_max))

Expected length: 9659
Actual length: 9659


Now we can see that we have eliminated the correct number of apps from the database so we can properly work with it

Now, let's use the reviews_max dictionary to remove the duplicates. For the duplicate cases, we'll only keep the entries with the highest number of reviews. In the code cell below:

#### PART TWO:

- Creating two empty lists, android_clean and already_added.
- We loop the android data set, each iteration:
    * Find name of app and number of reviews.
    * We add app to the google_clean list, and the app name to the already_added list if:
        - The reviews_max dictionary review count matches the app and reviews; and
        - The name of the app is not already in the already_added list. This prevent extra duplicates from being added once the highest has been collect. If we did not create this clause, the duplicate apps with the same number of reviews would be added as many times as they exist. 
        
If we just check for reviews_max[name] == n_reviews, we'll still end up with duplicate entries for some apps.

In [76]:
google_clean = []
already_added = []

for row in Google:
    name = row[0]
    n_reviews = float(row[3])
    
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        google_clean.append(row)
        already_added.append(name)

In [43]:
explore_data(google_clean,0,3,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


As proof of concept, here is the final number

Lastly, we will show that Apple has no duplicates using the same loop as before

In [44]:
duplicate_apps = []
unique_apps = []

for app in Apple:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps:', len(duplicate_apps)) 
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])

Number of duplicate apps: 0


Examples of duplicate apps: []


## Removing Non-English Apps

#### Part One:

Many of the apps listed in the databases are not directed at an English speaking audience and since we wish to create an English-based app, it is imperative we avoid letting app data that does not pertain to our work, influence the final decision.

In [45]:
print('Non-English AppleStore Apps:', '\n')
print(Apple[813][1],'\n',Apple[6731][1])
print('\n')
print('Non-English GooglePlay Apps:', '\n')
print(google_clean[4412][0],'\n',google_clean[7940][0])



Non-English AppleStore Apps: 

爱奇艺PPS -《欢乐颂2》电视剧热播 
 【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


Non-English GooglePlay Apps: 

中国語 AQリスニング 
 لعبة تقدر تربح DZ


Above is the proof of such types of apps

We want apps that use the English langauge and knowing what that includes allows us to determine what we don't want to use.
("English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;, etc.), and other symbols (+, *, /, etc.)").

With the ASCII character dictionary avaiable to use, we can parse the numbers asscoiated (0-127) with character and determine if it is used in the English alphabet.

Using this created function we utilize the ord() to identify the enconding number for the characters used in the App name.


In [46]:
def english_only(string):
    
    for character in string:
        if ord(character) > 127:
            return False
    
    return True


print(english_only('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_only('Instagram'))
    

False
True


This initally appears to work, but as the bottom cell below demonstrates, isolating the alphabet does not do work to include an 'emojis' or 'characters' that are either universal or English based. Going on this approach alone will remove more apps than we need. 

In [47]:
print(english_only('Docs To Go™ Free Office Suite'))
print(english_only('Instachat 😜'))

print(ord('😜'))
print(ord('™'))

False
False
128540
8482


#### Part Two:

To prevent further data loss, our cavaet for keeping apps will come down to ensuring only removing apps that have more that 3 non-ASCII characters

In [48]:
def english_primary(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127 :
            non_ascii += 1
    if non_ascii > 3:
        
        return False
    else:
        return True
 
print(english_primary('Docs To Go™ Free Office Suite'))
print(english_primary('Instachat 😜'))
print(english_primary('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
True
False


This method properly achieves the goal, even if it is not perfect

Next we will filter out a clean English based database for both sets of data

In [49]:
google_english = []
apple_english = []

for row in google_clean:
    name = row[0]
    if english_primary(name):    
        google_english.append(row)
        

for row in Apple:
    name = row[1]
    if english_primary(name):    
        apple_english.append(row)
 

print('Former size of GooglePlay App List:', '\n', len(google_clean))
print('\n')
print('Current GooglePlay App List')
print('\n')
explore_data(google_english,0,3,True)
print('\n')
print('Number of Apps removed:', len(google_clean)-(len(google_english)))
print('\n')
print('Former size of AppleStore App List:', '\n',len(Apple))
print('\n')
print('Current AppleStore App List')
print('\n')
explore_data(apple_english,0,3,True)
print('\n')
print('Number of Apps removed:', len(Apple)-(len(apple_english)))

Former size of GooglePlay App List: 
 9659


Current GooglePlay App List


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


Number of Apps removed: 45


Former size of AppleStore App List: 
 7197


Current AppleStore App List


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', 

Finally we have a clean data sets we can use with; 

- AppleStore apps at 6138 and 1014 removed  
- GoolgePlayStore apps with 9614 and 45 removed

## Isolating Free Apps

In the introduction we mentioned that ads revenue is the primary profit driver and thus we make free apps that are attractive to advertisers through the volume of dowloads and high ratings.

Our next code will compile all the free apps in both data sets

In [50]:
free_google_english = []
free_apple_english = []

for row in google_english:
    price = row[7]
    if price == '0':    
        free_google_english.append(row)
        

for row in apple_english:
    price = row[4]
    if price == '0.0':    
        free_apple_english.append(row)
        
explore_data(free_apple_english,0,3,True)
print('\n')
explore_data(free_google_english,0,3,True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 3222
Number of columns: 16


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Vari

Finally, after cleaning the data, we are left with 8864 GooglePlay apps and 3222 AppleStore apps

# MOST COMMON APP GENRE

#### Part One:

With the goal being from the start to figure out the types of applications that attract the most users, so that the in-app ad revenue generation can operate at peak efficieny, we now must isolate the most popular genres.

Reducing risks and variable costs is our goal, so, our "validation strategy" for an developing an app is as follows:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Ensuring the attractivenss in both markets, we must focus on crossover-genres to maximize desirabliltiy for the two stores.

Next, we will create frequency tables of "prime_genre" column for the AppleStore data and compare it with the "Genres" and "Category" columns in the GooglePlay data.

#### Part Two:

We will now create functions to analyze these tables

  - The first function will display percentages
  - The second will order them

In [51]:
def freq_table(dataset, index):
    frequency_table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in frequency_table:
            frequency_table[value] +=1
        else:
            frequency_table[value] = 1
            
    table_percentage = {}
    for key in frequency_table:
        percentage = (frequency_table[key]/total) * 100
        table_percentage[key] = percentage
        
    return table_percentage
    

def display_table(dataset, index):
    frequency_table = freq_table(dataset, index)
    table_display = []
    for key in frequency_table:
        key_val_as_tuple = (frequency_table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        

#### Part Three:

To begin we will analze the frequency table of the 'prime_genre' Apple data set

In [64]:
display_table(free_apple_english, -5) #prime_genre

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


With more than half of the free English apps dedicated to games (58%), the bias towards gaming is evident within the AppleStore towards gaming. The next values are Entertainment (8%), Photo and Video Apps (5%). Past that, no genre breaches the 4% mark. Anything below that is essentially all in the same ball park. It would seem that the AppleStore is geared towards gaming. So any created apps for this project should consider some type of gamification combined with a lesser used category as to best break into the market.

Next, the Genres and Category columns of the Google Play data set will be reviewed.

In [65]:
display_table(free_google_english, -4) #genres

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

In [66]:
display_table(free_google_english, 1) #Category

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

The issue with the "Genre" column is the sheer diversity in the details. The umbrella groupings present in the "Category" column appears far most useful to compare the AppleStore and GooglePlay. From here on out we will focus on the Category column to determine out choices.

With the Family and Games category holding ~25% of all apps in GooglePlay, the focus on children and playing is still the most present grouping. With the Apple store hold ~60% and GooglePlay at ~25%, it would be wise to somehow include child friendly gaming into our new app.

With non-gaming apps still having a larger presence in the GooglePlayStore, a focus on a practical app with family friendly gaming features to start before the crossover to the iOS platform, should be prudent.

## Most Popular Apps by Genre on the App Store:

The volume of "installs" is a great way to determine the apps which apps are most popular within each dataset, but only the GooglePlay data has that column available. For the App store, we can use the total user ratings as proxy to help understand which apps have been used most commonly. rating_count_tot will provide this for us.

Below, we calculate the average number of user ratings per app genre on the App Store:

In [67]:
genres_apple = freq_table(free_apple_english, -5)

for genre in genres_apple:
    total = 0
    len_genre = 0
    for row in free_apple_english:
        genre_app = row[-5]
        if genre_app == genre:
            n_ratings = float(row[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total/len_genre
    print(genre, ':', avg_n_ratings)
            

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


Naviagtion has the highest number of reviews, but two apps dominate those ratings. This is more of a red herring and should be dismissed:

In [68]:
for app in free_apple_english:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5]) # print name and number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


Certain categories have a similar pattern. Music and Social Media stand out with heavy-weights. Facebook, Skype, and Instagram for Social Media are skewing the results of popularity and the same goes for Spotify and Shazam for Music

Those genres are not of good use to use unless we remove the juggernauts, which we will do later.

Health & Fitness apps have 23,298 user ratings on average, although MyFitnessPal is the title holder for reviews, the difference between that and next tier down is not as significant. This would be an avenue to investigate.

In [69]:
for app in free_apple_english:
    if app[-5] == 'Health & Fitness':
        print(app[1], ':', app[5])

Calorie Counter & Diet Tracker by MyFitnessPal : 507706
Lose It! – Weight Loss Program and Calorie Counter : 373835
Weight Watchers : 136833
Sleep Cycle alarm clock : 104539
Fitbit : 90496
Period Tracker Lite : 53620
Nike+ Training Club - Workouts & Fitness Plans : 33969
Plant Nanny - Water Reminder with Cute Plants : 27421
Sworkit - Custom Workouts for Exercise & Fitness : 16819
Clue Period Tracker: Period & Ovulation Tracker : 13436
Headspace : 12819
Fooducate - Lose Weight, Eat Healthy,Get Motivated : 11875
Runtastic Running, Jogging and Walking Tracker : 10298
WebMD for iPad : 9142
8fit - Workouts, meal plans and personal trainer : 8730
Garmin Connect™ Mobile : 8341
Record by Under Armour, connects with UA HealthBox : 7754
Fitstar Personal Trainer : 7496
My Cycles Period and Ovulation Tracker : 7469
Seven - 7 Minute Workout Training Challenge : 6808
RUNNING for weight loss: workout & meal plans : 6407
Lifesum – Inspiring healthy lifestyle app : 5795
Waterlogged - Daily Hydration Tr

There is no conrete winner in this genre below MyFitnessPal and LoseIt!. If we can focus on the success of exercise/alarm based apps, we might have a good focus to run with. By creating a fitness app with alerts for daily events, combined with a social connection through the app, and points to collect for each goal met, it could be the solution.

Using points to buy avatars and earn ranking, we have gamified the exercise. By allowing you to connect with friends and peers to compare rankings and leave words of encouragement, we have added the social aspect. Furthermore, this would be family friendly and this approach seems to tick all the boxes of practical, social, and fun. Ad revenue would be a logical choice with such an app.

If GooglePlay gives us a similar conclusion, we should consider this angle.

## Most Popular Apps by Genre on Google Play

With "Installs" available to us in this set, it will be far easier to determine who has these apps. It does not provide us with daily use statistics, so one must consider many of these apps could be one time download and use or downloading several times to different phones.

Beyond that, the number of Installs is categorized into bins and the precise number is not present. Something to be aware of.

In [70]:
display_table(free_google_english, 5) # the Installs columns

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


Without any further information to glean from these value, we must take them at face value. This will also make future calculations much easier to grasp. Therefore, "1,000,000+" will simply be "1,000,000" and we will drop the "+" and "," to create float values to compute with.

This next loop will transform the column into calculatable values:

In [71]:
categories_android = freq_table(free_google_english, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in free_google_english:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

Communication apps have the largest number of installations, but as inthe AppleStore data, there are certain monsters (Skype, Messenger, WhatsApp) that are such big outliers with over a billion installs. It is best to remove them and other high count apps if we want a more balanced picture of what is being used.

In [72]:
for app in free_google_english:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

The average can be decimated by removing the skewing values present in the dominating apps:

In [73]:
under_100_m = []

for app in free_google_english:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
sum(under_100_m) / len(under_100_m)

3603485.3884615386

Finally, lets target the group we had already indentified from the AppleStore data. With 4,188,821 installs, the Health and Fitness category doesn't seem to be oversaturated. 

Let's make sure that this isn't skewed as well and look at the number or installs in this category:

In [74]:
for app in free_google_english:
    if app[1] == 'HEALTH_AND_FITNESS':
        print(app[0], ':', app[5])

Step Counter - Calorie Counter : 500,000+
Lose Belly Fat in 30 Days - Flat Stomach : 5,000,000+
Pedometer - Step Counter Free & Calorie Burner : 1,000,000+
Six Pack in 30 Days - Abs Workout : 10,000,000+
Lose Weight in 30 Days : 10,000,000+
Pedometer : 10,000,000+
LG Health : 10,000,000+
Step Counter - Pedometer Free & Calorie Counter : 10,000,000+
Pedometer, Step Counter & Weight Loss Tracker App : 10,000,000+
Sportractive GPS Running Cycling Distance Tracker : 1,000,000+
30 Day Fitness Challenge - Workout at Home : 10,000,000+
Home Workout for Men - Bodybuilding : 1,000,000+
Fat Burning Workout - Home Weight lose : 100,000+
Buttocks and Abdomen : 500,000+
Walking for Weight Loss - Walk Tracker : 100,000+
Running & Jogging : 500,000+
Sleep Sounds : 1,000,000+
Fitbit : 10,000,000+
Lose Belly Fat-Home Abs Fitness Workout : 50,000+
Cycling - Bike Tracker : 500,000+
Abs Training-Burn belly fat : 100,000+
Calorie Counter - EasyFit free : 1,000,000+
Aunjai i lert u : 500,000+
Garmin Connect

Burn Your Fat With Me! FG : 1,000,000+
FH Calculator : 500+
Restaurant Inspections - FL : 10,000+
Florida Blue : 100,000+


This is a good sign, only 1 or 2 apps are above 10+ million downloads and all the low number of installs seem to be regional for health related apps. We can leverage this to drill down into the popular apps from Health and Fitness that are universal by removing the skew from the top bottom of the list:

In [75]:
for app in free_google_english:
    if app[1] == 'HEALTH_AND_FITNESS' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Lose Belly Fat in 30 Days - Flat Stomach : 5,000,000+
Pedometer - Step Counter Free & Calorie Burner : 1,000,000+
Six Pack in 30 Days - Abs Workout : 10,000,000+
Lose Weight in 30 Days : 10,000,000+
Pedometer : 10,000,000+
LG Health : 10,000,000+
Step Counter - Pedometer Free & Calorie Counter : 10,000,000+
Pedometer, Step Counter & Weight Loss Tracker App : 10,000,000+
Sportractive GPS Running Cycling Distance Tracker : 1,000,000+
30 Day Fitness Challenge - Workout at Home : 10,000,000+
Home Workout for Men - Bodybuilding : 1,000,000+
Sleep Sounds : 1,000,000+
Fitbit : 10,000,000+
Calorie Counter - EasyFit free : 1,000,000+
Garmin Connect™ : 10,000,000+
BetterMe: Weight Loss Workouts : 5,000,000+
Bike Computer - GPS Cycling Tracker : 1,000,000+
Running Distance Tracker + : 1,000,000+
Runkeeper - GPS Track Run Walk : 10,000,000+
Walking: Pedometer diet : 1,000,000+
8fit Workouts & Meal Planner : 10,000,000+
Keep Trainer - Workout Trainer & Fitness Coach : 1,000,000+
PumpUp — Fitness Co

We can see the volume of middle popularity is packed for the Health and Fitness cateogry of GooglePlay. Many apps seem to offer similar ideas with things like "Pedometer" and "Abs" doing extremely well. 

Many of these apps are company related as well and if we differentiated from them on the social aspect of our program, the idea of a timed workout social ranking system would likely fair well in the GooglePlay store as well

### Conclusion

In conclusion, based on the information analyzed from free English apps in both the GooglePlayStore and the AppleStore, building a free Fitness app is a good decision.

Incorporating a gamfied ranking system will appeal to both markets lion-share, and adding a social context gives the people what they prefer on Android.

Setting this program apart with interactivity and design should be enough to attractive lucrative ad revenue contracts.