# Free Mobile App Usage

In this project our goal is to identify profitable mobile application profiles that we can use to inform our development team's future efforts. Our company only builds apps that are free to download and install, and our main source of revenue is accrued from advertiser payments via in-app ads. This means that the number of active users of our apps determines our revenue — the more users who see and engage with the ads, the better. There may be other considerations, such as the likelihood for new apps to successfully penetrate a market category, that we can use to balance our recommendations. 

We will use two data sources for our analysis. One containing Apple App Store Data, and the other Google Play Store Data. These can be downloaded at the below links.
* [Apple App Store Dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)
* [Google Play Store Dataset](https://www.kaggle.com/lava18/google-play-store-apps)

### Summary of Results

The middle to mid-high end of the road in average installs and total frequency of apps in the category/genre is most attractive. We see a few categories/genres that meet this criteria across marketplaces:
* Video Players & Editors
* Photography
* Entertainment
* Books and Reference

## Data Exploration

In [1]:
# read in datasets
from csv import reader

handle_apple = open('AppleStore.csv')
read_apple = reader(handle_apple)
apple_app_data = list(read_apple)
apple_headers = apple_app_data[0]

handle_google = open('googleplaystore.csv')
read_google = reader(handle_google)
google_app_data = list(read_google)
google_headers = google_app_data[0]

To make requerying our data simpler, let's create a function that allows us to select our dataset, query rows, and optionally print the number of rows/columns in the data set.

In [2]:
# function that prints each row of a list of lists and tells us the number of rows and columns
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        
    print('\n')#print('\n') # adds a new (empty) line 
    
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

___
Test pull of the datasets using the custom function.

In [3]:
# print the column names and a sample of the data from our two datasets
print('Column Names')

print('Apple:')
explore_data(apple_app_data, 0, 1, False)
print('Google:')
explore_data(google_app_data, 0, 1, False)

print('Data')
print('Apple Store')
explore_data(apple_app_data, 1, 5, True)
print('\n')

print('Google Play')
explore_data(google_app_data,1 ,5, True)
print('\n')

Column Names
Apple:
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Google:
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Data
Apple Store
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']
['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']
['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']
['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows

The Apple App store dataset consists of 16 columns and 7198 rows of data. There is a mixture of numeric and descriptive columns.

The Google Play store dataset consists of 13 columns and 10842 rows of data. There is a mixture of numeric and descriptive columns. 

There is an overlap in the columns of data that will allow us to compare the rows across the datasets. Those that will likely be the most important for this analysis are those that tell us the name of the app, (sub)categories, price, number of users and rating/reviews.

## Data Clean-up
We will want to ensure that the datasets are cleaned before we conduct an analysis. 

### Handling Missing Data - Google Play

Let's first evaluate and clean the Google Play dataset. We found online discussion on the Google dataset about [missing data](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015). Let's recreate this and decide on a course of action.

In [4]:
explore_data(google_app_data,10473, 10474, False) #Missing data at entry 10473

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']




There is a column with a blank entry. Let's opt to delete the entire row with missing data.

In [5]:
# delete row
del google_app_data[10473]

# verify the app row was removed
explore_data(google_app_data,10472,10475,False)

['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']
['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']
['Sat-Fi Voice', 'COMMUNICATION', '3.4', '37', '14M', '1,000+', 'Free', '0', 'Everyone', 'Communication', 'November 21, 2014', '2.2.1.5', '2.2 and up']




### Handling Duplicate Rows - Google Play
There may be rows that are a duplicate of eachother. Let's first identify which rows those are and then decide how to handle them.

In [6]:
# loop through the names of the apps and add them to a list if they are a duplicate

unique_apps = []
duplicate_apps = []

for app in google_app_data:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else :
        unique_apps.append(name)
        
print('Total Duplicate Apps: ', len(duplicate_apps))
print('\n')
print('Sample Duplicate Apps: ', duplicate_apps[:15])
        

Total Duplicate Apps:  1181


Sample Duplicate Apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


There are 1181 duplicate apps in the Google Dataset. Let's test a hypothesis that there is a variance in the number of reviews, which is a column that is likely to have change over time.

In [7]:
for app in google_app_data:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Our test showed that duplicate apps have a variance in their total reviews (column 4), which we can assume means that each record is a snapshot in time. Using this information, we will keep apps that have the greatest number of reviews, indicating their recency, and remove all other duplicates.

To clean our data of duplicates, we will create a dictionary where each key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app. We expect to see 9659 value key pairs in the dictionary, which we found by subtracting the number of duplicates from the total number of apps.

In [8]:
reviews_max = {}

for row in google_app_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

print(len(reviews_max))

9659


We now have a dictionary of all unique apps names and their corresponding number of reviews for the duplicate value containing the most reviews. We can use this dictionary to clean our dataset of duplicates.

We do this by creating two empty lists, looping through the dataset and checking if each value matches the key-value from the previous step's dictionary (which found and isolated the maximum value for each key in the dataset.) One list saves the row that has a match with the dictionary value and the other saves the name of each app that already has a recorded value in the 'clean' list, telling our program to ignore that entry.

In [9]:
googleplay_clean = []
already_added = []

for row in google_app_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    #key_value = reviews_max[name]
    if name not in already_added and float(reviews_max[name]) == float(n_reviews):
        already_added.append(name)
        googleplay_clean.append(row)

Let's check that our dataset has been changed. Explore the new dataset __googleplay_clean__, which is a version of the original dataset with incomplete and duplicate entries removed. From steps we completed earlier, we expect to see a count of 9659 rows in the list. 

In [10]:
explore_data(googleplay_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']
['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


We see that the number of rows is reflective of our expectation.

### Handling Duplicate Rows - Apple App Store
Let's check if the Apple App Store dataset contains duplicate rows.

In [11]:
unique_apps = []
duplicate_apps = []

#print(apple_app_data[0])
for row in apple_app_data[1:]:
    app_id = row[0]
    if app_id not in unique_apps:
        unique_apps.append(app_id)
    else :
        duplicate_apps.append(app_id)
        
print(len(unique_apps))
print(len(duplicate_apps))


7197
0


These results tell us that there are no duplicate apps in the apple dataset.

### Non-English Apps
Both the Apple app store and the Google Play store have applications created for Enlish speaking and Non-Enlish speaking audiences. Our business, however, only develops applications for an English speaking audience. It will make sense to remove these where reasonably possible. Let's explore apps that are not made for English speaking audiences and make decisions on how to remove them.

In [12]:
# print app names from Google and Apple
print(apple_app_data[814][1])
print(apple_app_data[6732][1])
print('\n')
print(googleplay_clean[4412][0])
print(googleplay_clean[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


中国語 AQリスニング
لعبة تقدر تربح DZ


Using the `ord()` function we can identify which apps contain characters typically not found in use with the English language. English characters fall in the range (0-127).

In [13]:
# show examples of unicode characters in and out of ord range
print(ord('a'))
print(ord('爱'))

97
29233


We can parse out the names of our app names and evaluate each character. By passing each character into a function, we can tell whether or not the fall into the English unicode character range (0-127). We can use discretion to say that if there are more than three non-English unicode characters that app is not likely for the English market. 

In [14]:
# Function that takes in a string and returns `False` if there's three or more non-English characters
def english(string) :
    count = 0
    for letter in string :
        if ord(letter) > 127 :
            count = count + 1
        #print(letter, count)
        if count > 3 :
            return False
    return True

# examples of the function working
print(english('Instagram'))
print(english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english('Docs To Go™ Free Office Suite'))
print(english('Instachat 😜'))

True
False
True
True


We can now loop through both of our entire datasets and, using the function above, systematically remove every app that is not for the English market.

In [15]:
# loop through the Google Play dataset and categorize apps into english and non-english lists
googleplay_eng = []
googleplay_non_eng = []

for app in googleplay_clean:
    name = app[0]
    if english(name) is True :
        googleplay_eng.append(app)
    else :
        googleplay_non_eng.append(app)

print('Count English Google Play Length: ', len(googleplay_eng))
print('Count Non-English Google Play Length: ', len(googleplay_non_eng))

# loop through the Apple dataset and categorize apps into english and non-english lists
apple_eng = []
apple_non_eng = []

apple_app_data = apple_app_data[1:]

for app in apple_app_data:
    name = app[1]
    if english(name) is True :
        apple_eng.append(app)
    else :
        apple_non_eng.append(app)
        
print('Count English Apple Length: ', len(apple_eng))
print('Count Non-English Apple Length: ', len(apple_non_eng))

Count English Google Play Length:  9614
Count Non-English Google Play Length:  45
Count English Apple Length:  6183
Count Non-English Apple Length:  1014


Before moving on, let's do a quick gut-check that the apps in non-English list are non-English.

In [16]:
print('Google Play')
print(googleplay_non_eng[0:5])
print('\n')
print('Apple')
print(apple_non_eng[0:5])

Google Play
[['Flame - درب عقلك يوميا', 'EDUCATION', '4.6', '56065', '37M', '1,000,000+', 'Free', '0', 'Everyone', 'Education', 'July 26, 2018', '3.3', '4.1 and up'], ['သိင်္ Astrology - Min Thein Kha BayDin', 'LIFESTYLE', '4.7', '2225', '15M', '100,000+', 'Free', '0', 'Everyone', 'Lifestyle', 'July 26, 2018', '4.2.1', '4.0.3 and up'], ['РИА Новости', 'NEWS_AND_MAGAZINES', '4.5', '44274', '8.0M', '1,000,000+', 'Free', '0', 'Everyone', 'News & Magazines', 'August 6, 2018', '4.0.6', '4.4 and up'], ['صور حرف H', 'ART_AND_DESIGN', '4.4', '13', '4.5M', '1,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 27, 2018', '2.0', '4.0.3 and up'], ['L.POINT - 엘포인트 [ 포인트, 멤버십, 적립, 사용, 모바일 카드, 쿠폰, 롯데]', 'LIFESTYLE', '4.0', '45224', '49M', '5,000,000+', 'Free', '0', 'Everyone', 'Lifestyle', 'August 1, 2018', '6.5.1', '4.1 and up']]


Apple
[['445375097', '爱奇艺PPS -《欢乐颂2》电视剧热播', '224617472', 'USD', '0.0', '14844', '0', '4.0', '0.0', '6.3.3', '17+', 'Entertainment', '38', '5', '3', '1'], ['405667771'

### Removing Non-Free Apps
Our business develops apps that are free to download and use, but generate ad revenue from in-app ads. The characteristics for paid and free apps can be quite different. Let's now filter both of our datasets to only include free apps.

In [17]:
# loop through Google Play dataset and filter free and paid apps into separate lists
googleplay_free = []
googleplay_paid = []

for app in googleplay_eng:
    cost = app[6]
    if cost == 'Free':
        googleplay_free.append(app)
    else:
        googleplay_paid.append(app)

# loop through apple dataset and filter free and paid apps into separate lists
apple_free = []
apple_paid = []

for app in apple_eng:
    cost = app[4]
    if cost == '0.0':
        apple_free.append(app)
    else:
        apple_paid.append(app)

# check that both datasets only contain free apps
print(apple_free[0:3])
print(len(apple_free))
print('\n')
print(googleplay_free[0:3])
print(len(googleplay_free))

[['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'], ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']]
3222


[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']]
8863


## App Category Analysis
Our end goal is to develop an app that is successful on both Google Play and Apple App Store marketplaces. Stakeholders most important success metric is revenue, and we know the KPI that most directly impacts revenue are the number of installs our apps have on the marketplace.

To minimize risks and overhead, the business's validation strategy for an app idea has three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Our analysis will follow a similar structure. First validating Google Play app category performance and then finding where within the iOS apple app store there is opportunity.
### Identifying Most Relevant Columns
Our Google Play dataset consists of 16 columns and our Apple App Store dataset consists of 13 columns of data, not all of which may be useful in this analysis.

Let's print out our columns and identify key columns.

In [18]:
print('Google Play: ')
print(google_headers)
print('\n')
print('Apple: ')
print(apple_headers)

Google Play: 
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Apple: 
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


The columns that will be useful in further analysis are:
* __Google Play__
 * `Category`
 * `Genres`
 * `Installs`

* __Apple App Store__
 * `prime_genre`
 * `rating_count_tot`

The Google Play dataset contains our primary KPI, count of app installs, however our Apple App Store data does not. As a proxy for installs, we can use the `rating_count_tot` which will tell us how many times an app was reviewed. We can fairly assume that app installs and count of reviews is correlated.

### Count of Apps in Google Play Categories

One indicator that a category is lucrative for other app developers and that the category may be accepting of new app entrants, is the number of apps currently in a category.

Let's create frequency tables of our Google Play dataset.

In [19]:
# function that takes in a dataset and creates a normalized frequency table of the given index
def freq_table(dataset, index):
    frequency_table = {}
    total = 0
    
    # generates a frequency table
    for row in dataset:
        target = row[index]
        if target in frequency_table:
            frequency_table[target] += 1
            total += 1
        else :
            frequency_table[target] = 1
            total += 1
    
    # normalizes a frequency table
    for value in frequency_table:
        key_val = frequency_table[value]
        percentage = round(((key_val / total) * 100), 2)
        frequency_table[value] = percentage
        #one_hundred_percent = one_hundred_percent + percentage
    
           
    return frequency_table#, round(one_hundred_percent)

To make our frequency table more readable, it will be easier to transform it into a table. Let's create a function that will do that next.

In [20]:
def display_table(dataset, index): #dataset is a list of lists and index will be an integer
    table = freq_table(dataset, index) # use freq_table function
    table_display = [] 
    
    #transforms the frequency tables into a list of tuples
    for key in table: 
        key_val_as_tuple = (table[key], key) # turns key values into a tuple in reverse order 
        table_display.append(key_val_as_tuple) # appends tuple to list
        
    #takes our complete list and sorts it in reverse (descending) order
    table_sorted = sorted(table_display, reverse = True) 
    for entry in table_sorted:
        print(entry[1], ':', entry[0]) #prints our sorted lists of lists back in key - value form in descending order

Let's now use our frequency table on the Genre category in the Google Play dataset.

In [21]:
display_table(googleplay_free, 9) #Genre

Tools : 8.45
Entertainment : 6.07
Education : 5.35
Business : 4.59
Productivity : 3.89
Lifestyle : 3.89
Finance : 3.7
Medical : 3.53
Sports : 3.46
Personalization : 3.32
Communication : 3.24
Action : 3.1
Health & Fitness : 3.08
Photography : 2.94
News & Magazines : 2.8
Social : 2.66
Travel & Local : 2.32
Shopping : 2.25
Books & Reference : 2.14
Simulation : 2.04
Dating : 1.86
Arcade : 1.85
Video Players & Editors : 1.77
Casual : 1.76
Maps & Navigation : 1.4
Food & Drink : 1.24
Puzzle : 1.13
Racing : 0.99
Role Playing : 0.94
Libraries & Demo : 0.94
Auto & Vehicles : 0.93
Strategy : 0.9
House & Home : 0.82
Weather : 0.8
Events : 0.71
Adventure : 0.68
Comics : 0.61
Beauty : 0.6
Art & Design : 0.6
Parenting : 0.5
Card : 0.45
Casino : 0.43
Trivia : 0.42
Educational;Education : 0.39
Board : 0.38
Educational : 0.37
Education;Education : 0.34
Word : 0.26
Casual;Pretend Play : 0.24
Music : 0.2
Racing;Action & Adventure : 0.17
Puzzle;Brain Games : 0.17
Entertainment;Music & Video : 0.17
Casual;B

In [22]:
display_table(googleplay_free, 1) #Category

FAMILY : 18.9
GAME : 9.73
TOOLS : 8.46
BUSINESS : 4.59
LIFESTYLE : 3.9
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.53
SPORTS : 3.4
PERSONALIZATION : 3.32
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.94
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.86
VIDEO_PLAYERS : 1.79
MAPS_AND_NAVIGATION : 1.4
FOOD_AND_DRINK : 1.24
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.82
WEATHER : 0.8
EVENTS : 0.71
PARENTING : 0.65
ART_AND_DESIGN : 0.64
COMICS : 0.62
BEAUTY : 0.6


By counting the apps within Google Play Categories and Genres we see:
1) The most common app genres are 'Tools', 'Entertainment', 'Education', and 'Business'<br>
2) The most common app categories are 'Family', 'Game', 'Tools', and 'Business'<br>
3) There is a hierarchy in how categories and genres are classified, categories being higher level and genres falling beneath them. This may be useful in future analysis.

Whichever genre we choose, it must also have a chance for success in the Apple App store market. Let's compare our findings here with the count of Apple app store apps prime_genre column.

### Count of Apps in Apple App Store Genres

In [23]:
display_table(apple_free, 11) #prime_genre

Games : 58.16
Entertainment : 7.88
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.51
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.74
Lifestyle : 1.58
News : 1.33
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12


From this frequency Table we see:
1) The most common app genre is 'Games' representing nearly 60% of the total market. <br>
2) The top genres, generally, are more geared towards entertainment and lifestyle.

Usage behavior seems to vary based on the platform. Generally, Android apps provide utility are more popular on the Google Play marketplace, while popular Apple apps provide some sort of entertainment. This is likely a result of the customizability that each operating system allows it developers to exercise. 

Considering these tables together:  
- The gaming category is a prime target, being the second most popular category on Android and by far the most popular on iOS. The Google play Gaming sub-categories Action, Simulation, Arcade, Causal, and puzzle games are popular. 
- Entertainment, Education, and utilities/tools are popular apps in both marketplaces. 

### Estimating Active Users with Installs and Total Reviews
To have an idea of which genres of apps have the most users, we can take the `Installs` `[5]` column from our Google Play data and the `rating_count_tot` `[5]` (which is our closest proxy) column from the Apple data set and average across our genres.

To accomplish this we will sum installs and total reviews for our each genres.

For Google Play, we have install data for our apps, however, we don't have the exact number. The apps fall into buckets with fairly large ranges. For example, an app in the 1,000,000+ category could have 3,000,000 installs or 1,000,000 installs. Therefore, we will need to make assumptions and leave each app rating at their categorical face value. So a 1,000,000+ app will be assumed to have exactly 1,000,000 downloads.

In [24]:
# use the display_table function selecting installs
display_table(googleplay_free, 5)

1,000,000+ : 15.73
100,000+ : 11.55
10,000,000+ : 10.55
10,000+ : 10.2
1,000+ : 8.39
100+ : 6.92
5,000,000+ : 6.83
500,000+ : 5.56
50,000+ : 4.77
5,000+ : 4.51
10+ : 3.54
500+ : 3.25
50,000,000+ : 2.3
100,000,000+ : 2.13
50+ : 1.92
5+ : 0.79
1+ : 0.51
500,000,000+ : 0.27
1,000,000,000+ : 0.23
0+ : 0.05


The percentage number isn't necessarily important, however the bins are in a format that will not be conducive to analysis techniques. Let's turn these from strings to numbers so we can better sort, analyze and interpret our installation bins.

In [25]:
for app in googleplay_free:
    installs = app[5]
    installs = installs.replace('+','')
    installs = installs.replace(',','')
    installs = float(installs)
    app[5] = installs
    
# use the display_table function selecting installs
display_table(googleplay_free, 5)

1000000.0 : 15.73
100000.0 : 11.55
10000000.0 : 10.55
10000.0 : 10.2
1000.0 : 8.39
100.0 : 6.92
5000000.0 : 6.83
500000.0 : 5.56
50000.0 : 4.77
5000.0 : 4.51
10.0 : 3.54
500.0 : 3.25
50000000.0 : 2.3
100000000.0 : 2.13
50.0 : 1.92
5.0 : 0.79
1.0 : 0.51
500000000.0 : 0.27
1000000000.0 : 0.23
0.0 : 0.05


We can now manipulate install values for our app categories. Let's create a few functions that will allow us to sum the installs for each category and look at averages. 

In [26]:
# function that takes in a dataset and totals all the values in a designated column by each column
def sum_table(dataset, index, sum_col_index):
    sum_table = {}
    
    # generataes the sum table
    for row in dataset:
        target = row[index]
        value = float(row[sum_col_index])
        if target in sum_table:
            sum_table[target] += value
        else: 
            sum_table[target] = value
    return sum_table

# function that sorts our sum_table function
def sorted_sum_table(dataset, index, sum_col_index, average = False):
    table = sum_table(dataset, index, sum_col_index)
    display_table = []
    
    if average is True:
        # create frequency table
        frequency_table = {}
        for row in dataset:
            target = row[index]
            if target in frequency_table:
                frequency_table[target] += 1
            else :
                frequency_table[target] = 1
        
        # turns the values in the in table into averages
        for sum_key in table:
            for count_key in frequency_table:
                if sum_key == count_key:
                    table[sum_key] = table[sum_key] / frequency_table[count_key]
        
        # transforms the frequency tables into a list of tuples
        for key in table:
            key_val_as_tuple = (table[key], key)
            display_table.append(key_val_as_tuple)
            
        # sorts our list
        table_sorted = sorted(display_table,reverse = True)

        # pretty prints our table
        for entry in table_sorted:
            print(entry[1], ':', entry[0])

    else: 
        # transforms the frequency tables into a list of tuples
        for key in table:
            key_val_as_tuple = (table[key], key)
            display_table.append(key_val_as_tuple)

        # sorts our list
        table_sorted = sorted(display_table,reverse = True)

        # pretty prints our table
        for entry in table_sorted:
            print(entry[1], ':', entry[0])

Let's look at the average installs for the category and genres of the Google Play dataset. 

In [27]:
# category average installs
sorted_sum_table(googleplay_free, 1, 5, average=True) 

COMMUNICATION : 38456119.167247385
VIDEO_PLAYERS : 24727872.452830188
SOCIAL : 23253652.127118643
PHOTOGRAPHY : 17840110.40229885
PRODUCTIVITY : 16787331.344927534
GAME : 15588015.603248259
TRAVEL_AND_LOCAL : 13984077.710144928
ENTERTAINMENT : 11640705.88235294
TOOLS : 10801391.298666667
NEWS_AND_MAGAZINES : 9549178.467741935
BOOKS_AND_REFERENCE : 8767811.894736841
SHOPPING : 7036877.311557789
PERSONALIZATION : 5201482.6122448975
WEATHER : 5074486.197183099
HEALTH_AND_FITNESS : 4188821.9853479853
MAPS_AND_NAVIGATION : 4056941.7741935486
FAMILY : 3697848.1731343283
SPORTS : 3638640.1428571427
ART_AND_DESIGN : 1986335.0877192982
FOOD_AND_DRINK : 1924897.7363636363
EDUCATION : 1833495.145631068
BUSINESS : 1712290.1474201474
LIFESTYLE : 1437816.2687861272
FINANCE : 1387692.475609756
HOUSE_AND_HOME : 1331540.5616438356
DATING : 854028.8303030303
COMICS : 817657.2727272727
AUTO_AND_VEHICLES : 647317.8170731707
LIBRARIES_AND_DEMO : 638503.734939759
PARENTING : 542603.6206896552
BEAUTY : 51315

In [28]:
# genre average installs Google Play Store
sorted_sum_table(googleplay_free, 9, 5, average=True) 

Communication : 38456119.167247385
Adventure;Action & Adventure : 35333333.333333336
Video Players & Editors : 24947335.796178345
Social : 23253652.127118643
Arcade : 22888365.48780488
Casual : 19569221.602564104
Puzzle;Action & Adventure : 18366666.666666668
Photography : 17840110.40229885
Educational;Action & Adventure : 17016666.666666668
Productivity : 16787331.344927534
Racing : 15910645.681818182
Travel & Local : 14051476.145631067
Casual;Action & Adventure : 12916666.666666666
Action : 12603588.872727273
Strategy : 11339901.3125
Tools : 10802461.246995995
Tools;Education : 10000000.0
Role Playing;Brain Games : 10000000.0
Lifestyle;Pretend Play : 10000000.0
Casual;Music & Video : 10000000.0
Card;Action & Adventure : 10000000.0
Adventure;Education : 10000000.0
News & Magazines : 9549178.467741935
Music : 9445583.333333334
Educational;Pretend Play : 9375000.0
Puzzle;Brain Games : 9280666.666666666
Word : 9094458.695652174
Racing;Action & Adventure : 8816666.666666666
Books & Refere

In [29]:
# genre average installs Apple App Store
sorted_sum_table(apple_free, 11, 5, average=True) 

Navigation : 86090.33333333333
Reference : 74942.11111111111
Social Networking : 71548.34905660378
Music : 57326.530303030304
Weather : 52279.892857142855
Book : 39758.5
Food & Drink : 33333.92307692308
Finance : 31467.944444444445
Photo & Video : 28441.54375
Travel : 28243.8
Shopping : 26919.690476190477
Health & Fitness : 23298.015384615384
Sports : 23008.898550724636
Games : 22788.6696905016
News : 21248.023255813954
Productivity : 21028.410714285714
Utilities : 18684.456790123455
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Business : 7491.117647058823
Education : 7003.983050847458
Catalogs : 4004.0
Medical : 612.0


Social, Video Players & Editors, Photography Apps, Entertainment, and Travel/Navigation/Weather applications received high average installs/proxy installs in both the Google Play and Apple App Store.

## Conclusion
Considering the average app ratings and the frequency distributions for apps together, we get a more complete picture. We want to develop an app that is capable of receiving a decent amount of installs, but we don't want there to be too much or too little competition amongst apps. The gaming category is an example of too much competition leading to market saturation. Gaming is one of the most frequently occuring in both marketplaces, however, the average installs they receive is not very high. This indicates that there are many games that likely get very few new players. On the other hand, a category/genre can be dominated by few very large apps making market penetration difficult. Social apps is an example of this. Social apps has a high average amount of users per app, but few apps generally. Competing in this category will similarly be difficult. 

The middle to mid-high end of the road in average installs and total frequency of apps in the category/genre is most attractive. We see a few categories/genres that meet this criteria across marketplaces:
* Video Players & Editors
* Photography
* Entertainment
* Books and Reference

We leave these recommendations to the development team to determine which category to move forward with.