# Profitable App Profiles for the App Store & Google Play Markets
This project has four parts:
* I will start by by clarifying the goal of the project (business understanding)
* Then I will collect relevant data and review it (data exploration)
* Next I'll clean the data to prepare it for analysis (data preparation)
* Finally I will analyze the cleaned data (data analysis)

**NOTE - this project is guided in spirit by Dataquest.io's guided project for the "Python for Data Science: Fundamentals" module. However, I am not using all of the methods they use - for example, I'm using pandas dataframes instead of lists of lists, and using code that makes sense with pandas dataframes instead of lists of lists, and so on.**

# Project objective (business understanding)
This project looks at iOS and Android mobile apps from the perspective of an analyst for a company which builds free mobile apps and makes money from ad revenue on those mobile apps. To this end, I analyze free apps by number of users to determine which kinds of apps are likely to attract more users.

My goal in this project is to develop a profile or set of profiles for profitable apps on the App Store & Google Play markets. That way, the company's developers have data to inform what kind of apps they build.

According to [Statista](https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/), in September 2018 there were approximately 2 million iOS apps on the App Store, and 2.1 million Android apps on Google Play:
<img src='https://s3.amazonaws.com/dq-content/350/py1m8_statista.png'>

Since the data for all of those apps are not readily available, I will use two datasets which can function as samples of the data instead. There is [one data set](https://www.kaggle.com/lava18/google-play-store-apps/home) with approximately 10,000 Android apps from Google Play (collected Augist 2018) and [another](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) with approximately 7,000 iOS apps from the App Store (collected July 2017).

# Collecting and reviewing the data (data exploration)
**Note: I've uploaded the data to [my GitHub](https://github.com/gmayock/profitable_app_profiles). I tried to access it directly from Kaggle's servers but was unsuccessful. Please give all credit where credit is due to the appropriate creators of the datasets, as linked above.**

In [1]:
import pandas as pd
df_google_play = pd.read_csv('https://raw.githubusercontent.com/gmayock/profitable_app_profiles/master/googleplaystore.csv')
df_app_store = pd.read_csv('https://raw.githubusercontent.com/gmayock/profitable_app_profiles/master/AppleStore.csv', index_col=0)

### Identifying columns which could help with my analysis
First I print the df shape along with a list of columns to see what they contain

In [56]:
print("Google Play:",df_google_play.shape,"\n", list(df_google_play), "\n\nApp Store:",df_app_store.shape,"\n", list(df_app_store))

Google Play: (8862, 15) 
 ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver', 'num_non_eng_chars', 'Installs_count'] 

App Store: (3220, 17) 
 ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic', 'num_non_eng_chars']


Then I print the head of each dataframe to see what the data looks like. First Google:

In [3]:
df_google_play.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


Then the App Store

In [4]:
df_app_store.head()

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
1,281656475,PAC-MAN Premium,100788224,USD,3.99,21292,26,4.0,4.5,6.3.5,4+,Games,38,5,10,1
2,281796108,Evernote - stay organized,158578688,USD,0.0,161065,26,4.0,3.5,8.2.2,4+,Productivity,37,5,23,1
3,281940292,"WeatherBug - Local Weather, Radar, Maps, Alerts",100524032,USD,0.0,188583,2822,3.5,4.5,5.0.0,4+,Weather,37,5,3,1
4,282614216,"eBay: Best App to Buy, Sell, Save! Online Shop...",128512000,USD,0.0,262241,649,4.0,4.5,5.10.0,12+,Shopping,37,5,9,1
5,282935706,Bible,92774400,USD,0.0,985920,5320,4.5,5.0,7.5.1,4+,Reference,37,5,45,1


Now I'll print the unique values counts ([nunique](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.nunique.html)): 

In [57]:
print("Google Play:\n",df_google_play.nunique(), df_google_play.shape,"\n\nApp Store:\n", df_app_store.nunique(), df_app_store.shape)

Google Play:
 App                  8862
Category               33
Rating                 39
Reviews              5121
Size                  406
Installs               21
Type                    1
Price                   1
Content Rating          6
Genres                115
Last Updated         1273
Current Ver          2695
Android Ver            33
num_non_eng_chars       4
Installs_count         20
dtype: int64 (8862, 15) 

App Store:
 id                   3220
track_name           3220
size_bytes           3203
currency                1
price                   1
rating_count_tot     2201
rating_count_ver      875
user_rating            10
user_rating_ver        10
ver                  1130
cont_rating             4
prime_genre            23
sup_devices.num        18
ipadSc_urls.num         6
lang.num               56
vpp_lic                 2
num_non_eng_chars       4
dtype: int64 (3220, 17)


### Summary 
Some features that stick out from Google Play are:
* Category 
* Genres
* Rating
* Reviews
* Installs
* Price

For the App Store:
* prime_genre
* rating_count_tot
* user_rating
* price

# Cleaning the data for analysis (data preparation)
The next step it to remove data which is not going to be relevant for this project. First I'll remove wrong information - bad lines etc - and then, since the goal is to build a free app in English, I will remove information which doesn't fit those parameters.

## Google Play data
### Removing a row which was scraped wrong
A quick look at the [discussion forum](https://www.kaggle.com/lava18/google-play-store-apps/discussion/81460) on Kaggle for the Google Play data set reveals a shift of the cells for index 10472 as well. Let's look at that first.

In [6]:
df_google_play[10472:10473]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10472,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up,


It looks like the error is still in the dataset. I could try to fix the data, or simply delete it. Let's consider fixing it.

#### Why I can't fix the row with wrong values
To fix it, I would need to shift the content of the cells over one to the left, then manually enter the value which was supposed to be in "Category", as that's the cell which is overwritten. 

From my data exploration above, it looks like the "Genres" feature maps to the "Category" feature. However, for this row, the "Genres" feature is blank. Therefore, I can't use that in this instance. Therefore I'll delete it.

#### Deleting it using loc
I could in theory drop the row by the index. However, it's better with Jupyter Notebooks to drop it in a way that won't cause an error if the cell is ran more than once. The method I use can be ran multiple times without throwing an error, as you see below:

In [7]:
df_google_play = df_google_play.loc[df_google_play['App'] != 'Life Made WI-Fi Touchscreen Photo Frame']
df_google_play.shape

(10840, 13)

In [8]:
df_google_play = df_google_play.loc[df_google_play['App'] != 'Life Made WI-Fi Touchscreen Photo Frame']
df_google_play.shape

(10840, 13)

### Dropping duplicate rows
#### Dropping duplicate rows using drop_duplicates()
It looks like the discussion forum mentions some duplicate rows, as well, so we'll take care of that with [pandas.DataFrame.drop_duplicates()](https://pandas.pydata.org/pandas-docs/version/0.17/generated/pandas.DataFrame.drop_duplicates.html). However, it's important to note that these duplicates vary mostly in number of reviews. Therefore I will make sure the "Reviews" column does not factor in.

In [9]:
col_list = list(df_google_play)
okay_to_drop_non_dupe = ['Reviews']
drop_dupe_list = [col for col in col_list if col not in okay_to_drop_non_dupe]
df_google_play_test = df_google_play.drop_duplicates(subset=drop_dupe_list)
df_google_play_test.shape

(9782, 13)

That removes over a thousand rows, or nearly 10% of the data. But perhaps this isn't the best way to do this, for two reasons:
1. It drops the rows randomly
2. It doesn't necessarily result in a single row for each app

To point one: if the number of reviews is different, it likely indicates the information was pulled at different times, because logically the number of reviews should only be able to go up. Therefore, it's more logical to keep the row with the highest number of reviews.

To point two: there are 9660 unique values for the App, as seen above when I ran *df\_google\_play.nunique()*. This could be caused by any number of the other columns.

For example, since the rows may have been pulled on different dates, it's possible that "Last Updated" is causing duplicates which wouldn't be dropped by the above method. 

#### Dropping duplicate rows after sorting the dataframe
Let's see what it looks like if we drop duplicates with the subset being only "App". In order to incorporate the learnings from point one, first I'll sort the dataframe by "Reviews" first.

In [10]:
df_google_play_test = df_google_play.sort_values(by="Reviews", ascending=False)
df_google_play_test = df_google_play_test.drop_duplicates(subset="App")
df_google_play_test.shape

(9659, 13)

9,660 unique values, minus the one we dropped ('Life Made WI-Fi Touchscreen Photo Frame'), resulting in 9,659 unique rows. Wonderful. Now I'll assign this back to df_google_play.

In [11]:
df_google_play = df_google_play_test.copy()
df_google_play.shape

(9659, 13)

### Removing non-free apps

In [12]:
df_google_play = df_google_play.loc[df_google_play['Price'] == '0']
df_google_play.shape, df_google_play['Price'].value_counts()

((8903, 13), 0    8903
 Name: Price, dtype: int64)

### Removing non-English apps
It's somewhat difficult to remove non-English apps from the Google Play data set as it does not explicitly state the language of the app. 

The suggested way to address this is to filter out apps with three or more non-standard (ordinal above 127) ASCII characters. This is used as a proxy for foreign language, although due to the pervasiveness of non-standard characters in modern app names - "Lep's World 3 🍀🍀🍀" comes to mind, or "► MultiCraft ― Free Miner! 👍" - it's not a perfect method.

Nevertheless, it is the method I will use at this time. Other methods which were considered but ultimately refused:
1. I could use the ASCII characters between 128 and 255 to denote non-English (thus omitting the emojis, etc)
2. I could find ranges for the ordinal of the emojis and use ASCII characters between 128 and the ordinal start of the emojis to denote non-English (thus omitting the emojis but including characters like kanji, hiragana, etc).
3. I could look for supplementary data sources

In [13]:
import string

def nonEnglishCharacterCount(app_name):
    non_eng_char_ct = 0
    for character in app_name:
        if ord(character) > 127:
            non_eng_char_ct += 1
    return non_eng_char_ct

In [14]:
df_google_play['num_non_eng_chars'] = [nonEnglishCharacterCount(i) for i in df_google_play['App']]

In [15]:
df_google_play = df_google_play.loc[df_google_play['num_non_eng_chars'] <= 3]
df_google_play.shape

(8862, 14)

## App Store data
### Remove non-English apps

In [16]:
df_app_store.shape

(7197, 16)

In [17]:
df_app_store['num_non_eng_chars'] = [nonEnglishCharacterCount(i) for i in df_app_store['track_name']]
df_app_store = df_app_store.loc[df_app_store['num_non_eng_chars'] <= 3]
df_app_store.shape

(6183, 17)

In [18]:
df_app_store.dtypes

id                     int64
track_name            object
size_bytes             int64
currency              object
price                float64
rating_count_tot       int64
rating_count_ver       int64
user_rating          float64
user_rating_ver      float64
ver                   object
cont_rating           object
prime_genre           object
sup_devices.num        int64
ipadSc_urls.num        int64
lang.num               int64
vpp_lic                int64
num_non_eng_chars      int64
dtype: object

### Removing non-Free apps

In [19]:
df_app_store = df_app_store.loc[df_app_store['price'] == 0]
df_app_store.shape, df_app_store['price'].value_counts()

((3222, 17), 0.0    3222
 Name: price, dtype: int64)

### Removing duplicates
There are 7195 unique track_names, and 7197 rows. Let's see what the difference for those two are.

In [20]:
dupe_list = ['Mannequin Challenge', 'VR Roller Coaster']
df_check1 = df_app_store.loc[df_app_store['track_name'] == dupe_list[0]]
df_check2 = df_app_store.loc[df_app_store['track_name'] == dupe_list[1]]
df_check1
# df_check2

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic,num_non_eng_chars
10751,1173990889,Mannequin Challenge,109705216,USD,0.0,668,87,3.0,3.0,1.4,9+,Games,37,4,1,1,0
10885,1178454060,Mannequin Challenge,59572224,USD,0.0,105,58,4.0,4.5,1.0.1,4+,Games,38,5,1,1,0


In [21]:
df_check2

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic,num_non_eng_chars
4000,952877179,VR Roller Coaster,169523200,USD,0.0,107,102,3.5,3.5,2.0.0,4+,Games,37,5,1,1,0
7579,1089824278,VR Roller Coaster,240964608,USD,0.0,67,44,3.5,4.0,0.81,4+,Games,38,0,1,1,0


At a  glance, it looks like an old version is left in the list. There were no discussions about this at the source of the data, so I [started one](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion/90409). We'll see if anything comes of it.

In the meantime, let's drop_duplicates as before.

In [22]:
df_app_store = df_app_store.sort_values(by='ver', ascending=False)
df_app_store = df_app_store.drop_duplicates(subset='track_name')
df_app_store.shape

(3220, 17)

Alright, it worked.

# Understanding what data we need (business understanding)
We have the data filtered to relevant information (free, English-language, non-duplicate apps), but what are we really looking for? 

The business objective is to develop a _successful_ app. Therefore, we should explore the data to determine which apps are successful on both Google Play and the App Store. 

The standard operating procedure for companies building this sort of app has three steps:
1. Build a minimal version of the app on Google Play
2. If the app has a good response, develop it further
3. If the app is profitable after a short time, port it to the App Store

The order of operating systems to develop can be switched depending on company competency, etc.

Nevertheless, it's important that we build a profile of apps which are successful on both Google Play _and_ the App Store, or we'll be leaving behind a large portion of the market ([about 45%](https://www.statista.com/statistics/266572/market-share-held-by-smartphone-platforms-in-the-united-states/)) which is accessible for presumably much less work than building a new app from scratch.

# Building the profile of a successful app (data analysis)
## Genre
First things first, let's explore which genres are most common in each market.

In [23]:
# print(df_google_play['Genres'].value_counts().to_dict())
df_google_play['Genres'].value_counts(normalize=True)

Tools                                  0.084405
Entertainment                          0.060709
Education                              0.053487
Business                               0.045926
Lifestyle                              0.038930
Productivity                           0.038930
Finance                                0.037012
Medical                                0.035206
Sports                                 0.034642
Personalization                        0.033175
Communication                          0.032385
Action                                 0.031031
Health & Fitness                       0.030806
Photography                            0.029452
News & Magazines                       0.027985
Social                                 0.026631
Travel & Local                         0.023245
Shopping                               0.022455
Books & Reference                      0.021440
Simulation                             0.020424
Dating                                 0

In [24]:
# print(df_google_play['Category'].value_counts().to_dict())
df_google_play['Category'].value_counts(normalize=True)

FAMILY                 0.189461
GAME                   0.096931
TOOLS                  0.084518
BUSINESS               0.045926
LIFESTYLE              0.039043
PRODUCTIVITY           0.038930
FINANCE                0.037012
MEDICAL                0.035206
SPORTS                 0.033965
PERSONALIZATION        0.033175
COMMUNICATION          0.032385
HEALTH_AND_FITNESS     0.030806
PHOTOGRAPHY            0.029452
NEWS_AND_MAGAZINES     0.027985
SOCIAL                 0.026631
TRAVEL_AND_LOCAL       0.023358
SHOPPING               0.022455
BOOKS_AND_REFERENCE    0.021440
DATING                 0.018619
VIDEO_PLAYERS          0.017942
MAPS_AND_NAVIGATION    0.013992
FOOD_AND_DRINK         0.012413
EDUCATION              0.011735
ENTERTAINMENT          0.009479
LIBRARIES_AND_DEMO     0.009366
AUTO_AND_VEHICLES      0.009253
HOUSE_AND_HOME         0.008237
WEATHER                0.008012
EVENTS                 0.007109
PARENTING              0.006545
ART_AND_DESIGN         0.006432
COMICS  

In [25]:
# print(df_app_store['prime_genre'].value_counts().to_dict())
df_app_store['prime_genre'].value_counts(normalize=True)

Games                0.581366
Entertainment        0.078882
Photo & Video        0.049689
Education            0.036646
Social Networking    0.032919
Shopping             0.026087
Utilities            0.025155
Sports               0.021429
Music                0.020497
Health & Fitness     0.020186
Productivity         0.017391
Lifestyle            0.015839
News                 0.013354
Travel               0.012422
Finance              0.011180
Weather              0.008696
Food & Drink         0.008075
Reference            0.005590
Business             0.005280
Book                 0.004348
Medical              0.001863
Navigation           0.001863
Catalogs             0.001242
Name: prime_genre, dtype: float64

Games are by far the largest over the two, but what's this Family category on Google Play?

In [60]:
df_family = df_google_play.loc[df_google_play['Category'] == 'FAMILY']
df_family['Genres'].value_counts(normalize=True)

Entertainment                            0.273377
Education                                0.226921
Simulation                               0.103633
Casual                                   0.079809
Puzzle                                   0.046456
Role Playing                             0.042883
Strategy                                 0.039309
Educational;Education                    0.020846
Educational                              0.019655
Education;Education                      0.014294
Casual;Pretend Play                      0.012507
Puzzle;Brain Games                       0.009529
Racing;Action & Adventure                0.008934
Casual;Action & Adventure                0.007147
Entertainment;Music & Video              0.007147
Arcade;Action & Adventure                0.006552
Casual;Brain Games                       0.006552
Educational;Pretend Play                 0.004765
Board;Brain Games                        0.004765
Action;Action & Adventure                0.004765


In [61]:
df_family_entertainment = df_family.loc[df_family['Genres'] == 'Entertainment']
df_family_entertainment = pd.DataFrame(df_family_entertainment['App'].value_counts()).reset_index().drop(columns='App')
list_family_entertainment = df_family_entertainment['index'].tolist()
print(list_family_entertainment)

['DL Hughley', '3D Color by Number with Voxels', 'Ako ay may lobo Pinoy Kid Song Offline', 'FE Mix - Jokes - Status - Wallpaper', 'BG Television', 'CX-40', 'Radio K - KUOM', 'Color by Number - Draw Sandbox Pixel Art', 'Cute Images for Whatsapp', 'Funny Jokes', 'Harris J Lyrics', 'Movie DB', 'Glitter Color By Number - Glitter Number Coloring', 'Results for DC Lottery', 'SYFY', 'BK Dinos', 'Ek Biladi Jadi Video Song', 'Scanning under clothes (prank)', 'BS Meter (Ad Supported)', 'FD VR - Virtual 3D Web Browser', 'E-cigarette for free', 'Deck Builder & Analyzer for CR', 'Evolution CP & IV Calculator for pokemon', 'Sanu Ek Pal Chain Song Videos - RAID Movie Songs', 'AW Radio', 'WPBS-DT', 'Bono’s Pit Bar-B-Q', 'Scanning body and undressing people', 'Best DP and Status - All Type DP & Status', 'Voice changer with effects', 'FANDOM for: GTA', 'Mahila Vashikaran(Ek rat me)-Hindi', 'Butterfly Pixel Art - coloring by number', 'ALPHA - Artificial Intelligence', 'CX-OF', 'Results for FL Lottery (Fl

At least a quarter of the Family category look related to games (simulation, casual, puzzle, role playing, strategy, brain games, etc), but most are streaming devices or random entertainment. 

The two app stores look different. The App Store is largely dominated by games with more than half of the apps in our target market. Google Play has productivity-type apps - Tools, Business, Lifestyle, Productivity, etc - but still a large portion are games of one type or another.

My initial recommendation is moving forward with making a game. They are the most ubiquitous on the App Store, and close to the most on Google Play - certainly the most across both. However, I'm going to check two things first - the Genres to Category relationship, and the popularity of genres in each market.

## 'Genres' to 'Category' relationship on Google Play
At a glance, Genres looks more detailed than Category. We can check that each Genre is assigned to only one Category with a few commands.

In [28]:
df_genre_category_relationship = df_google_play.loc[:,['Genres','Category']]
df_genre_category_relationship = df_genre_category_relationship.drop_duplicates()
len(df_genre_category_relationship)

134

In [29]:
df_g_c_r_counts = pd.DataFrame(df_genre_category_relationship['Genres'].value_counts()).reset_index()
df_g_c_r_counts = df_g_c_r_counts.rename(columns={'index':'Genres','Genres':'Counts'})
df_g_c_r_counts = df_g_c_r_counts.loc[df_g_c_r_counts['Counts'] >= 2]
len(df_g_c_r_counts)

19

It does look like the majority of Genres are only set to one Category, but 19 of 134 (14%) are assigned to two Categories. Let's take a look at those.

In [30]:
check_list = list(df_g_c_r_counts['Genres'])
check_df = df_google_play.loc[df_google_play['Genres'].isin(check_list)]
check_df = check_df.loc[:,['Genres','Category']].drop_duplicates().sort_values('Genres')
check_df

Unnamed: 0,Genres,Category
1955,Action;Action & Adventure,GAME
2071,Action;Action & Adventure,FAMILY
4,Art & Design;Creativity,ART_AND_DESIGN
7027,Art & Design;Creativity,FAMILY
1890,Casual,GAME
4879,Casual,FAMILY
1980,Casual;Brain Games,GAME
2166,Casual;Brain Games,FAMILY
741,Education,EDUCATION
7826,Education,FAMILY


Lots of crossover between family and education, family and entertainment, and family and game, but not anything drastic enough to make me want to reconsider my recommendation to build a game.  

## Popularity of genres
We want our app to have a lot of installs. Therefore it's important to see the popularity of each. We can use the 'Installs' column directly for Google Play, and for the App Store we can use the rating_count_tot as a proxy. First we have to convert the text string install counts to int strings. The granularity isn't great but it's better than nothing.

In [31]:
df_google_play['Installs_count'] = [i.replace(',','').replace('+','') for i in df_google_play['Installs']]
df_google_play['Installs_count'] = df_google_play['Installs_count'].astype(int)
# df_google_play['Installs_count'].value_counts()

Now we can look at the average install count for the Genres and Categories.

In [50]:
# Average installs (in millions) by Genre on Google Play
df_gp_avg_installs_g = df_google_play.groupby('Genres', as_index=False)['Installs_count'].mean().sort_values('Installs_count', 
                                                                                                          ascending=False)
df_gp_avg_installs_g['Installs_count'] = [i/1000000 for i in df_gp_avg_installs_g['Installs_count']]
df_gp_avg_installs_g = df_gp_avg_installs_g.rename(columns={'Installs_count':'Average_installs_count_in_millions'})
df_gp_avg_installs_g

Unnamed: 0,Genres,Average_installs_count_in_millions
33,Communication,38.456119
3,Adventure;Action & Adventure,35.333333
110,Video Players & Editors,24.947336
97,Social,23.253652
5,Arcade,22.888365
24,Casual,19.630959
81,Puzzle;Action & Adventure,18.366667
78,Photography,17.805628
44,Educational;Action & Adventure,17.016667
79,Productivity,16.787331


In [51]:
# Average installs (in millions) by Category on Google Play
df_gp_avg_installs_c = df_google_play.groupby('Category', as_index=False)['Installs_count'].mean().sort_values('Installs_count', 
                                                                                                          ascending=False)
df_gp_avg_installs_c['Installs_count'] = [i/1000000 for i in df_gp_avg_installs_c['Installs_count']]
df_gp_avg_installs_c = df_gp_avg_installs_c.rename(columns={'Installs_count':'Average_installs_count_in_millions'})
df_gp_avg_installs_c

Unnamed: 0,Category,Average_installs_count_in_millions
6,COMMUNICATION,38.456119
31,VIDEO_PLAYERS,24.727872
27,SOCIAL,23.253652
24,PHOTOGRAPHY,17.805628
25,PRODUCTIVITY,16.787331
14,GAME,15.560966
30,TRAVEL_AND_LOCAL,13.984078
9,ENTERTAINMENT,11.719762
29,TOOLS,10.682301
21,NEWS_AND_MAGAZINES,9.549178


In [54]:
# Average number of ratings by prime genre on the App Store
df_a_avg_rat = df_app_store.groupby('prime_genre', as_index=False)['rating_count_tot'].mean().sort_values('rating_count_tot', ascending=False)
df_a_avg_rat['rating_count_tot'] = df_a_avg_rat['rating_count_tot'].astype(int)
df_a_avg_rat

Unnamed: 0,prime_genre,rating_count_tot
12,Navigation,86090
16,Reference,74942
18,Social Networking,71548
11,Music,57326
22,Weather,52279
0,Book,39758
6,Food & Drink,33333
5,Finance,31467
14,Photo & Video,28441
20,Travel,28243


# Final Recommendation
Based on the information above, I recommend building a game of the genre "action and adventure" or related. 