# Analysis goal 
- The goal for this project is to analyze Google Play's data to help understand what kinds of apps are likely to attract more users.
- I'll focus on free apps for this analysis.

## About the Data
As of September 2019, there were [approximately 2.8 million Android apps](https://www.statista.com/statistics/266210/number-of-available-applications-in-the-google-play-store/) on Google Play.

Collecting data for these many apps is not an easy task. So I decided to look for a data set that could help me. After some search I found two promising data sets:
- [A data set](https://www.kaggle.com/lava18/google-play-store-apps) with 10k apps collected on february 2019
- And [a data set](https://www.kaggle.com/lava18/google-play-store-apps) with 267k apps collected on april 2019


After some thought I decided to use the last one, because it has more data and was collected more recently.


## Opening the Data

In [2]:
from csv import reader

### The Google Play data set ###
opened_file = open('..\\data\\raw\\Google-Playstore-Full.csv')
read_file = reader(opened_file)
google_play = list(read_file)
google_play_header = google_play[0]
google_play_data = google_play[1:]

## Exploring the Data
To make it easier to explore the data set I created 3 functions that I will reuse throughout the project.

In [3]:
def print_header(header):
    print(header)
    print('\n')

In [4]:
def print_data(data, start = 0, end = 5):
    data_slice = data[start:end]
    for row in data_slice:
        print(row)
        print('\n')

In [5]:
def print_data_info(data):
    print('Number of rows:', len(data))
    print('Number of columns:', len(data[0]))

In [6]:
def print_data_overview():
    print_header(google_play_header)
    print_data(google_play_data)
    print_data_info(google_play_data)   

In [7]:
print_data_overview()

['App Name', 'Category', 'Rating', 'Reviews', 'Installs', 'Size', 'Price', 'Content Rating', 'Last Updated', 'Minimum Version', 'Latest Version', '', '', '', '']


['DoorDash - Food Delivery', 'FOOD_AND_DRINK', '4.548561573', '305034', '5,000,000+', 'Varies with device', '0', 'Everyone', 'March 29, 2019', 'Varies with device', 'Varies with device', '', '', '', '']


['TripAdvisor Hotels Flights Restaurants Attractions', 'TRAVEL_AND_LOCAL', '4.400671482', '1207922', '100,000,000+', 'Varies with device', '0', 'Everyone', 'March 29, 2019', 'Varies with device', 'Varies with device', '', '', '', '']


['Peapod', 'SHOPPING', '3.656329393', '1967', '100,000+', '1.4M', '0', 'Everyone', 'September 20, 2018', '5.0 and up', '2.2.0', '', '', '', '']


['foodpanda - Local Food Delivery', 'FOOD_AND_DRINK', '4.107232571', '389154', '10,000,000+', '16M', '0', 'Everyone', 'March 22, 2019', '4.2 and up', '4.18.2', '', '', '', '']


['My CookBook Pro (Ad Free)', 'FOOD_AND_DRINK', '4.647752285', '2291', 

## Data Wrangling

After a quick glance we can get some useful information about this data, like the columns that can be important ('Category', 'Rating', 'Reviews', 'Installs', 'Price', 'Content Rating').

If we pay a little more attention, we can see that the header and rows are missing the last 4 values. Since the header is mmissing too any information in these fields are meaningless. So we can start fixing that.

### Drop 4 last missing values

In [8]:
number_of_items_to_delete = 4

del google_play_header[-number_of_items_to_delete:]

for row in google_play_data:
    del row[-number_of_items_to_delete:]

In [9]:
print_data_overview()

['App Name', 'Category', 'Rating', 'Reviews', 'Installs', 'Size', 'Price', 'Content Rating', 'Last Updated', 'Minimum Version', 'Latest Version']


['DoorDash - Food Delivery', 'FOOD_AND_DRINK', '4.548561573', '305034', '5,000,000+', 'Varies with device', '0', 'Everyone', 'March 29, 2019', 'Varies with device', 'Varies with device']


['TripAdvisor Hotels Flights Restaurants Attractions', 'TRAVEL_AND_LOCAL', '4.400671482', '1207922', '100,000,000+', 'Varies with device', '0', 'Everyone', 'March 29, 2019', 'Varies with device', 'Varies with device']


['Peapod', 'SHOPPING', '3.656329393', '1967', '100,000+', '1.4M', '0', 'Everyone', 'September 20, 2018', '5.0 and up', '2.2.0']


['foodpanda - Local Food Delivery', 'FOOD_AND_DRINK', '4.107232571', '389154', '10,000,000+', '16M', '0', 'Everyone', 'March 22, 2019', '4.2 and up', '4.18.2']


['My CookBook Pro (Ad Free)', 'FOOD_AND_DRINK', '4.647752285', '2291', '10,000+', 'Varies with device', '$5.99', 'Everyone', 'April 1, 2019', 'Varies w

### Find duplicates

So here is where I hit a wall using pure python, to check for duplicates I'd have to do something like:
```python
duplicate_apps = []
unique_apps = []

for app in google_play_data:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
```
This would work just fine for a small data set. But in this data set it just takes too long, so for now I'll start using pandas

In [10]:
import pandas as pd
pd.set_option('display.float_format', lambda x: '%.5f' % x)

In [11]:
df = pd.read_csv('..\\data\\raw\\Google-Playstore-Full.csv', low_memory=False)

In [12]:
df.head()

Unnamed: 0,App Name,Category,Rating,Reviews,Installs,Size,Price,Content Rating,Last Updated,Minimum Version,Latest Version,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14
0,DoorDash - Food Delivery,FOOD_AND_DRINK,4.548561573,305034,"5,000,000+",Varies with device,0,Everyone,"March 29, 2019",Varies with device,Varies with device,,,,
1,TripAdvisor Hotels Flights Restaurants Attract...,TRAVEL_AND_LOCAL,4.400671482,1207922,"100,000,000+",Varies with device,0,Everyone,"March 29, 2019",Varies with device,Varies with device,,,,
2,Peapod,SHOPPING,3.656329393,1967,"100,000+",1.4M,0,Everyone,"September 20, 2018",5.0 and up,2.2.0,,,,
3,foodpanda - Local Food Delivery,FOOD_AND_DRINK,4.107232571,389154,"10,000,000+",16M,0,Everyone,"March 22, 2019",4.2 and up,4.18.2,,,,
4,My CookBook Pro (Ad Free),FOOD_AND_DRINK,4.647752285,2291,"10,000+",Varies with device,$5.99,Everyone,"April 1, 2019",Varies with device,Varies with device,,,,


In [13]:
df.drop(columns=['Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14'], inplace=True)

### Filtering only free apps

For this we'll focus on the price column, looking closer we can see a price like this '$5.99', which indicates that the entire column is composed of strings. Knowing this we can create a function to transform the price column in number.


In [14]:
def price_to_number(price):
    price = price.replace('$', '')
    return float(price)

Let's apply this function to the price column

In [15]:
df['Price'].apply(price_to_number)

ValueError: could not convert string to float: '2.4M'

Let's check what's going on

In [16]:
df[df['Price'].str.contains(pat = 'M')]

Unnamed: 0,App Name,Category,Rating,Reviews,Installs,Size,Price,Content Rating,Last Updated,Minimum Version,Latest Version
13504,Never have I ever 18+,),GAME_STRATEGY,4.0,6,100+,2.4M,$0.99,Mature 17+,"December 30, 2018",4.0.3 and up
32229,Old-time Radio presents,,ENTERTAINMENT,4.0,20,"10,000+",3.1M,0,Everyone,"October 16, 2018",4.1 and up
48438,Mojo Times: Bihar Hindi Video News,Breaking News,NEWS_AND_MAGAZINES,4.775640965,156,"10,000+",6.9M,0,Teen,"March 30, 2019",4.1 and up
113151,Steins,Gate ALARM,ENTERTAINMENT,4.716867447,166,500+,67M,$0.99,Teen,"November 12, 2018",4.4 and up
125479,2-6 Ya? E?itici �ocuk Zeka Oyunlar?,Alfabe �?ren,EDUCATION,5.0,1,10+,57M,$2.49,Everyone,"October 31, 2017",2.3 and up
125480,2-6 Ya? E?itici �ocuk Zeka Oyunlar?,T�rk Alfabesi,EDUCATION,4.333333492,54,"50,000+",43M,0,Everyone,"October 31, 2017",2.3 and up
165230,Shytter -Twitter client,not notified you follow -,SOCIAL,4.098591328,71,"5,000+",7.7M,0,Everyone,"March 30, 2019",4.1 and up
168914,CorreosTrack 2.0 (Correos de M�xico,Mexpost),PRODUCTIVITY,4.389830589,59,"10,000+",16M,0,Everyone,"December 21, 2018",4.1 and up
180371,eShagird - Online academy,ETEA & MDCAT,EDUCATION,4.504273415,117,"10,000+",6.9M,0,Everyone,"March 2, 2019",4.0 and up
190759,Friend in Iceland,Tour Guide,TRAVEL_AND_LOCAL,5.0,6,"1,000+",27M,0,Everyone,"October 16, 2017",4.0 and up


Some kind of shift happened to the data on these rows. As consequence of this error, we'll delete these rows.

In [17]:
#Get column indexes
indexes = df[df['Price'].str.contains(pat = 'M')].index

# Delete these row indexes from dataFrame
df.drop(indexes , inplace=True)

Let's try to apply our price_to_number function again

In [18]:
df['Price'] = df['Price'].apply(price_to_number)

ValueError: could not convert string to float: 'Varies with device'

After trying to apply our function again, we just realized that we can do a better job by just getting all rows where price is '0', this will also fix problematic rows at the same time.

In [19]:
df_free = df[df['Price'] == '0']

### Removing Duplicate Entries
Now that we have only free apps we can go back to remove duplicate entries.
First we have to think about why the duplication is happening, let's take a look at some duplicate entries.

In [20]:
df_free.sort_values('App Name', ascending=True)[df_free.duplicated(subset=['App Name'], keep=False)].head(30)

Unnamed: 0,App Name,Category,Rating,Reviews,Installs,Size,Price,Content Rating,Last Updated,Minimum Version,Latest Version
66734,#NAME?,SPORTS,4.55343914,4725,"100,000+",4.9M,0,Everyone,"November 6, 2018",4.1 and up,1.7.5
85493,#NAME?,COMMUNICATION,2.817708254,192,"10,000+",46M,0,Teen,"October 31, 2018",4.2 and up,4.0.180.15257
65076,#NAME?,LIFESTYLE,3.971428633,2520,"100,000+",16M,0,Everyone,"February 25, 2019",4.1 and up,2.1.7
250895,#NAME?,COMMUNICATION,4.879069805,215,"5,000+",16M,0,Everyone,"December 10, 2018",4.1 and up,1.1
83867,#NAME?,HEALTH_AND_FITNESS,2.846153736,39,"1,000+",4.3M,0,Everyone,"May 14, 2015",4.0.3 and up,1.2
183900,#NAME?,FINANCE,3.158273458,278,"50,000+",3.0M,0,Everyone,"March 4, 2019",4.4 and up,1.15.5
209262,#NAME?,EDUCATION,4.594594479,370,"10,000+",18M,0,Everyone,"December 12, 2018",4.1 and up,2.1
155728,#NAME?,TOOLS,4.647058964,85,"1,000+",5.2M,0,Everyone,"March 26, 2019",4.4 and up,1.1.6
170318,#NAME?,EDUCATION,5.0,2,100+,6.4M,0,Everyone,"July 23, 2018",4.4 and up,1
181267,#NAME?,FINANCE,3.972222328,180,"10,000+",3.1M,0,Everyone,"March 4, 2019",4.4 and up,1.8.5


As we can the first apps are missing names and they were just replaced with '#NAME?'. Let's remove them.

In [21]:
#Get column indexes
indexes = df_free[df_free['App Name'] == '#NAME?'].index

# Delete these row indexes from dataFrame
df_free.drop(indexes , inplace=True)

Now to avoid removing duplicates randomly, we can remove them based on some information like rating, number of reviews, number of installs, last updated or last version. But number of reviews seems the best, the highest number of reviews among the duplicates would give us at the same time the most reliable rating, so we'll keep the highest number of reviews.

In [22]:
#First we'll sort the data frame based on the Reviews column
df_free.sort_values('Reviews', ascending=False, inplace=True)

In [23]:
#After we'll drop the duplicated names leaving only the highest number of reviews app
df_free.drop_duplicates(subset='App Name', keep='first', inplace=True)

## Most Common Apps by Category

In [24]:
df_free_groupby_category_percentage = df_free.groupby('Category', as_index=False).agg({'App Name': lambda x: (x.count()/len(df_free)) * 100}).rename(columns={'App Name':'Percentage'})
df_free_groupby_category_percentage.sort_values('Percentage', ascending=False)

Unnamed: 0,Category,Percentage
8,EDUCATION,12.46185
9,ENTERTAINMENT,7.92797
45,TOOLS,7.81738
3,BOOKS_AND_REFERENCE,7.56104
36,MUSIC_AND_AUDIO,6.84604
33,LIFESTYLE,5.56865
4,BUSINESS,4.12152
11,FINANCE,4.05807
39,PERSONALIZATION,3.79788
41,PRODUCTIVITY,3.4421


As we can see there are more than one category for games so lets put them together.

In [25]:
df_free_groupby_category_percentage[df_free_groupby_category_percentage['Category'].str.contains("GAME")].sum()

Category      GAME_ACTIONGAME_ADVENTUREGAME_ARCADEGAME_BOARD...
Percentage                                              7.86967
dtype: object

Google Play shows a balanced landscape of apps. Even if we group up all games categories they would come in third place. Now we'd like to get an idea about the kind of apps that have most users.

## Most Popular Apps by Category

One way to find out what categories are the most popular (have the most users) is to calculate the average number of installs for each app category. We can find this information in the Installs column. However, the install numbers don't seem precise enough, we have numbers like '50+', '100+', '100,000+'. But for our purposes we don't need to be so precise, we need only to have an idea of which category atracts more users. For that we'll only convert these numbers to float, so we can make calculations.

In [26]:
df_free['Installs'] = df_free['Installs'].apply(lambda x: float(x.replace(',', '').replace('+', '')))

With the code above we remove commas and plus signs and convert the string to float so we can make some calculations.

Let's get the average of number of installs per category.

In [27]:
df_free.groupby('Category', as_index=False)['Installs'].mean().sort_values('Installs', ascending=False)

Unnamed: 0,Category,Installs
23,GAME_RACING,6222893.08406
13,GAME_ACTION,4724329.77269
26,GAME_SPORTS,4166148.82175
48,VIDEO_PLAYERS,3773136.82155
27,GAME_STRATEGY,3425426.62332
15,GAME_ARCADE,3296286.01473
19,GAME_CASUAL,2937325.03293
21,GAME_MUSIC,2644452.86047
6,COMMUNICATION,2596150.06365
25,GAME_SIMULATION,2254367.49413


As we can see it's obvious that games are popular.

In [35]:
df_free[df_free['Category'] == 'VIDEO_PLAYERS'].sort_values('Installs', ascending=False).head(20)

Unnamed: 0,App Name,Category,Rating,Reviews,Installs,Size,Price,Content Rating,Last Updated,Minimum Version,Latest Version
813,YouTube,VIDEO_PLAYERS,4.36842823,41919102,5000000000.0,Varies with device,0,Teen,"April 2, 2019",Varies with device,Varies with device
6412,Google Play Movies & TV,VIDEO_PLAYERS,3.703356266,1048972,1000000000.0,Varies with device,0,Teen,"April 1, 2019",Varies with device,Varies with device
3154,LIKE Video -Magic Video Maker & Community,VIDEO_PLAYERS,4.581243515,2048739,100000000.0,45M,0,Teen,"April 2, 2019",4.0 and up,2.15.12
10796,Motorola FM Radio,VIDEO_PLAYERS,3.924235821,58141,100000000.0,Varies with device,0,Everyone,"December 13, 2018",Varies with device,Varies with device
8987,"VideoShow Video Editor, Video Maker, Beauty Ca...",VIDEO_PLAYERS,4.559087276,4458794,100000000.0,22M,0,Everyone,"April 3, 2019",4.0.3 and up,8.3.5rc
8986,"DU Recorder � Screen Recorder, Video Editor, Live",VIDEO_PLAYERS,4.745390892,3992518,100000000.0,17M,0,Everyone,"April 3, 2019",5.0 and up,2.1.1.5
2114,VLC for Android,VIDEO_PLAYERS,4.423864365,1139749,100000000.0,Varies with device,0,Everyone,"April 1, 2019",2.3 and up,Varies with device
3151,VMate 2019- Video Downloader &Best Video Tube ...,VIDEO_PLAYERS,4.248163223,323372,100000000.0,17M,0,Teen,"April 1, 2019",4.0.3 and up,2.14
3497,Motorola Gallery,VIDEO_PLAYERS,3.947426319,126318,100000000.0,23M,0,Everyone,"January 25, 2016",Varies with device,Varies with device
44563,VivaVideo - Video Editor & Photo Movie,VIDEO_PLAYERS,4.603675365,10855572,100000000.0,40M,0,Teen,"April 3, 2019",4.1 and up,7.9.7


In [34]:
df_free[df_free['Category'] == 'COMMUNICATION'].sort_values('Installs', ascending=False).head(20)

Unnamed: 0,App Name,Category,Rating,Reviews,Installs,Size,Price,Content Rating,Last Updated,Minimum Version,Latest Version
632,Messenger � Text and Video Chat for Free,COMMUNICATION,4.085856438,65469531,1000000000.0,Varies with device,0,Everyone,"April 2, 2019",Varies with device,Varies with device
815,Skype - free IM & video calls,COMMUNICATION,4.134609699,10746013,1000000000.0,Varies with device,0,Everyone,"March 28, 2019",Varies with device,Varies with device
3267,Samsung Internet Browser,COMMUNICATION,4.424014568,832714,1000000000.0,Varies with device,0,Everyone,"February 21, 2019",5.0 and up,Varies with device
842,Google Chrome: Fast & Secure,COMMUNICATION,4.335205078,13636591,1000000000.0,Varies with device,0,Everyone,"March 26, 2019",Varies with device,Varies with device
1981,Hangouts,COMMUNICATION,4.040542603,3960560,1000000000.0,Varies with device,0,Everyone,"March 29, 2019",Varies with device,Varies with device
2724,Gmail,COMMUNICATION,4.346980095,5614163,1000000000.0,Varies with device,0,Everyone,"April 2, 2019",Varies with device,Varies with device
6411,Google Duo - High Quality Video Calls,COMMUNICATION,4.596403599,3641252,1000000000.0,20M,0,Everyone,"April 2, 2019",4.4 and up,50.1.240885383.DR50_RC06
671,WhatsApp Messenger,COMMUNICATION,4.417610168,86214292,1000000000.0,Varies with device,0,Everyone,"March 27, 2019",Varies with device,Varies with device
832,"Viber Messenger - Messages, Group Chats & Calls",COMMUNICATION,4.362102032,12723787,500000000.0,Varies with device,0,Everyone,"April 2, 2019",4.1 and up,10.4.0.4
2729,UC Browser � Short Video Status & Video Downlo...,COMMUNICATION,4.450339317,19573637,500000000.0,44M,0,Teen,"April 2, 2019",4.0 and up,12.10.9.1193


In [33]:
df_free[df_free['Category'] == 'SOCIAL'].sort_values('Installs', ascending=False).head(20)

Unnamed: 0,App Name,Category,Rating,Reviews,Installs,Size,Price,Content Rating,Last Updated,Minimum Version,Latest Version
831,Facebook Lite,SOCIAL,4.288809299,10866006,1000000000.0,Varies with device,0,Teen,"March 28, 2019",Varies with device,Varies with device
653,Instagram,SOCIAL,4.51955986,79726403,1000000000.0,Varies with device,0,Teen,"April 2, 2019",Varies with device,Varies with device
704,Facebook,SOCIAL,4.087946415,85766433,1000000000.0,Varies with device,0,Teen,"April 2, 2019",Varies with device,Varies with device
642,TikTok,SOCIAL,4.386767387,9670607,500000000.0,72M,0,Teen,"April 2, 2019",4.1 and up,10.5.0
691,Snapchat,SOCIAL,4.07791853,19026060,500000000.0,56M,0,Teen,"April 3, 2019",4.4 and up,10.54.0.31
13915,Hike News & Content (for chatting go to new app),SOCIAL,4.337181568,2929521,100000000.0,Varies with device,0,Teen,"February 15, 2019",Varies with device,Varies with device
7637,VK � social network and calls,SOCIAL,3.821457386,6286733,100000000.0,51M,0,Mature 17+,"April 2, 2019",Varies with device,Varies with device
1991,Tango - Live Video Broadcasts,SOCIAL,4.290686131,3822103,100000000.0,80M,0,Mature 17+,"April 2, 2019",4.4 and up,6.6.233897
463,Pinterest,SOCIAL,4.617506027,5064395,100000000.0,Varies with device,0,Teen,"April 2, 2019",Varies with device,Varies with device
2710,"LinkedIn: Jobs, professional profile, & networ...",SOCIAL,4.240080833,1372067,100000000.0,23M,0,Everyone,"March 28, 2019",5.0 and up,4.1.289


To support even more games popularity, categories like video players, communication and social are dominated by apps like YouTube, Skype and Facebook with billions of installs each, making these categories seems more popular than they really are.

## Conclusion
In this project, we analyzed data about the Google Play mobile apps with the goal of recommending an app profile that can be profitable.

After the analysis, we concluded that making a game would probably be the best bet (racing games are more popular, but we think that any kind of game would be fine), they are popular and people tend to install multiple games of the same genre, unlike other categories of apps that people generally have a preference and that we would start at a disadvantage in competition.