## Profitable App Profiles for the Google Play Store

Author: Julian Moors\
Contact: julian.moors@outlook.com

### Introduction
_The goal for this project is to analyse data to help developers understand what type of apps are likely to attract more users._

In [2]:
# code for Google Play Store
import pandas as pd
pd.set_option('display.max_rows', None)

google_df = pd.read_csv('data/google-playstore.csv')
google_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


### Data Cleaning

In [4]:
# user has reported incorrect data in row 10472
# google_df = google_df.drop(10472)

# drop duplicate app names
google_df = google_df.drop_duplicates(subset=['App'])

# drop all apps that are non-English
google_df = google_df[google_df['App'].str.contains(r'^[a-zA-Z\s]+$', na=False)]

# remove '+' and ',' characters from installs column
google_df['Installs'] = google_df['Installs'].str.replace('+', '').str.replace(',', '')

# show the top 20 free apps by number of installs
google_df[google_df['Type'] == 'Free'].sort_values(by='Installs', ascending=False).head(20)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
3473,Dropbox,PRODUCTIVITY,4.4,1861310,61M,500000000,Free,0,Everyone,Productivity,"August 1, 2018",Varies with device,Varies with device
5596,Samsung Health,HEALTH_AND_FITNESS,4.3,480208,70M,500000000,Free,0,Everyone,Health & Fitness,"July 31, 2018",5.17.2.009,5.0 and up
1722,My Talking Tom,GAME,4.5,14891223,Varies with device,500000000,Free,0,Everyone,Casual,"July 19, 2018",4.8.0.132,4.1 and up
3476,Google Calendar,PRODUCTIVITY,4.2,858208,Varies with device,500000000,Free,0,Everyone,Productivity,"August 6, 2018",Varies with device,Varies with device
3574,Cloud Print,PRODUCTIVITY,4.1,282460,Varies with device,500000000,Free,0,Everyone,Productivity,"May 23, 2018",Varies with device,Varies with device
347,imo free video calls and chat,COMMUNICATION,4.3,4785892,11M,500000000,Free,0,Everyone,Communication,"June 8, 2018",9.8.000000010501,4.0 and up
342,Viber Messenger,COMMUNICATION,4.3,11334799,Varies with device,500000000,Free,0,Everyone,Communication,"July 18, 2018",Varies with device,Varies with device
2546,Facebook Lite,SOCIAL,4.3,8606259,Varies with device,500000000,Free,0,Teen,Social,"August 1, 2018",Varies with device,Varies with device
1662,Pou,GAME,4.3,10485308,24M,500000000,Free,0,Everyone,Casual,"May 25, 2018",1.4.77,4.0 and up
2550,Snapchat,SOCIAL,4.0,17014787,Varies with device,500000000,Free,0,Teen,Social,"July 30, 2018",Varies with device,Varies with device


### Data Analysis

In [3]:
# from the top 1000 free apps by number of installs, show the top 20 genres
google_df[google_df['Type'] == 'Free'].sort_values(by='Installs', ascending=False).head(1000)['Genres'].value_counts()[:21]

Genres
Tools                      72
Entertainment              58
Education                  42
Lifestyle                  42
Photography                41
Health & Fitness           40
Shopping                   40
Productivity               39
Dating                     39
Social                     37
Action                     36
Business                   35
Communication              34
Casual                     34
Medical                    31
Finance                    29
Personalization            28
Sports                     28
Books & Reference          24
Video Players & Editors    20
News & Magazines           18
Name: count, dtype: int64