# Introduction

Analysis of the Android app market in the Google Play store.
Using plotly.express to create the charts

**Data Source:** <br>
App and review data was scraped from the Google Play Store by Lavanya Gupta in 2018. Original files listed [here](
https://www.kaggle.com/lava18/google-play-store-apps).

In [1]:
import pandas as pd
import plotly.express as px

# Data Cleaning
- I found 'Rating' and 'Type' columns which have missing value -> remove missing value
- Also, I saw the dataset had some duplicated rows -> remove duplicates

In [2]:
df_apps = pd.read_csv('apps.csv')
df_apps.sample(5)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
501,Chronolink DX Lite,FAMILY,,2,47.0,10,Free,0,Everyone,Puzzle,"October 9, 2017",4.1 and up
1761,Selfie With Champion AJ Style,PHOTOGRAPHY,5.0,2,7.5,500,Free,0,Everyone,Photography,"January 8, 2018",3.2 and up
7310,ES App Locker,TOOLS,4.3,32207,3.4,1000000,Free,0,Everyone,Tools,"August 16, 2017",2.1 and up
153,Ej-buy,BUSINESS,,2,4.1,5,Free,0,Everyone,Business,"August 2, 2018",4.1 and up
10679,Farm Heroes Saga,GAME,4.4,7614407,70.0,100000000,Free,0,Everyone,Casual,"July 26, 2018",2.3 and up


In [3]:
df_apps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  int64  
 4   Size_MBs        10841 non-null  float64
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content_Rating  10841 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last_Updated    10841 non-null  object 
 11  Android_Ver     10839 non-null  object 
dtypes: float64(2), int64(1), object(9)
memory usage: 1016.5+ KB


In [4]:
df_apps.drop(['Android_Ver','Last_Updated'], axis=1, inplace=True)

### Check and remove missing value

In [5]:
df_apps.isna().any()

App               False
Category          False
Rating             True
Reviews           False
Size_MBs          False
Installs          False
Type               True
Price             False
Content_Rating    False
Genres            False
dtype: bool

In [6]:
df_app_clean = df_apps.dropna()

In [7]:
df_app_clean.shape

(9367, 10)

### Check and revmove duplicated value

In [8]:
df_app_clean.duplicated().any()

True

In [9]:
row_duplicates = df_app_clean[df_app_clean.duplicated()]
row_duplicates.sort_values(['App', 'Type', 'Price'], ascending=False)[:-10]

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
10156,trivago: Hotels & Travel,TRAVEL_AND_LOCAL,4.2,219848,12.0,50000000,Free,0,Everyone,Travel & Local
10159,trivago: Hotels & Travel,TRAVEL_AND_LOCAL,4.2,219848,12.0,50000000,Free,0,Everyone,Travel & Local
9632,"theScore: Live Sports Scores, News, Stats & Vi...",SPORTS,4.4,133825,34.0,10000000,Free,0,Everyone 10+,Sports
9634,"theScore: Live Sports Scores, News, Stats & Vi...",SPORTS,4.4,133833,34.0,10000000,Free,0,Everyone 10+,Sports
9635,"theScore: Live Sports Scores, News, Stats & Vi...",SPORTS,4.4,133833,34.0,10000000,Free,0,Everyone 10+,Sports
...,...,...,...,...,...,...,...,...,...,...
6488,Ada - Your Health Guide,MEDICAL,4.7,87418,14.0,1000000,Free,0,Everyone,Medical
6786,AdWords Express,BUSINESS,4.1,7149,11.0,1000000,Free,0,Everyone,Business
5686,Accounting App - Zoho Books,BUSINESS,4.5,3079,8.5,100000,Free,0,Everyone,Business
3242,ASCCP Mobile,MEDICAL,4.5,63,25.0,10000,Paid,$9.99,Everyone,Medical


In [10]:
df_app_clean = df_app_clean.sort_values(['App', 'Reviews'])
df_app_clean = df_app_clean.drop_duplicates(['App', 'Type', 'Price'], keep='last')

# Data Transformation
- 'Installs' and 'Price' columns should be numeric data type -> remove the unnecessary characters and convert to numeric data type

In [11]:
df_app_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8199 entries, 7523 to 7574
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             8199 non-null   object 
 1   Category        8199 non-null   object 
 2   Rating          8199 non-null   float64
 3   Reviews         8199 non-null   int64  
 4   Size_MBs        8199 non-null   float64
 5   Installs        8199 non-null   object 
 6   Type            8199 non-null   object 
 7   Price           8199 non-null   object 
 8   Content_Rating  8199 non-null   object 
 9   Genres          8199 non-null   object 
dtypes: float64(2), int64(1), object(7)
memory usage: 704.6+ KB


In [12]:
df_app_clean['Installs'] = df_app_clean['Installs'].astype(str).str.replace(',','')
df_app_clean['Installs'] = pd.to_numeric(df_app_clean['Installs'])

In [13]:
df_app_clean['Price'] = df_app_clean['Price'].astype(str).str.replace('$','')
df_app_clean['Price'] = pd.to_numeric(df_app_clean['Price'])

  df_app_clean['Price'] = df_app_clean['Price'].astype(str).str.replace('$','')


In [14]:
df_app_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8199 entries, 7523 to 7574
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             8199 non-null   object 
 1   Category        8199 non-null   object 
 2   Rating          8199 non-null   float64
 3   Reviews         8199 non-null   int64  
 4   Size_MBs        8199 non-null   float64
 5   Installs        8199 non-null   int64  
 6   Type            8199 non-null   object 
 7   Price           8199 non-null   float64
 8   Content_Rating  8199 non-null   object 
 9   Genres          8199 non-null   object 
dtypes: float64(3), int64(2), object(5)
memory usage: 704.6+ KB


#### Add a new column 'Estimated Revenue'

In [15]:
df_app_clean['Estimated Revenue'] = df_app_clean['Price']*df_app_clean['Installs']

In [16]:
df_app_clean.sort_values('Price', ascending=False)[:20]

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Estimated Revenue
3946,I'm Rich - Trump Edition,LIFESTYLE,3.6,275,7.3,10000,Paid,400.0,Everyone,Lifestyle,4000000.0
2775,I Am Rich Pro,FAMILY,4.4,201,2.7,5000,Paid,399.99,Everyone,Entertainment,1999950.0
3114,I am Rich,FINANCE,4.3,180,3.8,5000,Paid,399.99,Everyone,Finance,1999950.0
1331,most expensive app (H),FAMILY,4.3,6,1.5,100,Paid,399.99,Everyone,Entertainment,39999.0
3554,💎 I'm rich,LIFESTYLE,3.8,718,26.0,10000,Paid,399.99,Everyone,Lifestyle,3999900.0
3145,I am rich(premium),FINANCE,3.5,472,0.942383,5000,Paid,399.99,Everyone,Finance,1999950.0
2461,I AM RICH PRO PLUS,FINANCE,4.0,36,41.0,1000,Paid,399.99,Everyone,Finance,399990.0
4606,I Am Rich Premium,FINANCE,4.1,1867,4.7,50000,Paid,399.99,Everyone,Finance,19999500.0
1946,I am rich (Most expensive app),FINANCE,4.1,129,2.7,1000,Paid,399.99,Teen,Finance,399990.0
5765,I am rich,LIFESTYLE,3.8,3547,1.8,100000,Paid,399.99,Everyone,Lifestyle,39999000.0


#### Remove the rows that related 'I am Rich' appps 
Many apps have the content like 'I am rich'. They have the high price, the lowest price is 2299.99
-> remove these rows

In [17]:
df_clean_app = df_app_clean[df_app_clean['Price'] < 250]

### The hightest rated Apps

In [18]:
df_clean_app.sort_values(['Rating','Reviews'], ascending=False)[:10]

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Estimated Revenue
2095,Ríos de Fe,LIFESTYLE,5.0,141,15.0,1000,Free,0.0,Everyone,Lifestyle,0.0
2438,"FD Calculator (EMI, SIP, RD & Loan Eligilibility)",FINANCE,5.0,104,2.3,1000,Free,0.0,Everyone,Finance,0.0
3115,Oración CX,LIFESTYLE,5.0,103,3.8,5000,Free,0.0,Everyone,Lifestyle,0.0
2107,Barisal University App-BU Face,FAMILY,5.0,100,10.0,1000,Free,0.0,Everyone,Education,0.0
2069,Master E.K,FAMILY,5.0,90,19.0,1000,Free,0.0,Everyone,Education,0.0
1968,CL REPL,TOOLS,5.0,47,17.0,1000,Free,0.0,Everyone,Tools,0.0
790,AJ Cam,PHOTOGRAPHY,5.0,44,2.8,100,Free,0.0,Everyone,Photography,0.0
1275,AI Today : Artificial Intelligence News & AI 101,NEWS_AND_MAGAZINES,5.0,43,2.3,100,Free,0.0,Everyone,News & Magazines,0.0
2544,CS & IT Interview Questions,FAMILY,5.0,43,3.3,1000,Free,0.0,Everyone,Education,0.0
1789,Ek Vote,PRODUCTIVITY,5.0,43,6.2,500,Free,0.0,Everyone,Productivity,0.0


### 5 Apps with the most reviews

In [19]:
df_clean_app.sort_values('Reviews', ascending = False)[:5]

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Estimated Revenue
10805,Facebook,SOCIAL,4.1,78158306,5.3,1000000000,Free,0.0,Teen,Social,0.0
10789,WhatsApp Messenger,COMMUNICATION,4.4,69119316,3.5,1000000000,Free,0.0,Everyone,Communication,0.0
10808,Instagram,SOCIAL,4.5,66577446,5.3,1000000000,Free,0.0,Teen,Social,0.0
10790,Messenger – Text and Video Chat for Free,COMMUNICATION,4.0,56646578,3.5,1000000000,Free,0.0,Everyone,Communication,0.0
10652,Clash of Clans,GAME,4.6,44893888,98.0,100000000,Free,0.0,Everyone 10+,Strategy,0.0


# Visualization

### Top 15 Categories with the most apps

In [20]:
categories = df_clean_app['Category'].value_counts()
print(categories)
print(f'\nNumber of categories: {len(categories)}')

FAMILY                 1649
GAME                    898
TOOLS                   720
PRODUCTIVITY            301
PERSONALIZATION         298
LIFESTYLE               297
FINANCE                 296
MEDICAL                 291
PHOTOGRAPHY             263
BUSINESS                263
SPORTS                  260
COMMUNICATION           256
HEALTH_AND_FITNESS      244
NEWS_AND_MAGAZINES      204
SOCIAL                  203
TRAVEL_AND_LOCAL        187
SHOPPING                180
BOOKS_AND_REFERENCE     169
VIDEO_PLAYERS           149
DATING                  134
MAPS_AND_NAVIGATION     118
EDUCATION               104
FOOD_AND_DRINK           94
ENTERTAINMENT            86
AUTO_AND_VEHICLES        73
WEATHER                  72
LIBRARIES_AND_DEMO       64
HOUSE_AND_HOME           61
ART_AND_DESIGN           59
COMICS                   54
PARENTING                50
EVENTS                   45
BEAUTY                   42
Name: Category, dtype: int64

Number of categories: 33


In [44]:
bar = px.bar(categories, x=categories.index[:15], y=categories.values[:15],
             color = categories.values[:15],
             color_continuous_scale = 'Tealgrn',
            title='Number of apps by Category')
bar.update_layout(xaxis_title='Category', yaxis_title='Number of apps',
                 coloraxis_showscale = False)
bar.show()

Evaluating top 15 categories with the most apps
- 'Family' is the category with the most apps, there are 1649 apps which belong to this category
- 'Game' and 'Tools' are respectively the categories with the 2nd (898) and 3rd (720) highest number of apps
- The next categories have the number of apps from about 200-300

### Content Rating component

In [22]:
cont_rating = df_clean_app['Content_Rating'].value_counts()
cont_rating

Everyone           6607
Teen                911
Mature 17+          357
Everyone 10+        305
Adults only 18+       3
Unrated               1
Name: Content_Rating, dtype: int64

In [45]:
pie = px.pie(cont_rating, values = cont_rating.values, names=cont_rating.index,
            title='Label Component Percentage')
pie.show()

- Most apps (80.7%) are labeled 'Everyone', 11.1% for 'Teen'
- The rest are 'Mature 17+', 'Everyone 10+', 'Adults only 18+' and 'Unrated'

In [24]:
category_download = df_clean_app.groupby('Category', as_index=False).agg(Downloads=('Installs','sum'))
category_download.sort_values('Downloads', inplace=True)

In [41]:
h_bar = px.bar(category_download, 
               x=category_download['Downloads'][-15:], 
               y=category_download['Category'][-15:],
               color = category_download['Downloads'][-15:],
              color_continuous_scale = 'Tealgrn',
              title = 'Top 15 Categories with the most installs')
h_bar.update_layout(xaxis_title = 'Number of installs',
                    yaxis_title = 'Category',
                    coloraxis_showscale=False)
h_bar.show()

### Category Concentration - Downloads vs. Number of Apps

In [26]:
cate_install_apps = df_clean_app.groupby('Category', as_index=False).\
                        agg(Downloads=('Installs','sum'), Nb_Apps=('App','count'))

In [27]:
scatter = px.scatter(cate_install_apps, x='Nb_Apps', y='Downloads', color='Downloads', size='Nb_Apps',
                    title = 'Category Concentration - Downloads vs. Number of Apps', hover_name='Category')
scatter.update_layout(xaxis_title='Number of Apps',
                     yaxis=dict(type='log'))
scatter.show()

### Genres with the most apps

In [28]:
df_clean_app['Genres'].value_counts()

Tools                                  719
Entertainment                          467
Education                              429
Productivity                           301
Personalization                        298
                                      ... 
Arcade;Pretend Play                      1
Casual;Music & Video                     1
Art & Design;Pretend Play                1
Health & Fitness;Action & Adventure      1
Board;Pretend Play                       1
Name: Genres, Length: 114, dtype: int64

In [29]:
stack = df_clean_app['Genres'].astype(str).str.split(';', expand=True).stack()
number_stack = stack.value_counts()
print(len(number_stack))

53


In [37]:
bar = px.bar(number_stack, x=number_stack.index[:15], y=number_stack.values[:15],
            color=number_stack.values[:15], color_continuous_scale = 'Tealgrn',
            title='Top 15 Genres with the most Apps')
bar.update_layout(xaxis_title='Genres', yaxis_title='Number of Apps', coloraxis_showscale=False)
bar.show()

### Free and Paid Apps by Category

In [31]:
free_paid =  df_clean_app.groupby(['Category', 'Type'], as_index=False).agg(Nb_Apps=('App','count'))

In [32]:
free_paid_bar = px.bar(free_paid, x='Category', y='Nb_Apps', color='Type', barmode='group')
free_paid_bar.update_layout(yaxis=dict(type='log'), xaxis={'categoryorder':'total descending'})
free_paid_bar.show()

### Distribution

In [38]:
box = px.box(df_clean_app, x='Type', y='Installs', color='Type',
            title = 'Distribution of number of installs by Free and Paid type')
box.update_layout(yaxis=dict(type='log'))
box.show()

In [34]:
paid_app = df_clean_app[df_clean_app['Type'] == 'Paid']

In [35]:
paid_box = px.box(paid_app, x='Category', y='Price',
                 title='Distribution of Price by Category')
paid_box.update_layout(yaxis=dict(type='log'), xaxis={'categoryorder':'max descending'})
paid_box.show()

In [36]:
paid_box = px.box(paid_app, x='Category', y='Estimated Revenue',
                 title='Distribution of Revenue by Category')
paid_box.update_layout(yaxis=dict(type='log'), xaxis = {'categoryorder':'max descending'})
paid_box.show()