# Introduction

In this notebook, we will do a comprehensive analysis of the Android app market by comparing thousands of apps in the Google Play store.

# About the Dataset of Google Play Store Apps & Reviews

**Data Source:** <br>
App and review data was scraped from the Google Play Store by Lavanya Gupta in 2018. Original files listed [here](
https://www.kaggle.com/lava18/google-play-store-apps).

# Import Statements

In [None]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go



# Notebook Presentation

In [None]:
# Show numeric output in decimal format e.g., 2.15
pd.options.display.float_format = '{:,.2f}'.format

# Read the Dataset

In [None]:
df_apps = pd.read_csv('apps.csv')

# Data Cleaning

Checking how many rows and columns does `df_apps` have. What are the column names and look at a random sample of 5 different rows with [.sample()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html).

In [None]:
df_apps.shape

(10841, 12)

In [None]:
df_apps.columns

Index(['App', 'Category', 'Rating', 'Reviews', 'Size_MBs', 'Installs', 'Type',
       'Price', 'Content_Rating', 'Genres', 'Last_Updated', 'Android_Ver'],
      dtype='object')

In [None]:
df_apps

### Drop Unused Columns

Remove the columns called `Last_Updated` and `Android_Version` from the DataFrame. We will not use these columns.

In [None]:
df_apps.drop(['Last_Updated', 'Android_Ver'], axis=1)


Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
0,Ak Parti Yardım Toplama,SOCIAL,,0,8.70,0,Paid,$13.99,Teen,Social
1,Ain Arabic Kids Alif Ba ta,FAMILY,,0,33.00,0,Paid,$2.99,Everyone,Education
2,Popsicle Launcher for Android P 9.0 launcher,PERSONALIZATION,,0,5.50,0,Paid,$1.49,Everyone,Personalization
3,Command & Conquer: Rivals,FAMILY,,0,19.00,0,,0,Everyone 10+,Strategy
4,CX Network,BUSINESS,,0,10.00,0,Free,0,Everyone,Business
...,...,...,...,...,...,...,...,...,...,...
10836,Subway Surfers,GAME,4.50,27723193,76.00,1000000000,Free,0,Everyone 10+,Arcade
10837,Subway Surfers,GAME,4.50,27724094,76.00,1000000000,Free,0,Everyone 10+,Arcade
10838,Subway Surfers,GAME,4.50,27725352,76.00,1000000000,Free,0,Everyone 10+,Arcade
10839,Subway Surfers,GAME,4.50,27725352,76.00,1000000000,Free,0,Everyone 10+,Arcade


### Find and Remove NaN values in Ratings

Checking for number of rows have a NaN value (not-a-number) in the Ratings column. Create DataFrame called `df_apps_clean` that does not include these rows.

In [None]:
df_apps.isna().sum()

App                  0
Category             0
Rating            1474
Reviews              0
Size_MBs             0
Installs             0
Type                 1
Price                0
Content_Rating       0
Genres               0
Last_Updated         0
Android_Ver          2
dtype: int64

In [None]:
df_apps_clean = df_apps.dropna()

### Find and Remove Duplicates

Check for duplicates using the [.duplicated()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html) function. Testing for how many entries of "Instagram" app. Use [.drop_duplicates()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html) to remove any duplicates from `df_apps_clean`.


In [None]:
# Check for duplicates valued
# 476 duplicate entries
df_apps_clean.duplicated().sum()

474

In [None]:
# Duplicated series
duplicated_rows = df_apps_clean[df_apps_clean.duplicated()]

In [None]:
# Check for duplication of instagram
duplicated_rows[duplicated_rows.App=='Instagram']

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
10809,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social,"July 31, 2018",Varies with device


In [None]:
# Dropping duplicates
df_apps_clean.drop_duplicates(subset=['App', 'Type', 'Price'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_apps_clean.drop_duplicates(subset=['App', 'Type', 'Price'], inplace=True)


In [None]:
df_apps_clean[df_apps_clean.App=='Instagram']

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
10806,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social,"July 31, 2018",Varies with device


# Find Highest Rated Apps

Identify which apps are the highest rated. What problem might we encounter if we rely exclusively on ratings alone to determine the quality of an app?

In [None]:
df_apps_clean.sort_values(by='Rating', ascending=False)

# One problem that we might encounter if relying on ratings alone to determine the quality of apps is lack of reviews

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
21,KBA-EZ Health Guide,MEDICAL,5.00,4,25.00,1,Free,0,Everyone,Medical,"August 2, 2018",4.0.3 and up
1790,SUMMER SONIC app,EVENTS,5.00,4,61.00,500,Free,0,Everyone,Events,"July 24, 2018",4.4 and up
1769,Yazdani Cd Center EllahAbad Official App,FAMILY,5.00,8,3.80,500,Free,0,Everyone,Entertainment,"January 12, 2018",4.0 and up
985,DW Security,BUSINESS,5.00,6,15.00,100,Free,0,Everyone,Business,"July 25, 2018",4.1 and up
981,EU Exit poll,LIFESTYLE,5.00,10,9.40,100,Free,0,Everyone,Lifestyle,"July 15, 2016",4.1 and up
...,...,...,...,...,...,...,...,...,...,...,...,...
1314,CR Magazine,BUSINESS,1.00,1,7.80,100,Free,0,Everyone,Business,"July 23, 2014",2.3.3 and up
240,House party - live chat,DATING,1.00,1,9.20,10,Free,0,Mature 17+,Dating,"July 31, 2018",4.0.3 and up
1932,FE Mechanical Engineering Prep,FAMILY,1.00,2,21.00,1000,Free,0,Everyone,Education,"July 27, 2018",5.0 and up
617,DT future1 cam,TOOLS,1.00,1,24.00,50,Free,0,Everyone,Tools,"March 27, 2018",2.2 and up


# Find 5 Largest Apps in terms of Size (MBs)

Finding out max size in megabytes (MB) of the largest Android apps in the Google Play Store. Based on the data, do we think there could be limit in place or can developers make apps as large as they please?

In [None]:
df_apps_clean.sort_values(by='Size_MBs', ascending=False)

# Based on this dataset, it can be seen that Google Playstore imposed the size limit on the apps.

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
9942,Talking Babsy Baby: Baby Games,LIFESTYLE,4.00,140995,100.00,10000000,Free,0,Everyone,Lifestyle;Pretend Play,"July 16, 2018",4.0 and up
10687,Hungry Shark Evolution,GAME,4.50,6074334,100.00,100000000,Free,0,Teen,Arcade,"July 25, 2018",4.1 and up
9943,Miami crime simulator,GAME,4.00,254518,100.00,10000000,Free,0,Mature 17+,Action,"July 9, 2018",4.0 and up
9944,Gangster Town: Vice District,FAMILY,4.30,65146,100.00,10000000,Free,0,Mature 17+,Simulation,"May 31, 2018",4.0 and up
3144,Vi Trainer,HEALTH_AND_FITNESS,3.60,124,100.00,5000,Free,0,Everyone,Health & Fitness,"August 2, 2018",5.0 and up
...,...,...,...,...,...,...,...,...,...,...,...,...
2648,Ad Remove Plugin for App2SD,PRODUCTIVITY,4.10,66,0.02,1000,Paid,$1.29,Everyone,Productivity,"September 25, 2013",2.2 and up
5798,ExDialer PRO Key,COMMUNICATION,4.50,5474,0.02,100000,Paid,$3.99,Everyone,Communication,"January 15, 2014",2.1 and up
2684,My baby firework (Remove ad),FAMILY,4.10,30,0.01,1000,Paid,$0.99,Everyone,Entertainment,"April 25, 2013",Varies with device
7966,Market Update Helper,LIBRARIES_AND_DEMO,4.10,20145,0.01,1000000,Free,0,Everyone,Libraries & Demo,"February 12, 2013",1.5 and up


# Find the 5 App with Most Reviews

Which apps have the highest number of reviews? Are there any paid apps among the top 50?

In [None]:
df_apps_clean.sort_values(by='Reviews', ascending=False)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
10805,Facebook,SOCIAL,4.10,78158306,5.30,1000000000,Free,0,Teen,Social,"August 3, 2018",Varies with device
10785,WhatsApp Messenger,COMMUNICATION,4.40,69119316,3.50,1000000000,Free,0,Everyone,Communication,"August 3, 2018",Varies with device
10806,Instagram,SOCIAL,4.50,66577313,5.30,1000000000,Free,0,Teen,Social,"July 31, 2018",Varies with device
10784,Messenger – Text and Video Chat for Free,COMMUNICATION,4.00,56642847,3.50,1000000000,Free,0,Everyone,Communication,"August 1, 2018",Varies with device
10650,Clash of Clans,GAME,4.60,44891723,98.00,100000000,Free,0,Everyone 10+,Strategy,"July 15, 2018",4.1 and up
...,...,...,...,...,...,...,...,...,...,...,...,...
425,Labs on Demand,MEDICAL,5.00,1,22.00,10,Free,0,Everyone,Medical,"August 3, 2018",4.2 and up
901,ES Billing System (Offline App),PRODUCTIVITY,5.00,1,4.20,100,Free,0,Everyone,Productivity,"May 17, 2018",4.1 and up
453,Wowkwis aq Ka'qaquj,FAMILY,5.00,1,49.00,10,Free,0,Everyone,Education;Education,"February 16, 2018",4.0.3 and up
462,CB Fit,HEALTH_AND_FITNESS,5.00,1,7.80,10,Free,0,Everyone,Health & Fitness,"July 9, 2018",4.1 and up


# Plotly Pie and Donut Charts - Visualise Categorical Data: Content Ratings

In [None]:
ratings = df_apps_clean.Content_Rating.value_counts()
ratings

Everyone           6619
Teen                912
Mature 17+          357
Everyone 10+        305
Adults only 18+       3
Unrated               1
Name: Content_Rating, dtype: int64

In [None]:
# Plot the pie chart
fig = px.pie(labels=ratings.index, values=ratings.values)
fig.show()

  return args["labels"][column]


In [None]:
# Tuning the pie chart
fig = px.pie(labels = ratings.index,
             values=ratings.values,
             title='Content Rating',
             names=ratings.index
             )
fig.update_traces(textposition='outside', textinfo='percent+label')


Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.



In [None]:
# Creating a donut chart
fig = px.pie(labels = ratings.index,
             values=ratings.values,
             title='Content Rating',
             names=ratings.index,
             hole=0.6
             )
fig.update_traces(textposition='outside', textinfo='percent+label', textfont_size=15)


Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.



# Numeric Type Conversion: Examine the Number of Installs

How many apps had over 1 billion (that's right - BILLION) installations? How many apps just had a single install?

Check the datatype of the Installs column.

Count the number of apps at each level of installations.

Convert the number of installations (the Installs column) to a numeric data type. Hint: this is a 2-step process. You'll have make sure you remove non-numeric characters first.

In [None]:
df_apps_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8197 entries, 21 to 10835
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             8197 non-null   object 
 1   Category        8197 non-null   object 
 2   Rating          8197 non-null   float64
 3   Reviews         8197 non-null   int64  
 4   Size_MBs        8197 non-null   float64
 5   Installs        8197 non-null   object 
 6   Type            8197 non-null   object 
 7   Price           8197 non-null   object 
 8   Content_Rating  8197 non-null   object 
 9   Genres          8197 non-null   object 
 10  Last_Updated    8197 non-null   object 
 11  Android_Ver     8197 non-null   object 
dtypes: float64(2), int64(1), object(9)
memory usage: 832.5+ KB


In [None]:
# Counting the number of apps at each level of installtions
df_apps_clean.Installs.value_counts()

# It can be seen that in the column, there are commas character

1,000,000        1417
100,000          1096
10,000            987
10,000,000        933
1,000             697
5,000,000         607
500,000           504
50,000            457
5,000             425
100               303
50,000,000        202
500               199
100,000,000       189
10                 69
50                 56
500,000,000        24
1,000,000,000      20
5                   9
1                   3
Name: Installs, dtype: int64

In [None]:
# Replacing the commas with empty string
df_apps_clean.Installs = df_apps_clean.Installs.astype(str).str.replace(',', "")



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [None]:
# Conver the Instals column from str to number
df_apps_clean.Installs = pd.to_numeric(df_apps_clean.Installs)

df_apps_clean[['App', 'Installs']].groupby('Installs').count()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0_level_0,App
Installs,Unnamed: 1_level_1
1,3
5,9
10,69
50,56
100,303
500,199
1000,697
5000,425
10000,987
50000,457


# Find the Most Expensive Apps, Filter out the Junk, and Calculate a (ballpark) Sales Revenue Estimate

Let's examine the Price column more closely.

Convert the price column to numeric data. Then investigate the top 20 most expensive apps in the dataset.

Remove all apps that cost more than $250 from the `df_apps_clean` DataFrame.

Add a column called 'Revenue_Estimate' to the DataFrame. This column should hold the price of the app times the number of installs. What are the top 10 highest grossing paid apps according to this estimate? Out of the top 10 highest grossing paid apps, how many are games?


In [None]:


# Convert price columns from object to number
df_apps_clean.Price = df_apps_clean.Price.astype(str).str.replace('$', '')
df_apps_clean.Price = pd.to_numeric(df_apps_clean.Price)
df_apps_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8197 entries, 21 to 10835
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             8197 non-null   object 
 1   Category        8197 non-null   object 
 2   Rating          8197 non-null   float64
 3   Reviews         8197 non-null   int64  
 4   Size_MBs        8197 non-null   float64
 5   Installs        8197 non-null   int64  
 6   Type            8197 non-null   object 
 7   Price           8197 non-null   float64
 8   Content_Rating  8197 non-null   object 
 9   Genres          8197 non-null   object 
 10  Last_Updated    8197 non-null   object 
 11  Android_Ver     8197 non-null   object 
dtypes: float64(3), int64(2), object(7)
memory usage: 832.5+ KB



The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



### The most expensive apps sub $250

In [None]:
df_apps_clean[df_apps_clean.Price <= 250].sort_values(by='Price', ascending=False)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
2281,Vargo Anesthesia Mega App,MEDICAL,4.60,92,32.00,1000,Paid,79.99,Everyone,Medical,"June 18, 2018",4.0.3 and up
1407,LTC AS Legal,MEDICAL,4.00,6,1.30,100,Paid,39.99,Everyone,Medical,"April 4, 2018",4.1 and up
2629,I am Rich Person,LIFESTYLE,4.20,134,1.80,1000,Paid,37.99,Everyone,Lifestyle,"July 18, 2017",4.0.3 and up
2481,A Manual of Acupuncture,MEDICAL,3.50,214,68.00,1000,Paid,33.99,Everyone,Medical,"October 2, 2017",4.0 and up
4264,Golfshot Plus: Golf GPS,SPORTS,4.10,3387,25.00,50000,Paid,29.99,Everyone,Sports,"July 11, 2018",4.1 and up
...,...,...,...,...,...,...,...,...,...,...,...,...
4509,Hashtags For Likes.co,SOCIAL,4.30,420,18.00,50000,Free,0.00,Everyone,Social,"December 19, 2016",4.0.3 and up
4508,myAir™ for Air10™ by ResMed,MEDICAL,3.70,236,18.00,50000,Free,0.00,Everyone,Medical,"July 25, 2018",5.0 and up
4507,AK Math Coach,FAMILY,3.60,283,18.00,50000,Free,0.00,Everyone,Education,"May 19, 2015",2.3.3 and up
4506,Forgotten Hill: Fall,GAME,4.40,1063,18.00,50000,Free,0.00,Teen,Adventure,"October 30, 2017",3.0 and up


### Highest Grossing Paid Apps (ballpark estimate)

In [None]:
df_apps_clean['Revenue_Estimate'] = df_apps_clean.Price * df_apps_clean.Installs
df_apps_clean.sort_values(by='Revenue_Estimate', ascending=False)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver,Revenue_Estimate
9220,Minecraft,FAMILY,4.50,2376564,19.00,10000000,Paid,6.99,Everyone 10+,Arcade;Action & Adventure,"July 24, 2018",Varies with device,69900000.00
5765,I am rich,LIFESTYLE,3.80,3547,1.80,100000,Paid,399.99,Everyone,Lifestyle,"January 12, 2018",4.0.3 and up,39999000.00
4606,I Am Rich Premium,FINANCE,4.10,1867,4.70,50000,Paid,399.99,Everyone,Finance,"November 12, 2017",4.0 and up,19999500.00
8825,Hitman Sniper,GAME,4.60,408292,29.00,10000000,Paid,0.99,Mature 17+,Action,"July 12, 2018",4.1 and up,9900000.00
7151,Grand Theft Auto: San Andreas,GAME,4.40,348962,26.00,1000000,Paid,6.99,Mature 17+,Action,"March 21, 2015",3.0 and up,6990000.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4509,Hashtags For Likes.co,SOCIAL,4.30,420,18.00,50000,Free,0.00,Everyone,Social,"December 19, 2016",4.0.3 and up,0.00
4508,myAir™ for Air10™ by ResMed,MEDICAL,3.70,236,18.00,50000,Free,0.00,Everyone,Medical,"July 25, 2018",5.0 and up,0.00
4507,AK Math Coach,FAMILY,3.60,283,18.00,50000,Free,0.00,Everyone,Education,"May 19, 2015",2.3.3 and up,0.00
4506,Forgotten Hill: Fall,GAME,4.40,1063,18.00,50000,Free,0.00,Teen,Adventure,"October 30, 2017",3.0 and up,0.00


# Plotly Bar Charts & Scatter Plots: Analysing App Categories

In [None]:
df_apps_clean.Category.nunique()

33

In [None]:
top_10_category = df_apps_clean.Category.value_counts()[:10]
top_10_category

FAMILY             1610
GAME                910
TOOLS               719
FINANCE             302
LIFESTYLE           302
PRODUCTIVITY        301
PERSONALIZATION     296
MEDICAL             292
PHOTOGRAPHY         263
BUSINESS            262
Name: Category, dtype: int64

### Vertical Bar Chart - Highest Competition (Number of Apps)

In [None]:
bar = px.bar(x=top_10_category.index, y =top_10_category.values)
bar.show()

### Horizontal Bar Chart - Most Popular Categories (Highest Downloads)

In [None]:
category_installs = df_apps_clean.groupby('Category').agg({'Installs': pd.Series.sum})
category_installs.sort_values('Installs', ascending=True, inplace=True)

In [None]:
h_bar = px.bar(x = category_installs.Installs, y = category_installs.index, orientation='h')
h_bar.update_layout(xaxis_title='Number of Downloads', yaxis_title='Category')
h_bar.show()

### Category Concentration - Downloads vs. Competition

**Challenge**:
* First, create a DataFrame that has the number of apps in one column and the number of installs in another:

<img src=https://imgur.com/uQRSlXi.png width="350">

* Then use the [plotly express examples from the documentation](https://plotly.com/python/line-and-scatter/) alongside the [.scatter() API reference](https://plotly.com/python-api-reference/generated/plotly.express.scatter.html)to create scatter plot that looks like this.

<img src=https://imgur.com/cHsqh6a.png>

*Hint*: Use the size, hover_name and color parameters in .scatter(). To scale the yaxis, call .update_layout() and specify that the yaxis should be on a log-scale like so: yaxis=dict(type='log')

In [None]:
df_noApps_noDown=df_apps_clean.groupby('Category').agg({'Installs': pd.Series.sum, 'App':pd.Series.count})
df_noApps_noDown.sort_values('Installs', ascending=False)

fig = px.scatter(df_noApps_noDown, x='App', y='Installs', color='Installs', size='App', hover_name=df_noApps_noDown.index)
fig.update_layout(yaxis=dict(type='log'), xaxis_title='Number of Apps (Lower=More Concentrated)', yaxis_title='Install')
fig.show()

# Extracting Nested Data from a Column

How many different types of genres are there? Can an app belong to more than one genre? Check what happens when you use .value_counts() on a column with nested values? See if we can work around this problem by using the .split() function and the DataFrame's [.stack() method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.stack.html).


In [None]:

print(df_apps_clean.Genres.nunique())
stack=df_apps_clean.Genres.str.split(';', expand=True).stack()
print(stack)
num_genres = stack.value_counts()
print(num_genres)

114
21     0                    Medical
28     0                     Arcade
47     0                     Arcade
82     0                     Arcade
99     0                    Medical
                     ...           
10824  0               Productivity
10828  0    Video Players & Editors
10829  0    Video Players & Editors
10831  0           News & Magazines
10835  0                     Arcade
Length: 8577, dtype: object
Tools                      719
Education                  587
Entertainment              502
Action                     304
Lifestyle                  303
Finance                    302
Productivity               301
Personalization            296
Medical                    292
Sports                     270
Photography                263
Business                   262
Communication              258
Health & Fitness           245
Casual                     216
News & Magazines           204
Social                     203
Simulation                 200
Travel & Local

# Colour Scales in Plotly Charts - Competition in Genres

**Challenge**: Can you create this chart with the Series containing the genre data?

<img src=https://imgur.com/DbcoQli.png width=400>

Experimenting with the built in colour scales in Plotly. We can find a full list [here](https://plotly.com/python/builtin-colorscales/).

* Find a way to set the colour scale using the color_continuous_scale parameter.
* Find a way to make the color axis disappear by using coloraxis_showscale.

In [None]:
bar = px.bar(x=num_genres[:15].index, y=num_genres[:15].values, color_continuous_scale='Agsunset', color=num_genres.values[:15])
bar.update_layout(xaxis_title='Genre', yaxis_title='Number of Apps', coloraxis_showscale=False)
bar.show()

# Grouped Bar Charts: Free vs. Paid Apps per Category

In [None]:
# Count how man Paid vs Free
df_apps_clean.Type.value_counts()
# Making dataframe for free vs paid
df_free_vs_paid = df_apps_clean.groupby(['Category', 'Type'], as_index=False).agg({'App': pd.Series.count})
df_free_vs_paid

Unnamed: 0,Category,Type,App
0,ART_AND_DESIGN,Free,58
1,ART_AND_DESIGN,Paid,3
2,AUTO_AND_VEHICLES,Free,72
3,AUTO_AND_VEHICLES,Paid,1
4,BEAUTY,Free,42
...,...,...,...
56,TRAVEL_AND_LOCAL,Paid,8
57,VIDEO_PLAYERS,Free,144
58,VIDEO_PLAYERS,Paid,4
59,WEATHER,Free,65


Use the plotly express bar [chart examples](https://plotly.com/python/bar-charts/#bar-chart-with-sorted-or-ordered-categories) and the [.bar() API reference](https://plotly.com/python-api-reference/generated/plotly.express.bar.html#plotly.express.bar) to create this bar chart:

<img src=https://imgur.com/LE0XCxA.png>

We'll want to use the `df_free_vs_paid` DataFrame that you created above that has the total number of free and paid apps per category.

See if we can figure out how to get the look above by changing the `categoryorder` to 'total descending' as outlined in the documentation here [here](https://plotly.com/python/categorical-axes/#automatically-sorting-categories-by-name-or-total-value).

In [None]:
g_bar = px.bar(df_free_vs_paid, x='Category', y='App', barmode='group', color='Type')
g_bar.update_layout(xaxis={'categoryorder': 'total descending'},
                    xaxis_title='Category',
                    yaxis_title='Number of Apps',
                    yaxis=dict(type='log')
                    )
g_bar.show()

# Plotly Box Plots: Lost Downloads for Paid Apps

Create a box plot that shows the number of Installs for free versus paid apps. How does the median number of installations compare? Is the difference large or small?

Use the [Box Plots Guide](https://plotly.com/python/box-plots/) and the [.box API reference](https://plotly.com/python-api-reference/generated/plotly.express.box.html) to create the following chart.

<img src=https://imgur.com/uVsECT3.png>


In [None]:
box = px.box(df_apps_clean, x = 'Type', y ='Installs', color='Type', title='How many Downloads are Paid Apps giving')
box.update_layout(yaxis=dict(type='log'))
box.show()

# Plotly Box Plots: Revenue by App Category
See if we can generate the chart below:

<img src=https://imgur.com/v4CiNqX.png>

Looking at the hover text, how much does the median app earn in the Tools category? If developing an Android app costs $30,000 or thereabouts, does the average photography app recoup its development costs?

Hint: I've used 'min ascending' to sort the categories.

In [None]:
df_apps_clean_paid = df_apps_clean[df_apps_clean['Type'] == 'Paid']
box_category = px.box(df_apps_clean_paid, x='Category', y='Revenue_Estimate')
box_category.update_layout(yaxis=dict(type='log'), xaxis={'categoryorder': 'min ascending'},
                           xaxis_title='Category',
                           yaxis_title='Paid App Ball Park Revenue')
box_category.show()

# How Much Can You Charge? Examine Paid App Pricing Strategies by Category

**Challenge**: What is the median price price for a paid app? Then compare pricing by category by creating another box plot. But this time examine the prices (instead of the revenue estimates) of the paid apps. I recommend using `{categoryorder':'max descending'}` to sort the categories.

In [None]:
df_apps_clean_paid


# Average price for paid apps is $2.99

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver,Revenue_Estimate
28,Ra Ga Ba,GAME,5.00,2,20.00,1,Paid,1.49,Everyone,Arcade,"February 8, 2017",2.3 and up,1.49
47,Mu.F.O.,GAME,5.00,2,16.00,1,Paid,0.99,Everyone,Arcade,"March 3, 2017",2.3 and up,0.99
233,Chess of Blades (BL/Yaoi Game) (No VA),FAMILY,4.80,4,23.00,10,Paid,14.99,Teen,Casual,"January 24, 2018",2.3.3 and up,149.90
248,The DG Buddy,BUSINESS,3.70,3,11.00,10,Paid,2.49,Everyone,Business,"June 30, 2014",2.2 and up,24.90
291,AC DC Power Monitor,LIFESTYLE,5.00,1,1.20,10,Paid,3.04,Everyone,Lifestyle,"May 29, 2016",2.3 and up,30.40
...,...,...,...,...,...,...,...,...,...,...,...,...,...
7957,League of Stickman 2018- Ninja Arena PVP(Dream...,GAME,4.40,32496,99.00,1000000,Paid,0.99,Teen,Action,"July 3, 2018",2.3 and up,990000.00
7977,Sleep as Android Unlock,LIFESTYLE,4.50,23966,0.85,1000000,Paid,5.99,Everyone,Lifestyle,"June 27, 2018",4.0 and up,5990000.00
7988,Where's My Water?,FAMILY,4.70,188740,69.00,1000000,Paid,1.99,Everyone,Puzzle;Brain Games,"July 5, 2018",4.2 and up,1990000.00
8825,Hitman Sniper,GAME,4.60,408292,29.00,10000000,Paid,0.99,Mature 17+,Action,"July 12, 2018",4.1 and up,9900000.00


In [None]:
box_price = px.box(df_apps_clean_paid, x='Category', y='Price')
box_price.update_layout(yaxis=dict(type='log'), xaxis={'categoryorder': 'max descending'})
box_price.show()