# Introduction

In this notebook, we will do a comprehensive analysis of the Android app market by comparing thousands of apps in the Google Play store.

# About the Dataset of Google Play Store Apps & Reviews

**Data Source:** <br>
App and review data was scraped from the Google Play Store by Lavanya Gupta in 2018. Original files listed [here](
https://www.kaggle.com/lava18/google-play-store-apps).

# Import Statements

In [44]:
import pandas as pd
# creating charts with plotly
import plotly.express as px


# Notebook Presentation

In [68]:
# Show numeric output in decimal format e.g., 2.15
pd.options.display.float_format = '{:,.2f}'.format

# Read the Dataset

In [69]:
df_apps = pd.read_csv('apps.csv')

# Data Cleaning

**Test**: How many rows and columns does `df_apps` have? What are the column names? Look at a random sample of 5 different rows with [.sample()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html).

In [70]:
df_apps.shape
# there are 10,841 rows, and 12 columns associated with the dataset in df_apps

(10841, 12)

In [71]:
df_column_names = df_apps.columns
df_column_names_str = ', '.join(df_column_names)
print(f'The column names in df_apps are: {df_column_names_str}.')

The column names in df_apps are: App, Category, Rating, Reviews, Size_MBs, Installs, Type, Price, Content_Rating, Genres, Last_Updated, Android_Ver.


In [72]:
#random sample of five rows
df_apps.sample(5)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
5446,Diabetes Connect,MEDICAL,4.5,4027,8.4,100000,Free,0,Everyone,Medical,"October 10, 2017",4.0 and up
3196,v-view,FAMILY,3.6,309,19.0,10000,Free,0,Everyone,Entertainment,"June 22, 2017",4.2 and up
7315,How Old am I?,FAMILY,2.8,4635,3.9,1000000,Free,0,Everyone,Entertainment,"January 1, 2018",4.0 and up
327,bacterial vaginosis,MEDICAL,,0,3.6,10,Free,0,Teen,Medical,"March 26, 2018",4.0 and up
965,DS Helpdesk Plus,BUSINESS,3.6,21,7.2,100,Paid,$12.99,Everyone,Business,"January 30, 2017",Varies with device


### Drop Unused Columns

**Test**: Remove the columns called `Last_Updated` and `Android_Version` from the DataFrame. We will not use these columns.

In [73]:
# drop columns=['colnames'], axis=column axis, replace the current dataframe = True
df_apps.drop(columns=['Last_Updated','Android_Ver'], axis=1, inplace=True)
df_apps.head()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
0,Ak Parti YardÄ±m Toplama,SOCIAL,,0,8.7,0,Paid,$13.99,Teen,Social
1,Ain Arabic Kids Alif Ba ta,FAMILY,,0,33.0,0,Paid,$2.99,Everyone,Education
2,Popsicle Launcher for Android P 9.0 launcher,PERSONALIZATION,,0,5.5,0,Paid,$1.49,Everyone,Personalization
3,Command & Conquer: Rivals,FAMILY,,0,19.0,0,,0,Everyone 10+,Strategy
4,CX Network,BUSINESS,,0,10.0,0,Free,0,Everyone,Business


### Find and Remove NaN values in Ratings

**Test**: How may rows have a NaN value (not-a-number) in the Ratings column? Create DataFrame called `df_apps_clean` that does not include these rows.

In [74]:
df_apps.isna().sum()
#1474 rows have NaN values in the Ratings column; remove these and input into a new DF

Unnamed: 0,0
App,0
Category,0
Rating,1474
Reviews,0
Size_MBs,0
Installs,0
Type,1
Price,0
Content_Rating,0
Genres,0


In [75]:
#drop the NaN values from the dataframe and establish new df
df_apps_clean = df_apps.dropna()
#verify cleaning
df_apps_clean.isna().sum()

Unnamed: 0,0
App,0
Category,0
Rating,0
Reviews,0
Size_MBs,0
Installs,0
Type,0
Price,0
Content_Rating,0
Genres,0


### Find and Remove Duplicates

**Test**: Are there any duplicates in data? Check for duplicates using the [.duplicated()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html) function. How many entries can be found for the "Instagram" app? Use [.drop_duplicates()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html) to remove any duplicates from `df_apps_clean`.


In [76]:
df_apps_clean.duplicated().sum()
#476 rows are duplicated;

476

In [77]:
#new variable = from new dataframe, select the duplicated rows
duplicated_rows = df_apps_clean[df_apps_clean.duplicated()]
print(duplicated_rows.shape)
#sample data to show the duplicated records with instagram example
print(df_apps_clean[df_apps_clean.App == 'Instagram'])

(476, 10)
             App Category  Rating   Reviews  Size_MBs       Installs  Type  \
10806  Instagram   SOCIAL    4.50  66577313      5.30  1,000,000,000  Free   
10808  Instagram   SOCIAL    4.50  66577446      5.30  1,000,000,000  Free   
10809  Instagram   SOCIAL    4.50  66577313      5.30  1,000,000,000  Free   
10810  Instagram   SOCIAL    4.50  66509917      5.30  1,000,000,000  Free   

      Price Content_Rating  Genres  
10806     0           Teen  Social  
10808     0           Teen  Social  
10809     0           Teen  Social  
10810     0           Teen  Social  


In [78]:
#drop duplicates from the new df; specify the subset for identifying the duplicates
df_apps_clean = df_apps_clean.drop_duplicates(subset=['App','Type','Price'])
#verify drop of duplicates
print(df_apps_clean.duplicated().sum())
df_apps_clean[df_apps_clean.App == 'Instagram']

0


Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
10806,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social


In [79]:
df_apps_clean.shape

(8199, 10)

# Find Highest Rated Apps

**Test**: Identify which apps are the highest rated. What problem might you encounter if you rely exclusively on ratings alone to determine the quality of an app?

In [80]:
df_apps_clean.sort_values('Rating', ascending=False).head()
# Risks associated with just relying on rating is that there could be apps that have very high ratings, but not a lot of associated reviews. I.E. if there is only one rating, but one review
# it would artificially skew the results toward a higher ranking, most likely rated by friends/family.

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
21,KBA-EZ Health Guide,MEDICAL,5.0,4,25.0,1,Free,0,Everyone,Medical
1230,Sway Medical,MEDICAL,5.0,3,22.0,100,Free,0,Everyone,Medical
1227,AJ Men's Grooming,LIFESTYLE,5.0,2,22.0,100,Free,0,Everyone,Lifestyle
1224,FK Dedinje BGD,SPORTS,5.0,36,2.6,100,Free,0,Everyone,Sports
1223,CB VIDEO VISION,PHOTOGRAPHY,5.0,13,2.6,100,Free,0,Everyone,Photography


# Find 5 Largest Apps in terms of Size (MBs)

**Test**: What's the size in megabytes (MB) of the largest Android apps in the Google Play Store. Based on the data, do you think there could be limit in place or can developers make apps as large as they please?

In [81]:
df_apps_clean.sort_values('Size_MBs',ascending=False).head()
#based upon the data associated with this sample of the largest apps, it does appear that there would be some constraints associated with the size of an app, at 100MBs.


Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
9942,Talking Babsy Baby: Baby Games,LIFESTYLE,4.0,140995,100.0,10000000,Free,0,Everyone,Lifestyle;Pretend Play
10687,Hungry Shark Evolution,GAME,4.5,6074334,100.0,100000000,Free,0,Teen,Arcade
9943,Miami crime simulator,GAME,4.0,254518,100.0,10000000,Free,0,Mature 17+,Action
9944,Gangster Town: Vice District,FAMILY,4.3,65146,100.0,10000000,Free,0,Mature 17+,Simulation
3144,Vi Trainer,HEALTH_AND_FITNESS,3.6,124,100.0,5000,Free,0,Everyone,Health & Fitness


# Find the 5 App with Most Reviews

**Test**: Which apps have the highest number of reviews? Are there any paid apps among the top 50?

In [82]:
df_apps_clean.sort_values('Reviews',ascending=False).head(50)
# The apps below provide the samples of the highest reviewed apps, none of which are paid apps. Which leads you to believe that apps either have in-app purchases, or there is some other
# means of generating income, such as advertisements or harvesting of data.

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
10805,Facebook,SOCIAL,4.1,78158306,5.3,1000000000,Free,0,Teen,Social
10785,WhatsApp Messenger,COMMUNICATION,4.4,69119316,3.5,1000000000,Free,0,Everyone,Communication
10806,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social
10784,Messenger â€“ Text and Video Chat for Free,COMMUNICATION,4.0,56642847,3.5,1000000000,Free,0,Everyone,Communication
10650,Clash of Clans,GAME,4.6,44891723,98.0,100000000,Free,0,Everyone 10+,Strategy
10744,Clean Master- Space Cleaner & Antivirus,TOOLS,4.7,42916526,3.4,500000000,Free,0,Everyone,Tools
10835,Subway Surfers,GAME,4.5,27722264,76.0,1000000000,Free,0,Everyone 10+,Arcade
10828,YouTube,VIDEO_PLAYERS,4.3,25655305,4.65,1000000000,Free,0,Teen,Video Players & Editors
10746,"Security Master - Antivirus, VPN, AppLock, Boo...",TOOLS,4.7,24900999,3.4,500000000,Free,0,Everyone,Tools
10584,Clash Royale,GAME,4.6,23133508,97.0,100000000,Free,0,Everyone 10+,Strategy


# Plotly Pie and Donut Charts - Visualise Categorical Data: Content Ratings

In [83]:
# Comparing the distribution of the apps by the content ratings in our dataset.
ratings = df_apps_clean.Content_Rating.value_counts()
ratings

Unnamed: 0_level_0,count
Content_Rating,Unnamed: 1_level_1
Everyone,6621
Teen,912
Mature 17+,357
Everyone 10+,305
Adults only 18+,3
Unrated,1


In [84]:
# call plotly express: pie()
fig = px.pie(labels=ratings.index, values=ratings.values)
#show the figure created with: show()
fig.show()

In [85]:
# to customize the pie chart output:px.pie(data_frame: Any | None = None, names: Any | None = None, values: Any | None = None, color: Any | None = None, facet_row: Any | None = None, facet_col: Any | None = None, facet_col_wrap: int = 0, facet_row_spacing: Any | None = None, facet_col_spacing: Any | None = None, color_discrete_sequence: Any | None = None, color_discrete_map: Any | None = None, hover_name: Any | None = None, hover_data: Any | None = None, custom_data: Any | None = None, category_orders: Any | None = None, labels: Any | None = None, title: Any | None = None, template: Any | None = None, width: Any | None = None, height: Any | None = None, opacity: Any | None = None, hole: Any | None = None)
fig = px.pie(labels=ratings.index, #shows on hover-tooltip
             values=ratings.values, #slices of the chart
             title="Content Rating", # title for the chart graphic
             names=ratings.index) # Places the names for the legend
fig.update_traces(textposition='outside', textinfo='percent+label')
fig.show()

In [86]:
#donut charts can be created by adding a hole argument
# to customize the pie chart output:px.pie(data_frame: Any | None = None, names: Any | None = None, values: Any | None = None, color: Any | None = None, facet_row: Any | None = None, facet_col: Any | None = None, facet_col_wrap: int = 0, facet_row_spacing: Any | None = None, facet_col_spacing: Any | None = None, color_discrete_sequence: Any | None = None, color_discrete_map: Any | None = None, hover_name: Any | None = None, hover_data: Any | None = None, custom_data: Any | None = None, category_orders: Any | None = None, labels: Any | None = None, title: Any | None = None, template: Any | None = None, width: Any | None = None, height: Any | None = None, opacity: Any | None = None, hole: Any | None = None)
fig = px.pie(labels=ratings.index, #shows on hover-tooltip
             values=ratings.values, #slices of the chart
             title="Content Rating", # title for the chart graphic
             names=ratings.index,# Places the names for the legend
             hole=0.6) #hold density, represented as decimale equivalent of percentage
fig.update_traces(textposition='outside', textinfo='percent+label')
fig.show()

# Numeric Type Conversion: Examine the Number of Installs

**Test**: How many apps had over 1 billion (that's right - BILLION) installations? How many apps just had a single install?

Check the datatype of the Installs column.

Count the number of apps at each level of installations.

Convert the number of installations (the Installs column) to a numeric data type. Hint: this is a 2-step process. You'll have make sure you remove non-numeric characters first.

In [87]:
# Check the datatype of the Installs column.
df_apps_clean.Installs.describe() #not as descriptive as anticipated, check next type
df_apps_clean.info() # 5   Installs        8199 non-null   object


<class 'pandas.core.frame.DataFrame'>
Index: 8199 entries, 21 to 10835
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             8199 non-null   object 
 1   Category        8199 non-null   object 
 2   Rating          8199 non-null   float64
 3   Reviews         8199 non-null   int64  
 4   Size_MBs        8199 non-null   float64
 5   Installs        8199 non-null   object 
 6   Type            8199 non-null   object 
 7   Price           8199 non-null   object 
 8   Content_Rating  8199 non-null   object 
 9   Genres          8199 non-null   object 
dtypes: float64(2), int64(1), object(7)
memory usage: 704.6+ KB


In [88]:
#if we take two columns: 'Installs' and 'App' we can count the number of entries per level of installations,
# but since we are dealing with a non-numeric, ordering isn't helpful. (the comma throws it off)
df_apps_clean[['App','Installs']].groupby('Installs').count()

Unnamed: 0_level_0,App
Installs,Unnamed: 1_level_1
1,3
1000,698
1000000,1417
1000000000,20
10,69
10000,988
10000000,933
100,303
100000,1096
100000000,189


In [89]:
# Convert the number of installations (the Installs column) to a numeric data type, so they can be sorted..
#select the Installs column as type:str. so we can use replace method str.replace to remove non-numeric characters first.
df_apps_clean.Installs = df_apps_clean.Installs.astype(str).str.replace(',',"")
#convert to numeric
df_apps_clean.Installs = pd.to_numeric(df_apps_clean.Installs)

In [90]:
# NOW we can display the counts associated with each grouping of installs
df_apps_clean[['App','Installs']].groupby('Installs').count()

Unnamed: 0_level_0,App
Installs,Unnamed: 1_level_1
1,3
5,9
10,69
50,56
100,303
500,199
1000,698
5000,425
10000,988
50000,457


# Find the Most Expensive Apps, Filter out the Junk, and Calculate a (ballpark) Sales Revenue Estimate

Let's examine the Price column more closely.

**Test**: Convert the price column to numeric data. Then investigate the top 20 most expensive apps in the dataset.

Remove all apps that cost more than $250 from the `df_apps_clean` DataFrame.

Add a column called 'Revenue_Estimate' to the DataFrame. This column should hold the price of the app times the number of installs. What are the top 10 highest grossing paid apps according to this estimate? Out of the top 10 highest grossing paid apps, how many are games?


In [94]:
df_apps_clean[['App','Price']].groupby('Price').count()
# Test: Convert the price column to numeric data. Then investigate the top 20 most expensive apps in the dataset.
# select price field, as type: str, str.replace( $)
df_apps_clean.Price = df_apps_clean.Price.astype(str).str.replace('$',"")
#convert back to numeric
df_apps_clean.Price = pd.to_numeric(df_apps_clean.Price)
#sample data
df_apps_clean.sort_values("Price", ascending=False).head(20)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
3946,I'm Rich - Trump Edition,LIFESTYLE,3.6,275,7.3,10000,Paid,400.0,Everyone,Lifestyle
2461,I AM RICH PRO PLUS,FINANCE,4.0,36,41.0,1000,Paid,399.99,Everyone,Finance
4606,I Am Rich Premium,FINANCE,4.1,1867,4.7,50000,Paid,399.99,Everyone,Finance
3145,I am rich(premium),FINANCE,3.5,472,0.94,5000,Paid,399.99,Everyone,Finance
3554,ðŸ’Ž I'm rich,LIFESTYLE,3.8,718,26.0,10000,Paid,399.99,Everyone,Lifestyle
5765,I am rich,LIFESTYLE,3.8,3547,1.8,100000,Paid,399.99,Everyone,Lifestyle
1946,I am rich (Most expensive app),FINANCE,4.1,129,2.7,1000,Paid,399.99,Teen,Finance
2775,I Am Rich Pro,FAMILY,4.4,201,2.7,5000,Paid,399.99,Everyone,Entertainment
3221,I am Rich Plus,FAMILY,4.0,856,8.7,10000,Paid,399.99,Everyone,Entertainment
3114,I am Rich,FINANCE,4.3,180,3.8,5000,Paid,399.99,Everyone,Finance


### The most expensive apps sub $250

In [97]:
# Remove all apps that cost more than $250 from the df_apps_clean DataFrame.
#set the df = df when selecitng  df[price] < 250
df_apps_clean = df_apps_clean[df_apps_clean['Price'] < 250]
df_apps_clean.sort_values("Price", ascending=False).head(20)


Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
2281,Vargo Anesthesia Mega App,MEDICAL,4.6,92,32.0,1000,Paid,79.99,Everyone,Medical
1407,LTC AS Legal,MEDICAL,4.0,6,1.3,100,Paid,39.99,Everyone,Medical
2629,I am Rich Person,LIFESTYLE,4.2,134,1.8,1000,Paid,37.99,Everyone,Lifestyle
2481,A Manual of Acupuncture,MEDICAL,3.5,214,68.0,1000,Paid,33.99,Everyone,Medical
2463,PTA Content Master,MEDICAL,4.2,64,41.0,1000,Paid,29.99,Everyone,Medical
2207,EMT PASS,MEDICAL,3.4,51,2.4,1000,Paid,29.99,Everyone,Medical
4264,Golfshot Plus: Golf GPS,SPORTS,4.1,3387,25.0,50000,Paid,29.99,Everyone,Sports
504,AP Art History Flashcards,FAMILY,5.0,1,96.0,10,Paid,29.99,Mature 17+,Education
4772,Human Anatomy Atlas 2018: Complete 3D Human Body,MEDICAL,4.5,2921,25.0,100000,Paid,24.99,Everyone,Medical
3241,"Muscle Premium - Human Anatomy, Kinesiology, B...",MEDICAL,4.2,168,25.0,10000,Paid,24.99,Everyone,Medical


In [99]:
# Add a column called 'Revenue_Estimate' to the DataFrame. This column should hold the price of the app times the number of installs.
df_apps_clean['Revenue_Estimate'] = df_apps_clean.Installs.mul(df_apps_clean.Price)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Revenue_Estimate
9220,Minecraft,FAMILY,4.5,2376564,19.0,10000000,Paid,6.99,Everyone 10+,Arcade;Action & Adventure,69900000.0
8825,Hitman Sniper,GAME,4.6,408292,29.0,10000000,Paid,0.99,Mature 17+,Action,9900000.0
7151,Grand Theft Auto: San Andreas,GAME,4.4,348962,26.0,1000000,Paid,6.99,Mature 17+,Action,6990000.0
7477,Facetune - For Free,PHOTOGRAPHY,4.4,49553,48.0,1000000,Paid,5.99,Everyone,Photography,5990000.0
7977,Sleep as Android Unlock,LIFESTYLE,4.5,23966,0.85,1000000,Paid,5.99,Everyone,Lifestyle,5990000.0
6594,DraStic DS Emulator,GAME,4.6,87766,12.0,1000000,Paid,4.99,Everyone,Action,4990000.0
6082,Weather Live,WEATHER,4.5,76593,4.75,500000,Paid,5.99,Everyone,Weather,2995000.0
7954,Bloons TD 5,FAMILY,4.6,190086,94.0,1000000,Paid,2.99,Everyone,Strategy,2990000.0
7633,Five Nights at Freddy's,GAME,4.6,100805,50.0,1000000,Paid,2.99,Teen,Action,2990000.0
6746,Card Wars - Adventure Time,FAMILY,4.3,129603,23.0,1000000,Paid,2.99,Everyone 10+,Card;Action & Adventure,2990000.0


### Highest Grossing Paid Apps (ballpark estimate)

In [100]:
# What are the top 10 highest grossing paid apps according to this estimate? Out of the top 10 highest grossing paid apps, how many are games?
#sort and sample top 10
df_apps_clean.sort_values('Revenue_Estimate',ascending=False)[:10]
#holy moly look at the revenue from those games; 4 out of the top 10 highest grossing are games!

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Revenue_Estimate
9220,Minecraft,FAMILY,4.5,2376564,19.0,10000000,Paid,6.99,Everyone 10+,Arcade;Action & Adventure,69900000.0
8825,Hitman Sniper,GAME,4.6,408292,29.0,10000000,Paid,0.99,Mature 17+,Action,9900000.0
7151,Grand Theft Auto: San Andreas,GAME,4.4,348962,26.0,1000000,Paid,6.99,Mature 17+,Action,6990000.0
7477,Facetune - For Free,PHOTOGRAPHY,4.4,49553,48.0,1000000,Paid,5.99,Everyone,Photography,5990000.0
7977,Sleep as Android Unlock,LIFESTYLE,4.5,23966,0.85,1000000,Paid,5.99,Everyone,Lifestyle,5990000.0
6594,DraStic DS Emulator,GAME,4.6,87766,12.0,1000000,Paid,4.99,Everyone,Action,4990000.0
6082,Weather Live,WEATHER,4.5,76593,4.75,500000,Paid,5.99,Everyone,Weather,2995000.0
7954,Bloons TD 5,FAMILY,4.6,190086,94.0,1000000,Paid,2.99,Everyone,Strategy,2990000.0
7633,Five Nights at Freddy's,GAME,4.6,100805,50.0,1000000,Paid,2.99,Teen,Action,2990000.0
6746,Card Wars - Adventure Time,FAMILY,4.3,129603,23.0,1000000,Paid,2.99,Everyone 10+,Card;Action & Adventure,2990000.0


# Plotly Bar Charts & Scatter Plots: Analysing App Categories

### Horizontal Bar Chart - Most Popular Categories (Highest Downloads)

In [107]:
df_apps_clean.Category.nunique() # 33 unique categories present within this dataset
top10 = df_apps_clean.Category.value_counts()[:10]
top10

Unnamed: 0_level_0,count
Category,Unnamed: 1_level_1
FAMILY,1606
GAME,910
TOOLS,719
PRODUCTIVITY,301
PERSONALIZATION,298
LIFESTYLE,297
FINANCE,296
MEDICAL,292
PHOTOGRAPHY,263
BUSINESS,262


In [109]:
bar = px.bar(x= top10.index, #index = Category Name
             y= top10.values)
bar.show()

In [117]:
# looking at this from a different perspective, how often are apps downloaded in each category? This tells us how popular a category is.
# first, group apps by category and aggregate the sum of the installations
category_installs = df_apps_clean.groupby('Category').agg({"Installs":pd.Series.sum})
category_installs.sort_values('Installs', ascending=True, inplace=True)
#create horizontal bar chart, by adding orientation parameter
h_bar = px.bar(x= category_installs.Installs,
              y= category_installs.index,
              orientation='h',
              title="App Installs by Category")
h_bar.update_layout(xaxis_title = 'Number of Downloads', yaxis_title="Category")
h_bar.show()

### Vertical Bar Chart - Highest Competition (Number of Apps)

In [126]:
highest_competition = df_apps_clean.groupby('Category').agg({"App":pd.Series.count})
highest_competition.sort_values("App", ascending=False, inplace=True)
#create bar chart
hcomp_bar = px.bar(x= highest_competition.index,
                   y=highest_competition.App,
                   orientation = "v",
                   title="App Number by Category")
hcomp_bar.update_layout(xaxis_title="Category", yaxis_title="Number of Apps")
hcomp_bar.show()

### Category Concentration - Downloads vs. Competition

**Test**:
* First, create a DataFrame that has the number of apps in one column and the number of installs in another:

<img src=https://imgur.com/uQRSlXi.png width="350">

* [plotly express examples from the documentation](https://plotly.com/python/line-and-scatter/) alongside the [.scatter() API reference](https://plotly.com/python-api-reference/generated/plotly.express.scatter.html)to create scatter plot that looks like this.

<img src=https://imgur.com/cHsqh6a.png>

**: Using the size, hover_name and color parameters in .scatter().
To scale the yaxis, call .update_layout() and specify that the yaxis should be on a log-scale like so: yaxis=dict(type='log')

In [127]:
# merge the dataframes with pandas merge, on field Category, inner join
df_category_merged = pd.merge(highest_competition, category_installs, on='Category', how='inner')
print(f"The dimensions of the dataframe are {df_category_merged.shape}")
df_category_merged.sort_values("Installs", ascending=False)

The dimensions of the dataframe are (33, 2)


Unnamed: 0_level_0,App,Installs
Category,Unnamed: 1_level_1,Unnamed: 2_level_1
GAME,910,13858762717
COMMUNICATION,257,11039241530
TOOLS,719,8099724500
PRODUCTIVITY,301,5788070180
SOCIAL,203,5487841475
PHOTOGRAPHY,263,4649143130
FAMILY,1606,4437554490
VIDEO_PLAYERS,148,3916897200
TRAVEL_AND_LOCAL,187,2894859300
NEWS_AND_MAGAZINES,204,2369110650


In [131]:
scatter = px.scatter(df_category_merged,
                     x='App',
                     y='Installs',
                     title='Category Concentration',
                     size='App',
                     hover_name=df_category_merged.index,
                     color='Installs')

scatter.update_layout(xaxis_title = 'Installs',
                      yaxis=dict(type='log'))

scatter.show()

#what we can see from the visual below is three distincted groupings.
#1) Diverse applications that are popular(top right of visual): Many different apps hold high numbers of downloads
#2) Concentrated and popular(top left of visual): Few categories hold high number of downloads
#3) Unloved and Concentrated (bottom left), very few apps hold the majority of downloads for these categories, with no major pattern

# Extracting Nested Data from a Column

**Test**: How many different types of genres are there? Can an app belong to more than one genre? Check what happens when you use .value_counts() on a column with nested values? See if you can work around this problem by using the .split() function and the DataFrame's [.stack() method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.stack.html).


In [133]:
# How many different types of genres are there?
num_genres = len(df_apps_clean.Genres.unique())
print(f"There are {num_genres} different genres associated with the dataset.")
# Can an app belong to more than one genre? Yes, illustrated below
df_apps_clean.Genres.value_counts().sort_values(ascending=True)[:5]

There are 114 different genres associated with the dataset.


Unnamed: 0_level_0,count
Genres,Unnamed: 1_level_1
Lifestyle;Pretend Play,1
Strategy;Education,1
Adventure;Education,1
Role Playing;Brain Games,1
Tools;Education,1


In [137]:
#split the strings on the semicolon and then .stack them; meaning that you split them all and then aggregate any matches
stack = df_apps_clean.Genres.str.split(';', expand=True).stack()
print(f'Now we have a single column with shape {stack.shape}') #(8564,)

num_genres = stack.value_counts()
print(f'Number of genres: {len(num_genres)}') #53

Now we have a single column with shape (8564,)
Number of genres: 53


# Colour Scales in Plotly Charts - Competition in Genres

Chart with the Series containing the genre data

ref: colour scales in Plotly. full list [here](https://plotly.com/python/builtin-colorscales/).


In [147]:
genre_bar = px.bar(x=num_genres.index[:15], # index = category name
                   y=num_genres.values[:15], #count
                   title= 'Top Genres',
                   hover_name=num_genres.index[:15],
                   color=num_genres.values[:15],
                   color_continuous_scale='Agsunset'
                   )

genre_bar.update_layout(xaxis_title="Genre",
                        yaxis_title="Number of Apps",
                        coloraxis_showscale=False)
genre_bar.show()

# Grouped Bar Charts: Free vs. Paid Apps per Category

Using the plotly express bar [chart examples](https://plotly.com/python/bar-charts/#bar-chart-with-sorted-or-ordered-categories) and the [.bar() API reference](https://plotly.com/python-api-reference/generated/plotly.express.bar.html#plotly.express.bar) to create this bar chart:

<img src=https://imgur.com/LE0XCxA.png>

Using the `df_free_vs_paid` DataFrame created above that has the total number of free and paid apps per category.

Adjusted `categoryorder` to 'total descending' as outlined in the documentation here [here](https://plotly.com/python/categorical-axes/#automatically-sorting-categories-by-name-or-total-value).

In [152]:
#create a free vs paid following the same group by and aggregation methods as above
df_free_vs_paid = df_apps_clean.groupby(['Category','Type'], as_index=False).agg({'App':pd.Series.count}) #as_index pushes the categories to columns rather than the index field
df_free_vs_paid.sort_values('App')

Unnamed: 0,Category,Type,App
3,AUTO_AND_VEHICLES,Paid,1
24,FOOD_AND_DRINK,Paid,2
38,NEWS_AND_MAGAZINES,Paid,2
40,PARENTING,Paid,2
17,ENTERTAINMENT,Paid,2
...,...,...,...
31,LIFESTYLE,Free,284
21,FINANCE,Free,289
53,TOOLS,Free,656
25,GAME,Free,834


In [160]:
#creating the bar chart
fvp_bar = px.bar(df_free_vs_paid,
                 x='Category',
                 y='App',
                 title="Free vs Paid Apps by Category",
                 color='Type',
                 barmode='group'
                 )
fvp_bar.update_layout(xaxis_title='Category',
                      yaxis_title='Number of Apps',
                      xaxis={'categoryorder':'total descending'},
                      yaxis=dict(type='log'))
fvp_bar.show()

# Plotly Box Plots: Lost Downloads for Paid Apps

box plot that shows the number of Installs for free versus paid apps. How does the median number of installations compare? Is the difference large or small?

[Box Plots Guide](https://plotly.com/python/box-plots/) and the [.box API reference](https://plotly.com/python-api-reference/generated/plotly.express.box.html)

In [164]:
fvp_box = px.box(df_apps_clean,
                x='Type',
                 y='Installs',
                 color='Type',
                 notched=True,
                 points='all',
                 title="How Many Downloads are Paid Apps Giving Up?"
                 )
fvp_box.update_layout(yaxis=dict(type='log'))
fvp_box.show()

# the upper fence threshold for the free app installs is 10M, whereas paid apps is 100k. That's a substantial distinction between the amount of downloads.

# Plotly Box Plots: Revenue by App Category

**Challenge**: See if you can generate the chart below:

<img src=https://imgur.com/v4CiNqX.png>

Looking at the hover text, how much does the median app earn in the Tools category? If developing an Android app costs $30,000 or thereabouts, does the average photography app recoup its development costs?

Hint: I've used 'min ascending' to sort the categories.

In [169]:
df_paid_apps = df_apps_clean[df_apps_clean['Type']== 'Paid']
fvp_box = px.box(df_paid_apps,
                x='Category',
                 y='Revenue_Estimate',
                 title="How Much Can Paid Apps Earn?"
                 )
fvp_box.update_layout(xaxis_title='Category',
                      yaxis_title='Paid App Estimated Revenue',
                      xaxis={'categoryorder':'min ascending'},
    yaxis=dict(type='log'))
fvp_box.show()

#if android apps cost 30k to develop, then the average app in very few categories would justify the cost of development.
# several categories have high earning outliers, which is illustrated by the trailing dots above each of the boxes (e.g. Game, Family, Personalization, Medical, Tools),
# and for those categories, it would potentially justify the development cost, should the features be enough to counterbalance the average of the category

# How Much Can You Charge? Examine Paid App Pricing Strategies by Category

What is the median price price for a paid app?
comparing pricing by category with another box plot, but this time examining the prices (instead of the revenue estimates) of the paid apps.

In [172]:
df_paid_apps.Price.median() #median price for an app is 2.99, but some have higher than others.

2.99

In [174]:
fvp_box = px.box(df_paid_apps,
                x='Category',
                 y='Price',
                 title="Price per Category"
                 )
fvp_box.update_layout(xaxis_title='Category',
                      yaxis_title='Paid App Price',
                      xaxis={'categoryorder':'max descending'},
    yaxis=dict(type='log'))
fvp_box.show()

# The highest priced apps are associatd with price inelastic goods, meaning that regardless of shifts in economy, these goods tend to remain at the forefront.
# e.g. Medical, lifestyle, sports, family. Studies showed that even during the great depression, entertainment goods tended to remain purchased, due to being justified by the consumer as
# 'a nice treat'. These categories can be priced higher as their target audience would be more likely to remain purchasing them, so long as there is an appropriate selection that
# aligns both with the chart below, and the previous chart above.