# Introduction

In this notebook, we will do a comprehensive analysis of the Android app market by comparing thousands of apps in the Google Play store.

# About the Dataset of Google Play Store Apps & Reviews

**Data Source:** <br>
App and review data was scraped from the Google Play Store by Lavanya Gupta in 2018. Original files listed [here](
https://www.kaggle.com/lava18/google-play-store-apps).

# Import Statements

In [None]:
import pandas as pd

# Notebook Presentation

In [None]:
# Show numeric output in decimal format e.g., 2.15
pd.options.display.float_format = '{:,.2f}'.format

# Read the Dataset

In [None]:
df_apps = pd.read_csv('apps.csv')
df_apps.head()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
0,Ak Parti Yardım Toplama,SOCIAL,,0,8.7,0,Paid,$13.99,Teen,Social,"July 28, 2017",4.1 and up
1,Ain Arabic Kids Alif Ba ta,FAMILY,,0,33.0,0,Paid,$2.99,Everyone,Education,"April 15, 2016",3.0 and up
2,Popsicle Launcher for Android P 9.0 launcher,PERSONALIZATION,,0,5.5,0,Paid,$1.49,Everyone,Personalization,"July 11, 2018",4.2 and up
3,Command & Conquer: Rivals,FAMILY,,0,19.0,0,,0,Everyone 10+,Strategy,"June 28, 2018",Varies with device
4,CX Network,BUSINESS,,0,10.0,0,Free,0,Everyone,Business,"August 6, 2018",4.1 and up


# Data Cleaning

**Challenge**: How many rows and columns does `df_apps` have? What are the column names? Look at a random sample of 5 different rows with [.sample()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html).

In [None]:
df_apps.shape

(10841, 12)

In [None]:
df_apps.columns

Index(['App', 'Category', 'Rating', 'Reviews', 'Size_MBs', 'Installs', 'Type',
       'Price', 'Content_Rating', 'Genres', 'Last_Updated', 'Android_Ver'],
      dtype='object')

In [None]:
# Any random  'n' row data by using .sample(n)
df_apps.sample(5)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
4333,ConnectLine,MEDICAL,3.1,253,4.2,50000,Free,0,Everyone,Medical,"September 12, 2016",2.3.3 and up
8159,Pixgram- video photo slideshow,PHOTOGRAPHY,4.2,93726,9.2,5000000,Free,0,Everyone,Photography,"July 8, 2018",4.3 and up
8403,BetterMe: Weight Loss Workouts,HEALTH_AND_FITNESS,4.2,14709,15.0,5000000,Free,0,Everyone,Health & Fitness,"July 26, 2018",5.0 and up
1377,eG Monitor,PRODUCTIVITY,4.6,8,7.5,100,Free,0,Everyone,Productivity,"July 27, 2018",4.1 and up
3422,DZ Mobile Market,SHOPPING,,59,6.0,10000,Free,0,Everyone,Shopping,"February 1, 2018",4.0.3 and up


In [None]:
df_apps[['Category','Rating']]

Unnamed: 0,Category,Rating
0,SOCIAL,
1,FAMILY,
2,PERSONALIZATION,
3,FAMILY,
4,BUSINESS,
...,...,...
10836,GAME,4.50
10837,GAME,4.50
10838,GAME,4.50
10839,GAME,4.50


In [None]:
df_apps['Category']

0                 SOCIAL
1                 FAMILY
2        PERSONALIZATION
3                 FAMILY
4               BUSINESS
              ...       
10836               GAME
10837               GAME
10838               GAME
10839               GAME
10840               GAME
Name: Category, Length: 10841, dtype: object

### Drop Unused Columns

**Challenge**: Remove the columns called `Last_Updated` and `Android_Version` from the DataFrame. We will not use these columns.

In [None]:
# drop any row and column by .drop() method provided axis=0 for row and 1 for column.
df_apps.drop(['Android_Ver','Last_Updated'],axis=1,inplace=True)


In [None]:
# Now recheck our df_apps
df_apps.head()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
0,Ak Parti Yardım Toplama,SOCIAL,,0,8.7,0,Paid,$13.99,Teen,Social
1,Ain Arabic Kids Alif Ba ta,FAMILY,,0,33.0,0,Paid,$2.99,Everyone,Education
2,Popsicle Launcher for Android P 9.0 launcher,PERSONALIZATION,,0,5.5,0,Paid,$1.49,Everyone,Personalization
3,Command & Conquer: Rivals,FAMILY,,0,19.0,0,,0,Everyone 10+,Strategy
4,CX Network,BUSINESS,,0,10.0,0,Free,0,Everyone,Business


### Find and Remove NaN values in Ratings

**Challenge**: How may rows have a NaN value (not-a-number) in the Ratings column? Create DataFrame called `df_apps_clean` that does not include these rows.

In [None]:
# Is any NaN values in DataFrame.
df_apps.isna().values.any()

True

In [None]:
# Checking Rating column having NaN values or not
df_apps['Rating'].isna().values.any()

True

In [None]:
# How many NaN values are there in the column
df_apps.isna().values.sum()

1475

In [None]:
# Printing entire DataFrame correponding to NaN values in Rating column.
df_apps[df_apps['Rating'].isna()]

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
0,Ak Parti Yardım Toplama,SOCIAL,,0,8.70,0,Paid,$13.99,Teen,Social
1,Ain Arabic Kids Alif Ba ta,FAMILY,,0,33.00,0,Paid,$2.99,Everyone,Education
2,Popsicle Launcher for Android P 9.0 launcher,PERSONALIZATION,,0,5.50,0,Paid,$1.49,Everyone,Personalization
3,Command & Conquer: Rivals,FAMILY,,0,19.00,0,,0,Everyone 10+,Strategy
4,CX Network,BUSINESS,,0,10.00,0,Free,0,Everyone,Business
...,...,...,...,...,...,...,...,...,...,...
5840,Em Fuga Brasil,FAMILY,,1317,60.00,100000,Free,0,Everyone,Simulation
5862,Voice Tables - no internet,PARENTING,,970,71.00,100000,Free,0,Everyone,Parenting
6141,Young Speeches,LIBRARIES_AND_DEMO,,2221,2.40,500000,Free,0,Everyone,Libraries & Demo
7035,SD card backup,TOOLS,,142,3.40,1000000,Free,0,Everyone,Tools


In [None]:
# Deleting rows containing NaN values
df_apps_clean=df_apps.dropna()

### Find and Remove Duplicates

**Challenge**: Are there any duplicates in data? Check for duplicates using the [.duplicated()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html) function. How many entries can you find for the "Instagram" app? Use [.drop_duplicates()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html) to remove any duplicates from `df_apps_clean`.


In [None]:
# Checking duplicated data in Data Frame by .dublicated(), gives true/false boolean.
df_apps_clean.duplicated()

21       False
28       False
47       False
82       False
99       False
         ...  
10836    False
10837    False
10838    False
10839     True
10840    False
Length: 9367, dtype: bool

In [None]:
# What is the no of those duplicate data
df_apps_clean.duplicated().values.sum()

476

In [None]:
# Now showing which row have duplicate data..
duplicated_data=df_apps_clean[df_apps_clean.duplicated()]
print(duplicated_data.shape)
duplicated_data.head(10)

(476, 10)


Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
946,420 BZ Budeze Delivery,MEDICAL,5.0,2,11.0,100,Free,0,Mature 17+,Medical
1133,MouseMingle,DATING,2.7,3,3.9,100,Free,0,Mature 17+,Dating
1196,"Cardiac diagnosis (heart rate, arrhythmia)",MEDICAL,4.4,8,6.5,100,Paid,$12.99,Everyone,Medical
1231,Sway Medical,MEDICAL,5.0,3,22.0,100,Free,0,Everyone,Medical
1247,Chat Kids - Chat Room For Kids,DATING,4.7,6,4.9,100,Free,0,Mature 17+,Dating
1379,CT Scan Cross Sectional Anatomy,MEDICAL,4.3,10,46.0,100,Free,0,Everyone,Medical
1616,JH Blood Pressure Monitor,MEDICAL,3.7,9,2.9,500,Free,0,Everyone,Medical
1642,Cardi B Live Stream Video Chat - Prank,DATING,4.4,28,3.4,500,Free,0,Everyone,Dating
1813,Diabetes & Diet Tracker,MEDICAL,4.6,395,19.0,1000,Paid,$9.99,Everyone,Medical
1821,Transenger – Ts Dating and Chat for Free,DATING,3.6,8,14.0,1000,Free,0,Mature 17+,Dating


In [None]:
# duplicate data specific for instagram.
df_apps_clean[df_apps_clean.App=='Instagram']

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
10806,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social
10808,Instagram,SOCIAL,4.5,66577446,5.3,1000000000,Free,0,Teen,Social
10809,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social
10810,Instagram,SOCIAL,4.5,66509917,5.3,1000000000,Free,0,Teen,Social


In [None]:
# To get rid of duplicated data
df_apps_clean=df_apps_clean.drop_duplicates()

### All duplicates get deleted but these intagram data have diff reviews so it retain there in the dataframe.

In [None]:
df_apps_clean[df_apps_clean.App=='Instagram']

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Revenue_Estimate
10806,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0.0,Teen,Social,0.0


**The duplicates deleted by providing the column names that should be used in the comparison to identify duplicates.**

In [None]:
df_apps_clean=df_apps_clean.drop_duplicates(subset=['App','Type','Price'])

In [None]:
# Rechecking our instagram column, fine no duplicate..
df_apps_clean[df_apps_clean.App=='Instagram']

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
10806,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social


# Find Highest Rated Apps

**Challenge**: Identify which apps are the highest rated. What problem might you encounter if you rely exclusively on ratings alone to determine the quality of an app?

In [None]:
df_apps_clean['App'].loc[df_apps_clean['Rating'].idxmax()]

'KBA-EZ Health Guide'

In [None]:
df_apps_clean.sort_values('Rating',ascending=False).head()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
21,KBA-EZ Health Guide,MEDICAL,5.0,4,25.0,1,Free,0,Everyone,Medical
1230,Sway Medical,MEDICAL,5.0,3,22.0,100,Free,0,Everyone,Medical
1227,AJ Men's Grooming,LIFESTYLE,5.0,2,22.0,100,Free,0,Everyone,Lifestyle
1224,FK Dedinje BGD,SPORTS,5.0,36,2.6,100,Free,0,Everyone,Sports
1223,CB VIDEO VISION,PHOTOGRAPHY,5.0,13,2.6,100,Free,0,Everyone,Photography


# Find 5 Largest Apps in terms of Size (MBs)

**Challenge**: What's the size in megabytes (MB) of the largest Android apps in the Google Play Store. Based on the data, do you think there could be limit in place or can developers make apps as large as they please?

In [None]:
df_apps_clean.sort_values('Size_MBs',ascending=False).head()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
9942,Talking Babsy Baby: Baby Games,LIFESTYLE,4.0,140995,100.0,10000000,Free,0,Everyone,Lifestyle;Pretend Play
10687,Hungry Shark Evolution,GAME,4.5,6074334,100.0,100000000,Free,0,Teen,Arcade
9943,Miami crime simulator,GAME,4.0,254518,100.0,10000000,Free,0,Mature 17+,Action
9944,Gangster Town: Vice District,FAMILY,4.3,65146,100.0,10000000,Free,0,Mature 17+,Simulation
3144,Vi Trainer,HEALTH_AND_FITNESS,3.6,124,100.0,5000,Free,0,Everyone,Health & Fitness


# Find the 5 App with Most Reviews

**Challenge**: Which apps have the highest number of reviews? Are there any paid apps among the top 50?

In [None]:
df_apps_clean.sort_values('Reviews',ascending=False).head(10)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
10805,Facebook,SOCIAL,4.1,78158306,5.3,1000000000,Free,0,Teen,Social
10785,WhatsApp Messenger,COMMUNICATION,4.4,69119316,3.5,1000000000,Free,0,Everyone,Communication
10806,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social
10784,Messenger – Text and Video Chat for Free,COMMUNICATION,4.0,56642847,3.5,1000000000,Free,0,Everyone,Communication
10650,Clash of Clans,GAME,4.6,44891723,98.0,100000000,Free,0,Everyone 10+,Strategy
10744,Clean Master- Space Cleaner & Antivirus,TOOLS,4.7,42916526,3.4,500000000,Free,0,Everyone,Tools
10835,Subway Surfers,GAME,4.5,27722264,76.0,1000000000,Free,0,Everyone 10+,Arcade
10828,YouTube,VIDEO_PLAYERS,4.3,25655305,4.65,1000000000,Free,0,Teen,Video Players & Editors
10746,"Security Master - Antivirus, VPN, AppLock, Boo...",TOOLS,4.7,24900999,3.4,500000000,Free,0,Everyone,Tools
10584,Clash Royale,GAME,4.6,23133508,97.0,100000000,Free,0,Everyone 10+,Strategy


# Plotly Pie and Donut Charts - Visualise Categorical Data: Content Ratings

In [None]:
# At first counting the number of occurrences of each content_rating with value_counts().
content_rating=df_apps_clean.Content_Rating.value_counts()
content_rating

Everyone           6621
Teen                912
Mature 17+          357
Everyone 10+        305
Adults only 18+       3
Unrated               1
Name: Content_Rating, dtype: int64

In [None]:
import plotly.express as px

### Pie chart

In [None]:
fig = px.pie(title='Content Rating',names=content_rating.index,labels=content_rating.index, values=content_rating.values)
fig.update_traces(textposition='outside',textinfo='percent+label',textfont_size=15)
fig.show()

In [None]:
fig=px.pie(title='Content Rating',labels=content_rating.index,values=content_rating.values,names=content_rating.index,hole=0.8)
fig.update_traces(textposition='outside',textinfo='label+percent',textfont_size=15)
fig.show()

# Numeric Type Conversion: Examine the Number of Installs

**Challenge**: How many apps had over 1 billion (that's right - BILLION) installations? How many apps just had a single install?

Check the datatype of the Installs column.

Count the number of apps at each level of installations.

Convert the number of installations (the Installs column) to a numeric data type. Hint: this is a 2-step process. You'll have make sure you remove non-numeric characters first.

In [None]:
df_apps_clean.head()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
21,KBA-EZ Health Guide,MEDICAL,5.0,4,25.0,1,Free,0,Everyone,Medical
28,Ra Ga Ba,GAME,5.0,2,20.0,1,Paid,$1.49,Everyone,Arcade
47,Mu.F.O.,GAME,5.0,2,16.0,1,Paid,$0.99,Everyone,Arcade
82,Brick Breaker BR,GAME,5.0,7,19.0,5,Free,0,Everyone,Arcade
99,Anatomy & Physiology Vocabulary Exam Review App,MEDICAL,5.0,1,4.6,5,Free,0,Everyone,Medical


In [None]:
df_apps_clean.sort_values('Installs',ascending=False).head()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
10731,My Talking Tom,GAME,4.5,14891223,36.0,500000000,Free,0,Everyone,Casual
10746,"Security Master - Antivirus, VPN, AppLock, Boo...",TOOLS,4.7,24900999,3.4,500000000,Free,0,Everyone,Tools
10711,SHAREit - Transfer & Share,TOOLS,4.6,7790693,17.0,500000000,Free,0,Everyone,Tools
10713,imo free video calls and chat,COMMUNICATION,4.3,4785892,11.0,500000000,Free,0,Everyone,Communication
10717,Pou,GAME,4.3,10485308,24.0,500000000,Free,0,Everyone,Casual


In [None]:
# Just checking no corresponding to each installation.
df_apps_clean.Installs.value_counts()

1,000,000        1417
100,000          1096
10,000            988
10,000,000        933
1,000             698
5,000,000         607
500,000           504
50,000            457
5,000             425
100               303
50,000,000        202
500               199
100,000,000       189
10                 69
50                 56
500,000,000        24
1,000,000,000      20
5                   9
1                   3
Name: Installs, dtype: int64

In [None]:
#Checking the data type of values in Install column..
print(df_apps_clean.Installs.describe())
print('Just see rightmost bottom to check data type which is an object data type.')

count          8199
unique           19
top       1,000,000
freq           1417
Name: Installs, dtype: object
Just see rightmost bottom to check data type which is an object data type.


In [None]:
# To get the information about every column in dataframe about Dtype and Non-Null Count.
df_apps_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8199 entries, 21 to 10835
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             8199 non-null   object 
 1   Category        8199 non-null   object 
 2   Rating          8199 non-null   float64
 3   Reviews         8199 non-null   int64  
 4   Size_MBs        8199 non-null   float64
 5   Installs        8199 non-null   object 
 6   Type            8199 non-null   object 
 7   Price           8199 non-null   object 
 8   Content_Rating  8199 non-null   object 
 9   Genres          8199 non-null   object 
dtypes: float64(2), int64(1), object(7)
memory usage: 962.6+ KB


In [None]:
# To count No of app at each level of installation.
df_apps_clean[['App','Installs']].groupby('Installs').count().head(10)

Unnamed: 0_level_0,App
Installs,Unnamed: 1_level_1
1,3
1000,698
1000000,1417
1000000000,20
10,69
10000,988
10000000,933
100,303
100000,1096
100000000,189


In [None]:
# Aliter way to do the same
abc=df_apps_clean.groupby('Installs').agg({'App':pd.Series.count})
abc.head(10)

Unnamed: 0_level_0,App
Installs,Unnamed: 1_level_1
1,3
1000,698
1000000,1417
1000000000,20
10,69
10000,988
10000000,933
100,303
100000,1096
100000000,189


### Remove non numeric character comma separator "," .
### Use **.replace()** method to replace "," to other char..
### finally convert  it ti numeric type by **.to_numeric()** and just  pass the column.

In [None]:
#Convert the number of installations (the Installs column) to a numeric data type.
df_apps_clean.Installs=df_apps_clean.Installs.astype(str).str.replace(',','')
# To convert it to numeric type.
df_apps_clean.Installs=pd.to_numeric(df_apps_clean.Installs)
df_apps_clean[['App','Installs']].groupby('Installs').count().head(10)

Unnamed: 0_level_0,App
Installs,Unnamed: 1_level_1
1,3
5,9
10,69
50,56
100,303
500,199
1000,698
5000,425
10000,988
50000,457


### Find the Most Expensive Apps, Filter out the Junk, and Calculate a (ballpark) Sales Revenue Estimate

Let's examine the Price column more closely.

**Challenge**: Convert the price column to numeric data. Then investigate the top 20 most expensive apps in the dataset.

Remove all apps that cost more than $250 from the `df_apps_clean` DataFrame.

Add a column called 'Revenue_Estimate' to the DataFrame. This column should hold the price of the app times the number of installs. What are the top 10 highest grossing paid apps according to this estimate? Out of the top 10 highest grossing paid apps, how many are games?


In [None]:
# checking Dtype of Price column.
df_apps_clean.Price.describe()

count     8199
unique      73
top          0
freq      7595
Name: Price, dtype: object

In [None]:
# find Dtype of all column at at once.
df_apps_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8199 entries, 21 to 10835
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             8199 non-null   object 
 1   Category        8199 non-null   object 
 2   Rating          8199 non-null   float64
 3   Reviews         8199 non-null   int64  
 4   Size_MBs        8199 non-null   float64
 5   Installs        8199 non-null   int64  
 6   Type            8199 non-null   object 
 7   Price           8199 non-null   object 
 8   Content_Rating  8199 non-null   object 
 9   Genres          8199 non-null   object 
dtypes: float64(2), int64(2), object(6)
memory usage: 704.6+ KB


In [None]:
df_apps_clean.Price=df_apps_clean.Price.astype(str).str.replace("$","")
df_apps_clean.Price=pd.to_numeric(df_apps_clean.Price)


The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.



In [None]:
# Sorting done in DataFrame according to Price column
df_apps_clean.sort_values('Price',ascending=False).head(20)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
3946,I'm Rich - Trump Edition,LIFESTYLE,3.6,275,7.3,10000,Paid,400.0,Everyone,Lifestyle
2461,I AM RICH PRO PLUS,FINANCE,4.0,36,41.0,1000,Paid,399.99,Everyone,Finance
4606,I Am Rich Premium,FINANCE,4.1,1867,4.7,50000,Paid,399.99,Everyone,Finance
3145,I am rich(premium),FINANCE,3.5,472,0.94,5000,Paid,399.99,Everyone,Finance
3554,💎 I'm rich,LIFESTYLE,3.8,718,26.0,10000,Paid,399.99,Everyone,Lifestyle
5765,I am rich,LIFESTYLE,3.8,3547,1.8,100000,Paid,399.99,Everyone,Lifestyle
1946,I am rich (Most expensive app),FINANCE,4.1,129,2.7,1000,Paid,399.99,Teen,Finance
2775,I Am Rich Pro,FAMILY,4.4,201,2.7,5000,Paid,399.99,Everyone,Entertainment
3221,I am Rich Plus,FAMILY,4.0,856,8.7,10000,Paid,399.99,Everyone,Entertainment
3114,I am Rich,FINANCE,4.3,180,3.8,5000,Paid,399.99,Everyone,Finance


In [None]:
# Code right but not the correct way.
df_apps_clean.Price.sort_values(ascending=False)

3946    400.00
2461    399.99
4606    399.99
3145    399.99
3554    399.99
         ...  
4508      0.00
4507      0.00
4506      0.00
4505      0.00
10835     0.00
Name: Price, Length: 8199, dtype: float64

### The most expensive apps sub $250

In [None]:
df_apps_clean=df_apps_clean[df_apps_clean['Price']<250]
df_apps_clean.sort_values('Price',ascending=False)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
2281,Vargo Anesthesia Mega App,MEDICAL,4.60,92,32.00,1000,Paid,79.99,Everyone,Medical
1407,LTC AS Legal,MEDICAL,4.00,6,1.30,100,Paid,39.99,Everyone,Medical
2629,I am Rich Person,LIFESTYLE,4.20,134,1.80,1000,Paid,37.99,Everyone,Lifestyle
2481,A Manual of Acupuncture,MEDICAL,3.50,214,68.00,1000,Paid,33.99,Everyone,Medical
2463,PTA Content Master,MEDICAL,4.20,64,41.00,1000,Paid,29.99,Everyone,Medical
...,...,...,...,...,...,...,...,...,...,...
4508,myAir™ for Air10™ by ResMed,MEDICAL,3.70,236,18.00,50000,Free,0.00,Everyone,Medical
4507,AK Math Coach,FAMILY,3.60,283,18.00,50000,Free,0.00,Everyone,Education
4506,Forgotten Hill: Fall,GAME,4.40,1063,18.00,50000,Free,0.00,Teen,Adventure
4505,AE Video Poker,GAME,4.00,721,18.00,50000,Free,0.00,Teen,Casino


In [None]:
# Adding a  'Revenue_Estimate' column in our data which has values equal to product of Install and price of each column
df_apps_clean['Revenue_Estimate']=df_apps_clean.Installs.mul(df_apps_clean.Price)

In [None]:
df_apps_clean.sort_values('Revenue_Estimate',ascending=False)[:10]

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Revenue_Estimate
9220,Minecraft,FAMILY,4.5,2376564,19.0,10000000,Paid,6.99,Everyone 10+,Arcade;Action & Adventure,69900000.0
5765,I am rich,LIFESTYLE,3.8,3547,1.8,100000,Paid,399.99,Everyone,Lifestyle,39999000.0
4606,I Am Rich Premium,FINANCE,4.1,1867,4.7,50000,Paid,399.99,Everyone,Finance,19999500.0
8825,Hitman Sniper,GAME,4.6,408292,29.0,10000000,Paid,0.99,Mature 17+,Action,9900000.0
7151,Grand Theft Auto: San Andreas,GAME,4.4,348962,26.0,1000000,Paid,6.99,Mature 17+,Action,6990000.0
7977,Sleep as Android Unlock,LIFESTYLE,4.5,23966,0.85,1000000,Paid,5.99,Everyone,Lifestyle,5990000.0
7477,Facetune - For Free,PHOTOGRAPHY,4.4,49553,48.0,1000000,Paid,5.99,Everyone,Photography,5990000.0
6594,DraStic DS Emulator,GAME,4.6,87766,12.0,1000000,Paid,4.99,Everyone,Action,4990000.0
3946,I'm Rich - Trump Edition,LIFESTYLE,3.6,275,7.3,10000,Paid,400.0,Everyone,Lifestyle,4000000.0
3554,💎 I'm rich,LIFESTYLE,3.8,718,26.0,10000,Paid,399.99,Everyone,Lifestyle,3999900.0


### Highest Grossing Paid Apps (ballpark estimate)

In [None]:
df_apps_clean.Category.nunique()

33

In [None]:
#Top 10 category
top10_category=df_apps_clean.Category.value_counts()[:10]
top10_category

FAMILY             1606
GAME                910
TOOLS               719
PRODUCTIVITY        301
PERSONALIZATION     298
LIFESTYLE           297
FINANCE             296
MEDICAL             292
PHOTOGRAPHY         263
BUSINESS            262
Name: Category, dtype: int64

# Plotly Bar Charts & Scatter Plots: Analysing App Categories

### Vertical Bar Chart - Highest Competition (Number of Apps)

In [None]:
bar=px.bar(x=top10_category.index,y=top10_category.values)
bar.show()

### Horizontal Bar Chart - Most Popular Categories (Highest Downloads)

In [None]:
category_install=df_apps_clean.groupby('Category').agg({'Installs':pd.Series.sum})
category_install.sort_values('Installs',ascending=True,inplace=True)

In [None]:
# combining each category with no of install plot , changing orientation equal to horizontal..
# adding a custom title a
h_bar=px.bar(x=category_install.Installs,y=category_install.index,orientation='h',title='Category Install')
# You can update so many thing in the plot
h_bar.update_layout(xaxis_title='Number of Downloads',yaxis_title='Category')
h_bar.show()

### Category Concentration - Downloads vs. Competition

**Challenge**:
* First, create a DataFrame that has the number of apps in one column and the number of installs in another:

<img src=https://imgur.com/uQRSlXi.png width="350">

* Then use the [plotly express examples from the documentation](https://plotly.com/python/line-and-scatter/) alongside the [.scatter() API reference](https://plotly.com/python-api-reference/generated/plotly.express.scatter.html)to create scatter plot that looks like this.

<img src=https://imgur.com/cHsqh6a.png>

*Hint*: Use the size, hover_name and color parameters in .scatter(). To scale the yaxis, call .update_layout() and specify that the yaxis should be on a log-scale like so: yaxis=dict(type='log')

In [None]:
cat_app_no = df_apps_clean.groupby('Category').agg({'App': pd.Series.count})

In [None]:
cat_merged_app_installs=pd.merge(cat_app_no,category_install,on='Category')
cat_merged_app_installs.sample(5)

Unnamed: 0_level_0,App,Installs
Category,Unnamed: 1_level_1,Unnamed: 2_level_1
COMICS,54,44931100
EVENTS,45,15949410
BUSINESS,262,692018120
SHOPPING,180,1400331540
WEATHER,72,361096500


# The is my way of doing the same

In [None]:
category_app_installs=df_apps_clean.groupby('Category').agg({'Installs':pd.Series.sum,'App':pd.Series.count})
category_app_installs.head()

Unnamed: 0_level_0,Installs,App
Category,Unnamed: 1_level_1,Unnamed: 2_level_1
ART_AND_DESIGN,114233100,61
AUTO_AND_VEHICLES,53129800,73
BEAUTY,26916200,42
BOOKS_AND_REFERENCE,1665791655,169
BUSINESS,692018120,262


In [None]:
scar_plot=px.scatter(category_app_installs,x='App',y='Installs',title='Category Concentration',
                     size='App',color='Installs',hover_name=category_app_installs.index)
scar_plot.update_layout(xaxis_title="Number of Apps (Lower=More Concentrated)",yaxis_title='Installs',yaxis=dict(type='log'))
scar_plot.show()

# Extracting Nested Data from a Column

**Challenge**: How many different types of genres are there? Can an app belong to more than one genre? Check what happens when you use .value_counts() on a column with nested values? See if you can work around this problem by using the .split() function and the DataFrame's [.stack() method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.stack.html).


**Some Genres name have nested data i.e one gener name comtain two or more separated by ";"**

In [None]:
df_apps_clean.Genres.sample(20)

9523     Role Playing;Brain Games
10317                      Arcade
9343                        Tools
1986               Travel & Local
7599                 Food & Drink
2498                     Shopping
2535                       Social
717                        Sports
4184                      Medical
10676                      Casual
10022                      Action
9735                Communication
5045                 Food & Drink
7424                       Action
8456                     Shopping
5673                      Weather
9586                       Trivia
2573                        Tools
5048             Health & Fitness
8745             Libraries & Demo
Name: Genres, dtype: object

In [None]:
# .nunique() gives no of unique things in column and .unique() print whole bunch of unique things..
# Here we are finding how many different types of gener are there in Geners column.
df_apps_clean.Genres.unique()
print(f"there is {df_apps_clean.Genres.nunique()} unique geners")

there is 114 unique geners


In [None]:
# To look each app belongs to how many genre..
print('As you can see we have nested data,semi-colon (;) separates the genre names.')
df_apps_clean.Genres.value_counts().sort_values(ascending=True).sample(10)

As you can see we have nested data,semi-colon (;) separates the genre names.


Casino                                37
Lifestyle                            296
Board;Brain Games                     15
Role Playing;Pretend Play              4
Puzzle                               100
Casual;Music & Video                   1
Simulation;Action & Adventure          7
Social                               203
Dating                               134
Travel & Local;Action & Adventure      1
Name: Genres, dtype: int64

### Now separating gener by .split() and can stack them in single column by .stack()

In [None]:
stack=df_apps_clean.Genres.str.split(";",expand=True).stack()
print(f"we have mow a single column with shape {stack.shape}")

we have mow a single column with shape (8579,)


In [None]:
num_genre=stack.value_counts()
print(f"No of genre: {len(num_genre)}")
num_genre[:5]

No of genre: 53


Tools            719
Education        587
Entertainment    502
Action           304
Lifestyle        303
dtype: int64

In [None]:
# Same can be done as.....
# No of unique items in single stack column.
stack.nunique()

53

# Colour Scales in Plotly Charts - Competition in Genres

---



**Challenge**: Can you create this chart with the Series containing the genre data?

<img src=https://imgur.com/DbcoQli.png width=400>

Try experimenting with the built in colour scales in Plotly. You can find a full list [here](https://plotly.com/python/builtin-colorscales/).

* Find a way to set the colour scale using the color_continuous_scale parameter.
* Find a way to make the color axis disappear by using coloraxis_showscale.

In [None]:
# hover_name= and hover_data=[''] both options are there.
bar=px.bar(x=num_genre.index[:15],
       y=num_genre.values[:15],
       title='Top Genre',
       hover_name=num_genre.index[:15],
       color=num_genre.values[:15],
       color_continuous_scale='Agsunset' # you can change to diffrent color
       )
# updating layout axis name and hide right_most color axis show scale
bar.update_layout(xaxis_title='Gener',
                  yaxis_title='No of App',
                  coloraxis_showscale=False)
bar.show()

# Grouped Bar Charts: Free vs. Paid Apps per Category

In [None]:
df_apps_clean.tail()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Revenue_Estimate
10824,Google Drive,PRODUCTIVITY,4.4,2731171,4.0,1000000000,Free,0.0,Everyone,Productivity,0.0
10828,YouTube,VIDEO_PLAYERS,4.3,25655305,4.65,1000000000,Free,0.0,Teen,Video Players & Editors,0.0
10829,Google Play Movies & TV,VIDEO_PLAYERS,3.7,906384,4.65,1000000000,Free,0.0,Teen,Video Players & Editors,0.0
10831,Google News,NEWS_AND_MAGAZINES,3.9,877635,13.0,1000000000,Free,0.0,Teen,News & Magazines,0.0
10835,Subway Surfers,GAME,4.5,27722264,76.0,1000000000,Free,0.0,Everyone 10+,Arcade,0.0


### **Here we are grouping our data first by Category and then by Type and finally add no of app per each type ......**

In [None]:
# as_index=False to push all the data into column rather than end up with our category as index.
df_free_vs_paid=df_apps_clean.groupby(['Category','Type'],as_index=False).agg({'App':pd.Series.count})
df_free_vs_paid.sort_values('App',ascending=False)

Unnamed: 0,Category,Type,App
19,FAMILY,Free,1456
25,GAME,Free,834
53,TOOLS,Free,656
21,FINANCE,Free,289
31,LIFESTYLE,Free,284
...,...,...,...
17,ENTERTAINMENT,Paid,2
24,FOOD_AND_DRINK,Paid,2
40,PARENTING,Paid,2
38,NEWS_AND_MAGAZINES,Paid,2


**Challenge**: Use the plotly express bar [chart examples](https://plotly.com/python/bar-charts/#bar-chart-with-sorted-or-ordered-categories) and the [.bar() API reference](https://plotly.com/python-api-reference/generated/plotly.express.bar.html#plotly.express.bar) to create this bar chart:

<img src=https://imgur.com/LE0XCxA.png>

You'll want to use the `df_free_vs_paid` DataFrame that you created above that has the total number of free and paid apps per category.

See if you can figure out how to get the look above by changing the `categoryorder` to 'total descending' as outlined in the documentation here [here](https://plotly.com/python/categorical-axes/#automatically-sorting-categories-by-name-or-total-value).

###categories can be sorted alphabetically or by value using the categoryorder attribute:

Set categoryorder to "category ascending" or "category descending" for the alphanumerical order of the category names or "total ascending" or "total descending" for numerical order of values.

In [None]:
bar_free_vs_paid=px.bar(df_free_vs_paid,
                        x='Category',
                        y='App',
                        title='Free vs paid apps by category',
                        color='Type',
                        barmode='group'
                        )
bar_free_vs_paid.update_layout(
    xaxis_title='Category',
    yaxis_title='Number of Apps',
    # The y-axis should be on a log-scale
    yaxis=dict(type='log'),
    # Don't understand below code
    xaxis={'categoryorder':'total descending'}
)
bar_free_vs_paid.show()

# Plotly Box Plots: Lost Downloads for Paid Apps

**Challenge**: Create a box plot that shows the number of Installs for free versus paid apps. How does the median number of installations compare? Is the difference large or small?

Use the [Box Plots Guide](https://plotly.com/python/box-plots/) and the [.box API reference](https://plotly.com/python-api-reference/generated/plotly.express.box.html) to create the following chart.

<img src=https://imgur.com/uVsECT3.png>


Display the underlying data with the points argument, display underlying data points with either all points (all), outliers only (outliers, default), or none of them (False).

### **Display the underlying data**
With the points argument, display underlying data points with either all points (all), outliers only (outliers, default), or none of them (False).

In [None]:
box=px.box(df_apps_clean,y='Installs',x='Type', points="all",color='Type',title='How Many Downloads are Paid Apps Giving Up?',)
box.update_layout(yaxis=dict(type='log')) # y axis on log scale value.
box.show()

# Plotly Box Plots: Revenue by App Category

**Challenge**: See if you can generate the chart below:

<img src=https://imgur.com/v4CiNqX.png>

Looking at the hover text, how much does the median app earn in the Tools category? If developing an Android app costs $30,000 or thereabouts, does the average photography app recoup its development costs?

Hint: I've used 'min ascending' to sort the categories.

In [None]:
df_paid_apps=df_apps_clean[df_apps_clean.Type=='Paid']
box_2=px.box(df_paid_apps,
             x='Category',
             y='Revenue_Estimate',
             title='How Much Can Paid App Earn',

             )
box_2.update_layout(xaxis_title='Category',
                    yaxis_title='Paid App Ballpark Revenue',
                    xaxis={'categoryorder':'min ascending'},# min ascending to sort category
                    yaxis=dict(type='log') # Take y axis on log scale..
                    )
box_2.show()

# How Much Can You Charge? Examine Paid App Pricing Strategies by Category

**Challenge**: What is the median price price for a paid app? Then compare pricing by category by creating another box plot. But this time examine the prices (instead of the revenue estimates) of the paid apps. I recommend using `{categoryorder':'max descending'}` to sort the categories.

In [None]:
box_3=px.box(df_paid_apps,
             x='Category',
             y='Price',

             title='How Much Can Paid Apps Earn?'
             )
box_3.update_layout(xaxis_title='Category',
                    yaxis_title='Paid price Order..',
                    xaxis={'categoryorder':'max descending'},
                    yaxis=dict(type='log'))
box_3.show()