# Google Play Store App Analysis | Python
An EDA of the Android app marketing by comparing thousands of apps in the Google Play Store.

Google Play, formerly known as Android Market, is a digital distribution service owned and operated by Google. It's the official app store for Android devices. Apps can be browsed and downloaded by users. It also functions as a digital media store, selling books, music, and tea.

# <span> Table of Contents</span>  

* [About](#about)
* [Data Cleaning](#head_1)
* [Finding the Highest Rated Apps, the Largest Apps in terms of Size (MBs), and Top 5 Apps with the Most Reviews](#head_2)
* [Content Ratings Distribution](#head_3)
* [Examine the Number of Installs](#head_4)
* [Find the Most Expensive Apps and Calculate Sales Revenue Estimate](#head_5)
* [Analyzing the App Categories](#head_6)
* [Free vs. Paid Apps per Category](#head_7)
* [Examine Paid App Pricing Strategies by Category](#head_8)
* [Conclusion](#conclusion)
***


## About the Dataset of Google Play Store Apps & Reviews <a class="anchor" id="about"></a>

**Data Source:** <br>
Dataset was webscraped from the Google Play Store by [Lavanya Gupta](https://www.kaggle.com/lava18/google-play-store-apps) in 2018. 

**Data Limitations:**
* Assuming that the sample is a representation of the Google Play Store as a whole. Even though it should be noted that this sample was served up based on the Gupta's behavior and geographical location
* Data was compiled in 2018 (not current)
* The 'Installs' column is not the exact number installs. For example, if the app has 263,493 downloaded, then Google will simply report an order of magnitude like 100,000+. The '+' was removed and assume that's the exact number of installs in that column

<img src='https://media.wired.com/photos/610d91840a1a0353ebea4cd2/master/w_1600,c_limit/Sec_01-play.jpg' width=60%, height=60%/>

## Processing Data

In [1]:
# Importing libraries
import pandas as pd
import plotly.express as px

### Reading the Dataset

In [2]:
# Show numeric output in decimal format e.g., 2.15
pd.options.display.float_format = '{:,.2f}'.format

# Read dataset
df_apps = pd.read_csv('../input/google-play-store-app-review/apps.csv')

***
# Data Cleaning <a class="anchor" id="head_1"></a>

We'll be dropping unused columns, removing duplicates and NaN values.

### Explore and preview the dataframe structure

In [3]:
df_apps.shape

(10841, 12)

In [4]:
df_apps.columns

Index(['App', 'Category', 'Rating', 'Reviews', 'Size_MBs', 'Installs', 'Type',
       'Price', 'Content_Rating', 'Genres', 'Last_Updated', 'Android_Ver'],
      dtype='object')

In [5]:
df_apps.sample(5)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
6375,CJ VLC HD Remote (+ Stream),VIDEO_PLAYERS,4.2,4074,1.3,500000,Free,0,Everyone,Video Players & Editors,"November 13, 2013",2.1 and up
4034,Profile w/o crop for Telegram,PHOTOGRAPHY,4.2,348,1.5,10000,Free,0,Everyone,Photography,"April 11, 2014",2.3.3 and up
9222,Colorfy: Coloring Book for Adults - Free,FAMILY,4.5,787107,19.0,10000000,Free,0,Everyone,Entertainment,"June 20, 2018",Varies with device
8255,Relax Melodies: Sleep Sounds,HEALTH_AND_FITNESS,4.5,233243,8.8,5000000,Free,0,Everyone,Health & Fitness,"July 23, 2018",Varies with device
4166,Predator Calls for Hunting AU,SPORTS,4.4,27,78.0,10000,Free,0,Everyone,Sports,"August 3, 2016",4.0.3 and up


### Drop Unused Columns

We'll remove the columns called `Last_Updated` and `Android_Ver` 

In [6]:
# Drop the 2 columns
df_apps.drop(columns=['Last_Updated', 'Android_Ver'], axis=1, inplace=True)

In [7]:
# Check if columns removed
df_apps.columns

Index(['App', 'Category', 'Rating', 'Reviews', 'Size_MBs', 'Installs', 'Type',
       'Price', 'Content_Rating', 'Genres'],
      dtype='object')

### Find and Remove NaN values in Rating Column


In [8]:
# Checking how many nan values in Rating column
df_apps['Rating'].isna().sum()

1474

In [9]:
# Create a subset df of the dataframe based on the condition that .isna() evaluates to True
nan_rows_rating = df_apps[df_apps['Rating'].isna()]
nan_rows_rating

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
0,Ak Parti Yardım Toplama,SOCIAL,,0,8.70,0,Paid,$13.99,Teen,Social
1,Ain Arabic Kids Alif Ba ta,FAMILY,,0,33.00,0,Paid,$2.99,Everyone,Education
2,Popsicle Launcher for Android P 9.0 launcher,PERSONALIZATION,,0,5.50,0,Paid,$1.49,Everyone,Personalization
3,Command & Conquer: Rivals,FAMILY,,0,19.00,0,,0,Everyone 10+,Strategy
4,CX Network,BUSINESS,,0,10.00,0,Free,0,Everyone,Business
...,...,...,...,...,...,...,...,...,...,...
5840,Em Fuga Brasil,FAMILY,,1317,60.00,100000,Free,0,Everyone,Simulation
5862,Voice Tables - no internet,PARENTING,,970,71.00,100000,Free,0,Everyone,Parenting
6141,Young Speeches,LIBRARIES_AND_DEMO,,2221,2.40,500000,Free,0,Everyone,Libraries & Demo
7035,SD card backup,TOOLS,,142,3.40,1000000,Free,0,Everyone,Tools


From viewing this dataframe subset, we're able to see that the NAN values in ratings are associated with no reviews, and no installs. Let's proceed on removing the NaN values and create a new dataframe that's cleaned.

In [10]:
# Create a new dataframe that has no NaN values
df_apps_clean = df_apps.dropna()
df_apps_clean

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
21,KBA-EZ Health Guide,MEDICAL,5.00,4,25.00,1,Free,0,Everyone,Medical
28,Ra Ga Ba,GAME,5.00,2,20.00,1,Paid,$1.49,Everyone,Arcade
47,Mu.F.O.,GAME,5.00,2,16.00,1,Paid,$0.99,Everyone,Arcade
82,Brick Breaker BR,GAME,5.00,7,19.00,5,Free,0,Everyone,Arcade
99,Anatomy & Physiology Vocabulary Exam Review App,MEDICAL,5.00,1,4.60,5,Free,0,Everyone,Medical
...,...,...,...,...,...,...,...,...,...,...
10836,Subway Surfers,GAME,4.50,27723193,76.00,1000000000,Free,0,Everyone 10+,Arcade
10837,Subway Surfers,GAME,4.50,27724094,76.00,1000000000,Free,0,Everyone 10+,Arcade
10838,Subway Surfers,GAME,4.50,27725352,76.00,1000000000,Free,0,Everyone 10+,Arcade
10839,Subway Surfers,GAME,4.50,27725352,76.00,1000000000,Free,0,Everyone 10+,Arcade


### Find and Remove Duplicates



In [11]:
# Creating a subset df, showing only duplicated rows
duplicated_rows = df_apps_clean[df_apps_clean.duplicated()]
duplicated_rows

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
946,420 BZ Budeze Delivery,MEDICAL,5.00,2,11.00,100,Free,0,Mature 17+,Medical
1133,MouseMingle,DATING,2.70,3,3.90,100,Free,0,Mature 17+,Dating
1196,"Cardiac diagnosis (heart rate, arrhythmia)",MEDICAL,4.40,8,6.50,100,Paid,$12.99,Everyone,Medical
1231,Sway Medical,MEDICAL,5.00,3,22.00,100,Free,0,Everyone,Medical
1247,Chat Kids - Chat Room For Kids,DATING,4.70,6,4.90,100,Free,0,Mature 17+,Dating
...,...,...,...,...,...,...,...,...,...,...
10802,Skype - free IM & video calls,COMMUNICATION,4.10,10484169,3.50,1000000000,Free,0,Everyone,Communication
10809,Instagram,SOCIAL,4.50,66577313,5.30,1000000000,Free,0,Teen,Social
10826,Google Drive,PRODUCTIVITY,4.40,2731211,4.00,1000000000,Free,0,Everyone,Productivity
10832,Google News,NEWS_AND_MAGAZINES,3.90,877635,13.00,1000000000,Free,0,Teen,News & Magazines


In [12]:
# Checking for individual app 
df_apps_clean[df_apps_clean['App'] == 'Instagram']

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
10806,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social
10808,Instagram,SOCIAL,4.5,66577446,5.3,1000000000,Free,0,Teen,Social
10809,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social
10810,Instagram,SOCIAL,4.5,66509917,5.3,1000000000,Free,0,Teen,Social


To drop duplicates, we can't use `df_apps_clean = df_apps_clean.drop_duplicates()`.

If we do this without specifying how to identify duplicates, we see that 3 copies of Instagram are retained because they have different numbers of reviews. Thus, we need to provide the column names that are used in comparison to identify duplicates

In [13]:
# Drop Duplicates
df_apps_clean = df_apps_clean.drop_duplicates(subset=['App', 'Type', 'Price'])
df_apps_clean

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
21,KBA-EZ Health Guide,MEDICAL,5.00,4,25.00,1,Free,0,Everyone,Medical
28,Ra Ga Ba,GAME,5.00,2,20.00,1,Paid,$1.49,Everyone,Arcade
47,Mu.F.O.,GAME,5.00,2,16.00,1,Paid,$0.99,Everyone,Arcade
82,Brick Breaker BR,GAME,5.00,7,19.00,5,Free,0,Everyone,Arcade
99,Anatomy & Physiology Vocabulary Exam Review App,MEDICAL,5.00,1,4.60,5,Free,0,Everyone,Medical
...,...,...,...,...,...,...,...,...,...,...
10824,Google Drive,PRODUCTIVITY,4.40,2731171,4.00,1000000000,Free,0,Everyone,Productivity
10828,YouTube,VIDEO_PLAYERS,4.30,25655305,4.65,1000000000,Free,0,Teen,Video Players & Editors
10829,Google Play Movies & TV,VIDEO_PLAYERS,3.70,906384,4.65,1000000000,Free,0,Teen,Video Players & Editors
10831,Google News,NEWS_AND_MAGAZINES,3.90,877635,13.00,1000000000,Free,0,Teen,News & Magazines


In [14]:
# Double checking for single Instgram App
df_apps_clean[df_apps_clean['App'] == 'Instagram']

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
10806,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social


***
# Finding the Highest Rated Apps, the Largest Apps in terms of Size (MBs), and Top 5 Apps with the Most Reviews <a class="anchor" id="head_2"></a>

## Find Highest Rated Apps

Which apps are the highest rated?

In [15]:
# Check which apps are rated 5/5
df_apps_clean[df_apps_clean.Rating == 5]

df_apps_clean.sort_values(by=['Rating'], ascending=False)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
21,KBA-EZ Health Guide,MEDICAL,5.00,4,25.00,1,Free,0,Everyone,Medical
1230,Sway Medical,MEDICAL,5.00,3,22.00,100,Free,0,Everyone,Medical
1227,AJ Men's Grooming,LIFESTYLE,5.00,2,22.00,100,Free,0,Everyone,Lifestyle
1224,FK Dedinje BGD,SPORTS,5.00,36,2.60,100,Free,0,Everyone,Sports
1223,CB VIDEO VISION,PHOTOGRAPHY,5.00,13,2.60,100,Free,0,Everyone,Photography
...,...,...,...,...,...,...,...,...,...,...
1314,CR Magazine,BUSINESS,1.00,1,7.80,100,Free,0,Everyone,Business
1932,FE Mechanical Engineering Prep,FAMILY,1.00,2,21.00,1000,Free,0,Everyone,Education
357,Speech Therapy: F,FAMILY,1.00,1,16.00,10,Paid,$2.99,Everyone,Education
818,Familial Hypercholesterolaemia Handbook,MEDICAL,1.00,2,33.00,100,Free,0,Everyone,Medical


**Observations:**

The only apps with a small number of reviews and installs have a perfect 5 star rating. This could possibly be accomplished by close friends, family, or coworkers.

## Find 5 Largest Apps in Terms of Size (MBs)

What's the size in megabytes (MBs) of the largest Android apps in the Google Play Store.

In [16]:
df_apps_clean.sort_values(by=['Size_MBs'], ascending=False)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
9942,Talking Babsy Baby: Baby Games,LIFESTYLE,4.00,140995,100.00,10000000,Free,0,Everyone,Lifestyle;Pretend Play
10687,Hungry Shark Evolution,GAME,4.50,6074334,100.00,100000000,Free,0,Teen,Arcade
9943,Miami crime simulator,GAME,4.00,254518,100.00,10000000,Free,0,Mature 17+,Action
9944,Gangster Town: Vice District,FAMILY,4.30,65146,100.00,10000000,Free,0,Mature 17+,Simulation
3144,Vi Trainer,HEALTH_AND_FITNESS,3.60,124,100.00,5000,Free,0,Everyone,Health & Fitness
...,...,...,...,...,...,...,...,...,...,...
2648,Ad Remove Plugin for App2SD,PRODUCTIVITY,4.10,66,0.02,1000,Paid,$1.29,Everyone,Productivity
5798,ExDialer PRO Key,COMMUNICATION,4.50,5474,0.02,100000,Paid,$3.99,Everyone,Communication
2684,My baby firework (Remove ad),FAMILY,4.10,30,0.01,1000,Paid,$0.99,Everyone,Entertainment
7966,Market Update Helper,LIBRARIES_AND_DEMO,4.10,20145,0.01,1000000,Free,0,Everyone,Libraries & Demo


In [17]:
df_apps_clean[df_apps_clean['Size_MBs']==100]

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
1795,Navi Radiography Pro,MEDICAL,4.7,11,100.0,500,Paid,$15.99,Everyone,Medical
3144,Vi Trainer,HEALTH_AND_FITNESS,3.6,124,100.0,5000,Free,0,Everyone,Health & Fitness
4176,Car Crash III Beam DH Real Damage Simulator 2018,GAME,3.6,151,100.0,10000,Free,0,Everyone,Racing
7926,Post Bank,FINANCE,4.5,60449,100.0,1000000,Free,0,Everyone,Finance
7927,The Walking Dead: Our World,GAME,4.0,22435,100.0,1000000,Free,0,Teen,Action
7928,Stickman Legends: Shadow Wars,GAME,4.4,38419,100.0,1000000,Paid,$0.99,Everyone 10+,Action
8718,Mini Golf King - Multiplayer Game,GAME,4.5,531458,100.0,5000000,Free,0,Everyone,Sports
8719,Draft Simulator for FUT 18,SPORTS,4.6,162933,100.0,5000000,Free,0,Everyone,Sports
9942,Talking Babsy Baby: Baby Games,LIFESTYLE,4.0,140995,100.0,10000000,Free,0,Everyone,Lifestyle;Pretend Play
9943,Miami crime simulator,GAME,4.0,254518,100.0,10000000,Free,0,Mature 17+,Action


**Observation:**

It seems that Google Play Store has a max limit of 100MBs. 

## Finding the Top 5 App with Most Reviews

Which apps have the highest number of reviews? 

In [18]:
df_apps_clean.sort_values(by=['Reviews'], ascending=False)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
10805,Facebook,SOCIAL,4.10,78158306,5.30,1000000000,Free,0,Teen,Social
10785,WhatsApp Messenger,COMMUNICATION,4.40,69119316,3.50,1000000000,Free,0,Everyone,Communication
10806,Instagram,SOCIAL,4.50,66577313,5.30,1000000000,Free,0,Teen,Social
10784,Messenger – Text and Video Chat for Free,COMMUNICATION,4.00,56642847,3.50,1000000000,Free,0,Everyone,Communication
10650,Clash of Clans,GAME,4.60,44891723,98.00,100000000,Free,0,Everyone 10+,Strategy
...,...,...,...,...,...,...,...,...,...,...
453,Wowkwis aq Ka'qaquj,FAMILY,5.00,1,49.00,10,Free,0,Everyone,Education;Education
462,CB Fit,HEALTH_AND_FITNESS,5.00,1,7.80,10,Free,0,Everyone,Health & Fitness
901,ES Billing System (Offline App),PRODUCTIVITY,5.00,1,4.20,100,Free,0,Everyone,Productivity
1416,Ek Kahani Aisi Bhi Season 3 - The Horror Story,FAMILY,3.00,1,5.80,100,Free,0,Teen,Entertainment


**Observations:** 

Facebook, WhatsApp, Instagram, Facebook Messenger, and Clash of Clans appear to be the top five apps with the most reviews. Furthermore, they're all classified as free apps.

***
# Content Ratings Distribution <a class="anchor" id="head_3"></a>

All Android apps have content ratings that indicate whether the app is intended for a specific age group, such as "Everyone," "Teen," or "Mature 17+." Let's take a look at the content rating distribution in our dataset.

In [19]:
ratings = df_apps_clean['Content_Rating'].value_counts()
ratings

Everyone           6621
Teen                912
Mature 17+          357
Everyone 10+        305
Adults only 18+       3
Unrated               1
Name: Content_Rating, dtype: int64

In [20]:
# Pie Chart
ratings_pie = px.pie(labels=ratings.index, values=ratings.values, title="Content Rating", names=ratings.index)
ratings_pie.update_traces(textposition='outside', textinfo='percent+label')
ratings_pie.show()

**Observations:**

The distribution shows that majority of the apps are aimed at everyone, with the teen group accounting for 11% and the adult group accounting for less than 0.05%.

***
# Examine the Number of Installs <a class="anchor" id="head_4"></a>


In [21]:
# View all the columns datatype
df_apps_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8199 entries, 21 to 10835
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             8199 non-null   object 
 1   Category        8199 non-null   object 
 2   Rating          8199 non-null   float64
 3   Reviews         8199 non-null   int64  
 4   Size_MBs        8199 non-null   float64
 5   Installs        8199 non-null   object 
 6   Type            8199 non-null   object 
 7   Price           8199 non-null   object 
 8   Content_Rating  8199 non-null   object 
 9   Genres          8199 non-null   object 
dtypes: float64(2), int64(1), object(7)
memory usage: 704.6+ KB


In [22]:
df_apps_clean.Installs.describe()

count          8199
unique           19
top       1,000,000
freq           1417
Name: Installs, dtype: object

In [23]:
df_apps_clean[['App', 'Installs']].groupby('Installs').count()

Unnamed: 0_level_0,App
Installs,Unnamed: 1_level_1
1,3
1000,698
1000000,1417
1000000000,20
10,69
10000,988
10000000,933
100,303
100000,1096
100000000,189


In [24]:
# Turn object into char & remove commas using .replace
df_apps_clean.Installs = df_apps_clean.Installs.astype(str).str.replace(',','')

# Convert data into numeric
df_apps_clean.Installs = pd.to_numeric(df_apps_clean.Installs)
df_apps_clean[['App', 'Installs']].groupby('Installs').count()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0_level_0,App
Installs,Unnamed: 1_level_1
1,3
5,9
10,69
50,56
100,303
500,199
1000,698
5000,425
10000,988
50000,457


**Observations:**

There are 20 apps with over 1 billion downloads and 24 apps with 500 million downloads. It's also worth noting that there are three apps with only one installation, which could indicate the app developer.

***
# Find the Most Expensive Apps and Calculate Sales Revenue Estimate <a class="anchor" id="head_5"></a>


## Examining the Price Column

In [25]:
# Preview data info
df_apps_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8199 entries, 21 to 10835
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             8199 non-null   object 
 1   Category        8199 non-null   object 
 2   Rating          8199 non-null   float64
 3   Reviews         8199 non-null   int64  
 4   Size_MBs        8199 non-null   float64
 5   Installs        8199 non-null   int64  
 6   Type            8199 non-null   object 
 7   Price           8199 non-null   object 
 8   Content_Rating  8199 non-null   object 
 9   Genres          8199 non-null   object 
dtypes: float64(2), int64(2), object(6)
memory usage: 704.6+ KB


In [26]:
# Convert price to string to remove $ and convert into numeric data
df_apps_clean.Price = df_apps_clean.Price.astype(str).str.replace('$','')
df_apps_clean.Price = pd.to_numeric(df_apps_clean.Price)

df_apps_clean.sort_values('Price', ascending=False).head(20)


The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
3946,I'm Rich - Trump Edition,LIFESTYLE,3.6,275,7.3,10000,Paid,400.0,Everyone,Lifestyle
2461,I AM RICH PRO PLUS,FINANCE,4.0,36,41.0,1000,Paid,399.99,Everyone,Finance
4606,I Am Rich Premium,FINANCE,4.1,1867,4.7,50000,Paid,399.99,Everyone,Finance
3145,I am rich(premium),FINANCE,3.5,472,0.94,5000,Paid,399.99,Everyone,Finance
3554,💎 I'm rich,LIFESTYLE,3.8,718,26.0,10000,Paid,399.99,Everyone,Lifestyle
5765,I am rich,LIFESTYLE,3.8,3547,1.8,100000,Paid,399.99,Everyone,Lifestyle
1946,I am rich (Most expensive app),FINANCE,4.1,129,2.7,1000,Paid,399.99,Teen,Finance
2775,I Am Rich Pro,FAMILY,4.4,201,2.7,5000,Paid,399.99,Everyone,Entertainment
3221,I am Rich Plus,FAMILY,4.0,856,8.7,10000,Paid,399.99,Everyone,Entertainment
3114,I am Rich,FINANCE,4.3,180,3.8,5000,Paid,399.99,Everyone,Finance


There are 15 *I am Rich Apps* that cost $299.99 or more. After doing some research, it appears that this application is described as "a work of art with no hidden function at all," with its sole purpose being to demonstrate to others that they could afford it.

We'll ignore this information because it will distort our analysis of the most expensive 'real' app'. We'll set the price of the expensive app to $250.

In [27]:
# Remove any row that's priced more than $250 
df_apps_clean = df_apps_clean[df_apps_clean['Price'] < 250]
df_apps_clean.sort_values('Price', ascending=False)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
2281,Vargo Anesthesia Mega App,MEDICAL,4.60,92,32.00,1000,Paid,79.99,Everyone,Medical
1407,LTC AS Legal,MEDICAL,4.00,6,1.30,100,Paid,39.99,Everyone,Medical
2629,I am Rich Person,LIFESTYLE,4.20,134,1.80,1000,Paid,37.99,Everyone,Lifestyle
2481,A Manual of Acupuncture,MEDICAL,3.50,214,68.00,1000,Paid,33.99,Everyone,Medical
2463,PTA Content Master,MEDICAL,4.20,64,41.00,1000,Paid,29.99,Everyone,Medical
...,...,...,...,...,...,...,...,...,...,...
4508,myAir™ for Air10™ by ResMed,MEDICAL,3.70,236,18.00,50000,Free,0.00,Everyone,Medical
4507,AK Math Coach,FAMILY,3.60,283,18.00,50000,Free,0.00,Everyone,Education
4506,Forgotten Hill: Fall,GAME,4.40,1063,18.00,50000,Free,0.00,Teen,Adventure
4505,AE Video Poker,GAME,4.00,721,18.00,50000,Free,0.00,Teen,Casino


## Highest Grossing Paid Apps (Estimated) 

To find the revenue, we'll multiply the price with the number of installs.

In [28]:
# Let's calc the highest grossing paid apps.  (Price * Installs)
df_apps_clean['Revenue_Estimate'] = df_apps_clean['Price']*df_apps_clean['Installs']
df_apps_clean.head()

# Sort to find highest gross paid apps
df_apps_clean.sort_values('Revenue_Estimate', ascending=False).head(10)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Revenue_Estimate
9220,Minecraft,FAMILY,4.5,2376564,19.0,10000000,Paid,6.99,Everyone 10+,Arcade;Action & Adventure,69900000.0
8825,Hitman Sniper,GAME,4.6,408292,29.0,10000000,Paid,0.99,Mature 17+,Action,9900000.0
7151,Grand Theft Auto: San Andreas,GAME,4.4,348962,26.0,1000000,Paid,6.99,Mature 17+,Action,6990000.0
7477,Facetune - For Free,PHOTOGRAPHY,4.4,49553,48.0,1000000,Paid,5.99,Everyone,Photography,5990000.0
7977,Sleep as Android Unlock,LIFESTYLE,4.5,23966,0.85,1000000,Paid,5.99,Everyone,Lifestyle,5990000.0
6594,DraStic DS Emulator,GAME,4.6,87766,12.0,1000000,Paid,4.99,Everyone,Action,4990000.0
6082,Weather Live,WEATHER,4.5,76593,4.75,500000,Paid,5.99,Everyone,Weather,2995000.0
7954,Bloons TD 5,FAMILY,4.6,190086,94.0,1000000,Paid,2.99,Everyone,Strategy,2990000.0
7633,Five Nights at Freddy's,GAME,4.6,100805,50.0,1000000,Paid,2.99,Teen,Action,2990000.0
6746,Card Wars - Adventure Time,FAMILY,4.3,129603,23.0,1000000,Paid,2.99,Everyone 10+,Card;Action & Adventure,2990000.0


**Observations:**

For the sake of simplicity, assume that all installations were purchased at the listed prices. Minecraft, with nearly $70 million in revenue, is the highest-grossing paid app. It should be noted, however, that Minecraft is classified as a Family rather than a Game. However, if we consider Minecraft, Bloons TD 5, and Card Wars to be games rather than family apps, we can see that games account for 7/10 of the top ten highest grossing apps.

It's also worth noting that the Google Play store's category labels are somewhat flexible.

***
# Analyzing the App Categories <a class="anchor" id="head_6"></a>

Let's analyze which categorize are dominating the market. This could help app developers to narrow down which category that decides to go into. 


In [29]:
# Finding the number different categories
df_apps_clean.Category.nunique()

33

In [30]:
# Calc the number of apps per category.
df_apps_clean.Category.value_counts()

# Top 10 categories for apps
top_10_category = df_apps_clean.Category.value_counts()[:10]
top_10_category

FAMILY             1606
GAME                910
TOOLS               719
PRODUCTIVITY        301
PERSONALIZATION     298
LIFESTYLE           297
FINANCE             296
MEDICAL             292
PHOTOGRAPHY         263
BUSINESS            262
Name: Category, dtype: int64

## Highest Competition (Number of Apps)

In [31]:
# Create a vertical bar chart for the top 10 categories
top_10_category_bar = px.bar(x=top_10_category.index, 
                             y=top_10_category.values, 
                             title='Top 10 Categories Based on Number of Apps')

top_10_category_bar.update_layout(xaxis_title='Category', 
                                  yaxis_title='Number of Apps')

top_10_category_bar.update_traces(text=top_10_category.values,
                                  textposition='auto')

top_10_category_bar.show()

**Observations:**

Based on the large number of apps created, the Family and Game categories are the most competitive. When another app is released in these categories, it can be difficult to get noticed in these competitive categories.

## Most Popular Categories (Number of Installations)

Let's take a different approach, and instead of focusing solely on the total number of apps created in a given category, we'll consider how frequently those apps are downloaded. This will give us a sense of how popular a particular category is.

In [32]:
# Group our apps by category & sum the number of installations
category_installs = df_apps_clean.groupby('Category').agg({'Installs': pd.Series.sum})
category_installs.sort_values('Installs', ascending=True, inplace=True)

# Graph a horizontal barchart
category_install_bar = px.bar(x=category_installs.Installs, y=category_installs.index, orientation='h', title='Category Popularity <br><sub>Based on the number of installations</sub>')
category_install_bar.update_layout(xaxis_title='Number of Downloads', yaxis_title='Category')



category_install_bar.show()




**Observation:**

The Games, Communication, and Tools categories are the most popular in terms of downloads, whereas the Events and Parenting categories are the least popular.

## Category Concentration - Installs/Downloads vs. Competition

Let's try to plot the popularity of a category next to the number of apps in that category to see how concentrated a category is.

In [33]:
# Create a new dataframe groupby the category & number of apps
category_number = df_apps_clean.groupby('Category').agg({'App': pd.Series.count}) 
category_number

# Merge category_number df with category_installs df based on Category
category_merged_df = pd.merge(category_number, category_installs, on='Category', how="inner")
category_merged_df.sort_values('Installs', ascending=False)

Unnamed: 0_level_0,App,Installs
Category,Unnamed: 1_level_1,Unnamed: 2_level_1
GAME,910,13858762717
COMMUNICATION,257,11039241530
TOOLS,719,8099724500
PRODUCTIVITY,301,5788070180
SOCIAL,203,5487841475
PHOTOGRAPHY,263,4649143130
FAMILY,1606,4437554490
VIDEO_PLAYERS,148,3916897200
TRAVEL_AND_LOCAL,187,2894859300
NEWS_AND_MAGAZINES,204,2369110650


In [34]:
# Create the scatterplot
category_scatterplot = px.scatter(category_merged_df, 
                                  x='App', 
                                  y='Installs', 
                                  title='Category Concentration<br><sub>Lower number of apps = More concentration</sub>', 
                                  size='App', 
                                  hover_name=category_merged_df.index, 
                                  color='Installs')

category_scatterplot.update_layout(xaxis_title='Number of Apps', yaxis_title='Installs', yaxis=dict(type='log'))

category_scatterplot.show()

**Observation:**

We can see that Family, Tools, and Game have a lot of downloads while having varying numbers of apps. Communication, Productivity, Socials, and Photography have a higher concentration (fewer apps in the store) and a higher number of downloads. It's possible that this is due to users' desire to stick with the same social, photography, or communication apps they've been using comfortably with.

## Genre Competition

Let's explore the Genre column and find out how many numbers of genre there is.

In [35]:
# Count number of Genres
df_apps_clean['Genres'].nunique()

114

In [36]:
# Having a closer look, theres multiple genres separated by ;
df_apps_clean['Genres'].value_counts().sort_values()

Lifestyle;Pretend Play        1
Strategy;Education            1
Adventure;Education           1
Role Playing;Brain Games      1
Tools;Education               1
                           ... 
Personalization             298
Productivity                301
Education                   429
Entertainment               467
Tools                       718
Name: Genres, Length: 114, dtype: int64

Since we see that there's a nested data in the column, we'll separate the genre names using `.split()`. Then we'll add them into a single column with `.stack()`


In [37]:
# Split string based on ; . Expanding the string to diff columns 
stack = df_apps_clean.Genres.str.split(';', expand=True).stack()
stack.shape

num_genres = stack.value_counts()
num_genres.nunique()
# Theres 50 different genres

num_genres

Tools                      719
Education                  587
Entertainment              498
Action                     304
Productivity               301
Personalization            298
Lifestyle                  298
Finance                    296
Medical                    292
Sports                     270
Photography                263
Business                   262
Communication              258
Health & Fitness           245
Casual                     216
News & Magazines           204
Social                     203
Simulation                 200
Travel & Local             187
Arcade                     185
Shopping                   180
Books & Reference          171
Video Players & Editors    150
Dating                     134
Puzzle                     124
Maps & Navigation          118
Role Playing               111
Racing                     103
Action & Adventure          96
Strategy                    95
Food & Drink                94
Educational                 93
Adventur

In [38]:
# Create a genre barchart with the Series
genre_bar = px.bar(x=num_genres.index[:20],
                   y=num_genres.values[:20],
                   title='Top 20 Genres',
                   hover_name=num_genres.index[:20], 
                   color=num_genres.values[:20],
                   color_continuous_scale='Agsunset')

genre_bar.update_layout(xaxis_title='Genre',
                        yaxis_title='Number of Apps',
                        coloraxis_showscale=False)

genre_bar.update_traces(text= num_genres.values[:20], textposition='auto')

genre_bar.show()

**Observations:**

Tools, Education, and Entertainment are the most popular genres based on the number of apps. As a result, developing an app in one of these three genres would be considered competitive.

***
# Free vs. Paid Apps per Category <a class="anchor" id="head_7"></a>

In [39]:
# Count number of apps per type
df_apps_clean['Type'].value_counts()

Free    7595
Paid     589
Name: Type, dtype: int64

Since majority of apps are free, let's see if some categories have more paid apps then others

In [40]:
# Group data by Category then Type. 
#Using add_index=False, we push all data into columns instead of having Categories our index
# Then add up the number of apps per each type
free_vs_paid_df = df_apps_clean.groupby(["Category", "Type"], as_index=False).agg({'App': pd.Series.count})
free_vs_paid_df


Unnamed: 0,Category,Type,App
0,ART_AND_DESIGN,Free,58
1,ART_AND_DESIGN,Paid,3
2,AUTO_AND_VEHICLES,Free,72
3,AUTO_AND_VEHICLES,Paid,1
4,BEAUTY,Free,42
...,...,...,...
56,TRAVEL_AND_LOCAL,Paid,8
57,VIDEO_PLAYERS,Free,144
58,VIDEO_PLAYERS,Paid,4
59,WEATHER,Free,65


In [41]:
# Create a barchart comparing free & paid apps
free_vs_paid_bar = px.bar(free_vs_paid_df, 
                          x='Category',
                          y='App',
                          title='Free vs Paid Apps by Category',
                          color='Type',
                          barmode='group')

free_vs_paid_bar.update_layout(xaxis_title='Category',
                               yaxis_title='Number of Apps',
                               xaxis={'categoryorder': 'total descending'},
                               yaxis=dict(type='log'))
free_vs_paid_bar.show()

**Observations:**

Despite the fact that there are more free apps than paid apps, we can see that some categories have more paid apps than others (i.e. Personalization, and Medical). As a result, depending on the category, releasing a paid-for app makes sense.

This makes us wonder:
* How much should you charge if you compare other apps charging in the same category

* How many downloads are you potentially giving when you decide to not make it free

## Paid App Revenue (Estimate)

In [42]:
# Create a paid apps df
paid_apps_df = df_apps_clean[df_apps_clean['Type']=='Paid']
paid_apps_df


Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Revenue_Estimate
28,Ra Ga Ba,GAME,5.00,2,20.00,1,Paid,1.49,Everyone,Arcade,1.49
47,Mu.F.O.,GAME,5.00,2,16.00,1,Paid,0.99,Everyone,Arcade,0.99
233,Chess of Blades (BL/Yaoi Game) (No VA),FAMILY,4.80,4,23.00,10,Paid,14.99,Teen,Casual,149.90
248,The DG Buddy,BUSINESS,3.70,3,11.00,10,Paid,2.49,Everyone,Business,24.90
291,AC DC Power Monitor,LIFESTYLE,5.00,1,1.20,10,Paid,3.04,Everyone,Lifestyle,30.40
...,...,...,...,...,...,...,...,...,...,...,...
7957,League of Stickman 2018- Ninja Arena PVP(Dream...,GAME,4.40,32496,99.00,1000000,Paid,0.99,Teen,Action,990000.00
7977,Sleep as Android Unlock,LIFESTYLE,4.50,23966,0.85,1000000,Paid,5.99,Everyone,Lifestyle,5990000.00
7988,Where's My Water?,FAMILY,4.70,188740,69.00,1000000,Paid,1.99,Everyone,Puzzle;Brain Games,1990000.00
8825,Hitman Sniper,GAME,4.60,408292,29.00,10000000,Paid,0.99,Mature 17+,Action,9900000.00


In [43]:
# Create boxplot
paid_apps_box = px.box(paid_apps_df,
                       x='Category',
                       y='Revenue_Estimate',
                       title='How much can paid apps earn?',
                       )

paid_apps_box.update_layout(xaxis_title='Category', 
                            yaxis_title='Paid App Revenue Estimate',
                            xaxis={'categoryorder':'min ascending'},
                            yaxis=dict(type='log'))

paid_apps_box.show()

**Observations:**

It seems that many paid apps earn below `$10,000`.
It's worth to note a a simple android app would [cost](https://www.businessofapps.com/app-developers/research/app-development-cost/) `$16,000-$32,000`. This means they'll have to find other revenue streams to cover the development costs.

Outliers with High Earnings:
It's also worth noting that certain app categories appear to have a high number of outliers with much higher revenue (Game, Family, Personalization, Tools, Photography, Medical, and Communcation). This indicates that they have the potential to increase revenue in these areas.

***
# Examine Paid App Pricing Strategies by Category <a class="anchor" id="head_8"></a>

If a developer wants to list a paid app, how should they price it? It's helpful to look at the competitors in the same category.

In [44]:
# Find the median price for paid app
paid_apps_df['Price'].median()

2.99

In [45]:
# Create boxplot for app's price based on category
paid_price_box = px.box(paid_apps_df,
                        x='Category',
                        y='Price')

paid_price_box.update_layout(xaxis_title='Category',
                             yaxis_title='App Price',
                             xaxis={'categoryorder': 'max descending'},
                             yaxis=dict(type='log'))

paid_price_box.show()

**Observations:**

An Android app would cost `$2.99` on average. Some categories, however, have a higher median price than others. The Medical Category, for example, has a median price of `$5.49`. Furthermore, the Dating category has an unexpectedly high median price of `$6.99`.On the lower end, we see the Personalization apps having a median price of` $1.49 `and video players having a median price of `$1.29`.

***
# Conclusions <a class="anchor" id="conclusion"></a>

* The most popular apps are in the **Social and Communication** category.

* Approximately **80%** of app content in the app store is targeted for **everyone aged 10+**.

* **Highest grossing paid app is Minecraft with nearly $70 million** (2018). Minecraft is classified as a Family rather than a Game. However, if we consider Minecraft, Bloons TD 5, and Card Wars to be games rather than family apps, we can see that **games account for 7/10 of the top ten highest grossing apps**.

* The **Family and Game categories are the most competitive** in terms of the numbers of apps created. Thus, when another app is released in those categories, it's difficult to get noticed.

* The **most popular apps** in terms of downloads are in the** Game, Communication, and Tools** category, whereas Events and Parenting categories are the least popular.

* If a developer wanted to create a simple android app, they have to consider that a simple android app would cost `~$16,000-$32,000`, and from the dataset, it shows that on average, **many paid apps earn below `$10,000`**. It's important to find other revenue streams to cover development costs.

* It's worth noting that in the **Game, Family, Personalization, Tools, Photography, Medical, and Communication** categories, there appears to be a high number of outliers with much higher revenue, indicating the **potential for revenue growth**.

* When considering pricing an app, an **average app would cost $2.99**. However, in the Medical and Dating category, the median app price would be higher. Where the Personalization app and Video Player category has a lower median price. 




*Project based on 100 Days of Code: The Complete Python Pro Bootcamp for 2022 via Udemy*