# Introduction

In this notebook, we will do a comprehensive analysis of the Android app market by comparing thousands of apps in the Google Play store.

# About the Dataset of Google Play Store Apps & Reviews

**Data Source:** <br>
App and review data was scraped from the Google Play Store by Lavanya Gupta in 2018. Original files listed [here](
https://www.kaggle.com/lava18/google-play-store-apps).

# Import Statements

In [1]:
import pandas as pd


# Notebook Presentation

In [2]:
# Show numeric output in decimal format e.g., 2.15
pd.options.display.float_format = '{:,.2f}'.format

# Read the Dataset

In [3]:
df_apps = pd.read_csv('apps.csv')

# Data Cleaning

**Challenge**: How many rows and columns does `df_apps` have? What are the column names? Look at a random sample of 5 different rows with [.sample()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html).

In [4]:
df_apps.shape

(10841, 12)

In [5]:
df_apps.describe()

Unnamed: 0,Rating,Reviews,Size_MBs
count,9367.0,10841.0,10841.0
mean,4.19,444111.93,19.77
std,0.52,2927628.66,21.4
min,1.0,0.0,0.01
25%,4.0,38.0,4.9
50%,4.3,2094.0,11.0
75%,4.5,54768.0,27.0
max,5.0,78158306.0,100.0


In [6]:
df_apps.columns

Index(['App', 'Category', 'Rating', 'Reviews', 'Size_MBs', 'Installs', 'Type',
       'Price', 'Content_Rating', 'Genres', 'Last_Updated', 'Android_Ver'],
      dtype='object')

In [7]:
df_apps.sample(5)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
6425,easyFocus,PHOTOGRAPHY,3.7,3396,0.26,500000,Free,0,Everyone,Photography,"November 28, 2012",2.2 and up
4560,MoodSpace,MEDICAL,4.5,503,22.0,50000,Free,0,Everyone,Medical,"June 10, 2018",5.0 and up
7167,"Meet24 - Love, Chat, Singles",DATING,4.2,57081,7.9,1000000,Free,0,Mature 17+,Dating,"July 27, 2018",4.0.3 and up
721,FN,BUSINESS,5.0,14,3.3,50,Free,0,Everyone,Business,"February 1, 2018",4.0 and up
4980,School scientific calculator fx 500 es plus 50...,FAMILY,4.6,1553,9.2,100000,Free,0,Everyone,Education,"August 2, 2018",4.0 and up


### Drop Unused Columns

**Challenge**: Remove the columns called `Last_Updated` and `Android_Version` from the DataFrame. We will not use these columns. 

In [8]:
df_apps.drop('Last_Updated', axis=1)
df_apps.drop('Android_Ver', axis=1)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated
0,Ak Parti Yardım Toplama,SOCIAL,,0,8.70,0,Paid,$13.99,Teen,Social,"July 28, 2017"
1,Ain Arabic Kids Alif Ba ta,FAMILY,,0,33.00,0,Paid,$2.99,Everyone,Education,"April 15, 2016"
2,Popsicle Launcher for Android P 9.0 launcher,PERSONALIZATION,,0,5.50,0,Paid,$1.49,Everyone,Personalization,"July 11, 2018"
3,Command & Conquer: Rivals,FAMILY,,0,19.00,0,,0,Everyone 10+,Strategy,"June 28, 2018"
4,CX Network,BUSINESS,,0,10.00,0,Free,0,Everyone,Business,"August 6, 2018"
...,...,...,...,...,...,...,...,...,...,...,...
10836,Subway Surfers,GAME,4.50,27723193,76.00,1000000000,Free,0,Everyone 10+,Arcade,"July 12, 2018"
10837,Subway Surfers,GAME,4.50,27724094,76.00,1000000000,Free,0,Everyone 10+,Arcade,"July 12, 2018"
10838,Subway Surfers,GAME,4.50,27725352,76.00,1000000000,Free,0,Everyone 10+,Arcade,"July 12, 2018"
10839,Subway Surfers,GAME,4.50,27725352,76.00,1000000000,Free,0,Everyone 10+,Arcade,"July 12, 2018"


In [9]:
df_apps.shape

(10841, 12)

### Find and Remove NaN values in Ratings

**Challenge**: How may rows have a NaN value (not-a-number) in the Ratings column? Create DataFrame called `df_apps_clean` that does not include these rows. 

In [10]:
df_apps.Rating.isna().sum()

np.int64(1474)

In [11]:
df_apps_clean = df_apps.dropna()

In [12]:
df_apps_clean.isna().sum()

App               0
Category          0
Rating            0
Reviews           0
Size_MBs          0
Installs          0
Type              0
Price             0
Content_Rating    0
Genres            0
Last_Updated      0
Android_Ver       0
dtype: int64

In [13]:
df_apps_clean.shape

(9365, 12)

### Find and Remove Duplicates

**Challenge**: Are there any duplicates in data? Check for duplicates using the [.duplicated()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html) function. How many entries can you find for the "Instagram" app? Use [.drop_duplicates()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html) to remove any duplicates from `df_apps_clean`. 


In [14]:
df_apps_clean.duplicated().sum()

np.int64(474)

In [15]:
df_apps_dup = df_apps_clean[df_apps_clean.duplicated()]
df_apps_dup.shape

(474, 12)

In [16]:
df_apps_dup.head()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
946,420 BZ Budeze Delivery,MEDICAL,5.0,2,11.0,100,Free,0,Mature 17+,Medical,"June 6, 2018",4.1 and up
1133,MouseMingle,DATING,2.7,3,3.9,100,Free,0,Mature 17+,Dating,"July 17, 2018",4.4 and up
1196,"Cardiac diagnosis (heart rate, arrhythmia)",MEDICAL,4.4,8,6.5,100,Paid,$12.99,Everyone,Medical,"July 25, 2018",3.0 and up
1231,Sway Medical,MEDICAL,5.0,3,22.0,100,Free,0,Everyone,Medical,"July 25, 2018",5.0 and up
1247,Chat Kids - Chat Room For Kids,DATING,4.7,6,4.9,100,Free,0,Mature 17+,Dating,"July 24, 2018",4.0.3 and up


In [17]:
df_apps_dup[df_apps_dup.App == "Instagram"]

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
10809,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social,"July 31, 2018",Varies with device


In [18]:
df_apps_clean[df_apps_clean.App == "Instagram"]

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
10806,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social,"July 31, 2018",Varies with device
10808,Instagram,SOCIAL,4.5,66577446,5.3,1000000000,Free,0,Teen,Social,"July 31, 2018",Varies with device
10809,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social,"July 31, 2018",Varies with device
10810,Instagram,SOCIAL,4.5,66509917,5.3,1000000000,Free,0,Teen,Social,"July 31, 2018",Varies with device


In [19]:
df_apps_dedup = df_apps_clean.drop_duplicates(keep='first', subset=['App', 'Type', 'Price'])
df_apps_dedup.shape

(8197, 12)

# Find Highest Rated Apps

**Challenge**: Identify which apps are the highest rated. What problem might you encounter if you rely exclusively on ratings alone to determine the quality of an app?

In [20]:
# These are sorted by Rating
df_apps_dedup.sort_values('Rating', ascending=False)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
21,KBA-EZ Health Guide,MEDICAL,5.00,4,25.00,1,Free,0,Everyone,Medical,"August 2, 2018",4.0.3 and up
126,Tablet Reminder,MEDICAL,5.00,4,2.50,5,Free,0,Everyone,Medical,"August 3, 2018",4.1 and up
99,Anatomy & Physiology Vocabulary Exam Review App,MEDICAL,5.00,1,4.60,5,Free,0,Everyone,Medical,"August 2, 2018",4.0 and up
4058,Ek Bander Ne Kholi Dukan,FAMILY,5.00,10,3.00,10000,Free,0,Everyone,Entertainment,"June 26, 2017",4.0 and up
321,FK CLASSIC FOR YOU,BUSINESS,5.00,1,3.50,10,Free,0,Everyone,Business,"February 20, 2018",4.0 and up
...,...,...,...,...,...,...,...,...,...,...,...,...
818,Familial Hypercholesterolaemia Handbook,MEDICAL,1.00,2,33.00,100,Free,0,Everyone,Medical,"July 2, 2018",4.1 and up
617,DT future1 cam,TOOLS,1.00,1,24.00,50,Free,0,Everyone,Tools,"March 27, 2018",2.2 and up
576,Clarksburg AH,MEDICAL,1.00,1,28.00,50,Free,0,Everyone,Medical,"May 1, 2017",4.0.3 and up
357,Speech Therapy: F,FAMILY,1.00,1,16.00,10,Paid,$2.99,Everyone,Education,"October 7, 2016",2.3.3 and up


In [21]:
# These are sorted by the number of reviews
df_apps_dedup.sort_values('Reviews', ascending=False)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
10805,Facebook,SOCIAL,4.10,78158306,5.30,1000000000,Free,0,Teen,Social,"August 3, 2018",Varies with device
10785,WhatsApp Messenger,COMMUNICATION,4.40,69119316,3.50,1000000000,Free,0,Everyone,Communication,"August 3, 2018",Varies with device
10806,Instagram,SOCIAL,4.50,66577313,5.30,1000000000,Free,0,Teen,Social,"July 31, 2018",Varies with device
10784,Messenger – Text and Video Chat for Free,COMMUNICATION,4.00,56642847,3.50,1000000000,Free,0,Everyone,Communication,"August 1, 2018",Varies with device
10650,Clash of Clans,GAME,4.60,44891723,98.00,100000000,Free,0,Everyone 10+,Strategy,"July 15, 2018",4.1 and up
...,...,...,...,...,...,...,...,...,...,...,...,...
265,FAST EO,EVENTS,5.00,1,6.30,10,Free,0,Everyone,Events,"May 15, 2018",4.1 and up
576,Clarksburg AH,MEDICAL,1.00,1,28.00,50,Free,0,Everyone,Medical,"May 1, 2017",4.0.3 and up
582,MI-BP,HEALTH_AND_FITNESS,5.00,1,12.00,50,Free,0,Everyone,Health & Fitness,"April 9, 2018",5.0 and up
485,Familyfirst Messenger,MEDICAL,4.00,1,3.30,10,Free,0,Everyone,Medical,"July 12, 2018",4.4 and up


In [22]:
# These are sorted by the size of the app
df_apps_dedup.sort_values('Size_MBs', ascending=False)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
10295,SimCity BuildIt,FAMILY,4.50,4218587,100.00,50000000,Free,0,Everyone 10+,Simulation,"June 19, 2018",4.0 and up
3144,Vi Trainer,HEALTH_AND_FITNESS,3.60,124,100.00,5000,Free,0,Everyone,Health & Fitness,"August 2, 2018",5.0 and up
4176,Car Crash III Beam DH Real Damage Simulator 2018,GAME,3.60,151,100.00,10000,Free,0,Everyone,Racing,"May 20, 2018",4.1 and up
7926,Post Bank,FINANCE,4.50,60449,100.00,1000000,Free,0,Everyone,Finance,"July 23, 2018",4.0 and up
8718,Mini Golf King - Multiplayer Game,GAME,4.50,531458,100.00,5000000,Free,0,Everyone,Sports,"July 20, 2018",4.0.3 and up
...,...,...,...,...,...,...,...,...,...,...,...,...
5798,ExDialer PRO Key,COMMUNICATION,4.50,5474,0.02,100000,Paid,$3.99,Everyone,Communication,"January 15, 2014",2.1 and up
2648,Ad Remove Plugin for App2SD,PRODUCTIVITY,4.10,66,0.02,1000,Paid,$1.29,Everyone,Productivity,"September 25, 2013",2.2 and up
2684,My baby firework (Remove ad),FAMILY,4.10,30,0.01,1000,Paid,$0.99,Everyone,Entertainment,"April 25, 2013",Varies with device
7966,Market Update Helper,LIBRARIES_AND_DEMO,4.10,20145,0.01,1000000,Free,0,Everyone,Libraries & Demo,"February 12, 2013",1.5 and up


In [23]:
df_apps_dedup.sort_values('Reviews', ascending=False)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
10805,Facebook,SOCIAL,4.10,78158306,5.30,1000000000,Free,0,Teen,Social,"August 3, 2018",Varies with device
10785,WhatsApp Messenger,COMMUNICATION,4.40,69119316,3.50,1000000000,Free,0,Everyone,Communication,"August 3, 2018",Varies with device
10806,Instagram,SOCIAL,4.50,66577313,5.30,1000000000,Free,0,Teen,Social,"July 31, 2018",Varies with device
10784,Messenger – Text and Video Chat for Free,COMMUNICATION,4.00,56642847,3.50,1000000000,Free,0,Everyone,Communication,"August 1, 2018",Varies with device
10650,Clash of Clans,GAME,4.60,44891723,98.00,100000000,Free,0,Everyone 10+,Strategy,"July 15, 2018",4.1 and up
...,...,...,...,...,...,...,...,...,...,...,...,...
265,FAST EO,EVENTS,5.00,1,6.30,10,Free,0,Everyone,Events,"May 15, 2018",4.1 and up
576,Clarksburg AH,MEDICAL,1.00,1,28.00,50,Free,0,Everyone,Medical,"May 1, 2017",4.0.3 and up
582,MI-BP,HEALTH_AND_FITNESS,5.00,1,12.00,50,Free,0,Everyone,Health & Fitness,"April 9, 2018",5.0 and up
485,Familyfirst Messenger,MEDICAL,4.00,1,3.30,10,Free,0,Everyone,Medical,"July 12, 2018",4.4 and up


In [24]:
# Now let's get the first 50, and count how many are paid apps
df_apps_dedup.sort_values('Reviews', ascending=False).head(50)


Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
10805,Facebook,SOCIAL,4.1,78158306,5.3,1000000000,Free,0,Teen,Social,"August 3, 2018",Varies with device
10785,WhatsApp Messenger,COMMUNICATION,4.4,69119316,3.5,1000000000,Free,0,Everyone,Communication,"August 3, 2018",Varies with device
10806,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social,"July 31, 2018",Varies with device
10784,Messenger – Text and Video Chat for Free,COMMUNICATION,4.0,56642847,3.5,1000000000,Free,0,Everyone,Communication,"August 1, 2018",Varies with device
10650,Clash of Clans,GAME,4.6,44891723,98.0,100000000,Free,0,Everyone 10+,Strategy,"July 15, 2018",4.1 and up
10744,Clean Master- Space Cleaner & Antivirus,TOOLS,4.7,42916526,3.4,500000000,Free,0,Everyone,Tools,"August 3, 2018",Varies with device
10835,Subway Surfers,GAME,4.5,27722264,76.0,1000000000,Free,0,Everyone 10+,Arcade,"July 12, 2018",4.1 and up
10828,YouTube,VIDEO_PLAYERS,4.3,25655305,4.65,1000000000,Free,0,Teen,Video Players & Editors,"August 2, 2018",Varies with device
10746,"Security Master - Antivirus, VPN, AppLock, Boo...",TOOLS,4.7,24900999,3.4,500000000,Free,0,Everyone,Tools,"August 4, 2018",Varies with device
10584,Clash Royale,GAME,4.6,23133508,97.0,100000000,Free,0,Everyone 10+,Strategy,"June 27, 2018",4.1 and up


# Find 5 Largest Apps in terms of Size (MBs)

**Challenge**: What's the size in megabytes (MB) of the largest Android apps in the Google Play Store. Based on the data, do you think there could be limit in place or can developers make apps as large as they please? 

In [25]:
# These are sorted by the size of the app
df_apps_dedup.sort_values('Size_MBs', ascending=False).head(20)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
10295,SimCity BuildIt,FAMILY,4.5,4218587,100.0,50000000,Free,0,Everyone 10+,Simulation,"June 19, 2018",4.0 and up
3144,Vi Trainer,HEALTH_AND_FITNESS,3.6,124,100.0,5000,Free,0,Everyone,Health & Fitness,"August 2, 2018",5.0 and up
4176,Car Crash III Beam DH Real Damage Simulator 2018,GAME,3.6,151,100.0,10000,Free,0,Everyone,Racing,"May 20, 2018",4.1 and up
7926,Post Bank,FINANCE,4.5,60449,100.0,1000000,Free,0,Everyone,Finance,"July 23, 2018",4.0 and up
8718,Mini Golf King - Multiplayer Game,GAME,4.5,531458,100.0,5000000,Free,0,Everyone,Sports,"July 20, 2018",4.0.3 and up
7927,The Walking Dead: Our World,GAME,4.0,22435,100.0,1000000,Free,0,Teen,Action,"August 1, 2018",5.0 and up
7928,Stickman Legends: Shadow Wars,GAME,4.4,38419,100.0,1000000,Paid,$0.99,Everyone 10+,Action,"August 3, 2018",4.1 and up
9945,Ultimate Tennis,SPORTS,4.3,183004,100.0,10000000,Free,0,Everyone,Sports,"July 19, 2018",4.0.3 and up
9942,Talking Babsy Baby: Baby Games,LIFESTYLE,4.0,140995,100.0,10000000,Free,0,Everyone,Lifestyle;Pretend Play,"July 16, 2018",4.0 and up
9943,Miami crime simulator,GAME,4.0,254518,100.0,10000000,Free,0,Mature 17+,Action,"July 9, 2018",4.0 and up


# Find the 5 App with Most Reviews

**Challenge**: Which apps have the highest number of reviews? Are there any paid apps among the top 50?

In [26]:
# Now let's get the first 50, and count how many are paid apps
df_apps_dedup.sort_values('Reviews', ascending=False).head(50)


Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
10805,Facebook,SOCIAL,4.1,78158306,5.3,1000000000,Free,0,Teen,Social,"August 3, 2018",Varies with device
10785,WhatsApp Messenger,COMMUNICATION,4.4,69119316,3.5,1000000000,Free,0,Everyone,Communication,"August 3, 2018",Varies with device
10806,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social,"July 31, 2018",Varies with device
10784,Messenger – Text and Video Chat for Free,COMMUNICATION,4.0,56642847,3.5,1000000000,Free,0,Everyone,Communication,"August 1, 2018",Varies with device
10650,Clash of Clans,GAME,4.6,44891723,98.0,100000000,Free,0,Everyone 10+,Strategy,"July 15, 2018",4.1 and up
10744,Clean Master- Space Cleaner & Antivirus,TOOLS,4.7,42916526,3.4,500000000,Free,0,Everyone,Tools,"August 3, 2018",Varies with device
10835,Subway Surfers,GAME,4.5,27722264,76.0,1000000000,Free,0,Everyone 10+,Arcade,"July 12, 2018",4.1 and up
10828,YouTube,VIDEO_PLAYERS,4.3,25655305,4.65,1000000000,Free,0,Teen,Video Players & Editors,"August 2, 2018",Varies with device
10746,"Security Master - Antivirus, VPN, AppLock, Boo...",TOOLS,4.7,24900999,3.4,500000000,Free,0,Everyone,Tools,"August 4, 2018",Varies with device
10584,Clash Royale,GAME,4.6,23133508,97.0,100000000,Free,0,Everyone 10+,Strategy,"June 27, 2018",4.1 and up


# Plotly Pie and Donut Charts - Visualise Categorical Data: Content Ratings

In [27]:
ratings = df_apps_dedup.Content_Rating.value_counts()
ratings

Content_Rating
Everyone           6619
Teen                912
Mature 17+          357
Everyone 10+        305
Adults only 18+       3
Unrated               1
Name: count, dtype: int64

In [28]:
import plotly.express as px

In [29]:
ratings.describe()

count       6.00
mean    1,366.17
std     2,594.80
min         1.00
25%        78.50
50%       331.00
75%       773.25
max     6,619.00
Name: count, dtype: float64

In [30]:
ratings.head()

Content_Rating
Everyone           6619
Teen                912
Mature 17+          357
Everyone 10+        305
Adults only 18+       3
Name: count, dtype: int64

In [32]:
fig = px.pie(labels=ratings.index,
values=ratings.values,
title="Content Rating",
names=ratings.index,
)
fig.update_traces(textposition='outside', textinfo='percent+label')
fig.show()

In [34]:
fig = px.pie(labels=ratings.index,
values=ratings.values,
title="Content Rating",
names=ratings.index,
hole=0.4)
fig.update_traces(textposition='outside', textinfo='percent+label')
fig.show()

# Numeric Type Conversion: Examine the Number of Installs

**Challenge**: How many apps had over 1 billion (that's right - BILLION) installations? How many apps just had a single install? 

Check the datatype of the Installs column.

Count the number of apps at each level of installations. 

Convert the number of installations (the Installs column) to a numeric data type. Hint: this is a 2-step process. You'll have make sure you remove non-numeric characters first. 

In [35]:
df_apps_dedup.head()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
21,KBA-EZ Health Guide,MEDICAL,5.0,4,25.0,1,Free,0,Everyone,Medical,"August 2, 2018",4.0.3 and up
28,Ra Ga Ba,GAME,5.0,2,20.0,1,Paid,$1.49,Everyone,Arcade,"February 8, 2017",2.3 and up
47,Mu.F.O.,GAME,5.0,2,16.0,1,Paid,$0.99,Everyone,Arcade,"March 3, 2017",2.3 and up
82,Brick Breaker BR,GAME,5.0,7,19.0,5,Free,0,Everyone,Arcade,"July 23, 2018",4.1 and up
99,Anatomy & Physiology Vocabulary Exam Review App,MEDICAL,5.0,1,4.6,5,Free,0,Everyone,Medical,"August 2, 2018",4.0 and up


In [37]:
df_apps_dedup.sort_values(by=['Installs'], ascending=False).tail(10)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
2408,BP Tracker-Symptoms & Solution,MEDICAL,4.2,34,3.2,1000,Free,0,Everyone,Medical,"October 12, 2016",3.0 and up
2407,Dy So Exam,FAMILY,3.0,6,3.2,1000,Free,0,Everyone,Education,"July 3, 2017",2.3 and up
2406,CI Dictionary,FAMILY,4.6,31,3.2,1000,Free,0,Everyone,Education,"September 1, 2015",4.0.3 and up
2405,ALPHA - Artificial Intelligence,FAMILY,4.0,51,3.2,1000,Free,0,Everyone,Entertainment,"June 3, 2018",4.0 and up
2404,BK Shivani Videos,LIFESTYLE,4.7,29,3.2,1000,Free,0,Everyone,Lifestyle,"July 25, 2018",4.0.3 and up
2403,CI Screwed - Icon Pack,PERSONALIZATION,4.7,19,6.4,1000,Free,0,Everyone,Personalization,"June 30, 2016",4.0.3 and up
2402,EG Movi,TOOLS,4.2,40,7.4,1000,Free,0,Everyone,Tools,"May 12, 2017",3.2 and up
28,Ra Ga Ba,GAME,5.0,2,20.0,1,Paid,$1.49,Everyone,Arcade,"February 8, 2017",2.3 and up
47,Mu.F.O.,GAME,5.0,2,16.0,1,Paid,$0.99,Everyone,Arcade,"March 3, 2017",2.3 and up
21,KBA-EZ Health Guide,MEDICAL,5.0,4,25.0,1,Free,0,Everyone,Medical,"August 2, 2018",4.0.3 and up


In [45]:
# These are the counts for each level of installation.  The values are sorted, but we can see they are sorted by an alpha-influenced order.
df_apps_dedup.groupby('Installs')['Installs'].count()



Installs
1                   3
1,000             697
1,000,000        1417
1,000,000,000      20
10                 69
10,000            987
10,000,000        933
100               303
100,000          1096
100,000,000       189
5                   9
5,000             425
5,000,000         607
50                 56
50,000            457
50,000,000        202
500               199
500,000           504
500,000,000        24
Name: Installs, dtype: int64

In [52]:
# We can see the Installs are of type object, which we have to convert to a number. The present value of the object has commas.
df_apps_dedup.dtypes

App                object
Category           object
Rating            float64
Reviews             int64
Size_MBs          float64
Installs           object
Type               object
Price              object
Content_Rating     object
Genres             object
Last_Updated       object
Android_Ver        object
dtype: object

In [65]:
# convert the column from type object to an int64 for the dataframe df_apps_dedup
import pandas as pd
df_apps_numbers = df_apps_dedup.copy()
df_apps_numbers.Installs = [x.replace(',','') for x in df_apps_numbers.Installs]
df_apps_numbers.Installs = pd.to_numeric(df_apps_numbers.Installs)
df_apps_numbers.groupby('Installs')['Installs'].count()


Installs
1                3
5                9
10              69
50              56
100            303
500            199
1000           697
5000           425
10000          987
50000          457
100000        1096
500000         504
1000000       1417
5000000        607
10000000       933
50000000       202
100000000      189
500000000       24
1000000000      20
Name: Installs, dtype: int64

# Find the Most Expensive Apps, Filter out the Junk, and Calculate a (ballpark) Sales Revenue Estimate

Let's examine the Price column more closely.

**Challenge**: Convert the price column to numeric data. Then investigate the top 20 most expensive apps in the dataset.

Remove all apps that cost more than $250 from the `df_apps_clean` DataFrame.

Add a column called 'Revenue_Estimate' to the DataFrame. This column should hold the price of the app times the number of installs. What are the top 10 highest grossing paid apps according to this estimate? Out of the top 10 highest grossing paid apps, how many are games?


In [68]:
# Find the most expensive apps.  Let's see what we have first
df_apps_clean.sort_values('Price', ascending=True).head()
# We can see the data has dollar signs, which clobber the format.


Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
1472,FO Bixby,PERSONALIZATION,5.0,5,0.84,100,Paid,$0.99,Everyone,Personalization,"April 25, 2018",7.0 and up
4975,ZOOKEEPER DX TouchEdition,FAMILY,4.2,3195,4.6,100000,Paid,$0.99,Everyone,Puzzle,"November 26, 2015",2.1 and up
748,Interactive NPC DM Tool,FAMILY,2.8,5,0.61,50,Paid,$0.99,Everyone,Role Playing,"January 31, 2015",2.3.3 and up
1203,World Racers family board game,FAMILY,4.8,4,42.0,100,Paid,$0.99,Everyone,Board;Pretend Play,"September 3, 2015",5.1 and up
1204,Dress Up RagazzA13 DX,FAMILY,3.5,13,42.0,100,Paid,$0.99,Everyone,Simulation,"September 16, 2016",4.0.3 and up


In [69]:
# let's count the number with a groupby.  The results show we have 73 items in this list.
df_apps_clean.groupby('Price').Price.count()

Price
$0.99     107
$1.00       2
$1.20       1
$1.29       1
$1.49      30
         ... 
$8.49       1
$8.99       4
$9.00       2
$9.99      16
0        8719
Name: Price, Length: 73, dtype: int64

In [72]:
# Now let's strip out the dollar signs.
df_apps_cleaner = df_apps_clean.copy()
df_apps_cleaner.Price = [x.replace('$', '') for x in df_apps_cleaner.Price]
df_apps_cleaner.Price = pd.to_numeric(df_apps_cleaner.Price)
df_apps_cleaner.groupby('Price')['Price'].count()



Price
0.00      8719
0.99       107
1.00         2
1.20         1
1.29         1
          ... 
299.99       1
379.99       1
389.99       1
399.99      11
400.00       1
Name: Price, Length: 73, dtype: int64

### The most expensive apps sub $250

In [73]:
# In this example, we filter out the apps that cost 250 or more
df_apps_cleaner[df_apps_cleaner.Price >= 250]

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
1331,most expensive app (H),FAMILY,4.3,6,1.5,100,Paid,399.99,Everyone,Entertainment,"July 16, 2018",7.0 and up
1946,I am rich (Most expensive app),FINANCE,4.1,129,2.7,1000,Paid,399.99,Teen,Finance,"December 6, 2017",4.0.3 and up
2193,I am extremely Rich,LIFESTYLE,2.9,41,2.9,1000,Paid,379.99,Everyone,Lifestyle,"July 1, 2018",4.0 and up
2394,I am Rich!,FINANCE,3.8,93,22.0,1000,Paid,399.99,Everyone,Finance,"December 11, 2017",4.1 and up
2461,I AM RICH PRO PLUS,FINANCE,4.0,36,41.0,1000,Paid,399.99,Everyone,Finance,"June 25, 2018",4.1 and up
2775,I Am Rich Pro,FAMILY,4.4,201,2.7,5000,Paid,399.99,Everyone,Entertainment,"May 30, 2017",1.6 and up
3114,I am Rich,FINANCE,4.3,180,3.8,5000,Paid,399.99,Everyone,Finance,"March 22, 2018",4.2 and up
3145,I am rich(premium),FINANCE,3.5,472,0.94,5000,Paid,399.99,Everyone,Finance,"May 1, 2017",4.4 and up
3221,I am Rich Plus,FAMILY,4.0,856,8.7,10000,Paid,399.99,Everyone,Entertainment,"May 19, 2018",4.4 and up
3554,💎 I'm rich,LIFESTYLE,3.8,718,26.0,10000,Paid,399.99,Everyone,Lifestyle,"March 11, 2018",4.4 and up


In [74]:
# In this example, we filter out the apps that cost 250 or more
df_apps_cleaner[df_apps_cleaner.Price >= 250].count()

App               15
Category          15
Rating            15
Reviews           15
Size_MBs          15
Installs          15
Type              15
Price             15
Content_Rating    15
Genres            15
Last_Updated      15
Android_Ver       15
dtype: int64

In [76]:
df_apps_cleaner[df_apps_cleaner.Price <= 250].sort_values(by="Price", ascending=False).head(10)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
2282,Vargo Anesthesia Mega App,MEDICAL,4.6,92,32.0,1000,Paid,79.99,Everyone,Medical,"June 18, 2018",4.0.3 and up
2281,Vargo Anesthesia Mega App,MEDICAL,4.6,92,32.0,1000,Paid,79.99,Everyone,Medical,"June 18, 2018",4.0.3 and up
1407,LTC AS Legal,MEDICAL,4.0,6,1.3,100,Paid,39.99,Everyone,Medical,"April 4, 2018",4.1 and up
2629,I am Rich Person,LIFESTYLE,4.2,134,1.8,1000,Paid,37.99,Everyone,Lifestyle,"July 18, 2017",4.0.3 and up
2482,A Manual of Acupuncture,MEDICAL,3.5,214,68.0,1000,Paid,33.99,Everyone,Medical,"October 2, 2017",4.0 and up
2481,A Manual of Acupuncture,MEDICAL,3.5,214,68.0,1000,Paid,33.99,Everyone,Medical,"October 2, 2017",4.0 and up
2207,EMT PASS,MEDICAL,3.4,51,2.4,1000,Paid,29.99,Everyone,Medical,"October 22, 2014",4.0 and up
2464,PTA Content Master,MEDICAL,4.2,64,41.0,1000,Paid,29.99,Everyone,Medical,"December 22, 2015",2.2 and up
504,AP Art History Flashcards,FAMILY,5.0,1,96.0,10,Paid,29.99,Mature 17+,Education,"January 19, 2016",4.0 and up
2208,EMT PASS,MEDICAL,3.4,51,2.4,1000,Paid,29.99,Everyone,Medical,"October 22, 2014",4.0 and up


### Highest Grossing Paid Apps (ballpark estimate)

In [87]:
df_apps_cleaner.dtypes

App                object
Category           object
Rating            float64
Reviews             int64
Size_MBs          float64
Installs           object
Type               object
Price             float64
Content_Rating     object
Genres             object
Last_Updated       object
Android_Ver        object
dtype: object

In [82]:
# Find out if any of the price entries are blank
df_apps_cleaner.Price.isna().any()

np.False_

In [83]:
# Is any price blank?
df_apps_cleaner.Price.isnull().values.any()

np.False_

In [84]:
# is any install blank?
df_apps_cleaner.Installs.isna().any()

np.False_

In [85]:
# is any install blank?
df_apps_cleaner.Installs.isnull().values.any()

np.False_

In [93]:
# Installs are of type object.  We need to convert them to integers
#df_apps_cleaner.Installs = pd.to_numeric(df_apps_cleaner.Installs)
df_apps_cleaner.dtypes


Revenue_Estimate    float64
App                  object
Category             object
Rating              float64
Reviews               int64
Size_MBs            float64
Installs              int64
Type                 object
Price               float64
Content_Rating       object
Genres               object
Last_Updated         object
Android_Ver          object
dtype: object

In [92]:
# Here, we have to do a couple of clever things.
# The first is we multiple the price by the number of installs. We'll pretend this is a proxy for revenue.
# We store this product as a new value, which we'll call Revenue_Estimate
# Then we sort the new list and present the head() of the results

df_apps_cleaner.Installs = [x.replace(',','') for x in df_apps_cleaner.Installs]
df_apps_cleaner.Installs = pd.to_numeric(df_apps_cleaner.Installs)
df_apps_cleaner.insert(0, "Revenue_Estimate", df_apps_cleaner.Price * df_apps_cleaner.Installs)


AttributeError: 'int' object has no attribute 'replace'

In [96]:
df_apps_cleaner.head()


Unnamed: 0,Revenue_Estimate,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
21,0.0,KBA-EZ Health Guide,MEDICAL,5.0,4,25.0,1,Free,0.0,Everyone,Medical,"August 2, 2018",4.0.3 and up
28,1.49,Ra Ga Ba,GAME,5.0,2,20.0,1,Paid,1.49,Everyone,Arcade,"February 8, 2017",2.3 and up
47,0.99,Mu.F.O.,GAME,5.0,2,16.0,1,Paid,0.99,Everyone,Arcade,"March 3, 2017",2.3 and up
82,0.0,Brick Breaker BR,GAME,5.0,7,19.0,5,Free,0.0,Everyone,Arcade,"July 23, 2018",4.1 and up
99,0.0,Anatomy & Physiology Vocabulary Exam Review App,MEDICAL,5.0,1,4.6,5,Free,0.0,Everyone,Medical,"August 2, 2018",4.0 and up


In [98]:
df_apps_cleaner.sort_values(by='Revenue_Estimate', ascending=False).head(10)

Unnamed: 0,Revenue_Estimate,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
9220,69900000.0,Minecraft,FAMILY,4.5,2376564,19.0,10000000,Paid,6.99,Everyone 10+,Arcade;Action & Adventure,"July 24, 2018",Varies with device
9224,69900000.0,Minecraft,FAMILY,4.5,2375336,19.0,10000000,Paid,6.99,Everyone 10+,Arcade;Action & Adventure,"July 24, 2018",Varies with device
5765,39999000.0,I am rich,LIFESTYLE,3.8,3547,1.8,100000,Paid,399.99,Everyone,Lifestyle,"January 12, 2018",4.0.3 and up
4606,19999500.0,I Am Rich Premium,FINANCE,4.1,1867,4.7,50000,Paid,399.99,Everyone,Finance,"November 12, 2017",4.0 and up
8825,9900000.0,Hitman Sniper,GAME,4.6,408292,29.0,10000000,Paid,0.99,Mature 17+,Action,"July 12, 2018",4.1 and up
7151,6990000.0,Grand Theft Auto: San Andreas,GAME,4.4,348962,26.0,1000000,Paid,6.99,Mature 17+,Action,"March 21, 2015",3.0 and up
7479,5990000.0,Facetune - For Free,PHOTOGRAPHY,4.4,49553,48.0,1000000,Paid,5.99,Everyone,Photography,"July 25, 2018",4.1 and up
7478,5990000.0,Facetune - For Free,PHOTOGRAPHY,4.4,49553,48.0,1000000,Paid,5.99,Everyone,Photography,"July 25, 2018",4.1 and up
7477,5990000.0,Facetune - For Free,PHOTOGRAPHY,4.4,49553,48.0,1000000,Paid,5.99,Everyone,Photography,"July 25, 2018",4.1 and up
7977,5990000.0,Sleep as Android Unlock,LIFESTYLE,4.5,23966,0.85,1000000,Paid,5.99,Everyone,Lifestyle,"June 27, 2018",4.0 and up


In [101]:
df_apps_cleaner[df_apps_cleaner.Price <= 250].sort_values(by="Revenue_Estimate", ascending=False).head(10)

Unnamed: 0,Revenue_Estimate,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
9220,69900000.0,Minecraft,FAMILY,4.5,2376564,19.0,10000000,Paid,6.99,Everyone 10+,Arcade;Action & Adventure,"July 24, 2018",Varies with device
9224,69900000.0,Minecraft,FAMILY,4.5,2375336,19.0,10000000,Paid,6.99,Everyone 10+,Arcade;Action & Adventure,"July 24, 2018",Varies with device
8825,9900000.0,Hitman Sniper,GAME,4.6,408292,29.0,10000000,Paid,0.99,Mature 17+,Action,"July 12, 2018",4.1 and up
7151,6990000.0,Grand Theft Auto: San Andreas,GAME,4.4,348962,26.0,1000000,Paid,6.99,Mature 17+,Action,"March 21, 2015",3.0 and up
7478,5990000.0,Facetune - For Free,PHOTOGRAPHY,4.4,49553,48.0,1000000,Paid,5.99,Everyone,Photography,"July 25, 2018",4.1 and up
7977,5990000.0,Sleep as Android Unlock,LIFESTYLE,4.5,23966,0.85,1000000,Paid,5.99,Everyone,Lifestyle,"June 27, 2018",4.0 and up
7479,5990000.0,Facetune - For Free,PHOTOGRAPHY,4.4,49553,48.0,1000000,Paid,5.99,Everyone,Photography,"July 25, 2018",4.1 and up
7477,5990000.0,Facetune - For Free,PHOTOGRAPHY,4.4,49553,48.0,1000000,Paid,5.99,Everyone,Photography,"July 25, 2018",4.1 and up
6594,4990000.0,DraStic DS Emulator,GAME,4.6,87766,12.0,1000000,Paid,4.99,Everyone,Action,"July 19, 2016",2.3 and up
6082,2995000.0,Weather Live,WEATHER,4.5,76593,4.75,500000,Paid,5.99,Everyone,Weather,"November 21, 2017",Varies with device


# Plotly Bar Charts & Scatter Plots: Analysing App Categories

In [104]:
df_apps_cleaner.Category.nunique()

33

In [107]:
top10_category = df_apps_cleaner.Category.value_counts()[:10]
top10_category

Category
FAMILY           1747
GAME             1097
TOOLS             734
PRODUCTIVITY      351
MEDICAL           350
COMMUNICATION     328
FINANCE           323
SPORTS            319
PHOTOGRAPHY       317
LIFESTYLE         315
Name: count, dtype: int64

In [109]:
bar = px.bar(x = top10_category.index,
             y = top10_category.values,
             color = top10_category.index,
             title = 'Top 10 Categories by Sales')
bar.show()

### Vertical Bar Chart - Highest Competition (Number of Apps)

### Horizontal Bar Chart - Most Popular Categories (Highest Downloads)

In [113]:
category_installs = df_apps_cleaner.groupby('Category').agg({'Installs': pd.Series.sum})
category_installs.sort_values(by='Installs', ascending=False, inplace=True)

h_bar = px.bar(x=category_installs.Installs, 
               y=category_installs.index,
               orientation='h',
               title='Category Popularity'
               )
h_bar.update_layout(xaxis_title='Number of Downloads', yaxis_title='Category')

h_bar.show()

### Category Concentration - Downloads vs. Competition

**Challenge**: 
* First, create a DataFrame that has the number of apps in one column and the number of installs in another:

<img src=https://imgur.com/uQRSlXi.png width="350">

* Then use the [plotly express examples from the documentation](https://plotly.com/python/line-and-scatter/) alongside the [.scatter() API reference](https://plotly.com/python-api-reference/generated/plotly.express.scatter.html)to create scatter plot that looks like this. 

<img src=https://imgur.com/cHsqh6a.png>

*Hint*: Use the size, hover_name and color parameters in .scatter(). To scale the yaxis, call .update_layout() and specify that the yaxis should be on a log-scale like so: yaxis=dict(type='log') 

In [119]:
#category_installs = df_apps_cleaner.groupby('Category').agg({'Installs': pd.Series.sum})
category_number = df_apps_cleaner.groupby('Category').agg({'App': pd.Series.count})
category_merged_df = pd.merge(category_installs, category_number, on='Category')
category_merged_df.shape

(33, 2)

In [121]:

scatter = px.scatter(category_merged_df,
                     x='App',
                     y='Installs',
                     title='Category concentration',
                     size='App',
                     hover_name=category_merged_df.index,
                     color='Installs'
                     )
scatter.update_layout(xaxis_title='Number of apps',
                      yaxis_title="Installs",
                      yaxis=dict(type='log')
                      )
scatter.show()

# Extracting Nested Data from a Column

**Challenge**: How many different types of genres are there? Can an app belong to more than one genre? Check what happens when you use .value_counts() on a column with nested values? See if you can work around this problem by using the .split() function and the DataFrame's [.stack() method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.stack.html). 


In [122]:
df_apps_cleaner.groupby("Genres").Genres.count()
#df_apps_clean.groupby('Price').Price.count()

Genres
Action                                   358
Action;Action & Adventure                 17
Adventure                                 73
Adventure;Action & Adventure              13
Adventure;Brain Games                      1
                                        ... 
Video Players & Editors                  158
Video Players & Editors;Creativity         2
Video Players & Editors;Music & Video      3
Weather                                   75
Word                                      28
Name: Genres, Length: 115, dtype: int64

In [125]:
df_apps_cleaner.groupby("Genres").Genres.value_counts()
# Notice the output contains stings like this: "Action" and "Action;Action & Adventure".  The semi-colon is the delimiter

Genres
Action                                   358
Action;Action & Adventure                 17
Adventure                                 73
Adventure;Action & Adventure              13
Adventure;Brain Games                      1
                                        ... 
Video Players & Editors                  158
Video Players & Editors;Creativity         2
Video Players & Editors;Music & Video      3
Weather                                   75
Word                                      28
Name: count, Length: 115, dtype: int64

In [124]:
len(df_apps_cleaner.Genres.unique())

115

In [129]:
# Split the column by the semicolon

stack = df_apps_cleaner.Genres.str.split(';', expand=True).stack()
print (f'The stack shape is: {stack.shape}')

num_genres = stack.value_counts()
print (f'Number of unique genres: {len(num_genres)}')

The stack shape is: (9848,)
Number of unique genres: 53


# Colour Scales in Plotly Charts - Competition in Genres

**Challenge**: Can you create this chart with the Series containing the genre data? 

<img src=https://imgur.com/DbcoQli.png width=400>

Try experimenting with the built in colour scales in Plotly. You can find a full list [here](https://plotly.com/python/builtin-colorscales/). 

* Find a way to set the colour scale using the color_continuous_scale parameter. 
* Find a way to make the color axis disappear by using coloraxis_showscale. 

In [156]:
genre_category = df_apps_cleaner.Genres.value_counts()[:20]
genre_category

bar = px.bar(x = genre_category.index,
             y = genre_category.values,
             title = 'Top Genres',
             hover_name=genre_category.index[:20],
             color = genre_category.values[:20],
             color_continuous_scale='Agsunset',
             )

bar.update_layout(xaxis_title='Genre',
                  yaxis_title="Number of Apps",
                  coloraxis_showscale=False,
                  )

bar.show()

# Grouped Bar Charts: Free vs. Paid Apps per Category

In [157]:
df_apps_cleaner.Type.value_counts()

Type
Free    8719
Paid     646
Name: count, dtype: int64

**Challenge**: Use the plotly express bar [chart examples](https://plotly.com/python/bar-charts/#bar-chart-with-sorted-or-ordered-categories) and the [.bar() API reference](https://plotly.com/python-api-reference/generated/plotly.express.bar.html#plotly.express.bar) to create this bar chart: 

<img src=https://imgur.com/LE0XCxA.png>

You'll want to use the `df_free_vs_paid` DataFrame that you created above that has the total number of free and paid apps per category. 

See if you can figure out how to get the look above by changing the `categoryorder` to 'total descending' as outlined in the documentation here [here](https://plotly.com/python/categorical-axes/#automatically-sorting-categories-by-name-or-total-value). 

In [163]:
df_free_vs_paid = df_apps_cleaner.groupby(['Category', 'Type'], as_index=False).agg({'App': pd.Series.count})
df_free_vs_paid.sort_values(['App'], ascending=True)

Unnamed: 0,Category,Type,App
3,AUTO_AND_VEHICLES,Paid,1
17,ENTERTAINMENT,Paid,2
24,FOOD_AND_DRINK,Paid,2
38,NEWS_AND_MAGAZINES,Paid,2
48,SHOPPING,Paid,2
...,...,...,...
21,FINANCE,Free,310
45,PRODUCTIVITY,Free,333
53,TOOLS,Free,671
25,GAME,Free,1020


In [164]:
df_free_vs_paid.sort_values(['App'], ascending=False)

Unnamed: 0,Category,Type,App
19,FAMILY,Free,1585
25,GAME,Free,1020
53,TOOLS,Free,671
45,PRODUCTIVITY,Free,333
21,FINANCE,Free,310
...,...,...,...
24,FOOD_AND_DRINK,Paid,2
50,SOCIAL,Paid,2
40,PARENTING,Paid,2
48,SHOPPING,Paid,2


In [169]:
g_bar = px.bar(df_free_vs_paid,
               x ='Category',
               y = 'App',
               title = 'Free v Paid Apps by Category',
               color ='Type',
               barmode = 'group'
               )
g_bar.update_layout(xaxis_title='Category',
                    yaxis_title='Number of Apps',
                    xaxis={'categoryorder':'total descending'},
                    yaxis=dict(type='log')
                    )

g_bar.show()

# Plotly Box Plots: Lost Downloads for Paid Apps

**Challenge**: Create a box plot that shows the number of Installs for free versus paid apps. How does the median number of installations compare? Is the difference large or small?

Use the [Box Plots Guide](https://plotly.com/python/box-plots/) and the [.box API reference](https://plotly.com/python-api-reference/generated/plotly.express.box.html) to create the following chart. 

<img src=https://imgur.com/uVsECT3.png>


In [170]:
box = px.box(df_apps_cleaner,
             y='Installs',
             x='Type',
             color='Type',
             notched=True,
             points='all',
             title='How Many Downloads are Paid Apps Giving Up?')
 
box.update_layout(yaxis=dict(type='log'))
 
box.show()

# Plotly Box Plots: Revenue by App Category

**Challenge**: See if you can generate the chart below: 

<img src=https://imgur.com/v4CiNqX.png>

Looking at the hover text, how much does the median app earn in the Tools category? If developing an Android app costs $30,000 or thereabouts, does the average photography app recoup its development costs?

Hint: I've used 'min ascending' to sort the categories. 

In [172]:
df_paid_apps = df_apps_cleaner[df_apps_cleaner['Type'] == 'Paid']
box = px.box(df_paid_apps, 
             x='Category', 
             y='Revenue_Estimate',
             title='How Much Can Paid Apps Earn?')
 
box.update_layout(xaxis_title='Category',
                  yaxis_title='Paid App Ballpark Revenue',
                  xaxis={'categoryorder':'min ascending'},
                  yaxis=dict(type='log'))
 
 
box.show()

# How Much Can You Charge? Examine Paid App Pricing Strategies by Category

**Challenge**: What is the median price price for a paid app? Then compare pricing by category by creating another box plot. But this time examine the prices (instead of the revenue estimates) of the paid apps. I recommend using `{categoryorder':'max descending'}` to sort the categories.

In [173]:
box = px.box(df_paid_apps,
             x='Category',
             y="Price",
             title='Price per Category')
 
box.update_layout(xaxis_title='Category',
                  yaxis_title='Paid App Price',
                  xaxis={'categoryorder':'max descending'},
                  yaxis=dict(type='log'))
 
box.show()