# Introduction

In this notebook, we will do a comprehensive analysis of the Android app market by comparing thousands of apps in the Google Play store.

# About the Dataset of Google Play Store Apps & Reviews

**Data Source:** <br>
App and review data was scraped from the Google Play Store by Lavanya Gupta in 2018. Original files listed [here](
https://www.kaggle.com/lava18/google-play-store-apps).

# Import Statements

In [2]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

# Notebook Presentation

In [3]:
# Show numeric output in decimal format e.g., 2.15
pd.options.display.float_format = '{:,.2f}'.format

# Read the Dataset

In [4]:
df_apps = pd.read_csv('apps.csv')

# Data Cleaning

**Challenge**: How many rows and columns does `df_apps` have? What are the column names? Look at a random sample of 5 different rows with [.sample()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html).

In [5]:
print(df_apps.shape)

(10841, 12)


In [6]:
print(df_apps.columns)

Index(['App', 'Category', 'Rating', 'Reviews', 'Size_MBs', 'Installs', 'Type',
       'Price', 'Content_Rating', 'Genres', 'Last_Updated', 'Android_Ver'],
      dtype='object')


In [7]:
print(df_apps.sample(n=5))

                         App       Category  Rating  Reviews  Size_MBs  \
8481              GUN ZOMBIE           GAME    4.40   243121     38.00   
6603  Local weather Forecast        WEATHER    4.40     5482     12.00   
5832               EI Mobile        FINANCE    3.80     4231     82.00   
9045      Puffin Web Browser  COMMUNICATION    4.30   541389      3.50   
3757           DL Calculator         SPORTS    4.50      177      1.10   

        Installs  Type Price Content_Rating         Genres  \
8481   5,000,000  Free     0           Teen         Arcade   
6603   1,000,000  Free     0       Everyone        Weather   
5832     100,000  Free     0       Everyone        Finance   
9045  10,000,000  Free     0       Everyone  Communication   
3757      10,000  Free     0       Everyone         Sports   

            Last_Updated   Android_Ver  
8481    October 17, 2014    2.3 and up  
6603      August 7, 2018  4.0.3 and up  
5832       July 19, 2018    4.2 and up  
9045        July 9

### Drop Unused Columns

**Challenge**: Remove the columns called `Last_Updated` and `Android_Version` from the DataFrame. We will not use these columns. 

In [8]:
df_apps.drop(columns=["Last_Updated", "Android_Ver"], inplace=True, axis="columns")

### Find and Remove NaN values in Ratings

**Challenge**: How may rows have a NaN value (not-a-number) in the Ratings column? Create DataFrame called `df_apps_clean` that does not include these rows. 

In [9]:
print(df_apps.isna().value_counts())

App    Category  Rating  Reviews  Size_MBs  Installs  Type   Price  Content_Rating  Genres
False  False     False   False    False     False     False  False  False           False     9367
                 True    False    False     False     False  False  False           False     1473
                                                      True   False  False           False        1
dtype: int64


In [10]:
df_apps_clean = df_apps.dropna()
print(df_apps_clean.isna().value_counts())
print(df_apps_clean.shape)

App    Category  Rating  Reviews  Size_MBs  Installs  Type   Price  Content_Rating  Genres
False  False     False   False    False     False     False  False  False           False     9367
dtype: int64
(9367, 10)


### Find and Remove Duplicates

**Challenge**: Are there any duplicates in data? Check for duplicates using the [.duplicated()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html) function. How many entries can you find for the "Instagram" app? Use [.drop_duplicates()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html) to remove any duplicates from `df_apps_clean`. 


In [11]:
print(df_apps_clean.duplicated().value_counts())

False    8891
True      476
dtype: int64


In [12]:
print(df_apps_clean[df_apps_clean["App"] == "Instagram"])

             App Category  Rating   Reviews  Size_MBs       Installs  Type  \
10806  Instagram   SOCIAL    4.50  66577313      5.30  1,000,000,000  Free   
10808  Instagram   SOCIAL    4.50  66577446      5.30  1,000,000,000  Free   
10809  Instagram   SOCIAL    4.50  66577313      5.30  1,000,000,000  Free   
10810  Instagram   SOCIAL    4.50  66509917      5.30  1,000,000,000  Free   

      Price Content_Rating  Genres  
10806     0           Teen  Social  
10808     0           Teen  Social  
10809     0           Teen  Social  
10810     0           Teen  Social  


In [13]:
# drop_duplicates() by itself is too strict, it means only COMPLETELY identical entries will be counted
df_apps_cleaner = df_apps_clean.drop_duplicates()

In [14]:
print(df_apps_cleaner[df_apps_cleaner["App"] == "Instagram"].duplicated().value_counts())

False    3
dtype: int64


In [15]:
print(df_apps_cleaner[df_apps_cleaner["App"] == "Instagram"])

             App Category  Rating   Reviews  Size_MBs       Installs  Type  \
10806  Instagram   SOCIAL    4.50  66577313      5.30  1,000,000,000  Free   
10808  Instagram   SOCIAL    4.50  66577446      5.30  1,000,000,000  Free   
10810  Instagram   SOCIAL    4.50  66509917      5.30  1,000,000,000  Free   

      Price Content_Rating  Genres  
10806     0           Teen  Social  
10808     0           Teen  Social  
10810     0           Teen  Social  


In [16]:
# By adding the subset we can broaden the filter so all duplicates shall be dropped
df_apps_cleaner = df_apps_clean.drop_duplicates(subset=["App", "Type", "Price"])

In [17]:
print(df_apps_cleaner[df_apps_cleaner["App"] == "Instagram"])

             App Category  Rating   Reviews  Size_MBs       Installs  Type  \
10806  Instagram   SOCIAL    4.50  66577313      5.30  1,000,000,000  Free   

      Price Content_Rating  Genres  
10806     0           Teen  Social  


# Find Highest Rated Apps

**Challenge**: Identify which apps are the highest rated. What problem might you encounter if you rely exclusively on ratings alone to determine the quality of an app?

In [18]:
# Relying exclusively on ratings are bad because an app with a single 5 stars rating will be superior when compared with an app that is 4.9 stars with millions of ratings
print(df_apps_cleaner.sort_values(ascending=False, by="Rating").head())

                      App     Category  Rating  Reviews  Size_MBs Installs  \
21    KBA-EZ Health Guide      MEDICAL    5.00        4     25.00        1   
1230         Sway Medical      MEDICAL    5.00        3     22.00      100   
1227    AJ Men's Grooming    LIFESTYLE    5.00        2     22.00      100   
1224       FK Dedinje BGD       SPORTS    5.00       36      2.60      100   
1223      CB VIDEO VISION  PHOTOGRAPHY    5.00       13      2.60      100   

      Type Price Content_Rating       Genres  
21    Free     0       Everyone      Medical  
1230  Free     0       Everyone      Medical  
1227  Free     0       Everyone    Lifestyle  
1224  Free     0       Everyone       Sports  
1223  Free     0       Everyone  Photography  


# Find 5 Largest Apps in terms of Size (MBs)

**Challenge**: What's the size in megabytes (MB) of the largest Android apps in the Google Play Store. Based on the data, do you think there could be limit in place or can developers make apps as large as they please? 

In [19]:
print(df_apps_cleaner.sort_values(ascending=False, by="Size_MBs").head())

                                  App            Category  Rating  Reviews  \
9942   Talking Babsy Baby: Baby Games           LIFESTYLE    4.00   140995   
10687          Hungry Shark Evolution                GAME    4.50  6074334   
9943            Miami crime simulator                GAME    4.00   254518   
9944     Gangster Town: Vice District              FAMILY    4.30    65146   
3144                       Vi Trainer  HEALTH_AND_FITNESS    3.60      124   

       Size_MBs     Installs  Type Price Content_Rating  \
9942     100.00   10,000,000  Free     0       Everyone   
10687    100.00  100,000,000  Free     0           Teen   
9943     100.00   10,000,000  Free     0     Mature 17+   
9944     100.00   10,000,000  Free     0     Mature 17+   
3144     100.00        5,000  Free     0       Everyone   

                       Genres  
9942   Lifestyle;Pretend Play  
10687                  Arcade  
9943                   Action  
9944               Simulation  
3144         Hea

# Find the 5 App with Most Reviews

**Challenge**: Which apps have the highest number of reviews? Are there any paid apps among the top 50?

In [20]:
print(df_apps_cleaner.sort_values(ascending=False, by="Reviews").head())

                                            App       Category  Rating  \
10805                                  Facebook         SOCIAL    4.10   
10785                        WhatsApp Messenger  COMMUNICATION    4.40   
10806                                 Instagram         SOCIAL    4.50   
10784  Messenger – Text and Video Chat for Free  COMMUNICATION    4.00   
10650                            Clash of Clans           GAME    4.60   

        Reviews  Size_MBs       Installs  Type Price Content_Rating  \
10805  78158306      5.30  1,000,000,000  Free     0           Teen   
10785  69119316      3.50  1,000,000,000  Free     0       Everyone   
10806  66577313      5.30  1,000,000,000  Free     0           Teen   
10784  56642847      3.50  1,000,000,000  Free     0       Everyone   
10650  44891723     98.00    100,000,000  Free     0   Everyone 10+   

              Genres  
10805         Social  
10785  Communication  
10806         Social  
10784  Communication  
10650       S

In [21]:
print(df_apps_cleaner.sort_values(by="Reviews", ascending=False).head(n=50)["Type"] == "Free")

10805    True
10785    True
10806    True
10784    True
10650    True
10744    True
10835    True
10828    True
10746    True
10584    True
10763    True
10770    True
10735    True
10489    True
10731    True
10594    True
10302    True
10354    True
10549    True
10757    True
10721    True
10578    True
10813    True
10724    True
10717    True
10792    True
10628    True
10388    True
10694    True
10695    True
10644    True
10696    True
10660    True
10786    True
10817    True
10672    True
10734    True
10649    True
10699    True
10322    True
10396    True
10777    True
10822    True
10359    True
10711    True
10389    True
10676    True
10576    True
10461    True
10502    True
Name: Type, dtype: bool


# Plotly Pie and Donut Charts - Visualise Categorical Data: Content Ratings

In [22]:
content_ratings = df_apps_cleaner["Content_Rating"].value_counts()

In [23]:
print(content_ratings)

Everyone           6621
Teen                912
Mature 17+          357
Everyone 10+        305
Adults only 18+       3
Unrated               1
Name: Content_Rating, dtype: int64


In [24]:
figure = px.pie(labels=content_ratings.index, values=content_ratings.values, names=content_ratings.index, title="Content Rating of Android App Store apps", hole=0.5)
figure.update_traces(textposition="outside", textinfo="percent", textfont_size=15)
figure.show()

# Numeric Type Conversion: Examine the Number of Installs

**Challenge**: How many apps had over 1 billion (that's right - BILLION) installations? How many apps just had a single install? 

Check the datatype of the Installs column.

Count the number of apps at each level of installations. 

Convert the number of installations (the Installs column) to a numeric data type. Hint: this is a 2-step process. You'll have make sure you remove non-numeric characters first. 

In [25]:
print(df_apps_cleaner["Installs"].value_counts())

1,000,000        1417
100,000          1096
10,000            988
10,000,000        933
1,000             698
5,000,000         607
500,000           504
50,000            457
5,000             425
100               303
50,000,000        202
500               199
100,000,000       189
10                 69
50                 56
500,000,000        24
1,000,000,000      20
5                   9
1                   3
Name: Installs, dtype: int64


In [26]:
print(type(df_apps_cleaner["Installs"].values[0]))

<class 'str'>


In [27]:
df_apps_cleaner["Installs"] = df_apps_cleaner["Installs"].str.replace(",", "")
df_apps_cleaner["Installs"] = pd.to_numeric(df_apps_cleaner["Installs"])



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [28]:
df_apps_cleaner[["App", "Installs"]]. groupby("Installs").count()

Unnamed: 0_level_0,App
Installs,Unnamed: 1_level_1
1,3
5,9
10,69
50,56
100,303
500,199
1000,698
5000,425
10000,988
50000,457


# Find the Most Expensive Apps, Filter out the Junk, and Calculate a (ballpark) Sales Revenue Estimate

Let's examine the Price column more closely.

**Challenge**: Convert the price column to numeric data. Then investigate the top 20 most expensive apps in the dataset.

Remove all apps that cost more than $250 from the `df_apps_clean` DataFrame.

Add a column called 'Revenue_Estimate' to the DataFrame. This column should hold the price of the app times the number of installs. What are the top 10 highest grossing paid apps according to this estimate? Out of the top 10 highest grossing paid apps, how many are games?


In [29]:
print(df_apps_cleaner["Price"].head())

21        0
28    $1.49
47    $0.99
82        0
99        0
Name: Price, dtype: object


In [30]:
df_apps_cleaner["Price"] = df_apps_cleaner["Price"].str.replace("$", "")
df_apps_cleaner["Price"] = pd.to_numeric(df_apps_cleaner["Price"])


The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [31]:
print(df_apps_cleaner.sort_values(by="Price", ascending=False).head(n=20))

                                 App   Category  Rating  Reviews  Size_MBs  \
3946        I'm Rich - Trump Edition  LIFESTYLE    3.60      275      7.30   
2461              I AM RICH PRO PLUS    FINANCE    4.00       36     41.00   
4606               I Am Rich Premium    FINANCE    4.10     1867      4.70   
3145              I am rich(premium)    FINANCE    3.50      472      0.94   
3554                      💎 I'm rich  LIFESTYLE    3.80      718     26.00   
5765                       I am rich  LIFESTYLE    3.80     3547      1.80   
1946  I am rich (Most expensive app)    FINANCE    4.10      129      2.70   
2775                   I Am Rich Pro     FAMILY    4.40      201      2.70   
3221                  I am Rich Plus     FAMILY    4.00      856      8.70   
3114                       I am Rich    FINANCE    4.30      180      3.80   
1331          most expensive app (H)     FAMILY    4.30        6      1.50   
2394                      I am Rich!    FINANCE    3.80       93

### The most expensive apps sub $250

In [32]:
df_apps_cleaner[df_apps_cleaner["Price"] > 250] = None
df_apps_cleaner.dropna(inplace=True)

# df_apps_cleaner = df_apps_cleaner[df_apps_cleaner["Price"] < 250]



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



### Highest Grossing Paid Apps (ballpark estimate)

In [33]:
df_apps_cleaner.loc[:, "Revenue_Estimate"] = pd.Series(df_apps_cleaner["Price"] * df_apps_cleaner["Installs"], index=df_apps_cleaner.index)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [34]:
df_apps_cleaner.sort_values(by="Revenue_Estimate", ascending=False).head(10)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Revenue_Estimate
9220,Minecraft,FAMILY,4.5,2376564.0,19.0,10000000.0,Paid,6.99,Everyone 10+,Arcade;Action & Adventure,69900000.0
8825,Hitman Sniper,GAME,4.6,408292.0,29.0,10000000.0,Paid,0.99,Mature 17+,Action,9900000.0
7151,Grand Theft Auto: San Andreas,GAME,4.4,348962.0,26.0,1000000.0,Paid,6.99,Mature 17+,Action,6990000.0
7477,Facetune - For Free,PHOTOGRAPHY,4.4,49553.0,48.0,1000000.0,Paid,5.99,Everyone,Photography,5990000.0
7977,Sleep as Android Unlock,LIFESTYLE,4.5,23966.0,0.85,1000000.0,Paid,5.99,Everyone,Lifestyle,5990000.0
6594,DraStic DS Emulator,GAME,4.6,87766.0,12.0,1000000.0,Paid,4.99,Everyone,Action,4990000.0
6082,Weather Live,WEATHER,4.5,76593.0,4.75,500000.0,Paid,5.99,Everyone,Weather,2995000.0
7954,Bloons TD 5,FAMILY,4.6,190086.0,94.0,1000000.0,Paid,2.99,Everyone,Strategy,2990000.0
7633,Five Nights at Freddy's,GAME,4.6,100805.0,50.0,1000000.0,Paid,2.99,Teen,Action,2990000.0
6746,Card Wars - Adventure Time,FAMILY,4.3,129603.0,23.0,1000000.0,Paid,2.99,Everyone 10+,Card;Action & Adventure,2990000.0


# Plotly Bar Charts & Scatter Plots: Analysing App Categories

In [35]:
print(df_apps_cleaner["Category"].nunique())

33


In [36]:
top10_category = df_apps_cleaner.Category.value_counts()[:10]
print(top10_category)

FAMILY             1606
GAME                910
TOOLS               719
PRODUCTIVITY        301
PERSONALIZATION     298
LIFESTYLE           297
FINANCE             296
MEDICAL             292
PHOTOGRAPHY         263
BUSINESS            262
Name: Category, dtype: int64


In [37]:
category_installs = df_apps_cleaner.groupby("Category").agg({"Installs": pd.Series.sum})
category_installs.sort_values("Installs", ascending=True, inplace=True)
print(category_installs)

                             Installs
Category                             
EVENTS                  15,949,410.00
BEAUTY                  26,916,200.00
PARENTING               31,116,110.00
MEDICAL                 39,162,676.00
COMICS                  44,931,100.00
LIBRARIES_AND_DEMO      52,083,000.00
AUTO_AND_VEHICLES       53,129,800.00
HOUSE_AND_HOME          97,082,000.00
ART_AND_DESIGN         114,233,100.00
DATING                 140,912,410.00
FOOD_AND_DRINK         211,677,750.00
EDUCATION              352,852,000.00
WEATHER                361,096,500.00
FINANCE                455,249,400.00
MAPS_AND_NAVIGATION    503,267,560.00
LIFESTYLE              503,611,120.00
BUSINESS               692,018,120.00
SPORTS               1,096,431,465.00
HEALTH_AND_FITNESS   1,134,006,220.00
SHOPPING             1,400,331,540.00
PERSONALIZATION      1,532,352,930.00
BOOKS_AND_REFERENCE  1,665,791,655.00
ENTERTAINMENT        2,113,660,000.00
NEWS_AND_MAGAZINES   2,369,110,650.00
TRAVEL_AND_L

### Vertical Bar Chart - Highest Competition (Number of Apps)

In [38]:
px.bar(x= top10_category.index, y = top10_category.values)

### Horizontal Bar Chart - Most Popular Categories (Highest Downloads)

In [39]:
h_bar = px.bar(x=category_installs.Installs, y=category_installs.index, orientation="h")
h_bar.show()

### Category Concentration - Downloads vs. Competition

**Challenge**: 
* First, create a DataFrame that has the number of apps in one column and the number of installs in another:

<img src=https://imgur.com/uQRSlXi.png width="350">

* Then use the [plotly express examples from the documentation](https://plotly.com/python/line-and-scatter/) alongside the [.scatter() API reference](https://plotly.com/python-api-reference/generated/plotly.express.scatter.html)to create scatter plot that looks like this. 

<img src=https://imgur.com/cHsqh6a.png>

*Hint*: Use the size, hover_name and color parameters in .scatter(). To scale the yaxis, call .update_layout() and specify that the yaxis should be on a log-scale like so: yaxis=dict(type='log') 

In [40]:
category_concentration = df_apps_cleaner.groupby("Category").agg({"App": pd.Series.count})
print(category_concentration)

                      App
Category                 
ART_AND_DESIGN         61
AUTO_AND_VEHICLES      73
BEAUTY                 42
BOOKS_AND_REFERENCE   169
BUSINESS              262
COMICS                 54
COMMUNICATION         257
DATING                134
EDUCATION             118
ENTERTAINMENT         102
EVENTS                 45
FAMILY               1606
FINANCE               296
FOOD_AND_DRINK         94
GAME                  910
HEALTH_AND_FITNESS    243
HOUSE_AND_HOME         62
LIBRARIES_AND_DEMO     64
LIFESTYLE             297
MAPS_AND_NAVIGATION   118
MEDICAL               292
NEWS_AND_MAGAZINES    204
PARENTING              50
PERSONALIZATION       298
PHOTOGRAPHY           263
PRODUCTIVITY          301
SHOPPING              180
SOCIAL                203
SPORTS                260
TOOLS                 719
TRAVEL_AND_LOCAL      187
VIDEO_PLAYERS         148
WEATHER                72


In [41]:
category_concentration_merged_df = pd.merge(category_concentration, category_installs, on="Category", how="inner")
print(category_concentration_merged_df.shape)
print(category_concentration_merged_df.sort_values("Installs", ascending=False))

(33, 2)
                      App          Installs
Category                                   
GAME                  910 13,858,762,717.00
COMMUNICATION         257 11,039,241,530.00
TOOLS                 719  8,099,724,500.00
PRODUCTIVITY          301  5,788,070,180.00
SOCIAL                203  5,487,841,475.00
PHOTOGRAPHY           263  4,649,143,130.00
FAMILY               1606  4,437,554,490.00
VIDEO_PLAYERS         148  3,916,897,200.00
TRAVEL_AND_LOCAL      187  2,894,859,300.00
NEWS_AND_MAGAZINES    204  2,369,110,650.00
ENTERTAINMENT         102  2,113,660,000.00
BOOKS_AND_REFERENCE   169  1,665,791,655.00
PERSONALIZATION       298  1,532,352,930.00
SHOPPING              180  1,400,331,540.00
HEALTH_AND_FITNESS    243  1,134,006,220.00
SPORTS                260  1,096,431,465.00
BUSINESS              262    692,018,120.00
LIFESTYLE             297    503,611,120.00
MAPS_AND_NAVIGATION   118    503,267,560.00
FINANCE               296    455,249,400.00
WEATHER                7

In [42]:
scatter = px.scatter(category_concentration_merged_df, x="App", y="Installs", title="Category Concentration", size="App", hover_name=category_concentration_merged_df.index, color="Installs")
scatter.update_layout(xaxis_title="Number of Apps (Lower = More Concentrated)", yaxis=dict(type="log"))
scatter.show()

# Extracting Nested Data from a Column

**Challenge**: How many different types of genres are there? Can an app belong to more than one genre? Check what happens when you use .value_counts() on a column with nested values? See if you can work around this problem by using the .split() function and the DataFrame's [.stack() method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.stack.html). 


In [43]:
stack = df_apps_cleaner.Genres.str.split(";", expand=True).stack()
print(df_apps_cleaner.Genres)
print(stack)

21                       Medical
28                        Arcade
47                        Arcade
82                        Arcade
99                       Medical
                  ...           
10824               Productivity
10828    Video Players & Editors
10829    Video Players & Editors
10831           News & Magazines
10835                     Arcade
Name: Genres, Length: 8184, dtype: object
21     0                    Medical
28     0                     Arcade
47     0                     Arcade
82     0                     Arcade
99     0                    Medical
                     ...           
10824  0               Productivity
10828  0    Video Players & Editors
10829  0    Video Players & Editors
10831  0           News & Magazines
10835  0                     Arcade
Length: 8564, dtype: object


In [44]:
num_genres = stack.value_counts()
print(num_genres)

Tools                      719
Education                  587
Entertainment              498
Action                     304
Productivity               301
Personalization            298
Lifestyle                  298
Finance                    296
Medical                    292
Sports                     270
Photography                263
Business                   262
Communication              258
Health & Fitness           245
Casual                     216
News & Magazines           204
Social                     203
Simulation                 200
Travel & Local             187
Arcade                     185
Shopping                   180
Books & Reference          171
Video Players & Editors    150
Dating                     134
Puzzle                     124
Maps & Navigation          118
Role Playing               111
Racing                     103
Action & Adventure          96
Strategy                    95
Food & Drink                94
Educational                 93
Adventur

# Colour Scales in Plotly Charts - Competition in Genres

**Challenge**: Can you create this chart with the Series containing the genre data? 

<img src=https://imgur.com/DbcoQli.png width=400>

Try experimenting with the built in colour scales in Plotly. You can find a full list [here](https://plotly.com/python/builtin-colorscales/). 

* Find a way to set the colour scale using the color_continuous_scale parameter. 
* Find a way to make the color axis disappear by using coloraxis_showscale. 

In [45]:
top_genres_plot = px.bar(x= num_genres.index[:15], y = num_genres.values[:15], color_continuous_scale="Agsunset", color=num_genres.values[:15])
top_genres_plot.update_layout(yaxis_title="Number of Apps", xaxis_title="Genre", title="Top Genres", coloraxis_showscale=False)
top_genres_plot.show()

# Grouped Bar Charts: Free vs. Paid Apps per Category

In [46]:
print(df_apps_cleaner.Type.value_counts())

Free    7595
Paid     589
Name: Type, dtype: int64


In [47]:
df_free_vs_paid = df_apps_cleaner.groupby(["Category", "Type"], as_index=False).agg({"App":pd.Series.count})
print(df_free_vs_paid.head())

            Category  Type  App
0     ART_AND_DESIGN  Free   58
1     ART_AND_DESIGN  Paid    3
2  AUTO_AND_VEHICLES  Free   72
3  AUTO_AND_VEHICLES  Paid    1
4             BEAUTY  Free   42


**Challenge**: Use the plotly express bar [chart examples](https://plotly.com/python/bar-charts/#bar-chart-with-sorted-or-ordered-categories) and the [.bar() API reference](https://plotly.com/python-api-reference/generated/plotly.express.bar.html#plotly.express.bar) to create this bar chart: 

<img src=https://imgur.com/LE0XCxA.png>

You'll want to use the `df_free_vs_paid` DataFrame that you created above that has the total number of free and paid apps per category. 

See if you can figure out how to get the look above by changing the `categoryorder` to 'total descending' as outlined in the documentation here [here](https://plotly.com/python/categorical-axes/#automatically-sorting-categories-by-name-or-total-value). 

In [48]:
free_df = df_free_vs_paid[df_free_vs_paid["Type"] == "Free"]
paid_df = df_free_vs_paid[df_free_vs_paid["Type"] == "Paid"]
free_vs_paid_plot = go.Figure()
free_vs_paid_plot.add_trace(go.Bar(x=free_df["Category"], y=free_df["App"], name="Free"))
free_vs_paid_plot.add_trace(go.Bar(x=paid_df["Category"], y=paid_df["App"], name="Paid"))
free_vs_paid_plot.update_layout(title_text="Free vs Paid Apps by Category", yaxis_title="Number of Apps", xaxis_title="Category", yaxis=dict(type="log"))
free_vs_paid_plot.update_xaxes(categoryorder="total descending")
free_vs_paid_plot.show()

# Plotly Box Plots: Lost Downloads for Paid Apps

**Challenge**: Create a box plot that shows the number of Installs for free versus paid apps. How does the median number of installations compare? Is the difference large or small?

Use the [Box Plots Guide](https://plotly.com/python/box-plots/) and the [.box API reference](https://plotly.com/python-api-reference/generated/plotly.express.box.html) to create the following chart. 

<img src=https://imgur.com/uVsECT3.png>


In [49]:
df_free_paid_installs = df_apps_cleaner.groupby(["Installs", "Type"], as_index=False).agg({"App":pd.Series.count})
print(df_free_paid_installs.head())

   Installs  Type  App
0      1.00  Free    1
1      1.00  Paid    2
2      5.00  Free    9
3     10.00  Free   51
4     10.00  Paid   18


In [50]:
fig = px.box(df_apps_cleaner, x="Type", y="Installs", points="all", color="Type", notched=True, title="How Many Downloads are Paid Apps Giving Up?")
fig.update_layout(yaxis=dict(type="log"))
fig.show()

# Plotly Box Plots: Revenue by App Category

**Challenge**: See if you can generate the chart below: 

<img src=https://imgur.com/v4CiNqX.png>

Looking at the hover text, how much does the median app earn in the Tools category? If developing an Android app costs $30,000 or thereabouts, does the average photography app recoup its development costs?

Hint: I've used 'min ascending' to sort the categories. 

In [56]:
paid_df2 = df_apps_cleaner[df_apps_cleaner["Type"] == "Paid"]
fig2 = px.box(paid_df2, x="Category", y="Revenue_Estimate", title="How Much Can Paid Apps Earn?")
fig2.update_layout(yaxis=dict(type="log"), yaxis_title="Paid App Ballpark Revenue")
fig2.update_xaxes(categoryorder="min ascending")
fig2.show()

# How Much Can You Charge? Examine Paid App Pricing Strategies by Category

**Challenge**: What is the median price price for a paid app? Then compare pricing by category by creating another box plot. But this time examine the prices (instead of the revenue estimates) of the paid apps. I recommend using `{categoryorder':'max descending'}` to sort the categories.

In [59]:
fig3 = px.box(paid_df2, x="Category", y="Price", title="App Pricing by Category")
fig3.update_layout(yaxis=dict(type="log"), yaxis_title="App Price")
fig3.update_xaxes(categoryorder="max descending")
fig3.show()