# INTRODUCTION

Have you ever thought about building your own an iOS or Android app? If so, then you probably have wondered about how things work in the app stores. This notebook replicates some of the app store analytics provided by companies like App Annie or Sensor Tower that helps inform development and app marketing strategies for many companies. This stuff is BIG business!

### IMPORTING ALL NECESSARY LIBRARIES

In [126]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px

### LOAD THE DATA AAND GETS ITS INFORMATION

In [127]:
app_data = pd.read_csv('sample_data/apps.csv')

In [128]:
# view 5 random data from the apps_data above
app_data.sample(3)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
7617,Justice League Action Run,GAME,4.3,22333,9.7,1000000,Free,0,Everyone 10+,Action,"August 7, 2018",4.3 and up
543,Guide to Nikon Df,PHOTOGRAPHY,,1,0.647461,10,Paid,$29.99,Everyone,Photography,"February 18, 2014",4.0.3 and up
10477,TripAdvisor Hotels Flights Restaurants Attract...,TRAVEL_AND_LOCAL,4.4,1162331,12.0,100000000,Free,0,Everyone,Travel & Local,"August 4, 2018",Varies with device


#### Check For The Data Information

In [129]:
# get the data information
app_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  int64  
 4   Size_MBs        10841 non-null  float64
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content_Rating  10841 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last_Updated    10841 non-null  object 
 11  Android_Ver     10839 non-null  object 
dtypes: float64(2), int64(1), object(9)
memory usage: 1016.5+ KB


#### Check The Statistical Summary of The Data

In [130]:
app_data.describe()

Unnamed: 0,Rating,Reviews,Size_MBs
count,9367.0,10841.0,10841.0
mean,4.191513,444111.9,19.774147
std,0.515735,2927629.0,21.404354
min,1.0,0.0,0.008301
25%,4.0,38.0,4.9
50%,4.3,2094.0,11.0
75%,4.5,54768.0,27.0
max,5.0,78158310.0,100.0


#### Check For Missing Values In The Data

In [131]:
# how many missing records are the
print(f'The number of missing records are: \n{app_data.isna().sum()}\n')

# view the missing records
app_data[app_data.isna().values.any(axis=1)].head(3)

The number of missing records are: 
App                  0
Category             0
Rating            1474
Reviews              0
Size_MBs             0
Installs             0
Type                 1
Price                0
Content_Rating       0
Genres               0
Last_Updated         0
Android_Ver          2
dtype: int64



Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
0,Ak Parti Yardım Toplama,SOCIAL,,0,8.7,0,Paid,$13.99,Teen,Social,"July 28, 2017",4.1 and up
1,Ain Arabic Kids Alif Ba ta,FAMILY,,0,33.0,0,Paid,$2.99,Everyone,Education,"April 15, 2016",3.0 and up
2,Popsicle Launcher for Android P 9.0 launcher,PERSONALIZATION,,0,5.5,0,Paid,$1.49,Everyone,Personalization,"July 11, 2018",4.2 and up


`Rating` has 1474 missing values while `Type` has 1 missing values. We will remove all of theses missing records. 

#### Check For Duplicate In The Data

In [132]:
# print the number of duplicated records from the data
print(f'The number of duplicated records are: {app_data.duplicated().sum()}\n')

The number of duplicated records are: 483



In [133]:
# print out the first few duplicated records
duplicated_data = app_data[app_data.duplicated()]
duplicated_data.head()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
190,RT 516 VET,MEDICAL,,0,29.0,10,Free,0,Everyone,Medical,"July 13, 2018",4.0.3 and up
741,Penn State Health OnDemand,MEDICAL,,0,40.0,50,Free,0,Everyone,Medical,"July 24, 2018",4.0.3 and up
803,Maricopa AH,MEDICAL,,0,29.0,100,Free,0,Everyone,Medical,"July 16, 2018",4.0.3 and up
914,Breastfeeding Tracker Baby Log,MEDICAL,,6,23.0,100,Free,0,Everyone,Medical,"July 20, 2018",5.0 and up
946,420 BZ Budeze Delivery,MEDICAL,5.0,2,11.0,100,Free,0,Mature 17+,Medical,"June 6, 2018",4.1 and up


In [134]:
# print duplciated data for the Instagram app
duplicated_data[duplicated_data.App == 'Instagram']

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
10809,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social,"July 31, 2018",Varies with device


#### CHECK FOR TYPE 

In [135]:
app_data.dtypes

App                object
Category           object
Rating            float64
Reviews             int64
Size_MBs          float64
Installs           object
Type               object
Price              object
Content_Rating     object
Genres             object
Last_Updated       object
Android_Ver        object
dtype: object

The columns `Installs` and `Price` should be a numerical column and not object column.

### DATA CLEANING

* Remove Missing Records and Redundant Column.
* Remove Duplicated Data

In [136]:
# remove the columns: Last_Updated and Android_Version
app_data.drop(['Last_Updated', 'Android_Ver'], axis=1, inplace=True)

In [137]:
# remove the missing records
app_data.dropna(axis=0, inplace=True)

In [138]:
# confirm is there is still missing values:
app_data.isna().values.sum()

0

In [139]:
# drop the duplicated records based of some specific columns with same values
# we will drop duplicates for the columns: App, Type and Price
app_data.drop_duplicates(subset=['App', 'Type', 'Price'], inplace=True)

In [140]:
# confirm is there are still duplicated
app_data.duplicated().sum()

0

In [141]:
# change the type of the Price and Installs column to numerical
# we will have to remove every special characters from the columns before converting it
app_data.Installs = app_data.Installs.str.replace(',', '').astype(int)

# check for the type again
app_data.Installs.dtype

dtype('int64')

In [142]:
# change the type of Price too
app_data.Price = app_data.Price.str.replace('$', '').astype('float')

# check for the type again
app_data.Price.dtype


The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.



dtype('float64')

### PRELIMINARY EXPLORATION

In [143]:
# what is the highest rated app
most_rated_app = app_data.loc[app_data.Rating.idxmax()].App
print(f'Most Rated App is: {most_rated_app}')

Most Rated App is: KBA-EZ Health Guide


In [144]:
# what is the highest size app
highest_size_app = app_data.loc[app_data.Size_MBs.idxmax()].App
print(f'The Highest Size App is: {highest_size_app}')

The Highest Size App is: Navi Radiography Pro


In [145]:
app_data.sort_values('Size_MBs', ascending=False).head(5)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
9942,Talking Babsy Baby: Baby Games,LIFESTYLE,4.0,140995,100.0,10000000,Free,0.0,Everyone,Lifestyle;Pretend Play
10687,Hungry Shark Evolution,GAME,4.5,6074334,100.0,100000000,Free,0.0,Teen,Arcade
9943,Miami crime simulator,GAME,4.0,254518,100.0,10000000,Free,0.0,Mature 17+,Action
9944,Gangster Town: Vice District,FAMILY,4.3,65146,100.0,10000000,Free,0.0,Mature 17+,Simulation
3144,Vi Trainer,HEALTH_AND_FITNESS,3.6,124,100.0,5000,Free,0.0,Everyone,Health & Fitness


Here we can clearly see that there seems to be an upper bound of 100 MB for the size of an app. 

A quick google search would also have revealed that this limit is imposed by the Google Play Store itself. 

It’s interesting to see that a number of apps actually hit that limit exactly.

In [146]:
# what is the highest number of review
highest_reviewed_app = app_data.loc[app_data.Reviews.idxmax()].App
print(f'The Highest Reviewed App is: {highest_reviewed_app}')

The Highest Reviewed App is: Facebook


In [147]:
app_data.sort_values('Reviews', ascending=False).head(20)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
10805,Facebook,SOCIAL,4.1,78158306,5.3,1000000000,Free,0.0,Teen,Social
10785,WhatsApp Messenger,COMMUNICATION,4.4,69119316,3.5,1000000000,Free,0.0,Everyone,Communication
10806,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0.0,Teen,Social
10784,Messenger – Text and Video Chat for Free,COMMUNICATION,4.0,56642847,3.5,1000000000,Free,0.0,Everyone,Communication
10650,Clash of Clans,GAME,4.6,44891723,98.0,100000000,Free,0.0,Everyone 10+,Strategy
10744,Clean Master- Space Cleaner & Antivirus,TOOLS,4.7,42916526,3.4,500000000,Free,0.0,Everyone,Tools
10835,Subway Surfers,GAME,4.5,27722264,76.0,1000000000,Free,0.0,Everyone 10+,Arcade
10828,YouTube,VIDEO_PLAYERS,4.3,25655305,4.65,1000000000,Free,0.0,Teen,Video Players & Editors
10746,"Security Master - Antivirus, VPN, AppLock, Boo...",TOOLS,4.7,24900999,3.4,500000000,Free,0.0,Everyone,Tools
10584,Clash Royale,GAME,4.6,23133508,97.0,100000000,Free,0.0,Everyone 10+,Strategy


If you look at the number of reviews, you can find the most popular apps on the Android App Store. 

These include the usual suspects: Facebook, WhatsApp, Instagram etc. 

What’s also notable is that the list of the top 50 most reviewed apps does not include a single paid app! 

### DATA VISUALIZATION

In [148]:
# obtain the number each categorical content rating from the data
content_rating_count = app_data.Content_Rating.value_counts()
content_rating_count

Everyone           6621
Teen                912
Mature 17+          357
Everyone 10+        305
Adults only 18+       3
Unrated               1
Name: Content_Rating, dtype: int64

In [149]:
fig = px.pie(labels=content_rating_count.index, values=content_rating_count.values,
                title='Content Rating', names=content_rating_count.index)

fig.update_traces(textposition='outside', textinfo='percent+label')
fig.show()

In [150]:
# donut chart
fig = px.pie(labels=content_rating_count.index, values=content_rating_count.values,
                title='Content Rating', names=content_rating_count.index, 
             hole=0.6)

fig.update_traces(textposition='inside', textinfo='percent')
fig.show()

#### How many apps had over 1 billion (that's right - BILLION) installations? How many apps just had a single install?

In [151]:
# check for the types of all the columns again
app_data.dtypes

App                object
Category           object
Rating            float64
Reviews             int64
Size_MBs          float64
Installs            int64
Type               object
Price             float64
Content_Rating     object
Genres             object
dtype: object

In [152]:
# number of apps with over 1B installation is:
f'The number of apps with a single installation is --> {app_data[app_data.Installs >= 1000000000].shape[0]}'

'The number of apps with a single installation is --> 20'

In [153]:
# The number of app with just 1 installation:
f'The number of apps with a single installation is --> {app_data[app_data.Installs == 1].shape[0]}'

'The number of apps with a single installation is --> 3'

In [154]:
# check the top expensive apps
app_data.sort_values('Price', ascending=False).reset_index(drop=True).head(20)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
0,I'm Rich - Trump Edition,LIFESTYLE,3.6,275,7.3,10000,Paid,400.0,Everyone,Lifestyle
1,I AM RICH PRO PLUS,FINANCE,4.0,36,41.0,1000,Paid,399.99,Everyone,Finance
2,I Am Rich Premium,FINANCE,4.1,1867,4.7,50000,Paid,399.99,Everyone,Finance
3,I am rich(premium),FINANCE,3.5,472,0.942383,5000,Paid,399.99,Everyone,Finance
4,💎 I'm rich,LIFESTYLE,3.8,718,26.0,10000,Paid,399.99,Everyone,Lifestyle
5,I am rich,LIFESTYLE,3.8,3547,1.8,100000,Paid,399.99,Everyone,Lifestyle
6,I am rich (Most expensive app),FINANCE,4.1,129,2.7,1000,Paid,399.99,Teen,Finance
7,I Am Rich Pro,FAMILY,4.4,201,2.7,5000,Paid,399.99,Everyone,Entertainment
8,I am Rich Plus,FAMILY,4.0,856,8.7,10000,Paid,399.99,Everyone,Entertainment
9,I am Rich,FINANCE,4.3,180,3.8,5000,Paid,399.99,Everyone,Finance


From the dataframe above, there are 15 I am Rich Apps in the Google Play Store apparently. 

They all cost `$300` or more, which is the main point of the app. The story goes that in 2008, Armin Heinrich released the very first I am Rich app in the iOS App Store for `$999.90`. The app does absolutely nothing. It just displays the picture of a gemstone and can be used to prove to your friends how rich you are. Armin actually made a total of 7 sales before the app was hastily removed by Apple. Nonetheless, it inspired a bunch of copycats on the Android App Store, but if you search today, you’ll find all of these apps have disappeared as well. The high installation numbers are likely gamed by making the app was available for free at some point to get reviews and appear more legitimate.

In [155]:
# we will remove all record with prices greater than or equal to $250
app_data = app_data[app_data.Price < 250]

# now print the sorted data based on Price again
app_data.sort_values('Price', ascending=False).reset_index(drop=True).head()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
0,Vargo Anesthesia Mega App,MEDICAL,4.6,92,32.0,1000,Paid,79.99,Everyone,Medical
1,LTC AS Legal,MEDICAL,4.0,6,1.3,100,Paid,39.99,Everyone,Medical
2,I am Rich Person,LIFESTYLE,4.2,134,1.8,1000,Paid,37.99,Everyone,Lifestyle
3,A Manual of Acupuncture,MEDICAL,3.5,214,68.0,1000,Paid,33.99,Everyone,Medical
4,PTA Content Master,MEDICAL,4.2,64,41.0,1000,Paid,29.99,Everyone,Medical


We can work out the highest grossing paid apps now. All we need to do is multiply the values in the price and the installs column to get the number:

In [156]:
app_data['Revenue_Estimate'] = app_data.Installs.mul(app_data.Price)
app_data.sort_values('Revenue_Estimate', ascending=False)[:5]



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Revenue_Estimate
9220,Minecraft,FAMILY,4.5,2376564,19.0,10000000,Paid,6.99,Everyone 10+,Arcade;Action & Adventure,69900000.0
8825,Hitman Sniper,GAME,4.6,408292,29.0,10000000,Paid,0.99,Mature 17+,Action,9900000.0
7151,Grand Theft Auto: San Andreas,GAME,4.4,348962,26.0,1000000,Paid,6.99,Mature 17+,Action,6990000.0
7477,Facetune - For Free,PHOTOGRAPHY,4.4,49553,48.0,1000000,Paid,5.99,Everyone,Photography,5990000.0
7977,Sleep as Android Unlock,LIFESTYLE,4.5,23966,0.851562,1000000,Paid,5.99,Everyone,Lifestyle,5990000.0


The top spot of the highest-grossing paid app goes to … Minecraft at close to $70 million. It’s quite interesting that Minecraft (along with Bloons and Card Wars) is actually listed in the Family category rather than in the Game category. If we include these titles, we see that 7 out the top 10 highest-grossing apps are games. The Google Play Store seems to be quite flexible with its category labels.

Find the number of different app category

In [157]:
app_data.Category.nunique()

33

**What is the most installed app based on category?**

In [158]:
# top 5 of them are:
top_categories = app_data.Category.value_counts()
top_categories[:5]

FAMILY             1606
GAME                910
TOOLS               719
PRODUCTIVITY        301
PERSONALIZATION     298
Name: Category, dtype: int64

In [159]:
bars = px.bar(x=top_categories[:10].index, y=top_categories[:10].values)
bars.show()

Based on the number of apps, the Family and Game categories are the most competitive. Releasing yet another app into these categories will make it hard to get noticed.

**What matters is not just the total number of apps in the category but how often apps are downloaded in that category?**

In [160]:
category_installs = app_data.groupby('Category').agg({'Installs': np.sum}).sort_values('Installs', ascending=True)

category_installs

Unnamed: 0_level_0,Installs
Category,Unnamed: 1_level_1
EVENTS,15949410
BEAUTY,26916200
PARENTING,31116110
MEDICAL,39162676
COMICS,44931100
LIBRARIES_AND_DEMO,52083000
AUTO_AND_VEHICLES,53129800
HOUSE_AND_HOME,97082000
ART_AND_DESIGN,114233100
DATING,140912410


In [178]:
h_bar = px.bar(x = category_installs.Installs,
               y = category_installs.index,
               orientation='h')

h_bar.update_layout(xaxis_title='Number of Installation', yaxis_title='App Category')
 
h_bar.show()

Now we see that Games and Tools are actually the most popular categories.

If we plot popularity of a category next to the number of apps in that category we can get an idea of how concentrated a category is. 

**Do few apps have most of the downloads or are the downloads spread out over many apps?**

In [162]:
popular_category_install = app_data.groupby('Category').agg({'App':pd.Series.count, 'Installs':np.sum})

In [163]:
popular_category_install.sort_values('Installs', ascending=False)

Unnamed: 0_level_0,App,Installs
Category,Unnamed: 1_level_1,Unnamed: 2_level_1
GAME,910,13858762717
COMMUNICATION,257,11039241530
TOOLS,719,8099724500
PRODUCTIVITY,301,5788070180
SOCIAL,203,5487841475
PHOTOGRAPHY,263,4649143130
FAMILY,1606,4437554490
VIDEO_PLAYERS,148,3916897200
TRAVEL_AND_LOCAL,187,2894859300
NEWS_AND_MAGAZINES,204,2369110650


In [164]:
scatter = px.scatter(popular_category_install, # data
                    x='App', # column name
                    y='Installs',
                    title='Category Concentration',
                    size='App',
                    hover_name=popular_category_install.index,
                    color='Installs')
 
scatter.update_layout(xaxis_title="Number of Apps (Lower=More Concentrated)",
                      yaxis_title="Installs",
                      yaxis=dict(type='log'))
 
scatter.show()

What we see is that the categories like Family, Tools, and Game have many different apps sharing a high number of downloads. But for the categories like video players and entertainment, all the downloads are concentrated in very few apps.

**Let’s turn our attention to the Genres column. This is quite similar to the categories column but more granular.**

In [165]:
app_data['Genres'].value_counts()

Tools                                718
Entertainment                        467
Education                            429
Productivity                         301
Personalization                      298
                                    ... 
Adventure;Brain Games                  1
Travel & Local;Action & Adventure      1
Art & Design;Pretend Play              1
Music & Audio;Music & Video            1
Lifestyle;Pretend Play                 1
Name: Genres, Length: 114, dtype: int64

It can be seen from the series above that some app belongs to several genre.

In [166]:
# how many unique genres do we have in the data
app_data['Genres'].nunique()

114

If we look at the number of unique values in the Genres column we get 114. But this is not accurate if we have nested data like we do here. We can see this using .value_counts() and looking at the values that just have a single entry. There we see that the semi-colon (;) separates the genre names.

In [168]:
# splitting the genre on ';'
stack = app_data.Genres.str.split(';', expand=True).stack()
stack.shape

(8564,)

In [169]:
stack.head()

21  0    Medical
28  0     Arcade
47  0     Arcade
82  0     Arcade
99  0    Medical
dtype: object

In [181]:
# now counting the unique genres
ngenres = stack.value_counts()
ngenres

Tools                      719
Education                  587
Entertainment              498
Action                     304
Productivity               301
Personalization            298
Lifestyle                  298
Finance                    296
Medical                    292
Sports                     270
Photography                263
Business                   262
Communication              258
Health & Fitness           245
Casual                     216
News & Magazines           204
Social                     203
Simulation                 200
Travel & Local             187
Arcade                     185
Shopping                   180
Books & Reference          171
Video Players & Editors    150
Dating                     134
Puzzle                     124
Maps & Navigation          118
Role Playing               111
Racing                     103
Action & Adventure          96
Strategy                    95
Food & Drink                94
Educational                 93
Adventur

In [182]:
print(f'We can see that there {len(ngenres)} unique app categories from our data.')

We can see that there 53 unique app categories from our data.


In [187]:
bar = px.bar(x = ngenres.index[:15], 
             y = ngenres.values[:15], 
             title='Top Genres',
             hover_name=ngenres.index[:15],
             color=ngenres.values[:15],
             color_continuous_scale='Agsunset')
 
bar.update_layout(xaxis_title='Genre', yaxis_title='Number of Apps', coloraxis_showscale=False)
 
bar.show()

**How many apps are free and paid type?**

In [190]:
app_free_vs_paid = app_data.groupby(["Category", "Type"], as_index=False).agg({'App': pd.Series.count})
app_free_vs_paid.head()

Unnamed: 0,Category,Type,App
0,ART_AND_DESIGN,Free,58
1,ART_AND_DESIGN,Paid,3
2,AUTO_AND_VEHICLES,Free,72
3,AUTO_AND_VEHICLES,Paid,1
4,BEAUTY,Free,42


In [193]:
app_free_vs_paid.sort_values('App', ascending=False)[:5]

Unnamed: 0,Category,Type,App
19,FAMILY,Free,1456
25,GAME,Free,834
53,TOOLS,Free,656
21,FINANCE,Free,289
31,LIFESTYLE,Free,284


Unsurprisingly the biggest categories have the most paid apps. However, there might be some patterns if we put the numbers of a graph!

In [194]:
g_bar = px.bar(app_free_vs_paid,
               x='Category',
               y='App',
               title='Free vs Paid Apps by Category',
               color='Type',
               barmode='group')
 
g_bar.update_layout(xaxis_title='Category',
                    yaxis_title='Number of Apps',
                    xaxis={'categoryorder':'total descending'},
                    yaxis=dict(type='log'))
 
g_bar.show()

What we see is that while there are very few paid apps on the Google Play Store, some categories have relatively more paid apps than others, including Personalization, Medical and Weather. So, depending on the category you are targeting, it might make sense to release a paid-for app.

In [195]:
box = px.box(app_data,
             y='Installs',
             x='Type',
             color='Type',
             notched=True,
             points='all',
             title='How Many Downloads are Paid Apps Giving Up?')
 
box.update_layout(yaxis=dict(type='log'))
 
box.show()

But does this mean we should give up on selling a paid app? 

**Let’s see how much revenue we would estimate per category.**

In [197]:
paid_app = app_data[app_data['Type'] == 'Paid']

In [198]:
box = px.box(paid_app, 
             x='Category', 
             y='Revenue_Estimate',
             title='How Much Can Paid Apps Earn?')
 
box.update_layout(xaxis_title='Category',
                  yaxis_title='Paid App Ballpark Revenue',
                  xaxis={'categoryorder':'min ascending'},
                  yaxis=dict(type='log'))

If an Android app costs `$30,000` to develop, then the average app in very few 
categories would cover that development cost. The median paid photography app
 earned about `$20,000`. Many more app’s revenues were even lower - meaning they would need other sources of revenue like advertising or in-app purchases to make up for their development costs. However, certain app categories seem to contain a large number of outliers that have much higher (estimated) revenue - for example in Medical, Personalisation, Tools, Game, and Family.



In [199]:
box = px.box(paid_app,
             x='Category',
             y="Price",
             title='Price per Category')
 
box.update_layout(xaxis_title='Category',
                  yaxis_title='Paid App Price',
                  xaxis={'categoryorder':'max descending'},
                  yaxis=dict(type='log'))
 
box.show()

This time we see that Medical apps have the most expensive apps as well as a median price of `$5.49`. In contrast, Personalisation apps are quite cheap on average at `$1.49`. Other categories which higher median prices are Business (`$4.99`) and Dating (`$6.99`). It seems like customers who shop in these categories are not so concerned about paying a bit extra for their apps.