# Wrestle the Android App Store Data into Beautiful Looking Charts with Plotly

Have you ever thought about building your own an iOS or Android app? If so, then you probably have wondered about how things work in the app stores. Today we'll replicate some of the app store analytics provided by companies like App Annie or Sensor Tower that helps inform development and app marketing strategies for many companies. This stuff is BIG business!

We will compare thousands of apps in the Google Play Store so that we can gain insight into:

- How competitive different app categories (e.g., Games, Lifestyle, Weather) are

- Which app category offers compelling opportunities based on its popularity

- How many downloads you would give up by making your app paid vs. free

- How much you can reasonably charge for a paid app

- Which paid apps have had the highest revenue

- How many paid apps will recoup their development costs based on their sales revenue



In [1]:
import pandas as pd

df_apps = pd.read_csv('apps.csv')

# Data Cleaning: Removin NaN Values and Duplicates

The first step as always is getting a better idea about what we're dealing with.

### Preliminary Data Exploration

How many rows and columns does df_apps have? What are the column names? What does the data look like? Look at a random sample of 5 different rows with [.sample()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html)

In [None]:
# How many rows and columns does df_apps have?
df_apps.shape

# 10841 rows, 12 columns

(10841, 12)

We are working with a fairly large DataFrame this time.

10841 rows and 12 columns.

In [11]:
df_apps.head()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
0,Ak Parti Yardım Toplama,SOCIAL,,0,8.7,0,Paid,$13.99,Teen,Social,"July 28, 2017",4.1 and up
1,Ain Arabic Kids Alif Ba ta,FAMILY,,0,33.0,0,Paid,$2.99,Everyone,Education,"April 15, 2016",3.0 and up
2,Popsicle Launcher for Android P 9.0 launcher,PERSONALIZATION,,0,5.5,0,Paid,$1.49,Everyone,Personalization,"July 11, 2018",4.2 and up
3,Command & Conquer: Rivals,FAMILY,,0,19.0,0,,0,Everyone 10+,Strategy,"June 28, 2018",Varies with device
4,CX Network,BUSINESS,,0,10.0,0,Free,0,Everyone,Business,"August 6, 2018",4.1 and up


We can already see that there are some data issues that we need to fix. In the Ratings and Type columns there are NaN (Not a number values) and in the Price column we have dollar signs that will cause problems.



In [None]:
# What are the column names? 
print('Columns: ')
for col in df_apps.columns:
    print(f' - {col}')

Columns: 
 - App
 - Category
 - Rating
 - Reviews
 - Size_MBs
 - Installs
 - Type
 - Price
 - Content_Rating
 - Genres
 - Last_Updated
 - Android_Ver


In [None]:
# What does the data look like? Look at a random sample of 5 different rows with .sample()
df_apps.sample(5)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
6002,"Chatting - Free chat, random chat, boyfriend, ...",DATING,4.2,2506,6.1,500000,Free,0,Mature 17+,Dating,"June 15, 2017",4.0 and up
1433,CW Bluetooth SPP,COMMUNICATION,,3,0.938477,100,Free,0,Everyone,Communication,"March 31, 2018",6.0 and up
6747,Go-Go-Goat! Free Game,FAMILY,4.5,66740,23.0,1000000,Free,0,Everyone,Arcade;Action & Adventure,"July 8, 2016",2.3 and up
5855,Zombie War Z : Hero Survival Rules,GAME,3.9,1987,62.0,100000,Free,0,Mature 17+,Action,"December 13, 2017",4.0.3 and up
3084,Check Your Visitors on FB ?,SOCIAL,3.6,40,1.5,5000,Free,0,Everyone,Social,"February 17, 2017",4.2 and up


Remove the columns called Last_Updated and Android_Version from the DataFrame. We will not use these columns.



In [12]:
df_apps = df_apps.drop('Last_Updated', axis=1) #axis = 0 for rows, axis = 1 for columns
df_apps = df_apps.drop('Android_Ver', axis=1)

In [13]:
print('Columns: ')
for col in df_apps.columns:
    print(f' - {col}')

Columns: 
 - App
 - Category
 - Rating
 - Reviews
 - Size_MBs
 - Installs
 - Type
 - Price
 - Content_Rating
 - Genres


How many rows have a NaN value (not-a-number) in the Rating column? Create DataFrame called df_apps_clean that does not include these rows.

In [16]:
df_apps.Rating.isna().sum()

np.int64(1474)

In [24]:
# df_apps_clean = df_apps[df_apps.Rating.notna()]
df_apps_clean = df_apps.dropna()

In [26]:
df_apps_clean.shape

(9367, 10)

This leaves us with 9,367 entries in our DataFrame. But there may be other problems with the data too:

Are there any duplicates in data? Check for duplicates using the [.duplicated()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html) function. How many entries can you find for the "Instagram" app? Use [.drop_duplicates()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html) to remove any duplicates from `df_apps_clean`.



In [33]:
df_apps_clean[df_apps_clean.duplicated()]

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
946,420 BZ Budeze Delivery,MEDICAL,5.0,2,11.0,100,Free,0,Mature 17+,Medical
1133,MouseMingle,DATING,2.7,3,3.9,100,Free,0,Mature 17+,Dating
1196,"Cardiac diagnosis (heart rate, arrhythmia)",MEDICAL,4.4,8,6.5,100,Paid,$12.99,Everyone,Medical
1231,Sway Medical,MEDICAL,5.0,3,22.0,100,Free,0,Everyone,Medical
1247,Chat Kids - Chat Room For Kids,DATING,4.7,6,4.9,100,Free,0,Mature 17+,Dating
...,...,...,...,...,...,...,...,...,...,...
10802,Skype - free IM & video calls,COMMUNICATION,4.1,10484169,3.5,1000000000,Free,0,Everyone,Communication
10809,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social
10826,Google Drive,PRODUCTIVITY,4.4,2731211,4.0,1000000000,Free,0,Everyone,Productivity
10832,Google News,NEWS_AND_MAGAZINES,3.9,877635,13.0,1000000000,Free,0,Teen,News & Magazines


In [36]:
df_apps_clean[df_apps_clean.App == 'Instagram']

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
10806,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social
10808,Instagram,SOCIAL,4.5,66577446,5.3,1000000000,Free,0,Teen,Social
10809,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social
10810,Instagram,SOCIAL,4.5,66509917,5.3,1000000000,Free,0,Teen,Social


In [38]:
df_apps_clean[df_apps_clean.App == 'Instagram'].duplicated()

10806    False
10808    False
10809     True
10810    False
dtype: bool

In [39]:
df_apps_clean = df_apps_clean.drop_duplicates()

In [40]:
df_apps_clean[df_apps_clean.App == 'Instagram']

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
10806,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social
10808,Instagram,SOCIAL,4.5,66577446,5.3,1000000000,Free,0,Teen,Social
10810,Instagram,SOCIAL,4.5,66509917,5.3,1000000000,Free,0,Teen,Social


We need to provide the column names that should be used in the comparison to identify duplicates. 

In [41]:
df_apps_clean = df_apps_clean.drop_duplicates(subset=['App', 'Type', 'Price'])

In [42]:
df_apps_clean[df_apps_clean.App == 'Instagram']

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
10806,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social


In [43]:
df_apps_clean.shape

(8199, 10)

This leaves us with 8,199 entries after removing duplicates. 

### What else should I know about the data?

So we can see that 13 different features were originally scraped from the Google Play Store.

- Obviously, the data is just a sample out of all the Android apps. It doesn't include all Android apps of which there are millions.

- I’ll assume that the sample is representative of the App Store as a whole. This is not necessarily the case as, during the web scraping process, this sample was served up based on geographical location and user behaviour of the person who scraped it - in our case Lavanya Gupta.

- The data was compiled around 2017/2018. The pricing data reflect the price in USD Dollars at the time of scraping. (developers can offer promotions and change their app’s pricing).

- I’ve converted the app’s size to a floating-point number in MBs. If data was missing, it has been replaced by the average size for that category.

- The installs are not the exact number of installs. If an app has 245,239 installs then Google will simply report an order of magnitude like 100,000+. I’ve removed the '+' and we’ll assume the exact number of installs in that column for simplicity.



# Preliminary Exploration: The Highest Ratings, Most Reviews, and Largest Size

- Identify which apps are the highest rated. What problem might you encounter if you rely exclusively on ratings alone to determine the quality of an app?

In [48]:
df_apps_clean.sort_values(by='Rating', ascending=False)[:11]

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
21,KBA-EZ Health Guide,MEDICAL,5.0,4,25.0,1,Free,0,Everyone,Medical
1230,Sway Medical,MEDICAL,5.0,3,22.0,100,Free,0,Everyone,Medical
1227,AJ Men's Grooming,LIFESTYLE,5.0,2,22.0,100,Free,0,Everyone,Lifestyle
1224,FK Dedinje BGD,SPORTS,5.0,36,2.6,100,Free,0,Everyone,Sports
1223,CB VIDEO VISION,PHOTOGRAPHY,5.0,13,2.6,100,Free,0,Everyone,Photography
1222,"Beacon Baptist Jupiter, FL",LIFESTYLE,5.0,14,2.6,100,Free,0,Everyone,Lifestyle
1214,BV Mobile Apps,PRODUCTIVITY,5.0,3,4.8,100,Free,0,Everyone,Productivity
2680,Florida Wildflowers,FAMILY,5.0,5,69.0,1000,Free,0,Everyone,Education
1206,ADS-B Driver,TOOLS,5.0,2,6.3,100,Paid,$1.99,Everyone,Tools
2750,"Superheroes, Marvel, DC, Comics, TV, Movies News",COMICS,5.0,34,12.0,5000,Free,0,Everyone,Comics


Only apps with very few reviews (and a low number on installs) have perfect 5 star ratings (most likely by friends and family).

- What's the size in megabytes (MB) of the largest Android apps in the Google Play Store. Based on the data, do you think there could be a limit in place or can developers make apps as large as they please?

In [51]:
df_apps_clean.sort_values(by='Size_MBs', ascending=False).head()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
9942,Talking Babsy Baby: Baby Games,LIFESTYLE,4.0,140995,100.0,10000000,Free,0,Everyone,Lifestyle;Pretend Play
10687,Hungry Shark Evolution,GAME,4.5,6074334,100.0,100000000,Free,0,Teen,Arcade
9943,Miami crime simulator,GAME,4.0,254518,100.0,10000000,Free,0,Mature 17+,Action
9944,Gangster Town: Vice District,FAMILY,4.3,65146,100.0,10000000,Free,0,Mature 17+,Simulation
3144,Vi Trainer,HEALTH_AND_FITNESS,3.6,124,100.0,5000,Free,0,Everyone,Health & Fitness


Here we can clearly see that there seems to be an upper bound of 100 MB for the size of an app. A quick google search would also have revealed that this limit is imposed by the Google Play Store itself. It’s interesting to see that a number of apps actually hit that limit exactly.


- Which apps have the highest number of reviews? Are there any paid apps among the top 50?

In [53]:
df_apps_clean.sort_values(by='Reviews', ascending=False).head(n=50)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
10805,Facebook,SOCIAL,4.1,78158306,5.3,1000000000,Free,0,Teen,Social
10785,WhatsApp Messenger,COMMUNICATION,4.4,69119316,3.5,1000000000,Free,0,Everyone,Communication
10806,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social
10784,Messenger – Text and Video Chat for Free,COMMUNICATION,4.0,56642847,3.5,1000000000,Free,0,Everyone,Communication
10650,Clash of Clans,GAME,4.6,44891723,98.0,100000000,Free,0,Everyone 10+,Strategy
10744,Clean Master- Space Cleaner & Antivirus,TOOLS,4.7,42916526,3.4,500000000,Free,0,Everyone,Tools
10835,Subway Surfers,GAME,4.5,27722264,76.0,1000000000,Free,0,Everyone 10+,Arcade
10828,YouTube,VIDEO_PLAYERS,4.3,25655305,4.65,1000000000,Free,0,Teen,Video Players & Editors
10746,"Security Master - Antivirus, VPN, AppLock, Boo...",TOOLS,4.7,24900999,3.4,500000000,Free,0,Everyone,Tools
10584,Clash Royale,GAME,4.6,23133508,97.0,100000000,Free,0,Everyone 10+,Strategy


If you look at the number of reviews, you can find the most popular apps on the Android App Store. These include the usual suspects: Facebook, WhatsApp, Instagram etc. What’s also notable is that the list of the top 50 most reviewed apps does not include a single paid app! 🤔

