# Introduction

In this guided project, we work with two sets of mobile app data -- one from the iOS App Store and one from the Android Google Play Store -- in order to understand what types of of apps are likely to attract the most users. Our focus will be on free apps that are targeted at an English-speaking audience. More specifically, we will be exploring the most common and the most popular free apps by genre to see if there are any app genres which are particularly popular. While this project was originally written in pure Python, we will instead make use of the usual Python data science packages: NumPy, pandas, Matplotlib, and seaborn.

Since there are over four million apps available between the two major mobile app stores, we will work with samples from each that have been collected and uploaded to Kaggle. The first sample is a [set of approximately 10,000 Google Play Store apps](https://www.kaggle.com/lava18/google-play-store-apps) that was scraped in February 2019. The second is a [set of approximately 7200 iOS App Store apps](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) that was scraped in June 2018.

The Google Play Store data contains the following columns.

- `App` - Application name
- `Category` - Category the app belongs to
- `Rating` - Overall user rating of the app (as when scraped)
- `Reviews` - Number of user reviews for the app (as when scraped)
- `Size` - Size of the app (as when scraped)
- `Installs` - Number of user downloads/installs for the app (as when scraped)
- `Type` - Paid or Free
- `Price` - Price of the app (as when scraped)
- `Content Rating` - Age group the app is targeted at - Children / Mature 21+ / Adult
- `Genres` - An app can belong to multiple genres (apart from its main category). For example, a musical family game will belong to Music, Game, Family genres.
- `Last Updated` - Date when the app was last updated on Play Store (as when scraped)
- `Current Ver` - Current version of the app available on Play Store (as when scraped)
- `Android Ver` - Min required Android version (as when scraped)

The iOS App Store data contains the following columns.

- `id` - App ID
- `track_name`- App Name
- `size_bytes`- Size (in Bytes)
- `currency`- Currency Type
- `price`- Price amount
- `rating_count_tot`- User Rating counts (for all versions)
- `rating_count_ver`- User Rating counts (for current version)
- `user_rating` - Average User Rating value (for all versions)
- `user_rating_ver`- Average User Rating value (for current version)
- `ver` - Latest version code
- `cont_rating`- Content Rating
- `prime_genre`- Primary Genre
- `sup_devices.num`- Number of supporting devices
- `ipadSc_urls.num`- Number of screenshots showed for display
- `lang.num`- Number of supported languages
- `vpp_lic`- Vpp Device Based Licensing Enabled

# Loading data and initial cleaning

We start by importing our usual Python data science packages and then loading the two sets of app data into separate dataframes to perform some initial inspections. In particular, we will start off by cleaning up the column names to make them a little easier to work with.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

In [2]:
android_apps_filepath = "googleplaystore.csv"
android_apps = pd.read_csv(android_apps_filepath)

ios_apps_filepath = "AppleStore.csv"
ios_apps = pd.read_csv(ios_apps_filepath).drop(columns = ["Unnamed: 0"])

The column names for the Google Play Store data are all reasonably descriptive and concise, so the only cleaning we do for them is converting each name into snake case, which is the preferred style for Python.

In [3]:
android_apps_cols = android_apps.columns
android_apps_cols = android_apps_cols.str.replace(" ", "_")
android_apps_cols = android_apps_cols.str.lower()
android_apps.columns = android_apps_cols

The column names for the iOS App Store data can be left as is, though a few could be changed to be a little more concise. For example, `rating_count_tot` could be changed to `oa_reviews` and `rating_count_ver` could be changed to `cur_ver_reviews` to allow `user_rating` and `user_rating_ver` to be shortened some to `oa_rating` and `cur_ver_rating`, respectively. This is purely a matter of personal preference.

In [4]:
ios_apps_cols = ios_apps.columns.to_series()
columns_dict = {"rating_count_tot": "oa_reviews", "rating_count_ver": "cur_ver_reviews",
               "user_rating": "oa_rating", "user_rating_ver": "cur_ver_rating",
               "sup_devices.num": "num_sup_devices", "ipadSc_urls.num": "num_screenshots",
               "lang.num": "num_langs"}
ios_apps_cols.replace(columns_dict, inplace = True)
ios_apps.columns = ios_apps_cols

Now that we have cleaned up the column names for each data set to our liking, we move on to the rest of the cleaning. This will involve four main steps:

1. Checking for and handling null values.
2. Checking for duplicate apps and deciding which duplicates to remove.
3. Removing apps that aren't targeted toward English-speaking audiences.
4. Removing apps that are not free.

## Checking for and handling null values

The first step in our data cleaning is checking for and handling null values.

In [5]:
android_apps.isnull().sum()

app                  0
category             0
rating            1474
reviews              0
size                 0
installs             0
type                 1
price                0
content_rating       1
genres               0
last_updated         0
current_ver          8
android_ver          3
dtype: int64

In [6]:
ios_apps.isnull().sum()

id                 0
track_name         0
size_bytes         0
currency           0
price              0
oa_reviews         0
cur_ver_reviews    0
oa_rating          0
cur_ver_rating     0
ver                0
cont_rating        0
prime_genre        0
num_sup_devices    0
num_screenshots    0
num_langs          0
vpp_lic            0
dtype: int64

We see that while the iOS App Store data doesn't contain any null values, the Google Play Store data does. In particular, about 1500 apps are missing rating information, and then there are a few apps which are missing other information, such as type, content rating, app version info, and minimum required Android version. Since there are only a small handful of apps which are missing information aside from the rating, we'll look at those individually to make a decision for how to handle them before proceding to the question of the null values in the ratings column. We start with the app which has a null value in the `type` column.

In [7]:
android_apps[android_apps["type"].isnull()]

Unnamed: 0,app,category,rating,reviews,size,installs,type,price,content_rating,genres,last_updated,current_ver,android_ver
9148,Command & Conquer: Rivals,FAMILY,,0,Varies with device,0,,0,Everyone 10+,Strategy,"June 28, 2018",Varies with device,Varies with device


We can see that while this app, [Command & Conquer: Rivals](https://play.google.com/store/apps/details?id=com.ea.gp.candcwarzones) is has a null value in the `type` column, it has a `price` value of 0. This indicates that the app is a free app, which we can confirm by looking up the Play Store listing for the game. Therefore, we can simply replace the null value with "Free".

In [8]:
android_apps.loc[9148, "type"] = "Free"

In [9]:
android_apps["type"].isnull().sum()

0

Now that we have taken care of the row with a missing `type` value, we'll move on to the row with a missing `content_rating` value.

In [10]:
android_apps[android_apps["content_rating"].isnull()]

Unnamed: 0,app,category,rating,reviews,size,installs,type,price,content_rating,genres,last_updated,current_ver,android_ver
10472,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up,


Inspecting this row reveals that is has more issues than just a mising `content_rating` value. In fact a missing `category` value resulted in the values for all of the other columns to be shifted over by one to the left. To fix this row, we can look up its [Play Store listing](https://play.google.com/store/apps/details?id=com.lifemade.internetPhotoframe) to confirm that is should have a `category` value of "LIFESTYLE". After doing this, we can then create a working copy of the row to shift over the values by 1 to the right and then update the values in the `category` and `genres` to reflect that this app is a lifestyle app.

In [11]:
s = android_apps.loc[10472].copy()

In [12]:
s.iloc[2:] = android_apps.iloc[10472, 1:-1].values

s["category"] = "LIFESTYLE"

s["genres"] = "Lifestyle"

In [13]:
android_apps.loc[10472] = s

In [14]:
android_apps["content_rating"].isnull().sum()

0

While there are still columns with missing values, some of them, such as the `android_ver` column, are things that we can no longer look up by hand directly through the Google Play Store as of July 2020. In addition, since our ultimate goal in this project is to identify app genres which are particularly popular, the null values that remain are not directly relevant to our analysis at this point in time. A potential future extension of this project would involve handling the null values in the `rating` column to allow us to explore possible relationships between an app's rating and its popularity. For now, though, we move onto checking for and handling duplicate apps in the two datasets.

## Checking for and handling duplicate apps

To check for duplicate apps in the two datasets, we will make use of the [`duplicated()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html) function, which returns a boolean Series object denoting the duplicate rows. Since we are only using the app name to identify duplicates, we pass in the keyword argument `subset = ["app"]` to specify the column we wish to use.

In [15]:
android_apps.duplicated(subset = ["app"]).sum()

1181

There are a total of 1,181 duplicate Android app listings. For example, the app "Coloring book moana" appears twice.

In [16]:
android_apps.loc[android_apps.duplicated(subset = ["app"], keep = False), "app"].unique()[0]

'Coloring book moana'

In [17]:
android_apps[android_apps["app"] == "Coloring book moana"]

Unnamed: 0,app,category,rating,reviews,size,installs,type,price,content_rating,genres,last_updated,current_ver,android_ver
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2033,Coloring book moana,FAMILY,3.9,974,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up


Ideally, we'd be able to choose which of the above two rows by comparing when each was scraped and then picking the row which was scraped more recently to keep. Unfortunately, this dataset does not explicitly contain that information. We can, however, use the values in the `reviews` column as a proxy. The row with more reviews will be the one that was scraped more recently. To have pandas handle this process, we first sort the dataframe by the `reviews` column in descending order and then use the [`drop_duplicates()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html) to drop the duplicate rows. By default, the `drop_duplicates()` function keeps the first occurence of each duplicated row, so by sorting them in descending order by the number of reviews we will be able to keep the row with the most reviews (i.e. the row which was scraped most recently) for each duplicated app. Before sorting by the number reviews, we need to cast the `reviews` column to integers, since correcting the entry for Life Made WI-Fi Touchscreen Photo Frame resulted in putting a float into that column while the rest of the values are currently stored as strings. Strings and floats are not comparable to each other, so we would encounter an error if we do not convert the `reviews` column.

In [18]:
android_apps["reviews"] = android_apps["reviews"].astype("int64")

In [19]:
android_apps = android_apps.sort_values(by = ["reviews"], ascending = False).drop_duplicates(subset = ["app"], ignore_index = True)

In [20]:
android_apps.shape

(9660, 13)

After removing the duplicate rows, there are 9,660 apps left in the Google Play Store dataset. Now that we have handled the duplicate Android apps, we perform the same process with the iOS App Store dataset. For this set, we can use either the `id` or the `track_name` column to identify duplicate apps.

In [21]:
ios_apps.duplicated(subset = ["id"]).sum()

0

In [22]:
ios_apps.duplicated(subset = ["track_name"]).sum()

2

In [23]:
ios_apps[ios_apps.duplicated(subset = ["track_name"], keep = False)]

Unnamed: 0,id,track_name,size_bytes,currency,price,oa_reviews,cur_ver_reviews,oa_rating,cur_ver_rating,ver,cont_rating,prime_genre,num_sup_devices,num_screenshots,num_langs,vpp_lic
3319,952877179,VR Roller Coaster,169523200,USD,0.0,107,102,3.5,3.5,2.0.0,4+,Games,37,5,1,1
5603,1089824278,VR Roller Coaster,240964608,USD,0.0,67,44,3.5,4.0,0.81,4+,Games,38,0,1,1
7092,1173990889,Mannequin Challenge,109705216,USD,0.0,668,87,3.0,3.0,1.4,9+,Games,37,4,1,1
7128,1178454060,Mannequin Challenge,59572224,USD,0.0,105,58,4.0,4.5,1.0.1,4+,Games,38,5,1,1


It turns out that while every app in the iOS App Store dataset has a unique app ID, there are two apps that we see are duplicated when checking by app name. We will go ahead and keep only the occurrences with the highest number of reveiws for each duplicated app.

In [24]:
ios_apps = ios_apps.drop_duplicates(subset = ["track_name"], ignore_index = True)

Now that we have handled duplicate apps in both datasets, our next step is to remove apps that are not targeted toward English-speaking audiences.

## Removing apps that are not targeted toward English-speaking audiences

Since in this project we are focusing on apps that are targeted toward English-speaking audiences, we want to remove any apps which may not be targeted toward English speakers. One way that we can focus on apps targeted toward English-speaking audiences is by filtering out apps with names that use non-English characters. A naive way of doing this would be to use the built-in [`str.isascii()` function](https://docs.python.org/3/library/stdtypes.html#str.isascii) to filter out all apps which use non-ASCII characters.

In [25]:
titles = ['Instagram', '爱奇艺PPS -《欢乐颂2》电视剧热播', 'Docs To Go™ Free Office Suite', 'Instachat 😜']
list(filter(str.isascii, titles))

['Instagram']

As we can see above, filtering out apps with names that include any non-ASCII characters is a very aggressive filtering strategy, as it would filter out apps with titles that are in English, but use special non-ASCII characters such as emoji. To minimize the potential impact of data loss from this aggressive filtering strategy, we will instead filter out apps titles with a number of non-ASCII characters that exceeds a given threshold. The first step in implementing this less-aggressive strategy is to write a function that counts the number of non-ASCII characters in a string. This will make use of the [`ord()` function](https://docs.python.org/3/library/functions.html#ord), which converts a character into its Unicode code point.

In [26]:
def num_non_ascii(string):
    # convert string into list of Unicode code points of each character
    character_codepoints = list(map(ord, string))
    # filter out non-ascii characters, which have code point > 127
    non_ascii = list(filter(lambda x: x > 127, character_codepoints))
    return len(non_ascii)

In [27]:
list(map(num_non_ascii, titles))

[0, 13, 1, 1]

With this function, we can then filter our data by excluding apps with titles that contain more than a given number of non-ASCII characters. To do this, we use the [`Series.apply()` function](https://pandas.pydata.org/docs/reference/api/pandas.Series.apply.html#pandas.Series.apply) along with our `num_non_ascii` function to create a boolean mask for app titles that contain no more than our threshold for non-ASCII characters. For instance, we can filter out apps with titles that contain more than three non-ASCII characters.

In [28]:
non_ascii_threshold = 3
android_english_mask = android_apps["app"].apply(lambda x: num_non_ascii(x) <= non_ascii_threshold)
ios_english_mask = ios_apps["track_name"].apply(lambda x: num_non_ascii(x) <= non_ascii_threshold)

In [29]:
android_apps.shape[0] - android_apps.loc[android_english_mask, :].shape[0]

45

In [30]:
android_apps.loc[android_english_mask, :].shape[0]

9615

In [31]:
ios_apps.shape[0] - ios_apps.loc[ios_english_mask, :].shape[0]

1014

In [32]:
ios_apps.loc[ios_english_mask, :].shape[0]

6181

This filter removed 45 apps from the Google Play Store set, and 1014 apps from the iOS App Store set, leaving us with 9615 and 6181 apps from the Google Play Store and iOS App Store, respectively. With our sets of English-language apps from each store in hand, the last step is to filter out the apps that are not free.

## Removing apps that are not free

Isolating the free apps in each data set is a straightforward process. The Google Play Store data already contains the `type` column, which tags apps as either `Free` or `Paid`, so all we need to do is create a boolean mask to select only the apps that are tagged as `Free.`

In [33]:
android_free_mask = android_apps["type"] == "Free"
android_english_free_mask = android_english_mask & android_free_mask

In [34]:
android_apps.loc[android_english_free_mask, :].shape

(8865, 13)

As we can see, there are 8,865 free Android apps that are targeted toward English-speaking audiences. While the iOS App Store data doesn't already contain column like the `type` column from the Google Play Store data, we can still straightforwardly isolate the free iOS apps by using the `price` column. All we need to do is select the apps with a `price` value of zero.

In [35]:
ios_free_mask = ios_apps["price"] == 0
ios_english_free_mask = ios_english_mask & ios_free_mask

In [36]:
ios_apps.loc[ios_english_free_mask, :].shape

(3220, 16)

There are 3,220 free iOS apps that are targeted toward English-speaking audiences.

# Analyzing most common apps by genre

Now that we've done our initial data cleaning, which involved the following steps:

1. Checking for and handling null values;
2. Checking for duplicate apps and deciding which duplicates to remove;
3. Removing apps that aren't targeted toward English-speaking audiences; and
4. Removing apps that are not free,

we can finally dive into some analysis. Our goal is to explore the most common and most popular free apps by genre, which would be relevant if we were a hypothetical app developer looking for particular kinds of apps that are likely to be profitable. We would also want our hypothetical app to have maximal reach, so we should see what kinds of apps are successful on both the Google Play Store and the iOS App Store. As a starting point for our analysis, we'll simply look for which types of apps are most common in each app store. The [`pandas.Series.value_counts()` function](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) provides a convenient way to generate frequency tables for a given column of data. We will use this function with the `category` and `genres` columns from the Google Play Store data, and with the `prime_genre` column from the iOS App Store data. Note that passing the argument `normalize = True` tells the function to provide the relative frequencies of the unique values instead of the counts.

In [37]:
android_apps.loc[android_english_free_mask, "category"].value_counts(normalize = True)

FAMILY                 0.189171
GAME                   0.097236
TOOLS                  0.084602
BUSINESS               0.045911
LIFESTYLE              0.039143
PRODUCTIVITY           0.038917
FINANCE                0.036999
MEDICAL                0.035307
SPORTS                 0.033954
PERSONALIZATION        0.033164
COMMUNICATION          0.032375
HEALTH_AND_FITNESS     0.030795
PHOTOGRAPHY            0.029442
NEWS_AND_MAGAZINES     0.027975
SOCIAL                 0.026622
TRAVEL_AND_LOCAL       0.023350
SHOPPING               0.022448
BOOKS_AND_REFERENCE    0.021433
DATING                 0.018613
VIDEO_PLAYERS          0.017936
MAPS_AND_NAVIGATION    0.013988
FOOD_AND_DRINK         0.012408
EDUCATION              0.011506
ENTERTAINMENT          0.009588
LIBRARIES_AND_DEMO     0.009363
AUTO_AND_VEHICLES      0.009250
HOUSE_AND_HOME         0.008235
WEATHER                0.008009
EVENTS                 0.007107
PARENTING              0.006543
ART_AND_DESIGN         0.006430
COMICS  

Looking at the frequency table for the `category`, we see that the two two most common categories are "Family" and "Game". The "Family" category is a little nebulous, and the name in and of itself isn't particularly descriptive, but a look at the [corresponding page on the Google Play Store](https://play.google.com/store/apps/category/FAMILY) reveals this category is a mix of educational apps and games that are targeted at children. In other words, almost 30% of the free apps for English-speaking audiences in the Google Play Store set are games. The next most common categories, however, are more practical apps, such as "Tools", "Business", "Lifestyle", and "Productivity", which make up the next 20% of apps that we are looking at.

In [38]:
android_apps.loc[android_english_free_mask, "genres"].value_counts(normalize = True)

Tools                                 0.084490
Entertainment                         0.060688
Education                             0.053469
Business                              0.045911
Lifestyle                             0.039030
                                        ...   
Arcade;Pretend Play                   0.000113
Racing;Pretend Play                   0.000113
Video Players & Editors;Creativity    0.000113
Casual;Music & Video                  0.000113
Tools;Education                       0.000113
Name: genres, Length: 114, dtype: float64

The frequency table for the `genres` column is less helpful, at least without further processing, since individual apps can be placed into more than one genre. For now, since we want more of a general overview of the free English app landscape on the Google Play Store, we will focus our attention back to the `category` column. In the future, if we want to try and extract some more granular information regarding app types, we can revisit this column and perform some further processing to make it more usable. Next, we explore the `prime_genre` column from the iOS App Store data.

In [39]:
ios_apps.loc[ios_english_free_mask, "prime_genre"].value_counts(normalize = True)

Games                0.581366
Entertainment        0.078882
Photo & Video        0.049689
Education            0.036646
Social Networking    0.032919
Shopping             0.026087
Utilities            0.025155
Sports               0.021429
Music                0.020497
Health & Fitness     0.020186
Productivity         0.017391
Lifestyle            0.015839
News                 0.013354
Travel               0.012422
Finance              0.011180
Weather              0.008696
Food & Drink         0.008075
Reference            0.005590
Business             0.005280
Book                 0.004348
Navigation           0.001863
Medical              0.001863
Catalogs             0.001242
Name: prime_genre, dtype: float64

Among free iOS apps targeted at English-speaking audiences, the picture is very different compared to the Google Play Store apps. The most striking that stands out is how games account for almost 60% of the iOS apps we are looking at, compared to the 30% of Android apps that could be considered as games. Also of note is general prevalence of "fun" apps among the rest of the top five most common iOS app genres (Entertainment, Photo & Video, Social Networking) compare to the more-practical bent of the top five most common Android app categories.

One important thing to note, however, is that these frequency tables only show **how common** various types of apps are. They say nothing about the **popularity** of apps within those genres. For example, there might be so many Family apps in the Google Play Store simply because they are easy and inexpensive to make, allowing them to turn a profit through sheer numbers even of none of those apps is individually particularly popular. Put another way, a large supply of a given type of app does not imply the existence of a correspondingly large demand. If we want to find out which types of apps are the most popular, we'll need to instead look at some of the other columns for a more relevant measure of popularity.

# Analyzing the most popular apps by genre

One way of measuring the popularity of an app is to look at how many times it has been downloaded and installed. We can then compare the popularity of different types of apps by computing the mean and median number of installs for each type of app. We will use both the mean and the median to account for the possibility of potential outliers (dominant apps with huge user bases) in each category. While this is something that we can do fairly straightforwardly with the Google Play Store data, since the `installs` column provides information about the number of installs for each app, the iOS App Store data doesn't include such a column. Instead, we'll need to see if we can use some other column from the iOS App Store data as a reasonably proxy for the number of installs. Before we do that, though, we will focus our attention on the Google Play Store Data.

# Most popular Android apps by genre

While the Google Play Store data already includes an `installs` column, there is still a bit of pre-processing that we need to do before we can fully utilize it.

In [40]:
android_apps["installs"].head()

0    1,000,000,000+
1    1,000,000,000+
2    1,000,000,000+
3    1,000,000,000+
4      100,000,000+
Name: installs, dtype: object

 Right now the `installs` column is stored as the [`object` data type](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#dtypes), which is the most general data type that pandas can use to accommodate columns which contain multiple data types. Looking at the output from the `unique()` function tells us that this is due to the fact the column contains strings which are used to represent numerical ranges.

In [41]:
android_apps["installs"].unique()

array(['1,000,000,000+', '100,000,000+', '500,000,000+', '50,000,000+',
       '10,000,000+', '5,000,000+', '1,000,000+', '500,000+', '100,000+',
       '50,000+', '10,000+', '5,000+', '1,000+', '500+', '100+', '10+',
       '50+', '5+', '1+', '0+', '0'], dtype=object)

We also see that the values in the `installs` column aren't exact numbers of installs. Instead, they are order-of-magnitude estimates such as 100+ install or 1,000,000+ installs. This is important to note, since an app with an `installs` value of 1,000,000+ could have anywhere from as few as exactly 1,000,000 istalls to as many as 4,999,999 installs. However, since we are trying to examine overall popularity trends for various app genres, this data is sufficient for our purposes. 

In order to perform computations with this column, we will need to convert the strings into integers. To do so, we will assume that the number of installs for each app will be at the bottom of the range indicated by the value in the `installs` column. For example, we will simply convert a value of 100+ to just 100. This conversion will happen in two steps. First, we use the [`pandas.Series.str.replace()` function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.replace.html#pandas.Series.str.replace) to strip out all of the "+" and "," characters that occur. The second step is to use the [`pandas.to_numeric()` function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_numeric.html#pandas.to_numeric) to convert those values from strings to a numeric data type (either ints or floats depending on the context). We will then replace the original `installs` column with the converted version.

In [42]:
android_apps["installs"] = pd.to_numeric(android_apps["installs"].str.replace(r"[,\+]", ""))

In [43]:
android_apps["installs"]

0       1000000000
1       1000000000
2       1000000000
3       1000000000
4        100000000
           ...    
9655            10
9656            10
9657             1
9658             1
9659          1000
Name: installs, Length: 9660, dtype: int64

Now that we have converted the `installs` column from strings to integers, we use the [`pandas.DataFrame.pivot_table()` function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot_table.html) to aggregate the install numbers by category and produce a pivot table with the mean, standard deviation, and median number of installs for each category. We'll focus on the top ten categories with the highest mean number of installs, and we also make use of the [`round()` function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.round.html#pandas.DataFrame.round) to help us focus on the general orders of magnitude among the means and standard deviations.

In [44]:
android_installs_pivot_table = android_apps.loc[android_english_free_mask, :].pivot_table(values = "installs", index = "category", aggfunc = [np.mean, np.std, np.median])

In [45]:
android_installs_pivot_table.sort_values(by = ("mean", "installs"), ascending = False).head(10).round(-4)

Unnamed: 0_level_0,mean,std,median
Unnamed: 0_level_1,installs,installs,installs
category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
COMMUNICATION,38460000.0,156540000.0,500000
VIDEO_PLAYERS,24730000.0,119080000.0,1000000
SOCIAL,23250000.0,121490000.0,100000
PHOTOGRAPHY,17840000.0,66750000.0,1000000
PRODUCTIVITY,16790000.0,78220000.0,100000
GAME,15590000.0,53680000.0,1000000
TRAVEL_AND_LOCAL,13980000.0,98590000.0,100000
ENTERTAINMENT,11640000.0,24590000.0,1000000
TOOLS,10800000.0,57300000.0,100000
NEWS_AND_MAGAZINES,9550000.0,77470000.0,50000


As we can see, the Communication category has the highest mean number of installs (about 38.5 million), with Video Players (about 24.7 million) in second place and Social (about 23.3 million) in third. However, each of these categories have extremely high standard deviations: 156 million, 119 million, and 121 million, respectively, which seems to indicate that those categories each contain apps which are huge outliers in terms of popularity. In comparision, the medians for each category, which are much more resistant to outliers, are 500,000, 1,000,000, and 100,000 for Communication, Video Players, and Social, respectively. These values are much more modest in comparison to the mean values, and they provide a picture of the popularity of a typical app in each of those categories that isn't as skewed by the large outliers. An ambitious hypothetical app developer might still want to target one of those categories, since the high mean number of installs is an indicator of the large potential demand, but they would face stiff competition from the already-existing apps that dominate each category. This is especially true of the Communication and Social categories due to the extremely strong influence of [network effects](https://en.wikipedia.org/wiki/Network_effect) for the established players. It would take a lot of resources to even have a chance at competing with the most popular existing apps, and the chances of successfully doing so are slim. Before discussing an alternative strategy for how a more cautious hypothetical app developer with fewer resources at their disposal might choose a category to explore, we'll take a closer look at the top ten apps from each of the three most popular categories.

In [46]:
android_apps.loc[android_english_free_mask & (android_apps["category"] == "COMMUNICATION")].sort_values(by = "installs", ascending = False).head(10)

Unnamed: 0,app,category,rating,reviews,size,installs,type,price,content_rating,genres,last_updated,current_ver,android_ver
1,WhatsApp Messenger,COMMUNICATION,4.4,69119316,Varies with device,1000000000,Free,0,Everyone,Communication,"August 3, 2018",Varies with device,Varies with device
25,Skype - free IM & video calls,COMMUNICATION,4.1,10484169,Varies with device,1000000000,Free,0,Everyone,Communication,"August 3, 2018",Varies with device,Varies with device
3,Messenger – Text and Video Chat for Free,COMMUNICATION,4.0,56646578,Varies with device,1000000000,Free,0,Everyone,Communication,"August 1, 2018",Varies with device,Varies with device
93,Gmail,COMMUNICATION,4.3,4604483,Varies with device,1000000000,Free,0,Everyone,Communication,"August 2, 2018",Varies with device,Varies with device
33,Google Chrome: Fast & Secure,COMMUNICATION,4.3,9643041,Varies with device,1000000000,Free,0,Everyone,Communication,"August 1, 2018",Varies with device,Varies with device
123,Hangouts,COMMUNICATION,4.0,3419513,Varies with device,1000000000,Free,0,Everyone,Communication,"July 21, 2018",Varies with device,Varies with device
23,LINE: Free Calls & Messages,COMMUNICATION,4.2,10790289,Varies with device,500000000,Free,0,Everyone,Communication,"July 26, 2018",Varies with device,Varies with device
20,Viber Messenger,COMMUNICATION,4.3,11335481,Varies with device,500000000,Free,0,Everyone,Communication,"July 18, 2018",Varies with device,Varies with device
91,imo free video calls and chat,COMMUNICATION,4.3,4785988,11M,500000000,Free,0,Everyone,Communication,"June 8, 2018",9.8.000000010501,4.0 and up
192,Google Duo - High Quality Video Calls,COMMUNICATION,4.6,2083237,Varies with device,500000000,Free,0,Everyone,Communication,"July 31, 2018",37.1.206017801.DR37_RC14,4.4 and up


<font color="red">Talk about the Communication category. 6 in top 10 have at least 1 billion installs, all have at least 500 million. Dominated by well-established players</font>

In [47]:
android_apps.loc[android_english_free_mask & (android_apps["category"] == "VIDEO_PLAYERS")].sort_values(by = "installs", ascending = False).head(10)

Unnamed: 0,app,category,rating,reviews,size,installs,type,price,content_rating,genres,last_updated,current_ver,android_ver
7,YouTube,VIDEO_PLAYERS,4.3,25655305,Varies with device,1000000000,Free,0,Teen,Video Players & Editors,"August 2, 2018",Varies with device,Varies with device
377,Google Play Movies & TV,VIDEO_PLAYERS,3.7,906384,Varies with device,1000000000,Free,0,Teen,Video Players & Editors,"August 6, 2018",Varies with device,Varies with device
53,MX Player,VIDEO_PLAYERS,4.5,6474672,Varies with device,500000000,Free,0,Everyone,Video Players & Editors,"August 6, 2018",Varies with device,Varies with device
339,VLC for Android,VIDEO_PLAYERS,4.4,1032076,Varies with device,100000000,Free,0,Everyone,Video Players & Editors,"July 30, 2018",Varies with device,2.3 and up
1322,Motorola Gallery,VIDEO_PLAYERS,3.9,121916,23M,100000000,Free,0,Everyone,Video Players & Editors,"January 25, 2016",Varies with device,Varies with device
1883,Motorola FM Radio,VIDEO_PLAYERS,3.9,54815,Varies with device,100000000,Free,0,Everyone,Video Players & Editors,"May 2, 2018",Varies with device,Varies with device
32,VivaVideo - Video Editor & Photo Movie,VIDEO_PLAYERS,4.6,9879473,40M,100000000,Free,0,Teen,Video Players & Editors,"August 4, 2018",7.2.1,4.1 and up
196,Dubsmash,VIDEO_PLAYERS,4.2,1971777,29M,100000000,Free,0,Teen,Video Players & Editors,"May 11, 2018",2.35.8,4.1 and up
108,"VideoShow-Video Editor, Video Maker, Beauty Ca...",VIDEO_PLAYERS,4.6,4016834,Varies with device,100000000,Free,0,Everyone,Video Players & Editors,"July 23, 2018",Varies with device,Varies with device
221,Vigo Video,VIDEO_PLAYERS,4.3,1615596,Varies with device,50000000,Free,0,Teen,Video Players & Editors,"August 3, 2018",Varies with device,4.0.3 and up


<font color="red">Talk about the Video Players category. Top 2 have at least a billion installs and are very dominant. Google-owned, so probably installed by default on most Android phones.</font>

In [48]:
android_apps.loc[android_english_free_mask & (android_apps["category"] == "SOCIAL")].sort_values(by = "installs", ascending = False).head(10)

Unnamed: 0,app,category,rating,reviews,size,installs,type,price,content_rating,genres,last_updated,current_ver,android_ver
0,Facebook,SOCIAL,4.1,78158306,Varies with device,1000000000,Free,0,Teen,Social,"August 3, 2018",Varies with device,Varies with device
89,Google+,SOCIAL,4.2,4831125,Varies with device,1000000000,Free,0,Teen,Social,"July 26, 2018",Varies with device,Varies with device
2,Instagram,SOCIAL,4.5,66577446,Varies with device,1000000000,Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
12,Snapchat,SOCIAL,4.0,17015352,Varies with device,500000000,Free,0,Teen,Social,"July 30, 2018",Varies with device,Varies with device
36,Facebook Lite,SOCIAL,4.3,8606259,Varies with device,500000000,Free,0,Teen,Social,"August 1, 2018",Varies with device,Varies with device
65,VK,SOCIAL,3.8,5793284,Varies with device,100000000,Free,0,Mature 17+,Social,"August 3, 2018",Varies with device,Varies with device
68,Tik Tok - including musical.ly,SOCIAL,4.4,5637451,59M,100000000,Free,0,Teen,Social,"August 3, 2018",8.0.0,4.1 and up
100,Pinterest,SOCIAL,4.6,4305441,Varies with device,100000000,Free,0,Teen,Social,"August 3, 2018",Varies with device,Varies with device
116,Tango - Live Video Broadcast,SOCIAL,4.3,3806669,Varies with device,100000000,Free,0,Mature 17+,Social,"August 1, 2018",Varies with device,Varies with device
117,Badoo - Free Chat & Dating App,SOCIAL,4.3,3781770,Varies with device,100000000,Free,0,Mature 17+,Social,"August 2, 2018",Varies with device,Varies with device


<font color="red">Talk about the Social category. Dominated by Facebook, which also owns Instagram. Note that Google Plus is now dead -- probably had high install numbers as a default Android app for most phones.</font>

<font color="red">Talk about the reasoning for sorting categories with highest median and lowest standard deviation.</font>

In [49]:
android_installs_pivot_table.sort_values(by = [("median", "installs"), ("std", "installs")], ascending = [False, True]).head(10).round(-4)

Unnamed: 0_level_0,mean,std,median
Unnamed: 0_level_1,installs,installs,installs
category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
EDUCATION,1840000.0,2810000.0,1000000
WEATHER,5070000.0,11570000.0,1000000
SHOPPING,7040000.0,17730000.0,1000000
ENTERTAINMENT,11640000.0,24590000.0,1000000
GAME,15590000.0,53680000.0,1000000
PHOTOGRAPHY,17840000.0,66750000.0,1000000
VIDEO_PLAYERS,24730000.0,119080000.0,1000000
HOUSE_AND_HOME,1330000.0,2620000.0,500000
FOOD_AND_DRINK,1920000.0,3280000.0,500000
HEALTH_AND_FITNESS,4190000.0,31000000.0,500000


<font color="red">Talk about the top 3-5 categories when sorted by median (descending) and standard deviation (ascending).</font>

In [50]:
android_apps.loc[android_english_free_mask & (android_apps["category"] == "EDUCATION")].sort_values(by = "installs", ascending = False).head(10)

Unnamed: 0,app,category,rating,reviews,size,installs,type,price,content_rating,genres,last_updated,current_ver,android_ver
320,"Learn languages, grammar & vocabulary with Mem...",EDUCATION,4.7,1107948,Varies with device,10000000,Free,0,Everyone,Education,"August 2, 2018",Varies with device,Varies with device
1395,Remind: School Communication,EDUCATION,4.5,108613,Varies with device,10000000,Free,0,Everyone,Education,"August 3, 2018",Varies with device,Varies with device
781,Learn English with Wlingua,EDUCATION,4.7,314300,3.3M,10000000,Free,0,Everyone,Education,"May 2, 2018",1.94.9,4.0 and up
741,Math Tricks,EDUCATION,4.5,342918,8.1M,10000000,Free,0,Everyone,Education,"July 29, 2018",2.24,4.0 and up
1701,Google Classroom,EDUCATION,4.2,69498,Varies with device,10000000,Free,0,Everyone,Education,"July 19, 2018",Varies with device,Varies with device
987,Lumosity: #1 Brain Games & Cognitive Training App,EDUCATION,4.2,215301,Varies with device,10000000,Free,0,Everyone,Education,"August 1, 2018",Varies with device,Varies with device
997,Quizlet: Learn Languages & Vocab with Flashcards,EDUCATION,4.6,211856,Varies with device,10000000,Free,0,Everyone,Education,"August 1, 2018",Varies with device,Varies with device
1202,ClassDojo,EDUCATION,4.4,148550,59M,10000000,Free,0,Everyone,Education;Education,"August 3, 2018",4.21.1,4.1 and up
3008,Mermaids,EDUCATION,4.2,14286,Varies with device,5000000,Free,0,Everyone,Education;Creativity,"April 26, 2018",Varies with device,4.1 and up
1879,Learn 50 languages,EDUCATION,4.4,55256,14M,5000000,Free,0,Everyone,Education,"June 19, 2018",10.9.1,4.0 and up


<font color="red">Talk about the Education category. Could be promising for an app developer with modest resources. A lot of apps for language learning.</font>

In [51]:
android_apps.loc[android_english_free_mask & (android_apps["category"] == "WEATHER")].sort_values(by = "installs", ascending = False).head(10)

Unnamed: 0,app,category,rating,reviews,size,installs,type,price,content_rating,genres,last_updated,current_ver,android_ver
173,Weather & Clock Widget for Android,WEATHER,4.4,2371543,11M,50000000,Free,0,Everyone,Weather,"June 4, 2018",5.9.4.0,4.0.3 and up
229,The Weather Channel: Rain Forecast & Storm Alerts,WEATHER,4.4,1558437,Varies with device,50000000,Free,0,Everyone,Weather,"August 1, 2018",Varies with device,Varies with device
255,"GO Weather - Widget, Theme, Wallpaper, Efficient",WEATHER,4.5,1422858,Varies with device,50000000,Free,0,Everyone,Weather,"August 3, 2018",Varies with device,Varies with device
193,AccuWeather: Daily Forecast & Live Weather Rep...,WEATHER,4.4,2053404,Varies with device,50000000,Free,0,Everyone,Weather,"August 6, 2018",Varies with device,Varies with device
1057,wetter.com - Weather and Radar,WEATHER,4.2,189313,38M,10000000,Free,0,Everyone,Weather,"August 6, 2018",Varies with device,Varies with device
2663,HTC Weather,WEATHER,3.9,22154,Varies with device,10000000,Free,0,Everyone,Weather,"August 10, 2017",8.50.935520,4.4 and up
2804,Weather,WEATHER,4.2,18773,12M,10000000,Free,0,Everyone,Weather,"May 24, 2018",1.3.A.2.9,4.4 and up
1085,MyRadar NOAA Weather Radar,WEATHER,4.5,178934,Varies with device,10000000,Free,0,Everyone,Weather,"August 4, 2018",Varies with device,Varies with device
875,Amber Weather,WEATHER,4.4,260137,13M,10000000,Free,0,Everyone 10+,Weather,"July 16, 2018",3.8.1,4.1 and up
839,Weather 14 Days,WEATHER,4.4,279917,Varies with device,10000000,Free,0,Everyone,Weather,"July 18, 2018",Varies with device,Varies with device


<font color="red">Talk about the Weather category. Category is probably pretty saturated, and it is hard to differentiate yourself because at its core there is only so much you can do with the weather: accurate forecasts and attractive app design.</font>

In [52]:
android_apps.loc[android_english_free_mask & (android_apps["category"] == "SHOPPING")].sort_values(by = "installs", ascending = False).head(10)

Unnamed: 0,app,category,rating,reviews,size,installs,type,price,content_rating,genres,last_updated,current_ver,android_ver
57,Wish - Shopping Made Fun,SHOPPING,4.5,6212081,15M,100000000,Free,0,Everyone,Shopping,"August 3, 2018",4.20.5,4.1 and up
64,"AliExpress - Smarter Shopping, Better Living",SHOPPING,4.6,5917485,Varies with device,100000000,Free,0,Teen,Shopping,"August 6, 2018",Varies with device,Varies with device
63,Flipkart Online Shopping App,SHOPPING,4.4,6012719,Varies with device,100000000,Free,0,Teen,Shopping,"August 6, 2018",Varies with device,Varies with device
148,eBay: Buy & Sell this Summer - Discover Deals ...,SHOPPING,4.4,2788923,Varies with device,100000000,Free,0,Teen,Shopping,"July 30, 2018",Varies with device,Varies with device
376,Amazon Shopping,SHOPPING,4.3,909226,42M,100000000,Free,0,Teen,Shopping,"July 31, 2018",16.14.0.100,4.4 and up
327,The birth,SHOPPING,4.7,1084945,Varies with device,50000000,Free,0,Teen,Shopping,"August 3, 2018",Varies with device,Varies with device
355,"letgo: Buy & Sell Used Stuff, Cars & Real Estate",SHOPPING,4.5,973270,20M,50000000,Free,0,Teen,Shopping,"August 6, 2018",2.4.9,4.1 and up
395,OLX - Buy and Sell,SHOPPING,4.2,857923,18M,50000000,Free,0,Everyone,Shopping,"July 31, 2018",11.7.3.0,4.1 and up
272,Myntra Online Shopping App,SHOPPING,4.3,1315242,Varies with device,50000000,Free,0,Everyone,Shopping,"July 21, 2018",3.27.1,4.1 and up
263,"Groupon - Shop Deals, Discounts & Coupons",SHOPPING,4.6,1371082,Varies with device,50000000,Free,0,Teen,Shopping,"August 3, 2018",Varies with device,Varies with device


<font color="red">Talk about the Shopping category. Seems to mainly be apps which serve as storefronts for existing e-commerce websites, so not really realistic to be able to compete as a hypothetical app developer without an e-commerce platform to leverage.</font>