# Introduction

In this guided project, we work with two sets of mobile app data -- one from the iOS App Store and one from the Android Google Play Store -- in order to understand what types of of apps are likely to attract the most users. Our focus will be on free apps that are targeted at an English-speaking audience. More specifically, we will be exploring the most common and the most popular free apps by genre to see if there are any app genres which are particularly popular. While this project was originally written in pure Python, we will instead make use of the usual Python data science packages: NumPy, pandas, Matplotlib, and seaborn.

Since there are over four million apps available between the two major mobile app stores, we will work with samples from each that have been collected and uploaded to Kaggle. The first sample is a [set of approximately 10,000 Google Play Store apps](https://www.kaggle.com/lava18/google-play-store-apps) that was scraped in February 2019. The second is a [set of approximately 7200 iOS App Store apps](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) that was scraped in June 2018.

The Google Play Store data contains the following columns.

- `App` - Application name
- `Category` - Category the app belongs to
- `Rating` - Overall user rating of the app (as when scraped)
- `Reviews` - Number of user reviews for the app (as when scraped)
- `Size` - Size of the app (as when scraped)
- `Installs` - Number of user downloads/installs for the app (as when scraped)
- `Type` - Paid or Free
- `Price` - Price of the app (as when scraped)
- `Content Rating` - Age group the app is targeted at - Children / Mature 21+ / Adult
- `Genres` - An app can belong to multiple genres (apart from its main category). For example, a musical family game will belong to Music, Game, Family genres.
- `Last Updated` - Date when the app was last updated on Play Store (as when scraped)
- `Current Ver` - Current version of the app available on Play Store (as when scraped)
- `Android Ver` - Min required Android version (as when scraped)

The iOS App Store data contains the following columns.

- `id` - App ID
- `track_name`- App Name
- `size_bytes`- Size (in Bytes)
- `currency`- Currency Type
- `price`- Price amount
- `rating_count_tot`- User Rating counts (for all versions)
- `rating_count_ver`- User Rating counts (for current version)
- `user_rating` - Average User Rating value (for all versions)
- `user_rating_ver`- Average User Rating value (for current version)
- `ver` - Latest version code
- `cont_rating`- Content Rating
- `prime_genre`- Primary Genre
- `sup_devices.num`- Number of supporting devices
- `ipadSc_urls.num`- Number of screenshots showed for display
- `lang.num`- Number of supported languages
- `vpp_lic`- Vpp Device Based Licensing Enabled

# Loading data and initial cleaning

We start by importing our usual Python data science packages and then loading the two sets of app data into separate dataframes to perform some initial inspections. In particular, we will start off by cleaning up the column names to make them a little easier to work with.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

In [2]:
android_apps_filepath = "googleplaystore.csv"
android_apps = pd.read_csv(android_apps_filepath)

ios_apps_filepath = "AppleStore.csv"
ios_apps = pd.read_csv(ios_apps_filepath).drop(columns = ["Unnamed: 0"])

The column names for the Google Play Store data are all reasonably descriptive and concise, so the only cleaning we do for them is converting each name into snake case, which is the preferred style for Python.

In [3]:
android_apps_cols = android_apps.columns
android_apps_cols = android_apps_cols.str.replace(" ", "_")
android_apps_cols = android_apps_cols.str.lower()
android_apps.columns = android_apps_cols

The column names for the iOS App Store data can be left as is, though a few could be changed to be a little more concise. For example, `rating_count_tot` could be changed to `oa_reviews` and `rating_count_ver` could be changed to `cur_ver_reviews` to allow `user_rating` and `user_rating_ver` to be shortened some to `oa_rating` and `cur_ver_rating`, respectively. This is purely a matter of personal preference.

In [4]:
ios_apps_cols = ios_apps.columns.to_series()
columns_dict = {"rating_count_tot": "oa_reviews", "rating_count_ver": "cur_ver_reviews",
               "user_rating": "oa_rating", "user_rating_ver": "cur_ver_rating",
               "sup_devices.num": "num_sup_devices", "ipadSc_urls.num": "num_screenshots",
               "lang.num": "num_langs"}
ios_apps_cols.replace(columns_dict, inplace = True)
ios_apps.columns = ios_apps_cols

Now that we have cleaned up the column names for each data set to our liking, we move on to the rest of the cleaning. This will involve four main steps:

1. Checking for and handling null values.
2. Checking for duplicate apps and deciding which duplicates to remove.
3. Removing apps that aren't targeted toward English-speaking audiences.
4. Removing apps that are not free.

## Checking for and handling null values

The first step in our data cleaning is checking for and handling null values.

In [5]:
android_apps.isnull().sum()

app                  0
category             0
rating            1474
reviews              0
size                 0
installs             0
type                 1
price                0
content_rating       1
genres               0
last_updated         0
current_ver          8
android_ver          3
dtype: int64

In [6]:
ios_apps.isnull().sum()

id                 0
track_name         0
size_bytes         0
currency           0
price              0
oa_reviews         0
cur_ver_reviews    0
oa_rating          0
cur_ver_rating     0
ver                0
cont_rating        0
prime_genre        0
num_sup_devices    0
num_screenshots    0
num_langs          0
vpp_lic            0
dtype: int64

We see that while the iOS App Store data doesn't contain any null values, the Google Play Store data does. In particular, about 1500 apps are missing rating information, and then there are a few apps which are missing other information, such as type, content rating, app version info, and minimum required Android version. Since there are only a small handful of apps which are missing information aside from the rating, we'll look at those individually to make a decision for how to handle them before proceding to the question of the null values in the ratings column. We start with the app which has a null value in the `type` column.

In [7]:
android_apps[android_apps["type"].isnull()]

Unnamed: 0,app,category,rating,reviews,size,installs,type,price,content_rating,genres,last_updated,current_ver,android_ver
9148,Command & Conquer: Rivals,FAMILY,,0,Varies with device,0,,0,Everyone 10+,Strategy,"June 28, 2018",Varies with device,Varies with device


We can see that while this app, [Command & Conquer: Rivals](https://play.google.com/store/apps/details?id=com.ea.gp.candcwarzones) is has a null value in the `type` column, it has a `price` value of 0. This indicates that the app is a free app, which we can confirm by looking up the Play Store listing for the game. Therefore, we can simply replace the null value with "Free".

In [8]:
android_apps.loc[9148, "type"] = "Free"

In [9]:
android_apps["type"].isnull().sum()

0

Now that we have taken care of the row with a missing `type` value, we'll move on to the row with a missing `content_rating` value.

In [10]:
android_apps[android_apps["content_rating"].isnull()]

Unnamed: 0,app,category,rating,reviews,size,installs,type,price,content_rating,genres,last_updated,current_ver,android_ver
10472,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up,


Inspecting this row reveals that is has more issues than just a mising `content_rating` value. In fact a missing `category` value resulted in the values for all of the other columns to be shifted over by one to the left. To fix this row, we can look up its [Play Store listing](https://play.google.com/store/apps/details?id=com.lifemade.internetPhotoframe) to confirm that is should have a `category` value of "LIFESTYLE". After doing this, we can then create a working copy of the row to shift over the values by 1 to the right and then update the values in the `category` and `genres` to reflect that this app is a lifestyle app.

In [11]:
s = android_apps.loc[10472].copy()

In [12]:
s.iloc[2:] = android_apps.iloc[10472, 1:-1].values

s["category"] = "LIFESTYLE"

s["genres"] = "Lifestyle"

In [13]:
android_apps.loc[10472] = s

In [14]:
android_apps["content_rating"].isnull().sum()

0

While there are still columns with missing values, some of them, such as the `android_ver` column, are things that we can no longer look up by hand directly through the Google Play Store as of July 2020. In addition, since our ultimate goal in this project is to identify app genres which are particularly popular, the null values that remain are not directly relevant to our analysis at this point in time. A potential future extension of this project would involve handling the null values in the `rating` column to allow us to explore possible relationships between an app's rating and its popularity. For now, though, we move onto checking for and handling duplicate apps in the two datasets.

## Checking for and handling duplicate apps

## Removing apps that are not targeted toward English-speaking audiences

## Removing apps that are not free