# Introduction

In this guided project, we work with two sets of mobile app data -- one from the iOS App Store and one from the Android Google Play Store -- in order to understand what types of of apps are likely to attract the most users. Our focus will be on free apps that are targeted at an English-speaking audience. While this project was originally written in pure Python, we will instead make use of the usual Python data science packages: NumPy, pandas, Matplotlib, and seaborn.

Since there are over four million apps available between the two major mobile app stores, we will work with samples from each that have been collected and uploaded to Kaggle. The first sample is a [set of approximately 10,000 Google Play Store apps](https://www.kaggle.com/lava18/google-play-store-apps) that was scraped in February 2019. The second is a [set of approximately 7200 iOS App Store apps](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) that was scraped in June 2018.

The Google Play Store data contains the following columns.

- `App` - Application name
- `Category` - Category the app belongs to
- `Rating` - Overall user rating of the app (as when scraped)
- `Reviews` - Number of user reviews for the app (as when scraped)
- `Size` - Size of the app (as when scraped)
- `Installs` - Number of user downloads/installs for the app (as when scraped)
- `Type` - Paid or Free
- `Price` - Price of the app (as when scraped)
- `Content Rating` - Age group the app is targeted at - Children / Mature 21+ / Adult
- `Genres` - An app can belong to multiple genres (apart from its main category). For example, a musical family game will belong to Music, Game, Family genres.
- `Last Updated` - Date when the app was last updated on Play Store (as when scraped)
- `Current Ver` - Current version of the app available on Play Store (as when scraped)
- `Android Ver` - Min required Android version (as when scraped)

The iOS App Store data contains the following columns.

- `id` - App ID
- `track_name`- App Name
- `size_bytes`- Size (in Bytes)
- `currency`- Currency Type
- `price`- Price amount
- `rating_count_tot`- User Rating counts (for all versions)
- `rating_count_ver`- User Rating counts (for current version)
- `user_rating` - Average User Rating value (for all versions)
- `user_rating_ver`- Average User Rating value (for current version)
- `ver` - Latest version code
- `cont_rating`- Content Rating
- `prime_genre`- Primary Genre
- `sup_devices.num`- Number of supporting devices
- `ipadSc_urls.num`- Number of screenshots showed for display
- `lang.num`- Number of supported languages
- `vpp_lic`- Vpp Device Based Licensing Enabled

# Loading data and initial cleaning

We start by importing our usual Python data science packages and then loading the two sets of app data into separate dataframes to perform some initial inspections. In particular, we will start off by cleaning up the column names to make them a little easier to work with.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

In [25]:
android_apps_filepath = "googleplaystore.csv"
android_apps = pd.read_csv(android_apps_filepath)

ios_apps_filepath = "AppleStore.csv"
ios_apps = pd.read_csv(ios_apps_filepath).drop(columns = ["Unnamed: 0"])

The column names for the Google Play Store data are all reasonably descriptive and concise, so the only cleaning we do for them is converting each name into snake case, which is the preferred style for Python.

In [26]:
android_apps_cols = android_apps.columns
android_apps_cols = android_apps_cols.str.replace(" ", "_")
android_apps_cols = android_apps_cols.str.lower()
android_apps.columns = android_apps_cols

The column names for the iOS App Store data can be left as is, though a few could be changed to be a little more concise. For example, `rating_count_tot` could be changed to `oa_reviews` and `rating_count_ver` could be changed to `cur_ver_reviews` to allow `user_rating` and `user_rating_ver` to be shortened some to `oa_rating` and `cur_ver_rating`, respectively. This is purely a matter of personal preference.

In [29]:
ios_apps_cols = ios_apps.columns.to_series()
columns_dict = {"rating_count_tot": "oa_reviews", "rating_count_ver": "cur_ver_reviews",
               "user_rating": "oa_rating", "user_rating_ver": "cur_ver_rating",
               "sup_devices.num": "num_sup_devices", "ipadSc_urls.num": "num_screenshots",
               "lang.num": "num_langs"}
ios_apps_cols.replace(columns_dict, inplace = True)
ios_apps.columns = ios_apps_cols

Now that we have cleaned up the column names for each data set to our liking, we move on to the rest of the cleaning. This will involve four main steps:

1. Checking for and handling null values.
2. Checking for duplicate apps and deciding which duplicates to remove.
3. Removing apps that aren't targeted toward English-speaking audiences.
4. Removing apps that are not free.

## Checking for and handling null values

## Checking for and handling duplicate apps

## Removing apps that are not targeted toward English-speaking audiences

## Removing apps that are not free