# Profitable App Profiles for the App Store and Google Play Markets

- About: This project analyzes the apps a company has available on Google Play and the App Store. As the company only builds apps that are free to download and install, the main source of revenue consists of in-app ads. For in-app ads, the revenue for any given app is mostly influenced by the number of users who use the app, i.e. the more users that see and engage with the adds, the better.
- Goal: The goal for this project is to analyze data to help the developers understand what type of apps are likely to attract more users.

## First Steps

The first steps when performing data analysis involve importing the necessary package(s), bringing in the data to be analyzed, and performing a cursory exploration of the data sets.

In [1]:
# Import the necessary package(s)
import pandas as pd

In [2]:
# Import the data sets and store them as DataFrames
apple_app_store = pd.read_csv('AppleStore.csv')
google_play_store = pd.read_csv('googleplaystore.csv')

In [3]:
# Explore both data sets

# Print the first few rows of each data set
print("The first few rows of the App Store data set: \n", apple_app_store.head())
print("\n")
print("The first few rows of the Google Play data set: \n", google_play_store.head())
print("\n")

# Find the number of rows and columns of each data set
print("The number of rows and columns in the App Store data set is: ", apple_app_store.shape)
print("The number of rows and columns in the Google Play data set is: ", google_play_store.shape)

The first few rows of the App Store data set: 
           id               track_name  size_bytes currency  price  \
0  284882215                 Facebook   389879808      USD    0.0   
1  389801252                Instagram   113954816      USD    0.0   
2  529479190           Clash of Clans   116476928      USD    0.0   
3  420009108               Temple Run    65921024      USD    0.0   
4  284035177  Pandora - Music & Radio   130242560      USD    0.0   

   rating_count_tot  rating_count_ver  user_rating  user_rating_ver      ver  \
0           2974676               212          3.5              3.5     95.0   
1           2161558              1289          4.5              4.0    10.23   
2           2130805               579          4.5              4.5  9.24.12   
3           1724546              3842          4.5              4.0    1.6.2   
4           1126879              3594          4.0              4.5    8.4.1   

  cont_rating        prime_genre  sup_devices.num  ipadS

In [4]:
# Print the column names and try to identify the columns that could help with the analysis
print("Column names for the App Store data set: ", apple_app_store.columns.values)
print("Column names for the Google Play data set: ", google_play_store.columns.values)

Column names for the App Store data set:  ['id' 'track_name' 'size_bytes' 'currency' 'price' 'rating_count_tot'
 'rating_count_ver' 'user_rating' 'user_rating_ver' 'ver' 'cont_rating'
 'prime_genre' 'sup_devices.num' 'ipadSc_urls.num' 'lang.num' 'vpp_lic']
Column names for the Google Play data set:  ['App' 'Category' 'Rating' 'Reviews' 'Size' 'Installs' 'Type' 'Price'
 'Content Rating' 'Genres' 'Last Updated' 'Current Ver' 'Android Ver']


## Data Set Description

### For the App Store
- id: App ID
- track_name: App Name
- size_bytes: Size (in Bytes) of APP
- currency: Currency Type
- price: Price Amount
- rating_count_tot: User Rating Counts (for all version)
- rating_count_ver: User rating count (for current version)
- user_rating: Average User Rating Value (for all version)
- user_rating_ver: Average User Rating Value (for current version)
- ver: Latest Version Code
- cont_rating: Content Rating
- prime_genre: Primary Genre
- sup_devices.num: Number of Supporting Devices
- ipadSc_urls.num: Number of Screenshots Showed for Display
- lang.num: Number of Supported Languages
- vpp_lic: Vpp Device Based Licensing Enabled

With respect to the App Store, none of the columns provide sufficient information to determine what type of apps are likely to attract more users. For example, there is no information on the number of installs for a given app that could be paired with user rating data.

### For Google Play
- App: Name of the App
- Category: Category Under Which the App Falls
- Rating: App's Rating on the Play Store
- Reviews: Number of Reviews of the App
- Size: Size of the App
- Installs: Number of Installs of the App
- Type: If the App Is Free/Paid
- Price: Price of the App
- Content Rating: Appropriate Target Audience for the App
- Genres: Genre Under Which the App Falls
- Last Updated: Date When the App Was Last Updated
- Current Ver: Current Version of the App
- Android Ver: Minimum Android Version Required to Run the App

For Google Play, there is data on the number of installs of the app, information on the app's rating on the Play Store, as well as additional data (e.g. review information, genre data, etc.) that can be used to determine what type of apps are likely to attract more users.

## Next Step - Data Cleaning

Prior to analyzing the data, it is necessary to make sure the data is accurate, otherwise the results of the analysis will be wrong. Data cleaning is done to ensure the data is accurate. Data cleaning involves:

- Detecting inaccurate data and correcting or removing it.
  - Inaccurate data: Information that has not been entered correctly or maintained.
- Detecting duplicate data and removing it.
  - Duplicate data: A single, in this case app, that occupies more than one record in the database.
- Modifying the data to fit the purpose of the analysis.
  - In this case, the data needs to be modified so that paid apps and non-English apps are not included in the analysis.

### Detecting Inaccurate Data and Correcting or Removing It

[One of the discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) regarding the Google Play data set describes an error for a certain row. The error is that the value in the 'Category' column is actually the value for the 'Rating' column and the values for all subsequent columns have shifted left by one column (i.e. the value in the 'Rating' column is actually the value for the 'Reviews' column, the value in the 'Reviews' column is actually the value for the 'Size' column, etc.).

So, what should be done about the innacurate, in this case missing, data? There are a number of different options, including:

- Dropping the row from the data set.
- Beginning with the 'Category' column, shifting the values right by one column, and:
  - Replacing the value in the 'Category' column with "NO INFO" or
  - Attempting to infer what the value in the 'Category' column should be based on adjacent rows and using that information to replace the value in the 'Category' column.

As the 'Category' column is information that helps the developers understand what type of apps are likely to attract more users, dropping the row from the data set ensures that the innacurate (i.e. missing) data does not impact the analysis.

In [5]:
# Observe the inaccurate (i.e. missing) data and drop the row from the data set
print(google_play_store.loc[[10472]])
google_play_store = google_play_store.drop(10472)
google_play_store = google_play_store.reset_index(drop=True)

                                           App Category  Rating Reviews  \
10472  Life Made WI-Fi Touchscreen Photo Frame      1.9    19.0    3.0M   

         Size Installs Type     Price Content Rating             Genres  \
10472  1,000+     Free    0  Everyone            NaN  February 11, 2018   

      Last Updated Current Ver Android Ver  
10472       1.0.19  4.0 and up         NaN  


### Detecting Duplicate Data and Removing It

A search of the [discussion section](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion) of the (Apple) App Store data set shows that there is some duplicate data that needs to be removed.

The Google Play data set also has duplicate data that needs to be removed. For example, Instagram appears four (4) times in the data set.

In [6]:
# Subset the Google Play data set using loc to provide an example of the appearance of duplicate data
print(google_play_store.loc[google_play_store['App'] == 'Instagram'])

            App Category  Rating   Reviews                Size  \
2545  Instagram   SOCIAL     4.5  66577313  Varies with device   
2604  Instagram   SOCIAL     4.5  66577446  Varies with device   
2611  Instagram   SOCIAL     4.5  66577313  Varies with device   
3909  Instagram   SOCIAL     4.5  66509917  Varies with device   

            Installs  Type Price Content Rating  Genres   Last Updated  \
2545  1,000,000,000+  Free     0           Teen  Social  July 31, 2018   
2604  1,000,000,000+  Free     0           Teen  Social  July 31, 2018   
2611  1,000,000,000+  Free     0           Teen  Social  July 31, 2018   
3909  1,000,000,000+  Free     0           Teen  Social  July 31, 2018   

             Current Ver         Android Ver  
2545  Varies with device  Varies with device  
2604  Varies with device  Varies with device  
2611  Varies with device  Varies with device  
3909  Varies with device  Varies with device  


Instagram is not the only culprit. There are 1,181 cases where an app appears more than once.

In [7]:
# Create a DataFrame object
google_play_store_dataframe_object = pd.DataFrame(google_play_store, columns=['App'])

# Find duplicate rows
google_play_store_duplicate_row = google_play_store_dataframe_object[google_play_store_dataframe_object.duplicated()]
print(google_play_store_duplicate_row)

                                      App
229          Quick PDF Scanner + OCR FREE
236                                   Box
239                    Google My Business
256                   ZOOM Cloud Meetings
261             join.me - Simple Meetings
...                                   ...
10714                  FarmersOnly Dating
10719  Firefox Focus: The privacy browser
10729                         FP Notebook
10752      Slickdeals: Coupons & Shopping
10767                                AAFP

[1181 rows x 1 columns]


The duplicate data needs to be removed to prevent the analysis from being affected.

Returning to the example of Instagram, right now there are four (4) occurrences when there should only be one (1). So, this means that there are three (3) occurrences which are not valid (e.g. due to being outdated) and one (1) occurrence which is valid (e.g. the most recent scrape of the data source). In order to prevent the valid occurrence from being removed or an invalid occurrence from being retained, a criterion must be used to remove the duplicates (i.e. the duplicates should not be removed randomly).

The criterion used to remove duplicate data can be based on one or more factors, including:

- The number of reviews.
- The data the app was last updated.
- The current version of the app.
- Etc.

For Instagram, the number of reviews is the main difference between the duplicate entries. The different numbers show the data was collected at different times. The higher the number of reviews, the more recent the data should be. Rather than removing duplicates randomly, the row with the highest number of reviews will be kept and the other entries will be removed for any given app. This will first be done for Instagram (to ensure the process works and there are no unintended consequences) and will then be applied to the rest of the duplicate entries in the data set.

In [8]:
# Make a copy of the DataFrame, which can then be modified without affecting the original DataFrame
google_play_store_copy = google_play_store.copy()

# Sort the copied DataFrame by reviews, then subset the sorted copy of the Google Play data set using  
# loc to show that the duplicate data for Instagram is now sorted based on the number of reviews, with  
# the row having the highest number of reviews appearing first
google_play_store_instagram = google_play_store_copy.sort_values(by='Reviews', ascending=False).loc[lambda x: x.App == 'Instagram']
print(google_play_store_instagram)
print("\n")

# For Instagram, keep the row with the highest number of reviews and remove the other entries
google_play_store_instagram_actual = google_play_store_instagram.drop_duplicates('App', keep='first')
print(google_play_store_instagram_actual)

            App Category  Rating   Reviews                Size  \
2604  Instagram   SOCIAL     4.5  66577446  Varies with device   
2611  Instagram   SOCIAL     4.5  66577313  Varies with device   
2545  Instagram   SOCIAL     4.5  66577313  Varies with device   
3909  Instagram   SOCIAL     4.5  66509917  Varies with device   

            Installs  Type Price Content Rating  Genres   Last Updated  \
2604  1,000,000,000+  Free     0           Teen  Social  July 31, 2018   
2611  1,000,000,000+  Free     0           Teen  Social  July 31, 2018   
2545  1,000,000,000+  Free     0           Teen  Social  July 31, 2018   
3909  1,000,000,000+  Free     0           Teen  Social  July 31, 2018   

             Current Ver         Android Ver  
2604  Varies with device  Varies with device  
2611  Varies with device  Varies with device  
2545  Varies with device  Varies with device  
3909  Varies with device  Varies with device  


            App Category  Rating   Reviews                Siz

So, when applied to Instagram, the process to remove duplicate data was successful.

The process is:
1. Sort the data based on the 'Reviews' column, i.e. by the number of reviews, in descending order.
2. Drop the duplicate data based on the 'App' column, keeping the entry with the highest number of reviews.

Apply the process to the entire data set, then verify that there are no more duplicate entries.

When searching for duplicate rows using a DataFrame object, if the DataFrame is empty, then there are no duplicates and the process was successful. Another way of verifying the results is to check the number of rows. After removing the duplicates, we expect 9,659 rows.

In [9]:
google_play_store_no_duplicates = google_play_store_copy.sort_values(by='Reviews', ascending=False).drop_duplicates('App', keep='first')

# Create a DataFrame object
google_play_store_dataframe_object = pd.DataFrame(google_play_store_no_duplicates, columns=['App'])

# Find duplicate rows
google_play_store_duplicate_row = google_play_store_dataframe_object[google_play_store_dataframe_object.duplicated()]
print(google_play_store_duplicate_row)

print("\nAfter removing the duplicates, the number of rows and columns in the Google Play data set is: ", google_play_store_no_duplicates.shape)

Empty DataFrame
Columns: [App]
Index: []

After removing the duplicates, the number of rows and columns in the Google Play data set is:  (9659, 13)


### Modifying the Data to Fit the Purpose of the Analysis

The last step in the data cleaning process is to modify the data to fit the purpose of the analysis. In this case, the data needs to be modified so that paid apps and non-English apps are not included in the analysis.

So, how can non-English apps be removed from the data set? An initial approach will be to check whether the string can be encoded with only ASCII characters. If the string cannot be encoded with only ASCII characters, then the string has characters from some other alphabet or special characters.

Testing this approach on some toy examples yields:

In [10]:
def is_english(app_name):
    try:
        app_name.encode(encoding='utf-8').decode('ascii')
    except UnicodeDecodeError:
        return False
    else:
        return True

print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
False
False


There is an issue with the initial approach, namely that 'Docs To Go™ Free Office Suite' and 'Instachat 😜', both English apps, are being recognized as non-English apps because they have special characters (i.e. '™' and '😜').

An imperfect solution is as follows: Define a function that calculates how many characters are probably English characters, using regular expression, and return True above a certain threshold. The solution is not perfect as some words in non-English languages share the same letters (e.g. Daten, which is German for data).

In [11]:
import re

def is_probably_english(app_name, threshold=0.90):
    regular_expression = re.compile(r'[-a-zA-Z0-9_ ]')
    ascii = [character for character in app_name if regular_expression.search(character)]
    quotient = len(ascii) / len(app_name)
    passed = True if quotient >= threshold else False
    return passed, quotient

print(is_probably_english('Instagram'))
print(is_probably_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_probably_english('Docs To Go™ Free Office Suite'))
print(is_probably_english('Instachat 😜'))
print(is_probably_english('Daten'))

(True, 1.0)
(False, 0.3157894736842105)
(True, 0.9655172413793104)
(True, 0.9090909090909091)
(True, 1.0)


So, while not perfect, the solution does a relatively good job at identifying English apps and non-English apps.

Applying the solution to the data set:

In [12]:
import re

def is_probably_english(row, threshold=0.90):
    regular_expression = re.compile(r'[-a-zA-Z0-9_ ]')
    ascii = [character for character in row['App'] if regular_expression.search(character)]
    quotient = len(ascii) / len(row['App'])
    passed = True if quotient >= threshold else False
    return passed

google_play_store_is_probably_english = google_play_store_no_duplicates.apply(is_probably_english, axis=1)

google_play_store_english = google_play_store_no_duplicates[google_play_store_is_probably_english]

print(google_play_store_english.shape)

(9231, 13)


Above, the google_play_store_no_duplicates DataFrame is filtered using the is_probably_english function, with the result, a boolean, stored in the google_play_store_is_probably_english DataFrame. The google_play_store_is_probably_english DataFrame is then used to filter out the non-English apps in the google_play_store_no_duplicates DataFrame, with the end result being stored in the google_play_store_english DataFrame.

So far in the data cleaning process:

- Removed inaccurate data
- Removed duplicate app entries
- Removed non-English apps (to the greatest extent practical)

As mentioned in the introduction, the company only build apps that are free to download and install, with the main source of revenue consisting of in-app ads. The data sets contain both free and non-free apps; so it is necessary to isolate only the free apps for the analysis.

Isolating the free apps will be our last step in the data cleaning process.