<a href="https://colab.research.google.com/github/holiday-scott/AppStoreAnalysis/blob/main/App_Store_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Most Profitable Free Apps on Google Play and Apple Store

#### Introduction
This is the first project in the Data Analyst in Python learning pathway with DataQuest.

#### Scenario
I am doing data analysis for a company that builds Android and iOS mobile apps. The company only makes apps that are free to download and install, so the main source of revenue consists of in-app ads. This means revenue for any given app is mostly influenced by the number of users who use the app — the more users that see and engage with the ads, the better. 

#### Goal
To understand what type of apps are likely to attract more users.

#### Dataset
* Apple Store: [This data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) contains more than 7000 Apple iOS mobile application details.
* Play Store: [This data set](https://www.kaggle.com/lava18/google-play-store-apps) contains more than 10000 mobile applicationd details.

## Part 1 - Prepration
First we mount Google Drive and read in the data sets.



In [7]:
# Mount Google.
from google.colab import drive
drive.mount("/content/gdrive", force_remount=True)
#drive.flush_and_unmount()

# Load reader modules
from csv import reader

# Find out where the data is.
#!ls "gdrive/MyDrive/Colab Notebooks/p1/"

### Google Play Data Set ###
opened_file = open('gdrive/MyDrive/Colab Notebooks/p1/PlayStore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

opened_file = open('gdrive/MyDrive/Colab Notebooks/p1/AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

# Show the first two lines of each data set
print("\n")
print("First two lines of the Apple Store data set")
print("-------------------------------------------")
print(ios_header)
print(ios[1])

print("First two lines of the Google Play Store data set")
print("-------------------------------------------------")
print(android_header)
print(android[1])

Mounted at /content/gdrive


First two lines of the Apple Store data set
-------------------------------------------
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']
First two lines of the Google Play Store data set
-------------------------------------------------
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


## Part 1 Continued
Next, we will need an inspection function to take a look at slices of the data as we clean it up.

In [8]:
# Take data set, start and end point of slice, and count columns as a parameter
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line between rows
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

print(android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


## Part 2 - Cleaning the Data

### Deleting Rows
This function shows a row and then asks if you want to deletes that row.

In [11]:
### TEST LIST OF LISTS ###
#test_col = ["col0", "col1", "col2", "col3"]
#test_row1 = ["row1", "row1", "row1", "row1"]
#test_row2 = ["row2", "row2", "row2", "row2"]
#test_list_of_list = [test_col, test_row1, test_row2]

def del_row(data_set, row_index):
  print("This is row ", row_index)
  print(data_set[row_index])
  choice = input("Delete this row? [y/n]")
  if choice == "y": # only one valid answer
    del data_set[row_index]
    print("I have deleted row", row_index)
    #return data_set #use for testing.
  else: # else used so any other character results in no deletion 
    print("No action taken")
    #return data_set #use for testing.

del_row(android,10472)


This is row  10472
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
Delete this row? [y/n]y
I have deleted row 10472


We can now use this function to delete the known error on row **10472 of the Android data set**.

Since I'm not sure if the reported row counts to include the header row or not, we will use this function to inspect the row and delete if it matches the description: "Life Made WI-Fi..."

In [None]:
del_row(android, 10472)

This is row  10472
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
Delete this row? [y/n]y
I have deleted row 10472


### Duplicate Data Points
Different versions of different apps, as well as updates, re-releases and scraping errors mean there may well be duplicate data points.

Analysis and visualisation will be helped by ensuring each app only occurs once in the data set.

The Android data set does indeed have multiple duplicates rows for the same application.

### Count Duplicates
This function counts the number of duplicate apps in the data set.

In [12]:
def find_dup(dataset):
  duplicates = []
  uniques = []
  print("Total Rows:",len(dataset))
  for row in dataset:
    #print(len(duplicates)) # debug line
    #print(len(uniques)) # debug line
    name = row[0]
    #print(name) # debug line
    #print(row) # debug line
    if name in uniques:
      duplicates.append(name)
    else:
      uniques.append(name)
  print("Duplicate apps:", len(duplicates))
  print("Unique apps:", len(uniques))
  print("Checksum:", len(duplicates)+len(uniques))
  return

find_dup(android)



Total Rows: 10840
Duplicate apps: 1181
Unique apps: 9659
Checksum: 10840


### Delete Duplicates
Now we know they are there, we need to delete them.

We will use two functions:
* **fill_dict**: Creates a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.
* **fill_new_dataset**: Use the information stored in the dictionary and create a new data set using the dictionary as an index.

In [14]:

def fill_dict(dataset, name, review):
  u_dict = {}
  for row in dataset:
    app_name = row[name]
    app_review = float(row[review])
         
    # If app not in u_dict then add it.
    if (app_name not in u_dict):
      u_dict[app_name] = app_review
    # If app is in u_dict and higher review, replace it
    elif (app_name in u_dict) and (app_review > u_dict.get(app_name)):
      u_dict[app_name] = app_review
  print("New unique dictionary contains:", len(u_dict), "keys with values.")
  return u_dict

fill_dict(android, 0, 3)
print("\n")


New unique dictionary contains: 9659 keys with values.




In [15]:
# Global scope
clean_android = []
clean_names = []

def fill_new_dataset(index_dict, dataset):
  # Loop through the raw dataset.
  for row in dataset:
    app_name = row[0] # Isolate the name for checking.
    n_reviews = float(row[3])
    # Attempt to match the number of reviews against the index dictionary returned from the fill_dict function.
    # n_reviews is unique in both data sets so we take the high review value we found.
    # Check if name has not been processed already.
    if (index_dict[app_name] == n_reviews) and (app_name not in clean_names):
      clean_android.append(row) # Copy to new clean dataset
      clean_names.append(app_name) # Audit the name to make sure we don't have it already.
  print("New unique dataset contains:", len(clean_android), "rows of data.")
  return clean_android

fill_new_dataset(fill_dict(android, 0, 3), android)
print("\n")
print("Here's a couple of rows of the clean data_set:")
explore_data(clean_android, 5, 7, True)



New unique dictionary contains: 9659 keys with values.
New unique dataset contains: 9659 rows of data.


Here's a couple of rows of the clean data_set:
['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'April 26, 2018', '1.1', '4.0.3 and up']


['Infinite Painter', 'ART_AND_DESIGN', '4.1', '36815', '29M', '1,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'June 14, 2018', '6.1.61.1', '4.2 and up']


Number of rows: 9659
Number of columns: 13


## Non-English Apps
Removing non-English apps is important because the company develops English language apps so we don't want to analyse those from other countries.

### Finding Non-English Apps
We will use the ord(string) to indentity any non-English character by its ascii value (over 127).

3 non-English characters is enough fidelity for our needs, though some may get through.

Two nested for loops to go through the dataset row by row, and then the app name char by char.

In [25]:
# iOS and Android datasets have app names in different columns, so we need to 
# specify which column to look at as a parameter.
def get_english(dataset, name_col):
  non_english_names = [] # Two lists so we can check we have them all.
  english_names = []
  non_english_chars = 0
  print("Total Rows:",len(dataset))
  for row in dataset: # Loop through the data set.
    non_english_chars = 0
    name = row[name_col]
    for char in name: # Loop through each app name in each row.
        if ord(char) > 127:
          non_english_chars += 1 # Increment the counter of non-Eng chars.
    if non_english_chars > 3:
      non_english_names.append(row) # Append to the discard pile if non-Eng.
    else:
      english_names.append(row) # Append to the keep pile if Eng.
  # Quality control. 
  print("Apps named in Non-English:", len(non_english_names))
  print("Apps named in English:", len(english_names))
  print("Checksum:", len(non_english_names)+len(english_names))
  return english_names


eng_android = get_english(clean_android, 0)
print("Android data set now has", len(eng_android), "rows.")
print("\n")
eng_ios = get_english(ios, 1)
print("iOS data set now has", len(eng_ios), "rows.")


Total Rows: 9659
Apps named in Non-English: 45
Apps named in English: 9614
Checksum: 9659
Android data set now has 9614 rows.


Total Rows: 7197
Apps named in Non-English: 1014
Apps named in English: 6183
Checksum: 7197
iOS data set now has 6183 rows.
