# Profitable App Profiles
### Hemanth Soni, June 2020

---

## Introduction and Overview

The goal of this project is to identify the most profitable app profiles in the store. This should help our agency identify where we should focus our development effort. In order to ensure only relevant data is analyzed, the characteristics of the agency need to be kept in mind:
* Only builds free apps (no paid apps)
* Only builds apps for the English-speaking world (no foreign-language apps)

Typically, I wouldn't want to exclude data outside of this profile (as I may find that those excluded categories / formats are actually the most lucrative) but for the purposes of this exercise I'll take those constraints for granted.

## Importing datasets

First, I'm going to start by importing a few datasets. The tutorial I am following provides two:
* [9660 Android apps](https://www.kaggle.com/lava18/google-play-store-apps)
* [7195 iOS apps](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

Separately, I was able to find a [third much larger dataset of Android apps on Kaggle](https://www.kaggle.com/gauthamp10/google-playstore-apps?select=Google-Playstore-Full.csv). It has the same fields available in the provided Android dataset, so I'm going to also include this in the analysis. The larger dataset should allow for more granular insights into the Android market. Unfortunately, a similar larger dataset couldn't be found for the Apple app store.

In [2]:
from csv import reader

#The small Google dataset
open_file = open('apps_datasets/google_small.csv', encoding='utf8')
read_file = reader(open_file)
googlesmall = list(read_file)
googlesmall_header = googlesmall[0]
googlesmall_table = googlesmall[1:]

#The large Google dataset
open_file = open('apps_datasets/google_large.csv', encoding='utf8')
read_file = reader(open_file)
googlelarge = list(read_file)
googlelarge_header = googlelarge[0]
googlelarge_table = googlelarge[1:]

#The Apple dataset
open_file = open('apps_datasets/apple.csv', encoding='utf8')
read_file = reader(open_file)
apple = list(read_file)
apple_header = apple[0]
apple_table = apple[1:]

To make this data easier to explore, I first wrote a function that makes it easier to 'peek' into a dataset in a readable way. This function lets me print any number of rows from each of the datasets and get a view into the datasets total number of rows and columns.

In [3]:
def explore_data (dataset, start, end, overview=True, hasHeader=True):
    slice = dataset[start:end]
    
    print('Overview of first ' + str(end-start) + ' rows in database')
    print('\n')
    
    for each in slice:
        print(each)
        print('\n')
        
    if overview == True:
        if hasHeader == True:
            print('Number of columns = ' + str(len(dataset[0])))
            print('Number of rows = ' + str(len(dataset)-1))
            print('-'*40)
        else:
            print('Number of columns = ' + str(len(dataset[0])))
            print('Number of rows = ' + str(len(dataset)))
            print('-'*40)
            
explore_data(googlesmall,0,5)
explore_data(googlelarge,0,5)
explore_data(apple,0,5)

Overview of first 5 rows in database


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of columns = 13
Number of rows = 10841
--------------------------

## Cleaning data

### Manually correcting known error

Based on a [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) of one of the datasets, there appears to be a known error in the small Google Play Store dataset. We can correct for this by filling in the data by finding [the app](https://play.google.com/store/apps/details?id=com.lifemade.internetPhotoframe) in the Play Store and filling in the missing value.

In [4]:
# Finding the app based on the comments section and printing it to ensure it matches the expected error row.
print(googlesmall[10473])

# Printing another row that is known to be fine to understand where the issue lays.
print(googlesmall[1])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


By comparing these two outputs, we can see that the "category" (index position 1) is missing in the error row. We can correct for this by adding it into the dataset.

In [5]:
googlesmall[10473].insert(1,'LIFESTYLE')

print(googlesmall[10473])

['Life Made WI-Fi Touchscreen Photo Frame', 'LIFESTYLE', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


### Removing duplicates

Generally, it's a good idea to check for duplicates in the datasets, and remove them if they exist. We will do this as a two step process.
1. Check if the database has duplicates
2. Remove the duplicates

We could theoretically skip step 1, but we'll do it anyways since this is meant to be a learning experience.

In [6]:
# Initiating lists
duplicate_apps = []
unique_apps = []

# Function to check for duplicates
def check_dupes(listname):
    for each in listname:
        name = each[0]
        if name in unique_apps:
            duplicate_apps.append(name)
        else:
            unique_apps.append(name)
            
    print('Number of unique apps:',len(unique_apps))
    print('Number of duplicate apps:',len(duplicate_apps))
    
    del unique_apps[:]
    del duplicate_apps[:]
    
# Checking each list for duplicates
check_dupes(apple)
check_dupes(googlesmall)

# The check for the large database is disabled as my computer isn't strong enough to run it.
# check_dupes(googlelarge)

Number of unique apps: 7198
Number of duplicate apps: 0
Number of unique apps: 9661
Number of duplicate apps: 1181


From this, we can see that the Apple Store dataset doesn't have any duplicates for us to worry about, but the smaller Google Play Store dataset does. We'll filter through this list and keep only the version of each app with the most reviews (as this suggests the most complete and up-to-date data).

In [18]:
# Initiating dictionary to store highest-review-count-version of each app
reviews_max = {}

# Iterating through smaller Google dataset
for each in googlesmall[1:]:
    name = each[0]
    n_reviews = float(each[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

I can check for errors by comparing the expected length of the dictionary vs. the actual length of the dictionary.

In [19]:
if int(len(googlesmall_table)-1181) == len(reviews_max):
    print ('Success! The dictionary is the expected size:',len(reviews_max),'entries.')
else:
    print ('Something is wrong.',int(len(googlesmall_table)-1181),'entries were expected, but',len(reviews_max),'were recorded.')

Success! The dictionary is the expected size: 9660 entries.


Now, I'll use this dictionary to remove the duplicates, keeping the entry version with the greatest number of reviews and entering them into a new table, 'googlesmall_nodupes'.

In [23]:
# Initiating new lists
googlesmall_nodupes = []
already_added = []

# Filling out new lists
for each in googlesmall[1:]:
    name = each[0]
    n_reviews = float(each[3])
    
    if reviews_max[name] >= n_reviews and name not in already_added:
        googlesmall_nodupes.append(each)
        already_added.append(name)

### Removing non-English apps

To filter out non-English apps, I'll filter the database for any non-English characters (beyond ASCII code 127).

In [33]:
# Defining a function to check if the passed phrase is fully English
def isEnglish(phrase):
    
    non_english = 0
    
    for each in phrase:
        if ord(each) > 127:
            non_english += 1
        
        if non_english >= 3:
            return False

    return True

# Testing this function against several examples
# print(isEnglish('Instachat 😜'))
# print(isEnglish('Instagram'))
# print(isEnglish('爱奇艺PPS -《欢乐颂2》电视剧热播'))
# print(isEnglish('Docs To Go™ Free Office Suite'))

googlesmall_nodupes_eng = []
googlelarge_eng = []
apple_eng = []

for each in googlesmall_nodupes:
    name = each[0]
    if isEnglish(name):
        googlesmall_nodupes_eng.append(each)
        
for each in googlelarge:
    name = each[0]
    if isEnglish(name):
        googlelarge_eng.append(each)

for each in apple:
    name = each[1]
    if isEnglish(name):
        apple_eng.append(each)

### Removing paid apps

Because the agency is only concerned with free apps, we can use a similar mechanism to the above to remove any apps that are paid. This is done below.

In [84]:
# Initializing final lists for each dataset
googlesmall_final = []
googlelarge_final = []
apple_final = []

for each in googlesmall_nodupes_eng:
    price = each[7]
    if price == '0':
        googlesmall_final.append(each)
        
for each in apple_eng:
    price = each[4]
    if price == '0.0':
        apple_final.append(each)

print(len(googlesmall_final))
print(len(apple_final))

8847
3203


## Analyzing Data

Now that I have my cleaned datasets, I can begin the analysis. To do this in the most useful way possible, I have to consider the launch strategy of the agency, which is as follows:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, develop it further.
3. If the app is profitable after six months, build an iOS version of the app and add it to the App Store.

The agency has already determined through previous analysis that there is a direct and linear correlation between the number of installs and the revenue generated by the app; thus to maximize profit our goal is to maximize installations.

### Most Common App Profiles

Given that the agency's target is to launch the same app on multiple stores, I can start by identifying the types of applications that are successful in both the Google Play Store and the Apple iOS store.

In [87]:
# Creating a function to build a frequency table out of any dataset for a given index number

def freq_table(dataset, index):
    table = {}
    total = 0

    for each in dataset:

        # Extracting the value at the given index number
        value = each[index]

        # Checking if value exists in table and either adding to the count or creating the entry
        if value in table:
            table[value] += 1
        else:
            table[value] = 1

        # Increasing the total count by 1
        total += 1

        # Initializing a new table to return figures in percentages
        table_percent = {}

        for each in table:
            percentage = table[each] / total * 100
            table_percent[each] = percentage

        return table_percent

# Creating a function that sorts and then prints a given input frequency table

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [88]:
# Creating tables for each of the app stores

display_table(googlesmall_final,1)
display_table(googlesmall_final,9)

ART_AND_DESIGN : 100.0
Art & Design : 100.0
