# Profitable App Profiles
### Hemanth Soni, June 2020

---

## Introduction and Overview

The goal of this project is to identify the most profitable app profiles in the store. This should help our agency identify where we should focus our development effort. In order to ensure only relevant data is analyzed, the characteristics of the agency need to be kept in mind:
* Only builds free apps (no paid apps)
* Only builds apps for the English-speaking world (no foreign-language apps)

Typically, I wouldn't want to exclude data outside of this profile (as I may find that those excluded categories / formats are actually the most lucrative) but for the purposes of this exercise I'll take those constraints for granted.

## Importing datasets

First, I'm going to start by importing a few datasets. The tutorial I am following provides two:
* [9660 Android apps](https://www.kaggle.com/lava18/google-play-store-apps)
* [7195 iOS apps](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

Separately, I was able to find a [third much larger dataset of Android apps on Kaggle](https://www.kaggle.com/gauthamp10/google-playstore-apps?select=Google-Playstore-Full.csv). It has the same fields available in the provided Android dataset, so I'm going to also include this in the analysis. The larger dataset should allow for more granular insights into the Android market. Unfortunately, a similar larger dataset couldn't be found for the Apple app store.

In [None]:
from csv import reader

#The small Google dataset
open_file = open('apps_datasets/google_small.csv', encoding='utf8')
read_file = reader(open_file)
googlesmall = list(read_file)
googlesmall_header = googlesmall[0]
googlesmall_table = googlesmall[1:]

#The large Google dataset
open_file = open('apps_datasets/google_large.csv', encoding='utf8')
read_file = reader(open_file)
googlelarge = list(read_file)
googlelarge_header = googlelarge[0]
googlelarge_table = googlelarge[1:]

#The Apple dataset
open_file = open('apps_datasets/apple.csv', encoding='utf8')
read_file = reader(open_file)
apple = list(read_file)
apple_header = apple[0]
apple_table = apple[1:]

To make this data easier to explore, I first wrote a function that makes it easier to 'peek' into a dataset in a readable way. This function lets me print any number of rows from each of the datasets and get a view into the datasets total number of rows and columns.

In [None]:
def explore_data (dataset, start, end, overview=True, hasHeader=True):
    slice = dataset[start:end]
    
    print('Overview of first ' + str(end-start) + ' rows in database')
    print('\n')
    
    for each in slice:
        print(each)
        print('\n')
        
    if overview == True:
        if hasHeader == True:
            print('Number of columns = ' + str(len(dataset[0])))
            print('Number of rows = ' + str(len(dataset)-1))
            print('-'*40)
        else:
            print('Number of columns = ' + str(len(dataset[0])))
            print('Number of rows = ' + str(len(dataset)))
            print('-'*40)
            
explore_data(googlesmall,0,5)
explore_data(googlelarge,0,5)
explore_data(apple,0,5)

## Cleaning data

### Manually correcting known error

Based on a [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) of one of the datasets, there appears to be a known error in the small Google Play Store dataset. We can correct for this by filling in the data by finding [the app](https://play.google.com/store/apps/details?id=com.lifemade.internetPhotoframe) in the Play Store and filling in the missing value.

In [None]:
# Finding the app based on the comments section and printing it to ensure it matches the expected error row.
print(googlesmall[10473])

# Printing another row that is known to be fine to understand where the issue lays.
print(googlesmall[1])

By comparing these two outputs, we can see that the "category" (index position 1) is missing in the error row. We can correct for this by adding it into the dataset.

In [None]:
googlesmall[10473].insert(1,'LIFESTYLE')

print(googlesmall[10473])

### Removing duplicates

Generally, it's a good idea to check for duplicates in the datasets, and remove them if they exist. We will do this as a two step process.
1. Check if the database has duplicates
2. Remove the duplicates

We could theoretically skip step 1, but we'll do it anyways since this is meant to be a learning experience.

In [39]:
duplicate_apps = []
unique_apps = []

def check_dupes(listname):
    for each in listname:
        name = each[0]
        if name in unique_apps:
            duplicate_apps.append(name)
        else:
            unique_apps.append(name)
            
    print('Number of unique apps: ',len(unique_apps))
    print('Number of duplicate apps: ',len(duplicate_apps))
    
    del unique_apps[:]
    del duplicate_apps[:]
    
# Checking each list for duplicates

check_dupes(apple)
check_dupes(googlesmall)

#The check for the large database is disabled as my computer isn't strong enough to run it.
#check_duples(googlelarge)

Number of unique apps:  7198
Number of duplicate apps:  0
Number of unique apps:  9661
Number of duplicate apps:  1181


From this, we can see that the Apple Store dataset doesn't have any duplicates for us to worry about, but the smaller Google Play Store dataset does. We'll filter through this list and keep only the version of each app with the most reviews (as this suggests the most complete and up-to-date data).