# Profitable App Profiles
### Hemanth Soni, June 2020

---

## Introduction and Overview

The goal of this project is to identify the most profitable app profiles in the store. This should help our agency identify where we should focus our development effort. In order to ensure only relevant data is analyzed, the characteristics of the agency need to be kept in mind:
* Only builds free apps (no paid apps)
* Only builds apps for the English-speaking world (no foreign-language apps)

Typically, I wouldn't want to exclude data outside of this profile (as I may find that those excluded categories / formats are actually the most lucrative) but for the purposes of this exercise I'll take those constraints for granted.

## Importing datasets

First, I'm going to start by importing a few datasets. The tutorial I am following provides two:
* [9660 Android apps](https://www.kaggle.com/lava18/google-play-store-apps)
* [7195 iOS apps](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

Separately, I was able to find a [third much larger dataset of Android apps on Kaggle](https://www.kaggle.com/gauthamp10/google-playstore-apps?select=Google-Playstore-Full.csv). It has the same fields available in the provided Android dataset, so I'm going to also include this in the analysis. The larger dataset should allow for more granular insights into the Android market. Unfortunately, a similar larger dataset couldn't be found for the Apple app store.

In [17]:
from csv import reader

#The small Google dataset
open_file = open('apps_datasets/google_small.csv', encoding='utf8')
read_file = reader(open_file)
googlesmall = list(read_file)
googlesmall_header = googlesmall[0]
googlesmall_table = googlesmall[1:]

#The large Google dataset
open_file = open('apps_datasets/google_large.csv', encoding='utf8')
read_file = reader(open_file)
googlelarge = list(read_file)
googlelarge_header = googlelarge[0]
googlelarge_table = googlelarge[1:]

#The Apple dataset
open_file = open('apps_datasets/apple.csv', encoding='utf8')
read_file = reader(open_file)
apple = list(read_file)
apple_header = apple[0]
apple_table = apple[1:]

To make this data easier to explore, I first wrote a function that makes it easier to 'peek' into a dataset in a readable way. This function lets me print any number of rows from each of the datasets and get a view into the datasets total number of rows and columns.

In [18]:
def explore_data (dataset, start, end, overview=True, hasHeader=True):
    slice = dataset[start:end]
    
    print('Overview of first ' + str(end-start) + ' rows in database')
    print('\n')
    
    for each in slice:
        print(each)
        print('\n')
        
    if overview == True:
        if hasHeader == True:
            print('Number of columns = ' + str(len(dataset[0])))
            print('Number of rows = ' + str(len(dataset)-1))
            print('-'*40)
        else:
            print('Number of columns = ' + str(len(dataset[0])))
            print('Number of rows = ' + str(len(dataset)))
            print('-'*40)
            
explore_data(googlesmall,0,5)
explore_data(googlelarge,0,5)
explore_data(apple,0,5)

Overview of first 5 rows in database


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of columns = 13
Number of rows = 10841
--------------------------

## Cleaning data

### Deleting blank columns

From the above, we can see that the last four columns of the larger Google dataset are blank. The code below quickly runs through the database and deletes those four rows.

In [19]:
for each in googlelarge:
    del each[-4:]

I then check the database again to make sure I did this right.

In [20]:
explore_data(googlelarge,0,5)

Overview of first 5 rows in database


['App Name', 'Category', 'Rating', 'Reviews', 'Installs', 'Size', 'Price', 'Content Rating', 'Last Updated', 'Minimum Version', 'Latest Version']


['DoorDash - Food Delivery', 'FOOD_AND_DRINK', '4.548561573', '305034', '5,000,000+', 'Varies with device', '0', 'Everyone', 'March 29, 2019', 'Varies with device', 'Varies with device']


['TripAdvisor Hotels Flights Restaurants Attractions', 'TRAVEL_AND_LOCAL', '4.400671482', '1207922', '100,000,000+', 'Varies with device', '0', 'Everyone', 'March 29, 2019', 'Varies with device', 'Varies with device']


['Peapod', 'SHOPPING', '3.656329393', '1967', '100,000+', '1.4M', '0', 'Everyone', 'September 20, 2018', '5.0 and up', '2.2.0']


['foodpanda - Local Food Delivery', 'FOOD_AND_DRINK', '4.107232571', '389154', '10,000,000+', '16M', '0', 'Everyone', 'March 22, 2019', '4.2 and up', '4.18.2']


Number of columns = 11
Number of rows = 267052
----------------------------------------


### Checking data integrity

To start, we can run a simple test on the datasets to ensure that each row is complete (ie. has the same number of elements as the header). The code for this is below

In [21]:
# Function to run through each dataset and return a frequency table by the number of elements in each row

errorBase = {}

def rowCheck(dataset):
    
    errorBase = {}
    headerCount = len(dataset[0])
    
    print('Expecting',headerCount,'rows.')
    
    for each in dataset:
        rowCount = len(each)
        if (rowCount < headerCount) and (rowCount not in errorBase):
            errorBase[rowCount] = 1
        elif (rowCount < headerCount) and (rowCount in errorBase):
            errorBase[rowCount] += 1
    
    if len(errorBase) == 0:
        print('No errors found')
    else:
        print(errorBase)

In [22]:
rowCheck(googlesmall)
rowCheck(googlelarge)
rowCheck(apple)

Expecting 13 rows.
{12: 1}
Expecting 11 rows.
No errors found
Expecting 16 rows.
No errors found


#### Filling in missing data point

Based on the above, one row appears to be short two elements. This a known error in the data, as seen in this [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015). But for the purposes of the exercise, let's pretend we don't know the row number, so we need to dig through and find it. I'll do that with the code below, then fill in the missing data by finding the app in the Play Store and filling in the missing data.

In [23]:
rowCounter = 0

for each in googlesmall:
    if len(each) == 12:
        print('The error row is',rowCounter)
        break
    
    rowCounter += 1    

# Finding the app based on the counter above.
print(googlesmall[int(rowCounter)])

# Printing another row that is known to be fine to understand where the issue lays.
print(googlesmall[1])

The error row is 10473
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


By comparing these two outputs, we can see that the "category" (index position 1) and the "genre" (index position 11) are missing in the error row. We can correct for this by adding it into the dataset.

In [24]:
# Adding in missing data (category and genre)
# googlesmall[10473].insert(1,'LIFESTYLE') # commented out so that data isn't inserted in re-runs
# googlesmall[10473].insert(9,'Lifestyle') # commented out so that data isn't inserted in re-runs

# Comparing against a row known to be error-free to visually check
print(googlesmall[10473])
print(googlesmall[1])

# Deleting blank element at index 10
# googlesmall[10473].remove('') # commented out so that data isn't inserted in re-runs
print(googlesmall[10473])

# Confirming that error has been corrected
rowCheck(googlesmall)

['Life Made WI-Fi Touchscreen Photo Frame', 'LIFESTYLE', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', 'Lifestyle', '', 'February 11, 2018', '1.0.19', '4.0 and up']
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
['Life Made WI-Fi Touchscreen Photo Frame', 'LIFESTYLE', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', 'Lifestyle', '', 'February 11, 2018', '1.0.19', '4.0 and up']
Expecting 13 rows.
No errors found


### Continuing data integrity check

Next, we can check to make sure that the columns contain only the type of data expected. For example, the reviews column should only include numbers, no letters / phrases. We can do this by defining a function to check that an index number contains only the type of elements stated.

In [25]:
def elementCheck(database, index, var):
    
    counter = 1
    errorCounter = 0
    errorBase = []
    
    for each in database[1:]:
        
        # There's almost definitely a better way to do this, but the code below checks if setting the specified element to especified type works
        try:
            if isinstance(var(each[index]), var):
                pass
            
        # If it doesn't, adds it to the error-checker database
        except ValueError:
            errorBase.append(counter)
            errorCounter += 1
        counter += 1
    
    print('This database has',errorCounter,'errors at index point',str(index)+'.')
    
    if len(errorBase) > 0:
        print(errorBase)

In [26]:
# A better way to do this error check would be to define a dictionary of elements and their correct type, but I'm lazy.
# A better checker would also be able to account for some common data characteristics, like commas in numbers, or dollars for currency-denominated fields
# Also, hecks for strings isn't actually useful, as everything is stored a string. So I'm only running this on columns where I am expecting only non-strings

# Checking the smaller Google dataset
print('Small Google dataset check:')
elementCheck(googlesmall, 2, float)
elementCheck(googlesmall, 3, int)
print('')
    
# Checking the larger Google dataset
print('Larger Google dataset check:')
elementCheck(googlelarge, 2, float)
elementCheck(googlelarge, 3, int)
print('')

# Checking the Apple dataset
print('Apple dataset check:')
elementCheck(apple, 0, int)
elementCheck(apple, 2, int)
elementCheck(apple, 4, float)
elementCheck(apple, 5, int)
elementCheck(apple, 6, int)
elementCheck(apple, 7, float)
elementCheck(apple, 8, float)

Small Google dataset check:
This database has 0 errors at index point 2.
This database has 0 errors at index point 3.

Larger Google dataset check:
This database has 17 errors at index point 2.
[6942, 13505, 23458, 32230, 48439, 113152, 125480, 125481, 165231, 168915, 177166, 180372, 190760, 193870, 194166, 232812, 257774]
This database has 13 errors at index point 3.
[6942, 23458, 48439, 113152, 125481, 165231, 168915, 177166, 180372, 193870, 194166, 232812, 257774]

Apple dataset check:
This database has 0 errors at index point 0.
This database has 0 errors at index point 2.
This database has 0 errors at index point 4.
This database has 0 errors at index point 5.
This database has 0 errors at index point 6.
This database has 0 errors at index point 7.
This database has 0 errors at index point 8.


From this, we can tell the larger Google dataset has some errors that need to be corrected. All of the index numbes that have an error on the third element also have an error on the second element, so I will start by printing some of the flagged indexes in that list.

In [27]:
print(googlelarge[6942])
print(googlelarge[13505])
print(googlelarge[23458])

['ELer Japanese - NHK News', ' Podcasts', ' Lessons', '', 'EDUCATION', '4.705075264', '1458', '100,000+', '9.5M', '0', 'Everyone']
['Never have I ever 18+ ', ')', 'GAME_STRATEGY', '4', '6', '100+', '2.4M', '$0.99', 'Mature 17+', 'December 30, 2018', '4.0.3 and up']
['Israel News', ' Channel 2 News', 'NEWS_AND_MAGAZINES', '3.857798815', '11976', '1,000,000+', 'Varies with device', '0', 'Everyone 10+', 'March 16, 2019', 'Varies with device']


From here, I can see these are genuine errors (incorrectly filled, with blank tags, etc.). Because the number of flagged errors is so small, I will just delete these rows from the dataset rather than attempting to fix them. The dataset as a whole will still provide sufficient value.

In [28]:
errorSet = [6942, 13505, 23458, 32230, 48439, 113152, 125480, 125481, 165231, 168915, 177166, 180372, 190760, 193870, 194166, 232812, 257774]

# Reversing the errorset so that they get deleted from largest to smallest index (so that the index numbers don't change during deletion)
errorSet.reverse()

for each in errorSet:
    del googlelarge[each]

We can then check the database for errors again.

In [29]:
print('Larger Google dataset check:')
elementCheck(googlelarge, 2, float)
elementCheck(googlelarge, 3, int)
print('')

Larger Google dataset check:
This database has 0 errors at index point 2.
This database has 0 errors at index point 3.



Perfect! We now have 3 databases with proper rows and columns.

### Removing duplicates

Generally, it's a good idea to check for duplicates in the datasets, and remove them if they exist. We will do this as a two step process.
1. Check if the database has duplicates
2. Remove the duplicates

In [30]:
# Initiating lists
duplicate_apps = []
unique_apps = []

# Function to check for duplicates
def check_dupes(listname):
    for each in listname:
        name = each[0]
        if name in unique_apps:
            duplicate_apps.append(name)
        else:
            unique_apps.append(name)
            
    print('Number of unique apps:',len(unique_apps))
    print('Number of duplicate apps:',len(duplicate_apps))
    print('')
    
    del unique_apps[:]
    del duplicate_apps[:]
    
# Checking each list for duplicates
check_dupes(apple)
check_dupes(googlesmall)

# The check for the large database is disabled as my computer isn't strong enough to run it.

# check_dupes(googlelarge)

# This is likely because of a limitation of the code I wrote for checking for duplicates. The if statement checks a database that grows with every row of the database
# It starts by checkin a database with 0 elements, then 1, then 2, then 3, etc. But for a large dataset it soon needs to check through a database of 100k+ rows
# I don't know a better way to check for duplicates, but it looks like Kaggle does because the website says the dataset has 244407 unique apps

if len(googlelarge) == 244407:
    print('No duplicates the larger Google database')
else:
    print('There are',len(googlelarge)-244407,'duplicates in the larger Google database.')

Number of unique apps: 7198
Number of duplicate apps: 0

Number of unique apps: 9661
Number of duplicate apps: 1181

There are 22629 duplicates in the larger Google database.


From this, we can see that the Apple Store dataset doesn't have any duplicates for us to worry about, but the smaller Google Play Store dataset does, as does the larger Google Play Store dataset. We'll filter through and keep only the version of each app with the most reviews (as this suggests the most complete and up-to-date data).

In [31]:
# Initiating dictionary to store highest-review-count-version of each app
reviews_max = {}
reviews_max_l = {}

# Iterating through smaller Google dataset
for each in googlesmall[1:]:
    name = each[0]
    n_reviews = float(each[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

# Iterating through larger Google dataset
for each in googlelarge[1:]:
    name = each[0]
    n_reviews_l = float(each[3])
    
    if name in reviews_max_l and reviews_max_l[name] < n_reviews_l:
        reviews_max_l[name] = n_reviews_l
    elif name not in reviews_max_l:
        reviews_max_l[name] = n_reviews_l

I can check for errors by comparing the expected length of the dictionary vs. the actual length of the dictionary.

In [32]:
if int(len(googlesmall[1:])-1181) == len(reviews_max):
    print ('Success! The dictionary is the expected size:',len(reviews_max),'entries.')
else:
    print ('Something is wrong.',int(len(googlesmall[1:])-1181),'entries were expected, but',len(reviews_max),'were recorded.')

if int(len(googlelarge[1:])-22629-13-1) == len(reviews_max_l):
    print ('Success! The dictionary is the expected size:',len(reviews_max_l),'entries.')
else:
    print ('Something is wrong.',int(len(googlelarge[1:])-22629-13-1),'entries were expected, but',len(reviews_max_l),'were recorded.')

Success! The dictionary is the expected size: 9660 entries.
Success! The dictionary is the expected size: 244392 entries.


Now, I'll use this dictionary to remove the duplicates, keeping the entry version with the greatest number of reviews and entering them into a new table, 'googlesmall_nodupes'.

In [33]:
# Initiating new lists
googlesmall_nodupes = []
already_added = []
googlelarge_nodupes = []
already_added_l = []

# Filling out new lists
for each in googlesmall[1:]:
    name = each[0]
    n_reviews = float(each[3])
    
    if reviews_max[name] >= n_reviews and name not in already_added:
        googlesmall_nodupes.append(each)
        already_added.append(name)

# The code below takes forever to run. Troubleshoot here.
# for each in googlelarge[1:]:
#     name = each[0]
#     n_reviews_l = float(each[3])
#     
#     if reviews_max_l[name] >= n_reviews_l and name not in already_added_l:
#         googlelarge_nodupes.append(each)
#         already_added_l.append(name)
#     print('Cycle complete')

### Removing non-English apps

To filter out non-English apps, I'll filter the database for any non-English characters (beyond ASCII code 127).

In [34]:
# Defining a function to check if the passed phrase is fully English
def isEnglish(phrase):
    
    non_english = 0
    
    for each in phrase:
        if ord(each) > 127:
            non_english += 1
        
        if non_english >= 3:
            return False

    return True

# Testing this function against several examples
# print(isEnglish('Instachat 😜'))
# print(isEnglish('Instagram'))
# print(isEnglish('爱奇艺PPS -《欢乐颂2》电视剧热播'))
# print(isEnglish('Docs To Go™ Free Office Suite'))

googlesmall_nodupes_eng = []
googlelarge_eng = []
apple_eng = []

for each in googlesmall_nodupes:
    name = each[0]
    if isEnglish(name):
        googlesmall_nodupes_eng.append(each)
        
for each in googlelarge:
    name = each[0]
    if isEnglish(name):
        googlelarge_eng.append(each)

for each in apple:
    name = each[1]
    if isEnglish(name):
        apple_eng.append(each)

### Removing paid apps

Because the agency is only concerned with free apps, we can use a similar mechanism to the above to remove any apps that are paid. This is done below.

In [35]:
# Initializing final lists for each dataset
googlesmall_final = []
googlelarge_final = []
apple_final = []

for each in googlesmall_nodupes_eng:
    price = each[7]
    if price == '0':
        googlesmall_final.append(each)
        
for each in apple_eng:
    price = each[4]
    if price == '0.0':
        apple_final.append(each)

print(len(googlesmall_final))
print(len(apple_final))

8847
3203


## Analyzing Data

Now that I have my cleaned datasets, I can begin the analysis. To do this in the most useful way possible, I have to consider the launch strategy of the agency, which is as follows:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, develop it further.
3. If the app is profitable after six months, build an iOS version of the app and add it to the App Store.

The agency has already determined through previous analysis that there is a direct and linear correlation between the number of installs and the revenue generated by the app; thus to maximize profit our goal is to maximize installations.

### Most Common App Profiles

Given that the agency's target is to launch the same app on multiple stores, I can start by identifying the types of applications that are successful in both the Google Play Store and the Apple iOS store.

In [36]:
# Creating a function to build a frequency table out of any dataset for a given index number

def freq_table(dataset, index):
    table = {}
    total = 0

    for each in dataset:

        # Extracting the value at the given index number
        value = each[index]

        # Checking if value exists in table and either adding to the count or creating the entry
        if value in table:
            table[value] += 1
        else:
            table[value] = 1

        # Increasing the total count by 1
        total += 1

        # Initializing a new table to return figures in percentages
        table_percent = {}

        for each in table:
            percentage = table[each] / total * 100
            table_percent[each] = round(percentage,2)

    return table_percent

# Creating a function that sorts and then prints a given input frequency table

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [37]:
# Creating tables for each of the app stores

print('Google Table, Category')
display_table(googlesmall_final,1)
print(' ')
print('Google Table, Genres')
display_table(googlesmall_final,9)
print(' ')
print('Apple Table, Genre')
display_table(apple_final,11)

Google Table, Category
FAMILY : 18.48
GAME : 9.85
TOOLS : 8.43
BUSINESS : 4.6
PRODUCTIVITY : 3.9
LIFESTYLE : 3.9
FINANCE : 3.71
MEDICAL : 3.53
SPORTS : 3.39
PERSONALIZATION : 3.32
COMMUNICATION : 3.23
HEALTH_AND_FITNESS : 3.09
PHOTOGRAPHY : 2.95
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.67
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.87
VIDEO_PLAYERS : 1.79
MAPS_AND_NAVIGATION : 1.39
EDUCATION : 1.29
FOOD_AND_DRINK : 1.24
ENTERTAINMENT : 1.13
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.81
WEATHER : 0.79
EVENTS : 0.71
ART_AND_DESIGN : 0.68
PARENTING : 0.66
COMICS : 0.61
BEAUTY : 0.6
 
Google Table, Genres
Tools : 8.42
Entertainment : 6.08
Education : 5.36
Business : 4.6
Productivity : 3.9
Lifestyle : 3.89
Finance : 3.71
Medical : 3.53
Sports : 3.46
Personalization : 3.32
Communication : 3.23
Action : 3.1
Health & Fitness : 3.09
Photography : 2.95
News & Magazines : 2.8
Social : 2.67
Travel & Local : 2.33
Shopping : 2.25
Books & Refere

From this quick analysis, we can see that on the Google store, apps are generally a bit fragmented across various categories, with about 20% to family apps (which are mostly kids games), another 10% to games, and the remainder to more productivity-focused applications across various categories.

In the Apple store on other hand, Games are the clearly dominant category with ~60% of apps falling within that category, and another 8% to entertainment. This might indicate that focusing on building games is a decent strategy, but is no means conclusive since the number of apps in a given genre doesn't necessary correlate to the total number of installs for apps in that genre.

### Most Installed Apps

The previous calculations show us the genres with the most applications; I will now focus on identifying which genres have the most user installs. This data is given for the Google dataset, but the Apple Store data is missing install counts; I can use reviews a proxy for that data. While imperfect (as its possible apps in certain categories more frequently prompt users to leave reviews than others) it is likely still directionally informative.

In [38]:
apple_genres = freq_table(apple_final, 11)

for genre in apple_genres:
    total = 0
    len_genre = 0
    
    for each in apple_final:
        app_genre = each[11]
        if app_genre == genre:
            ratings = float(each[5])
            total += ratings
            len_genre += 1
    
    average = total / len_genre
    print(genre,':', average)

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22886.36709539121
Music : 57326.530303030304
Reference : 79350.4705882353
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 19156.493670886077
Travel : 28243.8
Shopping : 27230.734939759037
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16815.48
Entertainment : 14195.358565737051
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 46384.916666666664
Finance : 32367.02857142857
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


From the above, we can see that the most frequently rated (and presumably used) are in the navigation, social networking, reference, and music categories. It is still too early to suggest that these profiles make sense to focus on, as the concentration of usage isn't clear from the above summary table. For example, we know that the vast majority of apps are in the games category from earlier analysis, but they aren't in the top 3 most rated. Which suggests that each game on average likely receives less downloads than the average app in other categories.

It'll be important to recalculate the averages above to exclude the anomalies at the top of the charts: A few mega-apps such as Spotify, Google Maps, and Facebook/Instagram are likely heavily skewing the results in specific categories. I would likely want to remove the top X apps in any given category to account for this, but will leave that for later.

For now, I'll repeat the process above for the Google Play dataset.

In [39]:
google_cats = freq_table(googlesmall_final, 1)

for genre in google_cats:
    total = 0
    len_cats = 0
    for each in googlesmall_final:
        app_cat = each[1]
        if app_cat == genre:
            installs = each[5]
            installs = installs.replace(',','')
            installs = installs.replace('+','')
            installs = int(installs)
            
            total += installs
            len_cats += 1
        
    average = total / len_cats
    print(genre,':', average)

ART_AND_DESIGN : 1905351.6666666667
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8814199.78835979
BUSINESS : 1712290.1474201474
COMICS : 832613.8888888889
COMMUNICATION : 38590581.08741259
DATING : 854028.8303030303
EDUCATION : 3082017.543859649
ENTERTAINMENT : 21134600.0
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1341839.736111111
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1441969.3594202898
GAME : 15795366.762342136
FAMILY : 2691618.159021407
MEDICAL : 120616.48717948717
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17805627.643678162
SPORTS : 3650602.276666667
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10723898.758713137
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5145550.285714285
VIDEO_PLAYERS : 24852732.40506329
NEWS_AND_MAGAZINES : 

On average, communications apps have the most installs, but as in the apps store, this is likely heavily skewed by a few mega-apps such as Whatsapp. To address this, I will build a function that removes the top 10 apps from consideration in any category, and then provides the same summary statistics shown above.

In [58]:
# Setting the reviews to integers instead of strings so that they will sort properly

for each in googlesmall_final:
    each[3] = int(each[3])
    
for each in apple_final:
    each[5] = int(each[5])
    
# Sorting each dataset

def googleRanking(elem):
    return elem[3] # by review count

googlesmall_final.sort(key=googleRanking, reverse=True)

def appleRanking(elem):
    return elem[5]

apple_final.sort(key=appleRanking, reverse=True)

I will check to see if this worked by exploring the first few apps in each dataset.

In [None]:
explore_data(apple_final,0,5)
explore_data(googlesmall_final,0,5)

Now that the two datasets are sorted, I can write a new function to assess each category, excluding the first 10 responses.

In [82]:
apple_genres = freq_table(apple_final, 11)

for genre in apple_genres:
    total = 0
    len_genre = 0
    excludetop = 10
    
    for each in apple_final:
        app_genre = each[11]
        if app_genre == genre and len_genre >= excludetop:
            ratings = each[5]
            total += ratings
            len_genre += 1
        elif app_genre == genre and len_genre < excludetop:
            len_genre += 1

    average = total / len_genre
    print(genre,':', int(average))

print('')

google_cats = freq_table(googlesmall_final, 1)

for genre in google_cats:
    total = 0
    len_cats = 0
    excludetop = 10
        
    for each in googlesmall_final:
        app_cat = each[1]
        if app_cat == genre and len_cats >= excludetop:
            installs = each[5]
            installs = installs.replace(',','')
            installs = installs.replace('+','')
            installs = int(installs)
            total += installs
            len_cats += 1
        if app_cat == genre and len_cats < excludetop:
            len_cats += 1
        
    average = total / len_cats
    print(genre,':', int(average))

Social Networking : 13521
Photo & Video : 6763
Games : 17766
Music : 5999
Reference : 225
Health & Fitness : 2395
Weather : 379
Utilities : 3895
Travel : 1127
Shopping : 8791
News : 1016
Navigation : 0
Lifestyle : 1275
Entertainment : 7308
Food & Drink : 338
Sports : 5459
Book : 0
Finance : 3417
Education : 1904
Productivity : 6251
Business : 116
Catalogs : 0
Medical : 0

SOCIAL : 4185855
COMMUNICATION : 18310860
GAME : 12236239
TOOLS : 6032209
VIDEO_PLAYERS : 11814757
NEWS_AND_MAGAZINES : 1061275
PHOTOGRAPHY : 10525934
FAMILY : 2287948
TRAVEL_AND_LOCAL : 2244947
PERSONALIZATION : 2140258
MAPS_AND_NAVIGATION : 1203746
ENTERTAINMENT : 5834600
EDUCATION : 1625877
SHOPPING : 3519289
PRODUCTIVITY : 9106171
HEALTH_AND_FITNESS : 1551459
SPORTS : 2050602
BOOKS_AND_REFERENCE : 1142242
LIFESTYLE : 891244
WEATHER : 1431264
FINANCE : 686472
BUSINESS : 643494
FOOD_AND_DRINK : 1106715
COMICS : 184465
PARENTING : 170189
DATING : 363119
HOUSE_AND_HOME : 411284
LIBRARIES_AND_DEMO : 96335
ART_AND_DESIG

By removing the most popular apps, a slightly different narrative emerges. Specifically, in the Apple store we can see that while originally it may have appeared attractive with a large number of average installs (proxied by reviews), the user activity was concentrated in a limited number of apps (less than 10). Given the presence of Waze, Google Maps, etc., in that market, I recommend staying away from the Navigation category.

The game category seems quite competitive: even after excluding the top 10 in the category, the average user activity / reviews for games was much higher than any other category in the Apple store, and quite high in the game store as well. As we know from previous analysis, this is also the category with a significant amount of developer activity (with a large number of apps falling into this category) in the Apple store. To know whether it is a good idea or not to participate in this category is a strategic choice for the agency to make: do we believe we can be competitive (ie. do we have top-tier developers, designers, and ideas?).

The productivity category appears to be promising as well, with a large user base that is not concentrated in only a few players across both app stores. We can print out some of the top apps in these categories across each store to learn more about them.

In [None]:
counter = 0

for each in googlesmall_final:
    if each[1] == 'PRODUCTIVITY' and counter <= 30:
        print(each)
        counter += 1

print('')
counter = 0
for each in apple_final:
    if each[11] == 'Productivity' and counter <= 30:
        print(each)
        counter += 1

From this quick scan, we can see that the category may not be as promising as initially thought. It is still dominated by a large nubmer of ultimate players with many apps (eg. Google with 10+ productivity apps for each of their Google Docs/Sheets/Slides, etc. Security appears to be a theme on the Android play store but does not transfer well to the Apple store, and thus isn't in line with the agency's strategy. And finally there are VPN apps, which are now a relatively crowded space as well.

Photography is another area that appears to be promising across both app stores. Printing out the top 30 apps in each category could provide some insight (done below).

In [None]:
counter = 0

for each in googlesmall_final:
    if each[1] == 'PHOTOGRAPHY' and counter <= 30:
        print(each)
        counter += 1

print('')
counter = 0
for each in apple_final:
    if each[11] == 'Photo & Video' and counter <= 30:
        print(each)
        counter += 1

There appear to be a large number of apps focused on helping individuals edit their photos. This could be promising as it scales well in both app stores, has no highly dominant players which have captured the bulk of the market, and is likely to continue to grow as a category as individuals increasingly turn to their phone cameras as their primary capture device. Pending further analysis, I would suggest that the agency explore the opportunity to build an app in this category.