# Profitable App Profiles for the App Store and Google Play Markets
**A Dataquest Guided Project**

*Lindsay Hodgens*

---

## Introduction

This project was built as part of Dataquest's "Data Scientist in Python" learning path. The objective of this project is to analyze the user base associated with various apps listed on the App Store and Google Play Markets. 


## Step 1: Load in the data

Today we'll be using two datasets:  
<ol>
    <li>"googleplaystore.csv" - A dataset containing information for approx. 10,000 Android apps listed from the Google Play store. *Data collected in August 2018.*</li>
    <li>"AppleStore.csv" - A dataset containing information for approx. 7,000 iOS apps from the App Store. *Data collected in July 2017.*</li>
</ol>
        

In [3]:
from csv import reader

# (1) Load in Google Play dataset
opened_goog = open('googleplaystore.csv', encoding='utf8')
read_goog = reader(opened_goog)
android_dataset = list(read_goog)
android_header = android_dataset[0]
android_data = android_dataset[1:]

# (2) Load in App Store dataset
opened_app = open('AppleStore.csv', encoding='utf8')
read_app = reader(opened_app)
ios_dataset = list(read_app)
ios_header = ios_dataset[0]
ios_data = ios_dataset[1:]

## Step 2: Do a little exploring

Dataquest has provided a function called "explore_data()." When we use it to print out the first few rows of the dataset, this function will style the information in a way that is easier to read and understand. Let's see what we're going to be working with....

In [4]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

# Previewing the first few rows of the Google Play data

print('\n')
print("GOOGLE PLAY DATA: THE FIRST 5 ROWS")
print('\n')
explore_data(android_data, 1, 6, True)
print('\n')

print("******************************************************************")
# Previewing the first few rows of the App Store data

print('\n')
print("APP STORE DATA: THE FIRST 5 ROWS")
print('\n')
explore_data(ios_data, 1, 6, True)
print('\n')
print('\n')
print('\n')



GOOGLE PLAY DATA: THE FIRST 5 ROWS


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 10841
Number of columns: 13


******************

## Step 3: Clean the Data  

*Note: To keep the scope of this guided project reasonable, I will only be correcting known issues with the datasets. I will point out which dataset is affected by each data cleaning step.*

## Step 3.1 - Google Play Data - Correct Known Data Issue with Entry No.10472

As [this discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) points out, there is a known issue with Entry 10472. The rating appears to be missing for this app, which produces a column shift for the next columns. 

First I'm going to list out the columns in the Google Play dataset to see  what kinds of information *should* be there.

In [8]:
print(android_header)
len(android_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


13

Now we know that each row is supposed to have 13 values, as well as the name for each column.

In [9]:
print(android_data[10472])
len(android_data[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


12

When we pull up Entry 10472 we find that the row is indeed missing a value from one of the 13 columns.

There are a couple of things we could do in this situation. We could delete the row entirely or we could fill in the value. Someone pointed out in the original Kaggle discussion thread that the category seems to be "Lifestyle," so I'd prefer to just fill in the missing value for this observation.

**Digression: The Road Not Taken: How I *Could* Handle this Problem if I were Using Pandas (Which I'm Not)**

As I'm working on this step, it occurs to me that I would have more flexibility to deal with this issue if I had used pandas to put everything into a dataframe. I'm going to refrain from doing that for the time being since it seems like Dataquest wanted to reinforce the "from csv import reader" method of loading in the data. 

If I were using pandas, at this point I'd pull up all of the unique values for the "Category" column, to make sure that I get the right styling for this observation's designation of "Lifestyle." I'm sure there's still a way to do this with my current setup, but I'm not currently aware of it. For now, then, I'm going to confirm this the old-school way: by checking the original 'googleplaystore.csv' file.

![Screenshot of the Google Play Store spreadsheet. The "Category" column has been filtered to show the value 'LIFESTYLE' written in all caps.](https://github.com/lharpercannon/Android-iOS-app-data-exploration/blob/main/img/googleplay_missingcategory.PNG)

When I filtered the "Category" column, I found that "LIFESTYLE" is written out in all caps. Just to make sure, though, I'm going to print out another observation that belongs to this category.

In [11]:
print(android_data[1576])

['Telstra', 'LIFESTYLE', '3.0', '4260', '6.3M', '5,000,000+', 'Free', '0', 'Everyone', 'Lifestyle', 'August 8, 2017', '6.1', '2.3.3 and up']


## Step 3.1 - CONTINUED

*Or, How am I Going to Fix This? By Simply Deleting the Row.*

Alright, first things first I want to make sure that I don't delete the wrong row somehow. To ward against this, I'm going to print a sequence of rows that contains the observation I intend to delete. For ease of use, I'm going to use that nice **explore_data()** function that Dataquest provided us.

In [12]:
explore_data(android_data, 10470, 10475)

['Jazz Wi-Fi', 'COMMUNICATION', '3.4', '49', '4.0M', '10,000+', 'Free', '0', 'Everyone', 'Communication', 'February 10, 2017', '0.1', '2.3 and up']


['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


['Sat-Fi Voice', 'COMMUNICATION', '3.4', '37', '14M', '1,000+', 'Free', '0', 'Everyone', 'Communication', 'November 21, 2014', '2.2.1.5', '2.2 and up']




As we can see, our target observation (10472) is smack dab in the middle. Let's delete this observation and see what happens.

In [13]:
del android_data[10472]

A little nervewracking, isn't it? Let's run the same explore_data() call that we did before. If everything worked correctly, the target observation should be gone and the subsequent observations should have moved up one position. This means we *also* should see a new row included as the last entry of our exploratory list.

In [14]:
explore_data(android_data, 10470, 10475)

['Jazz Wi-Fi', 'COMMUNICATION', '3.4', '49', '4.0M', '10,000+', 'Free', '0', 'Everyone', 'Communication', 'February 10, 2017', '0.1', '2.3 and up']


['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


['Sat-Fi Voice', 'COMMUNICATION', '3.4', '37', '14M', '1,000+', 'Free', '0', 'Everyone', 'Communication', 'November 21, 2014', '2.2.1.5', '2.2 and up']


['Wi-Fi Visualizer', 'TOOLS', '3.9', '132', '2.6M', '50,000+', 'Free', '0', 'Everyone', 'Tools', 'May 17, 2017', '0.0.9', '2.3 and up']




That didn't turn out too bad! 

That's one known data issue taken care of.

---



## Step 3.2 - Google Play Data - Remove Duplicate Entries  

On the Kaggle website, folks have pointed out that the dataset contains some duplicate entries. Let's look at an example.

In [16]:
for app in android_data:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


As we can see here, Instagram appears within the dataset four times. While we're at it, why don't we go ahead and see how many duplicate rows are in the entire dataset?  

*You'll notice that I have also created an empty list called "unique_android_apps." Never fear! We're going to come back to that later on.*

In [21]:
duplicate_android_apps = [] # This list will hold our duplicate rows.
unique_android_apps = [] # This list will hold our unique rows.

for app in android_data:
    name = app[0]
    if name in unique_android_apps:
        duplicate_android_apps.append(name)
    else:
        unique_android_apps.append(name)
        
print('Number of duplicate apps:', len(duplicate_android_apps))
print('\n')
print('Examples of duplicate apps:' + '\n')
explore_data(duplicate_android_apps, 0, 15)

Number of duplicate apps: 1181


Examples of duplicate apps:

Quick PDF Scanner + OCR FREE


Box


Google My Business


ZOOM Cloud Meetings


join.me - Simple Meetings


Box


Zenefits


Google Ads


Google My Business


Slack


FreshBooks Classic


Insightly CRM


QuickBooks Accounting: Invoicing & Expenses


HipChat - Chat Built for Teams


Xero Accounting Software




With this, we know that there are **1,181 duplicate apps** and we can even get a preview of some of the apps that have been impacted by this.

Of course, that leaves us with another question: How are we going to take care of these duplicates? We *could* go through and randomly select one observation to retain. However, this plan assumes that one observation for an app is as relevant as the next. If we're going to be thorough, we can't assume that this is the case. If nothing else, we need to check whether we should prioritize a particular entry based on other attributes within the observation.

Keeping that in mind, let's return to the example we'd looked at previously: the 4 observations associated with the Instagram app.

In [23]:
for app in android_data:
    name = app[0]
    if name == 'Instagram':
        print(app)
        print('\n')
        
# You may have noticed that this loop looks quite similar to the one we
# used before. They're not exactly the same, though. In this case, we're
# taking a more targeted approach than selecting all duplicate observations.
# By targeting one specific value for one specific attribute--a value of 
# 'Instagram' for a row's name variable--we can bring up all of the
# Instagram rows contained in this dataset.

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']




At first glance, everything seems to be the same....except for the values in the third index position. Here we have:

<ul>
    <li>'66577313'</li>
    <li>'66577446'</li>
    <li>'66577313'</li>
    <li>'66509917'</li>
</ul>  

Let's check our header names and see what this value is supposed to be.

In [26]:
print(android_header[3])

Reviews


If the duplicate rows all have different numbers of total app reviews, then we actually *can* select which duplicate observation to retain in a systematic way that preserves the quality and consistency of our dataset.

Therefore, our strategy will be to retain the observation that has the highest review count, as it gives the *most recent* account of the app's attributes.

In [30]:
reviews_max = {}

for app in android_data:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        n_reviews = float(app[3])
    if name not in reviews_max:
        reviews_max[name] = n_reviews

This step ought to have isolated all unique observations within the dataset & the selected observation for duplicate cases (the entry with the highest number of app reviews). As always, we want to check our work.

In [31]:
print('Expected length:', len(android_data) - 1181)
# Remember, 1181 is the number of known duplicate entries.

print('Actual length:', len(reviews_max))

Expected length: 9659
Actual length: 9659


Great! Now we know that everything has worked as intended. Next we're going to use the dictionary we created previously to remove duplicate rows from consideration.

In [33]:
android_clean = [] # The location for our new (cleaned!) dataset
already_added = [] # Stores app names as they're added to the previous list.

for app in android_data:
    name = app[0]
    n_reviews = float(app[3])
    if (n_reviews == reviews_max[name]) and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
        

Let me back up a little and explain what we've just done. In a previous step we used the **reviews_max** dictionary to identify which observation *should* be kept when we come across duplicates.  

We're doing something similar this time around. If I'm going to be completely honest, I'm trying to figure out why the "reviews_max" step isn't sufficient--by identifying all observations that are either (a) unique rows or (b) duplicate rows with the 'winning condition' (the highest number of total reviews), haven't we already functionally removed all of the duplicates we don't want? I'm going to assume that we're just doing this another way for the sake of reinforcement & to double- and triple-check, so let's resume.

This time we've created two empty lists, as noted above in the code. Then we've created a for loop that goes through each row and checks whether the observation in question has an "n_reviews" value that matches the corresponding value for that app in the "reviews_max" dictionary. If a match is found and the app hasn't already been added to our cleaned dataset (meaning: it's not found in our "already_added" list) then we append the details for this row to our android_clean list.  

Therefore, we can use the same methods to check our work as before. There should be 9659 entries in our android_clean list.

In [34]:
print('Expected length:', len(reviews_max))
print('Actual length:', len(android_clean))

Expected length: 9659
Actual length: 9659


## Step 3.3 - Google Play and App Store Data - Addressing Business Requirement Regarding Non-English Apps



In the assignment prompt, Dataquest has noted that our objective is to perform market analysis for a company that intends to develop apps for an English-speaking audience. With that being the case, we'll want to make sure that the only apps in our dataset are English-language apps.  

Let's look at some examples of non-English apps.

In [42]:
print("App Store Example #1:", ios_data[813][1])
print('\n')
print("App Store Example #2:", ios_data[6731][1])
print('\n')
print('\n')
print("Google Play Example #1:", android_data[9308])
print('\n')
print("Google Play Example #2:", android_data[9309])

App Store Example #1: 爱奇艺PPS -《欢乐颂2》电视剧热播


App Store Example #2: 【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜




Google Play Example #1: ['أحداث وحقائق | خبر عاجل في اخبار العالم', 'NEWS_AND_MAGAZINES', '4.8', '311', '14M', '5,000+', 'Free', '0', 'Everyone 10+', 'News & Magazines', 'July 15, 2018', '2.9.19a', '4.2 and up']


Google Play Example #2: ['Diário Escola Mestres EF', 'FAMILY', 'NaN', '11', '6.6M', '1,000+', 'Free', '0', 'Everyone', 'Education', 'July 31, 2018', '3.0.18 Fundamental', '4.0 and up']


Before moving on, I do want to point out one way that my last step differs from the Dataquest walkthrough. In the Dataquest walkthrough, observations are pulled from the **android_clean** list. This makes sense, because it represents our cleanest version of the data so far. When I went to call the recommended observations, however, I noticed that my index numbers seem to differ from the tutorial's. Since we double-checked the success of our previous data cleaning steps, I'm not particularly concerned about something being wrong with my cleaned data. However, it is clear that (for whatever reason) there are some differences between *my* cleaned data and the *tutorial's* cleaned data.

**My temporary solution:** For the time being I've chosen to pull the Google Play examples from the original android_data list.

---

To filter out non-English apps, we're going to remove apps whose names contain characters that are not used within the English language.  

The English language tends to use characters whose codes fall between 0 and 127. Let's look at a couple of examples.

In [49]:
print('\n')
print('Likely to be found in English language: A B C')
print('\n')

print('Character numbers:')
print(ord('A'))
print(ord('B'))
print(ord('C'))
print('\n')
print('\n')

print('Not found in the English language: 爱 日 人')
print('\n')

print('Character numbers:')
print(ord('爱'))
print(ord('日'))
print(ord('人'))
print('\n')



Likely to be found in English language: A B C


Character numbers:
65
66
67




Not found in the English language: 爱 日 人


Character numbers:
29233
26085
20154




In [73]:
def is_English(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
    
    if non_ascii > 3:
        return False
    else:
        return True
        
print(is_English('Instagram'))
print(is_English('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_English('Docs To Go™ Free Office Suite'))
print(is_English('Instachat 😜'))

True
False
True
True


Now I'm going to try to use this function to do a cleaning pass on both datasets.

In [88]:
# Google Play Dataset

android_english = []
android_non_english = []

for app in android_clean:
    name = app[0]
    if is_English(name):
        android_english.append(app)
    else:
        android_non_english.append(app)
        
# App Store Dataset

ios_english = []
ios_non_english = []

for app in ios_data:
    name = app[1]
    if is_English(name):
        ios_english.append(app)
    else:
        ios_non_english.append(app)
        
print('Previous Google Play app count:', len(android_clean))
print('Updated Google Play app count:', len(android_english))
print('\n')
print('Previous App Store app count:', len(ios_data))
print('Updated App Store app count:', len(ios_english))

Previous Google Play app count: 9659
Updated Google Play app count: 9614


Previous App Store app count: 7197
Updated App Store app count: 6183


Before I do anything else, I want to peek at the first few rows for the English and Non-English lists.

In [89]:
print('\n')
print('Google Play Data')
print('\n')
print('ENGLISH APPS - FIRST 5')
print('\n')
explore_data(android_english, 0, 5)

print('NON-ENGLISH APPS - FIRST 5')
print('\n')
explore_data(android_non_english, 0, 5)
print('\n')
print('\n')

print('App Store Data')
print('\n')
print('ENGLISH APPS - FIRST 5')
print('\n')
explore_data(ios_english, 0, 5)

print('NON-ENGLISH APPS - FIRST 5')
print('\n')
explore_data(ios_non_english, 0, 5)



Google Play Data


ENGLISH APPS - FIRST 5


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


NON-ENGLISH APPS - FIRST 5


['Fl

## Step 3.4 - Google Play and App Store Data - Addressing Business Requirement Regarding Non-Free Apps  

As the Dataquest assignment instructions point out, we're doing market research for a company that wants to develop *free* apps. Thus, we'll want to remove all non-free apps.  


In [95]:
# Google app price column is index[7]
# App store is index[4]

## GOOGLE PLAY DATA ##

google_free = []
google_paid = []

for app in android_english:
    price = app[7]
    if price == '0':
        google_free.append(app)
    else:
        google_paid.append(app)

## APP STORE DATA ##

ios_free = []
ios_paid = []

for app in ios_english:
    price = app[4]
    if price == '0.0':
        ios_free.append(app)
    else:
        ios_paid.append(app)

print("NUMBER OF FREE, ENGLISH-LANGUAGE APPS")
print("Google Play:", len(google_free))
print("App Store:", len(ios_free))
print('\n')

print("NUMBER OF PAID, ENGLISH-LANGUAGE APPS")
print("Google Play:", len(google_paid))
print("App Store:", len(ios_paid))

NUMBER OF FREE, ENGLISH-LANGUAGE APPS
Google Play: 8862
App Store: 3222


NUMBER OF PAID, ENGLISH-LANGUAGE APPS
Google Play: 752
App Store: 2961


## Step 4 - Analysis  

**Assignment Details**
Our fictional company has the following plan set in place:

<ol>
    <li>Build a minimal Android version of the app and publish on Google Play.</li>
    <li>If the app has a good response from users, develop it further.</li>
    <li>If the app is profitable after six months, build an iOS version of the app and publish on App Store</li>
</ol>   

Based on this plan, we'll want to find app attributes that are correlated with success on both marketplaces.  

---

## Step 4.1 - Genre Analysis  

First, let's determine the most common genre categories for each market. To do so, we're going to build some frequency tables for a few columns in our dataset.

In [105]:
## Google Play - Genres is index[9]
## App Store - prime_genre is index[11]

def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage
    
    return table_percentages

## Dataquest has provided a handy-dandy function called "display_table()"

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

### 4.1.1 Analyzing the "prime_genre" column in App Store data

In [106]:
print('\n')        
display_table(ios_free, 11)
print('\n')



Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665




**So, what do we notice?**

**Top 3 genres**

| Genre | Frequency|
|-------|----------|
|Games|58.16263190564867|
|Entertainment|7.883302296710118|
|Photo & Video|4.9658597144630665|

A few things jump out right away:
<ul>
    <li>The race was not close whatsoever between the first- and second-ranked genres. "Games" wins by a landslide.</li>
    <li>That said, there's also a fairly significant gap between the second- and third-ranked genres. This becomes especially obvious when we look at the items at the bottom of the frequency list, which often have differences in the range of .1 - .4 percentage points.</li>
</ul>

### 4.1.2 Analyzing the "Genres" column in Google Play data

In [103]:
print('\n')        
display_table(google_free, 9)



Tools : 8.429248476641842
Entertainment : 6.070864364703228
Education : 5.348679756262695
Business : 4.5926427443015125
Productivity : 3.8930264048747465
Lifestyle : 3.8930264048747465
Finance : 3.7011961182577298
Medical : 3.5206499661475967
Sports : 3.4642292936131795
Personalization : 3.3175355450236967
Communication : 3.238546603475513
Action : 3.1031369893929135
Health & Fitness : 3.080568720379147
Photography : 2.945159106296547
News & Magazines : 2.798465357707064
Social : 2.663055743624464
Travel & Local : 2.324531708417964
Shopping : 2.2455427668697814
Books & Reference : 2.143985556307831
Simulation : 2.0424283457458814
Dating : 1.8618821936357481
Arcade : 1.8505980591288649
Video Players & Editors : 1.7716091175806816
Casual : 1.7603249830737984
Maps & Navigation : 1.399232678853532
Food & Drink : 1.2412547957571656
Puzzle : 1.128413450688332
Racing : 0.9930038366057323
Role Playing : 0.9365831640713158
Libraries & Demo : 0.9365831640713158
Auto & Vehicles : 0.925299029564

**Top 3 genres**

| Genre | Frequency|
|-------|----------|
|Tools |8.429248476641842|
|Entertainment|6.070864364703228|
|Education|5.348679756262695|

Some observations:
<ul>
    <li>Unlike the previous example, we don't see a huge difference in frequency between the first-, second-, and third-ranked genres.</li>
    <li>There is a suspicious lack of "Games" on this list. When we delve deeper into the frequency table we see that there are many genre categories that seem to refer to games--e.g. Arcade, Puzzle, Racing and more specific categories such as Role Playing/Brain Games, Arcade/Pretend Play, and Adventure/Education</li>
</ul>

**Recommendation**  

Based on the second observation, I would strongly recommend cleaning the datasets further--creating high-level categories like we saw in the previous example and assigning all of these subgenres to the appropriate category.

### 4.1.3 Analyzing the "Category" column in Google Play data

In [104]:
print('\n')        
display_table(google_free, 1)



FAMILY : 18.449559918754233
GAME : 9.873617693522906
TOOLS : 8.440532611148726
BUSINESS : 4.5926427443015125
LIFESTYLE : 3.9043105393816293
PRODUCTIVITY : 3.8930264048747465
FINANCE : 3.7011961182577298
MEDICAL : 3.5206499661475967
SPORTS : 3.39652448657188
PERSONALIZATION : 3.3175355450236967
COMMUNICATION : 3.238546603475513
HEALTH_AND_FITNESS : 3.080568720379147
PHOTOGRAPHY : 2.945159106296547
NEWS_AND_MAGAZINES : 2.798465357707064
SOCIAL : 2.663055743624464
TRAVEL_AND_LOCAL : 2.335815842924848
SHOPPING : 2.2455427668697814
BOOKS_AND_REFERENCE : 2.143985556307831
DATING : 1.8618821936357481
VIDEO_PLAYERS : 1.782893252087565
MAPS_AND_NAVIGATION : 1.399232678853532
EDUCATION : 1.2863913337846988
FOOD_AND_DRINK : 1.2412547957571656
ENTERTAINMENT : 1.128413450688332
LIBRARIES_AND_DEMO : 0.9365831640713158
AUTO_AND_VEHICLES : 0.9252990295644324
HOUSE_AND_HOME : 0.8350259535093659
WEATHER : 0.8011735499887158
EVENTS : 0.7109004739336493
ART_AND_DESIGN : 0.6770480704129994
PARENTING : 0.

**Top 3 categories**

| Genre | Frequency|
|-------|----------|
|FAMILY|18.449559918754233|
|GAME|9.873617693522906|
|TOOLS|8.440532611148726|

Some observations:
<ul>
    <li>It's not immediately clear what "Family" refers to. This is troublesome for a few reasons.
        <ol>
            <li>Are these entertainment apps that are appropriate for a wide range of ages? If so, how does that reconcile with the "Video Players", "Entertainment", and "Comics" categories?</li>
            <li>Are they education apps geared at children? If so, how does that reconcile with the "Education" category?</li>
        </ol>
    </li>
    <li>Whatever the "Family" category consists of, it has a decent lead on the second-ranking category ("Games").</li>
    <li>You can't tell from our top-3 chart, but there's a significant enough difference between the third- and fourth-ranked categories--approximately 3.9 percentage points.</li>
</ul>

## Step 4.2 - App Store - Userbase Analysis (by Genre)  

Now that we've identified some popular genres/categories, let's take a look at the average number of user ratings associated with each genre. 

**Important note:** Please note that the number of user ratings will not be the same as an app's userbase. Not all app users rate the app on its respective market platform.

In [110]:
genres_ios = freq_table(ios_free, 11)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in ios_free:
        genre_app = app[11]
        if genre_app == genre:
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


**Top 3 genres**

| Genre | Average Number of Ratings|
|-------|----------|
|Social Networking|71548.34905660378|
|Photo & Video|28441.54375|
|Games|22788.6696905016|

Some observations:
<ul>
    <li>The top category likely won't be too relevant to a new company. The average number of rankings in this category is likely concentrated in a small handful of apps. Users tend to be reluctant to jump over to new social networks, so I wouldn't recommend pursuing this category.</li>
    <li>"Photo and Video" could refer to a lot of different things. It would be very interesting to get a more detailed breakdown of what apps fall into this category. How many apps are standalone editing tools that are designed to perform one or two specific functions? How many apps have more advanced capabilities? How many focus on photo and how many focus on video?</li>
    <li>"Games" can be a lucrative app category, but (as I'll discuss in the conclusion of this document) this genre of app has a lot of specific domain knowledge.</li>
</ul>

## Step 4.3 - Google Play - Userbase Analysis (by Genre)

In [112]:
categories_google = freq_table(google_free, 1)

for category in categories_google:
    total = 0
    len_category = 0
    for app in google_free:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace('+', '')
            n_installs = n_installs.replace(',', '')
            n_installs = float(n_installs)
            total += n_installs
            len_category += 1
    avg_installs = total / len_category
    print(category, ':', avg_installs)

ART_AND_DESIGN : 1905351.6666666667
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 3082017.543859649
ENTERTAINMENT : 21134600.0
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1313681.9054054054
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15837565.085714286
FAMILY : 2691618.159021407
MEDICAL : 120616.48717948717
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17805627.643678162
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10695245.286096256
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24852732.40506329
NEWS_AND_MAGAZINE

**Top 3 categories**

| Genre | Average Number of Installs|
|-------|----------|
|ART_AND_DESIGN|1905351.6666666667|
|AUTO_AND_VEHICLES|647317.8170731707|
|BEAUTY|513151.88679245283|

Some observations:
<ul>
    <li>To reiterate a point I made in the previous subsection, art and design would be a good category to research. It isn't immensely useful as a category at the moment, though, because "art and design" could refer to a wide range of things.</li>
    <li>I would also make the same point about the "Beauty" category.</li>
    <li>Auto and vehicles would be another interesting category to research. However, the company *is* attempting to produce an app that can succeed on both marketplaces. In that case, I would at minimum conduct additional research on the viability of auto/vehicle apps on the App store before choosing this category.</li>
</ul>

## Conclusions  

### Complicating Factors to Consider  

Unless more work is done on the data, I'd have to be very conservative with my recommendations. 
<ol>
<li>We assessed the number of user reviews associated with genres/categories, but that *doesn't* give us any insight about the range of values for apps *within* a given genre or category. If a particular game is a runaway hit (e.g. Genshin Impact, Candy Crush, Temple Runner) the userbase numbers for that game *do not imply* a guarantee that any app in the "Games" category will find a comparable userbase. In fact, in practice we see that there is usually a tremendous disparity between the user counts associated with top games and the user counts that are distributed amongst all other apps in that category.</li>
    
<li>I would be wary of giving the impression that all categories or genres are equally difficult to build. This is especially relevant considering that "Games" emerges as a prominent category in both the App Store and Google Play markets. Game design is a specialized field, and users have different expectations for what constitutes a "good user experience." If the company wants to quickly produce an app for the market *and* that company does not have employees who are experienced with mobile game design, I would consider other options first. It would be prudent to assume that this could also hold true for other app categories.</li>

<li>If we look at the top genres and categories for the App Store and Google Play, we'll find that practical apps factor heavily into the Google Play market but not so much within the App Store. However, enjoyment-based apps perform strongly in both marketplaces.</li>
</ol>

### Next Steps

I'd recommend the following as next steps:

<ul>
    <li>Analyze the popularity of individual apps within genres of interest.
        <ul>
            <li>How does popularity seem to be distributed within the apps that belong to this genre?</li>
            <li>Are there any noticeable patterns for popular and unpopular apps?</li>
        </ul>
    </li>
    <li>Consult with Subject Matter Experts about the viability of developing within particular genres, particularly:
        <ul>
            <li>Video Games</li>
            <li>Social Media/Networking</li>
        </ul>
    </li>
    <li>Enhance the usefulness of the Google Play data by engineering high-level thematic categories and sorting the existing subcategories into the appropriate categories.</li>
    <li>Once you've narrowed it down to two or three genres/categories, conduct additional analysis to identify subgenres/subcategories and the patterns associated with them.</li>
</ul>