# Profitable App Profiles for the App Store and Google Play Markets

Our company builds Android and iOS mobile apps that are free to download and install. Our main source of revenue consists of in-app ads.

This means that the more users we get, the more people will be seeing the ads, the more revenue we will be generating. Therefore it is important for us, as data analysts, to help our developers understand what kinds of apps are likely to attract more users on Google Play and the App Store.

### Summary of Results
After analyzing the data, we reached the conclusion that taking a popular book and turning it into an app could be profitable for both the Google Play and the App Store markets. There are quite a few apps built around the Quran for example that qualify as successful in both markets. Since both the Google Play and the App Store markets are already full of libraries, we need to add special features besides the raw version of the book we select. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.

For more details, please refer to the the full analysis below.

## Opening and Exploring the Data Sets  

Our initial goal is to explore the datasets we're interested in. 

The [first data set](https://www.kaggle.com/lava18/google-play-store-apps/home) contains data on approximately ten thousand Android apps from Google Play — the data was collected in August 2018.  

The [second data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) contains data on approximately seven thousand iOS apps from the App Store — the data was collected in July 2017.  

We first start by using the `open()` function to open both the Apple Store and Google Play Store data sets. We then import the `reader()` function from the `csv` module and read both files. In order to be able to start analyzing the data, we must place the dataset in a list of lists. To do this, we use the `list()` function and assign the final lists to `list_apple` and `list_google`.

In [1]:
opened_file_apple = open('AppleStore.csv')
opened_file_google = open('googleplaystore.csv')
from csv import reader
read_file_apple = reader(opened_file_apple)
read_file_google = reader(opened_file_google)
list_apple = list(read_file_apple)
list_google = list(read_file_google)

We create a function that allows us to explore our dataset easily. This function prints the dataset's rows in a more readable way and can tell us how many rows and columns are within our dataset.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

We apply the `explore_data()` function to both lists:

In [3]:
explore_data(list_apple,0,4,True)
print('\n')
explore_data(list_google,0,4,True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7198
Number of columns: 16


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 

- The [data set from the Apple Store](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) appears to have 7198 rows (including the header row) and 16 columns. The following table provides a description of each column:

|Column Name|Description|
|----|-----------|
|'id'|App ID|
|'track_name'| App Name|
|'size_bytes'| Size (in Bytes)|
|'currency'| Currency Type|
|'price'| Price amount|
|'rating_count_tot'| User Rating counts (for all version)|
|'rating_count_ver'| User Rating counts (for current version)|
|'user_rating'|Average User Rating value (for all version)|
|'user_rating_ver'|Average User Rating value (for current version)|
|'ver'| Latest version code|
|'cont_rating'| Content Rating|
|'prime_genre'|Primary Genre|
|'sup_devices.num'| Number of supporting devices|
|'ipadSc_urls.num'| Number of screenshots showed for display|
|'lang.num'| Number of supported languages|
|'vpp_lic'|Vpp Device Based Licensing Enabled|

- The [dataset from the Google Play Store](https://www.kaggle.com/lava18/google-play-store-apps/home) appears to have 10842 rows (including the header row) and 13 columns. The following table provides a description of each column:

|Column Name|Description|
|-|-|
|'App'|Application name|
|'Category'|Category the app belongs to|
|'Rating'|Overall user rating of the app (as when scraped)|
|'Reviews'|Number of user reviews for the app (as when scraped)|
|'Size'|Size of the app (as when scraped)|
|'Installs'|Number of user downloads/installs for the app (as when scraped)|
|'Type'|Paid or Free|
|'Price'|Price of the app (as when scraped)|
|'Content Rating'|Age group the app is targeted at - Children / Mature 21+ / Adult|
|'Genres'|An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres.|
|'Last Updated'|Date when the app was last updated on Play Store (as when scraped)|
|'Current Ver'|Current version of the app available on Play Store (as when scraped)|
|'Android Ver'|Min required Android version (as when scraped)|

## Preparing the Data Sets (Data Cleaning)  

Data cleaning is done before beginning the analysis, and it includes removing or correcting wrong data, removing duplicate data, modifying the data to fit the purpose of our analysis, etc. 

Because at our company we only build apps that are free to download and install, and that are directed toward an English-speaking audience. This means that we'll need to:

- Remove non-English apps
- Remove non-free apps

### Detecting and Deleting Wrong Data  

Using the [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion) from the Google Play data set, we have found several errors, including displaced values in row 10472 (index number 10473). We begin by printing the row at that index to check whether it's indeed incorrect. Printing the header row allows us to cross reference whether each entry belongs in the right column.

In [4]:
print(list_google[0])
print(list_google[10473])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


We notice that the entries are displaced (e.g. Category is '1.9', Price is 'Everyone').

An error has been detected - we now need to fix it.
To do that, we remove the row that has an error using the `del` statement.

In [5]:
del list_google[10473]

We now perform the same task on the App Store data set.

By using the [discussion section](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion) from the App store data set, we find that the data set does not contain any errors. We can therefore continue with the rest of the data cleaning process.

### Detecting and Deleting Duplicate Entries

Within the [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion) from the Google Play data set, we also found that there were several cases of duplicate entries that were noticed by other data scientists.

The first thing that we need to do is to confirm that there are duplicate entries in the data set. One example that was brought up was the case of Instagram. Let's start if this is the case:

In [6]:
for app in list_google:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


We notice that Instagram has 4 entries in the data set. This confirms that the data set has at least one case of duplicates confirmed. We now need to check how many other cases there are and proceed to cleaning the irrelevant duplicate entries.

In order to detect all the duplicate entries within the data set, we start by:
1. Creating two lists: one for storing the name of duplicate apps (`duplicate_apps`), and one for storing the name of unique apps (`unique_apps`).
2. Looping through the `list_google` data set (the Google Play data set), and for each iteration:
    - Saving the app name to a variable named `name`
    - Checking if the name (`name`) exists in the `unique_apps` list 
    - Appending `name` to the `duplicate_apps` list if it already exists in the `unique_apps` list
    - Appending `name` to the `unique_apps` list if it does not exists in the `unique_apps` list
3. Printing the amount of duplicate apps using the `len()` function on the `duplicate_apps` list.
4. Printing a couple examples of the duplicate apps.

In [7]:
duplicate_apps = []
unique_apps = []

for app in list_google:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


There are a total of 1181 duplicate apps.

One thing we could do is remove the duplicate rows randomly, but we could probably find a better way. When examining the rows we printed for the Instagram app, the main difference between each row is found on the fourth position of each row, which corresponds to the number of reviews. The difference in these numbers shows that the data was collected at different points in time.

To include the correct/latest version of each duplicate within our list, we create a dictionary where each key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.

More precisely, we:  
1. Create an empty dictionary named `reviews_max` 
2. Loop through the Google Play data set (`list_google`), excluding the header row
3. For each iteration we:
    - Assign the app name to a variable named `name`
    - Assign the number of reviews to a variable named `n_reviews`
    - Check if `name` is not in the `reviews_max` dictionary as a key, then we create a new entry in the dictionary where the key is the app name, and the value is the number of reviews.
    - Check if `name` already exists as a key in the `reviews_max` dictionary and the current number of ratings listed for that app is less than the number of ratings that we are now coming across (`reviews_max[name] < n_reviews`), then we update the number of reviews for that entry in the `reviews_max` dictionary. In technical terms, we update the value for that particular key.

In [8]:
reviews_max = {}

for row in list_google[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if name not in reviews_max:
        reviews_max[name] = n_reviews    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews

We initially calculated the number of duplicate entries in the Google Play data set (1181). To make sure we have gotten rid of all the duplicate entries, we compare the expected length versus the actual length of the current dictionary:

In [9]:
print('Expected length is: ', len(list_google[1:]) - 1181)
print('\n')
print('Actual length is: ', len(reviews_max))

Expected length is:  9659


Actual length is:  9659


This confirms that the `reviews_max` dictionary contains only unique data. However, we still need to modify the data from the original list: `list_google`. To do that, we:  
1. Create two empty lists. The first one (`android_clean`) will be storing our new cleaned data set) and the second one (`already_added`) will be storing app names.
2. Loop through the Google Play data set (`list_google`), excluding the header row
3. For each iteration we:
    - Assign the app name to a variable named `name`
    - Assign the number of reviews to a variable named `n_reviews`
    - Check if the number of reviews (`n_reviews`) is the same as the number that we had found in the `reviews_max` dictionary **and** if this particular app has not already been added in the `already_added` list. This second condition allows us to make sure to not include duplicates of an app that have the same number of reviews.
    - Append the entire row to the `android_clean` list.
    - Append the name of the app (`name`) to the `already_added` list in order to keep track of the apps that we have already added.
4. We explore the first couple rows of our new clean list and check that the length is correct (9659 rows).

In [10]:
android_clean = []
already_added = []

for row in list_google[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(row)
        already_added.append(name)
        
print('Here are the first couple rows from our new list: ', '\n', android_clean[:3])
print('\n')
print('Length of the clean data set: ', len(android_clean))

Here are the first couple rows from our new list:  
 [['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']]


Length of the clean data set:  9659


The new list seems to be just right.

### Detecting and Deleting Non-English Apps  

The language we use for the apps we develop at our company is English, and we'd like to analyze only the apps that are directed toward an English-speaking audience. However, when exploring the data long enough, we find that both data sets have apps whose name suggests that they are not directed toward an English-speaking audience:

In [11]:
print(list_apple[814][1])
print('\n')
print(android_clean[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播


لعبة تقدر تربح DZ


We're not interested in keeping these kind of apps, so we'll remove them. One way to go about this is to remove each app whose name contains a symbol that is not commonly used in English text. 

Behind the scenes, each character we use in a string has a corresponding number associated with it. For instance, the corresponding number for character `a` is 97, for character `A` is 65, and for character `爱` is 29,233. The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the [ASCII](https://en.wikipedia.org/wiki/ASCII) (American Standard Code for Information Interchange) system. 

Before removing the rows corresponding to non-English apps, we will first create a function that detects whether a string contains English characters.

Our function `en_detect()` takes any string as an input. And for each character in that string, the function checks if the number corresponding to that character is greater than 127 (**not** within the characters of the English language). If this is the case, the function results in a `False` statement. However, if all the characters in a string have a corresponding number that is equal to or less than 127 (within the characters of the English language), the function results in a `True` statement.

In [12]:
def en_detect(string):
    for character in string:
        if ord(character) >= 127:
            return False
    
    return True

We now test the function on a couple of strings:

In [13]:
print(en_detect('Instagram'))
print(en_detect('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(en_detect('Docs To Go™ Free Office Suite'))
print(en_detect('Instachat 😜'))

True
False
False
False


Our newly created `en_detect` function appears to not be working with special characters such as emojis and ™. This is because these fall outside the ASCII range and have corresponding numebrs which are over 127. As we can see:

In [14]:
print(en_detect('😜'))
print(en_detect('™'))

False
False


If we are going to use this newly created function, we will lose useful data since many of the English apps may be labeled as non-English. 

To minimize the impact of data loss, we will only remove an app if its name has more than three characters that fall outside of the ASCII range. This means all English apps with up to three emoji or other special characters will still be labeled as English. Out filter function is still not perfect, but it should be fairly effective.

We proceed to modifying our function:

In [15]:
def en_detect(string):
    non_asci = 0
    
    for character in string:
        if ord(character) >= 127:
            non_asci += 1
   
    if non_asci > 3:
        return False
          
    
    return True

We proceed to test with a couple of app names to see if this modified function works:

In [16]:
print(en_detect('Docs To Go™ Free Office Suite'))
print(en_detect('Instachat 😜'))
print(en_detect('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
True
False


The function seems to be working fine. Although it might not filter out all non-English apps, it is the quickest and most effective solution we could find to resolve this problem.

We will now use the new function to filter out non-English apps from the Google Play Store and App Store data set. We decide to loop through each data set, and if an app name is identified as English, then we append the whole row to a separate list named `list_apple_en` for the App Store data set and `android_clean_en` for the Google Play Store data set.

In [17]:
list_apple_en =[]

for row in list_apple[1:]:
    if en_detect(row[1]) == True:
        list_apple_en.append(row)


android_clean_en = []

for row in android_clean:
    if en_detect(row[0]) == True:
        android_clean_en.append(row)

        
print('Examples from the new App Store list: ', list_apple_en[:3])
print('\n')
print('The number of apps in the App Store list: ', len(list_apple_en))
print('\n')
print('Examples from the new Google Play Store list: ', android_clean_en[:3])
print('\n')
print('The number of apps in the Google Play Store list: ', len(android_clean_en))

Examples from the new App Store list:  [['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'], ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']]


The number of apps in the App Store list:  6183


Examples from the new Google Play Store list:  [['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', 

### Detecting and Deleting Non-Free Apps

As mentioned in the introduction, our company only builds apps that are free to download and install. The current data sets contain both free and non-free apps. For the sake of our analysis we'll need to isolate the free apps.

Using the same process as we used to clean out non-English apps, we isolate the free apps in a seperate list (`list_apple_en_free` and `android_clean_en_free`). A free app is identified under the 'price' column, with a value of '0'.

In [18]:
list_apple_en_free = []

for row in list_apple_en:
    if row[4] == '0.0':
        list_apple_en_free.append(row)
        
android_clean_en_free = []

for row in android_clean_en:
    if row[7] == '0':
        android_clean_en_free.append(row)
        
print('The number of apps remaining in the App Store list: ', len(list_apple_en_free))
print('The number of apps remaining in the App Store list: ', len(android_clean_en_free))

The number of apps remaining in the App Store list:  3222
The number of apps remaining in the App Store list:  8864


## Data Analysis

So far we have cleaned our data, and:
- Removed inaccurate data
- Removed duplicate app entries
- Removed non-English apps
- Isolated the free apps
    
As mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we then develop it further.
3. If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.

This validation strategy allows us to put in minimal efforts into creating an app at first, allowing us to gather data quite quickly on its likeability within a market.   
Depending on how much positive response and revenue this app generates, we can then put more effort into building an iOS version of the app and adding it to the App Store, creating even more revenue for our company.

Because our end goal is to add the app on both the App Store and Google Play, we need to find app profiles that are successful on both markets. We will therefore begin the analysis by getting a sense of what the most common genres for each market are. For this, we'll need to build frequency tables for a few columns in our data sets.

### Finding the Most Common Genres in Each Market

Looking at the index table of the App Store, we notice that the most relevant column in our list is going to be **'prime_genre'**. In our data set, this can be retrieved by calling index **11**.


|Column Name|Description|
|----|-----------|
|'id'|App ID|
|'track_name'| App Name|
|'size_bytes'| Size (in Bytes)|
|'currency'| Currency Type|
|'price'| Price amount|
|'rating_count_tot'| User Rating counts (for all version)|
|'rating_count_ver'| User Rating counts (for current version)|
|'user_rating'|Average User Rating value (for all version)|
|'user_rating_ver'|Average User Rating value (for current version)|
|'ver'| Latest version code|
|'cont_rating'| Content Rating|
|'prime_genre'|Primary Genre|
|'sup_devices.num'| Number of supporting devices|
|'ipadSc_urls.num'| Number of screenshots showed for display|
|'lang.num'| Number of supported languages|
|'vpp_lic'|Vpp Device Based Licensing Enabled|
    
    
Looking at the index table of the Google Play Store, we notice that the most relevant columns in our list are going to be **'Category'** and **'Genres'**. In our data set, these can be retrieved by calling indexes **1** and **9**.


|Column Name|Description|
|-|-|
|'App'|Application name|
|'Category'|Category the app belongs to|
|'Rating'|Overall user rating of the app (as when scraped)|
|'Reviews'|Number of user reviews for the app (as when scraped)|
|'Size'|Size of the app (as when scraped)|
|'Installs'|Number of user downloads/installs for the app (as when scraped)|
|'Type'|Paid or Free|
|'Price'|Price of the app (as when scraped)|
|'Content Rating'|Age group the app is targeted at - Children / Mature 21+ / Adult|
|'Genres'|An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres.|
|'Last Updated'|Date when the app was last updated on Play Store (as when scraped)|
|'Current Ver'|Current version of the app available on Play Store (as when scraped)|
|'Android Ver'|Min required Android version (as when scraped)|

We now proceed to building frequency tables for the 'prime_genre' column of the App Store data set, and for the 'Genres' and 'Category' columns of the Google Play data set.

We'll build two functions we can use to analyze the frequency tables:
- One function to generate frequency tables that show percentages
- Another function we can use to display the percentages in a descending order

The first function, `freq_table()` takes in two inputs: `dataset` (which is expected to be a list of lists) and `index` (which is expected to be an integer). The function returns the frequency table (as a dictionary) for any column we want. Because we want the frequencies to be expressed as percentages, our function takes each key in our dictionary and converts the value into a percentage.

In [19]:
def freq_table(dataset, index):
    
    freq_dictionary = {}
    
    for row in dataset:
        chosen_column = row[index]
        if chosen_column in freq_dictionary:
            freq_dictionary[chosen_column] += 1
        else:
            freq_dictionary[chosen_column] = 1
    
    total = 0
    
    for key in freq_dictionary:
        total += freq_dictionary[key] #Here, we could have gone with just the length of the data set inputed into the function.
        
    for key in freq_dictionary:
        freq_dictionary[key] = round(((freq_dictionary[key]/total) * 100), 2)
        
    return freq_dictionary

We can test this function on the App Store data set:

In [20]:
freq_table(list_apple_en_free, 11)

{'Social Networking': 3.29,
 'Photo & Video': 4.97,
 'Games': 58.16,
 'Music': 2.05,
 'Reference': 0.56,
 'Health & Fitness': 2.02,
 'Weather': 0.87,
 'Utilities': 2.51,
 'Travel': 1.24,
 'Shopping': 2.61,
 'News': 1.33,
 'Navigation': 0.19,
 'Lifestyle': 1.58,
 'Entertainment': 7.88,
 'Food & Drink': 0.81,
 'Sports': 2.14,
 'Book': 0.43,
 'Finance': 1.12,
 'Education': 3.66,
 'Productivity': 1.74,
 'Business': 0.53,
 'Catalogs': 0.12,
 'Medical': 0.19}

We notice that each key-value pair is displayed in a random order. This is because dictionaries don't have an order. This will make it very difficult to analyze the frequency tables that we would like to generate. We could use the `sorted()` function, but it doesn't work too well with dictionaries since it will only consider and return the dictionary keys.

Therefore, we need to build a second function which can help us display the entries in the frequency table in a descending order.

The `display_table()` function below takes in two parameters: `dataset` and `index`. `dataset` is expected to be a list of lists, and `index` is expected to be an integer. It will generate a frequency table using the `freq_table()` function that we have just written. It will transform the frequency table into a list of tuples, and then it will sort the list in a descending order. Finally, it will print the entries of the frequency table in descending order.

In [21]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

We will now use it to display the frequency tables of the columns 'prime_genre', 'Category' and 'Genres':

In [22]:
print('The "prime_genre" column from the App Store data set: ')
print('\n')    
display_table(list_apple_en_free, 11)
print('\n')
print('The "Category" column from the Google Play Store data set: ')
print('\n') 
display_table(android_clean_en_free, 1)
print('\n')
print('The "Genres" column from the Google Play Store data set: ')
print('\n') 
display_table(android_clean_en_free, 9)

The "prime_genre" column from the App Store data set: 


Games : 58.16
Entertainment : 7.88
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.51
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.74
Lifestyle : 1.58
News : 1.33
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12


The "Category" column from the Google Play Store data set: 


FAMILY : 18.91
GAME : 9.72
TOOLS : 8.46
BUSINESS : 4.59
LIFESTYLE : 3.9
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.53
SPORTS : 3.4
PERSONALIZATION : 3.32
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.94
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.86
VIDEO_PLAYERS : 1.79
MAPS_AND_NAVIGATION : 1.4
FOOD_AND_DRINK : 1.24
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHIC

When analyzing the 'prime_genre' column of the App Store data set, we notice that the most common genre in the App Store is 'Games' (58.16%). This is followed by 'Entertainment' (7.88%) and 'Photo & Video' (4.97%). So far, it seems like most of the apps on the App Store are designed more for the purpose of fun rather than a practical purpose. Based on this, we cannot imply that this is the recommended app profile for the App Store market. We have to verify which genres have the largest number of users.  

Going through the 'Category' and 'Genres' columns of the Google Play data set, we notice that the most common categories are 'Family' (18.1%), 'Game' (9.72%) and 'Tools' (8.46%). The 'Genres' column seems to give a more detailed breakdown of the type of apps available on the Google Play market. The top 3 app genres are: 'Tools' (8.45%), 'Entertainment' (6.07%) and 'Education' (5.35%).

Compared to the App Store data set, the Google Play data set is not as populated with games. The entertainment genre is also well present in both data sets, but Google Play shows a more balanced landscape of both practical and for-fun apps. 

So far, there is no way of recommending an app profile based on what we have found.

### Finding the Most Popular Genres in Each Market

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the 'Installs' column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the 'rating_count_tot' column.

Let's start with calculating the average number of user ratings per app genre on the App Store. To do that, we'll need to:

- Isolate the apps of each genre
- Sum up the user ratings for the apps of that genre
- Divide the sum by the number of apps belonging to that genre (not by the total number of apps)

Using the `freq_table()` function, we loop over the unique genres of the App Store data set and compute the total number of ratings for each genre.

In [23]:
for genre in freq_table(list_apple_en_free, 11):
    total = 0
    len_genre = 0
    for row in list_apple_en_free:
        genre_app = row[11]
        if genre_app == genre:
            total += float(row[5])
            len_genre += 1

    avg_num_rat = total / len_genre
    print(genre, ':', avg_num_rat)

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


Out of all the genres, 'Navigation' seems to come on top followed by 'Reference' and 'Social Networking'. Not far behind, we can find 'Music' and 'Weather'. By exploring a couple of the apps in each of the top genres, we could maybe get a better idea of the potential that each of them has.

First, we explore the 'Navigation' apps:

In [24]:
for row in list_apple_en_free:
    if row[11] == 'Navigation':
        print(row[1], ':', row[5])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


We notice that this high number of ratings is heavily influenced by the great amount of Waze and Google Maps users. This doesn't seem like quite the right niche for us to be targetting.

Next, we explore the 'Reference', 'Social Networking', 'Music' and 'Weather' apps:

In [25]:
print('Reference', '\n')

for row in list_apple_en_free:
    if row[11] == 'Reference':
        print(row[1], ':', row[5])
        
print('\n', 'Social Networking', '\n')

for row in list_apple_en_free:
    if row[11] == 'Social Networking':
        print(row[1], ':', row[5])
        
print('\n', 'Music', '\n')

for row in list_apple_en_free:
    if row[11] == 'Music':
        print(row[1], ':', row[5])
        
print('\n', 'Weather', '\n')

for row in list_apple_en_free:
    if row[11] == 'Weather':
        print(row[1], ':', row[5])

Reference 

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0

 Social Networking 

Facebook : 2974676
Pinterest : 1061624
Skype for iPhone : 373519
Messenger : 351466
Tumb

Although the 'Bible' app is dominating the 'Reference' genre market, there are many dictionaries and other interesting types of apps that we could work on.

The 'Social Networking' genre is dominated by top apps such as 'Facebook', 'WhatsApp Messenger' and 'Pinterest'. This is a market that will be hard to break into.

The 'Music' genre is dominated by 'Pandora' and 'Spotify'. However, there seems to be quite a few music streaming services that are competing with each other on the level below these two apps. This might not be worth breaking into either.

The 'Weather' genre seems to be the same as the 'Music' genre, but with a tighter competition between all apps. We should keep this in mind, but will probably not end up digging into this genre.

We will now take a look at the Google Play market. We have data about the number of installs for the Google Play market, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.):

In [26]:
display_table(android_clean_en_free, 5)

1,000,000+ : 15.73
100,000+ : 11.55
10,000,000+ : 10.55
10,000+ : 10.2
1,000+ : 8.39
100+ : 6.92
5,000,000+ : 6.83
500,000+ : 5.56
50,000+ : 4.77
5,000+ : 4.51
10+ : 3.54
500+ : 3.25
50,000,000+ : 2.3
100,000,000+ : 2.13
50+ : 1.92
5+ : 0.79
1+ : 0.51
500,000,000+ : 0.27
1,000,000,000+ : 0.23
0+ : 0.05
0 : 0.01


One problem with this data is that it's not precise. For instance, we don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, we don't need very precise data for our purposes — we only want to find out which app genres attract the most users, and we don't need perfect precision with respect to the number of users.

We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on. To perform computations, however, we'll need to convert each install number from string to float. This means we need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error.

Approaching this data set the same way as for the App Store (nested loop), we use the `freq_table()` function, loop over the unique genres of the Google Play data set and compute the total number of ratings for each genre.

In [27]:
for category in freq_table(android_clean_en_free, 1):
    total = 0
    len_category = 0
    for row in android_clean_en_free:
        category_app = row[1]
        if category_app == category:
            num_installs = row[5]
            num_installs = num_installs.replace('+', '')
            num_installs = num_installs.replace(',', '')
            total += float(num_installs)
            len_category += 1
    avg_num_installs = total / len_category
    print(category, ':', avg_num_installs)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

The 'COMMUNICATIONS' category has the most installs with 38,456,119. This is most likely caused by heavily used apps such as 'Facebook' and 'WhatsApp'. When looking through the list, we notice that the 'BOOKS_AND_REFERENCE' category has one of the highest amounts of installs. This is interesting as it matches the most interesting category that we found on the App Store data set. In order to make a more informed decision as to which type of app our company should start building, we decide to explore the 'BOOKS_AND_REFERENCE' category:

In [28]:
print('BOOKS_AND_REFERENCE', '\n')

for row in android_clean_en_free:
    if row[1] == 'BOOKS_AND_REFERENCE':
        print(row[0], ':', row[5])

BOOKS_AND_REFERENCE 

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HT

This category has a very large variety of apps and does not seem to have too many very popular apps. To verify this, we will print only the very popular apps:

In [29]:
print('BOOKS_AND_REFERENCE top range:', '\n')

for row in android_clean_en_free:
    num_installs = row[5]
    num_installs = num_installs.replace('+', '')
    num_installs = num_installs.replace(',', '')
    num_installs = float(num_installs)
    if row[1] == 'BOOKS_AND_REFERENCE' and num_installs >= 100000000:
        print(row[0], ':', row[5])

BOOKS_AND_REFERENCE top range: 

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


There are only 5 extremely popular apps. This means that it should be quite interesting to explore the market within the mid-range of popularity:

In [30]:
print('BOOKS_AND_REFERENCE mid-range:', '\n')

for row in android_clean_en_free:
    num_installs = row[5]
    num_installs = num_installs.replace('+', '')
    num_installs = num_installs.replace(',', '')
    num_installs = float(num_installs)
    if row[1] == 'BOOKS_AND_REFERENCE' and 100000000 > num_installs >= 100000:
        print(row[0], ':', row[5])

BOOKS_AND_REFERENCE mid-range: 

Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
English translation from Bengali : 100,000+
Pdf Book Download - Read Pdf Book : 100

There seems to be tons of dictionaries and libraries, so this might not be a good type of app to build since the competition is so fierce. 

We also notice there are quite a few apps built around the Quran, which suggests that building an app around a popular book can be profitable. It seems that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets.

Because there is so much competition in this category of apps, it is important for our company to bring in new features to our app. Features such as audio versions of the book, discussion forums or note taking on the app are examples of what can be done to give us a competitive edge.

## Conclusion

In this project, our goal was to analyze data from the App Store and Google Play mobile apps and recommend an app profile that can be profitable to our app-developing company for both markets.

We concluded that taking a popular book and turning it into an app could be profitable for both the Google Play and the App Store markets. The markets are already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.