# Profitable App Profiles for the Google Play and Apple App Store Markets

**Author: Miguel Tapia**

The aim of this project is to find mobile applications that are profitable for the App Store (iOS) and Google Play Store (Android) markets. 

Setting:
- We are working as Data Analyst for a fictional company that builds iOS and Android mobile apps, and our job is to enable our team of developers to make **data-driven** decisions with respect to the kind of apps they build.

- At our company, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app.

Explaining my code:
- Aside from the objective of this project, this is also a learing project for me. Thus I will like to go into more detail explaining what a complex line of code actually does. 

Here are a few examples of what that would look like:

```
#import the reader() command from the csv module (COMMENT)
from csv import reader

## Google Play Store Dataset ## (HEADER COMMENT)
opened_file = open('googleplaystore.csv') #opening the file using the open() command (SAME LINE COMMENT)
read_file = reader(opened_file) #Once file is open, we read it in using a command called reader()
android = list(read_file) 
android_header = android[0]
android = android[1:]

## The App Store data set ##
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]
```


## Goal
Analyze data from Google Play and App Store markets to help our team of developers understand what kinds of apps are likely to attract more users.

## Opening and exploring data

Our aim is to help our developers understand what type of apps are likely to attract more users on Google Play and the App Store. To do this, we'll need to collect and analyze data about mobile apps avialable on Google Play and App Store.

As of September 2018, there were approximately 2 millions iOS apps available on the Apps Store, and 2.1 million Android apps on Google Play. Source [Link](https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/)

Collecting data for over 4 million apps requires a significant amount of time and money, so we'll try to analyse a sample of the data instead. To avoid specnding resources on collecting new data ourselves, we should first try to see if we can find any relevant existing data at no cost. Luckily, these are two data sets that seem suitable for our goals:

Here you can find anything you need to know about the dataset like # of column, column definitions, etc.
* A data set containing data about approximately 10,000 Android Apps from Google Play; the data was collected in August 2018. [Link](https://www.kaggle.com/lava18/google-play-store-apps)
* A data set containing data about approximately 7,000 iOS apps from the App Store; the data was collected in August 2018. [Link](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

Let's start by opening and exploring these two data sets. 

**Opening the data sets**

In [1]:
#import the reader() command from the csv module
from csv import reader

## Google Play Store Dataset ##
opened_file = open('googleplaystore.csv') #opening the file using the open() command
read_file = reader(opened_file) #Once file is open, we read it in using a command called reader()
android = list(read_file) 
android_header = android[0]
android = android[1:]

## The App Store data set ##
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

**Exploring the data sets**

To make it easier to explore the two data sets, we'll first write a function named `explore_data()` that we can use repeatedly to explore rows in a more readable way. We'll also add an option for our function to show the number of rows and columns for any data set.

General notes about the code below (`explore_data()` function):

- Function takes in four parameters:
    - `dataset`, which is expected to be a list of lists.
    - `Start` and `end`, which are both expected to be integers and represent the starting and the ending indices of a slice from the data set.
    - `rows_and_columns`, which is expected to be a Boolean and has `False` as a default argument.
- Slices the data set using `dataset[start:end]`
- Loops through the slice, and for each iteration, prints a row and adds a new line after that row using `print('\n')`.
    - The `\n` in `print('\n')` is a special character and won't be printed. Instead, the `\n` character adds a new line, and we use `print('\n')` to add some blank space between rows.
- Prints the number of rows and columns if `rows_and_columns` is `True`.
    - dataset shouldn't have a header row, otherwise the function will print the wrong number of rows (one more row compared to the actual length).

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n')

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Now let's explore the data sets using our explore function! 

Let's print the first few rows of each function:

In [3]:
#First few rows of the Google Play Store
print(android_header)
print('\n')
explore_data(android, 0, 3, True)


#First few rows of the The App Store
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215

Let's try and answer the following questions
- Find the exact number rows each data set has;
- Print the column names and try to identify the columns that could help us with our analysis.

We have 10841 apps in the the Google Play Store and the following columns may be of useful for this analysis:
- "Category": Category the app belongs to
- "RatingOverall": user rating of the app (as when scraped)
- "Reviews": Number of user reviews for the app (as when scraped)
- "InstallsNumber": of user downloads/installs for the app (as when scraped)
- "Type": Paid or Free
- "Price": Price of the app (as when scraped)
- "Content Rating"": Age group the app is targeted at - Children / Mature 21+ / Adult
- "Genres": An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres.
    
We have 7197 apps in the App Store and the following columns may be of useful for this analysis:
- "currency": Currency Type
- "price": Price amount
- "rating_count_tot": User Rating counts (for all version)
- "rating_count_ver": User Rating counts (for current version)
- "cont_rating": Content Rating
- "prime_genre": Primary Genre
- "sup_devices.num": Number of supporting devices


## Deleting wrong data

Now that we have our code uploaded and ready to roll, we need to make sure the data we analyze is accurate, otherwise the result of our analyzes will be wrong. This means we need to:

- Detect innacurate data, and either correct or remove it.
- Detect duplicate data, and remove the duplicates.

Note that the process of preparing our data for analysis is called data cleaning. Data cleaning is done before the analysis; it includes removing or correcting wrong data, removing duplicate data, and modifying the data to fit the purpose of our analysis.

**FUN FACT:** It's often said that data scientists spend around 80% of their time cleaning data, and only about 20% actually analyzing (cleaned) data.

The Google Play data set has a dedicated [Link](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) discussion section, and we can see that one of the discussions outlines an error for row **10472**. Let's print this row and compare it against the header and another row that is correct.


In [4]:
print(android_header) # header row
print('\n')
print(android[10472]) # incorrect row
print('\n')
print(android[0]) # first row (correct)


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Category column seems to be missing in row 10472. You can verify this since rating column for row 10472 is 19, but rating should be a number between 1-5. 

**Delete row 10472**

In [5]:
print(len(android))
del android[10472] # careful not to run this more than once since that would delete replacement row in slot 10472
print(len(android))

10841
10840


**What about duplicate rows?**

I wonder if we have duplicate rows? That can be a problem when we start our analyses since we don't want to be making any assumptions based on data that, among other things, is skewed by having more than one row for a single app.

Let's start with checking to see if we have any duplicate entries for the Facebook and Snapchat apps.

In [6]:
## Are there more than one Facebook application in our Google Play data set?
for app in android:
    name = app[0]
    if name == 'Facebook':
        print(app)
        
## Are there more than one Snapchat application in our Google Play data set?
for app in android:
    name = app[0]
    if name == 'Snapchat':
        print(app)

['Facebook', 'SOCIAL', '4.1', '78158306', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']
['Facebook', 'SOCIAL', '4.1', '78128208', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']
['Snapchat', 'SOCIAL', '4.0', '17014787', 'Varies with device', '500,000,000+', 'Free', '0', 'Teen', 'Social', 'July 30, 2018', 'Varies with device', 'Varies with device']
['Snapchat', 'SOCIAL', '4.0', '17014705', 'Varies with device', '500,000,000+', 'Free', '0', 'Teen', 'Social', 'July 30, 2018', 'Varies with device', 'Varies with device']
['Snapchat', 'SOCIAL', '4.0', '17015352', 'Varies with device', '500,000,000+', 'Free', '0', 'Teen', 'Social', 'July 30, 2018', 'Varies with device', 'Varies with device']
['Snapchat', 'SOCIAL', '4.0', '17000166', 'Varies with device', '500,000,000+', 'Free', '0', 'Teen', 'Social', 'July 30, 2018', 'Varie

We requested to print out a list of any app that has app name Facebook or Snapchat. And theoretically, we should only see two rows(1 for Facebook and 1 for Snapchat). But it looks like there are 2 Facebook apps and 4 Snapchat apps. 

**How to efficiently detect and delete duplicate apps**

How much duplicate data do we have in our data set? We can create a for loop to find out how much duplicate and unique apps we have.

In [7]:
duplicate_apps = [] # List for storing the name of duplicate apps
unique_apps = [] # List for storing the name of unique apps

for app in android:
    name = app[0] # Save the app name to a variable named name
    if name in unique_apps:
        duplicate_apps.append(name) # If name was already in unique apps, we append it to duplicate_apps list
    else:
        unique_apps.append(name) # if name wasn't already in the unique_apps list, we appended name to unique_apps

print('Number of duplicate apps in android:', len(duplicate_apps))
print('\n')
print('Number of unique apps in android:', len(unique_apps))

Number of duplicate apps in android: 1181


Number of unique apps in android: 9659


We have quite a bit of duplicate apps! **1,181** to be exact.

We don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app. One thing we can do is create a fucntion that removes duplicate rows for any app that has more than one row. 

But of the duplicate rows, say for Snapchat, which one should we keep? Let's examine the Snapchat duplicates and see what the major differences are between them.

In [8]:
## Let's examine the duplicate rows for Snapchat and examine what are the key differences.
print(android_header) # Column names
for app in android:
    name = app[0]
    if name == 'Snapchat':
        print(app)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Snapchat', 'SOCIAL', '4.0', '17014787', 'Varies with device', '500,000,000+', 'Free', '0', 'Teen', 'Social', 'July 30, 2018', 'Varies with device', 'Varies with device']
['Snapchat', 'SOCIAL', '4.0', '17014705', 'Varies with device', '500,000,000+', 'Free', '0', 'Teen', 'Social', 'July 30, 2018', 'Varies with device', 'Varies with device']
['Snapchat', 'SOCIAL', '4.0', '17015352', 'Varies with device', '500,000,000+', 'Free', '0', 'Teen', 'Social', 'July 30, 2018', 'Varies with device', 'Varies with device']
['Snapchat', 'SOCIAL', '4.0', '17000166', 'Varies with device', '500,000,000+', 'Free', '0', 'Teen', 'Social', 'July 30, 2018', 'Varies with device', 'Varies with device']


It looks like the only data that varies between rows is in column 4 (number of reviews). There seem to be more reviews in some Snapchat app rows than others. That can imply that the data was scrapped from The Google Play store in a different point in time. 

Thus, the higher the number in the reviews column, the more recent the app is. **So it may be good to keep the row with the most Review Numbers**

**Now, let's delete the duplicate rows!**

Recall that we found the following for unique and duplicate apps.
- Number of duplicate apps: 1181
- Number of unique apps: 9659

We need to not only remove the duplicates, but simultaneously keep the app that has the most reviews! 

Here's how:
- Create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.
- Use the information stored in the dictionary and create a new data set, which will have only one entry per app (and for each app, we'll only select the entry with the highest number of reviews).

In [9]:
reviews_max = {} # Empty dictionary

for app in android[1:]: # Loop through the Google Play data set, excluding header row
    name = app[0] # Assiging the App name to a variable named name
    n_reviews = float(app[3]) # Assign to variable named n_reviews + Convert to float to avoid any incorrect data error 

    # If name already exist in reviews_max dict. and the specific row has the most reviews, - 
    # then update that number of reviews for that entry in the reviews_max dict.
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    
    # if name does not exist in reviews_max dict., create a new entry in the dictionary - 
    # where the key is the app name and the value is the number of reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print(len(reviews_max))

9658


Now, let's use the reviews_max dictionary to remove the duplicates. For the duplicate cases, we'll only keep the row with the highest number of reviews.

Here's a brief description on what the If Statement is doing:
We loop through the android data set, and for every iteration:
- we isolate the name of the app and the number of reviews.
- we add the current row to the android_clean list, and the app to the already_cleaned list if:
    - Then number of reviews of the current app matches the number of reviews of that app as described in the reviews_max dictionary; and
    - The name of the app is not already in the already_added list. We need to add this suplementary condition gto account for those cases where the highest number of revoews of a duplicate app is the same for more than one entry (for example, the Box app has 3 entries, and the number of reviews is the same). If we just check for reviews_max == n_reviews, we'll end up with duplicate entries for some apps.


In [10]:
android_clean = [] # Empty list that stores our new clean data set
already_added = [] # Empty list that stores app names

for app in android[1:]: # Loop through the data set excluding the header row.
    name = app[0] # Assign the App name to a variable named name
    n_reviews = float(app[3]) # Assign the number of reviews to a variable named n_reviews. & convert number of reviews to a float.
    
    if (reviews_max[name] == n_reviews and (name not in already_added)): # See the notes above for explanation
        android_clean.append(app)
        already_added.append(name)

The data should now have 9,659 rows. Let's check!

In [11]:
explore_data(android_clean, 0, 3, True)

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 9658
Number of columns: 13


Sweet! We have succesfully deleted 1181 duplicate apps from our Google Play data set and have kept the most recent of the duplicates by choosing the app with the most reviews!

**Is there duplicate data in our iOS dataset?**

Now let's briefly create a function that applies the same process for duplicate deletion in our iOS data set.

In [12]:
ios_duplicate_apps = [] # List for storing the name of duplicate apps
ios_unique_apps = [] # List for storing the name of unique apps

for track_name in ios:
    name = track_name[1] # Save the app name to a variable named name
    if name in ios_unique_apps:
        ios_duplicate_apps.append(name) # If name was already in unique apps, we append it to duplicate_apps list
    else:
        ios_unique_apps.append(name) # if name wasn't already in the unique_apps list, we appended name to unique_apps

print('Number of duplicate apps in iOS :', len(ios_duplicate_apps))
print('\n')
print('Number of unique apps in iOS:', len(ios_unique_apps))

Number of duplicate apps in iOS : 2


Number of unique apps in iOS: 7195


It looks like there is one duplicate app in our entire iOS dataset. We can recreate the function we did above, but this may seem like too much effort for one row. We'll leave as is!

## Deleting non-english apps
We managed to remove the duplicate apps entries in our Google Play data set. We will now need to delete non-english apps since our fictional company only makes apps for an English speaking market.

Are there any non-english apps in our data set?  

In [13]:
print(ios[813][1]) # Printing out non-english app name from iOS data set
print(ios[6731][1])
print('\n')
print(android_clean[4411][0]) # Printing out non-english app name from Google Play data set
print(android_clean[7939][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


中国語 AQリスニング
لعبة تقدر تربح DZ


How can we go about doing deleting these apps from our data set? 

First approach that comes to mind is deleting any app with a name that contains a symbol that is not commonly used in English text - delete any app that doesn't include letters from A-Z, numbers from 0-9, and puctuation marks (., !, ?. ;), and other symbols (+, *, /). 

**Fun fact:**
    Each character we use in a string has a corresponding number associated with it. For instance, the corresponding number for character 'a' is 97 , character 'A' is 65, and character '爱' is 29,233. We can get the corresponding number of each character using the ord() built-in function [Link](https://docs.python.org/3/library/functions.html#ord)
    
So if we were to take the approach of deleting any app that uses non-english characters, we need to find the range of  numbers corresponding to the characters we commonly use in English language. That's easy to find. Just a simple Wikipedia search! [Link](https://en.wikipedia.org/wiki/ASCII). English text seemt to be in the range of 0-127.

Now we can use this number range (0-127) and assume that an app name that contains a character greater than 127 is probably a non-english app. 

Another road block would be the fact that our app names our stored as a string. So we need to find a way to check each individual character in a string and check it's corresponding characters. 

Fortunately for us, in Python, strings are indexable and iterable, which means we can use indexing to select an individual character, and we can also iterate on the string using a for loop.

Now let's write the function that to detect non-english apps!

Explanation inside the function:
- Inside the function, iterate over the input string. For each iteration check whether the number associated with the character is greater than 127. When a character is greater than 127, the function should immediately return False — the app name is probably non-English since it contains a character that doesn't belong to the set of common English characters.
- If the loop finishes running without the return statement being executed, then it means no character had a corresponding number over 127 — the app name is probably English, so the functions should return True.

In [14]:
def is_english(string): 
    for character in string:
        if ord(character) > 127:
            return False
        
    return True

Now let's test our function on some app names!

In [15]:
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Instachat 😜'))

True
False
False


Notice the unintended consequence using the above method of detecting non-english apps. Our function couldn't correctly identify certain English app names like the emoji in 'Instachat 😜'. This is becuase the emoji falls outside the 0-127 range we put in our fucntion. That specific emoji is 128540

In [16]:
print(ord('😜'))

128540


This is a problem becuase we are going to lose useful data since many english apps use character like the emoji above that fall outside the range we provided. How can we solve this issue?

One way would be revise our function to **ONLY** remove apps that contain more than 3 characters in the app name that fall out of our set range of 0-127. Meaning that all English apps with up to three emojis or other special characters will still be labeled as english. It's not perfect, but it get's fairly close!

In [17]:
def is_english(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
            
    if non_ascii > 3: # Addt. to orginal function: If the input string has > 3 chracters that fall outside the ascii range, then function to return false.
        return False
    else:
        return True

Now let's try out app with the emoji name and see if our fucntion clasifies it as True (an English app) or False (a non-english app).

In [18]:
print(is_english('Instachat 😜'))

True


Success!

Now let's apply this function to both data sets!

In [19]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0] # Column 1 has the name of the app hence the 0
    if is_english(name):
        android_english.append(app)

for app in ios:
    name = app[1] # Column 2 has the name of the app hence the 1
    if is_english(name):
        ios_english.append(app)

explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 9613
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'G

**Quick recap of what we've done so far**
- Removed inaccurate data
- Removed duplicate app entries
- Removed non-English apps

Now let's try and isolate the free apps since those are of main pain concern. Recall that the apps that are of use for our analyses are the free apps. 

In [20]:
android_final = [] # List of free apps from the Google Play store
ios_final = [] # List of free apps from the App Store

for app in android_english:
    price = app[7] # Price is on columnn 8
    if price == '0': # if price is at 0, then append that row to android_final list
        android_final.append(app)

for app in ios_english:
    price = app[4] # price is column 4
    if price == '0.0': # if price is at 0, then append that row to android_final list
        ios_final.append(app)
        
print(len(android_final))
print(len(ios_final))

8863
3222


## Data Cleaning Done
Isolating the free apps from the non-free apps was our last step in the data cleaning process. 

## Analysis - Most common Apps by genre
    
Here's a breakdown of our data cleaning process:

- Removed innacurate data
- Removed duplicate app entries
- Removed non-English apps
- Isolated the free apps

That was quite a bit of work with no analyses to show for it. But recall that it's common for data scientis/analyst to do about 80% data cleaning and 20% data analyses. Let's start analyzing!

Recall that our goal is to determine what apps are likely to attract more users, since our revenue is largely influenced by the number of people using our apps.

We need to be a little strategic for validating what apps are profitable and which ones are not. Here's an example validation strategy:
1. Build a minimal Android version app, and add it to Google Play market.
2. If the app has a good response from users, we develop it further.
3. If the apps is profitable after six months, we build an iOS version of the app and add it to the App Store market.

Note that our end goal is to add the app on both Google Play and the App Store, so we need to find app profiles that are succesful on both markets. 

Let's begin the analysis by getting a sense of what are the most common genres for each market. Frequency tables for some of the columns in our data can get us this result. What columns within each data set are we going to use.

Google Play 
- Category (column 1): The Category the app belongs to
- Genres (column 9): An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres.

App Store
- prime_genre (column 11): Primary Genre

Let's build two functions: 
- One to generate the frequency tables that show the genres' frequency in percentages
- One to function we can use to display percentages in a descending order

**Frequency tables**

To create a frequency table for the elements of a list, we need to:
- Create an empty dictionary.
- Loop through that list and check for each iteration whether the iteration variable exists as a key in the dictionary created.
    - If it exists, then increment by 1 the dictionary value at that key.
    - Else (if it doesn't exist), create a new key-value pair in the dictionary, where the dictionary key is the iteration variable, and the dictionary value is `1`.

In [21]:
#Function 1
def freq_table(dataset, index): #  Function takes in two parameters: dataset is a list of list and index is an int
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage
    
    return table_percentages

#Function 2
def display_table(dataset, index): # Function takes in two parameters: dataset is a list of list and index is an int
    table = freq_table(dataset, index) # Generate a frequency table using function above
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key) # Transform the frequency table into list of tuples
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True) # Sort the list in descending order
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Now that are functions are built. Let's start by examining the genres in iOS data set:

In [22]:
display_table(ios_final, 11) # We input column 11 from iOS data set into our function

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


We can see that among the free english apps in the App Store, more than half (**58.16%**) are games. Runner up is entertainment (**7.88%**), followed by photo & video apps (**4.97%**), and education being **3.66%**

General takeaway: 

It looks like among free english apps in the App Store is dominated by apps that are designed for fun entertainment (games, entertainment, photo & video, social networking) while apps with practical purposes (education, shopping, utilities, and productivity) are more rare. 

BUT we shoukd be weary in assuming that since fun apps are more numerous they possess the greatest number of users. 

Now let's examine the Genres and category columns in the Google Play data set

In [23]:
display_table(android_final, 1) # We input column 1 from adroid data set into our function

FAMILY : 18.910075595170937
GAME : 9.725826469592688
TOOLS : 8.462146000225657
BUSINESS : 4.592124562789123
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.5315355974275078
SPORTS : 3.396141261423897
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2381812027530184
HEALTH_AND_FITNESS : 3.0802211440821394
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496444
VIDEO_PLAYERS : 1.7939749520478394
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.1621347173643235
ENTERTAINMENT : 0.9590432133589079
LIBRARIES_AND_DEMO : 0.9364774906916393
AUTO_AND_VEHICLES : 0.9251946293580051
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS : 0.7108202640189552
PARENTING : 0.6544059573507841
ART_AND_DESIGN : 0

In [24]:
display_table(android_final, 9) # Genre column

Tools : 8.450863138892023
Entertainment : 6.070179397495204
Education : 5.348076272142616
Business : 4.592124562789123
Productivity : 3.8925871601038025
Lifestyle : 3.8925871601038025
Finance : 3.7007785174320205
Medical : 3.5315355974275078
Sports : 3.463838429425702
Personalization : 3.317161232088458
Communication : 3.2381812027530184
Action : 3.102786866749408
Health & Fitness : 3.0802211440821394
Photography : 2.944826808078529
News & Magazines : 2.798149610741284
Social : 2.6627552747376737
Travel & Local : 2.324269434728647
Shopping : 2.245289405393208
Books & Reference : 2.1437436533904997
Simulation : 2.042197901387792
Dating : 1.8616721200496444
Arcade : 1.8503892587160102
Video Players & Editors : 1.771409229380571
Casual : 1.7601263680469368
Maps & Navigation : 1.399074805370642
Food & Drink : 1.241114746699763
Puzzle : 1.128286133363421
Racing : 0.9928917973598104
Role Playing : 0.9364774906916393
Libraries & Demo : 0.9364774906916393
Auto & Vehicles : 0.9251946293580051
S

**Category column**
The landscape seems to be more evenly distributed in the Google Play data set. For instance, there isn't nearly as many apps made for fun, and it seems that a good number of apps are designed for practical purposes (family, tools, business, lifestyle, productivity, etc.). But what's included in the family category? For the most part, it's mostly kid games.

**Genre column**
Regardless, the practical purpose apps seemed to be more represented in the Google Play store than in the App Store. 

Notice that the Genre column has more categories (a lot more!). Although the difference between the category and genre column is not crystal clear, it's safe to say that the both represent the same big picture (which is what we want). Thus, we should just stick to category column for now.

**General takeway by reviwing category + genre columns in App Store and Google Play Store**
It looks like the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of fun apps and more practical apps. 

Now let's get an idea about the kind of apps that have the most users!

## Analysis - Most popular apps by genre

Recall that the frequency tables we analyzed showed us that the App Store is dominated by apps designed for fun, while Google Play shows a more evenly distributed landscape of both fun and practical apps. Now we would like to get an idea about the kinds of apps with the most users by genre. 

One way to find out what genres are most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the installs column (column 5). 

This specific column is not available in the App Store data set. But a good workaround is to take the total number of user ratings column called rating_count_tot (column 5). 

Now we need to calculate the avg number of user ratings per app genre on the App Store. Here's how to do it:
- Isolate the apps of each genre.
- Sum up the user ratings for the apps of that genre.
- Divide the sum by the number of apps belonging to that genre (not by the total number of apps).

In [25]:
genres_ios = freq_table(ios_final, 11) # generate a freq table for prime_genres (column 11) in ios data set

for genre in genres_ios: # loop over the unique genres of the App Store data set w/ iteration variable is named genre.
    total = 0 # variable that stores the sum of users ratings specific to each genre
    len_genre = 0 # variable that stores the number of apps specific to each genre
    for app in ios_final: # nested loop
        genre_app = app[11] # loop over data set, and for each iteration save app genre to variable named genre_app
        if genre_app == genre: # if genre_app is the same as genre (previous loop iteration)
            n_ratings = float(app[5]) # then save number of user ratings of the app as a float type
            total += n_ratings # add up the number of user ratings to the total variable
            len_genre += 1 # increment the len_genre variable by 1
    avg_n_ratings = total / len_genre # compute the avg number of user ratings by dividing total by len_genre
    print(genre, ':', avg_n_ratings) # print the app genre and average number of user ratings

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


It looks like navigation genre has the highest number of user reviews. I wonder how many apps are in the navigation genre and the general review distribution within each of those apps.

Let's right a quick for loop:

In [26]:
for app in ios_final:
    if app[11] == 'Navigation':
        print(app[1], ':', app[5])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


We have a total of 6 apps with Waze and Google Maps heavily weighing our avg calculations. They both have about 1/2 million reviews together!

**Why is this a problem?**
Our approach was a workaround to not having a column in our iOS data set that highlights the total number of downloads for each app. Using total reviews and grouping them by genre will give us a sense of a genres popularity. But our focus is genres and having apps that heavily skewe our data (like Google Maps and Waze in Navigation genre) will distort our analysis. 

We could get a better picture by removing these extremly popular apps and for each genre and the rework the averages, but their may be some insight in this data as it stands. 

Take for instance, the rerefence genre. 

In [27]:
for app in ios_final:
    if app[11] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


The reference genre has 79,942 user ratings on average, but it's actually the bible (985920 ratings) and Dictionary.com (254222 user ratings) that skew up the average rating. 

This particular niche shows some potential. One thing we could do is take a popular book (like the bible) and create an app that does more than the original app does. For instance, we can create an app that can include quotes from The Bible, an audio version of The Bible, quizes, etc. On top of that we can also embed a dictionary (second highest rated app in reference section) so users don't need to exit our app to look up words in an external app.

Recall that this analyses is for The App Store, and The App Store is fairly dominated by "Fun Apps" and not so much "practical" apps. This may imply that the market is overly saturated with "fun" apps, and creating more practical apps might give us a greater chance of standing out.

**Other interesting notes regarding other apps**

Weather apps - people generally don't spend too much time in these apps. Thus, the chance of making a profit from in app adds are low. Not to mention that we may need to spend money to connect to reliable weather data via an API.

Food & Drink - Making a popular food or drink app would require us to cook and delivery food and drink services.

Finance apps - examples of these apps involve banking, paying bills, money transfer, etc. We may need to hire a finance expert for a finance app.

**The Google Play data set**

In [28]:
display_table(android_final, 5)

1,000,000+ : 15.728308699086089
100,000+ : 11.55365000564143
10,000,000+ : 10.549475346947986
10,000+ : 10.188423784271691
1,000+ : 8.394448832223853
100+ : 6.916393997517771
5,000,000+ : 6.826131106848697
500,000+ : 5.562450637481666
50,000+ : 4.772650344127271
5,000+ : 4.513144533453684
10+ : 3.542818458761142
500+ : 3.2494640640866526
50,000,000+ : 2.3017037120613786
100,000,000+ : 2.1324607920568655
50+ : 1.9180864267178157
5+ : 0.7898002933543946
1+ : 0.5077287600135394
500,000,000+ : 0.270788672007221
1,000,000,000+ : 0.2256572266726842
0+ : 0.045131445334536835
0 : 0.011282861333634209


The the Google Play market, we actually have data about the number of installs, so we should be able to a clearer pciture. But as you can see above, the install numbers are not that precise. We can see that most of these values are open ended (100+, 1,000+, 5,000+, etc.). 

We don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, we don't need precise data for our purposes - we only want to get an idea which app genres attract the most users, and we do not need to know the exact amount of users. 

I think it's okay to leave those numbers as is, and we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on.

Before we can run our data trough our built in function, we need to convert each install number into a float - this means that we need to remove the commas and plus characters, otherwise the conversion will fail and raise an error.

Now let's calculate the average number of installs per app genre for the Google Play data set. Note that we'll need to use the nested loop, just like in the previous function.

In [29]:
categories_android = freq_table(android_final, 1) # freq table generated using the freq_table() function

for category in categories_android: # iteration variable is named category and loop over unique genres in android data set
    total = 0 # Initiate a variable named total w/ 0 value. It stores the sum of installs specific to genre
    len_category = 0 # Initiate a variable named total. It stores the number of apps specific to genre
    for app in android_final: # 
        category_app = app[1] # Save the app genre to a variable named category_app
        if category_app == category: # if category_app is the same as category (the iteration variable in main loop)
            n_installs = app[5] # Save the number of installs
            n_installs = n_installs.replace(',', '') # Remove the commas
            n_installs = n_installs.replace('+', '') # Remobve the addition signs
            total += float(n_installs) # Convert the string to a float
            len_category += 1 # increment the len_category nunmber by 1
    avg_n_installs = total / len_category # Compute the average number of installs
    print(category, ':', avg_n_installs) # Print the app genre and avg number of installs

ART_AND_DESIGN : 2021626.7857142857
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

It looks like on average, the communications genre has the most installs: 38,456,119. Let's dig a little deeper into this specific genre:

In [30]:
for app in android_final: 
    if app[1] == 'COMMUNICATION':
        print(app[0], ':', app[5])
    

WhatsApp Messenger : 1,000,000,000+
Messenger for SMS : 10,000,000+
My Tele2 : 5,000,000+
imo beta free calls and text : 100,000,000+
Contacts : 50,000,000+
Call Free – Free Call : 5,000,000+
Web Browser & Explorer : 5,000,000+
Browser 4G : 10,000,000+
MegaFon Dashboard : 10,000,000+
ZenUI Dialer & Contacts : 10,000,000+
Cricket Visual Voicemail : 10,000,000+
TracFone My Account : 1,000,000+
Xperia Link™ : 10,000,000+
TouchPal Keyboard - Fun Emoji & Android Keyboard : 10,000,000+
Skype Lite - Free Video Call & Chat : 5,000,000+
My magenta : 1,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Seznam.cz : 1,000,000+
Antillean Gold Telegram (original version) : 100,000+
AT&T Visual Voicemail : 10,000,000+
GMX Mail : 10,000,000+
Omlet Chat : 10,000,000+
My Vodacom SA : 5,000,000+
Microsoft Edge : 5,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Calls & Text by Mo+ : 5,000,000+
free 

ES-1 : 500+
Hangouts Dialer - Call Phones : 10,000,000+
EU Council : 1,000+
Council Voting Calculator : 5,000+
Have your say on Europe : 500+
Programi podrške EU : 100+
Inbox.eu : 10,000+
Web Browser for Android : 1,000,000+
Everbridge : 100,000+
Best Auto Call Recorder Free : 500+
EZ Wifi Notification : 10,000+
Test Server SMS FA : 5+
Lite for Facebook Messenger : 1,000,000+
FC Browser - Focus Privacy Browser : 1,000+
EHiN-FH conferenceapp : 100+
Carpooling FH Hagenberg : 100+
Wi-Fi Auto-connect : 1,000,000+
Talkie - Wi-Fi Calling, Chats, File Sharing : 500,000+
WeFi - Free Fast WiFi Connect & Find Wi-Fi Map : 1,000,000+
Sat-Fi : 5,000+
Portable Wi-Fi hotspot Free : 100,000+
TownWiFi | Wi-Fi Everywhere : 500,000+
Jazz Wi-Fi : 10,000+
Sat-Fi Voice : 1,000+
Free Wi-fi HotspoT : 50,000+
FN Web Radio : 10+
FNH Payment Info : 10+
MARKET FO : 100+
FO OP St-Nazaire : 100+
FO SODEXO : 100+
FO RCBT : 100+
FO Interim : 100+
FO PSA Sept-Fons : 100+
FO AIRBUS TLSE : 1,000+
FO STELIA Méaulte : 100

And to no ones surprise this number is heavily skewed up by a few number of apps that have over one billion installs. (What'sApp, Facebook, Messenger, Skype, Google Chrome, Gmail, and Hangouts). 

In [31]:
for app in android_final: 
    if app[1] == 'COMMUNICATION'and (app[5] == '1,000,000,000+'
                                     or app[5] == '500,000,000+'
                                     or app[5] == '100,000,000+'
                                     ):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

If we were to remove these communication apps that are over 100 million installs, what would happen to the averages

In [32]:
apps_under_100_m = []

for app in android_final:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        apps_under_100_m.append(float(n_installs))

sum(apps_under_100_m) / len(apps_under_100_m)

3603485.3884615386

- The avarage installs for the communications genre with really popular apps included (+100 million downloads): **38,456,119**
- The avarage installs for the communications genre **without** these really popular apps included: **3,603,485**

That's quite a drastic change!

The same can be said about the **video players** category, which is the runner up with 24,727,872 installs. The market is dominated by apps like Youtube, Google Play Movies., and MX Player. The pattern is repeated for **social apps** (where we have giants like Facebook, Instagram, Google+, etc.), **photography apps** (Google Photos and other popular photo editors), or **productivity apps** (Microsoft Word, Dropbox, Google Calendar, Evernote, etc.).

Recall that our **main concern** is that app geners might seem more popular than they really are! And it may be worth noting that these nieches seem to be dominated by few giants that are hard to compete against (giants like Google or Facebook). 

So do we really want to create apps that are dominated by a select number of apps who are ran by giants like Google or Facebook?

The BOOKS_AND_REFERENCE genre might be worth looking into. Not only is it fairly popular (8,767,811 avg installs) but it was also fairly popular in our iOS data set. And recall that part of our objective is to reccommend apps that would be popular in both the App Store and Google Play markets.

Let's look at some of the apps from this genre and their number of installs:

In [33]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

Are there a small numnber of apps that skew the avg install numbers?

In [34]:
for app in android_final: 
    if app[1] == 'BOOKS_AND_REFERENCE'and (app[5] == '1,000,000,000+'
                                     or app[5] == '500,000,000+'
                                     or app[5] == '100,000,000+'
                                     ):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


Let's compare the avgs with and without these popular apps:

In [35]:
book_apps_under_100_m = []

for app in android_final:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'BOOKS_AND_REFERENCE') and (float(n_installs) < 100000000):
        book_apps_under_100_m.append(float(n_installs))

sum(book_apps_under_100_m) / len(book_apps_under_100_m)

1437212.2162162163

- The avarage installs for the BOOKS_AND_REFERENCE genre with really popular apps included (+100 million downloads):  **8,767,811**
- The avarage installs for the BOOKS_AND_REFERENCE genre without these really popular apps included:  **1,437,212**

It still looks like a small number of apps heavily skew the avgs, but notice that there are only a few number of very popular apps (Google Play Books, Bible, Amazon Kindle, Wetpad, Audiobooks from Audible), so the market still shows some potential.

Let's try and get some app ideas based **only** on apps that are based somewhere in the middle in term of popularity. Let's say that middle apps would be somewhere between 1,000,000 and 100,000,000 downloads):

In [36]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                     or app[5] == '5,000,000+'
                                     or app[5] == '10,000,000+'
                                     or app[5] == '50,000,000+'
                                     ):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

The niche seems to be dominated by software for processing and reading ebooks, as well as various collections of libraries and dictionaries, so it's probably not a good idea to build similar apps since we will again face giant competition.

Notice that they're also quite a few apps built around the Quran, which may imply that building an app around the book can be profitable. Broadly speaking: Taking a pouplar book, and turning it into an app could be profitable for both the Google Play and the App Store!

It's important to note that the market seems to be full of libraries, so we need to add some special features besides the raw version of the book. Features like daily quotes from the book, an audio version of the book, quizzes on the book, and maybe even a forum where people can discuss the book.

## Conclusion

Let's briefly recap what our process looked like for this project:
- We started by clarifying the goal of our project
- We collected relavent data
- We cleaned the data to prepare it for our analysis
- We analysed the cleaned data


We analyzed data about the App Store and Google Play mobile apps. The goal was to recommend an app profile that is profitable for both markets.

We concluded that taking a popular book and turning it into an app could be profitable for both markets. However, both markets are alreay full of libraries, so we need to add some special features besides the raw versions of the book to distinguish our application! Features like daily quotes from the book, an audio version of the book, quizzes on the book, or a forum where people can discuss the book.