# Profitable Apps in Google Play and App Store
_First chapter of my journey to learn Data Analytics and Data Science_

## 1. Background
Our company plans to develop a new application for mobile devices. We aim to build free app and rely on in-app ads for the main source of revenue. Since our revenue will be highly influenced by the number of users, prior to determining which app we need to develop, we would like to analyze the apps profile which more likely attract more users.

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play. See chart below for comparison across 5 different marketplace for apps.

![App statistics](py1m8_statista.png)

Based on statistic above, we limit the scope of our analysis to only apps offered in two platform, android (Google Play) and iOS (App Store).


## 2. Exploring the data  
Collecting data for over four million apps requires a significant amount of time and money, so we'll try to analyze a sample of data instead. To avoid spending resources with collecting new data ourselves, we should first try to see whether we can find any relevant existing data at no cost. Luckily, these are two data sets that seem suitable for our purpose:
- [Google Play dataset](https://www.kaggle.com/lava18/google-play-store-apps) containing approximately 10,000 Android apps, collected in August 2018, and
- [Apps Store dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing approximately 7,000 iOS apps, collected in July 2017

First, let's open our datasets.

In [1]:
from csv import reader

# This is the dataset for App Store
opened_file = open('AppleStore.csv', encoding='utf8')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]  # Extract the header row
ios = ios[1:]        # ios dataset without header

# This is the dataset for Google Play
opened_file = open('googleplaystore.csv', encoding='utf8')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]  # Extract the header row
android = android[1:]        # android dataset without header

Build a function so we can easily read and explore the data. A small preview from App Store dataset is shown.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
print(ios_header)
print('\n')
explore_data(ios, 0, 2 ,True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7197
Number of columns: 16


From preview of App Store above, we have 7197 apps and 16 attributes for each app. We identified several attributes that may be useful for our analysis, such as: **track_name**, **price**, **rating_count_tot**, **user_rating**, and **prime_genre**.  

Now, let's see what's on Google Play Store dataset.

In [3]:
print(android_header)
print('\n')
explore_data(android, 0, 2, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


In Play Store, we have 10841 apps with 13 attributes each, not too different than App Store. And we determine that **App**, **Category**, **Rating**, **Reviews**, **Price**, and **Genres** will be useful for our analysis.

## 2. Cleaning the Data
Before analyzing the data, we'll need to clean them first to ensure out datasets are free from incorrect, irrelevant, inconsistent, and duplicates of information. This section will be broken down into several subsections. 

### 2.1 Deleting wrong data

Refer to the [one of the discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) on Google Play data set, there is a report of an error in row 10472. We have to confirm this error first:

In [4]:
print(android_header, '\n')
for row in android:
    flag = 0
    if len(row) != len(android_header):     # Comparing the length of each row with header row, to find a mismatch
        print('Error occurs in row: ', android.index(row), '\n')  # Return the row number where mismatch occurs
        print('Preview:', '\n', row)

if flag == 0:
    print('No error detected')

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

Error occurs in row:  10472 

Preview: 
 ['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
No error detected


Result above shows that there is indeed an error in row 10472, since the lenght of the row does not match the lenght of header.  After further investigation, it turns out the data is missing `Category` column, because the value is very unlikely to be `1.9`, this value is shifted from the adjacent column as a result of missing one column.  

Therefore, we will delete this row:

In [5]:
del android[10472] 
# Careful not to run this command more than one

Let's apply similar process to check the App Store dataset.

In [6]:
for row in ios:
    flag = 0
    if len(row) != len(ios_header):
        print('Error occurs in row: ', ios.index(row), '\n')
        print('Preview:', '\n', row)
        flag = 1

if flag == 0:
    print('No error detected')

No error detected


We did not find any error on the App Store dataset. We can continue to the next step of data cleaning process.

### 2.2 Removing duplicate entries
This time we will check each dataset for duplicate entries, meaning the same app/information are mentioned several times in the dataset. For Play Store, we can check through the Apps name in the `App` column (index 0). If an app name is mentioned several time in the whole dataset, we conclude that duplicate entries occur in that particular app.

In [7]:
duplicate_apps_android = []
unique_apps_android = []

for app in android:
    name = app[0]
    if name in unique_apps_android:
        duplicate_apps_android.append(name)
    else:
        unique_apps_android.append(name)
        
print('No of duplicates: ', len(duplicate_apps_android), '\n')
print(duplicate_apps_android[:4])

No of duplicates:  1181 

['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings']


Now we know there are 1181 duplicates in Google Play dataset. Let's check using similar command for duplicates in App Store. We will use the attribute `id` (index 0) in this dataset.

In [8]:
duplicate_apps_ios = []
unique_apps_ios = []

for app_id in ios:
    name = app_id[0]
    if name in unique_apps_ios:
        duplicate_apps_ios.append(name)
    else:
        unique_apps_ios.append(name)
        
print('No of duplicates: ', len(duplicate_apps_ios), '\n')
print(duplicate_apps_ios[:4])

No of duplicates:  0 

[]


The result shows that there is no duplicate entry on App Store dataset.  

Next we have to remove the duplicates on Google Play, and retaining only single data for each duplicates. We can randomly pick which one of the duplicates will be retained, but we won't do that in this analysis. Instead we will pick the one which the latest entry. This can be identified by choosing the data which have the most number of reviews across the duplicates. we can find this information on `Reviews` column (index 3).

To do that, we will:
- Create a dictionary where each key is a unique app name, and the value is the highest number of reviews of that app
- Use the dictionary to create new dataset, which will have only one entry per app

First, we build the dictionary:

In [25]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])                                  # Index 3 is the number of reviews of each app
    if name in reviews_max and reviews_max[name] < n_reviews:  # If the app is already in the list, and the new data has more review
        reviews_max[name] = n_reviews                          #    then the data in the list is replaced by the new data
    elif name not in reviews_max:                              # If the app is not already in the list
        reviews_max[name] = n_reviews                          #    then the new data is inserted/appended to the list
        
print(len(reviews_max))                                        # The length of our new list
print(len(android) - len(duplicate_apps_android))              # The length of our original data substracted by 
                                                               #    the number of duplicates (expected result)

9659
9659


We have build the dictionary and verify that its lenght matches the expected lenght of Google Play dataset, after we remove the duplicates. The expected length should be the current length substracted by the number of duplicates.

Now, after we have dictionary `reviews_max`, we can build our new dataset which only contains unique apps.

We loop through the `android` dataset and in each iteration we check the number of reviews and compare it to `reviews_max`. If it matches, then we append the data to our new dataset `android_clean`.

In details, we do this in the code below:
- create two lists, `android_clean` which will be our new dataset, and `already_added` which will help us identify whether the app has been added to `android_clean` or not,
- iterate for each row in `android` dataset,
- obtain the app name (index 0), and assign it to variable `name`,
- obtain the number of reviews (index 3), and assign it to variable `n_reviews`,
- set if condition, and append the row if `n_review` matches the corresponding value in `reviews_max` list we build in previous step. Also only if the app has not been added to `already_added` list.

Then we test by printing the length of our new dataset. Expected lenght should be 9659.

In [10]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)
        
print('Our new dataset length: ', len(android_clean))

Our new dataset length:  9659


The length of our new dataset is verified.

### 2.3 Removing non-English app
In this project, we limit our analysis only for the apps which designed for English-speaking audience. So we will have to check our datasets for non-English language, and remove them from our dataset.

To do this, we have to check for the app name that contains symbols no commonly used in English text. Using build-in function `ord()`, we able to obtain the corresponding number for each character. And according to [ASCII](https://en.wikipedia.org/wiki/ASCII) (American Standard Code for Information Interchange), characters used in English text correspond to number in range **between 0 to 127**. Therefore, we should check for app name on our datasets and if we find character number beyond the range of 0 to 127, then most likely that app is not in English languange.

To easily reuse the checking command, we build a function `is_english` which takes in string command and return `True` if the input text is in English, and `False` otherwise. To strengthen our function, we only classify the string as non-English if it contains three or more characters which number above 127, so even if there is emoji or other single non-common symbol in the string, our function still classify is as English text.

There is possibly a better solution since there may be a case that a non-English app is having English app name or an English app is having non-common characters, but this method should be good enough in our analysis.

In [11]:
def is_english(string):
    count = 0                      # Counter to record how many non-English character detected
    for character in string:
        if (ord(character) > 127):
            count += 1
            if count > 3:
                return False
    return True

# Test the function
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
True
True


Using our new function `is_english`, now we can isolate the non-English apps in our datasets and return new datasets that only contain English apps. The new datasets will be `android_english` and `ios_english`.

To do this, we will iterate for each app in each dataset, then checking the app name whether it is an English name using function `is_english`. Only English app will be appended to the new datasets.

After that we check the length of each new dataset.

In [12]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if (is_english(name)):
        android_english.append(app)

for app in ios:
    name = app[1]  # For ios, the app name is under column 'track_name' (index 1)
    if (is_english(name)):
        ios_english.append(app)
    
print('Google Play dataset length: ', len(android_english))
print('App Store dataset length: ', len(ios_english))

Google Play dataset length:  9614
App Store dataset length:  6183


In Play Store, we have reduced our dataset **from 9659 to 9614**, and in App Store we reduced **from 7196 to 6183**. We may still find one or two non-English apps in our new datasets because our function is not perfect, but they should be rare enough so that they won't affect our analysis

### 2.4 Filtering free app
Since we plan to develop app that is free to download and install, and our main source of revenue would be in-app ads, we will isolate only the free apps for our analysis. Attributes on each dataset that we find helpful for this process:
- Play Store: `price` column (index 7)
- App Store: `price` column (index 4)

Note that the price values are in string, so we don't compare them with `int` or `float` types.

This will be our last step on the data cleaning process.

In [26]:
android_final = []
ios_final = []

for app in android_english:
    price = app[7]                 # index 7 is the app price in Play Store
    if(price == '0'):
        android_final.append(app)
        
for app in ios_english:
    price = app[4]                 # index 4 is the app price in App Store
    if(price == '0.0'):
        ios_final.append(app)

print('Final android dataset: ', len(android_final))
print('Final ios dataset: ', len(ios_final))

Final android dataset:  8864
Final ios dataset:  3222


So far we have done these steps for the data cleaning process:
- Deleting wrong data
- Removing duplicate entries
- Removing non-English apps
- Filtering free apps

Now after we've done cleaning the datasets, we can start to analyze them.

## 3. Analyzing the Data
### 3.1 Categorizing by genres
As a reminder, our goal is to determine what kind of app that attracts users the most, because our revenue depends on the number of people using our app.

So now let's begin our analysis by determining what genre of app that people are likely to install. Both datasets contain useful information for us to categorize them:
- in Google Play we use column `Category`, and
- in App Store we use column `prime_genre`.

To simplify, we will use word **genres** to address those columns. By catergorizing the apps by genre, we can find which genres dominate the market on each platform. Next, we will build a function that return the number of users for each genre on each dataset, and present them as percentage.

In [27]:
def freq_percent(dataset, index):
    freq_table = {}  # Define the dictionary that will be used to contain the frequencies of each value.
    total = 0        # Define new variable to count the number of element in freq_table. Will be used to calculate the means.
    
    # Iteration to create the frequency table
    for row in dataset:
        value = row[index]
        total += 1
        
        if(value in freq_table):       # If the value is already in our frequency table, then add the occurrence by 1
            freq_table[value] += 1
        else:                          # First occurrence of the value
            freq_table[value] = 1
    
    # Iteration to calculate the percentages of each value
    percent_table = {}
    for key in freq_table:
        percentage = (freq_table[key] / total) * 100
        percent_table[key] = round(percentage, 2)     # Rounding the percentage value up to two decimals
        
    return percent_table

print(freq_percent(android_final, 1))
print('\n')
print(freq_percent(ios_final, 11))

{'ART_AND_DESIGN': 0.64, 'AUTO_AND_VEHICLES': 0.93, 'BEAUTY': 0.6, 'BOOKS_AND_REFERENCE': 2.14, 'BUSINESS': 4.59, 'COMICS': 0.62, 'COMMUNICATION': 3.24, 'DATING': 1.86, 'EDUCATION': 1.16, 'ENTERTAINMENT': 0.96, 'EVENTS': 0.71, 'FINANCE': 3.7, 'FOOD_AND_DRINK': 1.24, 'HEALTH_AND_FITNESS': 3.08, 'HOUSE_AND_HOME': 0.82, 'LIBRARIES_AND_DEMO': 0.94, 'LIFESTYLE': 3.9, 'GAME': 9.72, 'FAMILY': 18.91, 'MEDICAL': 3.53, 'SOCIAL': 2.66, 'SHOPPING': 2.25, 'PHOTOGRAPHY': 2.94, 'SPORTS': 3.4, 'TRAVEL_AND_LOCAL': 2.34, 'TOOLS': 8.46, 'PERSONALIZATION': 3.32, 'PRODUCTIVITY': 3.89, 'PARENTING': 0.65, 'WEATHER': 0.8, 'VIDEO_PLAYERS': 1.79, 'NEWS_AND_MAGAZINES': 2.8, 'MAPS_AND_NAVIGATION': 1.4}


{'Social Networking': 3.29, 'Photo & Video': 4.97, 'Games': 58.16, 'Music': 2.05, 'Reference': 0.56, 'Health & Fitness': 2.02, 'Weather': 0.87, 'Utilities': 2.51, 'Travel': 1.24, 'Shopping': 2.61, 'News': 1.33, 'Navigation': 0.19, 'Lifestyle': 1.58, 'Entertainment': 7.88, 'Food & Drink': 0.81, 'Sports': 2.14, 'Bo

We have built a function that shows us the number of user of each genre, but the representation is pretty difficult to read.
So let's build another function that displays our result in more readable way by sorting them.

We will use build-in function `sorted` for this. But since this function does not work well with dictionary, we will have to convert our dictionary `percent_table` to a tuple. If we don't do this, the `sorted` function will only return the dictionary key, and the dictionary value will be left out.

In details, here are the steps we will take:
- call the function `freq_percent` so we have our initial unsorted dictionary,
- initialize table `unsorted_table`, this will be our conversion result but still unsorted,
- iterate for each key in dictionary `table`, and retrieve every value within the key as a tuple. We put the value `table[key]` first then the key, because we will sort our list depending on the values, not the keys,
- in every iteration we append the tuple into `unsorted_table`,
- sort the `unsorted_table` using build-in function `sorted`, and set `reverse=True` because we will sort the table in descending order,
- print each entry on the `sorted_table` with the key first then followed by the value for readibilty.

In [29]:
def display_table(dataset, index):
    table = freq_percent(dataset, index)                # Obtain the frequency table of the column which we will sort
    unsorted_table = []                                 # Empty list to store our conversion result
    
    for key in table:
        key_val_as_tuple = (table[key], key)            # Obtain each value and key for dictionary, and save them as tuple
        unsorted_table.append(key_val_as_tuple)         # Store the tuples into unsorted_table
        
    sorted_table = sorted(unsorted_table, reverse=True) # Sort the list by the value in descending order (reverse=True)
    for entry in sorted_table:                          # Print the list in readable way
        print(entry[1], ': ', entry[0])

# Testing the function on Play Store dataset
print("Genres' popularity in Google Play:")
display_table(android_final, 1)                         # Index 1 is the Category column in Play Store dataset

Genres' popularity in Google Play:
FAMILY :  18.91
GAME :  9.72
TOOLS :  8.46
BUSINESS :  4.59
LIFESTYLE :  3.9
PRODUCTIVITY :  3.89
FINANCE :  3.7
MEDICAL :  3.53
SPORTS :  3.4
PERSONALIZATION :  3.32
COMMUNICATION :  3.24
HEALTH_AND_FITNESS :  3.08
PHOTOGRAPHY :  2.94
NEWS_AND_MAGAZINES :  2.8
SOCIAL :  2.66
TRAVEL_AND_LOCAL :  2.34
SHOPPING :  2.25
BOOKS_AND_REFERENCE :  2.14
DATING :  1.86
VIDEO_PLAYERS :  1.79
MAPS_AND_NAVIGATION :  1.4
FOOD_AND_DRINK :  1.24
EDUCATION :  1.16
ENTERTAINMENT :  0.96
LIBRARIES_AND_DEMO :  0.94
AUTO_AND_VEHICLES :  0.93
HOUSE_AND_HOME :  0.82
WEATHER :  0.8
EVENTS :  0.71
PARENTING :  0.65
ART_AND_DESIGN :  0.64
COMICS :  0.62
BEAUTY :  0.6


The result above shows us that in Google Play, there is no genre that significantly dominate the market. The popularity is distributed relatively even. But if we dig deeper, there is actualy another attribute in Google Play dataset, which is the `genres` column, that inform us for more specific genres.

In [30]:
print("Genres' popularity in Google Play:")
display_table(android_final, 9)                # Index 9 is the `genres` column in Play Store dataset

Genres' popularity in Google Play:
Tools :  8.45
Entertainment :  6.07
Education :  5.35
Business :  4.59
Productivity :  3.89
Lifestyle :  3.89
Finance :  3.7
Medical :  3.53
Sports :  3.46
Personalization :  3.32
Communication :  3.24
Action :  3.1
Health & Fitness :  3.08
Photography :  2.94
News & Magazines :  2.8
Social :  2.66
Travel & Local :  2.32
Shopping :  2.25
Books & Reference :  2.14
Simulation :  2.04
Dating :  1.86
Arcade :  1.85
Video Players & Editors :  1.77
Casual :  1.76
Maps & Navigation :  1.4
Food & Drink :  1.24
Puzzle :  1.13
Racing :  0.99
Role Playing :  0.94
Libraries & Demo :  0.94
Auto & Vehicles :  0.93
Strategy :  0.91
House & Home :  0.82
Weather :  0.8
Events :  0.71
Adventure :  0.68
Comics :  0.61
Beauty :  0.6
Art & Design :  0.6
Parenting :  0.5
Card :  0.45
Casino :  0.43
Trivia :  0.42
Educational;Education :  0.39
Board :  0.38
Educational :  0.37
Education;Education :  0.34
Word :  0.26
Casual;Pretend Play :  0.24
Music :  0.2
Racing;Action & 

This information is pretty specific because it actually categorized the data not only as genre but also subgenre or secondary genre. But since the result is not much different than our previous result, and at the moment we only need the bigger picture, we will disregard this result and proceed with the previous result in mind.

Now moving on to ios dataset:

In [17]:
print("Genres' popularity in App Store:")
display_table(ios_final, 11)

Genres' popularity in App Store:
Games :  58.16
Entertainment :  7.88
Photo & Video :  4.97
Education :  3.66
Social Networking :  3.29
Shopping :  2.61
Utilities :  2.51
Sports :  2.14
Music :  2.05
Health & Fitness :  2.02
Productivity :  1.74
Lifestyle :  1.58
News :  1.33
Travel :  1.24
Finance :  1.12
Weather :  0.87
Food & Drink :  0.81
Reference :  0.56
Business :  0.53
Book :  0.43
Navigation :  0.19
Medical :  0.19
Catalogs :  0.12


In contrast with Google Play dataset, in App Store we can actually see a genres that significantly dominates the market, which is `Games` with 58.16% popularity (more than half of the dataset). And after further observation, we can see the popularity is mostly dominated by apps that are designed for fun (games, entertainment, hobby, etc.).

However this data only based on the number of apps in the market, not the number of users. So we cannot yet conclude that those top genre are actually popular.

So now let's continue our analysis from another angle to determine the actual popularity of app's genre. We can easily obtain that by taking account the number of users that install the app.

### 3.2 Most popular apps by genres
So far our frequency table only determines which genre dominates the market. We have `Games` genre dominates the App Store, while Google Play shows more balance landscape across genres. Now we'd like to determine the apps with the most users.

One way to find out which genres with the most users is:
- For Google Play: using the value in `Install` column to determine how many users download and install the app.
- For App Store: we do not have `Install` column here, but we can use `rating_count_tot` instead, to determine how many ratings have been received by the app.

### Most popular app in App Store
Let's retrieve the data from App Store first:

In [31]:
prime_genre = freq_percent(ios_final, 11)    # Index 11 is the `prime_genre` column in App Store dataset

for genre in prime_genre:
    total = 0                                # This will store the total number of ratings received on each genre
    len_genre = 0                            # This will store the number of apps on each genre
    
    for row in ios_final:
        genre_app = row[11]                  # Index 11 is the `prime_genre` column in App Store dataset
        if (genre_app == genre):
            total += float(row[5])           # Index 5 is the `rating_count_tot`, we count the number of ratings for each genre
            len_genre += 1
            
    avg_rating = round(total / len_genre, 2) # Calculate the average and round them up to two decimals point
    print(genre, ': ', avg_rating)

Social Networking :  71548.35
Photo & Video :  28441.54
Games :  22788.67
Music :  57326.53
Reference :  74942.11
Health & Fitness :  23298.02
Weather :  52279.89
Utilities :  18684.46
Travel :  28243.8
Shopping :  26919.69
News :  21248.02
Navigation :  86090.33
Lifestyle :  16485.76
Entertainment :  14029.83
Food & Drink :  33333.92
Sports :  23008.9
Book :  39758.5
Finance :  31467.94
Education :  7003.98
Productivity :  21028.41
Business :  7491.12
Catalogs :  4004.0
Medical :  612.0


Result above shows us that generally people install Navigation apps, then Reference and Social Networking apps. But before making any conclusion, let's investigate further what apps on Navigation genre people usually install.

In [19]:
for app in ios_final:
    if app[11] == 'Navigation':
        print(app[1], ': ', app[5])

Waze - GPS Navigation, Maps & Real-time Traffic :  345046
Google Maps - Navigation & Transit :  154911
Geocaching® :  12811
CoPilot GPS – Car Navigation & Offline Maps :  3582
ImmobilienScout24: Real Estate Search in Germany :  187
Railway Route Search :  5


It turns out that our data is highly influenced by Waze and Google Maps (They have almost 500k users in total!). And this is not a good sign because our analysis will be skewed by those two apps. Not to mention our goal is to develop a new app, and it would be very difficult to compete with those two big players. To avoid introducing bias to our analysis, we have to remove the apps which may skew our dataset and rework the averages.

We suspect similar situation occurs in Reference and Social Networking apps, so let's confirm it.

In [20]:
print('Reference apps')
for app in ios_final:
    if app[11] == 'Reference':
        print(app[1], ': ', app[5])

print('\n')
print('Social Networking apps')
for app in ios_final:
    if app[11] == 'Social Networking':
        print(app[1], ': ', app[5])

Reference apps
Bible :  985920
Dictionary.com Dictionary & Thesaurus :  200047
Dictionary.com Dictionary & Thesaurus for iPad :  54175
Google Translate :  26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran :  18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition :  17588
Merriam-Webster Dictionary :  16849
Night Sky :  12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) :  8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools :  4693
GUNS MODS for Minecraft PC Edition - Mods Tools :  1497
Guides for Pokémon GO - Pokemon GO News and Cheats :  826
WWDC :  762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free :  718
VPN Express :  14
Real Bike Traffic Rider Virtual Reality Glasses :  8
教えて!goo :  0
Jishokun-Japanese English Dictionary & Translator :  0


Social Networking apps
Facebook :  2974676
Pinterest :  1061624
Skype for iPhone :  373

The Reference apps is highly influenced by Bible and Dictionary while the Social Networking apps is highly influenced by Facebook and Pinterest. But considering to develop in this genre is not necessary bad idea, because as we know that App Store is saturated by fun apps (quick reminder that App Store has more than 58% in games genre), so if we develop a good app outside of "fun" category, our app might be stand out more. But keep in mind for this genre for later evaluation.

### Most popular app in Google Play
Now let's continue our analysis on Google Play. But unfortunately, we cannot process the data similarly to App Store dataset because the `Install` column is actually a string and does not precisely inform us how many installation for each app.

In [32]:
print(display_table(android_final, 5))  # Index 5 is the `Install` column

1,000,000+ :  15.73
100,000+ :  11.55
10,000,000+ :  10.55
10,000+ :  10.2
1,000+ :  8.39
100+ :  6.92
5,000,000+ :  6.83
500,000+ :  5.56
50,000+ :  4.77
5,000+ :  4.51
10+ :  3.54
500+ :  3.25
50,000,000+ :  2.3
100,000,000+ :  2.13
50+ :  1.92
5+ :  0.79
1+ :  0.51
500,000,000+ :  0.27
1,000,000,000+ :  0.23
0+ :  0.05
0 :  0.01
None


In this case we will convert the `string` into `float`, and leaving the values as is. To put it simply, `100,000+` in string will be equal to `100,000` in float. We will use `str.replace(old, new)` method for the conversion.

In [33]:
Category = freq_percent(android_final, 1)            # Index 1 is the `Category` column in Play Store
max_cat = 0                                          # This variable is to capture the category which has maximum install
max_install = 0                                      # This variable is to capture the maximum install on each iteration

for cat in Category:
    total = 0                                        # To store the total number of installation of each category
    len_category = 0                                 # To store the number of apps within each category
    
    for app in android_final:
        category_app = app[1]
        if(category_app == cat):
            no_install = app[5]                      # Initiate the no_install variable
            no_install = no_install.replace('+', '') # Removing the plus ('+') sign
            no_install = no_install.replace(',', '') # Removing the comma (',') sign
            
            total += float(no_install)               # Count the total install for each category, after we convert them
            len_category += 1                        # Count the number of app that falls into each category
    
    avg_install = round(total / len_category, 2)     # Calculate the average value
    print(cat, ': ', avg_install)
    
    if(avg_install > max_install):
        max_cat = cat                                # Retrieve the category which have the most average install
        max_install = avg_install                    # Save the number of average install to be compared with the next iteration

print('\n')
print('The most popular is in', max_cat, 'genre')

ART_AND_DESIGN :  1986335.09
AUTO_AND_VEHICLES :  647317.82
BEAUTY :  513151.89
BOOKS_AND_REFERENCE :  8767811.89
BUSINESS :  1712290.15
COMICS :  817657.27
COMMUNICATION :  38456119.17
DATING :  854028.83
EDUCATION :  1833495.15
ENTERTAINMENT :  11640705.88
EVENTS :  253542.22
FINANCE :  1387692.48
FOOD_AND_DRINK :  1924897.74
HEALTH_AND_FITNESS :  4188821.99
HOUSE_AND_HOME :  1331540.56
LIBRARIES_AND_DEMO :  638503.73
LIFESTYLE :  1437816.27
GAME :  15588015.6
FAMILY :  3695641.82
MEDICAL :  120550.62
SOCIAL :  23253652.13
SHOPPING :  7036877.31
PHOTOGRAPHY :  17840110.4
SPORTS :  3638640.14
TRAVEL_AND_LOCAL :  13984077.71
TOOLS :  10801391.3
PERSONALIZATION :  5201482.61
PRODUCTIVITY :  16787331.34
PARENTING :  542603.62
WEATHER :  5074486.2
VIDEO_PLAYERS :  24727872.45
NEWS_AND_MAGAZINES :  9549178.47
MAPS_AND_NAVIGATION :  4056941.77


The most popular is in COMMUNICATION genre


The result shows us that the COMMUNICATION genre is the most popular apps in Google Play, with average installation of more than 38 Millions. Similar to App Store case, we should check whether this data is highly influenced by certain apps.

In [23]:
for app in android_final:
    if(app[1] == 'COMMUNICATION') and ((app[5] == '1,000,000,000+') or (app[5] == '500,000,000+') or (app[5] == '100,000,000+')):
        print(app[0], ': ', app[5])

WhatsApp Messenger :  1,000,000,000+
imo beta free calls and text :  100,000,000+
Android Messages :  100,000,000+
Google Duo - High Quality Video Calls :  500,000,000+
Messenger – Text and Video Chat for Free :  1,000,000,000+
imo free video calls and chat :  500,000,000+
Skype - free IM & video calls :  1,000,000,000+
Who :  100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji :  100,000,000+
LINE: Free Calls & Messages :  500,000,000+
Google Chrome: Fast & Secure :  1,000,000,000+
Firefox Browser fast & private :  100,000,000+
UC Browser - Fast Download Private & Secure :  500,000,000+
Gmail :  1,000,000,000+
Hangouts :  1,000,000,000+
Messenger Lite: Free Calls & Messages :  100,000,000+
Kik :  100,000,000+
KakaoTalk: Free Calls & Text :  100,000,000+
Opera Mini - fast web browser :  100,000,000+
Opera Browser: Fast and Secure :  100,000,000+
Telegram :  100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer :  100,000,000+
UC Browser Mini -Tiny Fast Private & Secure :  

We filtered for apps that have extremely high number of installations, and we still got a lot of apps. This condition obviously skew our data. Let's remove some apps that have more than 100M installs from our dataset, to see if it still feasible to compete in this genre.

In [39]:
under_100M = []

for app in android_final:
    no_install = app[5]
    no_install = no_install.replace('+', '')
    no_install = no_install.replace(',', '')
    
    if(app[1] == 'COMMUNICATION') and (float(no_install) < 100000000):
        under_100M.append(float(no_install))
        
avg_under_100M = round(sum(under_100M) / len(under_100M), 2)
print('New average for COMMUNICATION genre:', avg_under_100M)

New average for COMMUNICATION genre: 3603485.39


After we reduced the extremely popular apps, the `COMMUNICATION` genre falls incredibly for the rank of most genre installed. Now the most genre installed is `VIDEO_PLAYERS` and `SOCIAL`, with number of installs of 24M and 23M respectively.

But we suspect similar situation is occurring on `VIDEO_PLAYERS` and `SOCIAL` genres, so let's check them out first.

In [43]:
print('Extremely popular apps in VIDEO_PLAYERS genre:')
for app in android_final:
    if(app[1] == 'VIDEO_PLAYERS') and ((app[5] == '1,000,000,000+') or (app[5] == '500,000,000+') or (app[5] == '100,000,000+')):
        print(app[0], ': ', app[5])
        
print('\n')
print('Extremely popular apps in SOCIAL genre:')
for app in android_final:
    if(app[1] == 'SOCIAL') and ((app[5] == '1,000,000,000+') or (app[5] == '500,000,000+') or (app[5] == '100,000,000+')):
        print(app[0], ': ', app[5])

Extremely popular apps in VIDEO_PLAYERS genre:
YouTube :  1,000,000,000+
Motorola Gallery :  100,000,000+
VLC for Android :  100,000,000+
Google Play Movies & TV :  1,000,000,000+
MX Player :  500,000,000+
Dubsmash :  100,000,000+
VivaVideo - Video Editor & Photo Movie :  100,000,000+
VideoShow-Video Editor, Video Maker, Beauty Camera :  100,000,000+
Motorola FM Radio :  100,000,000+


Extremely popular apps in SOCIAL genre:
Facebook :  1,000,000,000+
Facebook Lite :  500,000,000+
Tumblr :  100,000,000+
Pinterest :  100,000,000+
Google+ :  1,000,000,000+
Badoo - Free Chat & Dating App :  100,000,000+
Tango - Live Video Broadcast :  100,000,000+
Instagram :  1,000,000,000+
Snapchat :  500,000,000+
LinkedIn :  100,000,000+
Tik Tok - including musical.ly :  100,000,000+
BIGO LIVE - Live Stream :  100,000,000+
VK :  100,000,000+


So we found out that similar situation is also occurs in `VIDEO_PLAYER` and `SOCIAL` genres. Both are highly dominated by big players such as Facebook, Google, Instagram, Youtube, etc. And after we remove them, the `VIDEO_PLAYER` and `SOCIAL` doesn't have much popularity for the small players. 

The `GAME` category is extremely popular too, with 15M number of installs. But as we mentioned during analysis on App Store, the market is saturated by the 'fun' apps, including game, entertainment, and hobby genres. The situation might be different between Google Play and App Store, but would like to avoid providing different genre recommendation for each market. And that means we also need to drop the `TRAVEL_AND_LOCAL` category, even if it has fairly high number of install.

## 4. Conclusion
To determine what kind of mobile app we will develop, we have analyzed the apps profiles on two big market, Google Play and App Store, to find out which genre that most likely attracts users.

So after the process of elimination, especially on those genres that have been dominated by big players, we recommend to develop an app in `Reference` genre (or `BOOKS_AND_REFERENCE` in Google Play). The Reference apps have fairly high popularity on both platform but does not dominated by some companies, so we will have a feasible competition in this genre. We will need to show off unique features of our apps to stand out more in this genre.

Other option is in `Education` genre with the same reason as above.