# 1. Profitable App Profiles for the App Store and Google Play Markets

In this project, an introductory statistical data analysis is made towards a better understanding of Android and iOS mobile apps on the Google Play and the App Store markets, respectively, aiming to identify tendencies and correlations between app's specifications/characteristics and their profitability. By doing that, the extracted knowledge may be useful to consolidate a strategic approach taken by a company for developing an app with certain specifications that lead to an higher expected profitability on the two considered markets.


### Primary Restraints to the App's Specifications for Development

As restraints, this analysis only takes into account unpaid apps (free to download and install), thus the main source of revenue consists of in-app ads. As a consequence, the profitability of an app is mostly influenced by the number of its users (directly proportional to add engagement). Additionally, it's imposed an idiom constraint, being directed to an English-speaking audience. 

These considerations set the rhetoric goal of the statistical data analysis: what type of apps are likely to attract more users on Google Play and the App Store, taking as constraints their price (must be free) and idiom (English). 


### Data Methodology Overview

To achieve the analysis purposes, it's necessary to collect and analyze data about mobile apps available on Google Play and the App Store, which satisfies the specified primary restraints. This composes the data methodology for validating an app idea, with its subsequent development and availability on the Google Play and the App Store, being comprised in the following way:
- 1.1 Opening and Structuring the Data
- 1.2 Data Cleaning
- 1.3 Data Analysis 

The final results are essentially suited for companies that align with the free apps market. Nevertheless, the adopted methodology (statistical data manipulation) can be adapted for paid apps.


To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:
- Build a minimal Android version of the app, and add it to Google Play.
- If the app has a good response from users, we develop it further.
- If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.


## 1.1 Opening and Structuring the Data

For attending the set goals, it was decided to proceed with the analysis of two extensively studied datasets, that are available on the Kaggle website. For an indeepth insight about the both **datasets'** atributes/proprieties nominated by the **headers**, check [Google Play Store Apps dataset documentation](https://www.kaggle.com/lava18/google-play-store-apps) and [Apple Store Apps dataset documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps).


After downloading the two considered datasets and posteriorly saving to the same directory as this jupyter notebook, both were imported to the present notebook environment. As the both datasets are in the csv format, the usage of the csv package built in Python can be useful. On this package, exists two widely use subroutines/algorithms/methods that structure the input data (that is, organized its content): Reader and DictReader. For the purpose of a more intuitively data manipulation, as the headers are anticipally known, the subroutine DictReader was used, structuring each row of the csv file as an independent dictionary. In order to aggroup the data into one element, the rows on the csv file were condensed as dictionaries on a list, being formally referred as a list of dictionaries. 

The resulted data structure (list of dictionaries) allows the data search to be dependent of csv file's headers, as they are the keys of each entry (a dictionary) on the list. 

In [1]:
from csv import DictReader
with open("AppleStore.csv") as file_apple_apps:
    read_apple_apps = DictReader(file_apple_apps)
    apple_apps = list(read_apple_apps)

with open("googleplaystore.csv") as file_google_apps:
    read_google_apps = DictReader(file_google_apps)
    google_apps = list(read_google_apps)

### Data Exploration

Intending to visualize the resulted data structure, as well as to perform some metrics (number of apps - rows on the csv file - and the number of headers - columns on the csv file - on both datasets), a function was designed, named explore_data(), allowing the abstraction of the conceived algorithm and minimization of repetitive code.   

The explore_data() function takes in four parameters:
- **dataset**, which is expected to be a list of dictionaries;
- **start** and **end** (two independent parameters), which are both expected to be integers, representing the starting and the ending indices of a slice from the data set, that is intended to isolate the exploratory visualization analysis;
- **additional_info**, which is expected to be a boolean variable, having False as its default argument. If set to True, prints the number of rows and columns of the data set. 

Describing the function, it slices the data set using dataset[start:end], posteriorly looping through the slice in order to, for each iteration, print a list's entry (row on the csv file mapped to a dictionary), printing, at the end of each iteration, the '\n' element, resulting in a blank space between consecutive rows.

In [2]:
def explore_data(dataset, start, end, additional_info = False):
    dataset_delimited = dataset[start:end]
    
    for data in dataset_delimited: 
        print(data)
        print('\n')
    
    if additional_info:
        print("Number of rows in the entire dataset are " + str(len(dataset)))
        print("Number of dataset headers are " + str(len(dataset[0])) + "\n")

In [3]:
print("Explorying Apple Store Apps dataset:")
explore_data(apple_apps, 0, 10, additional_info = True)
      
print("Explorying Google Play Store Apps dataset:")
explore_data(google_apps, 0, 10, additional_info = True)

Explorying Apple Store Apps dataset:
{'rating_count_ver': '212', 'cont_rating': '4+', 'ver': '95.0', 'sup_devices.num': '37', 'prime_genre': 'Social Networking', 'price': '0.0', 'lang.num': '29', 'user_rating_ver': '3.5', 'rating_count_tot': '2974676', 'user_rating': '3.5', 'currency': 'USD', 'track_name': 'Facebook', 'ipadSc_urls.num': '1', 'size_bytes': '389879808', 'vpp_lic': '1', 'id': '284882215'}


{'rating_count_ver': '1289', 'cont_rating': '12+', 'ver': '10.23', 'sup_devices.num': '37', 'prime_genre': 'Photo & Video', 'price': '0.0', 'lang.num': '29', 'user_rating_ver': '4.0', 'rating_count_tot': '2161558', 'user_rating': '4.5', 'currency': 'USD', 'track_name': 'Instagram', 'ipadSc_urls.num': '0', 'size_bytes': '113954816', 'vpp_lic': '1', 'id': '389801252'}


{'rating_count_ver': '579', 'cont_rating': '9+', 'ver': '9.24.12', 'sup_devices.num': '38', 'prime_genre': 'Games', 'price': '0.0', 'lang.num': '18', 'user_rating_ver': '4.5', 'rating_count_tot': '2130805', 'user_rating':

## 1.2 Data Cleaning

At this point, both data sets include the overall apps that were available on the Google Play and App Store at the time they were collected, existing the possibility that they contain data that doesn't meet the primary restraints to app's specifications: must be free to download and install, as well as directed towards an English-speaking audience. Additionally, it may exist other inconsistencies (duplicate entries and partially omitted data in some entries) that affect the analysis accuracy. 

Hereupon, before beginning our analysis, it's necessary to pre-process the data (formally called data cleaning) in order to fit the purpose of our analysis.

Overall, the data cleaning process includes the following steps:
- **1.2.1** Detect inaccurate data, and correct or remove it.
- **1.2.2** Detect duplicate data, and remove the duplicates.
- **1.2.3** Isolate English and free apps

### 1.2.1 Data Errors Analysis 

Taking into consideration the data types and pattern tendency for each header
We assume a datapoint has lost not swapped and hence the number of datapoints for that row is lesser than actual entries of the header.

allowing the abstraction of the conceived algorithm and minimization of repetitive code. 

In [4]:
def float_type(data):
    try:
        float(data)
        error = False
        
    except:
        error = True
        
    return error 

def data_error(dataset, dataset_flag):
    error_total = True
    
    if dataset_flag == "google":
        string_data = ["Category", "Genres", "Content Rating", "Last Uptated", "Apps", "Type", "Size"]
        num_data = ['Rating']
        key_rating = 'Rating'
        name_app = 'App'
                
    elif dataset_flag == "apple":
        string_data = ['cont_rating', 'prime_genre', 'currency']
        num_data = ['rating_count_ver', 'sup_devices.num', 'price', 'lang.num', 'user_rating_ver', 'rating_count_tot', 
                    'user_rating', 'ipadSc_urls.num', 'size_bytes', 'vpp_lic', 'id']
        key_rating = 'user_rating'
        name_app = 'track_name'
        
    app_error = []    
    for app in dataset:
        for key, value in app.items():
            error = False
            
            if key in string_data:
                error = not(float_type(value))
                    
            elif key in num_data:
                error = float_type(value)

                if not(error) and (key == key_rating and (float(value) > 5.0)):
                    error = True
                    
            if value == None or value == '':
                value = "None or empty string"
                app[key] = None
                error = True
                
                
            if error:   
                print(key + ": " + value)
                print("Incorrectness in the " + str(dataset.index(app)) + " ith row in the dataset, relative to the \'" + 
                      app[name_app] + "\' app.\n")
                error_total = False
                
                
                if dataset.index(app) in app_error:
                    continue
                    
                app_error.append(dataset.index(app))
    
    
    if error_total:
        print("It wasn't found any error in the dataset.")
    
    return app_error

In [5]:
print("Apple Store Apps dataset:")
apple_error = data_error(apple_apps, "apple")

print("\nGoogle Play Store Apps dataset:")
google_error = data_error(google_apps, "google")

Apple Store Apps dataset:
It wasn't found any error in the dataset.

Google Play Store Apps dataset:
Current Ver: None or empty string
Incorrectness in the 1553 ith row in the dataset, relative to the 'Market Update Helper' app.

Type: NaN
Incorrectness in the 9148 ith row in the dataset, relative to the 'Command & Conquer: Rivals' app.

Category: 1.9
Incorrectness in the 10472 ith row in the dataset, relative to the 'Life Made WI-Fi Touchscreen Photo Frame' app.

Rating: 19
Incorrectness in the 10472 ith row in the dataset, relative to the 'Life Made WI-Fi Touchscreen Photo Frame' app.

Content Rating: None or empty string
Incorrectness in the 10472 ith row in the dataset, relative to the 'Life Made WI-Fi Touchscreen Photo Frame' app.

Android Ver: None or empty string
Incorrectness in the 10472 ith row in the dataset, relative to the 'Life Made WI-Fi Touchscreen Photo Frame' app.

Type: 0
Incorrectness in the 10472 ith row in the dataset, relative to the 'Life Made WI-Fi Touchscreen 

The obtained results are in accordance with the revised statistics about the both data sets, more precisily the reported missing data. These statistics can be viewed in the following links:
* [Apple Store Apps dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)
* [Google Play Store Apps dataset](https://www.kaggle.com/lava18/google-play-store-apps)

You should select the same features as those illustrated in the following images, in order to access the metioned reported statistics.

### Data Inspection for Insertion and Deletion

To better comprehend the reported errors, it will be inspected the entire data structuring and distribution of the apps to whom those errors are relative. This will help to capture a deepened understanding of the reasons behind those errors, and, possibly, identify others. 

In [6]:
for i in google_error:
    print ("Entire " + str(i) + " ith app data information:")
    print(google_apps[i])
    print('\n')

Entire 1553 ith app data information:
{'Category': 'LIBRARIES_AND_DEMO', 'Rating': '4.1', 'Genres': 'Libraries & Demo', 'Content Rating': 'Everyone', 'Installs': '1,000,000+', 'Last Updated': 'February 12, 2013', 'App': 'Market Update Helper', 'Android Ver': '1.5 and up', 'Reviews': '20145', 'Current Ver': None, 'Type': 'Free', 'Price': '0', 'Size': '11k'}


Entire 9148 ith app data information:
{'Category': 'FAMILY', 'Rating': 'NaN', 'Genres': 'Strategy', 'Content Rating': 'Everyone 10+', 'Installs': '0', 'Last Updated': 'June 28, 2018', 'App': 'Command & Conquer: Rivals', 'Android Ver': 'Varies with device', 'Reviews': '0', 'Current Ver': 'Varies with device', 'Type': 'NaN', 'Price': '0', 'Size': 'Varies with device'}


Entire 10472 ith app data information:
{'Category': '1.9', 'Rating': '19', 'Genres': 'February 11, 2018', 'Content Rating': None, 'Installs': 'Free', 'Last Updated': '1.0.19', 'App': 'Life Made WI-Fi Touchscreen Photo Frame', 'Android Ver': None, 'Reviews': '3.0M', 'C

After inspecting for errors in both datasets, it was searched on the Google Play Store if the missing or correct informatition was avaliable. 

For the **'Market Update Helper' app** it was confirmed that its current version (**'Current Ver'** attribute) information is **missing**, so it will **remained empty** on the dataset (with the **None** flag as an alert for posterior analysis), instead of deleting its entire data, relevant for the propose of `statistical analyses`. 

For the **'Command & Conquer: Rivals' app** it was ....

For the **'Life Made WI-Fi Touchscreen Photo Frame' app**, after inspecting its entire data it was concluded that many **properties' values are swapped**, being already reported in [kaggle](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) that this is due to the **omission** of the value corresponding to the **'Category' column** (equivalent to the 'Category' property in the dictionary), in the original csv file. The mentioned caused a shift of one column to the left for all the remaining values, and consequently an incorrect correspondence between Propritie and its correct Value on the dictionary. Due to the inherent **dataset syntax** that isn't totally known by me (one could ignore relevant syntax variants to the correct insertion of data), it was decided to delete this data from the Google Play Store Apps data set.

In [7]:
print(len(google_apps))
del google_apps[10472]
print(len(google_apps))

10841
10840


### 1.2.2 Duplicate Data Deletion

For the purpose of the statistical analysis that is going to be performed, besides data errors related to incorrectness of data insertion on the dataset, the inspection for possible duplicate entries of one's app has to be realized. This can be manifested as complete duplicates of one same app, with all of their properties having the exact same values, or with differences in some of those properties. In this context, the differences may be related with actualizations of data parameters that are mutable with time, like the rating of an app or its downloads, and that were inserted on the dataset without previously delete unactualized data. 

With that said, firstly the analysis of duplicate entries 

In [8]:
def search_duplicates(dataset, dataset_flag, all_duplicates = False, example_duplicates = True):
    
    if dataset_flag == "google":
        name_app = "App"
    elif dataset_flag == "apple":
        name_app = "track_name"
        
    unique_apps = []
    duplicate_apps = []
    for app in dataset: 
        if  app[name_app] in unique_apps:
            row = dataset.index(app)
            duplicate_apps.append((app[name_app], row))
        else:
            unique_apps.append(app[name_app])
    
    if len(duplicate_apps) != 0:
        print("This dataset has duplicate entries")
        
        if all_duplicates:
            print("\nList of duplicates tuples in the form of (name_duplicate_app, row_dataset)")
            for app in duplicate_apps:
                print(app)
        
        if example_duplicates:
            print("\nDuplicate entries example:\n")
            num_duplicate = 0
            duplicate = duplicate_apps[num_duplicate][0]
            for app in dataset:
                if app[name_app] == duplicate:
                    print(app)
                    print("\n")
    else:
        print("This dataset doesn't have duplicate entries")
    
    return duplicate_apps

In [9]:
print("- Apple Store Apps dataset:")
apple_duplicates = search_duplicates(apple_apps, "apple")

print("\n- Google Play Store Apps dataset:")
google_duplicates = search_duplicates(google_apps, "google")

- Apple Store Apps dataset:
This dataset has duplicate entries

Duplicate entries example:

{'rating_count_ver': '87', 'cont_rating': '9+', 'ver': '1.4', 'sup_devices.num': '37', 'prime_genre': 'Games', 'price': '0.0', 'lang.num': '1', 'user_rating_ver': '3.0', 'rating_count_tot': '668', 'user_rating': '3.0', 'currency': 'USD', 'track_name': 'Mannequin Challenge', 'ipadSc_urls.num': '4', 'size_bytes': '109705216', 'vpp_lic': '1', 'id': '1173990889'}


{'rating_count_ver': '58', 'cont_rating': '4+', 'ver': '1.0.1', 'sup_devices.num': '38', 'prime_genre': 'Games', 'price': '0.0', 'lang.num': '1', 'user_rating_ver': '4.5', 'rating_count_tot': '105', 'user_rating': '4.0', 'currency': 'USD', 'track_name': 'Mannequin Challenge', 'ipadSc_urls.num': '5', 'size_bytes': '59572224', 'vpp_lic': '1', 'id': '1178454060'}



- Google Play Store Apps dataset:
This dataset has duplicate entries

Duplicate entries example:

{'Category': 'BUSINESS', 'Rating': '4.2', 'Genres': 'Business', 'Content Rating'

The number of duplicates can now be accessed by counting the number of elements on the duplicate list generated by the previous algorithm. 

In [10]:
print("- Number of duplicates on the Apple Store Apps dataset:" + str(len(apple_duplicates)))
print("\n- Number of duplicates on the Google Play Store Apps dataset:" + str(len(google_duplicates)))

- Number of duplicates on the Apple Store Apps dataset:2

- Number of duplicates on the Google Play Store Apps dataset:1181


so we need to remove the duplicate entries and keep only one entry per app. One thing we could do is remove the duplicate rows randomly, but we could probably find a better way.

If you examine the rows we printed for the Instagram app, the main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show the data was collected at different times.

We can use this information to build a criterion for removing the duplicates. The higher the number of reviews, the more recent the data should be. Rather than removing duplicates randomly, we'll only keep the row with the highest number of reviews and remove the other entries for any given app.
We'll remove the rows on the next screen.

Critério:
GOOGLE
... categoria, genres, content_rating mantem-se, pode-se ter como pressuposto;
... se o rating mudar, mudou o numero de reviwes, que é um parametro que aumenta incrementalmente ao longo do tempo, sendo que quanto maior mais atual é a informação;
... installs pode mudar, estando provavelmente tambem associado a um aumento de reviews;
... android_ver e current_ver, size poderá alterar-se, mas depende do last_uptaded, que tambem é incremental. 
... type e price nao sei. 

APPLE
... rating_count_ver, user_rating_ver e user_rating influenciam rating_count_tot, que é incremental
... 
cont_rating, sup_devices.num, prime_genre, ver, price, lang.num, currency, ipadSc_urls.num, size_bytes, vpp_lic, id ???

In [11]:
def max_review(dataset, dataset_flag):
    
    if dataset_flag == "google":
        reviews = "Reviews" 
        name_app = "App"
    elif dataset_flag == "apple":
        reviews = "rating_count_tot"
        name_app = "track_name"
        
    reviews_max = {}
    
    for app in dataset:
        name = app[name_app]
        n_reviews = int(app[reviews])
        
        if name in reviews_max and (reviews_max[name] <= n_reviews):
            reviews_max[name] = n_reviews
        elif name not in reviews_max:
            reviews_max[name] = n_reviews
            
    return reviews_max

In [12]:
def delete_duplicates(dataset, dataset_flag):
    if dataset_flag == "google":
        reviews = "Reviews"
        name_app = "App"
        
    elif dataset_flag == "apple":
        reviews = "rating_count_tot"
        name_app = "track_name"
    
    reviews_max = max_review(dataset, dataset_flag)
    
    clean_data = []
    added_data = []
    
    for app in dataset:
        name = app[name_app]
        n_reviews = int(app[reviews])
        
        if (n_reviews == reviews_max[name]) and (name not in added_data):
            clean_data.append(app)
            added_data.append(name)
            
    return clean_data 

In [13]:
apple_clean_data = delete_duplicates(apple_apps, "apple")
print("- Apple Store Apps cleaned dataset:")
explore_data(apple_clean_data, 0, 3, additional_info = True)

google_clean_data = delete_duplicates(google_apps, "google")
print("\n- Google Play Store Apps cleaned dataset:")
explore_data(google_clean_data, 0, 3, additional_info = True)

- Apple Store Apps cleaned dataset:
{'rating_count_ver': '212', 'cont_rating': '4+', 'ver': '95.0', 'sup_devices.num': '37', 'prime_genre': 'Social Networking', 'price': '0.0', 'lang.num': '29', 'user_rating_ver': '3.5', 'rating_count_tot': '2974676', 'user_rating': '3.5', 'currency': 'USD', 'track_name': 'Facebook', 'ipadSc_urls.num': '1', 'size_bytes': '389879808', 'vpp_lic': '1', 'id': '284882215'}


{'rating_count_ver': '1289', 'cont_rating': '12+', 'ver': '10.23', 'sup_devices.num': '37', 'prime_genre': 'Photo & Video', 'price': '0.0', 'lang.num': '29', 'user_rating_ver': '4.0', 'rating_count_tot': '2161558', 'user_rating': '4.5', 'currency': 'USD', 'track_name': 'Instagram', 'ipadSc_urls.num': '0', 'size_bytes': '113954816', 'vpp_lic': '1', 'id': '389801252'}


{'rating_count_ver': '579', 'cont_rating': '9+', 'ver': '9.24.12', 'sup_devices.num': '38', 'prime_genre': 'Games', 'price': '0.0', 'lang.num': '18', 'user_rating_ver': '4.5', 'rating_count_tot': '2130805', 'user_rating': 

### Removing Non-English Apps: Part One

Remember we use English for the apps we develop at our company, and we'd like to analyze only the apps that are directed toward an English-speaking audience. However, if we explore the data long enough, we'll find that both data sets have apps with names that suggest they are not directed toward an English-speaking audience.

We're not interested in keeping these apps, so we'll remove them. One way to go about this is to remove each app with a name containing a symbol that is not commonly used in English text — English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;), and other symbols (+, *, /).

Behind the scenes, each character we use in a string has a corresponding number associated with it. For instance, the corresponding number for character 'a' is 97, character 'A' is 65, and character '爱' is 29,233. We can get the corresponding number of each character using the ord() built-in function.
The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system. Based on this number range, we can build a function that detects whether a character belongs to the set of common English characters or not. If the number is equal to or less than 127, then the character belongs to the set of common English characters.

In Python, strings are indexable and iterable, which means we can use indexing to select an individual character, and we can also iterate on the string using a for loop.

app name is probably non-English since it contains a character that doesn't belong to the set of common English characters.

In [14]:
def inspect_english(string): 
    for letter in string:
        if ord(letter) > 127:
            return False
    
    return True

In [15]:
print(inspect_english('Docs To Go™ Free Office Suite'))
print(inspect_english('Instachat 😜'))

print(ord('™'))
print(ord('😜'))

False
False


TypeError: ord() expected a character, but string of length 3 found

On the previous screen, we wrote a function that detects non-English app names, but we saw that the function couldn't correctly identify certain English app names

This is because emojis and characters like ™ fall outside the ASCII range and have corresponding numbers over 127

If we're going to use the function we've created, we'll lose useful data since many English apps will be incorrectly labeled as non-English. To minimize the impact of data loss, we'll only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range. This means all English apps with up to three emoji or other special characters will still be labeled as English. Our filter function is still not perfect, but it should be fairly effective.
Let's edit the function we created in the previous screen, and then use it to filter out the non-English apps.

In [16]:
def inspect_english(string, max_special_chr):
    num_special_chr = 0 
    
    for letter in string:
        if ord(letter) > 127:
            num_special_chr += 1
    
    if num_special_chr > max_special_chr:
        return False
    
    return True

In [17]:
print(inspect_english('Docs To Go™ Free Office Suite', 5))
print(inspect_english('Instachat 😜', 5))

print(ord('™'))
print(ord('😜'))

True
True


TypeError: ord() expected a character, but string of length 3 found

In [18]:
def english_apps(dataset, dataset_flag, max_special_chr):
    if dataset_flag == "google":
        name_app = "App"
        
    elif dataset_flag == "apple":
        name_app = "track_name"
    
    english_list = []
    for app in dataset:
        name = app[name_app]
        
        if inspect_english(name, max_special_chr):
            english_list.append(app)
            
    return english_list 

In [19]:
total_apple = len(apple_clean_data)
apple_english = english_apps(apple_clean_data, "apple", 3)
len_apple_english = len(apple_english)
print("- For Apple dataset it were eliminated " + str(total_apple - len_apple_english) + " apps classified as non-english.\n")
explore_data(apple_english, 0, 3, True)

total_google = len(google_clean_data)
google_english = english_apps(google_clean_data, "google", 3)
len_google_english = len(google_english)
print("\n- For Google dataset it were eliminated " + str(total_google - len_google_english) + " apps classified as non-english.\n")
explore_data(google_english, 0, 3, True)

- For Apple dataset it were eliminated 1097 apps classified as non-english.

{'rating_count_ver': '212', 'cont_rating': '4+', 'ver': '95.0', 'sup_devices.num': '37', 'prime_genre': 'Social Networking', 'price': '0.0', 'lang.num': '29', 'user_rating_ver': '3.5', 'rating_count_tot': '2974676', 'user_rating': '3.5', 'currency': 'USD', 'track_name': 'Facebook', 'ipadSc_urls.num': '1', 'size_bytes': '389879808', 'vpp_lic': '1', 'id': '284882215'}


{'rating_count_ver': '1289', 'cont_rating': '12+', 'ver': '10.23', 'sup_devices.num': '37', 'prime_genre': 'Photo & Video', 'price': '0.0', 'lang.num': '29', 'user_rating_ver': '4.0', 'rating_count_tot': '2161558', 'user_rating': '4.5', 'currency': 'USD', 'track_name': 'Instagram', 'ipadSc_urls.num': '0', 'size_bytes': '113954816', 'vpp_lic': '1', 'id': '389801252'}


{'rating_count_ver': '579', 'cont_rating': '9+', 'ver': '9.24.12', 'sup_devices.num': '38', 'prime_genre': 'Games', 'price': '0.0', 'lang.num': '18', 'user_rating_ver': '4.5', 'rati

### Isolating the Free Apps


As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps; we'll need to isolate only the free apps for our analysis.

In [20]:
def free_apps(dataset, dataset_flag):
    if dataset_flag == "google":
        price = "Price"
        free = "0"
        
    elif dataset_flag == "apple":
        price = "price"
        free = "0.0"
        
    free_apps = []
    for app in dataset:
        price_app = app[price]
        
        if price_app == free:
            free_apps.append(app)
     
    return free_apps

In [21]:
apple_free_apps = free_apps(apple_english, "apple")
len_apple_free = len(apple_free_apps)
print("- For Apple dataset it were eliminated " + str(len_apple_english - len_apple_free) + " apps classified as non-free.\n")
explore_data(apple_free_apps, 0, 3, True)

google_free_apps = free_apps(google_english, "google")
len_google_free = len(google_free_apps)
print("- For Google dataset it were eliminated " + str(len_google_english - len_google_free) + " apps classified as non-free.\n")
explore_data(google_free_apps, 0, 3, True)

- For Apple dataset it were eliminated 2931 apps classified as non-free.

{'rating_count_ver': '212', 'cont_rating': '4+', 'ver': '95.0', 'sup_devices.num': '37', 'prime_genre': 'Social Networking', 'price': '0.0', 'lang.num': '29', 'user_rating_ver': '3.5', 'rating_count_tot': '2974676', 'user_rating': '3.5', 'currency': 'USD', 'track_name': 'Facebook', 'ipadSc_urls.num': '1', 'size_bytes': '389879808', 'vpp_lic': '1', 'id': '284882215'}


{'rating_count_ver': '1289', 'cont_rating': '12+', 'ver': '10.23', 'sup_devices.num': '37', 'prime_genre': 'Photo & Video', 'price': '0.0', 'lang.num': '29', 'user_rating_ver': '4.0', 'rating_count_tot': '2161558', 'user_rating': '4.5', 'currency': 'USD', 'track_name': 'Instagram', 'ipadSc_urls.num': '0', 'size_bytes': '113954816', 'vpp_lic': '1', 'id': '389801252'}


{'rating_count_ver': '579', 'cont_rating': '9+', 'ver': '9.24.12', 'sup_devices.num': '38', 'prime_genre': 'Games', 'price': '0.0', 'lang.num': '18', 'user_rating_ver': '4.5', 'rating_

### Most Common Apps by Genre: Part One

As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. 

Let's begin the analysis by getting a sense of what are the most common genres for each market. For this, we'll need to build frequency tables for a few columns in our data sets.

validation strategy for an app idea, and then we inspected the data sets to identify the columns that might be useful for finding out what the most common genres in each market are.

we want to find an app profile that fits both the App Store and Google Play. Explain our validation strategy for an app idea:::

As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.
- To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:
- Build a minimal Android version of the app, and add it to Google Play.
- If the app has a good response from users, we develop it further.
- If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

generate frequency tables to find out what are the most common genres in each market:
- prime_genre para Apple
- Genres (An app can belong to multiple genres (apart from its main category)) e Category (Category the app belongs to) para Google

Our conclusion was that we'll need to build a frequency table for the prime_genre column of the App Store data set, and for the Genres and Category columns of the Google Play data set.

We'll build two functions we can use to analyze the frequency tables:
One function to generate frequency tables that show percentages
Another function we can use to display the percentages in a descending order
We already learned to generate frequency tables that show percentages, and we're going to build a function for that in the exercise below. However, dictionaries don't have order, and it will be very difficult to analyze the frequency tables. We'll need to build a second function which can help us display the entries in the frequency table in a descending order.


we'll need to make use of the built-in sorted() function. This function takes in an iterable data type (like a list, dictionary, tuple, etc.), and returns a list of the elements of that iterable sorted in ascending or descending order (the reverse parameter controls whether the order is ascending or descending).

The sorted() function doesn't work too well with dictionaries because it only considers and returns the dictionary keys.

However, the sorted() function works well if we transform the dictionary into a list of tuples, where each tuple contains a dictionary key along with its corresponding dictionary value. To ensure the sorting works right, the dictionary value comes first, and the dictionary key comes second:

In [22]:
def display(freq_table):
    for propritie, frequency in freq_table:        
        print(propritie + " | " + str(frequency))

def freq_table(dataset, propritie, descending = True, display_table = True):
    
    freq_table = {}
    
    for app in dataset:
        key_propritie = app[propritie] 
        
        if key_propritie in freq_table:
            freq_table[key_propritie] += 1.0
        else:
            freq_table[key_propritie] = 1.0
            
    total = len(dataset)
    for key in freq_table:
        freq_table[key] = (freq_table[key] * 100) / total
    
    sorted_table = sorted(freq_table.items(), key = lambda item: item[1], reverse = descending)

    if display_table:
        display(sorted_table)
    
    return sorted_table

### Most Common Apps by Genre: Part Three

Remember our data set only contains free English apps, so you should be careful not to extend your conclusions beyond that scope. If you find that gaming apps are the most numerous among the free English apps on Google Play, it doesn't mean we'll see the same pattern on Google Play as a whole.

In [26]:
print("\nPrime Genre frequency table of the apple dataset:")
prime_genre_freq_table = freq_table(apple_free_apps, "prime_genre")


Prime Genre frequency table of the apple dataset:
Games | 58.5096305652
Entertainment | 7.8307546574
Photo & Video | 5.05209977897
Education | 3.72592358699
Social Networking | 3.28386485633
Shopping | 2.52604988949
Utilities | 2.39974739501
Sports | 2.17871802968
Music | 2.05241553521
Health & Fitness | 1.98926428797
Productivity | 1.7050836754
Lifestyle | 1.54720555731
News | 1.32617619198
Travel | 1.13672245027
Finance | 1.10514682665
Weather | 0.852541837701
Food & Drink | 0.820966214083
Business | 0.536785601516
Reference | 0.536785601516
Book | 0.378907483423
Medical | 0.189453741711
Navigation | 0.189453741711
Catalogs | 0.126302494474


Analyze the frequency table you generated for the prime_genre column of the App Store data set.
- What is the most common genre? What is the runner-up?
- What other patterns do you see?
- What is the general impression — are most of the apps designed for practical purposes (education, shopping, utilities, productivity, lifestyle) or more for entertainment (games, photo and video, social networking, sports, music)?
- Can you recommend an app profile for the App Store market based on this frequency table alone? If there's a large number of apps for a particular genre, does that also imply that apps of that genre generally have a large number of users?

In [23]:
print("Category frequency table of the google dataset:") 
category_freq_table = freq_table(google_free_apps, "Category")

print("\nGenres frequency table of the google dataset:") 
genres_freq_table = freq_table(google_free_apps, "Genres") 

Category frequency table of the google dataset:
FAMILY | 18.9383561644
GAME | 9.65753424658
TOOLS | 8.48173515982
BUSINESS | 4.64611872146
PRODUCTIVITY | 3.93835616438
LIFESTYLE | 3.91552511416
FINANCE | 3.72146118721
MEDICAL | 3.5502283105
SPORTS | 3.33333333333
PERSONALIZATION | 3.28767123288
COMMUNICATION | 3.25342465753
HEALTH_AND_FITNESS | 3.09360730594
PHOTOGRAPHY | 2.97945205479
NEWS_AND_MAGAZINES | 2.80821917808
SOCIAL | 2.64840182648
TRAVEL_AND_LOCAL | 2.3401826484
SHOPPING | 2.24885844749
BOOKS_AND_REFERENCE | 2.14611872146
DATING | 1.86073059361
VIDEO_PLAYERS | 1.80365296804
MAPS_AND_NAVIGATION | 1.38127853881
FOOD_AND_DRINK | 1.23287671233
EDUCATION | 1.17579908676
ENTERTAINMENT | 0.958904109589
AUTO_AND_VEHICLES | 0.924657534247
LIBRARIES_AND_DEMO | 0.901826484018
HOUSE_AND_HOME | 0.787671232877
WEATHER | 0.787671232877
EVENTS | 0.719178082192
ART_AND_DESIGN | 0.650684931507
PARENTING | 0.639269406393
BEAUTY | 0.60502283105
COMICS | 0.582191780822

Genres frequency table o

Analyze the frequency table you generated for the Category and Genres column of the Google Play data set.
- What are the most common genres?
- What other patterns do you see?
- Compare the patterns you see for the Google Play market with those you saw for the App Store market.
- Can you recommend an app profile based on what you found so far? Do the frequency tables you generated reveal the most frequent app genres or what genres have the most users?

The frequency tables we analyzed on the previous screen showed us that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and fun apps.

### Most Popular Apps by Genre on the App Store

Now, we'd like to get an idea about the kind of apps with the most users.

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app.

Let's start with calculating the average number of user ratings per app genre on the App Store. To do that, we'll need to:
- Isolate the apps of each genre.
- Sum up the user ratings for the apps of that genre.
- Divide the sum by the number of apps belonging to that genre (not by the total number of apps).

In [41]:
def genres_popular_average(dataset, freq_genre_table, propritie, descending = True, display_table = True):
    
    genres_popular_average = {}
    
    for genre in freq_genre_table: 
        popularity = 0.0 
        total_num_genre = 0.0 
        genre = genre[0]

        for app in dataset: 
            if genre not in app.values():
                continue
            
            if propritie == "Installs":
                value = app[propritie].replace("+", "")
                value = value.replace(",", "")
            else:
                value = app[propritie]
                
            popularity += float(value)
            total_num_genre += 1
           
        genres_popular_average[genre] = popularity / total_num_genre
    
    sorted_table = sorted(genres_popular_average.items(), key = lambda item: item[1], reverse = descending)
    
    if display_table:
        display(sorted_table)
        
    return sorted_table

In [36]:
print("- Popularity of prime_genre on apple dataset:")
genres_popular_avg_apple = genres_popular_average(apple_free_apps, prime_genre_freq_table, "rating_count_tot")

- Popularity of prime_genre on apple dataset:
Navigation | 86090.3333333
Reference | 79350.4705882
Social Networking | 72916.5480769
Music | 58205.0307692
Weather | 54215.2962963
Book | 46384.9166667
Food & Drink | 33333.9230769
Finance | 32367.0285714
Travel | 31358.5
Photo & Video | 28441.54375
Shopping | 27816.2
Health & Fitness | 24037.6349206
Games | 23009.9271452
Sports | 23008.8985507
Productivity | 21799.1481481
News | 21750.0714286
Utilities | 19900.4736842
Lifestyle | 16739.3469388
Entertainment | 14364.7741935
Business | 7491.11764706
Education | 7003.98305085
Catalogs | 4004.0
Medical | 612.0


Analyze the results and try to come up with at least one app profile recommendation for the App Store. 

app profile recommendation for the App Store based on the number of user ratings.

In [45]:
def popularity_black_swans(dataset, dataset_flag, genres_popular_avg, max_analyzed):
    
    if dataset_flag == "google":
        name = "App"
        propritie = "Installs"
    elif dataset_flag == "apple":
        name = "track_name"
        propritie = "rating_count_tot"
        
    for genre in genres_popular_avg:
        genre = genre[0]
        apps_genre = {}
        
        for app in dataset:
            if genre not in app.values():
                continue
            
            app_name = app[name]
            
            if propritie == "Installs":
                value = app[propritie].replace("+", "")
                value = value.replace(",", "")
            else:
                value = app[propritie]
                
            apps_genre[app_name] = int(value)
            
            
        print("- For the genre \'" + genre + "\', the first " + str(max_analyzed) + " most popular apps have the following engangement on rating counting:")
        sorted_table = sorted(apps_genre.items(), key = lambda item: item[1], reverse = True)
        
        display(sorted_table[:max_analyzed])
        print("\n")

In [46]:
popularity_black_swans(apple_free_apps, "apple", genres_popular_avg_apple, 5)

- For the genre 'Navigation', the first 5 most popular apps have the following engangement on rating counting:
Waze - GPS Navigation, Maps & Real-time Traffic | 345046
Google Maps - Navigation & Transit | 154911
Geocaching® | 12811
CoPilot GPS – Car Navigation & Offline Maps | 3582
ImmobilienScout24: Real Estate Search in Germany | 187


- For the genre 'Reference', the first 5 most popular apps have the following engangement on rating counting:
Bible | 985920
Dictionary.com Dictionary & Thesaurus | 200047
Dictionary.com Dictionary & Thesaurus for iPad | 54175
Google Translate | 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran | 18418


- For the genre 'Social Networking', the first 5 most popular apps have the following engangement on rating counting:
Facebook | 2974676
Pinterest | 1061624
Skype for iPhone | 373519
Messenger | 351466
Tumblr | 334293


- For the genre 'Music', the first 5 most popular apps have the following engangement on rating counting:
Pandora - Music & Rad

### Most Popular Apps by Genre on Google Play

We have data about the number of installs for the Google Play market, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.).
However, we don't need very precise data for our purposes — we only want to find out which app genres attract the most users, and we don't need perfect precision with respect to the number of users.

In [47]:
print("- Installs frequency table on google dataset:")
installs_freq_table = freq_table(google_free_apps, "Installs")

- Installs frequency table on google dataset:
1,000,000+ | 15.7420091324
100,000+ | 11.5182648402
10,000,000+ | 10.6050228311
10,000+ | 10.2054794521
1,000+ | 8.36757990868
100+ | 6.95205479452
5,000,000+ | 6.87214611872
500,000+ | 5.54794520548
50,000+ | 4.77168949772
5,000+ | 4.48630136986
10+ | 3.51598173516
500+ | 3.20776255708
50,000,000+ | 2.28310502283
100,000,000+ | 2.13470319635
50+ | 1.92922374429
5+ | 0.787671232877
1+ | 0.513698630137
500,000,000+ | 0.27397260274
1,000,000,000+ | 0.228310502283
0+ | 0.0456621004566
0 | 0.0114155251142


We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on. To perform computations, however, we'll need to convert each install number from string to float. This means we need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error.

In [44]:
print("- Popularity of Category on google dataset:")
category_popular_avg_google = genres_popular_average(google_free_apps, category_freq_table, "Installs")

print("\n- Popularity of Genres on google dataset:")
genres_popular_avg_google = genres_popular_average(google_free_apps, genres_freq_table, "Installs")

- Popularity of Category on google dataset:
COMMUNICATION | 38550548.0386
VIDEO_PLAYERS | 24878048.8608
SOCIAL | 23628689.2328
PHOTOGRAPHY | 17840110.4023
PRODUCTIVITY | 16787331.3449
GAME | 15571586.6903
TRAVEL_AND_LOCAL | 14120454.078
ENTERTAINMENT | 11767380.9524
TOOLS | 10902378.8345
NEWS_AND_MAGAZINES | 9626407.35772
BOOKS_AND_REFERENCE | 8329168.93617
SHOPPING | 7103190.7868
PERSONALIZATION | 5240358.98611
WEATHER | 5212877.10145
HEALTH_AND_FITNESS | 4219697.05535
MAPS_AND_NAVIGATION | 4115374.21488
SPORTS | 3750580.64384
FAMILY | 3716053.75527
ART_AND_DESIGN | 1986335.08772
FOOD_AND_DRINK | 1951283.80556
EDUCATION | 1833495.14563
BUSINESS | 1712290.14742
LIFESTYLE | 1447458.97668
HOUSE_AND_HOME | 1385541.46377
FINANCE | 1365500.40491
DATING | 861409.552147
COMICS | 859042.156863
AUTO_AND_VEHICLES | 654074.82716
LIBRARIES_AND_DEMO | 649314.050633
PARENTING | 552875.178571
BEAUTY | 513151.886792
EVENTS | 253542.222222
MEDICAL | 121161.877814

- Popularity of Genres on google datas

Analyze the results and try to come up with at least one app profile recommendation for Google Play. Remember, our aim is to recommend an app genre that shows potential for being profitable on both the App Store and Google Play.

In [49]:
popularity_black_swans(google_free_apps, "google", category_popular_avg_google, 10)

- For the genre 'COMMUNICATION', the first 10 most popular apps have the following engangement on rating counting:
Skype - free IM & video calls | 1000000000
Gmail | 1000000000
Google Chrome: Fast & Secure | 1000000000
WhatsApp Messenger | 1000000000
Hangouts | 1000000000
Messenger – Text and Video Chat for Free | 1000000000
LINE: Free Calls & Messages | 500000000
UC Browser - Fast Download Private & Secure | 500000000
imo free video calls and chat | 500000000
Viber Messenger | 500000000


- For the genre 'VIDEO_PLAYERS', the first 10 most popular apps have the following engangement on rating counting:
Google Play Movies & TV | 1000000000
YouTube | 1000000000
MX Player | 500000000
VLC for Android | 100000000
VivaVideo - Video Editor & Photo Movie | 100000000
Motorola Gallery | 100000000
VideoShow-Video Editor, Video Maker, Beauty Camera | 100000000
Dubsmash | 100000000
Motorola FM Radio | 100000000
Vigo Video | 50000000


- For the genre 'SOCIAL', the first 10 most popular apps have th

- For the genre 'HOUSE_AND_HOME', the first 10 most popular apps have the following engangement on rating counting:
Realtor.com Real Estate: Homes for Sale and Rent | 10000000
tinyCam Monitor FREE | 10000000
Zillow: Find Houses for Sale & Apartments for Rent | 10000000
Trulia Real Estate & Rentals | 10000000
Houzz Interior Design Ideas | 10000000
Alfred Home Security Camera | 5000000
Trulia Rent Apartments & Homes | 5000000
Real Estate sale & rent Trovit | 5000000
DaBang - Rental Homes in Korea | 5000000
ColorSnap® Visualizer | 1000000


- For the genre 'FINANCE', the first 10 most popular apps have the following engangement on rating counting:
Google Pay | 100000000
PayPal | 50000000
Mobile Bancomer | 10000000
Cash App | 10000000
Bank of America Mobile Banking | 10000000
Credit Karma | 10000000
Wells Fargo Mobile | 10000000
HDFC Bank MobileBanking | 10000000
CASHIER | 10000000
K PLUS | 10000000


- For the genre 'DATING', the first 10 most popular apps have the following engangement o

We see the same pattern for the video players category, which is the runner-up with 24,727,872 installs. The market is dominated by apps like Youtube, Google Play Movies & TV, or MX Player. The pattern is repeated for social apps (where we have giants like Facebook, Instagram, Google+, etc.), photography apps (Google Photos and other popular photo editors), or productivity apps (Microsoft Word, Dropbox, Google Calendar, Evernote, etc.).
Again, the main concern is that these app genres might seem more popular than they really are. Moreover, these niches seem to be dominated by a few giants who are hard to compete against.
The game genre seems pretty popular, but previously we found out this part of the market seems a bit saturated, so we'd like to come up with a different app recommendation if possible.

TENTAR TIRAR AS APLICAÇÕES E VER PARA QUANTO DIMINUI A POPULARIDADE


The book and reference genre includes a variety of apps: software for processing and reading ebooks, various collections of libraries, dictionaries, tutorials on programming or languages, etc. It seems there's still a small number of extremely popular apps that skew the average:
However, it looks like there are only a few very popular apps, so this market still shows potential. Let's try to get some app ideas based on the kind of apps that are somewhere in the middle in terms of popularity (between 1,000,000 and 100,000,000 downloads):