# Profitable App Profile Recommendation

In this project we analyze the data available from the various apps available on Google Play and the App Store that our company develops to find out user engagement. As all of our apps are free, the revenue is generated trhough in-app ads. The more the user interacts with theem, the more profitable the apps become.

The main goal of the project is to find out those apps that are potentially generating most revenue. In other words, the goal is to help the developers understand what type of apps are likely to attract more users on Google Play and the App Store

# Opening and Exploring the datasets

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

Collecting data for over 4 million apps requires a significant amount of time and money, so we'll try to analyze a sample of the data instead. To avoid spending resources on collecting new data ourselves, we should first try to see if we can find any relevant existing data at no cost. Luckily, these are two data sets that seem suitable for our goals:

A [data set](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. You can download the data set directly from this [link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).

A [data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. You can download the data set directly from this [link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).

As we have the data available to us, we can start by opening the dataset and then explore it.

In [1]:
from csv import reader
opened_file_google = open('dataset\googleplaystore.csv', encoding="utf8")
read_file_google = reader(opened_file_google)
apps_data_google = list(read_file_google)
google_header = apps_data_google[0]
google_data = apps_data_google[1:]

opened_file_apple = open('dataset\AppleStore.csv', encoding="utf8")
read_file_apple = reader(opened_file_apple)
apps_data_apple = list(read_file_apple)
apple_header = apps_data_apple[0]
apple_data = apps_data_apple[1:]

To make the dataset easily explorable, a function `explore_data()` is created below. It takes as input the dataset to be explored along with the start row and end row for the exploration. It can also provide us with the information such as the total number of rows and columns the dataset has. This function can be used to explore the data repeatedly with great readability.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

We are ready to use the `explore_data()` function to start exploring the data. Below we printed out first 5 rows for each dataset. First, the Google store data and then the Apple play store data. 

As can be seen from below, Google play store dataset has total **10842** rows including the header row (that has all the column names) and **13** columns in total

In [3]:
explore_data(google_data, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


On the other hand, Apple store dataset has **7198** rows including the header row and a grand total of **16** columns.

In [4]:
explore_data(apple_data, 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


Let us now take a look at the column names (header row) for each dataset separately. 

Firstly, we look at the column names of the Google play store dataset. A detailed description of the columns are given in the [documentation](https://www.kaggle.com/lava18/google-play-store-apps).

Some of the useful columns for our analysis can be `'App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Content Rating', 'Genres'`.

In [5]:
print(google_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Now let us take a glance at the column names of the Apple store dataset. A detailed description of the columns are given in the [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps). 

From these set of columns, it looks like `'track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'prime_genre'` are some of the columns that can be useful for our analysis.

In [6]:
print(apple_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


# Deleting wrong data

The Google Play data set has a dedicated [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion), and we can see that [one of the discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) describes an error for a certain row.

The row in question is `google_data[10472]`. According to the discussion this entry has missing `'Rating'` and a column shift happened for next columns. Let's first print out this row in conjunction with the header row and another valid row so that we can better identify the wrongness of the entry.

In [7]:
print(google_header, '\n\n', google_data[10472], '\n\n', google_data[0])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

 ['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up'] 

 ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


After further analyzing the **10472** entry, it looks like the data for the `'Category'` column is missing and for that reason all the data in the row shifted one column to the left. Clearly this entry is wrong and we will delete this entry below.

In [8]:
# deleting wrong entry for 10472

del google_data[10472]

Let's print and verify whether that entry was truly deleted from the dataset

In [9]:
print(google_data[10472])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


# Removing Duplicate Data

Now, we are going to look for duplicate data/ entries (multiple entry for the same app) in both the datasets. If we find duplicate data/entries, we are going to remove all of them keeping only the most recent entry. 

### Google Play Store

**First, we take a look at the *Google play store* dataset.** We can decide on which entry is the most recent depending on the count of the user reviews which can be found on the column: `'Reviews'`. Amongst all the duplicate entries, the entry that has the maximum number of user reviews, we can determine that as the most recent entry for that app and use that for our analysis.

We can write a function `detect_uniques_and_duplicates()` that will detect duplicate entries for an app for a given dataset and store the duplicate apps' names as a list named `duplicate_entries`. It will also store the unique apps' names as another list named `unique_entries`. This function will also take a column index of the dataset based on which the duplicity will be checked in the dataset.

Finally, we will get both lists (`duplicate_entries`, `unique_entries`) as output from the function `detect_uniques_and_duplicates()`.

In [10]:
def detect_uniques_and_duplicates(dataSet, col):
    unique_entries = []
    duplicate_entries = []
    
    for each_row in dataSet:
        if each_row[col] in unique_entries:
            duplicate_entries.append(each_row[col])
        else:
            unique_entries.append(each_row[col])
    
    return duplicate_entries, unique_entries

Below, we use `detect_uniques_and_duplicates()` function on Google Play store dataset and find out that there are **1181** duplicate data in it. Note, that, as a second parameter/ input to the function, `0` is used. Because, the column `'App'` which is the first column (index `0`) for each row in the dataset is the column that's used to check for duplicity. 

Some of the duplicate apps' names in the Google Play store dataset are also shown below.

In [11]:
dupDataGoogle, uniqueDataGoogle = detect_uniques_and_duplicates(google_data, 0)
print("Total number of duplicate entries in the Google play store dataset: " + str(len(dupDataGoogle)))
print("\n")
print("Some of the apps that has duplicate entries in the Google play store dataset: ")
print("\n")
print(dupDataGoogle[:13])

Total number of duplicate entries in the Google play store dataset: 1181


Some of the apps that has duplicate entries in the Google play store dataset: 


['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses']


As we have found **1181** duplicate entries, we can find out what the total number of unique entries should be by removing it from the total data count. As can be seen below, the total number of unique entries should be **9659**.

In [12]:
expected_length = len(google_data)-len(dupDataGoogle)
print("Expected unique entries: ", expected_length)

Expected unique entries:  9659


To remove the duplicates, we will create a dictionary `reviews_max`, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app. Therefore, the length of this dictionary should **9659** as this is the number of expected unique entries. We can verify that by printing the length of `reviews_max`.

In [13]:
reviews_max = {}

for each_row in google_data:
    name = each_row[0]
    n_reviews = float(each_row[3]) # 3 is the column index for 'Reviews' column in the dataset
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews

print("Expected Length: ", expected_length)
print("Actual Length: ", len(reviews_max))

Expected Length:  9659
Actual Length:  9659


Now, we can use the information stored in the dictionary `reviews_max` and create a new data set `android_clean`, which will have only one entry per app (and for each app, we'll only select the entry with the highest number of reviews). In the code cell below:

- We start by initializing two empty lists, `android_clean` and `already_added`.
- We loop through the `google_data` data set, and for every iteration:
    - We isolate the name of the app and the number of reviews.
    - We add the current row (`each_row`) to the `android_clean` list, and the app name (`name`) to the `already_added` list if:
        - The number of reviews of the current app matches the number of reviews of that app as described in the `reviews_max` dictionary; and
        - The name of the app is not already in the `already_added` list. We need to add this supplementary condition to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry (for example, the Box app has three entries, and the number of reviews is the same). If we just check for `reviews_max[name] == n_reviews`, we'll still end up with duplicate entries for some apps.

In [None]:
android_clean = []
already_added = []

for each_row in google_data:
    name = each_row[0]
    n_reviews = float(each_row[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(each_row)
        already_added.append(name)

Finally, we can verify the length (number of rows) of `android_clean` as it should be the same as the length of `reviews_max` which is **9659** by exploring the dataset using `explore_data()` function below.

In [None]:
explore_data(android_clean, 0, 5, True)

### Appl Store

We can now check for duplicate entries in the Apple store dataset using the `detect_uniques_and_duplicates()` function where `track_name` column (index `1` for each row) of the dataset is considered to be used to check for duplicity. 

Below we get 2 apps that have apparently duplicate entries. However, from the [discussion here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion/90409), it is evident that these are not duplicates; rather, they different apps with same name. Therefore, we do not have to remove any entry in this dataset.

In [None]:
dupDataApple, uniqueDataApple = detect_uniques_and_duplicates(apple_data, 1)
print("Apps with similar names but eventually not duplicates: ", "\n\n")
print(dupDataApple)

# Removing Non-English Apps

### Part One

Note, we use English for the apps we develop at our company, and we'd like to analyze only the apps that are directed toward an English-speaking audience. However, if we explore the data long enough, we'll find that both data sets have apps with names that suggest they are not directed toward an English-speaking audience. We're not interested in keeping these apps, so we'll remove them.

In [None]:
print(android_clean[4412][0])
print(android_clean[7940][0])
print("\n")
print(apple_data[813][1])
print(apple_data[6731][1])

We can start by writing a function `detect_EnglisName()` that takes in a string and return `False` if there's any character in the string that doesn't beelong to the set of English characters, otherwise it returns true.

We will take the help of ASCII codes in order to solve this. Each character in a string has an ASCII code associated with it. No English language character has an ASCII code more than 127 starting from 0. Therefore, we can determine a non-English character if its ASCII code is more than 127 and less than 0.

In [None]:
def detect_EnglisName(appName):
    for each_char in appName:
        if ord(each_char) > 127 or ord(each_char) < 0:
            return False
    
    return True

We verify our function `detect_EnglisName()` below with some strings and it works as expected.

In [None]:
print(detect_EnglisName('Instagram'))
print(detect_EnglisName('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(detect_EnglisName('Docs To Go™ Free Office Suite'))
print(detect_EnglisName('Instachat 😜'))

It looks like, the function couldn't correctly identify certain English app names like `'Docs To Go™ Free Office Suite'` and `'Instachat 😜'`. This is because emojis and characters like `™` fall outside the ASCII range and have corresponding numbers over 127.

In [None]:
print('™ :', ord('™'))
print('😜 :', ord('😜'))

### Part Two

If we're going to use the function we've created, we'll lose useful data since many English apps will be incorrectly labeled as non-English. To minimize the impact of data loss, we'll only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range. This means all English apps with up to three emoji or other special characters will still be labeled as English.

It is not a perfect solution, but surely better than the previous. Let's edit the function `detect_EnglisName()` to fit the idea and we'll later use it to filter out the non-English apps from the dataset.

In [None]:
def detect_EnglisName(appName):
    count = 0
    for each_char in appName:
        if ord(each_char) > 127:
            count += 1
            if count > 3:
                return False
    
    return True

We can now use the modified function to check whether it is working properly. 

In [None]:
print(detect_EnglisName('Instagram'))
print(detect_EnglisName('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(detect_EnglisName('Docs To Go™ Free Office Suite'))
print(detect_EnglisName('Instachat 😜'))

We can now confirm, it is working properly and is ready to be used for filtering out non-English apps from both datasets.

So, below we create a function `removeNonEnglishApps()` that takes in the dataset and the index of the column which has the name of an app. It loops through the provided data set. If an app name is identified as English, the whole row for that app is appended to a separate list `english_apps`.

In [None]:
def removeNonEnglishApps(dataSet, col):
    english_apps = []
    for each_row in dataSet:
        if detect_EnglisName(each_row[col]):
            english_apps.append(each_row)
    
    return english_apps

### Part Three

Below we apply the function `removeNonEnglishApps()` on both the Google Play Store and Apple Sotre dataset to filter out any non-English apps.

Two clean datasets with only non-English apps are now stored in `android_clean_English` for Google Play Store and `ios_clean_English` for Apple Store.

In [None]:
android_clean_English = removeNonEnglishApps(android_clean, 0)
ios_clean_English = removeNonEnglishApps(apple_data, 1)

After exploring `android_clean_English`, we can see the number of rows have decreased. Further we can find out exactly how many non-English apps we removed. The total number of non-English apps in the Google Play Store dataset was **62**.

In [None]:
explore_data(android_clean_English, 0, 3, True)
print("\n")
print("Total Non-English apps removed: ", len(android_clean)-len(android_clean_English))

After exploring `ios_clean_English`, we can see the number of rows have also decreased here. Further we can find out exactly how many non-English apps we removed. The total number of non-English apps in the Apple Play Store dataset was **1042**, significantly more than Google Play Store dataset.

In [None]:
explore_data(ios_clean_English, 0, 3, True)
print("\n")
print("Total Non-English apps removed: ", len(apple_data)-len(ios_clean_English))

# Removing non-Free apps

As we mentioned earlier, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps; we'll need to isolate only the free apps for our analysis.

### Google Play Store

We first isolate the free apps from the so far cleaned Google Play Store dataset: `android_clean_English`. The `'Price'` column of the dataset contains information about the app - whether it is free or not. If it is `0`, then it is free. Otherwise, it is not.

In [None]:
print(google_header)
print("\n")
print(android_clean_English[0])

We can use this column index (`7` for each row) in the dataset to extract out the free apps in a new set of data: `android_free_clean`.

In [None]:
android_free_clean = []
for each_row in android_clean_English:
    if each_row[7] == "0":
        android_free_clean.append(each_row)

After the extraction, we can start exploring `android_free_clean`. We find that the length of the dataset has shrunk even more. So, finally, the number of free apps is **8864**.

In [None]:
explore_data(android_free_clean, 0, 3, True)

### Apple Store

We first isolate the free apps from the so far cleaned Apple Store dataset: `ios_clean_English`. The `'price'` column of the dataset contains information about the app - whether it is free or not. Unlike Google Play store data, this column has numerical value as the price. If it is more than `0`, then it is not a free app.

In [None]:
print(apple_header)
print("\n")
print(ios_clean_English[0])

We can use this column index (`4` for each row) in the dataset to extract out the free apps in a new set of data: `ios_free_clean`.

In [None]:
ios_free_clean = []
for each_row in ios_clean_English:
    if float(each_row[4]) <= 0:
        ios_free_clean.append(each_row)

After the extraction, we can start exploring `ios_free_clean`. We find that the length of the dataset has shrunk even more. So, finally, the number of free apps is **3222**.

In [None]:
explore_data(ios_free_clean, 0, 3, True)

# Finding App Profile

### Part One

As we mentioned earlier, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets.

We can start analysis by getting an idea about the most common app genres for both android and ios market. For this, we'll need to build frequency tables for a few columns in our data sets.

Let's again take a look at both data sets' columns below and decide which column we should take into account while building the frequency table for each dataset.

We will use `'Genres'` and `'Category'` columns for generating the Google Play Store frequency table. On the other hand, `'prime_genre'` will be used to generate the Appl Store frequency table.

In [None]:
print("Google Play Store dataset column names: ")
print(google_header)
print("\n")

print("Apple Store dataset column names: ")
print(apple_header)

### Part Two

We'll build two functions we can use to analyze the frequency tables:
- One function named `freq_table()` to generate frequency tables that show percentages
- Another function named `display_table()` we can use to display the percentages in a descending order

First, we will start implementing the function `freq_table()` below.
- It takes in two inputs: `dataset` (which is expected to be a list of lists) and `index` (which is expected to be an integer).
- The function will return the frequency table (as a dictionary) for any column we want. The frequencies should also be expressed as percentages.

In [None]:
def freq_table(dataset, index):
    percent_dict = {}
    for each_row in dataset:
        key = each_row[index]
        if key in percent_dict:
            percent_dict[key] += 1
        else:
            percent_dict[key] = 1
    
    for i in percent_dict:
        percent_dict[i] = round((percent_dict[i]/len(dataset)) * 100, 2)
    
    return percent_dict

Now, below, the `display_table()` function is implemented. It:
- Takes in two parameters: `dataset` and `index`. `dataset` is expected to be a list of lists, and `index` is expected to be an integer.
- Generates a frequency table using the `freq_table()` function
- Transforms the frequency table into a list of tuples `table_display`, then sorts the list in a descending order using `sorted()` built-in function. The `sorted()` function does not work on dictionaries properly. However, it works on lists. For that reason, we have converted the dictionary we created from the `freq_table()` function into a list o tuples where for each tuple, the dictionary value comes first and the key comes second. As we are going to sort the list in a descending order, the second parameter `reverse` is set to `True` for the `sorted()` function.
- Finally, prints the entries of the frequency table in descending order.

In [None]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1] + ': ' + str(entry[0]) + "%")

### Part Three

First, we invoke the `display_table()` function for the `ios_free_clean` dataset and the `prime_genre` column (which is the index `11` for each row in the dataset) is used.

In [None]:
display_table(ios_free_clean, 11)

Looking at the data above, **Games** genre have the most apps available with 58.16% of the total. This is so much more in number than the second in the list: **Entertainment** (7.88%). After that most of the genres have number of apps evenly distrubted amongst them. Looks like Apple Store is mostly focused on offering apps that are fun and less on practical apps.

However, we are only analyzing free English apps and this might not be the real picture throughout the whole of Apple Store. For that, it will be unwise to recommend which apps are most used by the user in this platform. But we can put this information to use later to find usage data.

Now let us take a look at the Google Play Store platform's data.

Now, we invoke the `display_table()` function for the `android_free_clean` dataset and the `'Category'` column (which is the index `1` for each row in the dataset) is used.

In [None]:
display_table(android_free_clean, 1)

As can be seen from the data above, analyzing the `'Category'` column for the Google Play Store dataset, where we have only considered free English apps, **FAMILY** apps are the most in numbers (around 19%) of the total. The second most apps in number are the **GAME** apps with around 10% (almost half of **FAMILY**) of the total. Third in line are the **TOOLS** apps with 8.46% of the total. After that, most category apps are evenly distributed.

It does look like there is somewhat a blance in number of apps offered in different categories in this platform. However, upon further inspection, going through the [Google Play Store website](https://play.google.com/store/apps), the [**FAMILY**](https://play.google.com/store/apps/category/FAMILY) apps are mostly for kids. Apart from that, most [**GAME**](https://play.google.com/store/apps/category/GAME) apps are also geenerally for kids. From this viewpoint, it potentially looks like that Google Play Store offers most of its apps for kids. However, it's not possible to say confidently that this is the case for most usage of apps because availability does not readily translate to demand.

We can further analyze the `'Genres'` column in this dataset to see how the distribution of different types of apps are.

Finally, we invoke the `display_table()` function for the `android_free_clean` dataset again but this time the `'Genres'` column (which is the index `9` for each row in the dataset) is used.

In [None]:
display_table(android_free_clean, 9)

Looking at this data, it tells a very different story for Google Play Store's app distribution than when we used `'Category'` column for our analysis. It looks like under `'Genres'` column the app "categories" are more granular. Therefore, going forward with our analysis for Google Play Store we will be using `'Geenres'` column only in the future. 

Here, we see that **Tools** genre has the most numbers of apps (8.45%) followed by a close second: **Entertainment** with 6.07%. After that **Education**, **Business**, **Productivity**, **Lifestyle**, **Finance**, **Medical**, **Sports**, **Personalization**, **Communication**, **Action**, and **Health & Fitness** all are above 3% and close to each other in terms of percentage of total numbers.

It tells us that in Google Play Store, unlike Apple Store, there is a balance in apps offered. However, again, as we are basing our opinion mainly from filtered data (free English apps), it might not hold true for the whole of both platforms. Apart from that, based on this data, if it were to be recommended which kind of app is more used, it will be unwise. Because, we can only see from this data how much of different categories/ genres of apps are offered/ available in the platforms, but cannot have a concrete idea of how much they are being used.

However, we can use this information later in conjunction with other data in both datasets to find out the apps that are most used by the users.

# App Profile Recommendation from Apple Store

### Part One

The frequency tables we analyzed on the previous screen showed us that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and fun apps. Now, we'd like to get an idea about the kind of apps with the most users.

One way to find out what genres are the most popular (have the most users) is to calculate the average number of `installs` for each app genre. For the Google Play data set, we can find this information in the Installs column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the `rating_count_tot` column; this is the index `5` for each row in the Apple Store dataset.

### Part Two

Let's start with calculating the average number of user ratings per app genre on the App Store. To do that, we'll need to:
- Isolate the apps of each genre.
- Sum up the user ratings for the apps of that genre.
- Divide the sum by the number of apps belonging to that genre (not by the total number of apps).

We can start by generating a frequency table for the `'prime_genre'` column to get the unique app genres (below, we'll need to loop over the unique genres). We'll use the `freq_table()` function to do this.

We can create a dictionary `freq_tbl_unique_prime_genre` for each unique app genre as the key and the average number of total ratings (`rating_count_tot'`) it got for all the apps under said genre.

We want to display the `freq_tbl_unique_prime_genre` dictionary data in sorted (descending) manner. We use `sorted()` built-in function here again. However, as stated earlier once, `sorted()` doesn't work properly on dictionaries. So, we convert the dictionary into a tuple of list and then using `sorted()` we can print the descending ordered sorted data.

In [None]:
freq_tbl_unique_prime_genre = freq_table(ios_free_clean, 11)

for genre in freq_tbl_unique_prime_genre:
    total = 0
    len_genre = 0
    for each_row in ios_free_clean:
        genre_app = each_row[11]
        if genre_app == genre:
            total += float(each_row[5]) #index 5 for column 'rating_count_tot'
            len_genre += 1
    avg = round(total/len_genre, 2)
    freq_tbl_unique_prime_genre[genre] = avg

table = freq_tbl_unique_prime_genre
table_display = []
for key in table:
    key_val_as_tuple = (table[key], key)
    table_display.append(key_val_as_tuple)

table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
    print(entry[1] + ': ' + str(entry[0]))

From the data above we are going to analyze the *top 5* app profiles and find out how many of their apps deserve the developers' attention so that they can target those for in-app ad campaigns.

Let us first create a function `appPopularityByGenre()` that will given the following information: 
- `dataset`
- `nameColumn` (index of each row for the app name) 
- `genreColumn` (index of each row for the app genre name)
- `n_install_column` (index of each row for number of installation data)
- `genreName` (the name of the genre)

will find out the apps and the percentage of installation number under that genre. 

The function will firstly, print out the total number of apps under a certain genre. Then it will also only print out the apps that have 5% or more number of installation under that genre. Because, we believe it is not attention worthy of the developer which app has less than 5% of total number of installation under its genre. This means there are other apps that the users use more under that genre and those are more attention worthy.

In [None]:
def appPopularityByGenre(dataset, nameColumn, genreColumn, n_install_column, genreName, androidDataset=False):
    app = {}
    total_n_install = 0;
    for each_row in dataset:
        name = each_row[nameColumn]
        if each_row[genreColumn] == genreName:
            if not(androidDataset):
                app[name] = float(each_row[n_install_column])
                total_n_install += float(each_row[n_install_column])
            else:
                n_install = each_row[n_install_column]
                n_install = n_install.replace('+', '')
                n_install = n_install.replace(',', '')
                app[name] = float(n_install)
                total_n_install += float(n_install)
    
    print("Total app count for '" + genreName + "' genre: ", len(app))
    print("\n")
    print("Popular apps under this genre ===> ")
    for k in app:
        app[k] = round((app[k]/total_n_install) * 100, 2)
        if app[k] >= 5:
            print(k, ": ", app[k], "%")

We have decided to analyze to *top 5* popular genre in apps in Apple Store. They are:
1. Nvaigation
2. Reference
3. Social Networking
4. Weather
5. Music

One by one, we are going to analyze each of these genre and will only print out the apps that are worthy of the developers' attention for increasing in-app ads and user engagement.

We will start with the most popular one: **1. Navigation**.

In [None]:
appPopularityByGenre(ios_free_clean, 1, 11, 5, 'Navigation')

Next, we will take a look at **2. Reference**.

In [None]:
appPopularityByGenre(ios_free_clean, 1, 11, 5, 'Reference')

Now it's turn for **3. Social Networking**

In [None]:
appPopularityByGenre(ios_free_clean, 1, 11, 5, 'Social Networking')

After that, we have **4. Music**

In [None]:
appPopularityByGenre(ios_free_clean, 1, 11, 5, 'Music')

Finally, we have **5. Weather**

In [None]:
appPopularityByGenre(ios_free_clean, 1, 11, 5, 'Weather')

# App Profile Recommendation from Google Playstore



In [None]:
freq_tbl_unique_genre = freq_table(android_free_clean, 9)

for genre in freq_tbl_unique_genre:
    total = 0
    len_genre = 0
    for each_row in android_free_clean:
        genre_app = each_row[9]
        if genre_app == genre:
            n_installs = each_row[5] #index 5 for column 'installs'
            n_installs = n_installs.replace('+', '')
            n_installs = n_installs.replace(',', '')
            total += float(n_installs) 
            len_genre += 1
    avg = round(total/len_genre, 2)
    freq_tbl_unique_genre[genre] = avg

table = freq_tbl_unique_genre
table_display = []
for key in table:
    key_val_as_tuple = (table[key], key)
    table_display.append(key_val_as_tuple)

table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
    print(entry[1] + ': ' + str(entry[0]))

1. Communication
2. Adventure;Action & Adventure
3. Video Players & Editors
4. Social
5. Arcade

In [None]:
appPopularityByGenre(android_free_clean, 0, 9, 5, 'Communication', True)

In [None]:
appPopularityByGenre(android_free_clean, 0, 9, 5, 'Adventure;Action & Adventure', True)

In [None]:
appPopularityByGenre(android_free_clean, 0, 9, 5, 'Video Players & Editors', True)

In [None]:
appPopularityByGenre(android_free_clean, 0, 9, 5, 'Social', True)

In [None]:
appPopularityByGenre(android_free_clean, 0, 9, 5, 'Arcade', True)