# Analyzing Mobile App Data from Google & Apple

author >> <mark>hawkihawk</mark>

This project contains a series analyzis of free app profiles listed in both the Apple App Store and Google Play Store to get to know the process of data collection, hygiene, visualization, etc. Our goal is to help developers understand what type of apps are likely to attract more users. To do this, <mark>we analyzed existing data (at no cost) from +10,000 mobile apps</mark> using the following datasets:

1. A dataset containing [data about approximately 10,000 Android apps from Google Play](https://www.kaggle.com/datasets/lava18/google-play-store-apps); the data was collected in August 2018.
2. A dataset containing [data about approximately 7,000 iOS apps from the App Store](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps); the data was collected in July 2017.

#### Column descriptions
There are 16 columns in the iOS dataset, here are their description

| column_nr      | column_name | description
| ----------- | ----------- | ----------- |
| 0 | "id" | App ID |
| 1 | "track_name" | App Name |
| 2 | "size_bytes" | Size (in Bytes) |
| 3 | "currency" | Currency Type |
| 4 | "price" | Price amount |
| 5 | "ratingcounttot" | User Rating counts (for all version) |
| 6| "ratingcountver" | User Rating counts (for current version) |
| 7 | "user_rating" | Average User Rating value (for all version) |
| 8 | "userratingver" | Average User Rating value (for current version) |
| 9 | "ver" | Latest version code |
| 10 | "cont_rating" | Content Rating |
| 11 | "prime_genre" | Primary Genre |
| 12 | "sup_devices.num" | Number of supporting devices |
| 13 | "ipadSc_urls.num" | Number of screenshots showed for display |
| 14 | "lang.num" | Number of supported languages |
| 15 | "vpp_lic" | Vpp Device Based Licensing Enabled |

There are 13 columns in the Android dataset, here are their description

| column_nr      | column_name | description
| ----------- | ----------- | ----------- |
| 0 | "App" | App Name |
| 1 | "Category" | App Category |
| 2 | "Rating" |  Average User Rating value |
| 3 | "Reviews" | User Rating counts |
| 4 | "Size" | Size in MB |
| 5 | "Installs" | Total Number Of Installs |
| 6| "Type" |  |
| 7 | "Price" | Price of App |
| 8 | "Content Rating" | Content Rating |
| 9 | "Genres | App Genre |
| 10 | "Last Updated" | Date of last update |
| 11 | "Current Ver" | Current app version |
| 12 | "Android Ver" | Support Android Versions |

# Data exploration

## Opening both datasets

Let's explore the data

In [1]:
from csv import reader

### The Google Play dataset ###
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0] #first row is the header row
android_body = android[1:] #the body of the dataset

### The App Store dataset ###
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0] #first row is the header row
ios_body = ios[1:] #the body of the dataset

To make both datasets easier to explore we created a function named explore_data() which you can use to print rows in a readable way.

In [2]:
print(android_header)
print('\n')
print(android_body[1])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


### Android data:

In [33]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
print(android_header)
print('\n')
explore_data(android_body, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10840
Number of columns: 13


We see that the Google Play data set has 10841 apps and 13 columns. At a quick glance, the columns that might be useful for the purpose of our analysis are *'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price', and 'Genres'.*

### iOS data:

In [4]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
print(ios_header)
print('\n')
explore_data(ios_body, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


The iOS data set has 7197 apps and 16 columns, the columns that might be useful for the purpose of our analysis are *'track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', and 'prime_genre'*

# Data Cleaning

## Deleting wrong data

Before beginning our analyzis we need to make sure the data we analyze is accurate or the results of our analysis **will be wrong.** This means that we need to do the following:

1. Detect inaccurate data, and correct or remove it.
2. Detect duplicate data, and remove the duplicates.

or in other words, seek and destroy.

The Google Play data set has a [dedicated discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion), and we can see that [one of the discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) outlines an error for row 10472.

In [5]:
print(android_body[10472])
print('\n')
print(android_header)
print('\n')

android_dictionary = dict(zip(android_header, android_body[10472]))
print(android_dictionary)
print('\n')
    
for key, value in android_dictionary.items():
    print(f'{key: <20}{value}')

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


{'App': 'Life Made WI-Fi Touchscreen Photo Frame', 'Category': '1.9', 'Rating': '19', 'Reviews': '3.0M', 'Size': '1,000+', 'Installs': 'Free', 'Type': '0', 'Price': 'Everyone', 'Content Rating': '', 'Genres': 'February 11, 2018', 'Last Updated': '1.0.19', 'Current Ver': '4.0 and up'}


App                 Life Made WI-Fi Touchscreen Photo Frame
Category            1.9
Rating              19
Reviews             3.0M
Size                1,000+
Installs            Free
Type                0
Price               Everyone
Content Rating      
Genres              February 11, 2018
Last Updated        1.0.19
Current Ver         4.0 and up


Row 10472 includes the app "Life Made WI-Fi Touchscreen Photo Frame", and we can see that the rating is 19 which is impossible since the max rating for a Google Play app is 5. This problem is caused by a missing value in the 'Category' column.

**To tackle this we proceed to remove the row from the dataset**

In [6]:
del(android_body[10472])

print(android_body[10472])

print('\n')
android_dictionary = dict(zip(android_header, android_body[10472]))
for key, value in android_dictionary.items():
    print(f'{key: <20}{value}')

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


App                 osmino Wi-Fi: free WiFi
Category            TOOLS
Rating              4.2
Reviews             134203
Size                4.1M
Installs            10,000,000+
Type                Free
Price               0
Content Rating      Everyone
Genres              Tools
Last Updated        August 7, 2018
Current Ver         6.06.14
Android Ver         4.4 and up


## Removing Duplicate Entries

### Part I

In total, there are 1,181 cases where an Android app occurs more than once, **there are however 0 cases where and iOS app occurs more than once.**

In [7]:
duplicate_android_apps = []
unique_android_apps = []

for app in android_body: 
    name = app[0]
    
    if name in unique_android_apps:
        duplicate_android_apps.append(name)
    else:
        unique_android_apps.append(name)
        
duplicate_ios_apps = []
unique_ios_apps = []

for app in ios_body: 
    name = app[0]
    
    if name in unique_ios_apps:
        duplicate_ios_apps.append(name)
    else:
        unique_ios_apps.append(name)

print('Number of duplicate Android apps:', len(duplicate_android_apps))
print('Number of duplicate iOS apps:', len(duplicate_ios_apps))
print('\n')
print('Examples of duplicate Android apps:', duplicate_android_apps[:15])
print('\n')
print('Examples of duplicate iOS apps:', duplicate_ios_apps[:15])

Number of duplicate Android apps: 1181
Number of duplicate iOS apps: 0


Examples of duplicate Android apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


Examples of duplicate iOS apps: []


When we start examining indivitual apps we can see that for the Instagram app for example, the main difference happens on the 4th position of each row, which corresponds to the number of reviews. **The different numbers show the data was collected at different times.**

In [8]:
for app in android_body:
    name = app[0]
    
    if name == 'Instagram':
        print(app, '\n')

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] 

['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] 

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] 

['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] 



We can use this information to build a criterion for removing the duplicates. One of the possible explanations for the difference in column 4 is that the higher the number of reviews, the more recent the data should be. Rather than removing duplicates randomly, **we'll only keep the row with the highest number of reviews and remove the other entries for any given app.**

### Part 2

In [9]:
print('The expected length of Android list is', f'{len(android_body):,}', 'excluding the header')

#I researched how to format numbers in Python, hence f'{:,}'

The expected length of Android list is 10,840 excluding the header


In order to remove the duplicate entries we will do the following:

1. Create a dictionary
    1. Each dictionary key is a unique app name 
    2. Each corresponding dictionary value is the highest number of reviews of that app
2. Create a new dataset with the dictionary
    1. Only have one entry per app
    2. Ensure that entries with the highest number of reviews is selected

To turn the steps above into code, we'll need to use the "NOT IN" operator. The "NOT IN" operator is the opposite of the "IN" operator. For instance, **'z' in ['a', 'b', 'c'] returns False** because 'z' is not in ['a', 'b', 'c'], but **'z' not in ['a', 'b', 'c'] returns True** because it's true that 'z' is not in the list ['a', 'b', 'c'], like so:

In [10]:
print('z' in ['a', 'b','c'])
print('z' not in ['a', 'b','c'])

False
True


Below we created an empty dictionary named reviews_max and looped through the Google Play dataset (excl. the header row).

In [11]:
reviews_max = {}
    
for app in android_body:
    name = app[0] #Assign the app name to a variable named name.
    n_reviews = float(app[3]) #Convert the number of reviews to float. Assign it to a variable named n_reviews.
    
    if (name in reviews_max) and (reviews_max[name] < n_reviews):
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

In a previous code cell, we found that there are 1,181 cases where an app occurs more than once, so the length of our dictionary (of unique apps) should be equal to the difference between the length of our dataset and 1,181.

In [12]:
print('Expected length:',f'{(len(android_body)-1181):,}')
print('Actual length:',f'{len(reviews_max):,}')

Expected length: 9,659
Actual length: 9,659


Now that we've confirmed the actual length, we'll use the dictionary we created above to remove the duplicate rows: 

1. First we create two empty lists: 

    1. <mark>android_clean</mark> which will store our new cleaned dataset
    2. <mark>already_added</mark> which will just store app names

2. We'll then loop through the <mark>android_body</mark> dataset, and for each iteration we'll:

    1. Assign the app name to a variable named *name* & reviews to a variable named *n_reviews*
    2. We add the current row (<mark>app</mark>) to the <mark>android_clean list</mark>, and the app name (<mark>name</mark>) to the already_added list if:
    
        1. The number of reviews of the current app matches the number of reviews of that app as described in the reviews_max dictionary
        2. The name of the app is not already in the already_added list. 

In [13]:
android_clean = []
already_added = []

for app in android_body:
    name = app[0]
    n_reviews = float(app[3])
    
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

In [14]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


The reason for the supplementary condition is to account for cases where the highest number of reviews of a duplicate app is the same for more than one entry (for example, the Box app has three entries, and the number of reviews is the same). If we

- Just check for reviews_max[name] == n_reviews, we'll still end up with duplicate entries for some apps
- Added an <mark>else</mark> before <mark>already_added.append(name)</mark> we'd also end up with duplicate entries for some apps

## Removing Non-English Apps

### Part 1

We use English for the apps we develop at our company, and we'd like to analyze only the apps that are designed for an English-speaking audience. However, if we explore the data long enough, we'll find that both datasets have apps with names that suggest they are not designed for an English-speaking audience.

In [15]:
#Android dataset
print(android_clean[4412][0])
print(android_clean[7940][0])

print('\n')

#iOS dataset
print(ios_body[813][1])
print(ios_body[6731][1])

中国語 AQリスニング
لعبة تقدر تربح DZ


爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


We're not interested in keeping these apps, so we'll remove them. Each character we use in a string has a corresponding number associated with it. For instance, the corresponding number for character 'a' is 97, character 'A' is 65, and character '爱' is 29,233. We can get the corresponding number of each character using the ord() built-in function.

The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system.

In [16]:
def english_check(an_input):
    
    for character in an_input:
        if ord(character) > 127:
            return False
    
    return True

Above we wrote a function that takes in a string and returns <mark>False</mark> if there's any character in the string that doesn't belong to the set of common English characters; otherwise, the function returns <mark>True</mark>.

- The function iterates over the input string
- For each iteration it checks whether the number associated with the character is greater than 127
    - When a character is greater than 127, the function immediately returns False as the app name is probably non-English since it contains a character that doesn't belong to the set of common English characters.
    - If the loop finishes running without the return statement executing, then it means no character had a corresponding number over 127 the app name is probably English, so the functions should return True.

**Below we test the new function**

In [17]:
print(english_check('pew pew pew 🌌 🔫'))
print(english_check('Instachat 😜'))
print(english_check('Instachat'))

print(ord('™'))
print(ord('😜'))

False
False
True
8482
128540


Excluding all apps where a character is greater than 127 might be too broad, the function couldn't correctly identify certain English app names like 'pew pew pew 🌌 🔫'. This is because emojis and characters like ™ fall outside the ASCII range and have corresponding numbers over 127.

### Part Two

To minimize the impact of data loss we only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range. This means all English apps with up to three emoji or other special characters will still be labeled as English.

In [18]:
def english_check(an_input):
    not_ascii = 0
    
    for character in an_input:
        if ord(character) > 127:
            not_ascii += 1

    if not_ascii > 3:
        return False
    else:
        return True

print(english_check('Could this be?'))
print(english_check('pew pew pew 🔫'))
print(english_check('pew剧 pew剧 pew剧 🔫'))

True
True
False


Our filter function is still not perfect, but it should be fairly effective. In the next code cell we use the new function to filter out non-English apps from both datasets, loop through each dataset and create a seperate list for apps that are identified as English.

In [19]:
android_english_body = []
ios_english_body = []

for app in android_clean:
    name = app[0]

    if english_check(name):
        android_english_body.append(app)
        
        
for app in ios_body:
    name = app[1]
    if english_check(name):
        ios_english_body.append(app)

android_clean = android_english_body
ios_clean = ios_english_body
        
print('**Android**')
explore_data(android_english_body, 0, 2, rows_and_columns=True)
print('\n')
print('**iOS**')
explore_data(ios_english_body, 0, 2, rows_and_columns=True)

**Android**
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 9614
Number of columns: 13


**iOS**
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 6183
Number of columns: 16


**The Android dataset went from 9,659 to 9,614 reducing the number of entries by 45**


**The iOS dataset went from 7,197 to 6,183 reducing the number of entries by 1,014**

## Isolating the Free Apps

So far in the data cleaning process, we've done the following:

1. Removed inaccurate data
2. Removed duplicate app entries
3. Removed non-English apps

We're tasked with analyzing apps that are free to download and install and since our datasets contain both free and non-free apps we'll need to isolate only the free apps for our analysis.

In [20]:
free_android_apps = []
free_ios_apps = []


for row in android_clean:
    price = row[7]
    if price == '0':
        free_android_apps.append(row)
        
for row in ios_clean:
    price = row[4]
    if price == '0.0':
        free_ios_apps.append(row)
        
print('Entries for free Android apps')
explore_data(free_android_apps, 0, 2, rows_and_columns=True)
print('\n')
print('Entries for free iOS apps')
explore_data(free_ios_apps, 0, 2, rows_and_columns=True)

Entries for free Android apps
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 8864
Number of columns: 13


Entries for free iOS apps
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 3222
Number of columns: 16


## Most Common Apps by Genre

### Part One

Our goal is to determine the kinds of apps that are likely to attract more users because the number of people using our apps affect our revenue.

To minimize risks and overhead, our validation strategy for an app idea has three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful in both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification. One of the main benefits of having an app in both markets is the reach and larger revenue pool we'd otherwise have access to.

Below we begin the analysis by determining the most common genres for each market. For this, we'll need to build frequency tables for a few columns in our dataset. The most useful column in the iOS dataset is most likely going to be 'prime_genre', for Android that would be 'Genres' and 'Category'

In [21]:
print(android_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [22]:
print(ios_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


### Part Two

We'll build two functions we can use to analyze the frequency tables:

1. One function to generate frequency tables (freq_table) that show percentages
2. Another function we can use to display the percentages (display_table) in a descending order

The freq_table() function takes in two inputs: 
- dataset (which will be a list of lists)
- index (which will be an integer).

The function should return the frequency table as a dictionary for any column we want. The frequencies should also be expressed as percentages.

In [23]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages

### Part Three

The display_table() function you see below does the following:

- Takes in two parameters: dataset and index. dataset will be a list of lists, and index will be an integer
- Generates a frequency table using the freq_table() function (which you're going to write as an exercise)
- Transforms the frequency table into a list of tuples, then sorts the list in a descending order
- Prints the entries of the frequency table in descending order

Below we take a look at the 'prime_genre' column from the iOS dataset. Apps made for entertainment have by far the largest share of apps in the app store with the top 2 being Games and Entertainment with +60% of total Aapps. That said, while entertainment apps are dominant in terms of number of apps it doesn't say whether the number of users are equal proportionately. 

In [24]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', round(entry[0],2))

display_table(ios_clean,11)

Games : 54.86
Entertainment : 7.26
Education : 6.63
Photo & Video : 5.52
Utilities : 3.44
Productivity : 2.72
Health & Fitness : 2.67
Music : 2.22
Social Networking : 2.04
Sports : 1.68
Lifestyle : 1.6
Shopping : 1.37
Weather : 1.12
Travel : 0.97
News : 0.92
Book : 0.89
Reference : 0.86
Business : 0.86
Finance : 0.79
Food & Drink : 0.71
Navigation : 0.45
Medical : 0.34
Catalogs : 0.08


Below we take a look at the both the 'category' and 'genre' columns from the Android dataset. At first glance the share of apps is seem more distributed and more practical. The 'category' that has the largest share of apps is ['Family'](https://play.google.com/store/apps/category/FAMILY) with 19.4%, followed by games with 9.79%. The 5 apps that follow are *Tools, Business, Medical, Personalization, and productivity* which together have a 24.78% share.

In [25]:
print("**Android \'category'**")
print('\n')
display_table(android_clean,1)

**Android 'category'**


FAMILY : 19.33
GAME : 9.82
TOOLS : 8.61
BUSINESS : 4.36
MEDICAL : 4.11
PERSONALIZATION : 3.9
PRODUCTIVITY : 3.88
LIFESTYLE : 3.79
FINANCE : 3.59
SPORTS : 3.38
COMMUNICATION : 3.27
HEALTH_AND_FITNESS : 3.0
PHOTOGRAPHY : 2.91
NEWS_AND_MAGAZINES : 2.6
SOCIAL : 2.49
TRAVEL_AND_LOCAL : 2.28
BOOKS_AND_REFERENCE : 2.27
SHOPPING : 2.09
DATING : 1.77
VIDEO_PLAYERS : 1.7
MAPS_AND_NAVIGATION : 1.34
FOOD_AND_DRINK : 1.16
EDUCATION : 1.1
ENTERTAINMENT : 0.9
LIBRARIES_AND_DEMO : 0.87
AUTO_AND_VEHICLES : 0.87
WEATHER : 0.82
HOUSE_AND_HOME : 0.76
EVENTS : 0.67
PARENTING : 0.62
ART_AND_DESIGN : 0.62
COMICS : 0.57
BEAUTY : 0.55


The difference between the Genres and the Category columns is not crystal clear, but one thing we can notice is that the Genres column is much more granular (it has more categories). We're only looking for the bigger picture at the moment, so we'll only work with the Category column moving forward.

In [26]:
print("**Android \'genres'**")
print('\n')
display_table(android_clean,9)

**Android 'genres'**


Tools : 8.6
Entertainment : 5.79
Education : 5.23
Business : 4.36
Medical : 4.11
Personalization : 3.9
Productivity : 3.88
Lifestyle : 3.78
Finance : 3.59
Sports : 3.44
Communication : 3.27
Action : 3.11
Health & Fitness : 3.0
Photography : 2.91
News & Magazines : 2.6
Social : 2.49
Travel & Local : 2.27
Books & Reference : 2.27
Shopping : 2.09
Simulation : 1.98
Arcade : 1.91
Dating : 1.77
Casual : 1.72
Video Players & Editors : 1.67
Maps & Navigation : 1.34
Puzzle : 1.24
Food & Drink : 1.16
Role Playing : 1.08
Strategy : 0.98
Racing : 0.95
Libraries & Demo : 0.87
Auto & Vehicles : 0.87
Weather : 0.82
House & Home : 0.76
Adventure : 0.75
Events : 0.67
Art & Design : 0.58
Comics : 0.56
Beauty : 0.55
Card : 0.49
Parenting : 0.48
Board : 0.44
Casino : 0.41
Educational;Education : 0.4
Trivia : 0.38
Educational : 0.38
Education;Education : 0.36
Casual;Pretend Play : 0.26
Word : 0.24
Music : 0.2
Puzzle;Brain Games : 0.18
Education;Pretend Play : 0.18
Racing;Action & Adv

Up to this point, we found that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps. Now we'd like to get an idea about the kind of apps that have most users.

## Most Popular Apps by Genre on the App Store

The frequency tables we analyzed on the previous screen showed us that apps designed for fun dominate the App Store, while Google Play shows a more balanced landscape of both practical and fun apps. Now, we'd like to determine the kind of apps with the most users.

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app.

Let's start with calculating the average number of user ratings per app genre on the App Store. To do that, we'll need to do the following:

- Isolate the apps of each genre
- Add up the user ratings for the apps of that genre
- Divide the sum by the number of apps belonging to that genre (not by the total number of apps)

To calculate the average number of user ratings for each genre, we'll use a for loop inside of another for loop. This is an example of a for loop used inside another for loop:

In [27]:
unique_genres = freq_table(ios_clean, 11)

for main_genre in unique_genres:
    total = 0
    len_genre = 0

    for app in ios_clean:
        genre_app = app[11]
    
        if genre_app == main_genre:
            n_user_rating = float(app[5])
            total += n_user_rating
            len_genre += 1
    
    avg_user_rating = total / len_genre
    print(main_genre, ':', round(avg_user_rating))

Social Networking : 60254
Photo & Video : 14689
Games : 15587
Music : 29047
Reference : 27037
Health & Fitness : 10802
Weather : 23145
Utilities : 7928
Travel : 19030
Shopping : 26635
News : 16980
Navigation : 19371
Lifestyle : 8930
Entertainment : 8862
Food & Drink : 19934
Sports : 15351
Book : 10359
Finance : 23354
Education : 2472
Productivity : 8508
Business : 5149
Catalogs : 3465
Medical : 649


While there are a lot of things to consider when selecting an app profile, at this point I'd probably recommend selecting one that has a high number of users relative to the number of apps in the category. 'Reference' for example represents under 1% app share but averages 27,037 users which is above average which tells me that competition isn't as tough as in other categories.

## Most Popular Apps by Genre on Google Play

Above we came up with an app profile recommendation for the App Store based on the number of user ratings. We have data about the number of installs for the Google Play market, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough:

In [28]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', round(entry[0],2))

display_table(android_clean,5)

1,000,000+ : 14.71
100,000+ : 11.5
10,000+ : 10.62
10,000,000+ : 9.75
1,000+ : 9.15
100+ : 7.32
5,000,000+ : 6.29
500,000+ : 5.24
5,000+ : 4.84
50,000+ : 4.82
10+ : 3.99
500+ : 3.41
50,000,000+ : 2.12
50+ : 2.12
100,000,000+ : 1.97
5+ : 0.85
1+ : 0.69
500,000,000+ : 0.25
1,000,000,000+ : 0.21
0+ : 0.14
0 : 0.01


 We can see that most values are open-ended (100+, 1,000+, 5,000+, etc.). For instance, we don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, we don't need very precise data for our purposes — we only want to find out which app genres attract the most users.

We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on. To perform computations, however, we'll need to convert each install number from a string to a float. This means we need to remove the commas and the plus characters, or the conversion will fail and cause an error.

To remove characters from strings, we can use the [str.replace(old, new)](https://docs.python.org/3/library/stdtypes.html?#str.replace) method. str.replace() takes in two parameters, old and new, and replaces all occurrences of old within a string with new.

In [29]:
n_install = '+1,000,000'
n_install_plus = n_install.replace('+','')
n_install_comma = n_install_plus.replace(',','')
n_install = float(n_install_comma)
print(n_install)
print(type(n_install))

1000000.0
<class 'float'>


In [30]:
cat_and_install = freq_table(android_clean,1)

for category in cat_and_install:
    total = 0
    len_category = 0 
    
    for app in android_clean:
        cat_in_clean = app[1]   
        if cat_in_clean == category:
            
            install_app = app[5]
            install_app = install_app.replace('+', '')
            install_app = install_app.replace(',', '')              
            
            total += float(install_app)
            len_category += 1
    
    avg_andr_installs = total / len_category
    
    print(category,'|',f'{round(avg_andr_installs):,}','|', len_category, 'apps in category')

ART_AND_DESIGN | 1,887,285 | 60 apps in category
AUTO_AND_VEHICLES | 632,501 | 84 apps in category
BEAUTY | 513,152 | 53 apps in category
BOOKS_AND_REFERENCE | 7,641,778 | 218 apps in category
BUSINESS | 1,663,759 | 419 apps in category
COMICS | 817,657 | 55 apps in category
COMMUNICATION | 35,153,714 | 314 apps in category
DATING | 828,971 | 170 apps in category
EDUCATION | 1,782,566 | 106 apps in category
ENTERTAINMENT | 11,375,402 | 87 apps in category
EVENTS | 249,581 | 64 apps in category
FINANCE | 1,319,851 | 345 apps in category
FOOD_AND_DRINK | 1,891,060 | 112 apps in category
HEALTH_AND_FITNESS | 3,972,300 | 288 apps in category
HOUSE_AND_HOME | 1,331,541 | 73 apps in category
LIBRARIES_AND_DEMO | 630,904 | 84 apps in category
LIFESTYLE | 1,369,955 | 364 apps in category
GAME | 14,256,218 | 944 apps in category
FAMILY | 3,345,019 | 1858 apps in category
MEDICAL | 96,944 | 395 apps in category
SOCIAL | 22,961,790 | 239 apps in category
SHOPPING | 6,966,909 | 201 apps in categor

# Conclusion

There defintely are some good choices in terms of sectors to pursue, the sector we found to be a good starting point to test is the LIFESTYLE sector. Average installs in that sector per app is 1,369,955 installs with 364 apps focusing on anything between DIY lifestyle to alarm clocks. The reason this seems like a good choice is that relative to other sectors it's not as competitive which helps with easier market penetration.

In [32]:
for app in android_clean:
    if app[1] == 'LIFESTYLE':
        print(app[0], ':', app[5])

Dollhouse Decorating Games : 5,000,000+
metroZONE : 10,000,000+
Easy Hair Style Design : 100,000+
Talking Babsy Baby: Baby Games : 10,000,000+
Black Wallpaper, AMOLED, Dark Background: Darkify : 5,000,000+
Girly Wallpapers Backgrounds : 1,000,000+
Chart - Myanmar Keyboard : 5,000,000+
Easy Makeup Tutorials : 1,000,000+
Horoscopes – Daily Zodiac Horoscope and Astrology : 10,000,000+
Entel : 1,000,000+
ZenUI Safeguard : 1,000,000+
Live 4D Results ! (MY & SG) : 5,000,000+
Diary with lock : 10,000,000+
FOSSIL Q: DESIGN YOUR DIAL : 500,000+
Telstra : 5,000,000+
Family Locator - GPS Tracker : 10,000,000+
Van Nien 2018 - Lich Van su & Lich Am : 1,000,000+
Safeway : 1,000,000+
HTC Speak : 10,000,000+
Kawaii Easy Drawing : How to draw Step by Step : 5,000,000+
Tattoodo - Find your next tattoo : 1,000,000+
H&M : 10,000,000+
Samsung+ : 50,000,000+
Anime Avatar Creator: Make Your Own Avatar : 1,000,000+
Beautiful Design Birthday Cake : 500,000+
Pronunciation and know the name of the caller from hi