# Profitable App Profiles

## About Project:
In this project, we will be analysing data about apps from the Apple app store as well as the Android app store.

## Goal of Project:
Our goal is to help developers understand what kind of apps are more likely to be downloaded by users, and what kinda of apps are possibly profitable.

### Dataset sources:
[**Android (Google Playstore) Dataset**](https://www.kaggle.com/lava18/google-play-store-apps)

[**Apple App Store Dataset**](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

In [1]:
from csv import reader
def open_dataset(dataset, header = True): #defining function, takes two input parameters
    opened_file = open(dataset)
    read_file = reader(opened_file)
    data = list(read_file) #converts the file into a list
    if header:
        header_dataset = data[0]
        list_dataset = data[1:]
        return header_dataset, list_dataset #if there is a header, a tuple of the header and data will be returned
    else:
        list_dataset = data[0:]
        return list_dataset #if there is no header, the data will be returned

In [2]:
#opening the Android data set
android_data = open_dataset('googleplaystore.csv')
android_data_header, android_dataset = android_data #assigning variables to the header and actual data
#print(android_dataset)

#opening the Apple data set
apple_data = open_dataset('AppleStore.csv')
apple_data_header, apple_dataset = apple_data #assigning variables to the header and actual data
#print(apple_data_header)

In the section above, we put the Android and the Apple csv datasets into a function, and then defined two variables for each of the dataset. One variable is just the header of thd dataset, while the second variable is the full dataset without the header.

### Exploring our datasets

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False): #dataset takes in list, start and end take in integers
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [4]:
#showing the first 6 rows of the Apple and Android datasets, with how many rows and columns there are. The header columns are not counted because the variable accounts for that
print("This is the first 6 rows of the Android dataset\n")
explore_data(android_dataset, 0, 5, rows_and_columns=True) #android dataset, keyword argument


print("\nThese are the column names for the Android dataset:\n")
print("\n".join(android_data_header))




This is the first 6 rows of the Android dataset

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 10841
Number o

We can see the first 6 rows of the Android dataset, without the headers. The number of rows is 10841, and the number of columns is 13. The column names are listed above.

In [5]:
print("\nThis is the first 6 rows of the Apple dataset")
explore_data(apple_dataset, 0, 5, True) #positional argument

print("\nThese are the column names for the Apple dataset:\n")
print("\n".join(apple_data_header))


This is the first 6 rows of the Apple dataset
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of columns: 16

These are the column names for the Apple dataset:

id
track_name
size_bytes
currency
price
rating_count_tot
rating_count_ver
user_rating
user_rating_ver
ver
cont_rating
prime_genre
sup_devices.num


We can see the first 6 rows of the Android dataset, without the headers. The number of rows is 7197, and the number of columns is 16. The column names are listed above. The documentation to understand the column names can be found [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home).

### Finding wrong/incorrect data

In [6]:
#trying to find wrong data in the Android dataset
for row in android_dataset:
    if len(row) != len(android_data_header):
        print("The wrong data row is:")
        print(row)
        print("\n")
        print("Index of the wrong row is " + str(android_dataset.index(row)))

print("\nThe headers of the Android dataset are:\n")
print(android_data_header)

print("\nA normal row in an Android dataset looks like:\n")
print(android_dataset[3])
        

The wrong data row is:
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Index of the wrong row is 10472

The headers of the Android dataset are:

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

A normal row in an Android dataset looks like:

['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


It seems that row 10472 is missing the **Category** data, shifting everything by one cell.

In [7]:
corrected_android_dataset = android_dataset[:10472] + android_dataset[10473:] #new dataset without the wrong data
#print(corrected_android_dataset[10470:10480])
explore_data(corrected_android_dataset, 0, 5, rows_and_columns=True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 10840
Number of columns: 13


The wrong data row has been removed, and the new corrected data set is stored in the variable `corrected_android_dataset`. The new number of rows is 10840, and columns remain the same.

### Removing duplicate entries

If the Android dataset is looked at, we will see that it contains duplicate entries. This section will see what duplicate entries there are and what is the criteria to remove the duplicate entries.

In [8]:
#initializing 2 lists for unique and duplicate apps
unique_android_apps = []
duplicate_android_apps = []
for app in corrected_android_dataset:
    name = app[0]
    
    if name in unique_android_apps:
        duplicate_android_apps.append(name)
    else:
        unique_android_apps.append(name)

print("There are " + str(len(unique_android_apps)) + " unique apps. Some of the unique apps are:\n")
print("\n".join(unique_android_apps[:15]))
print("\n")

print("There are " + str(len(duplicate_android_apps)) + " duplicate apps. Some of the duplicate apps are:\n")
print("\n".join(duplicate_android_apps[:15]))
print("\n")
print("An example of a duplicate app is Instagram. Only difference in the rows is that the rating is different (index 3):\n")
for app in corrected_android_dataset:
    name = app[0]
    if name == 'Instagram':
        print(app)

There are 9659 unique apps. Some of the unique apps are:

Photo Editor & Candy Camera & Grid & ScrapBook
Coloring book moana
U Launcher Lite – FREE Live Cool Themes, Hide Apps
Sketch - Draw & Paint
Pixel Draw - Number Art Coloring Book
Paper flowers instructions
Smoke Effect Photo Maker - Smoke Editor
Infinite Painter
Garden Coloring Book
Kids Paint Free - Drawing Fun
Text on Photo - Fonteee
Name Art Photo Editor - Focus n Filters
Tattoo Name On My Photo Editor
Mandala Coloring Book
3D Color Pixel by Number - Sandbox Art Coloring


There are 1181 duplicate apps. Some of the duplicate apps are:

Quick PDF Scanner + OCR FREE
Box
Google My Business
ZOOM Cloud Meetings
join.me - Simple Meetings
Box
Zenefits
Google Ads
Google My Business
Slack
FreshBooks Classic
Insightly CRM
QuickBooks Accounting: Invoicing & Expenses
HipChat - Chat Built for Teams
Xero Accounting Software


An example of a duplicate app is Instagram. Only difference in the rows is that the rating is different (index 3):



We can see that there are 1181 duplicate app listings in the Android dataset. Since the only difference between these entries is the rating, we can use this to our advantage, to remove duplicate app listings. If the rating is more, we can assuume that the data was collected most recently, and thus we keep that listing, and remove the others.

In [9]:
reviews_max = {}
for app in corrected_android_dataset:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews: #if the app name already exixts as a key in the dictionary, and the reviews in the dict are LESS than the new row, then we update the dictionary
        reviews_max[name] = n_reviews
        
    if name not in reviews_max: #if the app does not exist, we add it into the dictionary
        reviews_max[name] = n_reviews
        
print("The number of items in the dictionary are: " + str(len(reviews_max.keys())))

The number of items in the dictionary are: 9659


We now have a dictionary where the key is the unique name of the app, and the value is the max number of reviews.

In [10]:
android_clean = []
already_added = []

for app in corrected_android_dataset:
    name = app[0]
    n_reviews = float(app[3])
    
    #now we compare the reviews_max dictionary to our dataset
    if (reviews_max[name] == n_reviews) and (name not in already_added): #we have to check for the second condition because we want to make sure all the data is accounted for.
        android_clean.append(app) #appending the whole row. This will be our new dataset
        already_added.append(name) #only appending the name of the app
print("To make sure everything went like we wanted it to, we explore the data from the new android_clean dataset\n")
explore_data(android_clean,0,10,True)

To make sure everything went like we wanted it to, we explore the data from the new android_clean dataset

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', 

In order to make sure we have a new dataset that only contains unique app names, with the highest number of reviews, we first initialized two lists: `android_clean` and `already_added`.

We then looped over our original dataset, and compared it to the `reviews_max` dictionary. We added the unique entries into the new `android_clean` list by doing the following:
1. If the number of reviews in the original dataset and the dictionary (`reviews_max`) is the same AND
2. If the name did not already exist in the `already_added` list
We appended the row from the original dataset into the new `android_clean` dataset.

Exploring the data matches what we expected. We see **9659** entries, as seen in the past as unique entries, and the number of columns remains the same.

### Removing non-English apps

In [11]:
def is_english(string):
    """
    This function will take a string, and if it is non-English, it will return False, else True.
    """
    for i in string:
        if ord(i) > 127:
            return False
        
    return True

is_english('Docs To Go™ Free Office Suite')

False

The above function does not work the right way, since we input an English string, yet we got the function returning False. Reason for this is that symbols like TM or emojis have ASCII numbers greater than 127. To better filter our strings, we can allow up to 3 non-English characters. This will likely account for emojis and special symbols like TM.

In [12]:
def is_english_better(string):
    """
    This function does pretty much the same thing as the last function, but allows up to 3 non-ASCII characters
    """
    non_ascii_count = 0 #initializing a non_ascii count
    
    for i in string:
        if ord(i) > 127:
            non_ascii_count +=1
        
    if non_ascii_count > 3:
        return False
    else:
        return True
    
is_english_better('Docs To Go™ Free Office Suite')

True

Since this filtration system works better for us, we will use it to filter out all non-English apps.

In [13]:
def is_english_app(dataset,index_of_name=0): #the second parameter is because Android data set name index is 0, but Apple dataset is 1
    
    dataset_english = []
    
    for app in dataset:
        name = app[index_of_name]
        
        if is_english_better(name): #checking that name is in english using previous function
            dataset_english.append(app)
    
    return dataset_english

print("The Android apps with English names are:\n")
android_english = is_english_app(android_clean)
explore_data(android_english,0,5,True)
print("\n")

print("The Apple apps with English names are:\n")
apple_english = is_english_app(apple_dataset,1)
explore_data(apple_english,0,5,True)



The Android apps with English names are:

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 9614
Number of columns: 13


The 

Using our function to filter out only English names, we have now created two new datasets:
1. Android dataset = `android_english`
2. Apple dataset = `apple_english`

### Isolating the free apps
In this section, I will define a new function that can append to a list of all the free apps.

In [14]:
def free_apps(dataset, price_index, int_or_float='integer'):
    """
    This function will take in the dataset, the index of the price, and if the price is an integer, i.e. '0' or a float, i.e. '0.0'. If the price is an integer, argument passed should be 'integer', else it should be 'float'
    """
    free_apps_list = []
    not_free_apps_list = []
    
    for app in dataset:
        price = app[price_index]
        
        if int_or_float == 'integer':
            if price == '0':
                free_apps_list.append(app)
            else:
                not_free_apps_list.append(app)
        if int_or_float == 'float':
            if price == '0.0':
                free_apps_list.append(app)
            else:
                not_free_apps_list.append(app)
            
    return free_apps_list, not_free_apps_list


The above function will return a tuple of free and not free apps. This tuple will need to be isolated into seperate variables. If the price is 0, make sure the `int_or_float` parameter is `'integer'`, else it should be `'float'`.

The index for Apple Store price data is **4**.

The index for Android Store price data is **7**.



In [15]:
apple_free_notfree_tuple = free_apps(apple_english, 4, 'float')
apple_free, apple_not_free = apple_free_notfree_tuple

explore_data(apple_free,0,5,True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 3222
Number of columns: 16


For Apple's App Store, we have 3222 apps that are free.

In [16]:
android_free_notfree_tuple = free_apps(android_english, 7)
android_free, android_not_free = android_free_notfree_tuple

explore_data(android_free,1,5,True)

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 8864
Number of columns: 13


For Android's Play Store, we have 8864 apps that are free.

### Most common apps by genre

In this section, we will explore what are the most common apps by genre for both: The Apple App Store as well as the Android Play Store.

The rationale in doing so is that we would like to make an app that is actually successful. We want to analyze the market to see what kind of apps are actually successful, and then go onto making that kind of app for Android first. If we get a good response, we will make it for Apple. 

We will start by exploring the headers of both the datasets, and figure out which columns we can make frequency tables based of.


In [17]:
print("Android dataset headers:\n")
print(android_data_header)
print("\n")

print("Apple dataset headers:\n")
print(apple_data_header)
print("\n")


Android dataset headers:

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Apple dataset headers:

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']




From the headers, we see that:
* For the Android dataset, we can build a frequency table using the 'Category' (index 1) and the 'Genres' (index 9) columns.
* For the Apple dataset, we can use the 'prime_genre' column (index -5)

In [18]:
def freq_table(dataset, index):
    freq_dict = {}
    
    for app in dataset:
        freq_param = app[index] #for every row, the column we are trying to build a freq table for
        
        if freq_param in freq_dict: #if the column name exists as a key in our dictionary, just add one to the value of that key
            freq_dict[freq_param] += 1
        elif freq_param not in freq_dict: #if it does not, then initialize it with 1
            freq_dict[freq_param] = 1
    
    total = 0
    
    for i in freq_dict.values(): #counting the total number of categories
        total += i
    
    freq_dict_percentage = {}
    
    for c_name in freq_dict:
        percent = (freq_dict[c_name] / total) * 100
        freq_dict_percentage[c_name] = percent #you're taking the key from the main freq_dict, adding the key to our new dict, and assigning it a value of the percent we just calculated
        
    return freq_dict_percentage


def display_table(dataset, index):
    """
This function was provided by dataquest. 
It takes in the dataset and index parameter, 
generates a frequency table using the function we wrote, 
transforms the table (dictionary) into a list of tuples,
sorts it into descending order,
prints the entries in descending order
    """
    table = freq_table(dataset, index) #just replicating the dictionary we create from our function
    table_display = [] #initializing new list
    for key in table:
        key_val_as_tuple = (table[key], key) #converts each dictionary entry into a tuple, with the value first, then the key
        table_display.append(key_val_as_tuple) #appends the tuple to the list

    table_sorted = sorted(table_display, reverse = True) #sorts the list. Reason above value comes first is so that value is sorted, not the key
    for entry in table_sorted: #prints it in descending order, with key:value notation
        print(entry[1], ':', entry[0])

We first wrote a function that will take in a dataset and the index of the column you want to create the frequency table of. Then, this function will return a dictionary with all the percentages. So for example, we could use this to display all the genres and the percent they make up in the Apple App Store. 

The next function, provided by Dataquest, takes this dictionary of percentages, and prints out the whole dictionary in descending order.

Time to test the functions on the Android and Apple App Stores. Reminder, we are looking at the following:
1. Android Play Store:
    * Category (index 1)
    * Genres (index 9)
2. Apple App Store:
    * prime_genre (index -5)


In [19]:
print("The following data is the percentage of app categories present for Android Play Store:\n")
display_table(android_free, 1)
print("\n")
print("The following data is the percentage of app genres present for Android Play Store:\n")
display_table(android_free,9)

The following data is the percentage of app categories present for Android Play Store:

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.80099

In [20]:
print("The following data is the percentage of app genres present for Apple App Store:\n")
display_table(apple_free, -5)

The following data is the percentage of app genres present for Apple App Store:

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


From above, we can surmise the following about the respective App stores:

1. For Android Apps, almost 19% of the apps fall under the 'FAMILY' category. Almosr 9% of the apps fall under the 'Tools' genre.
2. For Apple Apps, just more than half of the apps fall under the 'Games' genre.

A more detailed analysis of the App stores reveals the following about the apps:
The most common category for Android apps seen is the Family category. This is followed by Games, and then Tools, Business, Lifestyle, etc. While, in the Apple App Store, we see that Games dominate more than half of the categories in the English apps. In both the cases, we do not see that the most number of apps are developed for practical purposes. We see that most apps seem to be developed for fun stuff.

However, we do see more of the 'useful/practical' categories in the Android Play Store. We do see that no app category domincates even half of the total apps, and compared to the Apple App Store, the 'useful/practical' apps seem to have a higher share. This is further confirmed when we look at the genres in the Android Store. In the genres, tools dominate over fun apps.

### Most Popular Apps by Genre for the Apple App Store

Since the number of apps in each category does not really tell us if those are the apps that are most used or downloaded, we will use the average number of user ratings for the apps to see which genre has the highest average rating.

In order to do so, we will iterate over the genres, and then using the ratings for each apps, and checking what genre it is, append this to a new dictionary.

In [21]:
# We start by generating a frequency table for the unique genres using the function we wrote before.

apple_genres = freq_table(apple_free, -5)
'''
We start off by taking every genre, and the total number of ratings is zero.
The number of apps in that genre are zero also.

Then we compare the main dataset, and every time we see that in the loop of the main dataset, the genre matches that of our first loop,
we take the ratings of that app, and we add it to the total for that genre
and we add one to the length of that genre, to show how many apps belong to the genre.

The genre_and_rating dictionary keeps filling after every loop.
'''
apple_genre_and_rating = {}
for genre in apple_genres:
    total = 0 # total number of ratings, per genre
    
    len_genre = 0 #number of apps in each genre
    
    for app in apple_free:
        genre_app = app[-5]
        
        if genre_app == genre:
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    average = total / len_genre
    
    apple_genre_and_rating[genre] = average
    
    print(genre, ':', average)
#print(genre_and_rating)
    

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


From our analysis of the genres of the apps in the Apple App Store, it seems that the Navigation has the most number of average ratings. Reference apps seem to follow this and then followed by Social Networking.

To see which apps are dominating the ratings, we can use a loop to check this out.


In [22]:
#In the Navigation genre, let's see which apps have the highest ratings.
print("These are the ratings for the apps in the Navigation genre:\n")
for app in apple_free:
    genre = app[-5]
    if genre == "Navigation":
        print(app[1] + ": " + app[5])

These are the ratings for the apps in the Navigation genre:

Waze - GPS Navigation, Maps & Real-time Traffic: 345046
Google Maps - Navigation & Transit: 154911
Geocaching®: 12811
CoPilot GPS – Car Navigation & Offline Maps: 3582
ImmobilienScout24: Real Estate Search in Germany: 187
Railway Route Search: 5


In [23]:
#In the Reference genre, let's see which apps have the highest ratings.

for app in apple_free:
    genre = app[-5]
    if genre == "Reference":
        print(app[1] + ": " + app[5])

Bible: 985920
Dictionary.com Dictionary & Thesaurus: 200047
Dictionary.com Dictionary & Thesaurus for iPad: 54175
Google Translate: 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran: 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition: 17588
Merriam-Webster Dictionary: 16849
Night Sky: 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE): 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools: 4693
GUNS MODS for Minecraft PC Edition - Mods Tools: 1497
Guides for Pokémon GO - Pokemon GO News and Cheats: 826
WWDC: 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free: 718
VPN Express: 14
Real Bike Traffic Rider Virtual Reality Glasses: 8
教えて!goo: 0
Jishokun-Japanese English Dictionary & Translator: 0


It seems as if 'Waze' and 'Google Maps' dominate the Navigation category, and 'Bible' and 'Dictionary' dominate the reference category.

My recommendation, for the Apple App Store, would be to develop a refence app. Maybe this app could be a translation of the Bible from one language to another, sort of combining the Bible and Dictionary ideas. 

Reason for this recommendation: Navigation apps seem to be controlled by 'Waze' and 'Google Maps'. Both these apps are owned by Google, a supergiant, which we as a startup would have to invest a significant amout to develop.

### Most Popular Apps by Genre on Google Play

For the Android Google Play App Store, we have some data of the number of downloads for each app. Using this, we can usu a similar approach to the one for the Apple Store, where we can use the average number of downloads to determine the popularity of genres.

In [24]:
android_categories = freq_table(android_free, 1)
android_category_and_downloads = {}
for category in android_categories:
    total = 0
    len_category = 0
    
    for app in android_free:
        category_app = app[1]
        
        if category_app == category:
            n_install = app[5]
            n_install = n_install.replace('+', '')
            n_install = n_install.replace(',', '')
            n_install = float(n_install)
            total += n_install
            len_category += 1
            
    average = total / len_category
    android_category_and_downloads[category] = average
    print(category + ": " + str(average))
            

ART_AND_DESIGN: 1986335.0877192982
AUTO_AND_VEHICLES: 647317.8170731707
BEAUTY: 513151.88679245283
BOOKS_AND_REFERENCE: 8767811.894736841
BUSINESS: 1712290.1474201474
COMICS: 817657.2727272727
COMMUNICATION: 38456119.167247385
DATING: 854028.8303030303
EDUCATION: 1833495.145631068
ENTERTAINMENT: 11640705.88235294
EVENTS: 253542.22222222222
FINANCE: 1387692.475609756
FOOD_AND_DRINK: 1924897.7363636363
HEALTH_AND_FITNESS: 4188821.9853479853
HOUSE_AND_HOME: 1331540.5616438356
LIBRARIES_AND_DEMO: 638503.734939759
LIFESTYLE: 1437816.2687861272
GAME: 15588015.603248259
FAMILY: 3695641.8198090694
MEDICAL: 120550.61980830671
SOCIAL: 23253652.127118643
SHOPPING: 7036877.311557789
PHOTOGRAPHY: 17840110.40229885
SPORTS: 3638640.1428571427
TRAVEL_AND_LOCAL: 13984077.710144928
TOOLS: 10801391.298666667
PERSONALIZATION: 5201482.6122448975
PRODUCTIVITY: 16787331.344927534
PARENTING: 542603.6206896552
WEATHER: 5074486.197183099
VIDEO_PLAYERS: 24727872.452830188
NEWS_AND_MAGAZINES: 9549178.467741935
MA