# Maximise in App Add Revenue
Company A builds free to download apps for the IOS and Google Play platforms. 
 
It relies on in app adds for its revenue and requires large numbers of users.  
We are working as Data Analysts to try and uncover what apps may attract the most users.  

By exploring some of the data from the Apple Store and Google Play Store we will look to see what apps are likely to attract more customers.

## Imports

In [2]:
from csv import reader
import emoji as em
from operator import itemgetter

## Functions

Lets use the following functions to make our lives a bit easier. The first function opens and reads the data and the secound will return a slice of the data as well as the number of rows and columns.

In [3]:
def open_dataset(file_name, header=True):
    '''
    Opens a .csv file.

    Args:
        file_name (str): Name of .csv file to open.
        header (bool, optional): Has a header or not. Defaults to True.

    Returns:
        Lists: Opens the .csv file and converts to a list or list of lists
    '''          
    opened_file = open(file_name, encoding="utf8")
    read_file = reader(opened_file)
    data = list(read_file)
    # Left this in to create two datasets if header is True.
    if header:
        return data[1:], data[0]
    else:
        return data
  
def explore_data(dataset, start, end, rows_and_columns=False):
    '''
    Explores a data set and returns a slice of the data. If rows and columns is set to True will return the number of rows and columns.

    Args:
        dataset (variable): Name of variable holding the data.
        start (int): Index of first row to slice.
        end (int): Index of last row to slice.
        rows_and_columns (bool, optional): Set to True if the data has a header. Defaults to False.
    '''
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

## The Raw Data
Lets explore the two datasets from Kaggle to get a grasp of the data and to investigate the header names to determine which columns will be of best use for our research.

### Google Data

In [4]:
google_data, google_data_header = open_dataset('googleplaystore.csv')
print(google_data_header)
print('\n')
print(explore_data(google_data, 0, 4, rows_and_columns=True))

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10841
Number of columns: 13
None


### Column Headers
From the headers we can see these column names to be useful to us:  
`'App'`, `'Category'`, `'Reviews'`, `'Installs'`, `'Type'`, and `'Genres'`

### Apple Data

In [5]:
apple_data, apple_data_header = open_dataset('AppleStore.csv')
print(apple_data_header)
print('\n')
print(explore_data(apple_data, 0, 4, rows_and_columns=True))

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7197
Number of columns: 16
None


### Column Headers
From the headers we can see these column names to be useful to us:  
`'track_name'`, `'currency'`, `'price'`, `'rating_count_total'`, and `'prime_genre'`

## Data Cleaning
Remove errored row in the android data set as documented on Kaggle by @Giovanni Chrysostomo.  
First we identify the errored row.

In [6]:
print(google_data[10472])
print('\n')
print(google_data_header)
print(len(google_data))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
10841


We then remove the errored row using an `if` statement, to avoid incorrect deletion if the code runs again. We then get a row count to confirm deletion.

In [7]:
print(len(google_data))
if google_data[10472][0] == 'Life Made WI-Fi Touchscreen Photo Frame':
    del(google_data[10472])
print(len(google_data))

10841
10840


On quick inspection there appears to be duplicate data for Instagram in the android data set. A `for loop` uncovers 4 occurrence's of Instagram.

In [8]:
instagram_duplicates = []
for name in google_data:
    if 'Instagram' == name[0]:
        instagram_duplicates.append(name[0:4])
print(google_data_header[0:4])
print(instagram_duplicates)

['App', 'Category', 'Rating', 'Reviews']
[['Instagram', 'SOCIAL', '4.5', '66577313'], ['Instagram', 'SOCIAL', '4.5', '66577446'], ['Instagram', 'SOCIAL', '4.5', '66577313'], ['Instagram', 'SOCIAL', '4.5', '66509917']]


On closer inspection we can see that there are differences in the `values` in the `Reviews` column (`index [3]`).  

Having duplicate data can corrupt our results. As we need `Reviews` to back up our results we will only remove the duplicates with the lowest number of `Reviews`.  

We will now determine if the data has any more duplicates.

In [9]:
duplicate_apps = []
unique_apps = []

for name in google_data:
    name = name[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
print('Number of unique apps:', len(unique_apps))
print('Number of duplicate apps:', len(duplicate_apps))
print('Example of duplicate apps:', '\n', '\n', sorted(duplicate_apps[:20]))

Number of unique apps: 9659
Number of duplicate apps: 1181
Example of duplicate apps: 
 
 ['AdWords Express', 'Asana: organize team projects', 'Box', 'Box', 'Crew - Free Messaging and Scheduling', 'FreshBooks Classic', 'Google Ads', 'Google Analytics', 'Google My Business', 'Google My Business', 'HipChat - Chat Built for Teams', 'Insightly CRM', 'MailChimp - Email, Marketing Automation', 'Quick PDF Scanner + OCR FREE', 'QuickBooks Accounting: Invoicing & Expenses', 'Slack', 'Xero Accounting Software', 'ZOOM Cloud Meetings', 'Zenefits', 'join.me - Simple Meetings']


We have 1181 occurrence's of duplicate apps.  
  
We will have to clean these up.  
  
To do this we will create a dictionary by looping over our data and checking to see that only the duplicate apps with the highest reviews are added.

In [10]:
max_reviews = {}
for app in google_data:
    name = app[0]
    num_reviews = float(app[3])
    if name in max_reviews and max_reviews[name] > num_reviews: # Ensures that the higher rated app is copied to the dictionary
        max_reviews[name] = num_reviews
    elif name not in max_reviews:
        max_reviews[name] = num_reviews

In an earlier section we determined there where 9659 unique apps. Lets confirm our dictionary has the correct amount of entries. 

In [11]:
print('Expected length:', len(unique_apps), '\nActual length:', len(max_reviews))

Expected length: 9659 
Actual length: 9659


Now we have all the unique data in our dictionary we can clean them from the data.

In [12]:
android_clean = []
already_added = []

for app in google_data:
    name = app[0]
    num_reviews = float(app[3])
    
    if (max_reviews[name] == num_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

In [13]:
explore_data(android_clean, 0, 2, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 9659
Number of columns: 13


We will now check the apple data set.

In [14]:
print(apple_data_header)
print(apple_data[1])

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


It appears we have two duplicates.

In [15]:
duplicate_apps_apple = []
unique_apps_apple = []

for name in apple_data:
    name = name[1]
    if name in unique_apps_apple:
        duplicate_apps_apple.append(name)
    else:
        unique_apps_apple.append(name)
print('Number of unique apps:', len(unique_apps_apple))
print('Number of duplicate apps:', len(duplicate_apps_apple))
print('Example of duplicate apps:', '\n', '\n', sorted(duplicate_apps_apple))

Number of unique apps: 7195
Number of duplicate apps: 2
Example of duplicate apps: 
 
 ['Mannequin Challenge', 'VR Roller Coaster']


Lets identify the duplicate data.

In [16]:
def deep_index(lst, w):
    return [(i, sub.index(w)) for (i, sub) in enumerate(lst) if w in sub]
print(deep_index(apple_data, 'Mannequin Challenge'))
print(deep_index(apple_data, 'VR Roller Coaster'))

[(2948, 1), (4463, 1)]
[(4442, 1), (4831, 1)]


As there are only two duplicates lets examine them.

In [17]:
print(apple_data[2948])
print(apple_data[4463])
print(apple_data[4442])
print(apple_data[4831])

['1173990889', 'Mannequin Challenge', '109705216', 'USD', '0.0', '668', '87', '3.0', '3.0', '1.4', '9+', 'Games', '37', '4', '1', '1']
['1178454060', 'Mannequin Challenge', '59572224', 'USD', '0.0', '105', '58', '4.0', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']
['952877179', 'VR Roller Coaster', '169523200', 'USD', '0.0', '107', '102', '3.5', '3.5', '2.0.0', '4+', 'Games', '37', '5', '1', '1']
['1089824278', 'VR Roller Coaster', '240964608', 'USD', '0.0', '67', '44', '3.5', '4.0', '0.81', '4+', 'Games', '38', '0', '1', '1']


Lets delete the duplicates with the least reviews.

In [18]:
print(len(apple_data))
if apple_data[4463][1] == 'Mannequin Challenge' and apple_data[4463][2] == '59572224':
    del(apple_data[4463])
print(len(apple_data))

7197
7196


In [19]:
print(len(apple_data))
if apple_data[4830][1] == 'VR Roller Coaster' and apple_data[4830][2] == '240964608':
    del(apple_data[4830])
print(len(apple_data))

7196
7195


#### Removing Non-English Apps

In an attempt to achieve a more accurate language check to detect the non english apps we will remove the emoji's first using the `emoji` library.

In [20]:
android_demoji = []
apple_demoji = []

for app in android_clean:
    app[0] = em.replace_emoji(app[0]).strip()
    android_demoji.append(app)
print(len(android_demoji))
explore_data(android_demoji, 139, 140, True)

for app in apple_data:
    app[0] = em.replace_emoji(app[0]).strip()
    apple_demoji.append(app)
print(len(apple_demoji))
explore_data(apple_demoji, 139, 140, True)

9659
['Wattpad  Free Books', 'BOOKS_AND_REFERENCE', '4.6', '2914724', 'Varies with device', '100,000,000+', 'Free', '0', 'Teen', 'Books & Reference', 'August 1, 2018', 'Varies with device', 'Varies with device']


Number of rows: 9659
Number of columns: 13
7195
['552039496', 'The Room', '338273280', 'USD', '0.99', '143908', '1056', '5.0', '5.0', '1.0.4', '9+', 'Games', '24', '5', '1', '1']


Number of rows: 7195
Number of columns: 16


By removing the emoji's we were able to reduce the non ascii characters to two, allowing a more accurate filter.

In [21]:
def is_english(string):
    '''
    Checks for ascii characters above 127 in a string, if there are more than 2 characters it flags False.  

    Args:
        string (Variable):  Column containing strings to check.

    Returns:
        Bool: True if =< 127 or False if > 127.
    '''
    non_ascii = 0

    for character in string:
        if ord(character) > 127:
            non_ascii += 1

    if non_ascii > 2: # reduced the number for a more accurate result
        return False
    else:
        return True

We can now separate the english and non english apps from the android data.

In [22]:
android_english = []
android_foreign = []
for i in android_demoji:
    name = i[0]
    if is_english(name):
        android_english.append(i)
    else:
        android_foreign.append(i)
     
explore_data(android_english, 0, 3, True)
explore_data(android_foreign, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 9601
Number of columns: 13
['Truyện Vui Tý Quậy', 'COMICS', '4.5', '144', '4.7M', '10,000+', 'Free', '0', 'Everyone', 'Comics', 'July 19, 2018', '3.0', '4.0.3 and up']


['Flame - درب عقلك يوميا', 'EDUCATION', '4.6', '56065', '37M', '1,000,000+', 'Free', '0', 'Everyone', 'Education', 'July 26, 2018', '3.3', '4.1 and up']


['At home - rental · real estate · room finding application such as apartment · apartment', 'HOUSE

We can now separate the english and non english apps from the apple data.

In [23]:
ios_english = []
ios_foreign = []

for app in apple_data:
    name = app[1]
    if is_english(name):
        ios_english.append(app)
    else:
        ios_foreign.append(app)    
explore_data(ios_english, 0, 3, True)
explore_data(ios_foreign, 0, 4, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 6153
Number of columns: 16
['445375097', '爱奇艺PPS -《欢乐颂2》电视剧热播', '224617472', 'USD', '0.0', '14844', '0', '4.0', '0.0', '6.3.3', '17+', 'Entertainment', '38', '5', '3', '1']


['405667771', '聚力视频HD-人民的名义,跨界歌王全网热播', '90725376', 'USD', '0.0', '7446', '8', '4.0', '4.5', '5.0.8', '12+', 'Entertainment', '24', '4', '1', '1']


['336141475', '优酷视频', '204959744', 'USD', '0.0', '4885', '0', '3.5', '0.0', '6.7.0', '12+', 'Entertainment', '38', '0', '2', '1']


['425349261', '网易新闻 - 精选好内容，算出你的兴趣', '133134336', 'USD', '0.0', '4263', '6', '4.5', '1.0', '23.2', '

### Isolating the Free Apps and final clean

Create a new list with only the free apps. Remove everything except digits in `Installs` column and convert `Installs` and `Reviews` to `int` and `Rating` to `float` so we can use them as numbers later on.

In [24]:
android_final = []
ios_final = []

for app in android_english:
    if type(app[5]) == str:
        # Remove everything except digits and convert to 'int'
        app[5] = int(''.join(c for c in app[5] if c.isdigit()))
    if type(app[2]) == str:
        # Replace 'NaN'
        app[2] = app[2].replace('NaN', '0')
    if type(app[2]) == str:
        app[2] = float(app[2])  # Convert to 'float'
    if type(app[3]) == str:
        app[3] = int(app[3]) # Convert to 'int'
    price = app[7]
    if price == '0':
        android_final.append(app)
        
for app in ios_english:
    price = app[4]
    app[7] = float(app[7])
    if price == '0.0':
        ios_final.append(app)

print(len(android_final))
print(len(ios_final))

8850
3201


Check the `type` of values of the new datasets, using a `list comprehension`.

In [25]:
aft = [type(item) for item in android_final[1]]
print(aft)
ift = [type(item) for item in ios_final[1]]
print(ift)

[<class 'str'>, <class 'str'>, <class 'float'>, <class 'int'>, <class 'str'>, <class 'int'>, <class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>]
[<class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>, <class 'float'>, <class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>]


Lets remind ourselves of the column header names.

In [26]:
print(apple_data_header)
print('\n')
print(google_data_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


#### Category investigation
Now we have a clean dataset we can investigate the categories to begin to isolate the important data.  
  
To do this we need to check the categories to see which ones have the most reviews.  
   
We will create a function to do this for us. (Copied from Dataquest with some small mods).

In [27]:
def freq_table(dataset, index):
    '''
    Iterates through a data set and counts the number of occurrences and returns the result 
    as a dictionary key pair with the value as a percentage.

    Args:
        dataset (variable): object containing a data set
        index (int): column index number

    Returns:
        dict: key pair with the value as a percentage
    '''
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = round((table[key] / total) * 100, 2)
        table_percentages[key] = percentage 
    
    return table_percentages

In [28]:
def display_table(dataset, index):
    '''
    Converts the dictionary created by the function 'freq_table' into a sorted list

    Args:
        dataset (variable): object containing a data set
        index (int): column index number
    '''
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0], '%')

We can see from the Apple Store data that Games are nearly 60% of all the apps.  
  
The remaining 40% are made up of several other apps all below 8% with more than half of these being below 2%.

In [29]:
display_table(ios_final, -5)

Games : 58.23 %
Entertainment : 7.84 %
Photo & Video : 5.0 %
Education : 3.69 %
Social Networking : 3.31 %
Shopping : 2.59 %
Utilities : 2.47 %
Sports : 2.16 %
Music : 2.06 %
Health & Fitness : 2.03 %
Productivity : 1.75 %
Lifestyle : 1.56 %
News : 1.34 %
Travel : 1.25 %
Finance : 1.09 %
Weather : 0.87 %
Food & Drink : 0.81 %
Reference : 0.53 %
Business : 0.53 %
Book : 0.37 %
Navigation : 0.19 %
Medical : 0.19 %
Catalogs : 0.12 %


We can see from Google Playstore that the spread of apps is very thin. The categories FAMILY, GAME and TOOLS are the highest.

In [30]:
display_table(android_final, 1)

FAMILY : 18.82 %
GAME : 9.63 %
TOOLS : 8.44 %
BUSINESS : 4.59 %
PRODUCTIVITY : 3.9 %
LIFESTYLE : 3.89 %
FINANCE : 3.71 %
MEDICAL : 3.54 %
SPORTS : 3.41 %
PERSONALIZATION : 3.32 %
COMMUNICATION : 3.24 %
HEALTH_AND_FITNESS : 3.07 %
PHOTOGRAPHY : 2.95 %
NEWS_AND_MAGAZINES : 2.8 %
SOCIAL : 2.67 %
TRAVEL_AND_LOCAL : 2.34 %
SHOPPING : 2.25 %
BOOKS_AND_REFERENCE : 2.14 %
DATING : 1.86 %
VIDEO_PLAYERS : 1.79 %
MAPS_AND_NAVIGATION : 1.39 %
EDUCATION : 1.25 %
FOOD_AND_DRINK : 1.24 %
ENTERTAINMENT : 1.04 %
LIBRARIES_AND_DEMO : 0.94 %
AUTO_AND_VEHICLES : 0.93 %
HOUSE_AND_HOME : 0.81 %
WEATHER : 0.79 %
EVENTS : 0.71 %
ART_AND_DESIGN : 0.68 %
PARENTING : 0.66 %
COMICS : 0.61 %
BEAUTY : 0.6 %


Lets concentrate our efforts on apps with 5000 reviews or more.

In [31]:
higher_rev = []
for apps in android_final:
    hr = apps[3]
    if hr >= 5000:
        higher_rev.append(apps)
        
print(round(len(higher_rev) / len(android_final) * 100, 2))

40.96


The categories FAMILY, GAME and TOOLS are still the highest.

In [32]:
display_table(higher_rev, 1)

FAMILY : 16.69 %
GAME : 15.89 %
TOOLS : 7.31 %
PHOTOGRAPHY : 4.61 %
PRODUCTIVITY : 4.28 %
COMMUNICATION : 4.0 %
SPORTS : 3.72 %
HEALTH_AND_FITNESS : 3.59 %
SHOPPING : 3.26 %
PERSONALIZATION : 3.2 %
SOCIAL : 3.09 %
FINANCE : 3.09 %
TRAVEL_AND_LOCAL : 2.62 %
LIFESTYLE : 2.46 %
NEWS_AND_MAGAZINES : 2.34 %
ENTERTAINMENT : 2.21 %
VIDEO_PLAYERS : 2.15 %
BUSINESS : 1.96 %
EDUCATION : 1.93 %
BOOKS_AND_REFERENCE : 1.74 %
MAPS_AND_NAVIGATION : 1.46 %
FOOD_AND_DRINK : 1.46 %
DATING : 1.3 %
WEATHER : 1.27 %
MEDICAL : 0.83 %
HOUSE_AND_HOME : 0.8 %
COMICS : 0.55 %
AUTO_AND_VEHICLES : 0.55 %
ART_AND_DESIGN : 0.47 %
LIBRARIES_AND_DEMO : 0.36 %
PARENTING : 0.33 %
BEAUTY : 0.28 %
EVENTS : 0.22 %


In order to better understand the sorts of apps in the categories lets list the Genres and their size as a percentage of the category.  
To do this we will create a function so we can investigate the categories of most interest.

In [33]:
def freq_table_sub(dataset, index1, index2, string):
    '''
    Iterates through the data set and counts the number of 'Genres' per 'Category' 
    and returns the result as a dictionary key pair. We then create a list of tuples
    and print these with the value as a percentage.
    
    Args:
        dataset (variable): object containing the data
        index1 (int): column index number
        index2 (int): column index number
        string (str): category name
    '''
    cat = {}
    count = 0
    for i in dataset:
        count += 1
        col1 = i[index1]
        col2 = i[index2]
        if string in col1 and col2 in cat:
            cat[col2] += 1
        elif string in col1:
            cat[col2] = 1
            
    table_display = []
    for key in cat:
        key_val_as_tuple = (cat[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', round(entry[0] / (sum(list(cat.values()))) * 100, 2), '%')

In [34]:
freq_table_sub(higher_rev, 1, -4, 'FAMILY')

Simulation : 15.7 %
Entertainment : 15.54 %
Casual : 11.24 %
Strategy : 8.43 %
Role Playing : 8.1 %
Puzzle : 6.28 %
Education : 4.13 %
Casual;Pretend Play : 3.14 %
Educational;Education : 2.64 %
Casual;Action & Adventure : 1.98 %
Racing;Action & Adventure : 1.65 %
Puzzle;Brain Games : 1.65 %
Arcade;Action & Adventure : 1.49 %
Entertainment;Music & Video : 1.32 %
Educational;Pretend Play : 1.32 %
Educational : 1.16 %
Simulation;Action & Adventure : 0.99 %
Education;Education : 0.99 %
Casual;Brain Games : 0.99 %
Action;Action & Adventure : 0.99 %
Casual;Creativity : 0.83 %
Educational;Brain Games : 0.66 %
Education;Pretend Play : 0.66 %
Role Playing;Pretend Play : 0.5 %
Role Playing;Action & Adventure : 0.5 %
Entertainment;Brain Games : 0.5 %
Educational;Creativity : 0.5 %
Adventure;Action & Adventure : 0.5 %
Video Players & Editors;Music & Video : 0.33 %
Simulation;Pretend Play : 0.33 %
Puzzle;Creativity : 0.33 %
Puzzle;Action & Adventure : 0.33 %
Entertainment;Pretend Play : 0.33 %
Ent

In [35]:
freq_table_sub(android_final, 1, -4, 'FAMILY')

Entertainment : 27.37 %
Education : 22.69 %
Simulation : 10.56 %
Casual : 8.28 %
Puzzle : 4.86 %
Strategy : 4.14 %
Role Playing : 4.14 %
Educational;Education : 2.1 %
Educational : 1.98 %
Education;Education : 1.44 %
Casual;Pretend Play : 1.2 %
Racing;Action & Adventure : 0.9 %
Puzzle;Brain Games : 0.9 %
Casual;Action & Adventure : 0.72 %
Casual;Brain Games : 0.66 %
Arcade;Action & Adventure : 0.66 %
Entertainment;Music & Video : 0.54 %
Educational;Pretend Play : 0.48 %
Board;Brain Games : 0.48 %
Simulation;Action & Adventure : 0.36 %
Educational;Brain Games : 0.36 %
Action;Action & Adventure : 0.36 %
Entertainment;Brain Games : 0.3 %
Casual;Creativity : 0.3 %
Role Playing;Pretend Play : 0.24 %
Education;Pretend Play : 0.24 %
Role Playing;Action & Adventure : 0.18 %
Puzzle;Action & Adventure : 0.18 %
Entertainment;Action & Adventure : 0.18 %
Educational;Creativity : 0.18 %
Educational;Action & Adventure : 0.18 %
Adventure;Action & Adventure : 0.18 %
Video Players & Editors;Music & Vide

In [36]:
freq_table_sub(android_final, 1, -4, 'TOOLS')

Tools : 99.87 %
Tools;Education : 0.13 %


In [37]:
freq_table_sub(android_final, 1, -4, 'GAME')

Action : 32.28 %
Arcade : 19.25 %
Racing : 10.33 %
Adventure : 7.04 %
Card : 4.58 %
Trivia : 4.34 %
Casino : 4.34 %
Board : 3.87 %
Word : 2.7 %
Puzzle : 2.23 %
Music : 2.11 %
Casual : 2.0 %
Role Playing : 1.64 %
Strategy : 1.41 %
Simulation : 0.59 %
Sports : 0.47 %
Action;Action & Adventure : 0.35 %
Simulation;Action & Adventure : 0.12 %
Casual;Pretend Play : 0.12 %
Casual;Creativity : 0.12 %
Casual;Brain Games : 0.12 %


In [38]:
print(google_data_header)
print(explore_data(google_data, 0, 1, rows_and_columns=True))

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', 4.1, 159, '19M', 10000, 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 10840
Number of columns: 13
None


In [39]:
def top_genres(cat, genre, count):  
    num_installs = []
    num_cats = []
    for apps in android_final:
        top_app = apps[3]
        top_cat = apps[1]
        top_genre = apps[9]
        if cat in top_cat and genre in apps[9] and top_app >= count:
            num_installs.append(apps[0])
        elif cat in top_cat and top_app >= count:
            num_cats.append(apps[1])
    print(cat, '+', genre, ':', 'There are',len(num_installs), 
          'apps or', round(len(num_installs) / len(num_cats) * 100, 2), '% who have', count, 'reviews or more')

In [40]:
top_genres('FAMILY', 'Education', 5000)
top_genres('FAMILY', 'Action', 5000)
top_genres('GAME', 'Action', 5000)

FAMILY + Education : There are 81 apps or 15.46 % who have 5000 reviews or more
FAMILY + Action : There are 61 apps or 11.21 % who have 5000 reviews or more
GAME + Action : There are 196 apps or 51.58 % who have 5000 reviews or more


In [41]:
print(google_data_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [42]:
total_reviews = {}
total_downloads = {}
av_rating = {}
av_ratings = {}
count1 = 0
count2 = 0
count3 = 0
count4 = 0
for row in android_final:
    value = row[1]
    downloads = row[5]
    reviews = row[3]
    rats = row[2]
    count1 += reviews
    count2 += downloads
    count3 += 1
    count4 += rats
    if value in total_reviews:
        total_reviews[value] += reviews
    if value in total_downloads:
        total_downloads[value] += downloads
    if value in av_rating:
        av_rating[value] += 1
    if value in av_ratings:
        av_ratings[value] += rats
    else:
        total_reviews[value] = reviews
        total_downloads[value] = downloads
        av_rating[value] = 1
        av_ratings[value] = rats

tot_rev = {k: round(total_reviews[k] / total_downloads[k] * 100, 2) for k in total_downloads.keys() & total_reviews}
tot_rev_list = list(map(list, tot_rev.items()))

avg_rev = {k: round(av_ratings[k] / av_rating[k], 1) for k in av_rating.keys() & av_ratings}
avg_rev_list = list(map(list, avg_rev.items()))

tot_avg = [a + [b[1]] for (a, b) in zip(tot_rev_list, avg_rev_list)]

sort_tot = sorted(tot_avg, key=itemgetter(1), reverse=True)

for i in sort_tot:
    a = i[0]
    b = i[1]
    c = i[2]
    if b >= 0.5:
        print(a, 'has', b, 'reviews per one hundred downloads, with an average rating of', c, '\n')


COMICS has 5.21 reviews per one hundred downloads, with an average rating of 4.0 

SPORTS has 4.57 reviews per one hundred downloads, with an average rating of 3.3 

FAMILY has 4.22 reviews per one hundred downloads, with an average rating of 3.7 

SOCIAL has 4.15 reviews per one hundred downloads, with an average rating of 3.6 

GAME has 3.97 reviews per one hundred downloads, with an average rating of 4.0 

EDUCATION has 3.85 reviews per one hundred downloads, with an average rating of 4.3 

MEDICAL has 3.67 reviews per one hundred downloads, with an average rating of 3.0 

MAPS_AND_NAVIGATION has 3.54 reviews per one hundred downloads, with an average rating of 3.6 

PERSONALIZATION has 3.48 reviews per one hundred downloads, with an average rating of 3.4 

WEATHER has 3.38 reviews per one hundred downloads, with an average rating of 3.9 

SHOPPING has 3.18 reviews per one hundred downloads, with an average rating of 3.8 

PARENTING has 3.02 reviews per one hundred downloads, with a

In [43]:
from operator import itemgetter
sort_tot = sorted(tot_avg, key=itemgetter(2), reverse=True)
print(sort_tot)

[['EDUCATION', 3.85, 4.3], ['ENTERTAINMENT', 1.31, 4.1], ['ART_AND_DESIGN', 1.24, 4.1], ['GAME', 3.97, 4.0], ['PHOTOGRAPHY', 2.26, 4.0], ['COMICS', 5.21, 4.0], ['WEATHER', 3.38, 3.9], ['SHOPPING', 3.18, 3.8], ['AUTO_AND_VEHICLES', 2.18, 3.7], ['VIDEO_PLAYERS', 1.72, 3.7], ['BOOKS_AND_REFERENCE', 1.0, 3.7], ['FAMILY', 4.22, 3.7], ['HEALTH_AND_FITNESS', 1.85, 3.6], ['SOCIAL', 4.15, 3.6], ['PARENTING', 3.02, 3.6], ['MAPS_AND_NAVIGATION', 3.54, 3.6], ['FINANCE', 2.78, 3.6], ['HOUSE_AND_HOME', 1.99, 3.5], ['TOOLS', 2.86, 3.5], ['TRAVEL_AND_LOCAL', 0.93, 3.5], ['FOOD_AND_DRINK', 2.99, 3.5], ['BEAUTY', 1.46, 3.4], ['COMMUNICATION', 2.59, 3.4], ['PRODUCTIVITY', 0.96, 3.4], ['PERSONALIZATION', 3.48, 3.4], ['NEWS_AND_MAGAZINES', 0.97, 3.3], ['LIFESTYLE', 2.36, 3.3], ['SPORTS', 4.57, 3.3], ['EVENTS', 1.01, 3.2], ['LIBRARIES_AND_DEMO', 1.71, 3.2], ['DATING', 2.57, 3.2], ['MEDICAL', 3.67, 3.0], ['BUSINESS', 1.4, 2.5]]


In [44]:
che = []
for i in android_final:
    a = i[1]
    b = i[2]
    if a == 'EDUCATION':
        che.append(float(b))
        
print(round(sum(che) / 60, 1))

8.0


In [45]:
dict3 = {k: table1[k] / table2[k] for k in table2.keys() & table1}

list10 = list(map(list, dict3.items()))
print(list10)

NameError: name 'table2' is not defined

In [None]:
cats1 = []
cats2 = []
for row in android_final:
    cats1.append(row[1])
    if row[1] in cats1:
        cats2.append(row[1])
        
cats3 = sorted((set(cats2)))
print(cats3)

['ART_AND_DESIGN', 'AUTO_AND_VEHICLES', 'BEAUTY', 'BOOKS_AND_REFERENCE', 'BUSINESS', 'COMICS', 'COMMUNICATION', 'DATING', 'EDUCATION', 'ENTERTAINMENT', 'EVENTS', 'FAMILY', 'FINANCE', 'FOOD_AND_DRINK', 'GAME', 'HEALTH_AND_FITNESS', 'HOUSE_AND_HOME', 'LIBRARIES_AND_DEMO', 'LIFESTYLE', 'MAPS_AND_NAVIGATION', 'MEDICAL', 'NEWS_AND_MAGAZINES', 'PARENTING', 'PERSONALIZATION', 'PHOTOGRAPHY', 'PRODUCTIVITY', 'SHOPPING', 'SOCIAL', 'SPORTS', 'TOOLS', 'TRAVEL_AND_LOCAL', 'VIDEO_PLAYERS', 'WEATHER']


In [None]:
def high_rated(string):
    beau = []
    downs =[]
    rating = []
    for row in android_final:
        rev = row[3]
        down = row[5]
        rat = float(row[2])
        if row[1] == string:
            beau.append(rev)
            downs.append(down)
            rating.append(rat)
        rating = [0 if x == 'NaN' else x for x in rating]
    print(string, 'has a total of', sum(beau), 'reviews.')
    print(string, 'has a total of', sum(downs), 'downloads.')
    print(string, 'has', round(sum(beau) / sum(downs) * 100, 2), 'reviews per 100 downloads')
    print(string, 'has an average', round(sum(rating) / len(rating), 2), 'rating')

In [None]:
high_rated('GAME')
high_rated('SHOPPING')
high_rated('DATING')

GAME has a total of 440604975 reviews.
GAME has a total of 11107764450 downloads.
GAME has 3.97 reviews per 100 downloads
GAME has an average 4.03 rating
SHOPPING has a total of 44523992 reviews.
SHOPPING has a total of 1400338585 downloads.
SHOPPING has 3.18 reviews per 100 downloads
SHOPPING has an average 3.78 rating
DATING has a total of 3621934 reviews.
DATING has a total of 140914757 downloads.
DATING has 2.57 reviews per 100 downloads
DATING has an average 3.16 rating


In [None]:
beau = []
downs =[]
for row in android_final:
    beau.append(row[3])
    downs.append(row[5])
        
print(sum(beau))
print(len(beau))
print(sum(downs))
print(len(downs))
print(round(sum(beau) / sum(downs) * 100, 2))
print(len(android_final))

2083613379
8850
75002705825
8850
2.78
8850


In [None]:
total_reviews = []
total_downloads = []
for apps in android_final:
    total_reviews.append(apps[3])
    total_downloads.append(apps[5])
    
print('Total numbe of reviews for all apps:', sum(total_reviews))
print('Total number of downloads for all apps:', sum(total_downloads))

print(round(sum(total_reviews) / sum(total_downloads) * 100, 2))
print(len(android_final))

Total numbe of reviews for all apps: 2083613379
Total number of downloads for all apps: 75002705825
2.78
8850


In [None]:
per_rev = []
for apps in android_final:
    if apps[5] > 0:
        downloads = apps[5]
    if apps[3] > 0:
        reviews = apps[3]
    total = round((reviews / downloads) * 100, 2)
    per_rev.append(total)

if i in per_rev:
    print(i)

In [None]:
import emoji as em

foreign_strings = ['Instagram', '爱奇艺PPS -《欢乐颂2》电视剧热播', 'Docs To Go™ Free Office Suite', 'Instachat 😜', 'Wattpad 📖 Free Books']
english_strings = []
    
for i in foreign_strings:
    i = em.replace_emoji(i).strip()
    # if i.isascii():
    english_strings.append(i)
print(english_strings)


['Instagram', '爱奇艺PPS -《欢乐颂2》电视剧热播', 'Docs To Go Free Office Suite', 'Instachat', 'Wattpad  Free Books']


In [None]:
dir()

['In',
 'Out',
 '_',
 '_423',
 '_433',
 '_435',
 '_436',
 '_456',
 '_VSCode_defaultMatplotlib_Params',
 '__',
 '___',
 '__builtin__',
 '__builtins__',
 '__doc__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '__vsc_ipynb_file__',
 '_dh',
 '_i',
 '_i1',
 '_i10',
 '_i100',
 '_i101',
 '_i102',
 '_i103',
 '_i104',
 '_i105',
 '_i106',
 '_i107',
 '_i108',
 '_i109',
 '_i11',
 '_i110',
 '_i111',
 '_i112',
 '_i113',
 '_i114',
 '_i115',
 '_i116',
 '_i117',
 '_i118',
 '_i119',
 '_i12',
 '_i120',
 '_i121',
 '_i122',
 '_i123',
 '_i124',
 '_i125',
 '_i126',
 '_i127',
 '_i128',
 '_i129',
 '_i13',
 '_i130',
 '_i131',
 '_i132',
 '_i133',
 '_i134',
 '_i135',
 '_i136',
 '_i137',
 '_i138',
 '_i139',
 '_i14',
 '_i140',
 '_i141',
 '_i142',
 '_i143',
 '_i144',
 '_i145',
 '_i146',
 '_i147',
 '_i148',
 '_i149',
 '_i15',
 '_i150',
 '_i151',
 '_i152',
 '_i153',
 '_i154',
 '_i155',
 '_i156',
 '_i157',
 '_i158',
 '_i159',
 '_i16',
 '_i160',
 '_i161',
 '_i162',
 '_i163',
 '_i164',
 '_i165',
 '_i166',
 