# Exploring Android and iOS App Data


## Overview
This is my first Data Project on `Github` and `Dataquest`. 

The functions used in this project may seem inefficient at times. After all, there are many pre-existing tools that will do the job. This is because it is part of a `Python` learning module, and the purpose is to understand the fundamentals and proofs of functions, without the shortcuts. 

## Dataset

[iOS Dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps): Containing about 7,000 Apple iOS mobile application details.


[Android Dataset](https://www.kaggle.com/lava18/google-play-store-apps): Containing about 10,000 Android mobile application details.

## Tasks

1. Extract and Explore
2. Clean the Datasets
3. Light Analysis





### 1.1 Extract and Explore

1.1a Define a function to open the dataset<br>
1.1b Assign the data to `AppleData` and `GoogleData` variables<br>
1.1c Print the length of each dataset, and the headers 

In [2]:
# 1.1a Open, read, and assign a csv file to a variable.

def open_file(filename):
    opened_file = open(filename, encoding = "utf8")
    from csv import reader
    read_file = reader(opened_file)
    AppData = list(read_file)
    return AppData

In [3]:
# 1.1b Assigning to AppleData and GoogleData

AppleData = open_file("AppleStore.csv")
GoogleData = open_file("googleplaystore.csv")

# 1.1c Printing out headers for each data table

print('Apple Dataset Length:', len(AppleData)-1)
print('Google Dataset Length:', len(GoogleData)-1)
print('\n')
print('Apple Dataset Header:')
\
print(AppleData[0])
print('\n')
print('Google Dataset Header:')
\
print(GoogleData[0])
print('\n')



Apple Dataset Length: 7197
Google Dataset Length: 10841


Apple Dataset Header:
['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Google Dataset Header:
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']




### 2.1 Clean the Datasets: Wrong Entry

2.1a Looking at the [`Discussion`](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) for the dataset, we noticed that row `10473` has an error<br>
2.1b We isolate the row and the issue<br>
2.1c Since we don't have the true data to update, we will delete the row


In [4]:
# 2.1b Spotting the entry with an error and comparing to a normal entry

print(GoogleData[10473])
print('\n')
print(GoogleData[10472])

# row 10473 is missing the category column, index 2

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


In [5]:
# 2.1c Deleting the erroneous entry
# Remember to only run this cell once, if not you'll delete valid rows
del GoogleData[10473]



In [6]:
print(GoogleData[10472:10474])

[['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up'], ['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']]


### 2.2 Clean the Datasets: Dealing with Duplicate Entries 

2.2a Identify which dataset has duplicate entries<br>
2.2b Remove duplicate entries. Keep the entry with the most reviews (which is the latest one).


In [7]:
# 2.2a Which dataset has duplicate entries?

android_unique_apps = []
android_duplicate_apps = []

for app in GoogleData[1:]:
    name = app[0]
    if name in android_unique_apps:
        android_duplicate_apps.append(name)
    else:
        android_unique_apps.append(name)
        
print(android_duplicate_apps[:4])

apple_unique_apps = []
apple_duplicate_apps = []

for app in AppleData[1:]:
    name = app[0]
    if name in apple_unique_apps:
        apple_duplicate_apps.append(name)
    else:
        apple_unique_apps.append(name)
        
print(apple_duplicate_apps[:4])
    
# We can tell that android database has duplicate apps. The iOS database is fine. 

['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings']
[]


In [8]:
# 2.2a Printing the number of duplicate apps. Used a library to help with formatting.

import locale
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
print("The number of duplicate apps is", '{0:n}'.format(1181))



The number of duplicate apps is 1,181


In [9]:
# 2.2b Let's print a case of duplicates

for row in GoogleData[1:]:
    if row[0] == "Instagram":
        print(row)

# We can see that Reviews, or index 3, varies depending on when the entry was added


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [10]:
# 2.2b We will create a dictionary called reviews_max, where the key is the app name 
# and the value is the maximum number of reviews

reviews_max = {}

for row in GoogleData[1:]:
    name = row[0]
    n_reviews = float(row[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    if name not in reviews_max:
        reviews_max[name] = n_reviews
        

        


In [11]:
# 2.2b Creating a deduplicated clean list called android_clean
# The already_added list prevents duplicates where the review number is the same 
# Instagram had 2 rows with the same review number

android_clean = []
already_added = []

for row in GoogleData[1:]:
    name = row[0]
    n_reviews = float(row[3])
    
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(row)
        already_added.append(name)

print('The android_clean list has', len(android_clean), 'number of rows')
        
# Optional
# Why do we need the [already_added] list? 
# Upon exploring the data, I realised that there are cases where the max_reviews is the same
# but other columns are different. Try printing the FOR loop below and spot the difference!

# for row in GoogleData[1:]:
#     if row[0] == 'Learn C++':
#         print(row)




The android_clean list has 9659 number of rows


### 2.3 Clean the Datasets: Removing Non-English Entries and Paid Entries

2.3a Example of a Non-English Entry<br>
2.3b Function to identify Non-English Entries<br>
2.3c Cleaning `android` and `iOS` datatables of Non-English entries<br>
2.3d Cleaning `android` and `iOS` datatables of Paid entries

In [12]:
# 2.3a Printing a Non-English Entry

print(android_clean[4412])

['中国語 AQリスニング', 'FAMILY', 'NaN', '21', '17M', '5,000+', 'Free', '0', 'Everyone', 'Education', 'June 22, 2016', '2.4.0', '4.0 and up']


In [13]:
#2.3b The function is_english will identify any app name which has more than 3 Non-English characters 
# and return False

def is_english(input_string):
    count = 0
    for char in input_string:
        if ord(char) > 127:
            count += 1
            if count > 3:
                return False
    return True


print('Is the word "chicken" an English word?', is_english('chicken'))
print('Is the word "chicken語" an English word?', is_english('chicken語'))
print('Is the word "chicken語語語" an English word?', is_english('chicken語語語'))
print('Is the word "chicken語語語語" an English word?', is_english('chicken語語語語'))
print('Is the word "chicken😜😜😜😜" an English word?', is_english('chicken😜😜😜😜'))


Is the word "chicken" an English word? True
Is the word "chicken語" an English word? True
Is the word "chicken語語語" an English word? True
Is the word "chicken語語語語" an English word? False
Is the word "chicken😜😜😜😜" an English word? False


In [14]:
# 2.3c Cleaning the android and iOS datatables of Non-English entries

android_english = []
ios_english = []

for row in android_clean:
    if is_english(row[0]):
        android_english.append(row)

for row in AppleData[1:]:
    # Remember that we still need to remove the header for the AppleData
    if is_english(row[2]):
        ios_english.append(row)
      

    
print(len(android_english))
print(len(ios_english))

9614
6183


In [15]:
# 2.3d Cleaning the android and iOS datatables of Paid entries
android_free = []
ios_free = []

for row in android_english:
    if row[7] == '0':
        android_free.append(row)
        
for row in ios_english:
    if row[5] == '0':
        ios_free.append(row)
        
print(len(android_free))
print(len(ios_free))


8864
3222


### 3.1 Light Analysis:

We can look at the genres for `android` and `iOS` apps to see the demand and supply of genres

3.1a Dictionary which shows number of each genre on `android`<br>
3.1b Creating two functions (1) A frequency table (2) A display table sorted by frequency

In [16]:
# 3.1a Dictionary which shows the number of each genre on android

android_genre = {}

for app in android_free:
    genre = app[9]
    if genre in android_genre:
        android_genre[genre] += 1
    else:
        android_genre[genre] = 1


print(android_genre)

{'Art & Design': 53, 'Art & Design;Creativity': 6, 'Auto & Vehicles': 82, 'Beauty': 53, 'Books & Reference': 190, 'Business': 407, 'Comics': 54, 'Comics;Creativity': 1, 'Communication': 287, 'Dating': 165, 'Education': 474, 'Education;Creativity': 4, 'Education;Education': 30, 'Education;Pretend Play': 5, 'Education;Brain Games': 3, 'Entertainment': 538, 'Entertainment;Brain Games': 7, 'Entertainment;Creativity': 3, 'Entertainment;Music & Video': 15, 'Events': 63, 'Finance': 328, 'Food & Drink': 110, 'Health & Fitness': 273, 'House & Home': 73, 'Libraries & Demo': 83, 'Lifestyle': 345, 'Lifestyle;Pretend Play': 1, 'Card': 40, 'Arcade': 164, 'Puzzle': 100, 'Racing': 88, 'Sports': 307, 'Casual': 156, 'Simulation': 181, 'Adventure': 60, 'Trivia': 37, 'Action': 275, 'Word': 23, 'Role Playing': 83, 'Strategy': 81, 'Board': 34, 'Music': 18, 'Action;Action & Adventure': 9, 'Casual;Brain Games': 12, 'Educational;Creativity': 3, 'Puzzle;Brain Games': 15, 'Educational;Education': 35, 'Casual;Pre

In [17]:
# 3.1b Two functions. Frequency table and display table.

def freq_table(dataset, index):
    a_dict = {}
    count = 0
    for row in dataset:
        key = row[index]
        if key in a_dict:
            a_dict[key] += 1
            count += 1
        else:
            a_dict[key] = 1
            count += 1
            
    for key in a_dict:
        a_dict[key] /= count
        a_dict[key] *= 100
        
    return a_dict
        
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_list = [table[key], key]
        table_display.append(key_val_as_list)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
    

In [18]:
# 3.1b Trying out our two functions.

print('Android Genres:')
display_table(android_free, 9)
print('\n')
print('Android Categories:')
display_table(android_free, 1)
print('\n')
print('iOS Genres:')
display_table(ios_free, 12)

Android Genres:
Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles :

### 3.2 Light Analysis:

On iOS, we know don't know the number of app downloads.
However, we do know the number of times an app has been rated, column name = rating_count_tot

3.2a iOS Genre with total number of ratings, avg. number of ratings, and the ratio

In [43]:
# 3.2a iOS Genre with total number of ratings and avg. number of ratings 


# Create a Dictionary with genres for iOS
ios_genres = freq_table(ios_free, 12)


# Change the key value of the genres to hold 3 values in a list [Total number of ratings, 
# avg. number of ratings, ratio]
for genre in ios_genres:
    total = 0
    len_genre = 0
    for row in ios_free:
        genre_app = row[12]
        if genre_app == genre:
            total += float(row[6])
            len_genre += 1
    avg_user_ratings = round(total / len_genre)
    ios_genres[genre] = [round(total), avg_user_ratings, round(total/avg_user_ratings)]

    
# The function display2_table only has the table input. 
def display2_table(table):
    table_display = []
    for key in table:
        key_val_as_list = [table[key], key]
        table_display.append(key_val_as_list)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        
dict_table(ios_genres)
print('\n')

# From the results, it's possible that in the games genre, there are a few apps which take up a lot of the ratings.
# Let's test our hypothesis by looking at the game with the lowest number of ratings and the game with 
# the highest number of ratings (the range).
# We can compare this with the Medical genre.

def min_max(data, genre):
    # Setting games_min to a large number is arbitrary, I've yet to find a more elegant solution that doesn't require
    # hardcoding
    games_min = 50000000
    games_max = 0
    for row in data:
        if row[12] == genre and float(row[6]) < games_min:
            games_min = round(float(row[6]))

    for row in data:
        if row[12] == genre and float(row[6]) > games_max:
            games_max = round(float(row[6]))

    print('The range for the number of games reviews:', games_min, 'to', games_max)

        
min_max(ios_free, 'Games')
min_max(ios_free, 'Medical')


Games : [42705967, 22789, 1874]
Social Networking : [7584125, 71548, 106]
Photo & Video : [4550647, 28442, 160]
Music : [3783551, 57327, 66]
Entertainment : [3563577, 14030, 254]
Shopping : [2261254, 26920, 84]
Sports : [1587614, 23009, 69]
Health & Fitness : [1514371, 23298, 65]
Utilities : [1513441, 18684, 81]
Weather : [1463837, 52280, 28]
Reference : [1348958, 74942, 18]
Productivity : [1177591, 21028, 56]
Finance : [1132846, 31468, 36]
Travel : [1129752, 28244, 40]
News : [913665, 21248, 43]
Food & Drink : [866682, 33334, 26]
Lifestyle : [840774, 16486, 51]
Education : [826470, 7004, 118]
Book : [556619, 39758, 14]
Navigation : [516542, 86090, 6]
Business : [127349, 7491, 17]
Catalogs : [16016, 4004, 4]
Medical : [3672, 612, 6]


The range for the number of games reviews: 0 to 2130805
The range for the number of games reviews: 0 to 1341


### 3.2 Light Analysis:

On `Android`, the number of app installs can be found on index 5

3.2b Create a table with the average number of installs for each genre

In [22]:
# 3.2b Create a table with the average number of installs for each genre
# Index 1 is the category column
# Index 5 is the installs column

category_dict = freq_table(android_free, 1)

for category in category_dict:
    total = 0
    len_category = 0
    max_category = 0
    for row in android_free:
        installs = row[5]
        # We need to clean the installs data to use as a float.
        installs = installs.replace('+','')
        installs = installs.replace(',','')
        installs = float(installs)
        if category == row[1]:
            total += installs
            len_category += 1
            if installs > max_category:
                max_category = installs
    avg_installs = total / len_category
    category_dict[category] = [round(avg_installs), round(max_category)]
    
dict_table(category_dict)
            
            

COMMUNICATION : [38456119, 1000000000]
VIDEO_PLAYERS : [24727872, 1000000000]
SOCIAL : [23253652, 1000000000]
PHOTOGRAPHY : [17840110, 1000000000]
PRODUCTIVITY : [16787331, 1000000000]
GAME : [15588016, 1000000000]
TRAVEL_AND_LOCAL : [13984078, 1000000000]
ENTERTAINMENT : [11640706, 100000000]
TOOLS : [10801391, 1000000000]
NEWS_AND_MAGAZINES : [9549178, 1000000000]
BOOKS_AND_REFERENCE : [8767812, 1000000000]
SHOPPING : [7036877, 100000000]
PERSONALIZATION : [5201483, 100000000]
WEATHER : [5074486, 50000000]
HEALTH_AND_FITNESS : [4188822, 500000000]
MAPS_AND_NAVIGATION : [4056942, 100000000]
FAMILY : [3695642, 1000000000]
SPORTS : [3638640, 100000000]
ART_AND_DESIGN : [1986335, 50000000]
FOOD_AND_DRINK : [1924898, 10000000]
EDUCATION : [1833495, 10000000]
BUSINESS : [1712290, 100000000]
LIFESTYLE : [1437816, 100000000]
FINANCE : [1387692, 100000000]
HOUSE_AND_HOME : [1331541, 10000000]
DATING : [854029, 10000000]
COMICS : [817657, 10000000]
AUTO_AND_VEHICLES : [647318, 10000000]
LIBRAR

In [40]:
# Just out of curiosity, I wanted to see which `Medical` apps had more than 5 mil downloads.

for row in android_free:
    if row[5] == '5,000,000+' and row[1] == "MEDICAL":
        print(row)

        

['Best Hairstyles step by step', 'BEAUTY', '4.5', '45452', '9.2M', '5,000,000+', 'Free', '0', 'Everyone', 'Beauty', 'July 19, 2018', '1.25', '4.0 and up']
