# Analyzing Google Play and App Store Data

__Role__: For this project I will be working as a data anaylist for a app developing company that builds Android and iOS mobile apps. 

__Goal:__ The main goal of this project is to analyze data to help the company developers better understand the type of free apps that are most appealing to users. The target of this analysis is free games, as this company only builds free apps and relies on in-app ads to make revenue, so the more users the more revenue. 

## Opening and Studying the Data
In this part of the program we will analyze a sample set of data from both the App Store and Google Play.

The App Store Dataset from July 2017 contains data on approximately 7000 iOS apps, this dataset can be found from the following link: https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps

The Google Play Dataset from August 2018 contains data on approximately 10000 Android apps, this dataset can be found from the follwoing link: https://www.kaggle.com/datasets/lava18/google-play-store-apps

This part of the code opens the sample datasets.

In [1]:
from csv import reader

### The App Store data set ###
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

### The Google Play data set ###
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

Here we create a function called explore_data. This function takes 5 parameters
1. 'dataset' - This will be the dataset used and is a list of lists
2. 'start' - This is a integer that determines where we start in the dataset
3. 'end' - This is a integer that determines where we end in the dataset
4. 'rows_columns' - This is a boolean, with a default False argument, this determines if the number of rows and columns need to be printed
5. 'is_ios_data' - This is a boolean, with a default False argument, this determines if the data set is an iOS or Android data set, depedning on which it is it will print the headers for that dataset

This function starts of by slcing the data_set based on the start and end parameters given. It then checks if rows_and_columns is true, if so it prints the number of rows and the number of columns. Then it checks if is_ios_data is true, if so it prints the column names for the ios data, if not it'll print the column names for the android data. Finnaly it goes through every row in the sliced dataset and prints it. 

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False, is_ios_data = False):
    dataset_slice = dataset[start:end]    
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
    
    if is_ios_data:
        print('Name of columns:', ios_header)
        print('\n')
    else:
        print('Name of columns:', android_header)
        print('\n')
    
    for row in dataset_slice:
        print(row)
        print('\n')

In [3]:
explore_data(ios, 0, 4, True, True)

Number of rows: 7197
Number of columns: 16
Name of columns: ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']




In [4]:
explore_data(android, 0, 4, True)

Number of rows: 10841
Number of columns: 13
Name of columns: ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']




## Data Cleaning
In one of the discussions posts about the Google Play dataset, a user found that row 10472 in the Android data set is incorrect. First we'll print the row and print antoher row and compare the 2, to find any discrepancies, if any discrepancies are found the row will be deleted. 

In [5]:
print(android[10472])
print('\n')
print(android[2000])
del(android[10472]) #should be run only once

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['Magic Tiles 3', 'GAME', '4.5', '592504', 'Varies with device', '50,000,000+', 'Free', '0', 'Everyone', 'Music', 'August 3, 2018', '5.13.007', '4.1 and up']


### Removing Duplicate Entries 
When Data Cleaning it is important to remove any duplicates. It was found that the Google Play dataset had duplicates, this is seen in the code below. In this example we will look for 'Instagram' duplicates.

In [6]:
#Below we will run a loop, that goes through the android list and 
#finds all data entries with the app Name instagram 
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Now that we know there are duplicates we can remove them. We will do so by creating 2 empty lists, one that comtains the individual apps and one that contains the duplicates. We will then run a loop that goes through the list and finds the duplicate apps and appends them into the duplicate app list, if it is not a duplicate it will be appeneded into the individual app list. 

In [7]:
duplicate_apps = []
individual_apps = []

for app in android:
    name = app[0]
    if name in individual_apps:
        duplicate_apps.append(name)
    else:
        individual_apps.append(name)
        
print('There were a total of '+ str(len(duplicate_apps)) + ' duplicate apps')
print('\nExample of a few duplicate apps: ', duplicate_apps[:5])
print('\nExpected length:', len(android) - len(duplicate_apps))
        

There were a total of 1181 duplicate apps

Example of a few duplicate apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']

Expected length: 9659


Now that we know the expected length of the list should be, we can correctly create a new list but this time taking the duplicate data entry with the most reviews. We start off by making a directory called reviews_max, we then loop through the android list and assign both the name and reviews to variables. We think check if the number of reviews is alreay in the directory, if not we add it. If it is we compare the number of reveiws and depending on if the new value is greater than the already set value, we update the directory with the greater review number. Once the loop is complete we should have a directory with each apps max review, the length should be the same as the expected length calculated previously.

In [8]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max:
        if reviews_max[name] < n_reviews:
            reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

print(len(reviews_max))

9659


Now that we now the max number of reviews for each app, we can create a new list that contains every app with no duplicates based on the max number of reviews. We can also create a list of every app in the data set. We do this by creating 2 empty lists, android_clean and already_added. Then we loop through the Google play Dataset, we assign varibales to the name and the number of reviews. If number of reviews of the certain app in the data set is equal to the number of reviews of the same app in the reveiws_max directory, and the app is not already in the already_added list, we append the app to the android_clean list and we append the name of the app in the already_added list.

In [9]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)
        
print(android_clean[:4])
print(already_added[:4])
print(len(android_clean))

[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']]
['Photo Editor & Candy Camera & Grid & ScrapBook', 'U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'Sketch - Draw & Paint', 'Pixel Draw - Number Art Coloring Book']
9659


### Removing Non-English Apps
Since non-english apps are not to be used, we will need to remove them from the dataset. We test to see if any apps in the dataset are non english by checking the name of each app and checking if the value of each character is less than 127. The reason for this is the numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII system. We do this using the eng_find function created. This function takes the parameter 'dataset' which is list and loops through for every app in the list and assigns the name a variable. It then checks the value of every character in the name and if the value of the character is greater than 127 it returns False, if not it returns True. 

In [10]:
def eng_find (string):
    for character in string:
        if ord(character) > 127:
            return False
        else:
            return True       

Testing the function using the following strings

In [11]:
print(eng_find('Instagram'))
print(eng_find('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(eng_find('Docs To Go™ Free Office Suite'))
print(eng_find('Instachat 😜'))

True
False
True
True


Looking at the results, it's clear that the function does not fully work as the last 2 strings have 2 characters '™' and '😜' are not English letters, yet it was True. To try to solve this problem we will adjust the function to check if the string has more than 3 non-english characters we will return false, if not we will return true

In [12]:
def eng_find2 (string):
    count = 0
    for character in string:
        if ord(character) > 127:
            count += 1
            
    if count > 3:
        return False
    else:
        return True

Even with this adjustment, there seems to still be errors. This seems like the best we can do, so we'll leave it at this and move on

In [13]:
print(eng_find2('Docs To Go™ Free Office Suite'))
print(eng_find2('Instachat 😜'))
print(eng_find2('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
True
False


In [14]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if eng_find2(name):
        android_english.append(app)
        
for app in ios:
    name = app[1]
    if eng_find2(name):
        ios_english.append(app)

print(android_english[:2])
print("\n")
print(ios_english[:2])

[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']]


[['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']]


### Isolating for Free Apps
In this part of the code we will, remove all apps that are not free from the give dataset. We will do so by creating 2 empty lists for both the android and ios data sets. We will then loop through the list, find all the free apps and append them to their respective lists.


In [15]:
android_free = []
ios_free = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_free.append(app)
        
for app in ios_english:
    price = app[4]
    if price == '0.0':
        ios_free.append(app)
        
print(len(android_free))  
print(len(ios_free))
    

8864
3222


## Most Common Apps by Genre
Since the companies goal is to create applications that attract more users, for more revenue. So we'll go through each dataset and find the most popular app profiles and genres. To acheive this we will create a frequency table for the prime_genre column of the App Store data set, and the Genres and Category columns of the Google Play data set.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we then develop it further.
3. If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.


In [16]:
def freq_table(dataset, index):
    table = {}
    total = 0
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

### Examining the datasets 
__Examining the prime_genre column in iOS dataset__

Here we can see that the Games Genre is the most popular followed by Entertainment and Photo & Video, after that the applications seem to be more pratical lifestyle applications. From this frequency table, I can reccomend creating a Game app as moajority of the percentage is taken up by game applications.

In [17]:
display_table(ios_free, -5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


__Examining the Genre column in the Android dataset__

Here we can see the mot popular genres are family, game and tools. After that the genres become more lifestyle oriented.

In [18]:
display_table(android_free, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

__Examining the Category column in the Android dataset__

The most popular categoires are Tools, Entertainment and Education.

In [19]:
display_table(android_free, -4)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

### Most Popular Apps by Genre on the App Store

In this part of the code we will look at the most popular apps based on the genre, we will do so by looking at the number of installs. The android dataset has a installs column but the iOS dataset does not, so instead we will find the average number of ratings for each genre. We do so by using a nested loop. First we loop through the ios genre list and then insde of that we loop throguh the free ios dataset, if the genres match we add the total number of ratings to our total coutn and we add one to the amount of times the genre comes up. Once the loop has ended we find the avergae number of ratings by dividing the total by ammount of occurences. 

In [20]:
genres_ios = freq_table(ios_free, -5)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in ios_free:
        genre_app = app[-5]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


Based on the results above Navigation apps seem to be the most popular, followed by Reference and Social Networking. With that I would recommend working on a free Navigation app.

## Most Popular apps by Genre on Google Play 
In this part of the code we will find the most popular genre of apps on Google Play based on installs. To do this we will create a frequency table for the genres in the Google Play dataset. Then using a nested loop, we'll loop through the frequency table and the free android dataset and check to see if the genres match when we loop. If so we will add the number of intalls for that certain app to the total variable but since the number of installs contains strings such as '+' and ',' as seen below, we must remove them. 

In [23]:
display_table(android_free, 5) 

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


Once we remove these extra strings we can add the number of intalls to the total and add one to the number of occurrences. Once the loop has ended we can divide the total by the number of occurrences and get the average. 

In [30]:
android_category = freq_table(android_free, 1)

for category in android_category:
    total = 0
    len_category = 0
    for app in android_free:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_ratings = total / len_category
    print(category, ':', avg_n_ratings)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

Here we can see that, the most popular app genre in the android data set is Communication applications