# Analysing Android and iOS mobile apps

This project aims to perform practical data analysis on a dataset which looks at Android and iOS mobile apps. The goal of the project is to analyse the data and help the developers of the app understand which apps attract more users, to enable a higher growth of revenue. 

This project will use the following skills in Python:
* Lists and for loops
* Conditional statements
* Dictionaries and frequency tables 
* Functions 

### Opening and Exploring the data 
In September 2018, there were approximately 2 million iOS apps avaiable on the App Store, and 2.1 million Android Apps on Google Play. This project will use a sample of these, with data collected in August 2018 and July 2017.

Let's start by opening the two datasets and exploring them. 

In [1]:
import csv 

with open('AppleStore.csv', encoding = 'utf8') as apple_file:
    apple_file = list(csv.reader(apple_file))
    
with open('googleplaystore.csv', encoding = 'utf8') as google_file:
    google_file = list(csv.reader(google_file))

In [2]:
def explore_data(dataset, start, end, rows_and_columns = False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
    if rows_and_columns is True:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
explore_data(apple_file, 0, 5)

explore_data(google_file, 0, 5)

print(apple_file[0])

print(google_file[0])

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'A

### Deleting Inaccurate Data
Before we start analysing the datasets, we need to ensure the data is accurate through data cleaning. We are interested in free to download apps in the English language, so we will remove paid apps and non-English apps. 

In [3]:
print(google_file[10473])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


From a Kaggle community discussion, the row 10473 in the Google Play dataset has a missing rating. We have identified it in the print statement above. Now it's time to delete it from the dataset to make analysis easier. 

In [4]:
del google_file[10473]

### Removing Duplicate entries
Some of the data has multiple entries for the same app.

In [5]:
for row in google_file:
    app_name = row[0]
    if app_name == 'Instagram':
        print(row[:10])

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social']


We could remove these randomly, or we could look at the number of reviews, and go with the row which has the highest number of reviews. This would imply the data collected is most recent. 

Let's count the number of duplicates. 

In [6]:
duplicate_apps = []
unique_apps = []

for row in google_file[1:]:
    app_name = row[0]
    if app_name in unique_apps:
        duplicate_apps.append(app_name)
    else:
        unique_apps.append(app_name)
print('The number of duplicate apps is:', len(duplicate_apps))
print('\n')
print('The number of unique apps is:', len(unique_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:10])


The number of duplicate apps is: 1181


The number of unique apps is: 9659


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


We can follow the same process for the iOS dataset. 

In [7]:
ios_duplicate_apps = []
ios_unique_apps = []

for row in apple_file[1:]:
    app_name = row[0]
    if app_name in ios_unique_apps:
        ios_duplicate_apps.append(app_name)
    else:
        ios_unique_apps.append(app_name)
print('The number of duplicate apps is:', len(ios_duplicate_apps))
print('\n')
print('The number of unique apps is:', len(ios_unique_apps))
print('\n')
print('Examples of duplicate apps:', ios_duplicate_apps[:10])

The number of duplicate apps is: 0


The number of unique apps is: 7197


Examples of duplicate apps: []


From this code we can see that the iOS dataset has no duplicate app entries. Now we can begin the process of removing the duplicate entries from the Google Play dataset. 

We first create a dictionary where the key of the dictionary is the app name, and the dictionary value is the highest number of reviews of that app. 

In [8]:
reviews_max = {}

for row in google_file[1:]:
    name = row[0]
    n_reviews = float(row[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
print(len(reviews_max))

9659


The expected number came out to be 9659, which is the same number as the unique apps we saw earlier. We can use this dictionary to remove the duplicate rows. 

In [9]:
android_clean = []
already_added = []

for row in google_file[1:]:
    name = row[0]
    n_reviews = float(row[3])
    
    if reviews_max[name]== n_reviews and name not in already_added:
        android_clean.append(row)
        already_added.append(name)

print(len(android_clean))

9659


### Removing Non-English Apps
Now that we have cleaned the data, we need to remove any apps that are non-English. To do this, we need to write a function that takes in a string (e.g. app name) and checks whether the characters are common English characters. One way to do this is to check whether the character is in the range of 0 - 127 according to the ASCII system. 

In [13]:
def language_check(a_string):
    for row in a_string:
        if ord(row) > 127:
            return False
    return True 

print(language_check('Instagram'))
print(language_check('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(language_check('Docs To Go™ Free Office Suite'))
print(language_check('Instachat 😜'))

True
False
False
False


Although this function works quite well, it incorrectly identifies apps with emojis or special characters as non-English. This is because they fall outside of the 0-127 ASCII range. 

To minimise data loss, the function should return False if there are more than three characters that fall outside of the ASCII range. 

In [15]:
def language_check(a_string):
    special_character = 0
    for row in a_string:
        if ord(row) > 127:
            special_character +=1
        
    if special_character >3:
        return False
    else:
        return True 
print(language_check('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(language_check('Docs To Go™ Free Office Suite'))
print(language_check('Instachat 😜'))

False
True
True


In [20]:
ios_clean = []
google_clean = []

for row in apple_file[1:]:
    name = row[1]
    
    if language_check(name) == True:
        ios_clean.append(row)
        
for row in android_clean:
    name = row[0]
    
    if language_check(name) == True:
        google_clean.append(row)
        
explore_data(ios_clean, 0, 4, True) #This is the function we defined at the beginning of the project to help us explore datasets.
print('\n')
explore_data(google_clean, 0, 4, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 6183
Number of columns: 16


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketc

### Isolating the Free apps
So far, the data cleaning process consisted of removing innacurate data, removing duplicate entries and removing non-English apps. The last step of the data cleaning process is to isolate the free apps. 

In [26]:
ios_free = []

google_free = []

for row in ios_clean[1:]:
    price = row[4]
    
    if price == '0.0': #the price is stored as a string rather than number 
        ios_free.append(row)
        
for row in google_clean:
    price = row[7]
    
    if price == '0':
        google_free.append(row)

print('The length of iOS clean data is:', len(ios_free))
print('\n')
print('The length of Google clean data is:', len(google_free))

The length of iOS clean data is: 3221


The length of Google clean data is: 8864


### Most Common Apps by Genre
The final dataset for analysis will consist of 3221 iOS apps and 8864 Google Play apps. Now we can perform some data analysis to see which apps attract more users to enable a higher growth in revenue. The strategy ends with an app on both the Google Play store and App store, so we need to find apps that perform well on both stores. 

To begin the analysis, let's see which are the most common genres for each market. To do this, we will build frequency tables for some of the columns in the clean datasets. The best columns for this analysis will be the prime_genre column in the iOS dataset, and the Genres and Category column in the Google dataset. 

We will build two functions to help with this task:
* One function to generate frequency tables that show percentages
* Another function we can use to display the percentages in a descending order

In [29]:
def freq_table(dataset, index):
    freq = {}
    total = 0
    for row in dataset:
        total += 1
        value = row[index]
        
        if value in freq:
            freq[value] +=1
        else:
            freq[value] = 1
    
    freq_pct = {}
    for row in freq:
        pct = (freq[row]/ total) * 100
        freq_pct[row] = pct
    
    return freq_pct 

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        
display_table(ios_free, 11)
print('\n')


Games : 58.180689226948154
Entertainment : 7.885749767153058
Photo & Video : 4.967401428127911
Education : 3.6634585532443342
Social Networking : 3.2598571872089415
Shopping : 2.607885749767153
Utilities : 2.5147469729897547
Sports : 2.1421918658801617
Music : 2.049053089102763
Health & Fitness : 2.018006830176964
Productivity : 1.7385904998447685
Lifestyle : 1.5833592052157717
News : 1.334989133809376
Travel : 1.2418503570319777
Finance : 1.11766532132878
Weather : 0.8692952499223843
Food & Drink : 0.8072027320707855
Reference : 0.55883266066439
Business : 0.5277864017385905
Book : 0.43464762496119214
Navigation : 0.18627755355479667
Medical : 0.18627755355479667
Catalogs : 0.12418503570319776




From using these functions, we can see that the most common genre for the prime_genre column is the Games genre, with 58.18% of free ios apps being this genre. The other genres are all under 10%, with the next top 3 genres being Entertainment, Photo & Video and Education. 

In [30]:
display_table(google_free, 1)
print('\n')
display_table(google_free, 9)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

From using these functions, we can see that the most common genre for the Category and Genre column are the Family and Tools genre.  The other top genres relate to Game, Education and Entertainment. 

This tells us that apps designed for fun dominate the App Store, while Google Play shows a more balanced landscape of both practical and fun apps.

Now we'd like to know which app has the most users. For the Google Play dataset, we can use the Install column, however this is missing in the iOS dataset. For this, we can use the total number of user ratings as a proxy. 

In [38]:
#Start by generating a frequency table for the prime_genre column

ios_genre = freq_table(ios_free, 11)

for genre in ios_genre:
    total = 0
    len_genre = 0
    
    for row in ios_free:
        genre_app = row[11]
        if genre_app == genre:
            rating = float(row[5])
            total += rating
            len_genre += 1
    
    avg_nr_ratings = total / len_genre
    print(genre, ":", avg_nr_ratings)
    

Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Social Networking : 43899.514285714286
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


In [40]:
google_genres = freq_table(google_free, 1)

for category in google_genres:
    total = 0
    len_category = 0
    
    for row in google_free:
        category_app = row[1]
        if category_app == category:
            installs = row[5]
            installs = installs.replace('+', '')
            installs = installs.replace(',', '')
            total_installs = float(installs)
            total += total_installs
            len_category += 1
            
    avg_installs = total / len_category
    print(category, ':', avg_installs)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

## Conclusion
The game genre seems pretty popular, but the market for this may be already saturated. 

The books and reference genre looks fairly popular as well, with an average number of installs of 8,767,811. Since we found this genre has some potential to work well on the App Store too, and our aim is to recommend an app genre that shows potential for being profitable on both the App Store and Google Play, this genre should be used going forward for app development.