# Analyzing Successful Free App Profiles in the AppleStore and Google Play Markets
## Determine differentiates the boring apps from the cool ones
In this proyect we are working with a team of appdevelopers and our task as junior data scientists is to find the profile of the most successful types of apps in both the **Appstore** and **googleplaystore**

In [1]:
# Opened Files
opened_ios = open('AppleStore.csv')
opened_android = open('googleplaystore.csv')

# Read as a list of lists 
from csv import reader
AppleStore = list(reader(opened_ios))
googleplaystore = list(reader(opened_android))

In [2]:
# Created a function to expolre data
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
# AppleStore
explore_data(AppleStore, 1, 5, True)

# Google Play Store
explore_data(googleplaystore, 1, 5, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7198
Number of columns: 16
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live 

In [4]:
# Checking Headers
print(AppleStore[:1])
print('\n')
print(googleplaystore[:1])

[['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']]


[['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']]


## Unifying the Data for our purposes
To unify the data we must be aware that the number of columns are not the same on both data files. We must select the columns that will be useful for our analysis and disregard the rest. When unifing the columns of the **AppleStore** data and the **googlestore** data we must make sure that the values are kept under their corresponding header column.


In [5]:
#Checking row ranges with apparent error (found on a blog site)
print(len(googleplaystore[10472]))
print(len(googleplaystore[10473]))
print(len(googleplaystore[10474]))

#Row 10473 has one fewer data point

13
12
13


In [6]:
#Row 10473 has a missing value
print(googleplaystore[10473])


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [7]:
#We are going to Delete the row with missing info on the googleplaystore dataset
del googleplaystore[10473]

In [8]:
#We check again to make sure that the row has been deleted
print(googleplaystore[10473])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


### Duplicate entries in the googleplaystore data set
By reading the Kaggle blogs online we have discovered if that the dataset has some duplicate entries for some apps. In the following few cells we will try to identify and delete the duplicate entries (we will delete the "less relevant"/older duplicates)

In [9]:
# We are checking for duplicate app entries and identify how many unique app are in the Google Play Store dataset
unique_apps = []
duplicate_apps = []
for app in googleplaystore[1:]:
    app_name = app[0]
    if app_name in unique_apps:
        duplicate_apps.append(app_name)
    else:
        unique_apps.append(app_name)
        
print('Number of duplicate apps: ', len(duplicate_apps))
print('\n')
print('Number of unique apps: ', len(unique_apps))
print('\n')
print('Example of duplicate entry for apps: ', duplicate_apps[:10])

Number of duplicate apps:  1181


Number of unique apps:  9659


Example of duplicate entry for apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


In order to be more precise with our data, instead of removing the duplicated apps randomly we will delete the entries with less reviews, assuming that those entries are older

In [10]:
# Create a dictionary and alow only one entry per app (the entry with the highest review count) 
reviews_max = {}
for app in googleplaystore[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and n_reviews > reviews_max[name]:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print('Number of items in our new dictionary', len(reviews_max)) # Must equal 9659 as show above
for key in list(reviews_max)[:5]:
    print("key: {}, value: {} ".format(key, reviews_max[key]), '\n')

Number of items in our new dictionary 9659
key: Jaumo Dating, Flirt & Live Video, value: 900064.0  

key: ES App Locker, value: 32207.0  

key: C Examples, value: 1002.0  

key: Realtor.com Real Estate: Homes for Sale and Rent, value: 162243.0  

key: A+ Mobile, value: 730.0  



In [11]:
# With the help of the created list reviews_max we are going to clean the main list and leave only the app we are interested in
# The list "already_added" prevents entries with the same number of reviews to be included twice

android_clean = []
already_added = []
for app in googleplaystore[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
    
print(len(android_clean))
print(android_clean[:10])    

9659
[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up'], ['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000

In [12]:
# Ceating a function to identify english app names
def is_english(phrase):
    non_english = 0
    for letter in phrase:
        if ord(letter) > 127:
            non_english += 1
        if non_english > 3:  #Since some apps contain emojis and other characters we only exclude those with +3 non-english characters
            return False
    return True
        
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))
print(is_english('😜 Instachat '))
#Testing our new function

True
False
True
True
True


Although our function is not perfect and some english apps might have been excluded (those with more than three non-english characters like emojis) the function is pretty accurate for our purposes

In [13]:
#We use the above function to create a new lists and include only apps whose title pass the is_english function
ios_clean = []
andr_clean = []

for app in android_clean:
    if is_english(app[0]):
        andr_clean.append(app)
        
for app in AppleStore:
    if is_english(app[1]):
        ios_clean.append(app)

#We print the new lengths to see how mane rows we will be working with        
print(len(andr_clean))
print(len(ios_clean))

9614
6184


In [14]:
# Lastly, we create 2 final lists and only include the free apps, since that is what we are interested in
free_ios = []
free_android = []
for app in ios_clean[1:]:
    price = float(app[4])
    if price == 0:
        free_ios.append(app)
        
for app in andr_clean:
    app_type = app[6]
    if app_type != 'Paid': #Employed another method than above to practice
        free_android.append(app)

In [15]:
# Display the rows in each final version of the datasets
print('Android Free: ', len(free_android))
print('IOS Free: ', len(free_ios))

Android Free:  8864
IOS Free:  3222


### Most common app profiles for both IOS and Android
Since we want to help our team build an app that will work well on Android and IOS, we have to identify what app profiles thrive on both of these platforms. In the next cells we are goind to explore the most common genres of the apps that run on both platforms.

In [16]:
# Creating a function to show the frequency data tables
def freq_table(dataset, index):
    frequency = {}
    for row in dataset:
        genre = row[index]
        if genre in frequency:
            frequency[genre] += 1
        else:
            frequency[genre] = 1
    for element in frequency:
        frequency[element] /= len(dataset)
    return frequency    

In [17]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])


In [18]:
#Showing frequency tables for genres in AppleStore and for genres & catergories in Google Play Store
print(display_table(free_ios, 11))
print('\n')
print(display_table(free_android, 9))
print('\n')
print(display_table(free_android, 1))

Games : 0.5816263190564867
Entertainment : 0.07883302296710118
Photo & Video : 0.04965859714463067
Education : 0.03662321539416512
Social Networking : 0.032898820608317815
Shopping : 0.0260707635009311
Utilities : 0.025139664804469275
Sports : 0.021415270018621976
Music : 0.020484171322160148
Health & Fitness : 0.020173805090006207
Productivity : 0.01738050900062073
Lifestyle : 0.015828677839851025
News : 0.01334574798261949
Travel : 0.012414649286157667
Finance : 0.0111731843575419
Weather : 0.008690254500310366
Food & Drink : 0.008069522036002483
Reference : 0.00558659217877095
Business : 0.005276225946617008
Book : 0.004345127250155183
Navigation : 0.00186219739292365
Medical : 0.00186219739292365
Catalogs : 0.0012414649286157666
None


Tools : 0.08449909747292418
Entertainment : 0.06069494584837545
Education : 0.05347472924187725
Business : 0.04591606498194946
Productivity : 0.03892148014440433
Lifestyle : 0.03892148014440433
Finance : 0.03700361010830325
Medical : 0.03531137184115

In [19]:
#Showing a table for the average number of rating per app in each genre in the AppleStore dataset
genre_ios = freq_table(free_ios, 11)
for genre in genre_ios:
    total = 0
    len_genre = 0
    for app in free_ios:
        genre_app = app[11]
        if genre_app == genre:
            rating = float(app[5])
            total += rating
            len_genre += 1
    avg_user_ratings = round(total / len_genre,2)
    print(genre, ': ', avg_user_ratings)


Catalogs :  4004.0
Medical :  612.0
Finance :  31467.94
Education :  7003.98
Health & Fitness :  23298.02
Navigation :  86090.33
Shopping :  26919.69
Travel :  28243.8
Entertainment :  14029.83
Games :  22788.67
Lifestyle :  16485.76
Food & Drink :  33333.92
News :  21248.02
Productivity :  21028.41
Social Networking :  71548.35
Book :  39758.5
Utilities :  18684.46
Photo & Video :  28441.54
Business :  7491.12
Weather :  52279.89
Sports :  23008.9
Music :  57326.53
Reference :  74942.11


Navigation seeems like the genre with the most average number of ratings in the IOS platform. That is some what strange, lets see what might be causing that...

In [20]:
for app in free_ios:
    if app [11] == 'Navigation':
        print(app[1], ': ', app[5])

Waze - GPS Navigation, Maps & Real-time Traffic :  345046
Google Maps - Navigation & Transit :  154911
Geocaching® :  12811
CoPilot GPS – Car Navigation & Offline Maps :  3582
ImmobilienScout24: Real Estate Search in Germany :  187
Railway Route Search :  5


What about the *Reference* apps? What the heck are those anyways?

In [21]:
for app in free_ios:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


## Now lets check out the most popular apps by genre on the google play platform

In [22]:
# Repeting the process of crating a table with the average number of rating per app in each category
category_android = freq_table(free_android, 1)
for category in category_android:
    total = 0 
    len_category = 0 
    popular_apps = 0 #creating a variable for very popular apps with more than 10M reviews
    for app in free_android:
        if app[1] == category:
            n_installs = app[5]
            n_installs = n_installs.replace('+', '')
            n_installs = float(n_installs.replace(',', ''))
            total += n_installs
            #above we have cleaned the strings such as "+" and "," from the data and transformed it to a float to perform calculations 
            len_category += 1
            if n_installs > 10000000:
                popular_apps += 1
    avg_num_installs = total / len_category
    print(category, ': ', round(avg_num_installs,2), 'Number of apps: ', len_category, 'Apps with +10M reviews', ': ', popular_apps)
    #print('\n')           

FINANCE :  1387692.48 Number of apps:  328 Apps with +10M reviews :  2
EVENTS :  253542.22 Number of apps:  63 Apps with +10M reviews :  0
BEAUTY :  513151.89 Number of apps:  53 Apps with +10M reviews :  0
NEWS_AND_MAGAZINES :  9549178.47 Number of apps:  248 Apps with +10M reviews :  4
BUSINESS :  1712290.15 Number of apps:  407 Apps with +10M reviews :  7
HOUSE_AND_HOME :  1331540.56 Number of apps:  73 Apps with +10M reviews :  0
MEDICAL :  120550.62 Number of apps:  313 Apps with +10M reviews :  0
FOOD_AND_DRINK :  1924897.74 Number of apps:  110 Apps with +10M reviews :  0
SPORTS :  3638640.14 Number of apps:  301 Apps with +10M reviews :  9
WEATHER :  5074486.2 Number of apps:  71 Apps with +10M reviews :  4
PARENTING :  542603.62 Number of apps:  58 Apps with +10M reviews :  0
HEALTH_AND_FITNESS :  4188821.99 Number of apps:  273 Apps with +10M reviews :  3
ENTERTAINMENT :  11640705.88 Number of apps:  85 Apps with +10M reviews :  9
LIBRARIES_AND_DEMO :  638503.73 Number of app

We see that the are some categories that have an impresive amount of average rating that might be skewed by very popular apps (Popular apps are app with more than 10M reviews). Lets explore this phenomena in the communications & NEws and magazines category.

In [23]:
#Exploring the apps with high review in the Communication category 
for app in free_android:
    if app[1] == "COMMUNICATION" and (app[5] == '1,000,000,000+' or
                                      app[5] == '500,000,000+'):
        print(app[0], ': ', app[5])

WhatsApp Messenger :  1,000,000,000+
Google Duo - High Quality Video Calls :  500,000,000+
Messenger – Text and Video Chat for Free :  1,000,000,000+
imo free video calls and chat :  500,000,000+
Skype - free IM & video calls :  1,000,000,000+
LINE: Free Calls & Messages :  500,000,000+
Google Chrome: Fast & Secure :  1,000,000,000+
UC Browser - Fast Download Private & Secure :  500,000,000+
Gmail :  1,000,000,000+
Hangouts :  1,000,000,000+
Viber Messenger :  500,000,000+


In [24]:
#Recalculating the average reviews per app in the category "Comunications" without including apps with over 100M reviews
under_100M = []
for app in free_android:
    category = app[1]
    n_installs = app[5]
    n_installs = n_installs.replace('+', '')
    n_installs = float(n_installs.replace(',', ''))
    if category == 'COMMUNICATION' and n_installs < 100000000:
        under_100M.append(n_installs)
        
print('Average installs for cummunicaion apps excluding apps with over 100M installs: ', sum(under_100M) / len(under_100M))


Average installs for cummunicaion apps excluding apps with over 100M installs:  3603485.3884615386


In [25]:
#Exploring the apps with high review in the News & Magazines category
for app in free_android:
    if app[1] == "NEWS_AND_MAGAZINES" and (app[5] == '1,000,000,000+' or
                                      app[5] == '500,000,000+'):
        print(app[0], ': ', app[5])

Twitter :  500,000,000+
Flipboard: News For Our Time :  500,000,000+
Google News :  1,000,000,000+


In [26]:
#Recalculating the average reviews per app in the category "News and Magazines" without including apps with over 100M reviews (Outliers)
under_100M = []
for app in free_android:
    category = app[1]
    n_installs = app[5]
    n_installs = n_installs.replace('+', '')
    n_installs = float(n_installs.replace(',', ''))
    if category == 'NEWS_AND_MAGAZINES' and n_installs < 100000000:
        under_100M.append(n_installs)
        
print('Average installs for news and magazines apps excluding apps with over 100M installs: ', sum(under_100M) / len(under_100M))

Average installs for news and magazines apps excluding apps with over 100M installs:  1502841.8775510204


In [27]:
#Recalculating the average reviews per app in the category "Entertainment" without including apps with over 100M reviews
under_100M = []
for app in free_android:
    category = app[1]
    n_installs = app[5]
    n_installs = n_installs.replace('+', '')
    n_installs = float(n_installs.replace(',', ''))
    if category == 'ENTERTAINMENT' and n_installs < 10000000:
        under_100M.append(n_installs)
        
print('Average installs for entertainment apps excluding apps with over 10M installs: ', sum(under_100M) / len(under_100M))

Average installs for entertainment apps excluding apps with over 10M installs:  1597500.0


In [28]:
#Recalculating the average reviews per app in the category "Books and Reference" without including apps with over 100M reviews
under_100M = []
for app in free_android:
    category = app[1]
    n_installs = app[5]
    n_installs = n_installs.replace('+', '')
    n_installs = float(n_installs.replace(',', ''))
    if category == 'BOOKS_AND_REFERENCE' and n_installs < 10000000:
        under_100M.append(n_installs)
        
print('Average installs for books and reference apps excluding apps with over 10M installs: ', sum(under_100M) / len(under_100M))

Average installs for books and reference apps excluding apps with over 10M installs:  457134.0963855422


In [29]:
#Recalculating the average reviews per app in the category "Photography" without including apps with over 100M reviews
under_100M = []
for app in free_android:
    category = app[1]
    n_installs = app[5]
    n_installs = n_installs.replace('+', '')
    n_installs = float(n_installs.replace(',', ''))
    if category == 'PHOTOGRAPHY' and n_installs < 10000000:
        under_100M.append(n_installs)
        
print('Average installs for Photography apps excluding apps with over 10M installs: ', sum(under_100M) / len(under_100M))

Average installs for Photography apps excluding apps with over 10M installs:  1142753.4662576688


## Recomendation for App developers
#### Start with an app in the entertainment or photography category
When looking at the highest average ratings per category of apps in the googleplay platform we notice some great performing categories and some not so good ones. 
After ignoring the ouliers that drastically change the averages we can say to our developers is that the best categories to focus on are the *Photography* category. 
The new app could be in the form of a *photo editor or a easy instagram filter creator* (which are popular right now). Another essential questions that is left unanswerred in this proyect is how could the team develop in app purchases and how effective is this category in monetizing in app purchases. That could be a great proyect if we were to continue with this proyecto of developing an app.