# Attractive Apps - a data science project

This project will analyse data from various apps to determine which type of app attracts the most users.

Since our company develops free apps for Google Play and the App Store, revenue is gained from paid ads, and the revenue from those ads is determined by the number of users.

In [55]:
from csv import reader


# open AppleStore.csv dataset
open_applestore = open('AppleStore.csv', encoding ="utf-8")
read_applestore = reader(open_applestore)
apple_data = list(read_applestore)

# open googleplaystore.csv dataset
open_playstore = open('googleplaystore.csv', encoding = "utf-8")
read_playstore = reader(open_playstore) 
play_data = list(read_playstore)


Let's define an explore data function to help us take a quick look at the data

In [56]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        

Let's use the explore data function to take an intial look at the data sets.link

In [57]:
# first explore the AppleStore.csv dataset
explore_data(apple_data,0,5)

# first explore the AppleStore.csv dataset
explore_data(apple_data,0,5)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating',

## Cleaning Data

The next step is to begin the data cleaning process. First we will look for any data rows which have errors.

An example of an error row is shown below.

In [58]:
error_row = play_data[10473]
print(error_row)

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


As you can see in row 10473, as per the header row, the order should go as follows:

Header: - ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

Row 10473: - ['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']

There is no category present.

In [59]:
#let's delete this row

del play_data[10473]

We need to search for duplicate entries.

We can do this by creating two lists, one for unique apps names and one for duplicate apps names.

First for the Google Play data:

If the app name is not in the unqiye names list, then it can be added.

If the app name is already in the unique names list, then it will be added ot the dupicate list.

In [60]:
#Play store

reviews_max = {}

for app in play_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print("Number of apps in reviews_max: "+ str(len(reviews_max)))
print("Number of all other apps: "+ str(n_reviews))

Number of apps in reviews_max: 9659
Number of all other apps: 398307.0


Now we have created the reviews_max dictionary containing the newest version each app, we can create a new list with just the latest version of apps.

First we initalise two lists.

We create a for loop in which the app name and number of reviews is defined.

For each app in the data set, if the app is equal to the highest number of reviews, and the name has not been already added to the cleaned data list, then the app is added to the
 list.
We have to ensure the extra clause to account or theose duplicate apps that have the same number of reviews as others.

In [61]:
android_clean = []
already_added = []

for app in play_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

print(len(android_clean))

9659


Next we will look for duplicate apps in the Apple store data set.

In [62]:
ios_unique_apps = [] 
ios_duplicate_apps = [] 

for app in apple_data[1:]: 
    app_name = app[1]
    tot_ratings = app[5]
    #print(tot_ratings)
    #print(app_name)
    
    if app_name not in ios_unique_apps:
        ios_unique_apps.append(app_name)
        ios_unique_apps.append(tot_ratings)
    elif app_name in ios_unique_apps:
        ios_duplicate_apps.append(app_name)
        ios_duplicate_apps.append(tot_ratings)

#print(len(ios_unique_apps))

print("Number of duplciate apps in the Apple Store Data: " + str(len(ios_duplicate_apps)))

#print(ios_duplicate_apps)


#for app in apple_data[1:]:
   # index("VR Roller Coaster")

Number of duplciate apps in the Apple Store Data: 0


As you can see there are no duplicate apps in the Apple Store Data. We will therefore move onto the next stage.

## Removing non-english apps

There are many apps in our data sets that contain non-english characters, which are designed for a non-english audience. We only want to keep apps that are designed for an English audience. Below we define a function that can check for non-enligh characters, called english().

In [63]:
def english(string):
    letters = []
    for letter in string:
        ord_number = ord(letter)
        if ord_number > 127:
            letters.append("False")
        else:
            letters.append("True")
    if "False" in letters:
        word = "False"
    else:
        word = "True"
    return word

print(english("Instagram"))
print(english("爱奇艺PPS -《欢乐颂2》电视剧热播"))
print(english("Docs To Go™ Free Office Suite"))
print(english("Instachat 😜"))
            

True
False
False
False


Some apps may be designed for an English audience but have a few non-english symbols
e.g. Instachat 😜

Therefore we will assume that apps that have less than three non-english characters, are designed for an English audience, and those that have greater than three, are designed for a non-english audience. This is a slightly crude way to split the data, but should be sufficent.

In [64]:
def english(string):
    letters = []
    for letter in string:
        ord_number = ord(letter)
        if ord_number > 127:
            letters.append("False")
        else:
            letters.append("True")
    if letters.count("False") > 3:
        word = "False"
    else:
        word = "True"
    return word

print(english("Instagram"))
print(english("爱奇艺PPS -《欢乐颂2》电视剧热播"))
print(english("Docs To Go™ Free Office Suite"))
print(english("Instachat 😜"))
print(english("U Launcher Lite – FREE Live Cool Themes"))
            

True
False
True
True
True


In [65]:
android_cleaned = []
android_non_english = []

for app in android_clean:
    name = app[0]
    if english(name) == "True":
        android_cleaned.append(app)
    elif english(name) == "False":
        android_non_english.append(app)
        
print("# Google Play data set: \n")        
print("Number of English language android apps: "+ str(len(android_cleaned)))
print("Number non-english android apps: "+ str(len(android_non_english)))
print("\n")

print("Example android English language apps: \n" + str(android_cleaned[:3]))
print("\n")
print("Example android non-English language apps: \n"+ str(android_non_english[:3])+" \n ")

print("# Apple data set: \n")

ios_cleaned = []
ios_non_english = []

for app in apple_data[1:]:
    name = app[2]
    if english(name) == "True":
        ios_cleaned.append(app)
    elif english(name) == "False":
        ios_non_english.append(app)
        
print("Number of English language ios apps: "+ str(len(ios_cleaned)))
print("Number non-english ios apps: "+ str(len(ios_non_english)))
print("\n")

print("Example ios English language apps: \n" + str(ios_cleaned[:3]))
print("\n")
print("Example ios non-English language apps: \n"+ str(ios_non_english[:3]))



# Google Play data set: 

Number of English language android apps: 9614
Number non-english android apps: 45


Example android English language apps: 
[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']]


Example android non-English language apps: 
[['Flame - درب عقلك يوميا', 'EDUCATION', '4.6', '56065', '37M', '1,000,000+', 'Free', '0', 'Everyone', 'Education', 'July 26, 2018', '3.3', '4.1 and up'], ['သိင်္ Astrology - Min Thein Kha BayDin', 'LIFESTYLE', '4.7', '2225', '15M', '100,000+', 'F

In [66]:
android_free = []
android_paid = []

for item in android_cleaned:
    price = (item[7])
    if price == "0":
        android_free.append(item)
    else:
        android_paid.append(item)

print("Number of free android apps: "+str(len(android_free)))

ios_free = []
ios_paid = []

for item in ios_cleaned:
    price = (item[5])
    if price == "0":
        ios_free.append(item)
    else:
        ios_paid.append(item)

print("Number of free ios apps: "+str(len(ios_free)))


Number of free android apps: 8864
Number of free ios apps: 3222


## Analysis time - find most common apps by genre

Our company's strategy is to 
 - a. find out which app genres are popular
 -   b. quickly develop an app and launch it into the Play store
 -   c. if the app gets good feedback develop it more
 -   d. if within 6 months it is generating an income then launch it into the App store

a. lets use the explore data function we used at the beginning to find out which columns we will need for determining the most popular app genres


In [67]:
explore_data(android_free, 0, 3)
print("\n")
explore_data(ios_free, 0, 3)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']




['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241',

The android free data set clearly has a category header column 1. 
The ios free data set has a categories in column 12. 

In [68]:
def freq_table(dataset, index):
    table = {}
    index = int(index)
    total = 0

    for row in dataset:
        total += 1
        genre = row[index]
        if genre in table:
            table[genre] += 1
        else:
            table[genre] = 1

    table_percentages = {}
    for key in table:
        percentage = (table[key]/total) * 100
        table_percentages[key] = percentage
        
    
    return table_percentages


def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
   

In [69]:
freq_table(android_free, 1)
display_table(android_free, 1)
print("\n")
freq_table(ios_free, 12)
display_table(ios_free, 12)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

Now we've found the most common apps, we also want to find the most installed apps, those that are most popular. The Apple Store data set does not include the number of installations, so we will use the number of ratings as a proxy. 

Below we isolate the genre of the app in the ios_free cleaned data set, and if the app in the dataset is the same as the genre in our frequency table list, we add the number of ratings to the list, and add one to the len_genre list. We calculate the average number of ratings for each genre by dividing the total ratings by the total number of ratings. 

In [70]:
genre_ios = freq_table(ios_free, 12)

for genre in genre_ios:
    total = 0
    len_genre = 0
    for app in ios_free:
        genre_app = app[12]
        if genre_app == genre:
            ratings_total = float(app[6])
            #print(ratings_total)
            total += ratings_total
            #print(total)
            len_genre += 1
            #print(len_genre)
    average_ratings = (total / len_genre)
    print(genre, ":", average_ratings)

Productivity : 21028.410714285714
Weather : 52279.892857142855
Shopping : 26919.690476190477
Reference : 74942.11111111111
Finance : 31467.944444444445
Music : 57326.530303030304
Utilities : 18684.456790123455
Travel : 28243.8
Social Networking : 71548.34905660378
Sports : 23008.898550724636
Health & Fitness : 23298.015384615384
Games : 22788.6696905016
Food & Drink : 33333.92307692308
News : 21248.023255813954
Book : 39758.5
Photo & Video : 28441.54375
Entertainment : 14029.830708661417
Business : 7491.117647058823
Lifestyle : 16485.764705882353
Education : 7003.983050847458
Navigation : 86090.33333333333
Medical : 612.0
Catalogs : 4004.0


Navigation is the genre with the highest average rating. Let's explore Navigation apps in more detail:

In [71]:
for app in ios_free:
    genre = app[12]
    if genre == "Navigation":
        print(app[2], ":", app[6]) # name and number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Geocaching® : 12811
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5
CoPilot GPS – Car Navigation & Offline Maps : 3582
Google Maps - Navigation & Transit : 154911


As we can see, the Navigation apps are dominated by Waze and Google Maps. It will likely be difficult to compete with these big players. Similarly with the Music and Social Networking categories, there are a few apps that dominate and lots of competition. 

Let's look at the genre with the next highest average rating: Reference

In [72]:
for app in ios_free:
    genre = app[12]
    if genre == "Reference":
        print(app[2], ":", app[6]) # name and number of ratings

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
Merriam-Webster Dictionary : 16849
Google Translate : 26786
Night Sky : 12122
WWDC : 762
Jishokun-Japanese English Dictionary & Translator : 0
教えて!goo : 0
VPN Express : 14
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Real Bike Traffic Rider Virtual Reality Glasses : 8


The Reference category has a few big apps such as the Bible and Dictionary.com, however there is a lot of variety within this category, and unlike the other markets, it is not so flooded. This category could be a good choice. 