# Profitable Mobile App Analysis

## Project Purpose:
The purpose of this project is to provide insight as to which types of apps that are free to download and install are the most popular.

## Personal Objectives:
Open completing this project, I hope to solidfy key concepts in Python programming as well as provide a real life application of data science and analytics from a real data set.

In [1]:
from csv import reader

### The Google Play data set ###
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

### The App Store data set ###
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]


In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
print (ios_header)
print ('\n')
explore_data(ios, 0 ,3, True)

print (android_header)
print ('\n')
explore_data(android, 0 ,3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'J

From the output above, useful columns that could help with the analysis of profitable free apps are:

`Apple Store`
- track_name (index 1), price (index 4), rating_count_tot (index 5), user_rating (index 7), cont_rating (index 10), prime_genre (index 11), lang.num (index 14)

`Google Play Store`
- App (index 0), Category (index 1), Rating (index 2), Installs (index 5), Type (index 6), Price (index 7), Content Rating (index 8), Genres (index 9)

In [3]:
print (android[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


## Cleaning Bad Entries

- From reading the dicussion section from the google play store, it was deduced that there is an error in the data.
- the "Category" entry is missing for index 10472 (or 10473 if the header is kept in the file)

- The code row is removed using `del android[10472]`. The code is not kept for the possibility it could be ran again, deleting a good data entry.

In [4]:
del android[10472]
print (android[10472])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


## Duplicate Entries
- We will now check for duplicates
- duplicate entries must be deleted as to not skew the end result.

The Code Below:
- creates new list to store unique and duplicate apps
- iterates through the data sets. If the name of the app on the current iteration is already an entry in the `unique_apps` list, the name is `appended` to the `duplicate_apps` list.
- If the current iteration app name `is not` in the `unique_data` list than it is added to the list

In [5]:
unique_ios_apps = []
duplicate_ios_apps = []
unique_android_apps = []
duplicate_android_apps = []

for row in ios:
    name = row[1]
    if name in unique_ios_apps: # checks app store App Name
        duplicate_ios_apps.append(name)
    else:
        unique_ios_apps.append(name)

for row in android:
    name = row[0]
    if name in unique_android_apps: # checks Google Store App Name
        duplicate_android_apps.append(name)
    else:
        unique_android_apps.append(name)
        
print ('Number of Duplicate IOS Apps: ', len(duplicate_ios_apps))
print ('Repeated IOS Apps: ', duplicate_ios_apps)
print ('\n')
print ('Number of Duplicate ANDROID Apps: ', len(duplicate_android_apps))
print ('Some Repeated Android Apps: ', duplicate_android_apps[:10])
        

Number of Duplicate IOS Apps:  2
Repeated IOS Apps:  ['Mannequin Challenge', 'VR Roller Coaster']


Number of Duplicate ANDROID Apps:  1181
Some Repeated Android Apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


## Method For Deleting Duplicate Apps
`AppleStore`
- for this data set, we will use the `rating_count_tot` column (index 5) to determine which of the duplicate entries has the most total ratings. Only the entry with the largest quantity of ratings will be kept.

`Google Play Store`
- A similar approach will be used, but with the google play store column `Reviews` column (index 4)

The Code Below:
- creates new `dictionary` data structures for ios and android apps to store the app name and it's associated highest number of ratings.
- Iterates through the data sets.
- If the current iteration app name is not a `key` in its `dictionary` than it is added to the `dictionary` with the current iteration value for `ratings_count` as its associated value.
- If the current iteration app name is a `key` in its `dictionary`, and the current iteration `ratings_count` value is greater than the existings associated value to the key, than the key's associated value is replaced with the higher value of `ratings_count`


In [6]:
ios_review_max = {}
android_review_max = {}

for row in ios:
    name = row[1]
    ratings_count = float(row[5])
    
    if name in ios_review_max and ios_review_max[name] < ratings_count:
        ios_review_max[name] = ratings_count  
        
    elif name not in ios_review_max:
        ios_review_max[name] = ratings_count
        
for row in android:
    name = row[0]
    ratings_count = float(row[3])
    
    if name in android_review_max and android_review_max[name] < ratings_count:
        android_review_max[name] = ratings_count
        
    elif name not in android_review_max:
        android_review_max[name] = ratings_count

In [7]:
print ('Exptected Length of IOS data: ', len(ios) - 2)
print ('Actual Length of Android Data: ', len(ios_review_max))
print ('\n')
print ('Exptected Length of Android Data: ', len(android) - 1181)
print ('Actual Length of Android Data: ', len(android_review_max))

Exptected Length of IOS data:  7195
Actual Length of Android Data:  7195


Exptected Length of Android Data:  9659
Actual Length of Android Data:  9659


## Algorithm to Remove Duplicate Entries

The code below:
- creates new lists to store the final data sets. (`_clean`)
- creates new lists to keep track of which have already been added (`_aleady_added`)
- Iterates through the data set
- if the `app name` on the current iteration has not already been added to the `_clean` list AND the current iteration `ratings_count` value is equal to the value stored in the dictionary from the previous step, the entire row is added to the `_clean` list.

In [8]:
ios_clean = []
ios_already_added = []
android_clean = []
android_already_added = []

for row in ios:
    name = row[1]
    ratings_count = float(row[5])
    
    if (name not in ios_already_added) and (ios_review_max[name] == ratings_count):
        ios_clean.append(row)
        ios_already_added.append(name)
    
for row in android:
    name = row[0]
    ratings_count = float(row[3])
    
    if (name not in android_already_added) and (android_review_max[name] == ratings_count):
        android_clean.append(row)
        android_already_added.append(name)
        
print ('Length of Cleaned IOS Data: ', len(ios_clean))
print ('Length of Cleaned ANROID Data: ', len(android_clean))
        

Length of Cleaned IOS Data:  7195
Length of Cleaned ANROID Data:  9659


## Removing Non-English Directed Apps

To do this, we will interate through each data set, looping through each character of the app name and check its associated ASCII value. 
- If a character has and ASCII value in the range [0,127] it is likely in the english language. All other characters will be flagged and the current iteration app will not be added to a new list (`_clean_english`) 
- the built-in funtion `ord()` will be used
- A counter is used to determine how many characters in the app name is out of the english language. If that counter reaches 3, the app is said to be non-english directed.

In [9]:
def check_english(appname):
    counter = 0
    
    for character in appname:
        if (ord(character) > 127):
            counter +=1
            
    if (counter >= 3):
         return False
    else:      
        return True

In [10]:
print (check_english('Instagram'))
print (check_english('Docs To Go™ Free Office Suite'))

True
True


The Code Below:
- creates new lists for android and ios apps that have been cleaned of bad data, repeated entries and non-english apps.

In [11]:
ios_clean_english = []
android_clean_english = []

for row in ios_clean:
    name = row[1]
    
    if (check_english(name) == True):
        ios_clean_english.append(row)
        
for row in android_clean:
    name = row[0]
    
    if (check_english(name) == True):
        android_clean_english.append(row)
        
print ('Length of Cleaned and English IOS Data: ', len(ios_clean_english))
print ('Length of Cleaned and Enlish ANDROID Data: ', len(android_clean_english))

Length of Cleaned and English IOS Data:  6153
Length of Cleaned and Enlish ANDROID Data:  9597


## Isolating Free Apps
- Based on the problem description, we wish to analyse free apps.
- The code below creates two final lists for each the apple store and google play store.
- The lists store `cleaned` data, rid of bad and duplicate entries as well as non-enligh apps as well as paid apps.

The code below:
- Iterates through the most recently cleaned data sets
- checks the price column of each. If price is free than the row is `appended` to the final list.

In [12]:
ios_cleaned_free = []
android_cleaned_free = []

for row in ios_clean_english:
    price = row[4]
    
    if (price == '0.0'):
        ios_cleaned_free.append(row)
        
for row in android_clean_english:
    price = row[7]
    
    if (price == '0'):
        android_cleaned_free.append(row)
        
print ('Length of Completely Cleaned IOS data: ', len(ios_cleaned_free))
print ('Length of Completely Cleanded Android data: ', len(android_cleaned_free))
        

Length of Completely Cleaned IOS data:  3201
Length of Completely Cleanded Android data:  8848


## Begin Analysis
- Goal is to find an app profile that  fits both the Apple App Store and Google Play store to maximize volume of potential users.

- Validation Strategy: (1) build minimal version of app for android and add it to google play. (2) If good response from user, app is further developed. (3) If app is profitable after six months, an IOS version is made and added to the Apple App Store.

The Code Below:
- function named `freq_table` that takes in a data set (expected to be a list of lists) and a column index
- The function iterates through the data set and sets the desired key to `var` "target"
- Generates a `frequency table` for the column index of choosing

In [13]:
def freq_table(dataset, index):
    ft = {}
    ft_percent = {}
    total = 0
    
    for row in dataset:
        total+=1
        target = row[index]
        
        if target in ft:
            ft[target] += 1
        else:
            ft[target] = 1
            
    for item in ft:
        ft_percent[item] = (ft[item] / total) * 100
        
    return ft_percent


In [14]:
print (freq_table(ios_cleaned_free, 11))

{'Catalogs': 0.12496094970321774, 'Business': 0.5310840362386754, 'Lifestyle': 1.5620118712902218, 'Music': 2.0618556701030926, 'Reference': 0.5310840362386754, 'Sports': 2.1555763823805063, 'Food & Drink': 0.8122461730709154, 'Productivity': 1.7494532958450486, 'Shopping': 2.592939706341768, 'Entertainment': 7.841299593876913, 'News': 1.3433302093095907, 'Photo & Video': 4.99843798812871, 'Medical': 0.18744142455482662, 'Social Networking': 3.3114651671352706, 'Games': 58.23180256169947, 'Utilities': 2.4679787566385505, 'Travel': 1.2496094970321776, 'Health & Fitness': 2.0306154326772883, 'Finance': 1.0934083099031553, 'Education': 3.6863480162449234, 'Book': 0.37488284910965325, 'Weather': 0.8747266479225243, 'Navigation': 0.18744142455482662}


## Display Ordered Frequency Tables

The code below:
- makes used of the `sorted` built in function
- the `sorted` function does not work with dictionaries, though it does with tuples. A dictionary is essential a list of tuples.
- a tuple variable is created (`tuple_value`) that has the first entry as the frequency table value, the first is the key. This tuple value is then `appended` to a list `table_display` to create a list of tuples.
- the `sorted()` function is than used to display the data from highest to lowest values.

In [15]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    
    for item in table:
        tuple_value = (table[item], item)
        table_display.append(tuple_value)
        
    table_sorted = sorted(table_display, reverse = True)
    
    for entry in table_sorted:
        print (entry[1], ':', entry[0])
        

In [16]:
print ('IOS DATA: ', '\n')
print (display_table(ios_cleaned_free, 11))
print ('\n')
print ('Android Data Category Column: ', '\n')
print (display_table(android_cleaned_free, 1))

IOS DATA:  

Games : 58.23180256169947
Entertainment : 7.841299593876913
Photo & Video : 4.99843798812871
Education : 3.6863480162449234
Social Networking : 3.3114651671352706
Shopping : 2.592939706341768
Utilities : 2.4679787566385505
Sports : 2.1555763823805063
Music : 2.0618556701030926
Health & Fitness : 2.0306154326772883
Productivity : 1.7494532958450486
Lifestyle : 1.5620118712902218
News : 1.3433302093095907
Travel : 1.2496094970321776
Finance : 1.0934083099031553
Weather : 0.8747266479225243
Food & Drink : 0.8122461730709154
Reference : 0.5310840362386754
Business : 0.5310840362386754
Book : 0.37488284910965325
Navigation : 0.18744142455482662
Medical : 0.18744142455482662
Catalogs : 0.12496094970321774
None


Android Data Category Column:  

FAMILY : 18.942133815551536
GAME : 9.697106690777577
TOOLS : 8.453887884267631
BUSINESS : 4.599909584086799
PRODUCTIVITY : 3.899186256781193
LIFESTYLE : 3.887884267631103
FINANCE : 3.7070524412296564
MEDICAL : 3.5375226039783
SPORTS : 3.3

## Genre Analysis (AppStore):

Apple AppStore:
- free app types dominated by non-productive types of apps (Games, Social Media, Entertainment, etc.)
- The above makes up 70% of free apps on the app store.

Google Play:
- free app types dominated by productive apps. (Family, Tools, Business etc)

Conclusions:
- The above data tells us the commonality of different app genres.
- Need to Analyse which app genres are most frequently Installed to provide a better recommendation for an app profile

In [17]:
ios_genre_ft = freq_table(ios_cleaned_free, 11)
android_genre_ft = freq_table(android_cleaned_free, 1)

In [18]:
print ('App Genre: Avg Num User Ratings') 

for genre in ios_genre_ft:
    total = 0
    len_genre = 0
    
    for item in ios_cleaned_free:
        genre_app = item[11]
        
        
        if (genre_app == genre):
            num_ratings = float(item[5])
            total += num_ratings
            len_genre += 1
            
    avg_user_ratings = total / len_genre
    print (genre, ':', avg_user_ratings)

App Genre: Avg Num User Ratings
Catalogs : 4004.0
Business : 7491.117647058823
Lifestyle : 16815.48
Music : 57326.530303030304
Reference : 79350.4705882353
Sports : 23008.898550724636
Food & Drink : 33333.92307692308
Productivity : 21028.410714285714
Shopping : 27230.734939759037
Entertainment : 14195.358565737051
News : 21248.023255813954
Photo & Video : 28441.54375
Medical : 612.0
Social Networking : 71548.34905660378
Games : 22910.83100858369
Utilities : 19156.493670886077
Travel : 28243.8
Health & Fitness : 23298.015384615384
Finance : 32367.02857142857
Education : 7003.983050847458
Book : 46384.916666666664
Weather : 52279.892857142855
Navigation : 86090.33333333333


## Genre Analysis Continued

- The table above describes the average number of user ratings for each genre type in the `ios_cleaned_free` data set.
- It appears as though the app store is dominated by the `Navigation` genres for free apps. Although from the output below, this is monopolized by Waze and Google Maps
- The next most popular app types are in genres such as `social networking` and `Reference` although this is also dominated by a key few apps in their categories. These would be tough, competitive markets to enter into.

In [19]:
for row in ios_cleaned_free:
    if row[11] == 'Productivity':
        print (row[1], ':', row[5])
        
print ('\n')


Evernote - stay organized : 161065
Gmail - email by Google: secure, fast & organized : 135962
iTranslate - Language Translator & Dictionary : 123215
Yahoo Mail - Keeps You Organized! : 113709
Google Docs : 64259
Google Drive - free online storage : 59255
Dropbox : 49578
Microsoft Word : 47999
Microsoft OneNote : 39638
Microsoft Outlook - email and calendar : 32807
Hotspot Shield Free VPN Proxy & Wi-Fi Privacy : 32499
Documents 6 - File manager, PDF reader and browser : 29110
Google Sheets : 24602
Microsoft Excel : 24430
Inbox by Gmail : 21561
T-Mobile : 19977
Paper by FiftyThree - Sketch, Diagram, Take Notes : 18219
MyScript Calculator - Handwriting calculator : 16555
VPN Proxy Master - Unlimited WiFi security VPN : 13674
Microsoft OneDrive – File & photo cloud storage : 12797
Ever - Capture Your Memories : 12755
Speak & Translate － Voice and Text Translator : 12062
Tayasui Sketches : 11505
Drawing Desk - Draw, Paint, Doodle & Sketch board : 11040
Microsoft PowerPoint : 10939
Email - F

## App Profile Suggestion (AppStore)

- It is important to be aware to dominant apps in the market place that may skew the perceived popularity of that genre. (i.e. the `social networking` genre is extrememly dominated by apps like `facebook` whereas others struggle to get off the ground.
- We determined the IOS app market is dominated by fun, unproductive apps. Perhaps an app that implements `gamifies` a productive app would stand out amongst others in this lower populated segment of the appstore market

Idea: to-do list app that tracks calendar events and other self made lists. This fits well with an idea that will initially take off in the Google Play Market (tools/productivity genre apps) and includes methods that attract IOS users (fun app genres)

## Summarizing Google Play Store Genre Popularity
- Table below describes google play store genres and their respective average number of user installs

In [20]:
print ('App Genre: Avg Num User Ratings') 

for category in android_genre_ft:
    total = 0
    len_category = 0
    
    for item in android_cleaned_free:
        category_app = item[1]
        
        if (category_app == category):
            num_installs = item[5]
            num_installs = num_installs.replace('+', '')
            num_installs = num_installs.replace(',', '')
            total += float(num_installs)
            len_category += 1
            
    avg_user_ratings = total / len_category
    print (category, ':', avg_user_ratings)

App Genre: Avg Num User Ratings
FOOD_AND_DRINK : 1924897.7363636363
COMMUNICATION : 38590581.08741259
TRAVEL_AND_LOCAL : 13984077.710144928
MEDICAL : 120550.61980830671
EVENTS : 253542.22222222222
BEAUTY : 513151.88679245283
DATING : 854028.8303030303
PARENTING : 542603.6206896552
NEWS_AND_MAGAZINES : 9549178.467741935
PERSONALIZATION : 5201482.6122448975
GAME : 15544014.51048951
HEALTH_AND_FITNESS : 4188821.9853479853
TOOLS : 10830251.970588235
SOCIAL : 23253652.127118643
SPORTS : 3650602.276666667
HOUSE_AND_HOME : 1360598.042253521
ENTERTAINMENT : 11640705.88235294
BUSINESS : 1712290.1474201474
COMICS : 832613.8888888889
BOOKS_AND_REFERENCE : 8814199.78835979
PHOTOGRAPHY : 17840110.40229885
WEATHER : 5145550.285714285
FINANCE : 1387692.475609756
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1446158.2238372094
ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
EDUCATION : 1833495.145631068
SHOPPING : 7036877.311557789
MAPS_AND_NAVIGATION : 4049274.6341463416

# Conclusion:

App profile for mentioned `verification strategy`:

- To-Do list tracker that gamifies completed list items to produce rewards to get in app purchase.
- App purchases will use made up currency, although credit cards can be used to buy more in-app currency
- Genre of app (productivity) likely to take off on google play store initially based on the above analysis.
- Game and reward aspects reaches to IOS users whose app store is dominated by fun, non-productive apps through gamification methods.