# Guided Project: Profitable App Profiles for the Apple App Store and Google Play Markets.

The primary aim of this project is to reveal the mobile app profiles that are most profitable for the Apple App Store and Google Play markets. 

The project simulates a contract from a monile app development company to analyze the aforementioned datasets and provide insights on the most profitable app profiles.

At the company, they focus on building free to download applications, and the main source of their revenue consists of in-app advertisements. Hence, their revenue is greatly influenced by the number of users that use an app.

**Question: What kind of apps are likely to attract the most users?**


## A. Opening and Exploring the App Store and Google Play DataSets:

This project is going to focus on analyzing a sample of the datasets in place of the complete data (strictly due to cost and time constraints).

The two (2) sample files being worked with are:
- "AppleStore.csv"
- "googleplaystore.csv"






In [1]:
#importing reader function for both data sets
from csv import reader
opened_apple_file = open("AppleStore.csv")
read_apple_file = reader(opened_apple_file)
apple_file = list(read_apple_file)

opened_google_file = open("googleplaystore.csv")
read_google_file = reader(opened_google_file)
google_file = list(read_google_file)



### Information about DataSets

The App Store dataset contains data about approximately 7,000 iOs apps (data from JUly 2017).

The Google Play dataset contains data about approximately 10,000 android apps (data from August 2018).

**The block of code below utilizes a function: explore_data(), that is used to display the number of rows and columns in each data set as well as all information in each unique row.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
explore_data(apple_file, 0, 5, True) # Displays information on the first couple of rows in the dataset
explore_data(google_file, 0, 5, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7198
Number of columns: 16
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Ed

From the above exploratory data, we see that the 

AppleStore data has 7,198 applications and 16 columns housing different features of each.

While the Google Play Store data has 10,842 and 13 columns as well.

Below, we print out each individual column name to better understand the nature of the datasets that we're working with.



In [3]:
# Printing the column names to enable a better understanding of the data that is useful to the project.
for row in apple_file[0]:
    print(row.title())
    
print('\n')
for row in google_file[0]:
    print(row)
    
    

Id
Track_Name
Size_Bytes
Currency
Price
Rating_Count_Tot
Rating_Count_Ver
User_Rating
User_Rating_Ver
Ver
Cont_Rating
Prime_Genre
Sup_Devices.Num
Ipadsc_Urls.Num
Lang.Num
Vpp_Lic


App
Category
Rating
Reviews
Size
Installs
Type
Price
Content Rating
Genres
Last Updated
Current Ver
Android Ver


## B. Cleaning up the data: Deleting Wrong Data.

Before diving into analysis, it is mandatory to ensure all data being worked with serves the purpose of our analysis through accuracy and relevancy.

In the inception phase of the cleaning, I am going to delete inaccurate data, correct or remove it as well as removing/delering duplicate data.

In [4]:
#After reading through discussions on the Google play dataset, it shows that 
#there's an inconsistency, below is code to reveal it.

print(google_file[10473])

del google_file[10473]

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


## Removing Duplicate Data.

Apps occuring multiple times affect the accuracy of the analysis, hence it is imperative to clear out all unneccesary information.


Starting with the Google Play dataset, two empty lists are created to house the unique and duplicate apps. 

Iterate through the google play dataset (excluding the header row), categorize each app as either unique or a duplicate.

In [5]:
unique_apps = []
duplicate_apps = []

for row in google_file[1:]:
    if row[0] in unique_apps:
        duplicate_apps.append(row[0])
    else:
        unique_apps.append(row[0])
        
print(duplicate_apps[:5])# Confirming the presence of a few duplicate apps.

print("Number of duplicate apps: " + str(len(duplicate_apps)))

#The criterion to be used to clean out the duplicates involves the duplicate
#with the highest user rating. This is the most useful information for the primary goal of our analysis.

#Highest User Rating = Most value in finding out which type of apps users lean towards.





['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']
Number of duplicate apps: 1181


In the process of removing duplicate apps, we considered the ratings of the duplicates and settled on working with the highest ratings as the most valuable piece of data.

To extract the data, we created a dictionary to store information about duplicate apps (converted to unique apps) and their highest ratings. 

In [6]:
#Creating a dictionary that stores the apps as keys and their highest review ratings amongst all duplicates as values.
reviews_max = {}
for row in google_file[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print(len(reviews_max))

        

#Using the dictionary created to remove duplicate rows
android_clean = []
already_added = []

for row in google_file[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(row)
        already_added.append(name)
        
print(len(android_clean))

        
        
        


9659
9659


## Removing Non-English Apps from the Dataset.

The next part of the data cleaning process would be ensure homogenity (same language in all data).

A function is created below to iterate through strings and check for each characters existence in the ASCII 127 character syntax.

In [7]:
#The function below checks through the characters in a string to ensure they belong
#to the english alphabet based on the ASCII 127 character metric. 

#Characters greater than 127 exist outside the language.
def english_char_presence(string):
    non_english = []
    eng_lish = []
    for a in string:
        if ord(a) > 127:
            non_english.append(a)
        else:
            eng_lish.append(a)
    if len(non_english) > 3:
            return False
    else:
            return True
    
print(english_char_presence("Instagram"))
print(english_char_presence("爱奇艺PPS -《欢乐颂2》电视剧热播"))
print(english_char_presence("Docs To Go™ Free Office Suite"))
print(english_char_presence("Instachat 😜"))


            

True
False
True
True


The above function is not optimal, but serves the purpose of filtering out non-english apps for now
            

In [8]:
#For the google_file dataset
english_google_apps = []
for row in android_clean:
    app_names = row[0]
    app_language = english_char_presence(app_names)
    if app_language:
        english_google_apps.append(row)
        
#For the apple_file dataset
english_apple_apps = []
for row in apple_file[1:]:
    app_names = row[1]
    app_language = english_char_presence(app_names)
    if app_language:
        english_apple_apps.append(row)
        

print(len(english_google_apps))
print(len(english_apple_apps))
#The len() function shows the applications left in each dataset after cleaning out the non-english characters.



9614
6183


## Categorizing Free to download Apps.

Since the company develops free games, it is important to factor this into the data cleaning process ensuring accuracy in the analysis.

In [9]:
#For the google_file dataset
free_google_apps = []
for row in english_google_apps:
    app_price = str(row[7])
    if app_price == "0" or app_price == "$0":
        free_google_apps.append(row)
    else:
        continue
    
print(len(free_google_apps))
#For the apple_file dataset
free_apple_apps = []
for row in english_apple_apps:
    app_price = str(row[4])
    if app_price == "0.0" or app_price == "$0.0":
        free_apple_apps.append(row)
    else:
        continue
print(len(free_apple_apps))



8864
3222


## Most Common Apps by Genre:

The company has provided its end goal as determining the kind of apps that are more likely to attract the most users. It has shared its stratedy to build revenue with app development on both mobile store platforms.

With this knowledge, the analysis would benefit from finding app profiles that play out well in both markets (ios and android), since part of the company's goal is to have a strong foothold in both mobile stores.

The block of code below reveals the columns important to complete this task in the respective data-sets.

In [10]:
for row in free_google_apps[81]:
    print("\n" + row)
    #column 9 shows the genre
    
for row in free_apple_apps[81]:
    print("\n" + row)
    #column 11 shows the genre and column 1 shows the category


CarMax – Cars for Sale: Search Used Car Inventory

AUTO_AND_VEHICLES

4.4

21777

Varies with device

1,000,000+

Free

0

Everyone

Auto & Vehicles

August 4, 2018

Varies with device

Varies with device

991153141

Fallout Shelter

1172922368

USD

0.0

199396

1131

4.5

4.5

1.12

12+

Games

38

5

5

1


### Frequency tables for both datasets.

The next step in the analysis of the dataset provided involves creating frequency tables that display the highest percentage based on genre, category etc for the datasets.

In [11]:
def freq_table(dataset, index):
    empty_dictionary = {}
    for row in dataset:
        column = row[index]
        if column in empty_dictionary:
            empty_dictionary[column] += 1
        else:
            empty_dictionary[column] = 1
    for key,value in empty_dictionary.items():
        value /= len(dataset)
        value *= 100
        empty_dictionary[key] = value
    return empty_dictionary

freq_table(free_google_apps, 9)


{'Art & Design': 0.5979241877256317,
 'Art & Design;Creativity': 0.06768953068592057,
 'Auto & Vehicles': 0.9250902527075812,
 'Beauty': 0.5979241877256317,
 'Books & Reference': 2.1435018050541514,
 'Business': 4.591606498194946,
 'Comics': 0.6092057761732852,
 'Comics;Creativity': 0.01128158844765343,
 'Communication': 3.2378158844765346,
 'Dating': 1.861462093862816,
 'Education': 5.347472924187725,
 'Education;Creativity': 0.04512635379061372,
 'Education;Education': 0.33844765342960287,
 'Education;Pretend Play': 0.056407942238267145,
 'Education;Brain Games': 0.033844765342960284,
 'Entertainment': 6.069494584837545,
 'Entertainment;Brain Games': 0.078971119133574,
 'Entertainment;Creativity': 0.033844765342960284,
 'Entertainment;Music & Video': 0.16922382671480143,
 'Events': 0.7107400722021661,
 'Finance': 3.7003610108303246,
 'Food & Drink': 1.2409747292418771,
 'Health & Fitness': 3.0798736462093865,
 'House & Home': 0.8235559566787004,
 'Libraries & Demo': 0.936371841155234

In [12]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
 

print("\nINFO FOR GOOGLE APPS GENRE COLUMN")
display_table(free_google_apps, 9)
print("\nINFO FOR APPLE APPS PRIME GENRE COLUMN")
display_table(free_apple_apps, 11)
print("\nINFO FOR APPLE APPS CATEGORY COLUMN")
display_table(free_google_apps, 1)


INFO FOR GOOGLE APPS GENRE COLUMN
Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.936371841155234

### Most Popular Apps on the App Store
The next step of the analysis involves finding the most popular apps on each app platform.

We begin with Genre on the App Store - To do this we need to extract the average of user ratings for each genre in the App Store.

In [18]:
#Generate a frequency table for the prime genre column (index 11) in the cleaned app store dataset.
app_store_freq_table= freq_table(free_apple_apps, 11)

for genre in app_store_freq_table:
    total = 0
    len_genre = 0
    for row in free_apple_apps:
        genre_app = row[11]
        if genre_app == genre:
            no_of_user_ratings = float(row[5])
            total += no_of_user_ratings 
            len_genre += 1 
    average_no_user_ratings = total/len_genre
    print("App Genre: " + genre )
    print("Average number of user ratings: " + str(average_no_user_ratings))
    

App Genre: Social Networking
Average number of user ratings: 71548.34905660378
App Genre: Photo & Video
Average number of user ratings: 28441.54375
App Genre: Games
Average number of user ratings: 22788.6696905016
App Genre: Music
Average number of user ratings: 57326.530303030304
App Genre: Reference
Average number of user ratings: 74942.11111111111
App Genre: Health & Fitness
Average number of user ratings: 23298.015384615384
App Genre: Weather
Average number of user ratings: 52279.892857142855
App Genre: Utilities
Average number of user ratings: 18684.456790123455
App Genre: Travel
Average number of user ratings: 28243.8
App Genre: Shopping
Average number of user ratings: 26919.690476190477
App Genre: News
Average number of user ratings: 21248.023255813954
App Genre: Navigation
Average number of user ratings: 86090.33333333333
App Genre: Lifestyle
Average number of user ratings: 16485.764705882353
App Genre: Entertainment
Average number of user ratings: 14029.830708661417
App Genre:

### Most Popular Apps on Google Play Store.


Next, we move on to the most popular apps by Genre on Google Play

In [22]:
google_store_freq_table = freq_table(free_google_apps, 1)
print(google_store_freq_table)

for category in google_store_freq_table:
    total = 0
    len_category = 0
    for row in free_google_apps:
        category_app = row[1]
        if category_app == category:
            no_of_installs = row[5]
            no_of_installs = no_of_installs.replace("+","")
            no_of_installs = no_of_installs.replace(",","")
            no_of_installs = int(no_of_installs)
            total += no_of_installs
            len_category += 1
    average_no_of_installs = total/len_category
    print("App Genre: " + category)
    print("Average no. of installs: " + str(average_no_of_installs))
    

{'ART_AND_DESIGN': 0.6430505415162455, 'AUTO_AND_VEHICLES': 0.9250902527075812, 'BEAUTY': 0.5979241877256317, 'BOOKS_AND_REFERENCE': 2.1435018050541514, 'BUSINESS': 4.591606498194946, 'COMICS': 0.6204873646209386, 'COMMUNICATION': 3.2378158844765346, 'DATING': 1.861462093862816, 'EDUCATION': 1.1620036101083033, 'ENTERTAINMENT': 0.9589350180505415, 'EVENTS': 0.7107400722021661, 'FINANCE': 3.7003610108303246, 'FOOD_AND_DRINK': 1.2409747292418771, 'HEALTH_AND_FITNESS': 3.0798736462093865, 'HOUSE_AND_HOME': 0.8235559566787004, 'LIBRARIES_AND_DEMO': 0.9363718411552346, 'LIFESTYLE': 3.9034296028880866, 'GAME': 9.724729241877256, 'FAMILY': 18.907942238267147, 'MEDICAL': 3.531137184115524, 'SOCIAL': 2.6624548736462095, 'SHOPPING': 2.2450361010830324, 'PHOTOGRAPHY': 2.944494584837545, 'SPORTS': 3.395758122743682, 'TRAVEL_AND_LOCAL': 2.33528880866426, 'TOOLS': 8.461191335740072, 'PERSONALIZATION': 3.3167870036101084, 'PRODUCTIVITY': 3.892148014440433, 'PARENTING': 0.6543321299638989, 'WEATHER': 

## Conclusion.
Based on the above analysis, the ideal app profile recommendation for Google Play Store would be apps under the Communication Category. 

Whilst the ideal app profile recommendation for Apple App Store would be apps under Navigation/Scothe Genre.