# Google Play & App Store Apps: Free + profitable

This is project is aimed to analyse the free application market on both the 2 major mobile platforms. 

Let's thought for a moment that we're working for a Software House that only build iOS & Android apps and that their business model is to build free apps and make revenue from in-app ads. This means the revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better.

Our goal is to deliver analytical data in order to show which free application, platform & type combination is profitable for the developers.

### The data sets.
There are over 4 million apps in both Google Play & App Store. Collecting data for such a vast number of apps requires a significant amount of time and money, so we'll try to analyze a sample of the data instead.

We have one [dataset](https://www.kaggle.com/lava18/google-play-store-apps) with approximately [10.000 Android apps](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv) from Google Play and [another](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) with approximately [7.000 iOS apps](https://dq-content.s3.amazonaws.com/350/AppleStore.csv) from the App Store.

We'll start by opening and exploring these two data sets.

**Open and prepare the datasets**

In [1]:
# Read App Store Data.
# --------------------------------------------
# open file, import reader from module csv, and read data file.
openAppStoreFile = open("AppleStore.csv")
from csv import reader
readAppStoreFile = reader(openAppStoreFile) # reader returns a list

# convert it to list of lists.
dataAppStore = list(readAppStoreFile) 

In [2]:
# Google Play Data.
# --------------------------------------------
# open file, import reader from module csv, and read data file.
openGooglePlayFile = open("googleplaystore.csv")
from csv import reader
readGooglePlayFile = reader(openGooglePlayFile) # reader returns a list

# convert it to list of lists.
dataGooglePlay = list(readGooglePlayFile)

In [3]:
# To make them easier to explore, we created a function that we can
# repeatedly use to print rows in a readable way.

# data exploration function:
def explore_data(dataset, start, end, rows_and_columns=False):
    # dataset: the dataset list of lists, 
    # start, end: indices of a slice from the dataset,
    # rows and columns: Boolean, print the number of rows, columns.
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

**Explore the datasets**

In [4]:
# A little exploration of the App Store data.
# 1. Print the header row (the fields, the columns) separately.
print("App Store header (fields):")
headerAppStore = explore_data(dataAppStore[:1], 0, 1)

# 2. Print a few rows.
print("App Store list (data):")
exploreAppStore = explore_data(dataAppStore[1:], 0, 5, rows_and_columns=True)

App Store header (fields):
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


App Store list (data):
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number o

In [5]:
# A little exploration of the Google Play data.
# 1. Print the header row (the fields, the columns) separately.
print("Google Play header (fields):")
headerGooglePlay = explore_data(dataGooglePlay[:1], 0, 1)

# 2. Print a few rows.
print("Google Play list (data):")
exploreAppStore = explore_data(dataGooglePlay[1:], 0, 5, rows_and_columns=True)

Google Play header (fields):
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Google Play list (data):
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN

### Data cleansing time
As always and before analysing anything, we need to make sure that data is accurate, no mistakes, duplicate entries or anything misformed is occuring inside the data file.

In [6]:
# There is a known error in a particular line, and we're going to fix it.

# Find the error row (we know it's in line 10473, with header included).
print("The defected line is this:\n", dataGooglePlay[10473], "\n")
# Delete the error row.
del dataGooglePlay[10473]
print(dataGooglePlay[10473])

The defected line is this:
 ['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up'] 

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


In [7]:
# Define a helper function for use to find duplicates:
# ----------------------------------------------------
# args: data_set, application name column, platform, header row.
# We're forcing for header existence as an argument.
def find_data_duplicates(data_set, app_name_col, platform, header):
    unique_list = []     # holds the unique values.
    duplicates_list = [] # holds the duplicates.

    # Duplicates
    for row in data_set[1:]:             # loop through the dataset.
        name = row[app_name_col]         # get the name of app.
        if name in unique_list:          # if it's in "unique_list"
            duplicates_list.append(name) # add it to duplicates.
        else:
            unique_list.append(name)     # else, add it to uniques.
            
    print("Number of duplicates for", platform, "platform : ", len(duplicates_list))
    print("Examples of duplicates: ", duplicates_list[:15])       

**Cheking for Duplicates**

Now, we'll check for duplicates over the two platforms, using the function we just created.

In [8]:
find_data_duplicates(dataGooglePlay, 0, "Google Play", True)

Number of duplicates for Google Play platform :  1181
Examples of duplicates:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


In [9]:
find_data_duplicates(dataAppStore, 0, "App Store",  True)

Number of duplicates for App Store platform :  0
Examples of duplicates:  []


**Duplicates found**

It looks like Google Play has numerous duplicates but App Sore has none.

### Duplicates Removal: Which one we're going to remove?

We won't remove duplicates randomly because as it seems, those duplicates are created using different measurements regarding the number of reviews. So, our strategy would be to keep the entry among duplicates which has the highest review. That way we're ensuring more accuarte results when analysing.

Additionally, we may check also the **Dates** among the duplicates and if there're differences, this would be a matter of discussion. (Dates chack requires additional prerequisites.

In [10]:
# Create a new data set by removing duplicates using a dictionary

# 1. Create a new dictionary with unique apps vales (name: #reviews)
#    (duplicates removed here) based on the highest review.
#    We're forcing for header existence as an argument.
def remove_duplicates(data_set, app_name_col, reviews_col, header):
    reviews_max = {} # the dictionay which holds the clean data set.

    # Loop through data
    for row in data_set[1:]:
        name = row[app_name_col]
        n_reviews = float(row[reviews_col])

        # if name as key exists and has a value less than what is stored
        # in n_reviews, then replace its value in dictionary with
        # n_reviews value. Elif, add a new entry.
        if name in reviews_max and reviews_max[name] < n_reviews:
            reviews_max[name] = n_reviews
        elif name not in reviews_max:
            reviews_max[name] = n_reviews

    return reviews_max

dictionaryGooglePlay = remove_duplicates(dataGooglePlay, 0, 3, True)
print(len(dictionaryGooglePlay))

9659


In [11]:
# 2. Using the dictionary created above, to remove
#    the duplicate rows from the original data_set.

googlePlay_clean = [] # the new data list
already_added = [] # for a special pupose.

# A. Get the header and add it to the new list
googlePlay_clean.append(dataGooglePlay[0])

# Loop through the original data set.
# B. get the row
for row in dataGooglePlay[1:]:
    
    # C. get the name of the row
    name = row[0]   
    n_reviews = float(row[3])
    
    # C. Compparison and Cleaning
    # between original data set and the dictionary.
    if n_reviews == dictionaryGooglePlay[name] and name not in already_added:
        googlePlay_clean.append(row)
        already_added.append(name)

print(len(googlePlay_clean), "(Header is included)")

9660 (Header is included)


### Removal of entries in foreign language.

Here, we'll clean any app entry which writen and targets a foreign audience.

In [12]:
def checkEnglishChars(appName):
    conseqChars = [] # holds the consequtive characters
    for char in appName:
        codePoint = ord(char)

        # if the string contains > 3 CONSEQUTIVE foreign characters
        if codePoint > 127 and len(conseqChars) == 3:
            conseqChars.append(char)
#             print(conseqChars, len(conseqChars), "Four (non-English)")
            return False

        elif codePoint > 127 and (len(conseqChars) >= 1
                                  and len(conseqChars) <= 3):
            conseqChars.append(char)
#             print(conseqChars, len(conseqChars), "two to three")

        elif codePoint > 127:
            conseqChars.append(char)
#             print(conseqChars, len(conseqChars), "one")

        else:
#             print(char, "(English character)")
            conseqChars = [] # reset the consequtive chars var.
    return True
        
result = checkEnglishChars("In😜😜😜at😜😜😜😜edfghdfgh")
print(result, "(the result var)") 
        

False (the result var)


#### English / non-English: A different strategy
I decided to follow a different strategy in the foreign language taging: 

Rather than filter out and clean the list by removing the non-English app rows, it's better to keep the list intact from removing things and add a boolean data point at the begining, marking the row (the list) as "True" if it's English and "False" if not.

That whay, we can work seemlessly to improve the English/non-English filter without messing and destroying the list.

If we want to filter out the English entries, we're looping through the "True" data point, or the "False" for non-English.

I'm guessing we could do the same with dictionary also, right?

In [13]:
# 1. Google Play cleaning
# Insert in the header the "English App" string before
# the the app name (first column) to create a new column.
googlePlay_clean[0].insert(0, "English App")

# Loop through the entire data.
for row in googlePlay_clean[1:]:
    # call the previous func and get the result (True/False)
    result = checkEnglishChars(row[0])
    # insert the result at the begining of each row.
    row.insert(0, result) 

    
# 2. App Store cleaning
# (I'm going to use the same convention "_clean" as in Google Play).
appStore_clean = dataAppStore

# Insert in the header the "English App" string before
# the app name (second column) to create a new column.
appStore_clean[0].insert(1, "english app")

# # Loop through the entire data.
for row in appStore_clean[1:]:
    # call the previous func and get the result (True/False)
    result = checkEnglishChars(row[1])
    # insert the result at the begining of each row.
    row.insert(1, result)
    

In [14]:
# Explore data sets for non-English apps.

# Google Play
gCount = 0
gNonEnglish = []
for row in googlePlay_clean[1:]:
    if not row[0]:
        gNonEnglish.append(row)
        gCount += 1

print("The total non-English apps in Google Play are: ", gCount)

# App Store
asCount = 0
asNonEnglish = []
for row in appStore_clean[1:]:
    if not row[0]:
        asNonEnglish.append(row)
        asCount += 1

print("The total non-English apps in App Store are: ", asCount)


The total non-English apps in Google Play are:  38
The total non-English apps in App Store are:  0


In [15]:
# print the non-English rows from Google Play data.
# for row in gNonEnglish:
#     print(row, "\n")

In [16]:
# Selecting the free-apps only.
free_AppStore = []
free_GooglePlay = []

# App Store
for row in appStore_clean[1:]:
    if row[1] and float(row[5]) == 0:
        free_AppStore.append(row)
        
print("The total Free English apps in App Store are: ", len(free_AppStore))
    
# Google Play
for row in googlePlay_clean[1:]:
    if row[0] and (row[8] == "0" or row[8] == "0.0"):
        free_GooglePlay.append(row)
        
print("The total Free English apps in Google Play are: ", len(free_GooglePlay))
    
    
    
    
    

The total Free English apps in App Store are:  3228
The total Free English apps in Google Play are:  8871


## Application Profile.
From here, we'll concentrate onto what kind of applications we need to build, by analysing the data. Also, we need to build the app foe both the major platforms.

In [17]:
# generate frequency table from a list of lists,
# for any column of that list.
def freq_table(dataset, lstColIndex):
    freqColumnData = {}
    
    for row in dataset:
        freqValue = row[lstColIndex]
        if freqValue in freqColumnData:
            freqColumnData[freqValue] += 1
        else:
            freqColumnData[freqValue] = 1
    
    # Get the number of genres and the grand total of all genres.
    numOfGenres = 0
    grandTotalGenres = 0
    for genre in freqColumnData:
        numOfGenres += 1
        grandTotalGenres += freqColumnData[genre]
#     print("Number of Genres: ", numOfGenres)
#     print("Grand Total of all frequencies: ", grandTotalGenres)
    
    # convert the list values to percentages
    for genre in freqColumnData:
        perCent = (freqColumnData[genre] / grandTotalGenres) * 100
        perCent = round(perCent, 2)
        freqColumnData[genre] = perCent   
    
    return freqColumnData


# transforms the frequency table into a list of tuples,
# then sorts the list in a descending order.
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []

    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
    

In [18]:
# Print the results for analysis:
print("Frequency Table for App Store/Prime Genre:")    
display_table(free_AppStore, 12)
print()

print("Frequency Table for Google Play/Category:")    
display_table(free_GooglePlay, 2)
print()

print("Frequency Table for Google Play/Genres:")    
display_table(free_GooglePlay, 10)


Frequency Table for App Store/Prime Genre:
Games : 58.12
Entertainment : 7.87
Photo & Video : 4.96
Education : 3.66
Social Networking : 3.28
Shopping : 2.63
Utilities : 2.51
Sports : 2.14
Music : 2.04
Health & Fitness : 2.01
Productivity : 1.73
Lifestyle : 1.58
News : 1.33
Travel : 1.24
Finance : 1.15
Weather : 0.9
Food & Drink : 0.84
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12

Frequency Table for Google Play/Category:
FAMILY : 18.9
GAME : 9.72
TOOLS : 8.45
BUSINESS : 4.59
LIFESTYLE : 3.91
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.53
SPORTS : 3.39
PERSONALIZATION : 3.33
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.94
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.33
SHOPPING : 2.24
BOOKS_AND_REFERENCE : 2.16
DATING : 1.86
VIDEO_PLAYERS : 1.79
MAPS_AND_NAVIGATION : 1.41
FOOD_AND_DRINK : 1.24
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.92
HOUSE_AND_HOME : 0.82
W

### Analyse average installations per genre/category

In [19]:
# 1. App Store: Generating a frequency table for the prime_genre column.

genre_AppStore = freq_table(free_AppStore, 12)
print("App Store average values per genre\n")
print("Genre\t\tAverage User Ratings")
print("-------------------------------------")

for genre in genre_AppStore:
    # "total" store the sum of user ratings (the number of ratings,
    # not the actual ratings) specific to each genre.
    total = 0
    len_genre = 0 # the number of apps of each genre.
    
    for row in free_AppStore:
        genre_app = row[12] # the genre of each app in the main file.
        if genre_app == genre: # if it's the same as the outer
            user_ratings = float(row[6]) # get the user_count_tot column
            # Add up the number of user ratings to the total variable.
            total += user_ratings 
            len_genre += 1 # Increment the len_genre variable by 1.
    
    avgUserRatings = total / len_genre
    print(genre, "\t\t", round(avgUserRatings, 2))


App Store average values per genre

Genre		Average User Ratings
-------------------------------------
Social Networking 		 71548.35
Health & Fitness 		 23298.02
Weather 		 50477.55
Productivity 		 21028.41
Photo & Video 		 28441.54
Music 		 57326.53
Utilities 		 18684.46
Medical 		 612.0
Finance 		 30617.81
Navigation 		 86090.33
Book 		 39758.5
Education 		 7003.98
News 		 21248.02
Games 		 22764.91
Catalogs 		 4004.0
Business 		 7491.12
Lifestyle 		 16485.76
Shopping 		 26608.0
Entertainment 		 14029.83
Reference 		 74942.11
Sports 		 23008.9
Travel 		 28243.8
Food & Drink 		 32099.52


### App Store profile recommentation.

In [20]:
# 1. Google Play: Generating a frequency table for the "category" column.

category_GooglePlay = freq_table(free_GooglePlay, 2)
# print("Google Play average values per category\n")
# print("Category\t\tAverage User Ratings")
# print("-------------------------------------")

for category in category_GooglePlay:    
    total = 0 # "total" stores the sum of installs of each category.
    len_category = 0 # the number of apps specific to each category.
    
    for row in free_GooglePlay:
        category_app = row[2] # the category of each app in the main file.
        if category_app == category: # if it's the same as the outer
            user_installs = row[6] # get the "installs" column.
            
            # clean the numbers and covert.
            user_installs = user_installs.replace(",", "")
            user_installs = user_installs.replace("+", "")
            user_installs = float(user_installs)
            
            # Add up the number of installations to the total variable.
            total += user_installs 
            len_category += 1 # Increment the "len_category" var. by 1.
    
    avgUserRatings = total / len_category
    print(category, "\t\t", round(avgUserRatings, 2))


TOOLS 		 10801391.3
DATING 		 854028.83
LIBRARIES_AND_DEMO 		 638503.73
SHOPPING 		 7036877.31
FAMILY 		 3693438.69
EDUCATION 		 1833495.15
ENTERTAINMENT 		 11640705.88
LIFESTYLE 		 1433701.52
NEWS_AND_MAGAZINES 		 9549178.47
WEATHER 		 5074486.2
PERSONALIZATION 		 5183850.81
HEALTH_AND_FITNESS 		 4188821.99
EVENTS 		 253542.22
FOOD_AND_DRINK 		 1924897.74
PRODUCTIVITY 		 16787331.34
MAPS_AND_NAVIGATION 		 4025286.24
PARENTING 		 542603.62
BUSINESS 		 1712290.15
AUTO_AND_VEHICLES 		 647317.82
BEAUTY 		 513151.89
GAME 		 15588015.6
COMICS 		 817657.27
TRAVEL_AND_LOCAL 		 13984077.71
PHOTOGRAPHY 		 17840110.4
COMMUNICATION 		 38456119.17
HOUSE_AND_HOME 		 1331540.56
VIDEO_PLAYERS 		 24727872.45
BOOKS_AND_REFERENCE 		 8676537.81
ART_AND_DESIGN 		 1952105.17
SOCIAL 		 23253652.13
MEDICAL 		 120550.62
FINANCE 		 1387692.48
SPORTS 		 3638640.14


### Google Play profile recommentation.

# Conclusions
In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We concluded that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets. The markets are already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.