# Profitable App Profiles for the Apple Store and Google Play Markets

In this project an analysis on Apple Store and Google Play free-only applications will be made. The objective is to understand which apps are more profitable utilising in-app ads. Monitoring the interactions users have with each app is the better way to understand which ones stand out giving developers a direction to head regarding what kind of apps to make. 

## Opening and exploring the data
In order to be a little less time consuming, I'll be opening a data set with information about ~10000 Google Play apps and ~7000 Apple Store apps, instead of collecting the information about all the approximatelly 4 million combined apps in both stores.

In [1]:
from csv import reader

'''opening the google play data set'''
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
googleapps = list(read_file)
google_apps_data = googleapps[1:]

'''opening the App Store data set'''
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
appleapps = list(read_file)
apple_apps_data = appleapps[1:]



In [2]:
'''
Data exploration function - printing the rows for the chosen dataset, from star row to end row with a blank 
line spacing each row
'''
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
print(googleapps[0])
print('\n')
gplay  = explore_data(google_apps_data,0,4, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10841
Number of columns: 13


For the google play data set apps we can see that there is data from 10841 apps and 13 categories (columns). The categories that at first can be useful are: App, Category, Rating, Reviews, Installs, Price and Genres. 

If there is any questions regarding any category, here is a link to the [data set](https://www.kaggle.com/lava18/google-play-store-apps) documentation, where more information is available. 

Next is the Apple Store exploration. 

In [3]:
print(appleapps[0])
print('\n')
appstore = explore_data(apple_apps_data,0,4,True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7197
Number of columns: 16


Regarding the Apple store data set, there are 7197 apps and 16 categories. The ones that can be interesting are: track_name, currency, price, rating_count_tot, rating_count_ver and prime_genre. For some help on categories that may be not so clear, here is a [link to documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) on the data set.

## Cleaning the data
Deleting data since there is wrong info from a thread in the google play data set discussion section. The error is in entry 10472 (after the header). Below this row is printed and compared to the header row and a correct row.

In [4]:
print(googleapps[0])
print('\n')
print(googleapps[1])
print('\n')
print(googleapps[10473])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


**"Category" and column is missing; "Genre" column has no information, as seen in the above comparison.**

In [5]:
#Deleting the row with error
print(len(google_apps_data))
del google_apps_data[10472] #can only be ran once 
print(len(google_apps_data))

10841
10840


There are also some applications that are presented multiple times (see below). That is of no interest, so we'll try to identify them first, to remove them later. 

In [7]:
for app in google_apps_data:
    name = app[0]
    if name == 'Instagram':
        print(app)
    elif name == 'Box':
        print(app)

['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Va

Next, the number of duplicates is counted.

In [20]:
unique_apps = []
duplicates = []

for app in google_apps_data:
    name = app[0]
    if name in unique_apps:
        duplicates.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicates:',len(duplicates))
print('Number of unique apps:',len(unique_apps))

Number of duplicates: 1181
Number of unique apps: 9659


Being that the main difference in the duplicate apps presented above comes from the fourth column (number of reviews). The different numbers reveal that the data was collected at different times. The rows that will be mantained will be the ones with highest number of reviews, since that should be the most recent entry on that application. Rows will be removed because we don't want to be counting apps more than once when analyzing data. 

To remove the duplicate rows we will:
* Create a dictionary, where each key is an app name and the corresponding value the highest number of reviews for that app;
* Use the information stored in the dictionary to create a new dataset where there is only one entry per app (the entry corresponding to the highest number of reviews from the previous dataset)


In [11]:
#building the dictionary
reviews_max = {}

for app in google_apps_data:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
#print(reviews_max)

In [17]:
android_clean = [] #new cleaned data set
already_added = [] #store app names

for app in google_apps_data:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
#print(android_clean)

In the two cells above, a dictionary with the highest number of reviews was built. This dictionary was later used to remove the duplicate rows.

Afterwards, two new lists were created. android_clean, which is the new dataset to be used, where there are no repeated rows, and already_added. 

The expected number of rows is now 9659, since 1181 duplicates were removed. 

In [18]:
#checking if apps that were repeated still appear more than once
explore_data(android_clean,0,3, True)


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13
