# Profitable App Profiles

## About Project:
In this project, we will be analysing data about apps from the Apple app store as well as the Android app store.

## Goal of Project:
Our goal is to help developers understand what kind of apps are more likely to be downloaded by users, and what kinda of apps are possibly profitable.

### Dataset sources:
[**Android (Google Playstore) Dataset**](https://www.kaggle.com/lava18/google-play-store-apps)

[**Apple App Store Dataset**](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

In [1]:
from csv import reader
def open_dataset(dataset, header = True): #defining function, takes two input parameters
    opened_file = open(dataset)
    read_file = reader(opened_file)
    data = list(read_file) #converts the file into a list
    if header:
        header_dataset = data[0]
        list_dataset = data[1:]
        return header_dataset, list_dataset #if there is a header, a tuple of the header and data will be returned
    else:
        list_dataset = data[0:]
        return list_dataset #if there is no header, the data will be returned

In [2]:
#opening the Android data set
android_data = open_dataset('googleplaystore.csv')
android_data_header, android_dataset = android_data #assigning variables to the header and actual data
#print(android_dataset)

#opening the Apple data set
apple_data = open_dataset('AppleStore.csv')
apple_data_header, apple_dataset = apple_data #assigning variables to the header and actual data
#print(apple_data_header)

In the section above, we put the Android and the Apple csv datasets into a function, and then defined two variables for each of the dataset. One variable is just the header of thd dataset, while the second variable is the full dataset without the header.

### Exploring our datasets

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False): #dataset takes in list, start and end take in integers
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [4]:
#showing the first 6 rows of the Apple and Android datasets, with how many rows and columns there are. The header columns are not counted because the variable accounts for that
print("This is the first 6 rows of the Android dataset\n")
explore_data(android_dataset, 0, 5, rows_and_columns=True) #android dataset, keyword argument


print("\nThese are the column names for the Android dataset:\n")
print("\n".join(android_data_header))




This is the first 6 rows of the Android dataset

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 10841
Number o

We can see the first 6 rows of the Android dataset, without the headers. The number of rows is 10841, and the number of columns is 13. The column names are listed above.

In [5]:
print("\nThis is the first 6 rows of the Apple dataset")
explore_data(apple_dataset, 0, 5, True) #positional argument

print("\nThese are the column names for the Apple dataset:\n")
print("\n".join(apple_data_header))


This is the first 6 rows of the Apple dataset
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of columns: 16

These are the column names for the Apple dataset:

id
track_name
size_bytes
currency
price
rating_count_tot
rating_count_ver
user_rating
user_rating_ver
ver
cont_rating
prime_genre
sup_devices.num


We can see the first 6 rows of the Android dataset, without the headers. The number of rows is 7197, and the number of columns is 16. The column names are listed above. The documentation to understand the column names can be found [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home).

### Finding wrong/incorrect data

In [6]:
#trying to find wrong data in the Android dataset
for row in android_dataset:
    if len(row) != len(android_data_header):
        print("The wrong data row is:")
        print(row)
        print("\n")
        print("Index of the wrong row is " + str(android_dataset.index(row)))

print("\nThe headers of the Android dataset are:\n")
print(android_data_header)

print("\nA normal row in an Android dataset looks like:\n")
print(android_dataset[3])
        

The wrong data row is:
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Index of the wrong row is 10472

The headers of the Android dataset are:

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

A normal row in an Android dataset looks like:

['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


It seems that row 10472 is missing the **Category** data, shifting everything by one cell.

In [7]:
corrected_android_dataset = android_dataset[:10472] + android_dataset[10473:] #new dataset without the wrong data
#print(corrected_android_dataset[10470:10480])
explore_data(corrected_android_dataset, 0, 5, rows_and_columns=True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 10840
Number of columns: 13


The wrong data row has been removed, and the new corrected data set is stored in the variable `corrected_android_dataset`. The new number of rows is 10840, and columns remain the same.

### Removing duplicate entries

If the Android dataset is looked at, we will see that it contains duplicate entries. This section will see what duplicate entries there are and what is the criteria to remove the duplicate entries.

In [8]:
#initializing 2 lists for unique and duplicate apps
unique_android_apps = []
duplicate_android_apps = []
for app in corrected_android_dataset:
    name = app[0]
    
    if name in unique_android_apps:
        duplicate_android_apps.append(name)
    else:
        unique_android_apps.append(name)

print("There are " + str(len(unique_android_apps)) + " unique apps. Some of the unique apps are:\n")
print("\n".join(unique_android_apps[:15]))
print("\n")

print("There are " + str(len(duplicate_android_apps)) + " duplicate apps. Some of the duplicate apps are:\n")
print("\n".join(duplicate_android_apps[:15]))
print("\n")
print("An example of a duplicate app is Instagram. Only difference in the rows is that the rating is different (index 3):\n")
for app in corrected_android_dataset:
    name = app[0]
    if name == 'Instagram':
        print(app)

There are 9659 unique apps. Some of the unique apps are:

Photo Editor & Candy Camera & Grid & ScrapBook
Coloring book moana
U Launcher Lite – FREE Live Cool Themes, Hide Apps
Sketch - Draw & Paint
Pixel Draw - Number Art Coloring Book
Paper flowers instructions
Smoke Effect Photo Maker - Smoke Editor
Infinite Painter
Garden Coloring Book
Kids Paint Free - Drawing Fun
Text on Photo - Fonteee
Name Art Photo Editor - Focus n Filters
Tattoo Name On My Photo Editor
Mandala Coloring Book
3D Color Pixel by Number - Sandbox Art Coloring


There are 1181 duplicate apps. Some of the duplicate apps are:

Quick PDF Scanner + OCR FREE
Box
Google My Business
ZOOM Cloud Meetings
join.me - Simple Meetings
Box
Zenefits
Google Ads
Google My Business
Slack
FreshBooks Classic
Insightly CRM
QuickBooks Accounting: Invoicing & Expenses
HipChat - Chat Built for Teams
Xero Accounting Software


An example of a duplicate app is Instagram. Only difference in the rows is that the rating is different (index 3):



We can see that there are 1181 duplicate app listings in the Android dataset. Since the only difference between these entries is the rating, we can use this to our advantage, to remove duplicate app listings. If the rating is more, we can assuume that the data was collected most recently, and thus we keep that listing, and remove the others.

In [13]:
reviews_max = {}
for app in corrected_android_dataset:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews: #if the app name already exixts as a key in the dictionary, and the reviews in the dict are LESS than the new row, then we update the dictionary
        reviews_max[name] = n_reviews
        
    if name not in reviews_max: #if the app does not exist, we add it into the dictionary
        reviews_max[name] = n_reviews
        
print("The number of items in the dictionary are: " + str(len(reviews_max.keys())))

The number of items in the dictionary are: 9659


We now have a dictionary where the key is the unique name of the app, and the value is the max number of reviews.

In [17]:
android_clean = []
already_added = []

for app in corrected_android_dataset:
    name = app[0]
    n_reviews = float(app[3])
    
    #now we compare the reviews_max dictionary to our dataset
    if (reviews_max[name] == n_reviews) and (name not in already_added): #we have to check for the second condition because we want to make sure all the data is accounted for.
        android_clean.append(app) #appending the whole row. This will be our new dataset
        already_added.append(name) #only appending the name of the app
print("To make sure everything went like we wanted it to, we explore the data from the new android_clean dataset\n")
explore_data(android_clean,0,10,True)

To make sure everything went like we wanted it to, we explore the data from the new android_clean dataset

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', 

In order to make sure we have a new dataset that only contains unique app names, with the highest number of reviews, we first initialized two lists: `android_clean` and `already_added`.

We then looped over our original dataset, and compared it to the `reviews_max` dictionary. We added the unique entries into the new `android_clean` list by doing the following:
1. If the number of reviews in the original dataset and the dictionary (`reviews_max`) is the same AND
2. If the name did not already exist in the `already_added` list
We appended the row from the original dataset into the new `android_clean` dataset.

Exploring the data matches what we expected. We see **9659** entries, as seen in the past as unique entries, and the number of columns remains the same.

### Removing non-English apps

In [27]:
def is_english(string):
    """
    This function will take a string, and if it is non-English, it will return False, else True.
    """
    for i in string:
        if ord(i) > 127:
            return False
        
    return True

is_english('Docs To Go™ Free Office Suite')

False

The above function does not work the right way, since we input an English string, yet we got the function returning False. Reason for this is that symbols like TM or emojis have ASCII numbers greater than 127. To better filter our strings, we can allow up to 3 non-English characters. This will likely account for emojis and special symbols like TM.

In [28]:
def is_english_better(string):
    """
    This function does pretty much the same thing as the last function, but allows up to 3 non-ASCII characters
    """
    non_ascii_count = 0 #initializing a non_ascii count
    
    for i in string:
        if ord(i) > 127:
            non_ascii_count +=1
        
    if non_ascii_count > 3:
        return False
    else:
        return True
    
is_english_better('Docs To Go™ Free Office Suite')

True

Since this filtration system works better for us, we will use it to filter out all non-English apps.

In [38]:
def is_english_app(dataset,index_of_name=0): #the second parameter is because Android data set name index is 0, but Apple dataset is 1
    
    dataset_english = []
    
    for app in dataset:
        name = app[index_of_name]
        
        if is_english_better(name): #checking that name is in english using previous function
            dataset_english.append(app)
    
    return dataset_english

print("The Android apps with English names are:\n")
android_english = is_english_app(android_clean)
explore_data(android_english,0,5,True)
print("\n")

print("The Apple apps with English names are:\n")
apple_english = is_english_app(apple_dataset,1)
explore_data(apple_english,0,5,True)



The Android apps with English names are:

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 9614
Number of columns: 13


The 

Using our function to filter out only English names, we have now created two new datasets:
1. Android dataset = `android_english`
2. Apple dataset = `apple_english`

### Isolating the free apps
In this section, I will define a new function that can append to a list of all the free apps.

In [41]:
def free_apps(dataset, price_index, int_or_float='integer'):
    """
    This function will take in the dataset, the index of the price, and if the price is an integer, i.e. '0' or a float, i.e. '0.0'. If the price is an integer, argument passed should be 'integer', else it should be 'float'
    """
    free_apps_list = []
    not_free_apps_list = []
    
    for app in dataset:
        price = app[price_index]
        
        if int_or_float == 'integer':
            if price == '0':
                free_apps_list.append(app)
            else:
                not_free_apps_list.append(app)
        if int_or_float == 'float':
            if price == '0.0':
                free_apps_list.append(app)
            else:
                not_free_apps_list.append(app)
            
    return free_apps_list, not_free_apps_list


The above function will return a tuple of free and not free apps. This tuple will need to be isolated into seperate variables. If the price is 0, make sure the `int_or_float` parameter is `'integer'`, else it should be `'float'`.

The index for Apple Store price data is **4**.

The index for Android Store price data is **7**.



In [48]:
apple_free_notfree_tuple = free_apps(apple_english, 4, 'float')
apple_free, apple_not_free = apple_free_notfree_tuple

explore_data(apple_free,0,5,True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 3222
Number of columns: 16


For Apple's App Store, we have 3222 apps that are free.

In [49]:
android_free_notfree_tuple = free_apps(android_english, 7)
android_free, android_not_free = android_free_notfree_tuple

explore_data(android_free,1,5,True)

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 8864
Number of columns: 13


For Android's Play Store, we have 8864 apps that are free.

### Most common apps by genre

In this section, we will explore what are the most common apps by genre for both: The Apple App Store as well as the Android Play Store.

The rationale in doing so is that we would like to make an app that is actually successful. We want to analyze the market to see what kind of apps are actually successful, and then go onto making that kind of app for Android first. If we get a good response, we will make it for Apple. 



In [None]:
T