# Profitable App Profiles

## About Project:
In this project, we will be analysing data about apps from the Apple app store as well as the Android app store.

## Goal of Project:
Our goal is to help developers understand what kind of apps are more likely to be downloaded by users, and what kinda of apps are possibly profitable.

### Dataset sources:
[**Android (Google Playstore) Dataset**](https://www.kaggle.com/lava18/google-play-store-apps)

[**Apple App Store Dataset**](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

In [1]:
from csv import reader
def open_dataset(dataset, header = True): #defining function, takes two input parameters
    opened_file = open(dataset)
    read_file = reader(opened_file)
    data = list(read_file) #converts the file into a list
    if header:
        header_dataset = data[0]
        list_dataset = data[1:]
        return header_dataset, list_dataset #if there is a header, a tuple of the header and data will be returned
    else:
        list_dataset = data[0:]
        return list_dataset #if there is no header, the data will be returned

In [28]:
#opening the Android data set
android_data = open_dataset('googleplaystore.csv')
android_data_header, android_dataset = android_data #assigning variables to the header and actual data
#print(android_dataset)

#opening the Apple data set
apple_data = open_dataset('AppleStore.csv')
apple_data_header, apple_dataset = apple_data #assigning variables to the header and actual data
#print(apple_data_header)

In the section above, we put the Android and the Apple csv datasets into a function, and then defined two variables for each of the dataset. One variable is just the header of thd dataset, while the second variable is the full dataset without the header.

### Exploring our datasets

In [6]:
def explore_data(dataset, start, end, rows_and_columns=False): #dataset takes in list, start and end take in integers
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [32]:
#showing the first 6 rows of the Apple and Android datasets, with how many rows and columns there are. The header columns are not counted because the variable accounts for that
print("This is the first 6 rows of the Android dataset\n")
explore_data(android_dataset, 0, 5, rows_and_columns=True) #android dataset, keyword argument


print("\nThese are the column names for the Android dataset:\n")
print("\n".join(android_data_header))




This is the first 6 rows of the Android dataset

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 10841
Number o

We can see the first 6 rows of the Android dataset, without the headers. The number of rows is 10841, and the number of columns is 13. The column names are listed above.

In [33]:
print("\nThis is the first 6 rows of the Apple dataset")
explore_data(apple_dataset, 0, 5, True) #positional argument

print("\nThese are the column names for the Apple dataset:\n")
print("\n".join(apple_data_header))


This is the first 6 rows of the Apple dataset
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of columns: 16

These are the column names for the Apple dataset:

id
track_name
size_bytes
currency
price
rating_count_tot
rating_count_ver
user_rating
user_rating_ver
ver
cont_rating
prime_genre
sup_devices.num


We can see the first 6 rows of the Android dataset, without the headers. The number of rows is 7197, and the number of columns is 16. The column names are listed above. The documentation to understand the column names can be found [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home).

### Finding wrong/incorrect data

In [37]:
#trying to find wrong data in the Android dataset
for row in android_dataset:
    if len(row) != len(android_data_header):
        print("The wrong data row is:")
        print(row)
        print("\n")
        print("Index of the wrong row is " + str(android_dataset.index(row)))

print("\nThe headers of the Android dataset are:\n")
print(android_data_header)

print("\nA normal row in an Android dataset looks like:\n")
print(android_dataset[3])
        

The wrong data row is:
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Index of the wrong row is 10472

The headers of the Android dataset are:

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

A normal row in an Android dataset looks like:

['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


It seems that row 10472 is missing the **Category** data, shifting everything by one cell.

In [38]:
corrected_android_dataset = android_dataset[:10472] + android_dataset[10473:] #new dataset without the wrong data
#print(corrected_android_dataset[10470:10480])
explore_data(corrected_android_dataset, 0, 5, rows_and_columns=True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 10840
Number of columns: 13


The wrong data row has been removed, and the new corrected data set is stored in the variable `corrected_android_dataset`. The new number of rows is 10840, and columns remain the same.

### Removing duplicate entries