<a href="https://colab.research.google.com/github/hoangvn111/Project-Profitable-App-Profiles-for-the-App-Store-and-Google-Play-Markets/blob/master/Project_Profitable_App_Profiles_for_the_App_Store_and_Google_Play_Markets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive 
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).



# **Profitable App Profiles for the App Store and Google Play Markets**
Our aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. We're working as data analysts for a company that builds Android and iOS mobile apps, and our job is to enable our team of developers to make data-driven decisions with respect to the kind of apps they build.

At our company, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.

## **Opening and Exploring the Data**

In [2]:
import csv

open_file = open('/content/drive/MyDrive/my_datasets/Project: Profitable App Profiles for the App Store and Google Play Markets/googleplaystore.csv', encoding='utf8')
read_file = csv.reader(open_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

open_file = open('/content/drive/MyDrive/my_datasets/Project: Profitable App Profiles for the App Store and Google Play Markets/AppleStore.csv', encoding='utf8')
read_file = csv.reader(open_file)
ios = list(read_file)

for row in ios:
  row.pop(0)


ios_header = ios[0]
ios = ios[1:]


In [3]:
print(android_header)
print('\n')
print(android[:10])
print('\n')
print(ios_header)
print('\n')
print(ios[:10])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyo


To make it easier to explore the two data sets, we'll first write a function named **explore_data()** that we can use repeatedly to explore rows in a more readable way. We'll also add an option for our function to show the number of rows and columns for any data set.




In [4]:
def explore_data(dataset, start, end, rows_and_columns=False):
  dataset_slice = dataset[start:end]
  for row in dataset_slice:
    print(row)
    print('\n') # add a new empty line between rows

  if rows_and_columns:
    print('Number of rows:', len(dataset))
    print('Number of columns:', len(row))

In [5]:
print(android_header)
print('\n')
print(explore_data(android, 0, 10, True))

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Eve

We see that the Google Play data set has 10841 apps and 13 columns. At a quick glance, the columns that might be useful for the purpose of our analysis are '**App**', '**Category**', '**Reviews**', '**Installs**', '**Type**', '**Price**', and '**Genres**'.

Now let's take a look at the App Store data set.

In [6]:
print(ios_header)
print('\n')
explore_data(ios, 0, 10, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


['282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1']


['2836193

We have 7197 iOS apps in this data set, and the columns that seem interesting are: '**track_name**', '**currency**', '**price**', '**rating_count_tot**', '**rating_count_ver**', and '**prime_genre**'. Not all column names are self-explanatory in this case, but details about each column can be found in the data set [documentation.](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)

## **Deleting Wrong Data**

Write a function **find_missing_data()** to compare length of each row with length of header and find the index of row has wrong length => wrong data

In [7]:
def find_missing_data(dataset, header):
  count_err_rows = 0
  for row in dataset:
    if len(row) != len(header):
      index = dataset.index(row)
      print('row {} has wrong length'.format(index))
      count_err_rows += 1
    else:
      count_err_rows = count_err_rows
  if count_err_rows == 0:
    result = 'Dataset has 0 rows wrong length'
    return result

In [8]:
find_missing_data(android, android_header)

row 10472 has wrong length


In [9]:
print(android[10472])
print('\n')
print(android_header)
print('\n')
print(android[0])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


The row 10472 corresponds to the app ***Life Made WI-Fi Touchscreen Photo Frame***, and we can see that the rating is **19**. This is clearly off because the maximum rating for a Google Play app is **5** , this problem is caused by a **missing value in the 'Category'** column. As a consequence, we'll delete this row.

In [13]:
print(len(android))
del android[10472]
print(len(android))

10841
10840


In [11]:
find_missing_data(ios, ios_header)

'Dataset has 0 rows wrong length'

## **Removing Duplicate Entries**


### **Part One**

If we explore the Google Play data set long enough, we'll find that some apps have more than one entry. For instance, the application *Instagram* has four entries:

In [27]:
android_header

['App',
 'Category',
 'Rating',
 'Reviews',
 'Size',
 'Installs',
 'Type',
 'Price',
 'Content Rating',
 'Genres',
 'Last Updated',
 'Current Ver',
 'Android Ver']

In [28]:
for app in android:
  name = app[0]
  if name == 'Instagram':
    print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
