# Profitable Apps in Google Play and App Store
As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play. See chart below for comparison across 5 different marketplace for apps.

![App statistics](py1m8_statista.png)

## 1. Exploring the data  
First, let's open our datasets. Currently we already have data of the number of apps on Google Play and App Store.

In [1]:
from csv import reader

# This is the dataset for App Store
opened_file = open('AppleStore.csv', encoding='utf8')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

# This is the dataset for Google Play
opened_file = open('googleplaystore.csv', encoding='utf8')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

Build a function so we can easily read and explore the data. A small preview from App Store dataset is shown.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
print(ios_header)
print('\n')
explore_data(ios, 0, 2 ,True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7197
Number of columns: 16


From preview of App Store above, we identified several attributes that may be useful for our analysis, such as: **track_name**, **price**, **rating_count_tot**, **user_rating**, and **prime_genre**.  

Now, let's see what's on Google Play Store dataset.

In [3]:
print(android_header)
print('\n')
explore_data(android, 0, 2, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


In Play Store, we determine that **App**, **Category**, **Rating**, **Reviews**, **Price**, and **Genre** will be useful for our analysis.

## 2. Cleaning the Data
In this section, we clean the data first before further analyze them. This section will be broken down into several subsections. 

### 2.1 Deleting wrong data

Refer to the [one of the discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) on Google Play data set, there is a report of an error in row 10472. We have to confirm this error first:

In [4]:
print(android_header, '\n')
for row in android:
    if len(row) != len(android_header):
        print('Error occurs in row: ', android.index(row), '\n')
        print('Preview:', '\n', row)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

Error occurs in row:  10472 

Preview: 
 ['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Result above shows that there is indeed an error in row 10472, since the lenght of the row does not match the lenght of header.  After further investigation, it turns out the data missing 'Category' column, because the value is very unlikely to be '1.9', this value is shifted from the adjacent column as a result of missing 'Category' column.  

Therefore, we will delete this row:

In [5]:
del android[10472] 
# Careful not to run this command more than one

Let's apply similar process to check the App Store dataset.

In [6]:
for row in ios:
    flag = 0
    if len(row) != len(ios_header):
        print('Error occurs in row: ', android.index(row), '\n')
        print('Preview:', '\n', row)
        flag = 1

if flag == 0:
    print('No error detected')

No error detected


We did not find any error on the App Store dataset. We can continue to the next step of data cleaning process.

### 2.2 Removing duplicate entries
This time we will check each dataset for duplicate entries, meaning the same app/information are mentioned several times in the dataset.

In [7]:
duplicate_apps_android = []
unique_apps_android = []

for app in android:
    name = app[0]
    if name in unique_apps_android:
        duplicate_apps_android.append(name)
    else:
        unique_apps_android.append(name)
        
print('No of duplicates: ', len(duplicate_apps_android), '\n')
print(duplicate_apps_android[:4])

No of duplicates:  1181 

['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings']


Now we know there are 1181 duplicates in Google Play dataset. Let's check using similar command for duplicates in App Store.

In [8]:
duplicate_apps_ios = []
unique_apps_ios = []

for app_id in ios:
    name = app_id[0]
    if name in unique_apps_ios:
        duplicate_apps_ios.append(name)
    else:
        unique_apps_ios.append(name)
        
print('No of duplicates: ', len(duplicate_apps_ios), '\n')
print(duplicate_apps_ios[:4])

No of duplicates:  0 

[]


The result shows that there is no duplicate entry on App Store dataset. We checked for each apps that have same id.  

Next we have to remove the duplicates on Google Play, and retaining only single data for each duplicates. We won't randomly pick which one of the duplicates will be retained, instead we will pick the one which the latest entry. This can be identified by choosing the data which have the most number of reviews.

To do that, we will:
- Create a dictionary where each key is a unique app name, and the value is the highest number of reviews of that app
- Use the dictionary to create new dataset, which will have only one entry per app

First, we build the dictionary:

In [9]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print(len(reviews_max))
print(len(android) - len(duplicate_apps_android))

9659
9659


We have build the dictionary and verify that its lenght matches the expected lenght of Google Play dataset, after we remove the duplicates. The expected length should be the current length substracted by the number of duplicates.

Now, after we have dictionary `reviews_max`, we can build our new dataset which only contains unique apps.

We loop through the `android` dataset and in each iteration we check the number of reviews and compare it to `reviews_max`. If it matches, then we append the data to our new dataset `android_clean`.

In details, we do this in the code below:
- create two lists, `android_clean` which will be our new dataset, and `already_added` which will help us identify whether the app has been added to `android_clean` or not,
- iterate for each row in `android` dataset,
- obtain the app name (index 0), and assign it to variable `name`,
- obtain the number of reviews (index 3), and assign it to variable `n_reviews`,
- set if condition, and append the row if `n_review` matches the corresponding value in `reviews_max` list we build in previous step. Also only if the app has not been added to `already_added` list.

Then we test by printing the length of our new dataset. Expected lenght should be 9659.

In [10]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)
        
print('Our new dataset length: ', len(android_clean))

Our new dataset length:  9659


The length of our new dataset is verified.

### 2.3 Removing non-English app
In this project, we limit our analysis only for the apps which designed for English-speaking audience. So we will have to check our datasets for non-English language, and remove them from our dataset.

To do this, we have to check for the app name that contains symbols no commonly used in English text. Using build-in function `ord()`, we able to obtain the corresponding number for each character. And according to ASCII (American Standard Code for Information Interchange), characters used in English text correspond to number in range between 0 to 127. Therefore, we should check for app name on our datasets and if we find character number beyond the range of 0 to 127, then most likely that app is not in English languange.

To easily reuse the checking command, we build a function `is_english` which takes in string command and return `True` if the input text is in English, and `False` otherwise. To strengthen our function, we only classify the string as non-English if it contains three or more characters which number above 127, so even if there is emoji or other single non-common symbol in the string, our function still classify is as English text.

In [15]:
def is_english(string):
    count = 0 # Counter to record how many non-English character detected
    for character in string:
        if (ord(character) > 127):
            count += 1
            if count > 3:
                return False
    return True

# Test the function
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
True
True


In [16]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if (is_english(name)):
        android_english.append(app)

for app in ios:
    name = app[1]  # For ios, the app name is under column 'track_name' (index 1)
    if (is_english(name)):
        ios_english.append(app)
    
print('Google Play dataset length: ', len(android_english))
print('App Store dataset length: ', len(ios_english))

Google Play dataset length:  9614
App Store dataset length:  6183
