# Opening and Exploring the data

The aim of this project is to help our developers understand what type of apps are likely to attract more users on Google Play and the App Store. To do this, we'll need to collect and analyze data about mobile apps available on Google Play and the App Store.

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play. Collecting data for over 4 million apps requires a significant amount of time and money, so we'll try to analyze a sample of the data instead. To avoid spending resources on collecting new data ourselves, we should first try to see if we can find any relevant existing data at no cost. Luckily, these are two data sets that seem suitable for our goals:
A data set containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. You can download the data set directly from https://dq-content.s3.amazonaws.com/350/googleplaystore.csv.
A data set containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. You can download the data set directly from https://dq-content.s3.amazonaws.com/350/AppleStore.csv.

In [1]:
from csv import reader
# Google Play data set #
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

# The App Store data set #
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

dd``

We will open and explore data with the following explore_data() function:

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

To extract android rows and columns:

In [3]:
print(android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


To extract ios rows and columns:

In [4]:
print(ios_header)
print('/n')
explore_data(ios, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
/n
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


# Deleting Wrong Data

In the previous step, we opened the two data sets and performed a brief exploration of the data. Before beginning our analysis, we need to make sure the data we analyze is accurate, otherwise the results of our analysis will be wrong. This means that we need to:

Detect inaccurate data, and correct or remove it.
Detect duplicate data, and remove the duplicates.
Recall that at our company, we only build apps that are free to download and install, and that are directed toward an English-speaking audience. This means that we'll need to:

Remove non-English apps like 爱奇艺PPS -《欢乐颂2》电视剧热播.
Remove apps that aren't free.

Row 10472 of the Android data set was reported broken (https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015): 

In [5]:
print(android[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Therefore, this datapoint will be deleted from the dataset:

In [6]:
print(len(android))
del android[10472]
print(len(android))

10841
10840


# Removing  Duplicate Entries

There are some reports of duplicate data points in the dataset (https://www.kaggle.com/lava18/google-play-store-apps/discussion). The following function shows that there are 1181 cases of an app with the same name that occurs more than once in the android dataset:

In [7]:
duplicate_apps = []
unique_apps = []
for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


The following is a full account of all duplicate entries in the dataset. The verified_apps dictionary provides the number of times each duplicate app was duplicated

In [8]:
def duplicate_verification(duplicate_apps, unique_apps):
    verified_apps = {}
    for app in duplicate_apps:
        if app in verified_apps:
            verified_apps[app] += 1
            for row in unique_apps:
                if row == app:
                    verified_apps[app] += 1
        else:
            verified_apps[app] = 1
    return (verified_apps)

android_duplicate_hash = duplicate_verification(duplicate_apps, unique_apps)
print(android_duplicate_hash)

{'Quick PDF Scanner + OCR FREE': 3, 'Box': 3, 'Google My Business': 3, 'ZOOM Cloud Meetings': 1, 'join.me - Simple Meetings': 3, 'Zenefits': 1, 'Google Ads': 3, 'Slack': 3, 'FreshBooks Classic': 1, 'Insightly CRM': 1, 'QuickBooks Accounting: Invoicing & Expenses': 3, 'HipChat - Chat Built for Teams': 1, 'Xero Accounting Software': 1, 'MailChimp - Email, Marketing Automation': 1, 'Crew - Free Messaging and Scheduling': 1, 'Asana: organize team projects': 1, 'Google Analytics': 1, 'AdWords Express': 1, 'Accounting App - Zoho Books': 1, 'Invoice & Time Tracking - Zoho': 1, 'Invoice 2go — Professional Invoices and Estimates': 1, 'SignEasy | Sign and Fill PDF and other Documents': 1, 'Genius Scan - PDF Scanner': 1, 'Tiny Scanner - PDF Scanner App': 1, 'Fast Scanner : Free PDF Scan': 1, 'Mobile Doc Scanner (MDScan) Lite': 1, 'TurboScan: scan documents and receipts in PDF': 1, 'Tiny Scanner Pro: PDF Doc Scan': 1, 'Docs To Go™ Free Office Suite': 1, 'OfficeSuite : Free Office + PDF Editor': 1,

For example, the following shows that 'Roblox' was the most duplicated app, with 15 repeated entries, whereas 'Duolingo' was the second most duplicated app, with 11 duplicated entries':

In [9]:
def most_duplicated_apps(duplicate_hash):
    most_duplicated = [False, 0]
    second_most_duplicated = []
    for app in duplicate_hash:
        if duplicate_hash[app] > most_duplicated[1]:
            second_most_duplicated = [most_duplicated[0], most_duplicated[1]]
            most_duplicated = [app, duplicate_hash[app]]        
        elif duplicate_hash[app] == most_duplicated[1]:
            second_most_duplicated = [app, duplicate_hash[app]]
    return(most_duplicated, second_most_duplicated)

android_most_duplicated = most_duplicated_apps(android_duplicate_hash)

most_duplicated = android_most_duplicated[0]
second_most_duplicated = android_most_duplicated[1]

print(most_duplicated)
print(second_most_duplicated)

['ROBLOX', 15]
['Duolingo: Learn Languages Free', 11]


The full data sets for for those apps are extracted as follows:

In [10]:
most_dup_full_set = []
sec_most_dup_full_set = []

for app in android:
    if app[0] == most_duplicated[0]:
        most_dup_full_set.append(app)
    elif app[0] == second_most_duplicated[0]:
        sec_most_dup_full_set.append(app)

print(android_header)

for line in most_dup_full_set:
    print(line)
    
for line in sec_most_dup_full_set:
    print(line)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['ROBLOX', 'GAME', '4.5', '4447388', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
['ROBLOX', 'GAME', '4.5', '4447346', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
['ROBLOX', 'GAME', '4.5', '4448791', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
['ROBLOX', 'GAME', '4.5', '4449882', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
['ROBLOX', 'GAME', '4.5', '4449910', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
['ROBLOX

The above shows that the number of reviews (item 4) is the data point with the most relevant variance between each rows. Therefore, it will be used as the criterion for differentiating between each duplicate to the effect that only the entry with the most reviews will be ket in the dataset, whereas all others will be deleted as follows:

In [None]:
def sorting_unsorted_duplicates_array(duplicates_unsorted):
    dups_sorted = []
    for dup in duplicates_unsorted:
        if not dups_sorted:
            array = []
            array.append(dup)
            dups_sorted.append(array)
        else:
            dups_sorted_count = len(dups_sorted)
            sorted_dups_counter = 0
            for sorted_dups1 in dups_sorted:
                sorted_dups_counter += 1
                already_sorted = False
                for dup1 in sorted_dups1:
                    if dup1[0] == dup[0]:
                        sorted_dups1.append(dup)
                        already_sorted = True                 
                if  dups_sorted_count == sorted_dups_counter and already_sorted == False:
                    array2 = []
                    array2.append(dup)
                    dups_sorted.append(array2)
    return dups_sorted

def sorted_duplicates(verified_apps_hash, data_set):
    duplicates_unsorted = []
    for row in data_set:
        if row[0] in verified_apps_hash:
            duplicates_unsorted.append(row)
    dups_sorted = sorting_unsorted_duplicates_array(duplicates_unsorted)
    return dups_unsorted


sorted_android_duplicates = sorted_duplicates(android_duplicate_hash, android)


print(unsorted_android_duplicates)


In [None]:
# def remove_highest_rating(all_duplicates_sorted):
#     for dups in all_duplicates_sorted:
#         highest_rating: []
#         for app in dups:
#             if not highest_rating:
#                 highest_rating = app
#             if app[3] > highest_rating[3]:
#                 highest_rating = app
#         for app in dups:
#             if app == highest_rating:
#                 del app
#     return all_duplicates_sorted

# sorted_and_dups_wo_highest_rating = remove_highest_rating(sorted_android_duplicates)
# print(sorted_and_dups_wo_highest_rating)