# Analysis of apps from Google Play and App Store

Aim: try to use new knowledge. 

## Open and Explore the data

Opening datasets:

In [1]:
def open_dataset(file_name):
    opened_file = open(file_name)
    from csv import reader
    read_file = reader(opened_file)
    data = list(read_file)
    return data
    
ios = open_dataset('AppleStore.csv')
android = open_dataset('googleplaystore.csv')

Exploring datasets:

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n')

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Number of rows and columns for App Store:

In [4]:
explore_data(ios, 0, 2, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 7198
Number of columns: 16


Number of rows and columns for Google Play:

In [5]:
explore_data(android, 0, 2, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


The following datasets were used:
- [App Store](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)
- [Google Play](https://www.kaggle.com/lava18/google-play-store-apps/home)

# Cleaning Data 

### Deleting wrong Data

In [14]:
print(android[0])
print('\n')
print(android[10473])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


As we see, row 10473 has Rating=19. The maximum rating in Google Play - 5.

In [15]:
print(len(android))
del android[10473]
print(len(android))

10842
10841


### Remove duplicates

Search duplicates in App Store:

In [16]:
duplicate_apps_ios = []
unique_apps_ios = []

for app in ios:
    name = app[0]
    if name in unique_apps_ios:
        duplicate_apps_ios.append(name)
    else:
        unique_apps_ios.append(name)
        
print('Number of duplicate apps in App Store:', len(duplicate_apps_ios))
print('\n')
print('Number of unique apps in App Store:', len(unique_apps_ios))
print('\n')
print('Example of dups:', duplicate_apps_ios[:8])

Number of duplicate apps in App Store: 0


Number of unique apps in App Store: 7198


Example of dups: []


Search duplicates in Google Play:

In [19]:
duplicate_apps_android = []
unique_apps_android = []

for app in android:
    name = app[0]
    if name in unique_apps_android:
        duplicate_apps_android.append(name)
    else:
        unique_apps_android.append(name)
        
print('Number of duplicate apps in Google Play:', len(duplicate_apps_android))
print('\n')
print('Number of unique apps in Google Play:', len(unique_apps_android[1:]))
print('\n')
print('Example of dups:', duplicate_apps_android[:20])

Number of duplicate apps in Google Play: 1181


Number of unique apps in Google Play: 9659


Example of dups: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software', 'MailChimp - Email, Marketing Automation', 'Crew - Free Messaging and Scheduling', 'Asana: organize team projects', 'Google Analytics', 'AdWords Express']


Remove dups from Google Play dataset.

Creating a dictionary where each key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.

In [25]:
reviews_max = {}
for app in android[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print('Actual length:', len(reviews_max))

Actual length: 9659


Removing the duplicate rows:

In [57]:
android_clean = [] #store clean data
already_added = []

for app in android[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)
        
print('Length of clean Google Play dataset:', len(android_clean))
explore_data(android_clean, 0, 3, True)

Length of clean Google Play dataset: 9659
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


Detect and remove non-English apps:

In [47]:
#if there's any character that doesn't belong to English 
def is_english(string_name):
    non_ascii = 0
    
    for character in string_name:
        if ord(character) > 127:
            non_ascii +=1
            
    if non_ascii > 3:
        return False
    else:
        return True

In [61]:
android_english_apps = []
ios_english_apps = []

#android english apps
for app in android_clean[1:]:
    name = app[0]
    if is_english(name):
        android_english_apps.append(app)
        
#ios english apps        
for app in ios[1:]:
    name = app[0]
    if is_english(name):
        ios_english_apps.append(app)
        
print('Google Play apps:')
explore_data(android_english_apps, 0, 3, True)
print('\n')
print('App Store apps:')
explore_data(ios_english_apps, 0, 3, True)

Google Play apps:
['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 9613
Number of columns: 13


App Store apps:
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579',

We need to use only free apps. So:

In [69]:
android_free_apps = []
ios_free_apps = []

for app in android_english_apps:
    price = app[7]
    if price == '0':
        android_free_apps.append(app)
        
for app in ios_english_apps:
    price = app[4]
    if price == '0.0':
        ios_free_apps.append(app)
        
print(len(android_free_apps))
print(len(ios_free_apps))

8863
4056


# Analyze datasets

In [74]:
#function for generating frequency tables
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages

#transform frequency table into a list of tuple
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

print('Apps in App Store by genre in percents', '\n')
display_table(ios_free_apps, 11)
print('\n')
print('Apps in Google Play by genre in percents', '\n')
display_table(android_free_apps, 9)
print('\n')
print('Apps in Google Play by category in percents', '\n')
display_table(android_free_apps, 1)

Apps in App Store by genre in percents 

Games : 55.64595660749507
Entertainment : 8.234714003944774
Photo & Video : 4.117357001972387
Social Networking : 3.5256410256410255
Education : 3.2544378698224854
Shopping : 2.983234714003945
Utilities : 2.687376725838264
Lifestyle : 2.3175542406311638
Finance : 2.0710059171597637
Sports : 1.947731755424063
Health & Fitness : 1.8737672583826428
Music : 1.6518737672583828
Book : 1.6272189349112427
Productivity : 1.5285996055226825
News : 1.4299802761341223
Travel : 1.3806706114398422
Food & Drink : 1.0601577909270217
Weather : 0.7642998027613412
Reference : 0.4930966469428008
Navigation : 0.4930966469428008
Business : 0.4930966469428008
Catalogs : 0.22189349112426035
Medical : 0.19723865877712032


Apps in Google Play by genre in percents 

Tools : 8.450863138892023
Entertainment : 6.070179397495204
Education : 5.348076272142616
Business : 4.592124562789123
Productivity : 3.8925871601038025
Lifestyle : 3.8925871601038025
Finance : 3.700778517432

Calculating the average number of user ratings per app genre on the App Store. To do that, we'll need to:

- Isolate the apps of each genre.
- Sum up the user ratings for the apps of that genre.
- Divide the sum by the number of apps belonging to that genre (not by the total number of apps).

In [76]:
#generating a frequency table for the prime_genre column
ios_genre = freq_table(ios_free_apps, 11)

for genre in ios_genre:
    total = 0
    len_genre = 0
    for app in ios_free_apps:
        genre_app = app[11]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    
    print(genre, ':', avg_n_ratings)

Entertainment : 10822.961077844311
Book : 8498.333333333334
Finance : 13522.261904761905
Reference : 67447.9
Education : 6266.333333333333
Business : 6367.8
Weather : 47220.93548387097
Medical : 459.75
Lifestyle : 8978.308510638299
Music : 56482.02985074627
Catalogs : 1779.5555555555557
Games : 18924.68896765618
Productivity : 19053.887096774193
Social Networking : 53078.195804195806
Health & Fitness : 19952.315789473683
Food & Drink : 20179.093023255813
News : 15892.724137931034
Travel : 20216.01785714286
Photo & Video : 27249.892215568863
Sports : 20128.974683544304
Shopping : 18746.677685950413
Navigation : 25972.05
Utilities : 14010.100917431193


Most Popular Apps by Genre on Google Play

In [77]:
android_category = freq_table(android_free_apps, 1)

for category in android_category:
    total = 0
    len_genre = 0
    for app in android_free_apps:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

NameError: name 'len_category' is not defined