# AppStore and Google Play store analysis

For this project, we'll pretend we're working as data analysts for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and in the App Store.

Our aim is to help our developers understand what type of apps are likely to attract more users on Google Play and the App Store.

Data sources:

* [A dataset](https://www.kaggle.com/datasets/lava18/google-play-store-apps) containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).
* [A dataset](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).

## Setup

In [1]:
# Imports
from csv import reader
import codecs

In [2]:
def open_file(filename, encoding="utf-8"):
    opened_file = codecs.open(filename, "r", encoding)
    read_file = reader(opened_file)
    apps_data = list(read_file)

    return apps_data

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
app_store_data = open_file('data/AppleStore.csv')
google_play_store_data = open_file('data/googleplaystore.csv')

app_store_data_header = app_store_data[0]
app_store_data_content = app_store_data[1:]
google_play_store_data_header = google_play_store_data[0]
google_play_store_data_content = google_play_store_data[1:]

In [4]:
explore_data(app_store_data_content, 0, 5, rows_and_columns=True)
explore_data(google_play_store_data_content, 0, 5, rows_and_columns=True)

print("AppStore data header:\n", app_store_data_header)
print("GooglePlay data header:\n", google_play_store_data_header)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of columns: 16
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '

Columns that seem to be helpful for our future analysis are:

* for AppStore dataset: 'price', 'rating_count_tot', 'prime_genre'
* for GooglePlay dataset: 'Category', 'Installs'

## Data cleaning

We only build apps that are free to download and install, and we design them for an English-speaking audience. This means that we'll need to do the following:

* Remove non-English apps
* Remove apps that aren't free

### Remove the app listing for the app missing Genre

In [5]:
# Row 10472 for GooglePlay check
explore_data(google_play_store_data_content, 10470, 10474, rows_and_columns=True)

['Jazz Wi-Fi', 'COMMUNICATION', '3.4', '49', '4.0M', '10,000+', 'Free', '0', 'Everyone', 'Communication', 'February 10, 2017', '0.1', '2.3 and up']


['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


Number of rows: 10841
Number of columns: 13


In [6]:
# This row has empty value in 'Genres' column, so we need to get rid of it
del google_play_store_data_content[10472]

In [7]:
explore_data(google_play_store_data_content, 10470, 10474, rows_and_columns=True)

['Jazz Wi-Fi', 'COMMUNICATION', '3.4', '49', '4.0M', '10,000+', 'Free', '0', 'Everyone', 'Communication', 'February 10, 2017', '0.1', '2.3 and up']


['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


['Sat-Fi Voice', 'COMMUNICATION', '3.4', '37', '14M', '1,000+', 'Free', '0', 'Everyone', 'Communication', 'November 21, 2014', '2.2.1.5', '2.2 and up']


Number of rows: 10840
Number of columns: 13


### Remove duplicate apps

For each duplicate app name, keep the row with the highest number of reviews and remove the other entries for any given app

In [8]:
def remove_duplicate_names(dataset, name_column_idx, n_reviews_column_idx):
    print("Dataset length:", len(dataset))
    duplicate_apps = []
    unique_apps = []

    for app in dataset:
        name = app[name_column_idx]
        if name in unique_apps:
            duplicate_apps.append(name)
        else:
            unique_apps.append(name)

    print("Unique apps:", len(unique_apps))
    print("Duplicate apps:", len(duplicate_apps))

    reviews_max = {}

    for app in dataset:
        name = app[name_column_idx]
        n_reviews = float(app[n_reviews_column_idx])
        if name in reviews_max:
            if reviews_max[name] < n_reviews:
                reviews_max[name] = n_reviews
        else:
            reviews_max[name] = n_reviews
    
    assert len(unique_apps) == len(reviews_max)

    data_no_duplicates = []
    already_added = []

    for app in dataset:
        name = app[name_column_idx]
        n_reviews = float(app[n_reviews_column_idx])
        if n_reviews == reviews_max[name] and name not in already_added:
            data_no_duplicates.append(app)
            already_added.append(name)

    assert len(unique_apps) == len(data_no_duplicates)
    
    return data_no_duplicates

In [9]:
app_store_data_no_duplicates = remove_duplicate_names(app_store_data_content, name_column_idx=1, n_reviews_column_idx=5)

Dataset length: 7197
Unique apps: 7195
Duplicate apps: 2


In [10]:
google_play_data_no_duplicates = remove_duplicate_names(google_play_store_data_content, name_column_idx=0, n_reviews_column_idx=3)

Dataset length: 10840
Unique apps: 9659
Duplicate apps: 1181


### Remove Non-English Apps



In [11]:
def is_english_app_name(app_name):
    non_english_char_count = 0
    for char in app_name:
        if ord(char) > 127:
            non_english_char_count += 1
    if non_english_char_count > 3:
        return False
    return True

In [12]:
print(is_english_app_name('Instagram'))
print(is_english_app_name('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english_app_name('Docs To Go™ Free Office Suite'))
print(is_english_app_name('Instachat 😜'))

True
False
True
True


In [13]:
def remove_non_english_apps(dataset, name_column_idx):
    dataset_english_names = []
    for app in dataset:
        app_name = app[name_column_idx]
        if is_english_app_name(app_name):
            dataset_english_names.append(app)
    return dataset_english_names

In [14]:
# Remove non-English apps from AppStore data
app_store_data_eng = remove_non_english_apps(app_store_data_no_duplicates, name_column_idx=1)
print(len(app_store_data_eng))

6181


In [15]:
# Remove non-English apps from Google Play data
google_play_data_eng = remove_non_english_apps(google_play_data_no_duplicates, name_column_idx=0)
print(len(google_play_data_eng))

9614


### Isolate free apps

In [16]:
def filter_free_apps(dataset, price_column_idx):
    dataset_free_apps = []
    for app in dataset:
        try:
            app_price = float(app[price_column_idx])
        except ValueError:
            app_price = float(app[price_column_idx][1:])
        if app_price == 0.0:
            dataset_free_apps.append(app)
    return dataset_free_apps

In [17]:
# Remove non-English apps from AppStore data
app_store_cleaned = filter_free_apps(app_store_data_eng, price_column_idx=4)
print(len(app_store_cleaned))

3220


In [18]:
# Remove non-English apps from Google Play data
google_play_cleaned = filter_free_apps(google_play_data_eng, price_column_idx=7)
print(len(google_play_cleaned))

8864


## Market analysis

### Determine the most common genres for each market

Generate frequency tables for app genres. We'll need to build a frequency table for the `prime_genre` column of the App Store data set, and for the `Genres` and `Category` columns of the Google Play data set.

In [19]:
def freq_table(dataset, index):
    freq_table_dict = {}
    for row in dataset:
        value = row[index]
        if value in freq_table_dict:
            freq_table_dict[value] += 1
        else:
            freq_table_dict[value] = 1
    return freq_table_dict


def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [20]:
# Frequency table for prime_genre from AppStore data
appstore_prime_genre_freq = freq_table(app_store_cleaned, 11)
display_table(app_store_cleaned, 11)

Games : 1872
Entertainment : 254
Photo & Video : 160
Education : 118
Social Networking : 106
Shopping : 84
Utilities : 81
Sports : 69
Music : 66
Health & Fitness : 65
Productivity : 56
Lifestyle : 51
News : 43
Travel : 40
Finance : 36
Weather : 28
Food & Drink : 26
Reference : 18
Business : 17
Book : 14
Navigation : 6
Medical : 6
Catalogs : 4


In [21]:
# Frequency table for Genres from Google Play data
googleplay_genres_freq = freq_table(google_play_cleaned, 9)
display_table(google_play_cleaned, 9)

Tools : 749
Entertainment : 538
Education : 474
Business : 407
Productivity : 345
Lifestyle : 345
Finance : 328
Medical : 313
Sports : 307
Personalization : 294
Communication : 287
Action : 275
Health & Fitness : 273
Photography : 261
News & Magazines : 248
Social : 236
Travel & Local : 206
Shopping : 199
Books & Reference : 190
Simulation : 181
Dating : 165
Arcade : 164
Video Players & Editors : 157
Casual : 156
Maps & Navigation : 124
Food & Drink : 110
Puzzle : 100
Racing : 88
Role Playing : 83
Libraries & Demo : 83
Auto & Vehicles : 82
Strategy : 81
House & Home : 73
Weather : 71
Events : 63
Adventure : 60
Comics : 54
Beauty : 53
Art & Design : 53
Parenting : 44
Card : 40
Casino : 38
Trivia : 37
Educational;Education : 35
Board : 34
Educational : 33
Education;Education : 30
Word : 23
Casual;Pretend Play : 21
Music : 18
Racing;Action & Adventure : 15
Puzzle;Brain Games : 15
Entertainment;Music & Video : 15
Casual;Brain Games : 12
Casual;Action & Adventure : 12
Arcade;Action & Advent

In [22]:
# Frequency table for Category from Google Play data
googleplay_category_freq = freq_table(google_play_cleaned, 1)
display_table(google_play_cleaned, 1)

FAMILY : 1676
GAME : 862
TOOLS : 750
BUSINESS : 407
LIFESTYLE : 346
PRODUCTIVITY : 345
FINANCE : 328
MEDICAL : 313
SPORTS : 301
PERSONALIZATION : 294
COMMUNICATION : 287
HEALTH_AND_FITNESS : 273
PHOTOGRAPHY : 261
NEWS_AND_MAGAZINES : 248
SOCIAL : 236
TRAVEL_AND_LOCAL : 207
SHOPPING : 199
BOOKS_AND_REFERENCE : 190
DATING : 165
VIDEO_PLAYERS : 159
MAPS_AND_NAVIGATION : 124
FOOD_AND_DRINK : 110
EDUCATION : 103
ENTERTAINMENT : 85
LIBRARIES_AND_DEMO : 83
AUTO_AND_VEHICLES : 82
HOUSE_AND_HOME : 73
WEATHER : 71
EVENTS : 63
PARENTING : 58
ART_AND_DESIGN : 57
COMICS : 55
BEAUTY : 53


### Find average number of app users by genre

In [23]:
def avg_users_by_genre(dataset, index_genre, index_installs, genre_freq_table):
    users_by_genre_dict = {}
    for row in dataset:
        app_genre = row[index_genre]
        try:
            app_installs = int(row[index_installs])
        except ValueError:
            app_installs = int(row[index_installs].replace('+', '').replace(',', ''))
        if app_genre in users_by_genre_dict:
            users_by_genre_dict[app_genre] += app_installs
        else:
            users_by_genre_dict[app_genre] = app_installs
    # Take the average using freq table
    for item in users_by_genre_dict:
        users_by_genre_dict[item] /= genre_freq_table[item]
    
    # Sort and ready for printing    
    table_display = []
    for key in users_by_genre_dict:
        key_val_as_tuple = (users_by_genre_dict[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

    return users_by_genre_dict

In [24]:
appstore_avg_users_by_genre = avg_users_by_genre(app_store_cleaned, 11, 5, appstore_prime_genre_freq)

Navigation : 86090.33333333333
Reference : 74942.11111111111
Social Networking : 71548.34905660378
Music : 57326.530303030304
Weather : 52279.892857142855
Book : 39758.5
Food & Drink : 33333.92307692308
Finance : 31467.944444444445
Photo & Video : 28441.54375
Travel : 28243.8
Shopping : 26919.690476190477
Health & Fitness : 23298.015384615384
Sports : 23008.898550724636
Games : 22812.92467948718
News : 21248.023255813954
Productivity : 21028.410714285714
Utilities : 18684.456790123455
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Business : 7491.117647058823
Education : 7003.983050847458
Catalogs : 4004.0
Medical : 612.0


In [25]:
googleplay_avg_users_by_genres = avg_users_by_genre(google_play_cleaned, 9, 3, googleplay_genres_freq)

Adventure;Action & Adventure : 1513269.0
Strategy : 1251841.8148148148
Communication : 995608.4634146341
Social : 965830.9872881356
Casual;Action & Adventure : 942726.4166666666
Card;Action & Adventure : 920571.0
Casual : 837706.0064102564
Sports;Action & Adventure : 730014.0
Arcade : 708829.8353658536
Racing : 597997.1590909091
Action : 544975.6218181818
Puzzle;Action & Adventure : 533895.3333333334
Video Players & Editors : 428992.6050955414
Photography : 404081.3754789272
Tools;Education : 342336.0
Role Playing;Action & Adventure : 322308.3333333333
Tools : 305684.02803738316
Adventure : 302214.93333333335
Adventure;Education : 288606.0
Role Playing : 249256.8313253012
Education;Education : 234564.86666666667
Word : 228272.04347826086
Educational;Action & Adventure : 225802.33333333334
Shopping : 223887.34673366835
Music : 216456.44444444444
Puzzle : 215662.56
Educational;Pretend Play : 214720.125
Sports : 213438.38436482084
Racing;Action & Adventure : 203194.86666666667
Trivia : 19

In [26]:
googleplay_avg_users_by_category = avg_users_by_genre(google_play_cleaned, 1, 3, googleplay_category_freq)

COMMUNICATION : 995608.4634146341
SOCIAL : 965830.9872881356
GAME : 683523.8445475638
VIDEO_PLAYERS : 425350.08176100627
PHOTOGRAPHY : 404081.3754789272
TOOLS : 305732.8973333333
ENTERTAINMENT : 301752.24705882353
SHOPPING : 223887.34673366835
PERSONALIZATION : 181122.31632653062
WEATHER : 171250.77464788733
PRODUCTIVITY : 160634.5420289855
MAPS_AND_NAVIGATION : 142860.0483870968
TRAVEL_AND_LOCAL : 129484.42512077295
SPORTS : 116938.6146179402
FAMILY : 113142.99821002387
NEWS_AND_MAGAZINES : 93088.03225806452
BOOKS_AND_REFERENCE : 87995.06842105264
HEALTH_AND_FITNESS : 78094.9706959707
FOOD_AND_DRINK : 57478.79090909091
EDUCATION : 56293.09708737864
COMICS : 42585.61818181818
FINANCE : 38535.8993902439
LIFESTYLE : 33921.82369942196
HOUSE_AND_HOME : 26435.465753424658
ART_AND_DESIGN : 24699.42105263158
BUSINESS : 24239.727272727272
DATING : 21953.272727272728
PARENTING : 16378.706896551725
AUTO_AND_VEHICLES : 14140.280487804877
LIBRARIES_AND_DEMO : 10925.807228915663
BEAUTY : 7476.22641

### Analyze total installs on Google Play

In [27]:
googleplay_avg_installs_by_category = avg_users_by_genre(google_play_cleaned, 1, 5, googleplay_category_freq)

COMMUNICATION : 38456119.167247385
VIDEO_PLAYERS : 24727872.452830188
SOCIAL : 23253652.127118643
PHOTOGRAPHY : 17840110.40229885
PRODUCTIVITY : 16787331.344927534
GAME : 15588015.603248259
TRAVEL_AND_LOCAL : 13984077.710144928
ENTERTAINMENT : 11640705.88235294
TOOLS : 10801391.298666667
NEWS_AND_MAGAZINES : 9549178.467741935
BOOKS_AND_REFERENCE : 8767811.894736841
SHOPPING : 7036877.311557789
PERSONALIZATION : 5201482.6122448975
WEATHER : 5074486.197183099
HEALTH_AND_FITNESS : 4188821.9853479853
MAPS_AND_NAVIGATION : 4056941.7741935486
FAMILY : 3695641.8198090694
SPORTS : 3638640.1428571427
ART_AND_DESIGN : 1986335.0877192982
FOOD_AND_DRINK : 1924897.7363636363
EDUCATION : 1833495.145631068
BUSINESS : 1712290.1474201474
LIFESTYLE : 1437816.2687861272
FINANCE : 1387692.475609756
HOUSE_AND_HOME : 1331540.5616438356
DATING : 854028.8303030303
COMICS : 817657.2727272727
AUTO_AND_VEHICLES : 647317.8170731707
LIBRARIES_AND_DEMO : 638503.734939759
PARENTING : 542603.6206896552
BEAUTY : 51315

## Analyzing the results and conclusions

### AppStore

Games are the type of apps that is more often met in the AppStore. Much often than others. So it's a competitive field. But Games don't attract that many users. Navigation and Reference are the opposite - a lot of users, not so many apps. So maybe it's a good idea to try to enter that market.
Social Networking seems to be the most balanced choice - a competitive market plus rather much installs.

### Google Play genres

Here amount of apps in the category almost don't overlap with the popular apps. So I'd recommend that we rely on the number of installs and select categories on 5-6-7 places (Casual or Sports here).

### Google Play categories

Communication, social and games have a lot of installs and look also attractive from the market perspective. But it may be too huge for us, so let's choose some less popular categories of the market - photography, entertainment and shopping

In [29]:
%history -p