## App Store analysis

The goal of this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

First, I created a helper function to explore the data in an easy way.

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')  # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In this cell I'm reading the datasets.

In [2]:
from csv import reader

apple_store_file_name = "AppleStore.csv"
google_play_store_file = "googleplaystore.csv"


def read_dataset_with(file_name):
    file = open(file_name)
    dataset = list(reader(file))

    return dataset


apple_store_dataset = read_dataset_with(apple_store_file_name)
google_store_dataset = read_dataset_with(google_play_store_file)

explore_data(apple_store_dataset, 0, 2, True)
explore_data(google_store_dataset, 0, 2, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 7198
Number of columns: 16
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


In this cell, I will be removing corrupt rows from each dataset. A corrupt row is a row that has a different number of columns than the header row.

In [3]:
def remove_corrupt_rows_from(raw_dataset):
    number_of_columns = len(raw_dataset[0])
    clean_dataset = [raw_dataset[0]]

    for row in raw_dataset[1:]:
        if len(row) == number_of_columns:
            clean_dataset.append(row)

    return clean_dataset


apple_store_dataset = remove_corrupt_rows_from(apple_store_dataset)
google_store_dataset = remove_corrupt_rows_from(google_store_dataset)

explore_data(apple_store_dataset, 0, 2, True)
explore_data(google_store_dataset, 0, 2, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 7198
Number of columns: 16
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


Next, I'm going to remove duplicated apps. I determine duplication as two apps having the same name. If it happens, I'll select the instance with the biggest number of ratings - because it suggests it's the newest piece of data about this app.

In [4]:
def remove_duplicated_apps_from(raw_dataset, name_column_index, rating_count_column_index):
    deduped_apps = {}
    header = raw_dataset[0]

    for app in raw_dataset[1:]:
        app_name = app[name_column_index]
        rating_count = app[rating_count_column_index]

        if app_name not in deduped_apps or rating_count > deduped_apps[app_name][rating_count_column_index]:
            deduped_apps[app_name] = app

    return [header] + list(deduped_apps.values())


apple_store_dataset = remove_duplicated_apps_from(apple_store_dataset, 1, 5)
google_store_dataset = remove_duplicated_apps_from(google_store_dataset, 0, 3)

explore_data(apple_store_dataset, 0, 2, True)
explore_data(google_store_dataset, 0, 2, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 7196
Number of columns: 16
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 9660
Number of columns: 13


The next step is to remove apps that are not English. It's not a smart language model - I'm just checking for the characters from the English aplhabet, punctuation and up to 3 other characters.

In [5]:
import re


def remove_non_english_apps(raw_dataset, name_column_index):
    english_apps = [raw_dataset[0]]


    for app in raw_dataset[1:]:
        app_name = app[name_column_index]

        non_ascii_pattern = re.compile(r'[^\x00-\x7F]')
        non_ascii_chars = non_ascii_pattern.findall(app_name)

        if len(non_ascii_chars) <= 3:
            english_apps.append(app)

    return english_apps


apple_store_dataset = remove_non_english_apps(apple_store_dataset, 1)
google_store_dataset = remove_non_english_apps(google_store_dataset, 0)

explore_data(apple_store_dataset, 0, 2, True)
explore_data(google_store_dataset, 0, 2, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 6182
Number of columns: 16
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 9615
Number of columns: 13


Since our interest is only in the free apps, I will remove the paid ones.

In [6]:
def remove_paid_apps(raw_dataset, price_column_index):
    free_apps = [raw_dataset[0]]

    for app in raw_dataset[1:]:
        try:
            price = float(app[price_column_index])
        except ValueError:
            price = float(app[price_column_index].replace("$", ""))

        if price == 0.0:
            free_apps.append(app)

    return free_apps

apple_store_dataset = remove_paid_apps(apple_store_dataset, 4)
google_store_dataset = remove_paid_apps(google_store_dataset, 7)

explore_data(apple_store_dataset, 0, 2, True)
explore_data(google_store_dataset, 0, 2, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 3221
Number of columns: 16
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 8863
Number of columns: 13


Now I want to display a frequency table that shows the share of each type of app genre within the remaining apps. 

In [7]:
def create_percentage_frequency_table_for(raw_dataset, column_index):
    numerical_frequencies = {}
    percentage_frequencies = {}
    dataset = raw_dataset[1:]
    number_of_apps = len(dataset)

    for app in dataset:
        frequency_field = app[column_index]
        if frequency_field in numerical_frequencies:
            numerical_frequencies[frequency_field] += 1
        else:
            numerical_frequencies[frequency_field] = 1

    for frequency_field, frequency in numerical_frequencies.items():
        percentage_frequencies[frequency_field] = round(frequency / number_of_apps * 100, 2)


    return percentage_frequencies

apple_store_frequency_table = create_percentage_frequency_table_for(apple_store_dataset, 11)
google_store_category_frequency_table = create_percentage_frequency_table_for(google_store_dataset, 1)
google_store_genres_frequency_table = create_percentage_frequency_table_for(google_store_dataset, 9)

In [8]:
def display_table(table):
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

### Apple store - prime_genre column


In [9]:
display_table(apple_store_frequency_table)

Games : 58.14
Entertainment : 7.89
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.52
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.74
Lifestyle : 1.58
News : 1.34
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12


**What is the most common genre? What is the next most common?**\
The vast majority (58%) of the free, English apps are Games, the next category (8%) is Entertainment.

**What is the general impression — are most of the apps designed for practical purposes (education, shopping, utilities, productivity, lifestyle) or more for entertainment (games, photo and video, social networking, sports, music)?**\
Applications designed for fun (games, entertainment, socials, photos) are the most popular ones. Much less applications have any practical pruposes.

**Can you recommend an app profile for the App Store market based on this frequency table alone? If there's a large number of apps for a particular genre, does that also imply that apps of that genre generally have a large number of users?**\
The fact that there are many apps available for a purpose doesn't imply that people are actually using them.


### Google store - Category Column

In [10]:
display_table(google_store_category_frequency_table)

FAMILY : 18.93
GAME : 9.69
TOOLS : 8.45
BUSINESS : 4.59
LIFESTYLE : 3.9
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.52
SPORTS : 3.4
PERSONALIZATION : 3.32
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.95
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.86
VIDEO_PLAYERS : 1.79
MAPS_AND_NAVIGATION : 1.4
FOOD_AND_DRINK : 1.24
EDUCATION : 1.17
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.82
WEATHER : 0.8
EVENTS : 0.71
PARENTING : 0.65
ART_AND_DESIGN : 0.64
COMICS : 0.62
BEAUTY : 0.6


### Google store - Genres Column
**What is the most common genre? What is the next most common?**\
The majority (98%) of the free, English apps are designed for Family, the next category (10%) is Games.

**What is the general impression — are most of the apps designed for practical purposes (education, shopping, utilities, productivity, lifestyle) or more for entertainment (games, photo and video, social networking, sports, music)?**\
Practical apps like Family, Tools, Business category are more common than fun ones like Games.

In [11]:
display_table(google_store_genres_frequency_table)

Tools : 8.44
Entertainment : 6.07
Education : 5.35
Business : 4.59
Productivity : 3.89
Lifestyle : 3.89
Finance : 3.7
Medical : 3.52
Sports : 3.46
Personalization : 3.32
Communication : 3.24
Action : 3.1
Health & Fitness : 3.08
Photography : 2.95
News & Magazines : 2.8
Social : 2.66
Travel & Local : 2.32
Shopping : 2.25
Books & Reference : 2.14
Simulation : 2.04
Dating : 1.86
Arcade : 1.85
Video Players & Editors : 1.77
Casual : 1.75
Maps & Navigation : 1.4
Food & Drink : 1.24
Puzzle : 1.13
Racing : 0.99
Role Playing : 0.94
Libraries & Demo : 0.94
Auto & Vehicles : 0.93
Strategy : 0.91
House & Home : 0.82
Weather : 0.8
Events : 0.71
Adventure : 0.68
Comics : 0.61
Beauty : 0.6
Art & Design : 0.6
Parenting : 0.5
Card : 0.44
Casino : 0.43
Trivia : 0.42
Educational;Education : 0.39
Educational : 0.37
Board : 0.37
Education;Education : 0.34
Word : 0.26
Casual;Pretend Play : 0.24
Music : 0.2
Puzzle;Brain Games : 0.18
Racing;Action & Adventure : 0.17
Entertainment;Music & Video : 0.17
Casual;

**What is the most common genre? What is the next most common?**\
The majority (8%) of the free, English apps are Tools, the next category (6%) is Entertainment.

**What is the general impression — are most of the apps designed for practical purposes (education, shopping, utilities, productivity, lifestyle) or more for entertainment (games, photo and video, social networking, sports, music)?**\
Practical apps like Tools, Education, Business category are more common than fun ones like Entertainment, Lifestyle.

#### Comparison & conclusions

The comparison of the number of free, English apps on Apple store and Google store show that most apps in Apple store are designed to have fun, while on Goole they are practical. However, it's not possible to determine which profile we should choose for our app based on this data. It only indicates what apps are available, not which ones are the most used. 

Now it's time to analyze how many people actually use the apps from each genre. For Google store there is an `Installs` column available, but Apple store dataset doesn't have that. To be consistent, I'll use the column with the number of ratings for both stores.

### Apple store - average of reviews per genre

In [12]:
def average_reviews_per_app_in_genre(raw_dataset, genre_column_index, reviews_column_index):
    reviews_per_genre = {}
    average_reviews_per_genre = {}

    for app in raw_dataset[1:]:
        genre = app[genre_column_index]
        reviews = int(app[reviews_column_index])

        if genre in reviews_per_genre:
            reviews_per_genre[genre]["sum_reviews"] += reviews
            reviews_per_genre[genre]["count_reviews"] += 1
        else:
            reviews_per_genre[genre] = {"sum_reviews": reviews, "count_reviews": 1}

    for genre, reviews in reviews_per_genre.items():
        average_reviews_per_genre[genre] = reviews["sum_reviews"] // reviews["count_reviews"]
    
    return average_reviews_per_genre

In [13]:
apple_average_reviews_per_genre = average_reviews_per_app_in_genre(apple_store_dataset, 11, 5)
display_table(apple_average_reviews_per_genre)

Navigation : 86090
Reference : 74942
Social Networking : 71548
Music : 57326
Weather : 52279
Book : 39758
Food & Drink : 33333
Finance : 31467
Photo & Video : 28441
Travel : 28243
Shopping : 26919
Health & Fitness : 23298
Sports : 23008
Games : 22812
News : 21248
Productivity : 21028
Utilities : 18684
Lifestyle : 16485
Entertainment : 14029
Business : 7491
Education : 7003
Catalogs : 4004
Medical : 612


We can see that the most reviewed app categories are: Navigation, Reference and Social Networking. Let's see what's inside of those categories.

In [14]:
def apple_show_apps_in_genre(genre):
    print(f"{genre} apps: ")
    for app in apple_store_dataset:
        app_genre = app[1]
        if app_genre == genre:
            print(app[0], ":", app[5])

apple_show_apps_in_genre("Navigation")
apple_show_apps_in_genre("Reference")
apple_show_apps_in_genre("Social Networking")

Navigation apps: 
Reference apps: 
Social Networking apps: 


All of the most popular categories contain some apps that have significantly more reviews than others. In the `Navigation` category, we can see that `Geocaching` could be a potential niche to fill, however it seems like it might be a high effort to start such an app. In the `Reference` category, I can see a niche in the religion related applications. The Social Networking seems to be already saturated with popular apps, so it might be difficult to get our app to the wider public.

### Google store - average of reviews per category

In [15]:
google_average_reviews_per_category = average_reviews_per_app_in_genre(google_store_dataset, 1, 3)
display_table(google_average_reviews_per_category)

COMMUNICATION : 995608
SOCIAL : 965830
GAME : 683839
VIDEO_PLAYERS : 425350
PHOTOGRAPHY : 403207
TOOLS : 306086
ENTERTAINMENT : 301752
SHOPPING : 223887
PERSONALIZATION : 181122
WEATHER : 171250
PRODUCTIVITY : 160634
MAPS_AND_NAVIGATION : 142860
TRAVEL_AND_LOCAL : 129484
SPORTS : 116938
FAMILY : 112996
NEWS_AND_MAGAZINES : 93088
BOOKS_AND_REFERENCE : 87995
HEALTH_AND_FITNESS : 78094
FOOD_AND_DRINK : 57478
EDUCATION : 55791
COMICS : 42585
FINANCE : 38535
LIFESTYLE : 33921
HOUSE_AND_HOME : 26435
ART_AND_DESIGN : 24699
BUSINESS : 24239
DATING : 21953
PARENTING : 16378
AUTO_AND_VEHICLES : 14140
LIBRARIES_AND_DEMO : 10925
BEAUTY : 7476
MEDICAL : 3727
EVENTS : 2555


In [16]:
def google_show_apps_in_genre(genre):
    print(f"{genre} apps: ")
    for app in google_store_dataset:
        app_genre = app[1]
        if app_genre == genre:
            print(app[0], ":", app[3])

google_show_apps_in_genre("COMMUNICATION")
google_show_apps_in_genre("SOCIAL")
google_show_apps_in_genre("GAME")

COMMUNICATION apps: 
Messenger – Text and Video Chat for Free : 56646578
WhatsApp Messenger : 69119316
Messenger for SMS : 125257
Google Chrome: Fast & Secure : 9643041
Messenger Lite: Free Calls & Messages : 1429038
Gmail : 4604483
Hangouts : 3419513
Viber Messenger : 11335481
My Tele2 : 158679
Firefox Browser fast & private : 3075118
Yahoo Mail – Stay Organized : 4188345
imo beta free calls and text : 659395
imo free video calls and chat : 4785988
Contacts : 66602
Call Free – Free Call : 30209
Web Browser & Explorer : 36901
Opera Mini - fast web browser : 5150801
Browser 4G : 192948
MegaFon Dashboard : 99559
ZenUI Dialer & Contacts : 437674
Cricket Visual Voicemail : 13698
Opera Browser: Fast and Secure : 2473795
TracFone My Account : 20769
Firefox Focus: The privacy browser : 36981
Google Voice : 171052
Chrome Dev : 63576
Xperia Link™ : 45487
TouchPal Keyboard - Fun Emoji & Android Keyboard : 615381
Who : 2451093
Skype Lite - Free Video Call & Chat : 33053
WeChat : 5387631
UC Browse