## Profitable Applications for Apple Store and Google Play Markets
---
- The idea behind this project is to analyze data from both stores to see which free apps and categories yield a larger user base
- Datasets used: [Google Play](https://www.kaggle.com/lava18/google-play-store-apps), [Apple Store](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

In [1]:
from csv import reader
import itertools
from collections import OrderedDict

apple_store_file, google_play_file = open('../dataset/AppleStore.csv'), open('../dataset/googleplaystore.csv')
apple_store_data, google_play_data = list(reader(apple_store_file)), list(reader(google_play_file))

In [2]:
def explore_data(data, start, end, number_of_rows_and_columns=True):
    data_slice = data[start:end]
    
    if number_of_rows_and_columns: 
        print(f'Number of rows: {len(data)}\nNumber of columns: {len(data[0])}\n')
        
    for row in data_slice:
        print(f'{row}\n')

In [3]:
print(f'{apple_store_data[0][1:]}\n\n{google_play_data[0]}')

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [4]:
explore_data(data=apple_store_data[1:],start=0,end=5)
print('-'*80)
explore_data(data=google_play_data[1:], start=0,end=5)

Number of rows: 7197
Number of columns: 17

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']

['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']

['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']

['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']

['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1']

--------------------------------------------------------------------------------
Number of rows: 10841
Number of columns: 13

['Photo Editor & Candy Camera & Grid & ScrapBook'

### Eliminate non-english applications, rows missing values, and apps that are not free 🔎
- Delete row [10473] because it doesn't have a rating value, as pointed [here](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015)
- Checking if the google play store dataset contains any duplicate data
- The criterea chosen to eliminate duplicates is by number of reviews, since it probably determines that the data was obtained more recently

In [5]:
print(google_play_data[10473])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [6]:
if not google_play_data[10473][8]: del google_play_data[10473]

In [7]:
def eliminate_duplicates(dataset):
    unique_apps = {}
    number_of_duplicate_apps = 0
    
    for row in dataset[1:]:
        app_name = row[0]
        app_reviews = float(row[3]) if 'M' not in row[3] else float(row[3][:-1])
        
        if app_name in unique_apps:
            number_of_duplicate_apps += 1
            if app_reviews > unique_apps[app_name]: unique_apps[app_name] = app_reviews
        else:
            unique_apps[app_name] = app_reviews
            
    print(f'{dict(itertools.islice(unique_apps.items(), 4))}...\n\nThere are {number_of_duplicate_apps} duplicate app entries on this dataset')
    filtered_dataset = [dataset[0]]
    already_added = set()
    for row in dataset[1:]:
        app_name = row[0]
        app_reviews = float(row[3]) if 'M' not in row[3] else float(row[3][:-1])
        
        if unique_apps[app_name] == app_reviews and app_name not in already_added:
            filtered_dataset.append(row)
            already_added.add(app_name)
            
    print(f'\nOld dataset size: {len(dataset)}, new size: {len(filtered_dataset)}')
    
    return filtered_dataset

In [8]:
google_play_data = eliminate_duplicates(google_play_data)

{'Photo Editor & Candy Camera & Grid & ScrapBook': 159.0, 'Coloring book moana': 974.0, 'U Launcher Lite – FREE Live Cool Themes, Hide Apps': 87510.0, 'Sketch - Draw & Paint': 215644.0}...

There are 1181 duplicate app entries on this dataset

Old dataset size: 10841, new size: 9660


### Checking whether or not an application is directed to English speaking people
- Look over each application and inspect if it has more than three characters falling outside the ASCII range (this includes emojis, ™, etc) then delete this row

In [9]:
def is_english_app(txt):
    
    for letter in txt:
        if 0 <= ord(letter) >= 127:
            return False
    return True

In [10]:
def eliminate_non_english_apps(dataset, name_col, store):
    filtered_dataset = [dataset[0]]
    for row in dataset[1:]:
        app_name = row[name_col]
        if is_english_app(app_name): filtered_dataset.append(row)

    print(f'{len(dataset) - len(filtered_dataset)} phrases were eliminated after filtering non english applications from the {store} dataset')
                                                             
    return filtered_dataset

In [11]:
google_play_data = eliminate_non_english_apps(google_play_data, 0, 'Google Play')
apple_store_data = eliminate_non_english_apps(apple_store_data, 2, 'Apple Store')

542 phrases were eliminated after filtering non english applications from the Google Play dataset
1490 phrases were eliminated after filtering non english applications from the Apple Store dataset


In [12]:
def eliminate_paid_apps(dataset, price_col, free_indicator, store):
    filtered_dataset = [dataset[0]]
    for row in dataset[1:]:
        if row[price_col] == free_indicator:
            filtered_dataset.append(row)
    
    print(f'{len(dataset) - len(filtered_dataset)} phrases were eliminated after filtering paid applications from the {store} dataset')
    return filtered_dataset

In [13]:
google_play_data = eliminate_paid_apps(google_play_data, 6, 'Free', 'Google Play')
apple_store_data = eliminate_paid_apps(apple_store_data, 5, '0', 'Apple Store')

710 phrases were eliminated after filtering paid applications from the Google Play dataset
2785 phrases were eliminated after filtering paid applications from the Apple Store dataset


### Inspect which are the top application's categories for each Store

In [50]:
def get_frequency_list_percentages(dictionary, dataset_size):
    for key, value in dictionary.items():
        dictionary[key] = f'{round((value/dataset_size)*100, 2)}%'
    return dictionary

def get_categories_frequency(dataset, category_col, store, genre_col=None):
    categories_frequency_table = {}
    if genre_col: genres_frequency_table = {}
        
    for row in dataset[1:]:
        category = row[category_col].capitalize()
        if category in categories_frequency_table:
            categories_frequency_table[category] += 1
        else:
            categories_frequency_table[category] = 1
            
        if genre_col:
            genre = row[genre_col].capitalize()
            if genre in genres_frequency_table:
                genres_frequency_table[genre] += 1
            else:
                genres_frequency_table[genre] = 1
    
    categories_frequency_table_percentages = categories_frequency_table.copy()
    categories_frequency_table_percentages = get_frequency_list_percentages(categories_frequency_table_percentages, len(dataset))
    categories_frequency_table_percentages = dict(sorted(categories_frequency_table_percentages.items(), key=lambda item: -float(item[1][:-1])))
    
    print(f'App categories frequency for free apps on {store} store:\n\n{categories_frequency_table_percentages}')
    
    if genre_col:
        genres_frequency_table_percentages = get_frequency_list_percentages(genres_frequency_table, len(dataset))
        genres_frequency_table_percentages = dict(sorted(genres_frequency_table_percentages.items(), key=lambda item: -float(item[1][:-1])))
        
        print(f'\nApp genres frequency for free apps on {store} store:\n\n{genres_frequency_table_percentages}')
    
    if not genre_col:
        return categories_frequency_table
    else:
        return {'categories': categories_frequency_table, 'columns': genres_frequency_table}

apple_store_categories = get_categories_frequency(apple_store_data, 12, 'Apple Store')
print(f'\n{"-"*80}\n')
google_play_categories = get_categories_frequency(google_play_data, 1, 'Google Play', 9)

App categories frequency for free apps on Apple Store store:

{'Games': '59.15%', 'Entertainment': '7.53%', 'Photo & video': '5.13%', 'Education': '3.83%', 'Social networking': '3.11%', 'Shopping': '2.5%', 'Utilities': '2.26%', 'Music': '2.16%', 'Sports': '2.05%', 'Health & fitness': '1.98%', 'Productivity': '1.71%', 'Lifestyle': '1.47%', 'News': '1.33%', 'Travel': '1.13%', 'Finance': '1.09%', 'Weather': '0.89%', 'Food & drink': '0.89%', 'Reference': '0.51%', 'Business': '0.51%', 'Book': '0.27%', 'Medical': '0.21%', 'Navigation': '0.14%', 'Catalogs': '0.1%'}

--------------------------------------------------------------------------------

App categories frequency for free apps on Google Play store:

{'Family': '18.79%', 'Game': '9.61%', 'Tools': '8.58%', 'Business': '4.71%', 'Productivity': '3.97%', 'Lifestyle': '3.89%', 'Finance': '3.73%', 'Medical': '3.64%', 'Personalization': '3.31%', 'Sports': '3.26%', 'Communication': '3.22%', 'Health_and_fitness': '3.13%', 'Photography': '3.01%'

### Apple Store analysis 
---
- There's an disproportionate number of game and entertainement apps when analysing the free english applications, whith Utilities and Productivity coming below the 3% mark

### Google Play analysis 
---
- The most common free apps vary on a range of different genres and categories, being more evenly distritibuted than the apps from apple store
- There are a lot of game Applications, family and tools on the store 
- It differs a lot of the Apple store applications, since we cannot see a clear pattern relating to those apps to specifically games and entertainment 

#### Notice that this was made only by inspecting the main categories and genres of each dataset, further exploration will be done now in order to check which apps yield a larger user base. The apple store dataset is missing the installed column, so in order to workaround this issue we'll use the rating count column

In [56]:
def get_category_average_user_base(popularity_dict, category_frequency_dict):
    average_category_user_base = {}
    for category, user_base in popularity_dict.items():
        average_category_user_base[category] = round(user_base/float(category_frequency_dict[category]), 2)
    return average_category_user_base

def get_most_downloaded_app_genres(dataset, category_dict, category_col, popularity_col):
    category_popularity_sum = {}
    for row in dataset[1:]:
        category = row[category_col].capitalize()
        popularity = float(row[popularity_col])
        
        if category in category_popularity_sum:
            category_popularity_sum[category] += popularity
        else:
            category_popularity_sum[category] = popularity
            
    average_category_user_base = get_category_average_user_base(category_popularity_sum, category_dict)
    average_category_user_base = dict(sorted(average_category_user_base.items(), key=lambda item: -float(item[1])))
    
    return average_category_user_base
    
get_most_downloaded_app_genres(apple_store_data, apple_store_categories, 12, 6)

{'Navigation': 125037.25,
 'Reference': 89562.6,
 'Social networking': 78567.31,
 'Music': 55396.02,
 'Weather': 48275.58,
 'Travel': 34115.58,
 'Food & drink': 33333.92,
 'Photo & video': 29249.77,
 'Shopping': 28877.58,
 'Finance': 26038.69,
 'Sports': 25791.67,
 'News': 23382.18,
 'Productivity': 22842.22,
 'Games': 21560.75,
 'Health & fitness': 19418.62,
 'Lifestyle': 17260.53,
 'Book': 16671.0,
 'Entertainment': 15006.23,
 'Utilities': 11571.7,
 'Business': 6839.6,
 'Education': 6103.46,
 'Catalogs': 5195.0,
 'Medical': 612.0}