# Profitable App Profiles for the App Store and Google Play Markets

The purpose of this notebook is to perform data analysis on mobile apps on the App Store and on Google Play and to find apps that are profitable in these markets.

## Data

- [Data set](https://www.kaggle.com/lava18/google-play-store-apps) containing data approximately 10,000 Android apps from Google Play.
    - [Direct Link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).
- [Data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data approximately 7,000 iOS apps from the App Store. 
    - [Direct Link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).

In [1]:
from csv import reader

### The Google Play data set ###
opened_file = open('googleplaystore.csv', encoding="utf8")
read_file = reader(opened_file)
android_apps = list(read_file)
android_apps_header = android_apps[0]
android_apps = android_apps[1:]

### The App Store data set ###
opened_file = open('AppleStore.csv', encoding="utf8")
read_file = reader(opened_file)
ios_apps = list(read_file)
ios_apps_header = ios_apps[0]
ios_apps = ios_apps[1:]

In order to facilitate the exploration of the two data sets, we will create a function called read_data(). This function will allow us to explore the rows in a more comprehensible manner, and we will incorporate an additional feature that displays the number of rows and columns for any given data set. This will enable us to reuse the function conveniently whenever needed.

In [2]:
def read_data(dataset, start, end, rows_and_columns=False, num_columns=None):
    for index, row in enumerate(dataset[start:end], start=start):
        print(f"Row {index}: {row}\n")
        
    if rows_and_columns:
        num_rows = len(dataset)
        num_columns = num_columns if num_columns is not None else len(dataset[0])
        print('Number of rows:', num_rows)
        print('Number of columns:', num_columns)

print(android_apps_header)
print('\n')
read_data(android_apps, 0, 1, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Row 0: ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']

Number of rows: 10841
Number of columns: 13


Relevant columns could be `'App'`, `'Category'`, `'Reviews'`, `'Installs'`, `'Type'`, `'Price'`, and `'Genres'`.

In [3]:
print(ios_apps_header)
print('\n')
read_data(ios_apps, 0, 1, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Row 0: ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']

Number of rows: 7197
Number of columns: 16


Relevant columns in the IOS apps data could be `'track_name'`, `'currency'`, `'price'`, `'rating_count_tot'`, `'rating_count_ver'`, and `'prime_genre'`.
[documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home).

In this [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015), there is an error on row 10472 where the app has a rating of 19:

In [4]:
print(android_apps[10472])  # incorrect row
print('\n')
print(android_apps_header)  # header
print(len(android_apps))
del android_apps[10472]  # delete this row
print(len(android_apps))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
10841
10840


## Remove Duplicate Entries

In [6]:
duplicate_apps = []
unique_apps = []

for app in android_apps:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
    
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:5])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


We don't want to count the apps more than once when we analyze the data, so we need to remove the duplicate entries. We'll keep the rows that have the highest number of reviews because the higher the number of reviews, the more reliable the ratings.

In [7]:
#use a dictionary to create a new clean dataset, where each key is the app name and the value is the max(n_reviews)
reviews_max = {}

for app in android_apps:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

print('Expected dictionary length:', len(android_apps) - 1181)
print('Actual dictionary length:', len(reviews_max))

Expected dictionary length: 9659
Actual dictionary length: 9659


In [8]:
android_apps_clean = []
already_added = []

#use reviews_max to remove the duplicates
for app in android_apps:
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_apps_clean.append(app)
        already_added.append(name)

In [9]:
read_data(android_apps_clean, 0, 1, True)

Row 0: ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']

Number of rows: 9659
Number of columns: 13


## Free Apps

We can separate free apps and paid apps and analyze them differently. I am assuming that paid apps would be profitable and will perform analysis on paid apps later. I will begin by looking at free apps first since the profitability for these apps would be highly determined by the amount of users as their main source of revenue would be from ads.

In [15]:
android_apps_final = []
ios_apps_final = []

for app in android_apps_clean[0:5000]:
    price = app[7]
    if price == '0':
        android_apps_final.append(app)
        
for app in ios_apps[0:5000]:
    price = app[4]
    if price == '0.0':
        ios_apps_final.append(app)
        
print(len(android_apps_final))
print(len(ios_apps_final))

4623
2851


## Most Common Apps by Genre

To start our analysis, we will determine the most common genres in each market. To achieve this, we'll create frequency tables for the `prime_genre` column in the App Store dataset, and the `Genres` and `Category` columns in the Google Play dataset.

To aid our analysis, we'll develop two functions:

1. First function will generate frequency tables, showing the percentages of each genre or category.
2. Second function will display the percentages in descending order, allowing us to identify the most prevalent genres or categories.

By utilizing these functions, we can gain valuable insights into the popular app genres in each market, helping us identify potential app profiles that could be successful across both markets.

In [27]:
from collections import Counter

def freq_table(dataset, index):
    total = len(dataset)
    counts = Counter(row[index] for row in dataset)
    
    table_percentages = {key: (count / total) * 100 for key, count in counts.items()}
    
    return table_percentages


def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_sorted = sorted(table.items(), key=lambda x: x[1], reverse=True)
    for key, value in table_sorted:
        print(f"{key}: {value}")

## Analysis