# Profitable App Profiles for the App Store and Google Play Markets

***
This is a guided-project under Dataquest's curriculum.
***

The company builds free-to-download Android and iOS mobile apps for English speakers. Main source of revenue is in-app ads. The more users, the higher probability that the ads are engaged, and hence higher revenue. This project is to analyze the profitable apps profile in App Store and Google Play markets. 

The goal of this project is to analyze data in order to understand what type of apps are likely to attract users, and could therefore generate higher revenue.


# Step 1 - Data preparation
## Step 1a - Import data

There are over 4 million apps combined in both stores. Here we use sample data set for the analysis.

* iOS data is obtained from <a href="https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps">here</a>.
* Android data is obtained from <a href="https://www.kaggle.com/lava18/google-play-store-apps">here</a>.

In [2]:
from csv import reader

#### iOS dataset ####
ios_file = open("D:\Documents\Python Journey\Dataquest\Guided Project 1_Profitable apps\AppleStore.csv", encoding="utf8")
ios = list(reader(ios_file))
ios_header = ios[0]
ios_data = ios[1:]

#### Android dataset ####
android_file = open("D:\Documents\Python Journey\Dataquest\Guided Project 1_Profitable apps\googleplaystore.csv", encoding="utf8")
android = list(reader(android_file))
android_header = android[0]
android_data = android[1:]

## Step 1b - Explore data

In [3]:
# This function is to extract data in the file to a more readable format.
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print("\n") # adds a new empty line between each row

    if rows_and_columns:
        print("Number of rows: ", len(dataset))
        print("Number of columns: ", len(dataset[0]))

In [4]:
# Exploring iOS dataset
print(ios_header)
print("\n")
explore_data(ios_data, 0, 3, rows_and_columns=True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows:  7197
Number of columns:  16


The columns that are interesting and could be used for the analysis are *track_name*, *price*, *rating_count_tot*, and *prime_genre*

In [5]:
# Exploring Android dataset
print(android_header)
print("\n")
explore_data(android_data, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows:  10841
Number of columns:  13


The columns that are interesting and could be used for the analysis are *App*, *Category*, *Rating*, *Installs*, *Type*, *Price*, *Content Rating*, and *Genre*


## Step 1c - Clean data
- Detect inaccurate data, and correct or remove it.
- Detect duplicate data, and remove the duplicates.

Since the company is building free apps for English speakers, the following apps will be removed:
- non-English apps
- paid apps

### Clean iOS dataset

In [7]:
# Check each row have the same number of columns
ios_col_row = True

while ios_col_row:
    for row in ios_data:
        if len(ios_header) != len(row):
            print(row, "\n")
            print("Index of the row is " + str(ios_data.index(row)))
            ios_col_row = False
    print("Check ended.")
    print(f"Each rows contain all columns: {ios_col_row}")
    break

Check ended.
Each rows contain all columns: True


In [6]:
# Find duplicate rows
ios_dup_apps = []
ios_unique_apps = []

for i in range(len(ios_data)):
    ios_id = ios_data[i][0]
    if ios_id in ios_unique_apps:
        ios_dup_apps.append(ios_id)
    else:
        ios_unique_apps.append(ios_id)

print(f"Number of unique apps: {len(ios_unique_apps):,.0f}")
print()
print(f"Number of duplicate apps: {len(ios_dup_apps):,.0f}")
print()
print("Example of duplicate apps are:", sorted(ios_dup_apps[0:20]))

assert len(ios_data) == len(ios_dup_apps) + len(ios_unique_apps)    # check if number of rows are matched

Number of unique apps: 7,197

Number of duplicate apps: 0

Example of duplicate apps are: []


### Prepare iOS data for the analysis
To simplify the analysis, we assume that English speaking apps should not contain many non-English characters in the app names. If there are 3 or more non-English characters in the name, the app will be considered as a non-English app. We will then separate between free and paid apps among our English apps.

In [7]:
# This function is used to determine whether the app has a non-English name
def eng_check(app):
    non_eng = 0
    eng = 0
    for char in app:
        if ord(char) > 127:
            non_eng += 1
        else:
            eng += 1
    if eng == 0 or non_eng >= 3:
        return False
    else:
        return True    

assert (eng_check("Instagram")) == True
assert (eng_check("爱奇艺PPS -《欢乐颂2》电视剧热播")) == False
assert (eng_check("Docs To Go™ Free Office Suite")) == True
assert (eng_check("Instachat 😜")) == True
assert (eng_check("豆瓣")) == False

In [8]:
# Separate English apps and non-English apps
ios_eng_apps = []
ios_noneng_apps = []

for row in ios_data:
    eng = eng_check(row[1])
    if eng == True:
        ios_eng_apps.append(row)
    else:
        ios_noneng_apps.append(row)

assert len(ios_data) == len(ios_eng_apps) + len(ios_noneng_apps)

print(f"There are {len(ios_eng_apps):,.0f} apps with English names.")
print("Example:", ios_eng_apps[0][1], ",", ios_eng_apps[1][1], ",", ios_eng_apps[2][1], ",", ios_eng_apps[3][1])
print()
print(f"There are {len(ios_noneng_apps):,.0f} apps with non-English names.")
print("Example:", ios_noneng_apps[0][1], ",", ios_noneng_apps[1][1], ",", ios_noneng_apps[2][1], ",", ios_noneng_apps[3][1])

There are 6,149 apps with English names.
Example: Facebook , Instagram , Clash of Clans , Temple Run

There are 1,048 apps with non-English names.
Example: 爱奇艺PPS -《欢乐颂2》电视剧热播 , 聚力视频HD-人民的名义,跨界歌王全网热播 , 优酷视频 , 网易新闻 - 精选好内容，算出你的兴趣


In [9]:
# Check price of apps
ios_eng_apps_price_f = {}

for row in ios_eng_apps:
    ios_eng_apps_price_f.setdefault(row[4], 0)
    ios_eng_apps_price_f[row[4]] += 1

print(ios_eng_apps_price_f)

{'0.0': 3198, '1.99': 610, '0.99': 636, '6.99': 165, '2.99': 669, '7.99': 30, '4.99': 373, '9.99': 75, '3.99': 265, '8.99': 8, '5.99': 43, '14.99': 15, '13.99': 6, '19.99': 13, '17.99': 3, '15.99': 4, '24.99': 8, '20.99': 1, '29.99': 6, '12.99': 1, '39.99': 2, '74.99': 1, '16.99': 2, '249.99': 1, '11.99': 3, '27.99': 1, '49.99': 2, '59.99': 3, '22.99': 1, '18.99': 1, '99.99': 1, '34.99': 1, '299.99': 1}


From the result below, the price of free apps in iOS is labelled as "0.0". Hence, we will use "0.0" as a criterion to separate free apps and paid apps.

In [10]:
# Separate free and paid apps
ios_eng_free_apps = []
ios_eng_paid_apps = []

for row in ios_eng_apps:
    if row[4] == "0.0":
        ios_eng_free_apps.append(row)
    else:
        ios_eng_paid_apps.append(row)

assert len(ios_eng_apps) == len(ios_eng_free_apps) + len(ios_eng_paid_apps)

print(f"There are {len(ios_eng_free_apps):,.0f} free apps with English names.")
print("Example:", ios_eng_free_apps[0][1], ",", ios_eng_free_apps[1][1], ",", ios_eng_free_apps[2][1], ",", ios_eng_free_apps[3][1])
print()
print(f"There are {len(ios_eng_paid_apps):,.0f} paid apps with English names.")
print("Example:", ios_eng_paid_apps[0][1], ",", ios_eng_paid_apps[1][1], ",", ios_eng_paid_apps[2][1], ",", ios_eng_paid_apps[3][1])

There are 3,198 free apps with English names.
Example: Facebook , Instagram , Clash of Clans , Temple Run

There are 2,951 paid apps with English names.
Example: Fruit Ninja Classic , Clear Vision (17+) , Minecraft: Pocket Edition , Plants vs. Zombies


### Clean Android dataset

In [11]:
# Check each row have the same number of columns
android_col_row = True

while android_col_row:
    for row in android_data:
        if len(android_header) != len(row):
            print(row, "\n")
            print("Index of the row is " + str(android_data.index(row)) + ".", "There are", len(android_data[android_data.index(row)]), "items in the row")
            android_col_row = False

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up'] 

Index of the row is 10472. There are 12 items in the row


The 10472th row in `android_data` seems to be missing one column.

In [12]:
# Compare the incorrect row with header
print(android_data[10472])
print()
print(android_header)

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


The comparison above shows that the data for *Category* column is missing in line 10472. The *Category* is *1.9* and *Rating* is *19*, which are off. Hence, this row will be deleted.

In [13]:
print(len(android_data))    # number of rows before deletion
del android_data[10472]     # row to delete
print(len(android_data))    # number of rows after deletion

10841
10840


In [14]:
# Find duplicate rows
android_dup_apps = []
android_unique_apps = []

for i in range(len(android_data)):
    name = android_data[i][0]
    if name in android_unique_apps:
        android_dup_apps.append(name)
    else:
        android_unique_apps.append(name)

print(f"Number of unique apps: {len(android_unique_apps):,.0f}")
print()
print(f"Number of duplicate apps: {len(android_dup_apps):,.0f}")
print()
print("Example of duplicate apps are:", sorted(android_dup_apps[0:20]))

assert len(android_data) == len(android_dup_apps) + len(android_unique_apps)    # check if number of rows are matched

Number of unique apps: 9,659

Number of duplicate apps: 1,181

Example of duplicate apps are: ['AdWords Express', 'Asana: organize team projects', 'Box', 'Box', 'Crew - Free Messaging and Scheduling', 'FreshBooks Classic', 'Google Ads', 'Google Analytics', 'Google My Business', 'Google My Business', 'HipChat - Chat Built for Teams', 'Insightly CRM', 'MailChimp - Email, Marketing Automation', 'Quick PDF Scanner + OCR FREE', 'QuickBooks Accounting: Invoicing & Expenses', 'Slack', 'Xero Accounting Software', 'ZOOM Cloud Meetings', 'Zenefits', 'join.me - Simple Meetings']


In [15]:
# Select some duplicate apps to see the data
for row in android_data:
    name = row[0]
    if name == "Amazon Kindle":
        print(row)

['Amazon Kindle', 'BOOKS_AND_REFERENCE', '4.2', '814080', 'Varies with device', '100,000,000+', 'Free', '0', 'Teen', 'Books & Reference', 'July 27, 2018', 'Varies with device', 'Varies with device']
['Amazon Kindle', 'BOOKS_AND_REFERENCE', '4.2', '814151', 'Varies with device', '100,000,000+', 'Free', '0', 'Teen', 'Books & Reference', 'July 27, 2018', 'Varies with device', 'Varies with device']


The data for both rows suggest that they are the same app. The only difference between both rows is number of reviews. It is likely that the row with higher number of reviews is the row that contains more recent data.

In [16]:
# Select some duplicate apps to see the data
for row in android_data:
    name = row[0]
    if name == "Instagram":
        print(row)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Similar to *Amazon Kindle*, the data from all 4 rows suggest that they are the same app. The only difference is number of reviews.

In [17]:
# Select some duplicate apps to see the data
for row in android_data:
    name = row[0]
    if name == "Ebook Reader":
        print(row)

['Ebook Reader', 'BOOKS_AND_REFERENCE', '4.1', '85842', '37M', '5,000,000+', 'Free', '0', 'Everyone', 'Books & Reference', 'June 25, 2018', '5.0.6', '4.0 and up']
['Ebook Reader', 'BOOKS_AND_REFERENCE', '4.1', '85842', '37M', '5,000,000+', 'Free', '0', 'Everyone', 'Books & Reference', 'June 25, 2018', '5.0.6', '4.0 and up']


Both rows contain exactly the same data.

Next, the duplicate apps with equal or lower number of reviews will be removed.

In [18]:
# Create dataset that only include unique apps
reviews_max = {}

for row in android_data:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
       reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews

assert len(reviews_max) == len(android_unique_apps)

The app name and its highest number of reviews are paired and stored in `reviews_max`. At the end, we check that the number of apps in `reviews_max` equals to the number of apps in `unique_apps` to ensure that all data are included.

In [19]:
android_clean = []
already_added = []

for row in android_data:
    name = row[0]
    n_reviews = float(row[3])
    if reviews_max[name] == n_reviews and name not in already_added:
        android_clean.append(row)
        already_added.append(name)

assert len(reviews_max) == len(android_unique_apps) == len(android_clean)

The rows with highest number of reviews of each (duplicate) app are selected and stored in a new list called `android_clean`. Similar to the process before, we check that the number of apps in our new list equals to the number of apps in `reviews_max` and `unique_apps` to ensure that all data are included.

### Prepare Android data for the analysis
Similar to the iOS data set, the Android app names will be checked and identified whether it is an English-speaking app. If there are more than 3 non-English characters in the name, the app will be considered as a non-English app. We will then separate between free and paid apps among our English apps.

In [20]:
# Separate English apps and non-English apps
android_eng_apps = []
android_noneng_apps = []

for row in android_clean:
    eng = eng_check(row[0])
    if eng == True:
        android_eng_apps.append(row)
    else:
        android_noneng_apps.append(row)

assert len(android_clean) == len(android_eng_apps) + len(android_noneng_apps)

print(f"There are {len(android_eng_apps):,.0f} apps with English names.")
print("Example:", android_eng_apps[0][0], ",", android_eng_apps[1][0], ",", android_eng_apps[2][0], ",", android_eng_apps[3][0])
print()
print(f"There are {len(android_noneng_apps):,.0f} apps with non-English names.")
print("Example:", android_noneng_apps[0][0], ",", android_noneng_apps[1][0], ",", android_noneng_apps[2][0], ",", android_noneng_apps[3][0])

There are 9,597 apps with English names.
Example: Photo Editor & Candy Camera & Grid & ScrapBook , U Launcher Lite – FREE Live Cool Themes, Hide Apps , Sketch - Draw & Paint , Pixel Draw - Number Art Coloring Book

There are 62 apps with non-English names.
Example: Truyện Vui Tý Quậy , Flame - درب عقلك يوميا , At home - rental · real estate · room finding application such as apartment · apartment , 乐屋网: Buying a house, selling a house, renting a house


In [21]:
# Check price of apps
android_eng_apps_price_f1 = {}

for row in android_eng_apps:
    android_eng_apps_price_f1.setdefault(row[6], 0)         # Grouped data based on "Type"
    android_eng_apps_price_f1[row[6]] += 1

print(android_eng_apps_price_f1)
print()

android_eng_apps_price_f2 = {}

for row in android_eng_apps:
    android_eng_apps_price_f2.setdefault(row[7], 0)         # Grouped data based on "Price"
    android_eng_apps_price_f2[row[7]] += 1

print(android_eng_apps_price_f2)

{'Free': 8847, 'Paid': 749, 'NaN': 1}

{'0': 8848, '$4.99': 70, '$3.99': 56, '$1.49': 45, '$2.99': 123, '$7.99': 7, '$5.99': 26, '$3.49': 7, '$1.99': 73, '$6.99': 10, '$9.99': 19, '$7.49': 2, '$0.99': 145, '$9.00': 1, '$5.49': 5, '$10.00': 2, '$11.99': 3, '$79.99': 1, '$16.99': 2, '$14.99': 9, '$1.00': 3, '$29.99': 5, '$2.49': 25, '$24.99': 3, '$10.99': 1, '$1.50': 1, '$19.99': 5, '$15.99': 1, '$33.99': 1, '$74.99': 1, '$39.99': 2, '$3.95': 1, '$4.49': 9, '$1.70': 2, '$8.99': 5, '$2.00': 3, '$3.88': 1, '$25.99': 1, '$399.99': 11, '$17.99': 2, '$400.00': 1, '$3.02': 1, '$1.76': 1, '$4.84': 1, '$4.77': 1, '$1.61': 1, '$2.50': 1, '$1.59': 1, '$6.49': 5, '$1.29': 1, '$5.00': 1, '$13.99': 2, '$299.99': 1, '$379.99': 1, '$37.99': 1, '$18.99': 1, '$389.99': 1, '$19.90': 1, '$8.49': 2, '$1.75': 1, '$14.00': 1, '$4.85': 1, '$46.99': 1, '$109.99': 1, '$154.99': 1, '$3.08': 1, '$2.59': 1, '$4.80': 1, '$1.96': 1, '$19.40': 1, '$3.90': 1, '$4.59': 1, '$15.46': 1, '$3.04': 1, '$12.99': 3, '$4.29': 1

The first dictionary `android_eng_apps_price_f1` shows the frequency of each app types, while the second dictionary `android_eng_apps_price_f2` shows the frequency of each app's actual prices.

Note that there are 8,863 "Free" apps, while there are 8,864 "$0" apps.

In [22]:
for row in android_eng_apps:
    if row[6] == "NaN":
        print(row)

['Command & Conquer: Rivals', 'FAMILY', 'NaN', '0', 'Varies with device', '0', 'NaN', '0', 'Everyone 10+', 'Strategy', 'June 28, 2018', 'Varies with device', 'Varies with device']


I have confirmed from other sources that the *Command & Conquer: Rivals* game is free-to-play. Hence, this will be included in "Free" type, to align with its price. Next we can start to separate the apps into 2 groups: *free* and *paid* apps.

In [23]:
# Separate free and paid apps
android_eng_free_apps = []
android_eng_paid_apps = []

for row in android_eng_apps:
    if row[7] == "0":
        android_eng_free_apps.append(row)
    else:
        android_eng_paid_apps.append(row)

assert len(android_eng_apps) == len(android_eng_free_apps) + len(android_eng_paid_apps)

print(f"There are {len(android_eng_free_apps):,.0f} free apps with English names.")
print("Example:", android_eng_free_apps[0][0], ",", android_eng_free_apps[1][0], ",", android_eng_free_apps[2][0], ",", android_eng_free_apps[3][0])
print()
print(f"There are {len(android_eng_paid_apps):,.0f} paid apps with English names.")
print("Example:", android_eng_paid_apps[0][0], ",", android_eng_paid_apps[1][0], ",", android_eng_paid_apps[2][0], ",", android_eng_paid_apps[3][0])

There are 8,848 free apps with English names.
Example: Photo Editor & Candy Camera & Grid & ScrapBook , U Launcher Lite – FREE Live Cool Themes, Hide Apps , Sketch - Draw & Paint , Pixel Draw - Number Art Coloring Book

There are 749 paid apps with English names.
Example: TurboScan: scan documents and receipts in PDF , Tiny Scanner Pro: PDF Doc Scan , Puffin Browser Pro , Truth or Dare Pro


After data preparation, there are 3,198 free iOS apps and 8,848 free Android apps, with English names, for the analysis. From this point on, these apps will be referred to as "English free apps".

# Step 2 - Data analysis

The validation strategy for an app idea consists of 3 steps:

1. Build a minimal Android version and add it to Google Play store.
2. If the app has a good response from users, the app will be developed further.
3. If the app is profitable after 6 months, the iOS version will be developed and added to the App Store.

As our goal is to add this app to both Google Play and App Store, the following analysis will help to determine the app profiles that are successful on both stores.

In [24]:
# Revisit the table headers to determine the columns used for the analysis
print(ios_header)
print()
print(android_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


The columns that could be beneficial to the analysis are *prime_genre*, for iOS data set, and *Category* and *Genres*, for Android data set, as these would suggest the common app genre/category in each store.

In [25]:
# This function is to extract data to determine frequencies of each genre
def freq_table(dataset, index):
    table_eng_free_f = {}
    for row in dataset:
        table_eng_free_f.setdefault(row[index], 0)
        table_eng_free_f[row[index]] += 1
    for k, v in table_eng_free_f.items():
        v = (v / len(dataset)) * 100
        table_eng_free_f[k] = v
    
    return table_eng_free_f

In [26]:
# This function is to display the table in descending order of the frequencies
def display_table(dataset):
    table_display = []
    for key in dataset:
        key_val_as_tuple = (dataset[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', f"{entry[0]:,.4f}")     # Reduce the decimal digits to be more readable

## Step 2a - Find the most common genre in each store
### iOS common genre

In [27]:
# Find an index of "prime_genre" column
ios_header.index("prime_genre")

11

In [28]:
# Determine frequencies of each genre
ios_eng_free_f = freq_table(ios_eng_free_apps, 11)
display_table(ios_eng_free_f)

Games : 58.2864
Entertainment : 7.8487
Photo & Video : 5.0031
Education : 3.6898
Social Networking : 3.2520
Shopping : 2.5954
Utilities : 2.4703
Sports : 2.1576
Music : 2.0638
Health & Fitness : 2.0325
Productivity : 1.7511
Lifestyle : 1.5635
News : 1.3446
Travel : 1.2195
Finance : 1.0944
Weather : 0.8755
Food & Drink : 0.8130
Reference : 0.5316
Business : 0.5316
Book : 0.3752
Navigation : 0.1876
Medical : 0.1876
Catalogs : 0.1251


Approximately 71% of our English free apps in App Store are designed for entertainment purpose. The top 3 are:

1. *Games*, which takes up to more than half of all English free apps - at 58.29%. 
2. *Entertainment* - at 7.85%
3. *Photo & Video* - at 5.00%

Based on the frequency above, it is not possible to conclude that *Games* is the most popular app profile in App Store. However, this might suggest that there would be many competitor apps if we launch a game on App Store.

### Anroid common genre

In [29]:
# Find an index of "Category" column
android_header.index("Category")

1

In [30]:
# Determine frequencies of each category
android_eng_free_f_cat = freq_table(android_eng_free_apps, 1)
display_table(android_eng_free_f_cat)

FAMILY : 18.9421
GAME : 9.6971
TOOLS : 8.4539
BUSINESS : 4.5999
PRODUCTIVITY : 3.8992
LIFESTYLE : 3.8879
FINANCE : 3.7071
MEDICAL : 3.5375
SPORTS : 3.3906
PERSONALIZATION : 3.3228
COMMUNICATION : 3.2324
HEALTH_AND_FITNESS : 3.0854
PHOTOGRAPHY : 2.9498
NEWS_AND_MAGAZINES : 2.8029
SOCIAL : 2.6673
TRAVEL_AND_LOCAL : 2.3395
SHOPPING : 2.2491
BOOKS_AND_REFERENCE : 2.1361
DATING : 1.8648
VIDEO_PLAYERS : 1.7970
MAPS_AND_NAVIGATION : 1.3901
FOOD_AND_DRINK : 1.2432
EDUCATION : 1.1641
ENTERTAINMENT : 0.9607
LIBRARIES_AND_DEMO : 0.9381
AUTO_AND_VEHICLES : 0.9268
HOUSE_AND_HOME : 0.8024
WEATHER : 0.7911
EVENTS : 0.7120
PARENTING : 0.6555
ART_AND_DESIGN : 0.6442
COMICS : 0.6103
BEAUTY : 0.5990


The frequencies of the English free apps in Google Play are more distributed, comparing to those of App Store. The first common *Category* is *Family* - at 18.94%, followed by *Game* (9.70%), and *Tool* (8.45%).

After further investigation, *Family* category is for apps designed for children. (At the time of writing, the *Family* category is named *Kids* on the Google Play website.) Under *Family*, the sub-category includes educational, games, and entertainment.

Next, we check *Genres* column.

In [31]:
# Find an index of "Genres" column
android_header.index("Genres")

9

In [32]:
# Determine frequencies of each Genres
android_eng_free_f_genres = freq_table(android_eng_free_apps, 9)
display_table(android_eng_free_f_genres)

Tools : 8.4426
Entertainment : 6.0805
Education : 5.3571
Business : 4.5999
Productivity : 3.8992
Lifestyle : 3.8766
Finance : 3.7071
Medical : 3.5375
Sports : 3.4584
Personalization : 3.3228
Communication : 3.2324
Action : 3.0967
Health & Fitness : 3.0854
Photography : 2.9498
News & Magazines : 2.8029
Social : 2.6673
Travel & Local : 2.3282
Shopping : 2.2491
Books & Reference : 2.1361
Simulation : 2.0457
Dating : 1.8648
Arcade : 1.8422
Video Players & Editors : 1.7744
Casual : 1.7631
Maps & Navigation : 1.3901
Food & Drink : 1.2432
Puzzle : 1.1302
Racing : 0.9946
Role Playing : 0.9381
Libraries & Demo : 0.9381
Auto & Vehicles : 0.9268
Strategy : 0.9155
House & Home : 0.8024
Weather : 0.7911
Events : 0.7120
Adventure : 0.6668
Comics : 0.5990
Beauty : 0.5990
Art & Design : 0.5990
Parenting : 0.4973
Card : 0.4521
Trivia : 0.4182
Casino : 0.4182
Educational;Education : 0.3956
Board : 0.3843
Educational : 0.3730
Education;Education : 0.3391
Word : 0.2599
Casual;Pretend Play : 0.2373
Music :

In [33]:
# Check sample rows with more than 1 genre belongs to which category
temp_genres_cat = {}

for row in android_eng_free_apps:
    if row[9] == "Racing;Action & Adventure" or row[9] == "Entertainment;Music & Video" or row[9] == "Puzzle;Brain Games" or row[9] == "Casual;Brain Games":
        temp_genres_cat.setdefault(row[1], 0)
        temp_genres_cat[row[1]] += 1

print(temp_genres_cat)

{'ENTERTAINMENT': 3, 'GAME': 1, 'FAMILY': 53}


The data from *Genres* column has too many categories. Note that "Game" or "Family", our top 2 categories, do not appear in *Genres*. However, there are lots of game categories (e.g. Racing, Simulation, Arcade, and Puzzle) in our data instead.

It seems like this *Genres* column acted as a sub-categories or labels for each app. Hence, we will focus on working with *Category* data.

## Step 2b - Find the most popular apps in each store
### iOS popular apps
In our App Store data set, there is no data about number of downloads or installs. Assuming that higher number of reviews could signal higher number of installs, *rating_count_tot* will be used to analyze the popularity of the apps. 

In [34]:
print(ios_header.index("rating_count_tot"))
print(ios_header.index("prime_genre"))

5
11


In [35]:
# Find average number of reviews in each genre
ios_eng_free_genre = freq_table(ios_eng_free_apps, 11)
ios_ef_genre_sorted = {}

for genre in ios_eng_free_genre:
    total = 0
    len_genre = 0
    for app in ios_eng_free_apps:
        genre_app = app[11]
        if genre_app == genre:
            rating_count = float(app[5])
            total += rating_count
            len_genre += 1
    avg_user = total / len_genre
    ios_ef_genre_sorted.setdefault(genre, avg_user)

display_table(ios_ef_genre_sorted)

Navigation : 86,090.3333
Reference : 79,350.4706
Social Networking : 72,916.5481
Music : 57,326.5303
Weather : 52,279.8929
Book : 46,384.9167
Food & Drink : 33,333.9231
Finance : 32,367.0286
Travel : 28,964.0513
Photo & Video : 28,441.5438
Shopping : 27,230.7349
Health & Fitness : 23,298.0154
Sports : 23,008.8986
Games : 22,910.9233
News : 21,248.0233
Productivity : 21,028.4107
Utilities : 19,156.4937
Lifestyle : 16,815.4800
Entertainment : 14,195.3586
Business : 7,491.1176
Education : 7,003.9831
Catalogs : 4,004.0000
Medical : 612.0000


The table above shows that navigation apps have the highest number of reviews, followed by reference apps (example of reference apps are Wikipedia, WolframAlpha, and Dictionary), and social networking apps. Game apps, which is the most common genre, have only 22k reviews.

In [36]:
# Exploring each genre
for row in ios_eng_free_apps:
    if row[11] == "Navigation":
        print(f"{row[1]} : {float((row[5])):,.2f}")

Waze - GPS Navigation, Maps & Real-time Traffic : 345,046.00
Google Maps - Navigation & Transit : 154,911.00
Geocaching® : 12,811.00
CoPilot GPS – Car Navigation & Offline Maps : 3,582.00
ImmobilienScout24: Real Estate Search in Germany : 187.00
Railway Route Search : 5.00


There are only 6 English free apps in navigation genre. *Waze* and *Google Maps* together have almost 500k reviews. On the other hand, number of reviews of the other 4 apps combined still couldn't reach 20k yet. It could be difficult trying to steal a market share from Google products.

In [37]:
# Exploring each genre
for row in ios_eng_free_apps:
    if row[11] == "Reference":
        print(f"{row[1]} : {float((row[5])):,.2f}")

Bible : 985,920.00
Dictionary.com Dictionary & Thesaurus : 200,047.00
Dictionary.com Dictionary & Thesaurus for iPad : 54,175.00
Google Translate : 26,786.00
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18,418.00
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17,588.00
Merriam-Webster Dictionary : 16,849.00
Night Sky : 12,122.00
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8,535.00
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4,693.00
GUNS MODS for Minecraft PC Edition - Mods Tools : 1,497.00
Guides for Pokémon GO - Pokemon GO News and Cheats : 826.00
WWDC : 762.00
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718.00
VPN Express : 14.00
Real Bike Traffic Rider Virtual Reality Glasses : 8.00
Jishokun-Japanese English Dictionary & Translator : 0.00


Among reference apps, *Bible* has the most number of reviews, at almost 1M. The second and place are the same product, but for different devices, from *Dictionary.com*. Both apps together have approximately 250k reviews. As the data is heavily skewed, even without considering *Bible*, it could be difficult as well to get a market share from a big player like *Dictionary.com*. Even *Google Translate* and *Merriam-Webster Dictionary* are far behind.

Similarly, social networking, music, and weather genres are also dominated by big players, such as Facebook (~3M reviews), Pandora and Spotify (~2M reviews), and The Weather Channel (~700k reviews).

In [38]:
# Exploring each genre
ios_Social = []

for row in ios_eng_free_apps:
    if row[11] == "Social Networking":
        ios_Social.append(row)

for app in ios_Social[0:5]:
    print(f"{app[1]} : {float((app[5])):,.2f}")

Facebook : 2,974,676.00
Pinterest : 1,061,624.00
Skype for iPhone : 373,519.00
Messenger : 351,466.00
Tumblr : 334,293.00


In [39]:
# Exploring each genre
ios_Music = []

for row in ios_eng_free_apps:
    if row[11] == "Music":
        ios_Music.append(row)

for app in ios_Music[0:5]:
    print(f"{app[1]} : {float((app[5])):,.2f}")

Pandora - Music & Radio : 1,126,879.00
Spotify Music : 878,563.00
Shazam - Discover music, artists, videos & lyrics : 402,925.00
iHeartRadio – Free Music & Radio Stations : 293,228.00
SoundCloud - Music & Audio : 135,744.00


In [40]:
# Exploring each genre
ios_Weather = []

for row in ios_eng_free_apps:
    if row[11] == "Weather":
        ios_Weather.append(row)

for app in ios_Weather[0:5]:
    print(f"{app[1]} : {float((app[5])):,.2f}")

The Weather Channel: Forecast, Radar & Alerts : 495,626.00
The Weather Channel App for iPad – best local forecast, radar map, and storm tracking : 208,648.00
WeatherBug - Local Weather, Radar, Maps, Alerts : 188,583.00
MyRadar NOAA Weather Radar Forecast : 150,158.00
AccuWeather - Weather for Life : 144,214.00


Book genre shows some potential. Currently, there are 12 free English apps under this genre and only one adult coloring book. eBooks and audio books would require extensive collections to be able to steal spotlight from Amazon's Kindle and Audible. Adult coloring book seems to be a good option as people would need to spend time in our app. Once one picture is done, they could continue to color another picture endlessly. Later we could also offer in-app purchases for more colors or more pictures.

In [41]:
# Exploring each genre
for row in ios_eng_free_apps:
    if row[11] == "Book":
        print(f"{row[1]} : {float((row[5])):,.2f}")

Kindle – Read eBooks, Magazines & Textbooks : 252,076.00
Audible – audio books, original series & podcasts : 105,274.00
Color Therapy Adult Coloring Book for Adults : 84,062.00
OverDrive – Library eBooks and Audiobooks : 65,450.00
HOOKED - Chat Stories : 47,829.00
BookShout: Read eBooks & Track Your Reading Goals : 879.00
Dr. Seuss Treasury — 50 best kids books : 451.00
Green Riding Hood : 392.00
Weirdwood Manor : 197.00
MangaZERO - comic reader : 9.00
ikouhoushi : 0.00
MangaTiara - love comic reader : 0.00


### Android popular apps
Unlike App Store data, our Android data set provided number of installs. This data will be used in our analysis to find which genre has the most installs.

In [42]:
print(android_header.index("Category"))
print(android_header.index("Installs"))

1
5


In [43]:
explore_data(android_eng_free_apps, 0, 3)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']




From slices of data above, the number of installs are open-ended. For example:

- *Photo Editor & Candy Camera & Grid & ScrapBook* has 10,000+ installs.
- *U Launcher Lite – FREE Live Cool Themes, Hide Apps* has 5,000,000+ installs.
- *Sketch - Draw & Paint* has 50,000,000+ installs.

As we don't have more precise numbers, 10,000+ installs will be deemed as 10,000 installs. 5,000,000+ will be deemed as 5,000,000 installs, and so on. These data will need to be converted to a float. The "+" and "," need to be removed. 

In [44]:
# Find average number of installs in each category
android_eng_free_genre = freq_table(android_eng_free_apps, 1)
android_ef_genre_sorted = {}

for genre in android_eng_free_genre:
    total = 0
    len_genre = 0
    for app in android_eng_free_apps:
        genre_app = app[1]
        if genre_app == genre:
            installs = app[5]
            installs = installs.replace("+", "")
            installs = installs.replace(",", "")
            installs = float(installs)
            total += installs
            len_genre += 1
    avg_user = total / len_genre
    android_ef_genre_sorted.setdefault(genre, avg_user)

display_table(android_ef_genre_sorted)

COMMUNICATION : 38,590,581.0874
VIDEO_PLAYERS : 24,727,872.4528
SOCIAL : 23,253,652.1271
PHOTOGRAPHY : 17,840,110.4023
PRODUCTIVITY : 16,787,331.3449
GAME : 15,544,014.5105
TRAVEL_AND_LOCAL : 13,984,077.7101
ENTERTAINMENT : 11,640,705.8824
TOOLS : 10,830,251.9706
NEWS_AND_MAGAZINES : 9,549,178.4677
BOOKS_AND_REFERENCE : 8,814,199.7884
SHOPPING : 7,036,877.3116
PERSONALIZATION : 5,201,482.6122
WEATHER : 5,145,550.2857
HEALTH_AND_FITNESS : 4,188,821.9853
MAPS_AND_NAVIGATION : 4,049,274.6341
FAMILY : 3,695,641.8198
SPORTS : 3,650,602.2767
ART_AND_DESIGN : 1,986,335.0877
FOOD_AND_DRINK : 1,924,897.7364
EDUCATION : 1,833,495.1456
BUSINESS : 1,712,290.1474
LIFESTYLE : 1,446,158.2238
FINANCE : 1,387,692.4756
HOUSE_AND_HOME : 1,360,598.0423
DATING : 854,028.8303
COMICS : 832,613.8889
AUTO_AND_VEHICLES : 647,317.8171
LIBRARIES_AND_DEMO : 638,503.7349
PARENTING : 542,603.6207
BEAUTY : 513,151.8868
EVENTS : 253,542.2222
MEDICAL : 120,550.6198


On Google Play, communication apps have the highest number of installs, followed by video players apps and social apps. However, the figures of these categories are heavily skewed by very popular apps with 100,000,000+ to 1,000,000,000+ installs. For example, WhatsApp Messenger, Google Duo, Youtube, VLC for Android, Facebook, and Tumblr.

In [45]:
# Exploring each genre
android_Commu = []

for row in android_eng_free_apps:
    if row[1] == "COMMUNICATION":
        android_Commu.append(row)

for app in android_Commu[0:20]:
    print(f"{app[0]} : {(app[5])}")

WhatsApp Messenger : 1,000,000,000+
Messenger for SMS : 10,000,000+
My Tele2 : 5,000,000+
imo beta free calls and text : 100,000,000+
Contacts : 50,000,000+
Call Free – Free Call : 5,000,000+
Web Browser & Explorer : 5,000,000+
Browser 4G : 10,000,000+
MegaFon Dashboard : 10,000,000+
ZenUI Dialer & Contacts : 10,000,000+
Cricket Visual Voicemail : 10,000,000+
TracFone My Account : 1,000,000+
Xperia Link™ : 10,000,000+
TouchPal Keyboard - Fun Emoji & Android Keyboard : 10,000,000+
Skype Lite - Free Video Call & Chat : 5,000,000+
My magenta : 1,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Seznam.cz : 1,000,000+
Antillean Gold Telegram (original version) : 100,000+


In [46]:
# Exploring each genre
android_vdo = []

for row in android_eng_free_apps:
    if row[1] == "VIDEO_PLAYERS":
        android_vdo.append(row)

for app in android_vdo[0:20]:
    print(f"{app[0]} : {(app[5])}")

YouTube : 1,000,000,000+
All Video Downloader 2018 : 1,000,000+
Video Downloader : 10,000,000+
HD Video Player : 1,000,000+
Iqiyi (for tablet) : 1,000,000+
Video Player All Format : 10,000,000+
Motorola Gallery : 100,000,000+
Free TV series : 100,000+
Video Player All Format for Android : 500,000+
VLC for Android : 100,000,000+
Code : 10,000,000+
Vote for : 50,000,000+
XX HD Video downloader-Free Video Downloader : 1,000,000+
OBJECTIVE : 1,000,000+
Music - Mp3 Player : 10,000,000+
HD Movie Video Player : 1,000,000+
YouCut - Video Editor & Video Maker, No Watermark : 5,000,000+
Video Editor,Crop Video,Movie Video,Music,Effects : 1,000,000+
YouTube Studio : 10,000,000+
video player for android : 10,000,000+


In [47]:
# Exploring each genre
android_social = []

for row in android_eng_free_apps:
    if row[1] == "SOCIAL":
        android_social.append(row)

for app in android_social[0:20]:
    print(f"{app[0]} : {(app[5])}")

Facebook : 1,000,000,000+
Facebook Lite : 500,000,000+
Tumblr : 100,000,000+
Social network all in one 2018 : 100,000+
Pinterest : 100,000,000+
TextNow - free text + calls : 10,000,000+
Google+ : 1,000,000,000+
The Messenger App : 1,000,000+
Messenger Pro : 1,000,000+
Free Messages, Video, Chat,Text for Messenger Plus : 1,000,000+
Telegram X : 5,000,000+
The Video Messenger App : 100,000+
Jodel - The Hyperlocal App : 1,000,000+
Hide Something - Photo, Video : 5,000,000+
Love Sticker : 1,000,000+
Web Browser & Fast Explorer : 5,000,000+
LiveMe - Video chat, new friends, and make money : 10,000,000+
VidStatus app - Status Videos & Status Downloader : 5,000,000+
Love Images : 1,000,000+
Web Browser ( Fast & Secure Web Explorer) : 500,000+


In [48]:
# Exploring each genre
android_entertainment = []

for row in android_eng_free_apps:
    if row[1] == "ENTERTAINMENT":
        android_entertainment.append(row)

for app in android_entertainment[0:30]:
    print(f"{app[0]} : {(app[5])}")

Complete Spanish Movies : 1,000,000+
Pluto TV - It’s Free TV : 1,000,000+
Mobile TV : 10,000,000+
TV+ : 5,000,000+
Digital TV : 5,000,000+
Motorola Spotlight Player™ : 10,000,000+
Vigo Lite : 5,000,000+
Hotstar : 100,000,000+
Peers.TV: broadcast TV channels First, Match TV, TNT ... : 5,000,000+
The green alien dance : 1,000,000+
Spectrum TV : 5,000,000+
H TV : 5,000,000+
StarTimes - Live International Champions Cup : 1,000,000+
Cinematic Cinematic : 1,000,000+
MEGOGO - Cinema and TV : 10,000,000+
Talking Angela : 100,000,000+
DStv Now : 5,000,000+
ivi - movies and TV shows in HD : 10,000,000+
Radio Javan : 1,000,000+
Talking Ginger 2 : 50,000,000+
Girly Lock Screen Wallpaper with Quotes : 5,000,000+
🔥 Football Wallpapers 4K | Full HD Backgrounds 😍 : 1,000,000+
Movies by Flixster, with Rotten Tomatoes : 10,000,000+
Low Poly – Puzzle art game : 1,000,000+
BBC Media Player : 10,000,000+
Amazon Prime Video : 50,000,000+
Adult Glitter Color by Number Book - Sandbox Pages : 1,000,000+
IMDb M

Next, we take a look at the (adult) coloring book as we see some potential from the iOS data. The coloring books are under different categories. Although there are many free coloring books, but they are under *Family* category, which is for kids. The adult coloring books are mainly in *Entertainment* and there aren't many of them. Most popular apps in *Entertainment* are streaming apps or TV apps. The company could introduce a new adult coloring book app to the market and grab some market share.

In [49]:
# Exploring potential for adult coloring book app in Play Store
for row in android_eng_free_apps:
    if "color" in row[0] or "Color" in row[0] or "COLOR" in row[0]:
        print(f"{row[1]} >> {row[0]} : {(row[5])}")

ART_AND_DESIGN >> Pixel Draw - Number Art Coloring Book : 100,000+
ART_AND_DESIGN >> Garden Coloring Book : 1,000,000+
ART_AND_DESIGN >> Mandala Coloring Book : 100,000+
ART_AND_DESIGN >> 3D Color Pixel by Number - Sandbox Art Coloring : 100,000+
ART_AND_DESIGN >> Colorfit - Drawing & Coloring : 500,000+
ART_AND_DESIGN >> Anime Manga Coloring Book : 100,000+
ART_AND_DESIGN >> How To Color Disney Princess - Coloring Pages : 500,000+
BEAUTY >> Colors of white in Urdu : 10,000+
BEAUTY >> Discover Color : 100,000+
COMICS >> Unicorn Pokez - Color By Number : 50,000+
EDUCATION >> Dinosaurs Coloring Pages : 500,000+
EDUCATION >> Cars Coloring Pages : 1,000,000+
ENTERTAINMENT >> Adult Glitter Color by Number Book - Sandbox Pages : 1,000,000+
ENTERTAINMENT >> ColorFul - Adult Coloring Book : 5,000,000+
ENTERTAINMENT >> Colorfy: Coloring Book for Adults - Free : 10,000,000+
HOUSE_AND_HOME >> ColorSnap® Visualizer : 1,000,000+
GAME >> Color Road : 10,000,000+
GAME >> Pixel Art: Color by Number Ga

# Conclusion
This project is to analyze the app profiles that are profitable for both App Store and Google Play Store. 

We conclude that adult coloring books could be profitable in both stores. The category/genre is not yet dominated by certain companies. The market is also not too niche. The adult coloring books also offer flexibility to the company as we could expand our revenue model to offering in-app purchases, either to remove ads, more colors, or more pictures.