# Profitable App Profiles for the Apple Store and Google Play Markets

#### Table of Contents
* [Introduction](#1)
* [Load and Explore Dataset](#2)
* [Data Exploration and Cleaning](#3)
	* [Check for Null values](#4)
	* [Check for duplicates](#5)
	* [Explore Apple dataset](#6)
	* [Removal of Non-English Apps](#7)
	* [Isolate FREE Apps](#8)
* [Data Analysis](#9)

#### Introduction <a class='anchor' id='1'></a>

Goal: To analyze data to help developers understand what kind of free, English apps attract users.<br>
Exclusion criteria: non-English apps, paid apps

#### Load and Explore Dataset <a class='anchor' id='2'></a>

In [60]:
#Create function to explore datasets, prints out specified number of rows
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))  

In [61]:
from csv import reader
import re

apple_open = open('D:/DataQuest/AppleStore.csv', encoding="utf8")
read_apple = reader(apple_open)
list_apple = list(read_apple)
google_open = open('D:/DataQuest/googleplaystore.csv', encoding="utf8")
read_google = reader(google_open)
list_google = list(read_google)

In [62]:
explore_data(list_apple, 0, 2, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 7198
Number of columns: 16


In [63]:
explore_data(list_google, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


#### Data Exploration and Cleaning <a class='anchor' id='3'></a>

In [64]:
for row in list_apple[1]:
    print(f"{row} {type(row)}")

284882215 <class 'str'>
Facebook <class 'str'>
389879808 <class 'str'>
USD <class 'str'>
0.0 <class 'str'>
2974676 <class 'str'>
212 <class 'str'>
3.5 <class 'str'>
3.5 <class 'str'>
95.0 <class 'str'>
4+ <class 'str'>
Social Networking <class 'str'>
37 <class 'str'>
1 <class 'str'>
29 <class 'str'>
1 <class 'str'>


In [65]:
for row in list_google[1]:
    print(f"{row} {type(row)}")

Photo Editor & Candy Camera & Grid & ScrapBook <class 'str'>
ART_AND_DESIGN <class 'str'>
4.1 <class 'str'>
159 <class 'str'>
19M <class 'str'>
10,000+ <class 'str'>
Free <class 'str'>
0 <class 'str'>
Everyone <class 'str'>
Art & Design <class 'str'>
January 7, 2018 <class 'str'>
1.0.0 <class 'str'>
4.0.3 and up <class 'str'>


Check data type, all str data
Next check each data point to see if each can be converted to required data types

In [66]:
def checkdata_int(dataset, index):
    count = 1
    try:
        for row in dataset[1:]:
            check = int(row[index])
            count+=1
    except:
        print("Exception INT has occurred.")
        print(row)
        print(count)

def checkdata_float(dataset, index):
    count = 1
    try:
        for row in dataset[1:]:
            check = float(row[index])
            count+=1
    except:
        print("Exception FLOAT has occurred.")
        print(row)
        print(count)

In [67]:
checkdata_int(list_apple, 5)
checkdata_float(list_apple, 4)
checkdata_float(list_apple, 7)

In [68]:
checkdata_int(list_google, 3)
checkdata_float(list_google, 2)
checkdata_float(list_google, 7)

Exception INT has occurred.
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
10473
Exception FLOAT has occurred.
['TurboScan: scan documents and receipts in PDF', 'BUSINESS', '4.7', '11442', '6.8M', '100,000+', 'Paid', '$4.99', 'Everyone', 'Business', 'March 25, 2018', '1.5.2', '4.0 and up']
235


INT except: Row has missing category, shifting columns and empty genre
FLOAT exception: price has $ sign

In [69]:
categories = []
for row in list_google[1:]:
    if row[1] in categories:
        pass
    else:
        categories.append(row[1])
print(categories)

['ART_AND_DESIGN', 'AUTO_AND_VEHICLES', 'BEAUTY', 'BOOKS_AND_REFERENCE', 'BUSINESS', 'COMICS', 'COMMUNICATION', 'DATING', 'EDUCATION', 'ENTERTAINMENT', 'EVENTS', 'FINANCE', 'FOOD_AND_DRINK', 'HEALTH_AND_FITNESS', 'HOUSE_AND_HOME', 'LIBRARIES_AND_DEMO', 'LIFESTYLE', 'GAME', 'FAMILY', 'MEDICAL', 'SOCIAL', 'SHOPPING', 'PHOTOGRAPHY', 'SPORTS', 'TRAVEL_AND_LOCAL', 'TOOLS', 'PERSONALIZATION', 'PRODUCTIVITY', 'PARENTING', 'WEATHER', 'VIDEO_PLAYERS', 'NEWS_AND_MAGAZINES', 'MAPS_AND_NAVIGATION', '1.9']


Double check types of categories. Life made wifi touchscreen photo frame possibly goes under photography

In [70]:
genres = []
for row in list_google[1:]:
    if row[9] in genres:
        pass
    else:
        genres.append(row[9])
print(genres)

['Art & Design', 'Art & Design;Pretend Play', 'Art & Design;Creativity', 'Art & Design;Action & Adventure', 'Auto & Vehicles', 'Beauty', 'Books & Reference', 'Business', 'Comics', 'Comics;Creativity', 'Communication', 'Dating', 'Education;Education', 'Education', 'Education;Creativity', 'Education;Music & Video', 'Education;Action & Adventure', 'Education;Pretend Play', 'Education;Brain Games', 'Entertainment', 'Entertainment;Music & Video', 'Entertainment;Brain Games', 'Entertainment;Creativity', 'Events', 'Finance', 'Food & Drink', 'Health & Fitness', 'House & Home', 'Libraries & Demo', 'Lifestyle', 'Lifestyle;Pretend Play', 'Adventure;Action & Adventure', 'Arcade', 'Casual', 'Card', 'Casual;Pretend Play', 'Action', 'Strategy', 'Puzzle', 'Sports', 'Music', 'Word', 'Racing', 'Casual;Creativity', 'Casual;Action & Adventure', 'Simulation', 'Adventure', 'Board', 'Trivia', 'Role Playing', 'Simulation;Education', 'Action;Action & Adventure', 'Casual;Brain Games', 'Simulation;Action & Adven

Possibly Photography

In [71]:
list_google[10473].insert(1, 'PHOTOGRAPHY')
list_google[10473][9] = 'Photography'
print(list_google[10473])

['Life Made WI-Fi Touchscreen Photo Frame', 'PHOTOGRAPHY', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', 'Photography', 'February 11, 2018', '1.0.19', '4.0 and up']


In [72]:
checkdata_int(list_google, 3)


In [73]:
price = []
for row in list_google[1:]:
    if "$" in row[7]:
        row[7] = row[7].strip('$')
        print(row[7])

4.99
4.99
4.99
4.99
3.99
3.99
6.99
1.49
2.99
3.99
7.99
3.99
3.99
5.99
3.99
3.99
4.99
2.99
3.49
4.99
2.99
3.99
2.99
2.99
2.99
1.99
4.99
4.99
4.99
5.99
6.99
9.99
4.99
3.99
2.99
3.99
2.99
3.99
3.99
4.99
3.99
2.99
7.49
2.99
0.99
0.99
0.99
4.99
2.99
4.99
2.99
4.99
4.99
2.99
2.99
3.99
3.99
2.99
2.99
3.99
3.99
6.99
2.99
9.00
0.99
5.49
9.99
6.99
10.00
3.99
5.99
24.99
11.99
79.99
11.99
2.99
16.99
3.99
2.99
9.99
3.99
14.99
2.99
3.99
2.99
1.00
29.99
2.99
2.99
12.99
4.99
2.99
14.99
5.99
3.49
0.99
2.49
24.99
10.99
1.99
24.99
4.99
3.99
2.99
7.49
1.50
2.99
3.99
1.99
9.99
3.99
3.99
3.99
7.99
14.99
9.99
3.99
19.99
29.99
15.99
0.99
33.99
0.99
79.99
9.00
24.99
9.99
10.00
16.99
11.99
29.99
14.99
74.99
11.99
6.99
5.49
14.99
9.99
33.99
29.99
24.99
12.99
39.99
5.99
2.99
24.99
19.99
2.99
0.99
5.99
0.99
5.99
2.99
5.99
3.95
5.99
29.99
2.49
0.99
0.99
9.99
4.49
5.99
4.49
1.99
1.49
1.99
3.99
1.70
0.99
1.99
0.99
2.99
2.99
0.99
0.99
0.99
2.99
8.99
4.99
2.99
4.99
5.99
39.99
2.49
2.00
1.49
4.99
1.70
1.49
2.99
3.88
0.9

Strip $ from prices

In [74]:
checkdata_float(list_google, 7)

#### Check for Null values <a class='anchor' id='4'></a>

In [75]:
rating = []
list_null = []
count = 1
for row in list_google[1:]:
    if row[2] in rating:
        pass
    else:
        rating.append(row[2])
    if float(row[2]) <1 or float(row[2]) >5:
        print(f"Data in {count} is out of range.")
    if row[2] == 'NaN':
        list_null.append(row)
    count +=1
print(rating)
print(f'There are {len(list_null)} Null values')


['4.1', '3.9', '4.7', '4.5', '4.3', '4.4', '3.8', '4.2', '4.6', '3.2', '4.0', 'NaN', '4.8', '4.9', '3.6', '3.7', '3.3', '3.4', '3.5', '3.1', '5.0', '2.6', '3.0', '1.9', '2.5', '2.8', '2.7', '1.0', '2.9', '2.3', '2.2', '1.7', '2.0', '1.8', '2.4', '1.6', '2.1', '1.4', '1.5', '1.2']
There are 1474 Null values


noted NaN, all values within 1-5
1474 Nan Ratings out of 10842 total, data can be kept for for now, as analysis can be based on installs

In [76]:
for row in list_null[:6]:
    print(row)
    print()

['Mcqueen Coloring pages', 'ART_AND_DESIGN', 'NaN', '61', '7.0M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Action & Adventure', 'March 7, 2018', '1.0.0', '4.1 and up']

['Wrinkles and rejuvenation', 'BEAUTY', 'NaN', '182', '5.7M', '100,000+', 'Free', '0', 'Everyone 10+', 'Beauty', 'September 20, 2017', '8.0', '3.0 and up']

['Manicure - nail design', 'BEAUTY', 'NaN', '119', '3.7M', '50,000+', 'Free', '0', 'Everyone', 'Beauty', 'July 23, 2018', '1.3', '4.1 and up']

['Skin Care and Natural Beauty', 'BEAUTY', 'NaN', '654', '7.4M', '100,000+', 'Free', '0', 'Teen', 'Beauty', 'July 17, 2018', '1.15', '4.1 and up']

['Secrets of beauty, youth and health', 'BEAUTY', 'NaN', '77', '2.9M', '10,000+', 'Free', '0', 'Mature 17+', 'Beauty', 'August 8, 2017', '2.0', '2.3 and up']

['Recipes and tips for losing weight', 'BEAUTY', 'NaN', '35', '3.1M', '10,000+', 'Free', '0', 'Everyone 10+', 'Beauty', 'December 11, 2017', '2.0', '3.0 and up']



#### Check through rest of the columns

In [77]:
reviews = []
for row in list_google[1:]:
    if int(row[3]) < 0:
        print("There is a negaive review")

installs = []
for row in list_google[1:]:
    if row[5] in installs:
        pass
    else:
        installs.append(row[5])
print(installs)

num_list = []
for row in list_google[1:]:
    number = float(row[7])
    num_list.append(number)
print(min(num_list))
print(max(num_list))

['10,000+', '500,000+', '5,000,000+', '50,000,000+', '100,000+', '50,000+', '1,000,000+', '10,000,000+', '5,000+', '100,000,000+', '1,000,000,000+', '1,000+', '500,000,000+', '50+', '100+', '500+', '10+', '1+', '5+', '0+', '0']
0.0
400.0


Data columns are within min-max range.

#### Next check for duplicates <a class='anchor' id='5'></a>

In [78]:
unique_google = []
duplicated_google = []
for row in list_google[1:]:
    if row[0] in unique_google:
        duplicated_google.append(row[0])
    else:
        unique_google.append(row[0])
print(len(unique_google))
print(len(list_google[1:]))
print(len(duplicated_google))


9660
10841
1181


A lot of duplicate (1181) entries. Take a look at the entry.

In [79]:
for row in sorted(duplicated_google):
    print(row)
    print()

10 Best Foods for You

1800 Contacts - Lens Store

2017 EMRA Antibiotic Guide

21-Day Meditation Experience

365Scores - Live Scores

420 BZ Budeze Delivery

8 Ball Pool

8 Ball Pool

8 Ball Pool

8 Ball Pool

8 Ball Pool

8 Ball Pool

8fit Workouts & Meal Planner

95Live -SG#1 Live Streaming App

A Manual of Acupuncture

A&E - Watch Full Episodes of TV Shows

A&E - Watch Full Episodes of TV Shows

A&E - Watch Full Episodes of TV Shows

AAFP

ABC News - US & World News

AC - Tips & News for Android™

AP Mobile - Breaking News

ASCCP Mobile

ASOS

Accounting App - Zoho Books

AccuWeather: Daily Forecast & Live Weather Reports

Acorns - Invest Spare Change

AdWords Express

Ada - Your Health Guide

Adobe Acrobat Reader

Adobe Acrobat Reader

Adobe Photoshop Express:Photo Editor Collage Maker

Adult Dirty Emojis

Adult Dirty Emojis

Advanced Comprehension Therapy

Agar.io

Airbnb

Airway Ex - Intubate. Anesthetize. Train.

AliExpress - Smarter Shopping, Better Living

AliExpress - Smarter

In [80]:
max_review = {}
for row in list_google[1:]:
    if row[0] in max_review and float(row[3]) > max_review[row[0]]:
        max_review[row[0]] = float(row[3])
    elif row[0] in max_review and float(row[3]) < max_review[row[0]]:
        pass
    else:
        max_review[row[0]] = float(row[3])

print(len(max_review))

9660


Created filter to isolate name of highest review count into dicitonary

In [81]:
cleaned_google = []
added = []
for row in list_google[1:]:
    if float(row[3]) == float(max_review[row[0]]) and row[0] not in added:
        cleaned_google.append(row)
        added.append(row[0])
print(len(cleaned_google))

9660


Use filter to create list with all the data columns, ensuring to match number of unique rows

#### Next we explore the Apple dataset <a class='anchor' id='6'></a>

In [82]:
explore_data(list_apple, 0, 2, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 7198
Number of columns: 16


#### Check through the columns

In [83]:
def check_data(index):
    currency = []
    for row in list_apple[1:]:
        try:
            if float(row[index]) not in currency:
                currency.append(float(row[index]))
        except:
            if row[index] not in currency:
                currency.append(row[index])
    print(currency)
    print(f'MIN: {min(currency)}')
    print(f'MAX: {max(currency)}')

check_data(3) #currency
check_data(4) #price
check_data(5) #rating cout total
check_data(7) #user rating

['USD']
MIN: USD
MAX: USD
[0.0, 1.99, 0.99, 6.99, 2.99, 7.99, 4.99, 9.99, 3.99, 8.99, 5.99, 14.99, 13.99, 19.99, 17.99, 15.99, 24.99, 20.99, 29.99, 12.99, 39.99, 74.99, 16.99, 249.99, 11.99, 27.99, 49.99, 59.99, 22.99, 18.99, 99.99, 21.99, 34.99, 299.99, 23.99, 47.99]
MIN: 0.0
MAX: 299.99
[2974676.0, 2161558.0, 2130805.0, 1724546.0, 1126879.0, 1061624.0, 985920.0, 961794.0, 878563.0, 824451.0, 706110.0, 698516.0, 679055.0, 677247.0, 669079.0, 612532.0, 567344.0, 541693.0, 522012.0, 508808.0, 507706.0, 503230.0, 495626.0, 481564.0, 479440.0, 464312.0, 446880.0, 446185.0, 426463.0, 418033.0, 417779.0, 416736.0, 414803.0, 405647.0, 405007.0, 402925.0, 397730.0, 395261.0, 393469.0, 391401.0, 386521.0, 373857.0, 373835.0, 373519.0, 370370.0, 360974.0, 359832.0, 354058.0, 351466.0, 345046.0, 342969.0, 334293.0, 327025.0, 326482.0, 323905.0, 308844.0, 303856.0, 301182.0, 295211.0, 293857.0, 293228.0, 291787.0, 290996.0, 287589.0, 287095.0, 278166.0, 277268.0, 274501.0, 267394.0, 266921.0, 266

These data look okay.

#### Next we check for duplicates

In [84]:
unique_apple = []
duplicated_apple = []
for row in list_apple[1:]:
    if row[1] in unique_apple:
        duplicated_apple.append(row[1])
    else:
        unique_apple.append(row[1])
print(len(unique_apple))
print(len(list_apple[1:]))
print(len(duplicated_apple))

7195
7197
2


In [85]:
print(list_apple[0])
for row in list_apple[1:]:
    if row[1] == 'Mannequin Challenge' or row[1] == 'VR Roller Coaster':
        print(row)
        print()

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['1173990889', 'Mannequin Challenge', '109705216', 'USD', '0.0', '668', '87', '3.0', '3.0', '1.4', '9+', 'Games', '37', '4', '1', '1']

['952877179', 'VR Roller Coaster', '169523200', 'USD', '0.0', '107', '102', '3.5', '3.5', '2.0.0', '4+', 'Games', '37', '5', '1', '1']

['1178454060', 'Mannequin Challenge', '59572224', 'USD', '0.0', '105', '58', '4.0', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']

['1089824278', 'VR Roller Coaster', '240964608', 'USD', '0.0', '67', '44', '3.5', '4.0', '0.81', '4+', 'Games', '38', '0', '1', '1']



#### Removal of Non-English Apps <a class='anchor' id='7'></a>

In [86]:
def english(string):
    for x in string:
        if ord(x) > 127: #ASCII character code
            return False
    return True

english('Instachat 😜')
english("English")

True

In [87]:
cleaned_google_english = []
cleaned_google_nonenglish = []
for x in cleaned_google:
    if english(x[0]) == False:
        cleaned_google_nonenglish.append(x)
    else:
        cleaned_google_english.append(x)
print(len(cleaned_google_nonenglish))
print(len(cleaned_google_english))

542
9118


There are 542 non-English apps and 9118 English Apps.
Let's look at the non-English apps.

In [88]:
for row in cleaned_google_nonenglish[:30]:
    print(row[0])
    print()

U Launcher Lite – FREE Live Cool Themes, Hide Apps

CarMax – Cars for Sale: Search Used Car Inventory

AutoScout24 Switzerland – Find your new car

Zona Azul Digital Fácil SP CET - OFFICIAL São Paulo

ReadEra – free ebook reader

Docs To Go™ Free Office Suite

USPS MOBILE®

Invoice 2go — Professional Invoices and Estimates

Röhrich Werner Soundboard

Manga Net – Best Online Manga Reader

Truyện Vui Tý Quậy

Comic Es - Shojo manga / love comics free of charge ♪ ♪

漫咖 Comics - Manga,Novel and Stories

Tapas – Comics, Novels, and Stories

【Ranobbe complete free】 Novelba - Free app that you can read and write novels

Call Free – Free Call

Xperia Link™

Messenger – Text and Video Chat for Free

Dolphin Browser - Fast, Private & Adblock🐬

Sync.ME – Caller ID & Block

myMail – Email for Hotmail, Gmail and Outlook Mail

Vonage Mobile® Call Video Text

Match™ Dating - Meet Singles

Find Real Love — YouLove Premium Dating

Sudy – Meet Elite & Rich Single

EliteSingles – Dating for Single Profes

Some of these are mostly english but contain symbols or emoticons.
Let's include 3 character threshold for exclusion.

In [89]:
def english(string):
    count = 0
    for x in string:
        if ord(x) > 127: #ASCII character code
            count += 1
    if count > 3:
        return False
    else:
        return True

cleaned_google_english = []
cleaned_google_nonenglish = []
for x in cleaned_google:
    if english(x[0]) == False:
        cleaned_google_nonenglish.append(x)
    else:
        cleaned_google_english.append(x)
print(len(cleaned_google_nonenglish))
print(len(cleaned_google_english))

45
9615


In [90]:
for row in cleaned_google_nonenglish[:30]:
    print(row[0])
    print()

Flame - درب عقلك يوميا

သိင်္ Astrology - Min Thein Kha BayDin

РИА Новости

صور حرف H

L.POINT - 엘포인트 [ 포인트, 멤버십, 적립, 사용, 모바일 카드, 쿠폰, 롯데]

RMEduS - 음성인식을 활용한 R 프로그래밍 실습 시스템

AJ렌터카 법인 카셰어링

Al Quran Free - القرآن (Islam)

中国語 AQリスニング

日本AV历史

Ay Yıldız Duvar Kağıtları

বাংলা টিভি প্রো BD Bangla TV

Cъновник BG

CSCS BG (в български)

뽕티비 - 개인방송, 인터넷방송, BJ방송

BL 女性向け恋愛ゲーム◆俺プリクロス

SecondSecret ‐「恋を読む」BLノベルゲーム‐

BL 女性向け恋愛ゲーム◆ごくメン

あなカレ【BL】無料ゲーム

감성학원 BL 첫사랑

BQ-መጽሐፍ ቅዱሳዊ ጥያቄዎች

BS Calendar / Patro / पात्रो

Vip视频免费看-BT磁力搜索

Билеты ПДД CD 2019 PRO

Offline Jízdní řády CG Transit

Bonjour 2017 Abidjan CI ❤❤❤❤❤

CK 初一 十五

الفاتحون Conquerors

DG ग्राम / Digital Gram Panchayat

DM הפקות



Now the excluded apps appear to be mostly non-English
Let's apply to applestore dataset.

In [91]:
cleaned_apple_english = []
cleaned_apple_nonenglish = []
for x in list_apple[1:]:
    if english(x[1]) == False:
        cleaned_apple_nonenglish.append(x)
    else:
        cleaned_apple_english.append(x)
print(len(cleaned_apple_nonenglish))
print(len(cleaned_apple_english))

1014
6183


In [92]:
for row in cleaned_apple_nonenglish[:30]:
    print(row[1])
    print()

爱奇艺PPS -《欢乐颂2》电视剧热播

聚力视频HD-人民的名义,跨界歌王全网热播

优酷视频

网易新闻 - 精选好内容，算出你的兴趣

淘宝 - 随时随地，想淘就淘

搜狐视频HD-欢乐颂2 全网首播

阴阳师-全区互通现世集结

百度贴吧-全球最大兴趣交友社区

百度网盘

爱奇艺HD -《欢乐颂2》电视剧热播

乐视视频HD-白鹿原,欢乐颂,奔跑吧全网热播

万年历-值得信赖的日历黄历查询工具

新浪新闻-阅读最新时事热门头条资讯视频

喜马拉雅FM（听书社区）电台有声小说相声英语

央视影音-海量央视内容高清直播

腾讯视频HD-楚乔传,明日之子6月全网首播

手机百度 - 百度一下你就得到

百度视频HD-高清电视剧、电影在线观看神器

MOMO陌陌-开启视频社交,用直播分享生活

QQ 浏览器-搜新闻、选小说漫画、看视频

同花顺-炒股、股票

聚力视频-蓝光电视剧电影在线热播

快看漫画

乐视视频-白鹿原,欢乐颂,奔跑吧全网热播

酷我音乐HD-无损在线播放

随手记（专业版）-好用的记账理财工具

Dictionary ( قاموس عربي / انجليزي + ودجيت الترجمة)

滴滴出行

高德地图（精准专业的手机地图）

百度HD-极速安全浏览器



#### Isolate FREE apps <a class='anchor' id='8'></a>

In [93]:
final_data_apple = []
final_data_google = []

for x in cleaned_google_english:
    if float(x[7]) == 0:
        final_data_google.append(x)

for x in cleaned_apple_english:
    if float(x[4]) == 0:
        final_data_apple.append(x)

print(len(final_data_google))
print(len(final_data_apple))

8865
3222


In [94]:
for row in final_data_google[:10]:
    print(row)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']
['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']
['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']
['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']
['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000+', 'Free',

In [95]:
for row in final_data_apple[:10]:
    print(row)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']
['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']
['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']
['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']
['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']
['429047995', 'Pinterest', '74778624', 'USD', '0.0', '1061624', '1814', '4.5', '4.0', '6.26', '12+', 'Social Networking', '37', '5', '27', '1']
['282935706', 'Bible', '92774400', 'USD', '0.0', '985920', '5320', '4.5', '5.0', '7.5.1', '4+', 'Reference', '37', '5', '45', '1']
['553834731', 'Candy Cr

### Data Analysis <a class='anchor' id='9'></a>

#### Number of Apps per Genre/Category

In [96]:
def freq_table(dataset, index):
    freq = {}
    for row in dataset:
        if row[index] in freq:
            freq[row[index]] += 1
        else:
            freq[row[index]] = 1
    total = len(dataset)

    freq_percent = {}
    for x in freq:
        freq_percent[x] = 100* (freq[x] / total)
    
    return freq_percent

print(freq_table(final_data_google, 1))

{'ART_AND_DESIGN': 0.6429780033840948, 'AUTO_AND_VEHICLES': 0.924985899605189, 'BEAUTY': 0.5978567399887197, 'BOOKS_AND_REFERENCE': 2.143260011280316, 'BUSINESS': 4.591088550479413, 'COMICS': 0.6204173716864072, 'COMMUNICATION': 3.2374506486181613, 'DATING': 1.8612521150592216, 'EDUCATION': 1.161872532430908, 'ENTERTAINMENT': 0.9588268471517203, 'EVENTS': 0.7106598984771574, 'FINANCE': 3.699943598420756, 'FOOD_AND_DRINK': 1.2408347433728144, 'HEALTH_AND_FITNESS': 3.0795262267343486, 'HOUSE_AND_HOME': 0.8234630569655951, 'LIBRARIES_AND_DEMO': 0.9362662154540328, 'LIFESTYLE': 3.902989283699944, 'GAME': 9.723632261703328, 'FAMILY': 18.905809362662154, 'MEDICAL': 3.5307388606880994, 'SOCIAL': 2.662154540327129, 'SHOPPING': 2.2447828539199097, 'PHOTOGRAPHY': 2.955442752397067, 'SPORTS': 3.395375070501974, 'TRAVEL_AND_LOCAL': 2.33502538071066, 'TOOLS': 8.460236886632826, 'PERSONALIZATION': 3.3164128595600673, 'PRODUCTIVITY': 3.8917089678511, 'PARENTING': 0.6542583192329385, 'WEATHER': 0.8009

Code to generate frequency table, then the same table in percentages.

In [97]:
def sort_table(table, start, end):
    x_table = []
    for x in table:
        tuple = (table[x],x)
        x_table.append(tuple)
    sorted_final = sorted(x_table, reverse = True)
    for row in sorted_final[start:end]:
        print(f'{row[1]}: {row[0]}')

In [98]:
google_table = freq_table(final_data_google, 1)
sort_table(google_table, 0, 5)

FAMILY: 18.905809362662154
GAME: 9.723632261703328
TOOLS: 8.460236886632826
BUSINESS: 4.591088550479413
LIFESTYLE: 3.902989283699944


In [99]:
apple_table = freq_table(final_data_apple, -5)
sort_table(apple_table, 0, 5)

Games: 58.16263190564867
Entertainment: 7.883302296710118
Photo & Video: 4.9658597144630665
Education: 3.662321539416512
Social Networking: 3.2898820608317814


From the tables, we find that most popular apps are in the genre of Games and Family.

#### Most Popular Apps per Genre
Evaluated via total rating counts on Apple store

In [100]:
apple_table = freq_table(final_data_apple, -5) #generate percentage table by Genre
genre_rating_table = {}
for genre in apple_table:
    total = 0
    num_of_ratings = 0
    for x in final_data_apple: #while looping through percent table, loop through dataset, add up ratings for each genre
        if genre == x[-5]:
            ratings = float(x[5])
            total += ratings
            num_of_ratings += 1
    avg_ratings = total / num_of_ratings #calculate average
    genre_rating_table[genre] = avg_ratings

sort_table(genre_rating_table, 0, 5)


Navigation: 86090.33333333333
Reference: 74942.11111111111
Social Networking: 71548.34905660378
Music: 57326.530303030304
Weather: 52279.892857142855


Top 5 highest rating counts seem to be navigation, reference, social networking, music, weather

In [101]:
count = 0
for x in final_data_apple:
    if count == 5:
        break
    if x[-5] == "Navigation":
        print(f'{x[1]}: {x[5]}')
        count += 1


Waze - GPS Navigation, Maps & Real-time Traffic: 345046
Google Maps - Navigation & Transit: 154911
Geocaching®: 12811
CoPilot GPS – Car Navigation & Offline Maps: 3582
ImmobilienScout24: Real Estate Search in Germany: 187


In [102]:
count = 0
for x in final_data_apple:
    if count == 5:
        break
    if x[-5] == "Reference":
        print(f'{x[1]}: {x[5]}')
        count += 1



Bible: 985920
Dictionary.com Dictionary & Thesaurus: 200047
Dictionary.com Dictionary & Thesaurus for iPad: 54175
Google Translate: 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran: 18418


In [103]:
count = 0
for x in final_data_apple:
    if count == 5:
        break
    if x[-5] == "Social Networking":
        print(f'{x[1]}: {x[5]}')
        count += 1


Facebook: 2974676
Pinterest: 1061624
Skype for iPhone: 373519
Messenger: 351466
Tumblr: 334293


##### Evaluated via Install on google play store
###### we need to first transform the install data for it to be useful

In [104]:
categories_google = freq_table(final_data_google, 1)

installs_table ={}

for category in categories_google:
    total = 0
    len_category = 0
    for app in final_data_google:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    installs_table[category] = avg_n_installs

sort_table(installs_table, 0, 15)

COMMUNICATION: 38456119.167247385
VIDEO_PLAYERS: 24727872.452830188
SOCIAL: 23253652.127118643
PHOTOGRAPHY: 17772022.194656488
PRODUCTIVITY: 16787331.344927534
GAME: 15588015.603248259
TRAVEL_AND_LOCAL: 13984077.710144928
ENTERTAINMENT: 11640705.88235294
TOOLS: 10801391.298666667
NEWS_AND_MAGAZINES: 9549178.467741935
BOOKS_AND_REFERENCE: 8767811.894736841
SHOPPING: 7036877.311557789
PERSONALIZATION: 5201482.6122448975
WEATHER: 5074486.197183099
HEALTH_AND_FITNESS: 4188821.9853479853


In [105]:
count = 0
for x in final_data_google:
    if count == 5:
        break
    if x[1] == "COMMUNICATION":
        print(f'{x[0]}: {x[5]}')
        count += 1

WhatsApp Messenger: 1,000,000,000+
Messenger for SMS: 10,000,000+
My Tele2: 5,000,000+
imo beta free calls and text: 100,000,000+
Contacts: 50,000,000+


In [106]:
count = 0
for x in final_data_google:
    if count == 5:
        break
    if x[1] == "VIDEO_PLAYERS":
        print(f'{x[0]}: {x[5]}')
        count += 1

YouTube: 1,000,000,000+
All Video Downloader 2018: 1,000,000+
Video Downloader: 10,000,000+
HD Video Player: 1,000,000+
Iqiyi (for tablet): 1,000,000+


In [107]:
count = 0
for x in final_data_google:
    if count == 5:
        break
    if x[1] == "SOCIAL":
        print(f'{x[0]}: {x[5]}')
        count += 1

Facebook: 1,000,000,000+
Facebook Lite: 500,000,000+
Tumblr: 100,000,000+
Social network all in one 2018: 100,000+
Pinterest: 100,000,000+


In [108]:
count = 0
for x in final_data_google:
    if count == 15:
        break
    if x[1] == "TOOLS":
        print(f'{x[0]}: {x[5]}')
        count += 1

Google: 1,000,000,000+
Google Translate: 500,000,000+
Moto Display: 10,000,000+
Motorola Alert: 50,000,000+
Motorola Assist: 50,000,000+
Moto Suggestions ™: 1,000,000+
Moto Voice: 10,000,000+
Calculator: 100,000,000+
Device Help: 100,000,000+
Account Manager: 100,000,000+
myMetro: 10,000,000+
File Manager: 50,000,000+
My Telcel: 50,000,000+
Calculator - free calculator, multi calculator app: 10,000,000+
ASUS Sound Recorder: 10,000,000+


In [109]:
count = 0
for x in final_data_google:
    if count == 15:
        break
    if x[1] == "BOOKS_AND_REFERENCE":
        print(f'{x[0]}: {x[5]}')
        count += 1

E-Book Read - Read Book for free: 50,000+
Download free book with green book: 100,000+
Wikipedia: 10,000,000+
Cool Reader: 10,000,000+
Free Panda Radio Music: 100,000+
Book store: 1,000,000+
FBReader: Favorite Book Reader: 10,000,000+
English Grammar Complete Handbook: 500,000+
Free Books - Spirit Fanfiction and Stories: 1,000,000+
Google Play Books: 1,000,000,000+
AlReader -any text book reader: 5,000,000+
Offline English Dictionary: 100,000+
Offline: English to Tagalog Dictionary: 500,000+
FamilySearch Tree: 1,000,000+
Cloud of Books: 1,000,000+


It seems the market in free, English apps tend to be dominated by a few handful of companies in each category. Potentially there is a demand for Reference Apps / Books in both markets.