# Analyzing Mobile App Data

- Goal for the project is to analyze data to help app developers understand what type of apps are likely to attract more users.

## Data Exploration

- A dataset containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. 

- A dataset containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. Find information on the data below

I start by opening and exploring the two datasets. I created a function that can used to explore the rows in each of the datasets

In [57]:
# Opening the AppleStore data
opened_file = open('AppleStore.csv', 'r')
from csv import reader
read_file = reader(opened_file)
apples_data = list(read_file)
apples_header = apples_data[0]
apples_data = apples_data[1:]


# Opening the googleplaystore data
opened_file = open('googleplaystore.csv', 'r')
read_file = reader(opened_file)
google_data = list(read_file)
google_header = google_data[0]
google_data = google_data[1:]

In [58]:
def explore_data(dataset, first, last, header=False):
    dataset_slice = dataset[first:last]
    
    for row in dataset_slice:
        print(row)
        print('\n')
        
        if header:
            print('Number of rows:', len(dataset))
            print('Number of columns:', len(dataset[0]))

In [59]:
first_dataset = explore_data(apples_data, 0, 3, True)
print(first_dataset)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 7197
Number of columns: 16
['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7197
Number of columns: 16
['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16
None


In [60]:
second_dataset = explore_data(google_data, 2, 8, True)
print(second_dataset)

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13
['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10841
Number of columns: 13
['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 10841
Number of columns: 13
['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 10841
Number of columns: 13
['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000+', 'Free', '

Exploring the column headers of the two datasets

In [21]:
apples_header

['id',
 'track_name',
 'size_bytes',
 'currency',
 'price',
 'rating_count_tot',
 'rating_count_ver',
 'user_rating',
 'user_rating_ver',
 'ver',
 'cont_rating',
 'prime_genre',
 'sup_devices.num',
 'ipadSc_urls.num',
 'lang.num',
 'vpp_lic']

## Data Cleaning

### Deleting Wrong Data

- Detect inaccurate data, and correct or remove it
- Detect duplicate data, and remove the duplicates
- Removing non-English apps like 爱奇艺PPS -《欢乐颂2》电视剧热播.
- Removing apps that aren't free

In [22]:
google_header

['App',
 'Category',
 'Rating',
 'Reviews',
 'Size',
 'Installs',
 'Type',
 'Price',
 'Content Rating',
 'Genres',
 'Last Updated',
 'Current Ver',
 'Android Ver']

In [61]:
for row in google_data:
    if len(row) != len(google_data[0]):
        print(row)
        
        print('\n')
        print('The index position is:', google_data.index(row))
        print('The number of datapoints in this row is:', len(row))
        

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


The index position is: 10472
The number of datapoints in this row is: 12


In [62]:
# Deleting the wrong data entry at that index position

del google_data[10472]

### Checking for duplicate Entries

- Some Apps have duplicate entries, we will inspect and remove these entries from our datasets

In [63]:
duplicate_apps = []
unique_apps = []

count = 0
for app in google_data:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
        if count <= 5:
            print(app)
        count+=1
        
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])

['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Google My Business', 'BUSINESS', '4.4', '70991', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 24, 2018', '2.19.0.204537701', '4.4 and up']
['ZOOM Cloud Meetings', 'BUSINESS', '4.4', '31614', '37M', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 20, 2018', '4.1.28165.0716', '4.0 and up']
['join.me - Simple Meetings', 'BUSINESS', '4.0', '6989', 'Varies with device', '1,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 16, 2018', '4.3.0.508', '4.4 and up']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018'

In [64]:
print('Expected length:', len(google_data) - 1181)

Expected length: 9659


In [71]:
reviews_max = {}
for app in google_data:
    name = app[0]
    n_reviews = float(app[3])
        
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews
        
print(len(reviews_max))

9659


Removing Duplicate Entries

In [66]:
# Removing duplicate rows

android_clean = []
already_added = []

for item in google_data:
    name = item[0]
    reviews = app[3]
    if 'M' in reviews:
        reviews = reviews.rstrip('M')
        n_reviews = float(reviews)

    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(item)
        already_added.append(name)
    
                
print(len(android_clean))

9659


In [67]:
applestore_duplicates = []
unique_entries = []

for app in apples_data:
    name = app[0]
    if name in unique_entries:
        applestore_duplicates.append(name)
    else:
        unique_entries.append(name)

In [68]:
print(len(applestore_duplicates))

0


Seems there are no duplicate entries in the appleStore data

In [72]:
def string_func(string):
    count = 0
    for element in string:
        if ord(element) > 127:
            count += 1
    if count > 3:
        return False
    else:
        return True
    return None

In [70]:
name = string_func('Instachat 😜')
docs = string_func('Docs To Go™ Free Office Suite')
chinese_lang = string_func('爱奇艺PPS -《欢乐颂2》电视剧热播')
new = string_func("Free Office Suite")

print(name)
print(docs)
print(chinese_lang)
print(new)


True
True
False
True


## Filtering out non-english apps

In [75]:
# Filtering out non-English apps from both the Android apps data and the IOS apps data
english_apps = []
non_english_apps = []

for element in google_data:
    name = element[0]
    english_chr_app = string_func(name)
    if english_chr_app is True:
        english_apps.append(element)
    else:
        non_english_apps.append(element)
        
ios_english_apps = []
ios_non_english_apps = []

for item in apples_data:
    name = item[0]
    english_chr_app = string_func(name)
    if english_chr_app is True:
        ios_english_apps.append(item)
    else:
        ios_non_english_apps.append(name)

print(len(ios_english_apps))
print(len(ios_non_english_apps))

7197
0


In [81]:
print('Number of english apps in the android dataset is:', len(english_apps))
print('Number of non_english apps in the android dataset is:', len(non_english_apps))
print('\n')
print('Number of english apps in the ios apps dataset:', len(ios_english_apps))
print('Number of non english apps in the ios apps dataset:', len(ios_non_english_apps))

Number of english apps in the android dataset is: 10795
Number of non_english apps in the android dataset is: 45


Number of english apps in the ios apps dataset: 7197
Number of non english apps in the ios apps dataset: 0


### Isolting the Free Apps

we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our datasets contain both free and non-free apps; we'll need to isolate only the free apps for our analysis

In [100]:
# Isolating free apps from non-free for the google dataset
free_apps = []
non_free_apps = []


for element in google_data:
    app_price = element[7]
    if app_price == '0':
        free_apps.append(element)
    else:
        non_free_apps.append(element)
print('Number of free android apps:', len(free_apps))


# Isolating free apps from non free apps for the IOS dataset
free_ios_apps = []
non_free_ios_apps = []

for element in apples_data:
    app_price = element[4]
    if app_price == '0.0':
        free_ios_apps.append(element)
    else:
        non_free_ios_apps.append(element)


print('Number of free IOS apps:', len(free_ios_apps))

Number of free android apps: 10040
Number of free IOS apps: 4056


As we mentioned in the introduction, our goal is to determine the kinds of apps that are likely to attract more users because the number of people using our apps affect our revenue.

To minimize risks and overhead, our validation strategy for an app idea has three steps:
- Build a minimal Android version of the app, and add it to Google Play.
- If the app has a good response from users, we develop it further.
- If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful in both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

In [108]:
google_header

['App',
 'Category',
 'Rating',
 'Reviews',
 'Size',
 'Installs',
 'Type',
 'Price',
 'Content Rating',
 'Genres',
 'Last Updated',
 'Current Ver',
 'Android Ver']

In [179]:
apples_header

['id',
 'track_name',
 'size_bytes',
 'currency',
 'price',
 'rating_count_tot',
 'rating_count_ver',
 'user_rating',
 'user_rating_ver',
 'ver',
 'cont_rating',
 'prime_genre',
 'sup_devices.num',
 'ipadSc_urls.num',
 'lang.num',
 'vpp_lic']

In [176]:
apples_header.index('rating_count_tot')

5

Building a frequency table for the google play store using 'Genres; column and 'Category' column and for the Apples Store using the prime_genre

In [128]:
# Changing the variable name of google and IOS header columns to free apps for google and IOS

free_apps_header = google_header
free_ios_header = apples_header

In [131]:
free_apps_header.index('Genres')

9

In [129]:
free_ios_header.index('prime_genre')

11

In [136]:
free_apps_header.index('Category')

1

In [147]:
def freq_table(dataset, index):
    genres = {}
    total_number_of_apps = 0
    
    for element in dataset:
        total_number_of_apps += 1
        genre = element[index]
        
        if genre in genres:
            genres[genre] += 1
        else:
            genres[genre] = 1
            
    genres_percent = {}        
    for key in genres:
        percentage = (genres[key] / total_number_of_apps) * 100
        genres_percent[key] = round(percentage, 3)
    
    return genres_percent


    
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_value_tuple = (table[key], key) # takes each key-value pair and stores it as a tuple
        table_display.append(key_value_tuple)
    
    table_sorted = sorted(table_display, reverse=True)
    for entry in table_sorted:
        print(entry[1], entry[0])

### Analyzing the prime_genre apps from the Apples store

In [159]:
prime_genre = display_table(free_ios_apps, 11)

Games 55.646
Entertainment 8.235
Photo & Video 4.117
Social Networking 3.526
Education 3.254
Shopping 2.983
Utilities 2.687
Lifestyle 2.318
Finance 2.071
Sports 1.948
Health & Fitness 1.874
Music 1.652
Book 1.627
Productivity 1.529
News 1.43
Travel 1.381
Food & Drink 1.06
Weather 0.764
Reference 0.493
Navigation 0.493
Business 0.493
Catalogs 0.222
Medical 0.197


- We can see that more than half (58.16%) of the free English apps are games. Photo and video applications come in at about 5%, followed by entertainment apps at about 8%. Social networking apps account for 3.29% of the apps in our data collection, trailing only educational apps with 3.66% of the total.
- Apps with recreational purposes (games, entertainment, photo and video, social networking, sports, music, etc.) predominate in the App Store (at least the portion of it that offers free English apps), while apps with more practical purposes (education, shopping, utilities, productivity, lifestyle, etc.) are less common. However, just because there are more enjoyable apps available doesn't necessarily mean that there are more users as well; demand and supply might not line up perfectly.

### Analyzing the frequency table generated from category and genres

In [160]:
genre = display_table(free_apps, 9)

Tools 7.61
Entertainment 6.016
Education 5.169
Business 4.442
Productivity 3.944
Sports 3.725
Lifestyle 3.606
Communication 3.586
Medical 3.526
Finance 3.476
Action 3.396
Health & Fitness 3.237
Photography 3.118
Personalization 3.078
Social 2.908
News & Magazines 2.799
Shopping 2.57
Travel & Local 2.44
Dating 2.261
Books & Reference 2.022
Arcade 1.992
Simulation 1.902
Casual 1.833
Video Players & Editors 1.683
Maps & Navigation 1.315
Food & Drink 1.245
Puzzle 1.205
Racing 0.946
Strategy 0.936
House & Home 0.876
Role Playing 0.867
Libraries & Demo 0.837
Auto & Vehicles 0.817
Weather 0.737
Events 0.627
Adventure 0.627
Comics 0.588
Art & Design 0.548
Beauty 0.528
Parenting 0.438
Education;Education 0.438
Card 0.408
Trivia 0.378
Educational;Education 0.378
Casino 0.378
Board 0.349
Educational 0.329
Word 0.289
Entertainment;Music & Video 0.269
Casual;Pretend Play 0.249
Music 0.209
Casual;Action & Adventure 0.199
Racing;Action & Adventure 0.189
Puzzle;Brain Games 0.169
Educational;Pretend Pl

In [161]:
category = display_table(free_ios_apps, 1)

VR Roller Coaster 0.049
Mannequin Challenge 0.049
ｗｗｗ 0.025
Ｘ:15秒の人気 アクション ゲーム 0.025
２ちゃんねる for iPhone 0.025
애드픽 - 인플루언서가 되어 의미있는 수익을 올리세요! 0.025
실시간 날씨 0.025
龙珠直播-高清游戏娱乐直播平台 0.025
龙之觉醒-热血经典RPG，回味激燃岁月 0.025
黄金日-贵金属理财投资黄金白银 0.025
鴨川等間隔の法則 0.025
鳥として生きた男　その壮絶な人生 0.025
魔灵觉醒（王者归来）- 3D新职业魔剑士降临 0.025
鬼畜三国-红将觉醒 0.025
鬼畜-一亿人都在用的聊天必备神器! 0.025
鬼吹灯昆仑神宫 - 年兽袭来 0.025
鬼とび 0.025
高清影视-大片免费天天看 0.025
高德地图（精准专业的手机地图） 0.025
高德地图HD 0.025
驾考宝典-2017最新考驾照驾校学车驾考通 0.025
驾校一点通-保过版，2017最新驾考学车宝典 0.025
驴妈妈旅游-订景点门票机票火车票特价酒店 0.025
饿了么外卖-大牌美食，折扣热卖 0.025
食べないと死ぬ2 0.025
食べないと死ぬ 3 0.025
飞猪 0.025
飞凡--智慧新生活 0.025
飛べないロボはただの... ～無料アクションRPGゲーム～ 0.025
风行视频+ HD - 电影电视剧体育视频播放器 0.025
頭の回転をはやくする！脳トレ！Blackhole 0.025
面白ニュースを超快適に読める!!まとめのまとめMM 0.025
非诚勿扰-中国最大免费婚恋交友平台 0.025
青藍高校リア充部 0.025
青藍高校ヒモ部 0.025
霸王英雄传（吕布一统天下，演义三国群英杰传奇） 0.025
霸王卧龙传奇（全新转职系统，60种职业供你选择） 0.025
電車パズル ツメツメ - 通勤時間にピッタリ! 脳トレパズル 0.025
電球でテニスしてみた-無料で遊べるミニゲーム 0.025
零钱夺宝pro-新手99元红包助你1元购 0.025
陌爱神器-不闲聊！陌生人快速约见面平台 0.025
陌爱-最in全民交友神器！同城寂陌陌生帅哥美女视频聊天约会平台 0.025
陌恋-同城高颜值寂寞美女帅哥激情交

- The scene on Google Play appears to be very different now; there don't seem to be as many games and it seems like most apps are made for family, tools, business, lifestyle, productivity, etc. If we look into this further, we see that the family category—which includes over 19% of all apps—primarily refers to children's games.

### Analyzing Most Popular Apps by Genre

- Now we want to determine the kind of apps with the most users

One way to find out what genres are the most popular (have the most users) is to calculate the average number of `Installs` for each genre. This info exists in the google play store dataset but is missing from the AppleStore dataset. We use the `rating_count_tot` column as a proxy

In [173]:
# Looping over the prime_genre column in the Applestore dataset

def freq_table(dataset, index):
    prime_genre_table = {}
    total_number_of_apps = 0
    
    for element in dataset:
        total_number_of_apps += 1
        genre = element[index]
        
        if genre in prime_genre_table:
            prime_genre_table[genre] += 1
        else:
            prime_genre_table[genre] = 1

    return prime_genre_table

In [191]:
prime_freq = freq_table(free_ios_apps, 11)

for genre in prime_freq:
    total = 0
    len_genre = 0
    for element in free_ios_apps:
        genre_app = element[11]
        if genre_app == genre:
            user_rating = float(element[5])
            total += user_rating
            len_genre += 1
    average_user_rating = total / len_genre
    print(genre, ':', average_user_rating)

Social Networking : 53078.195804195806
Photo & Video : 27249.892215568863
Games : 18924.68896765618
Music : 56482.02985074627
Reference : 67447.9
Health & Fitness : 19952.315789473683
Weather : 47220.93548387097
Utilities : 14010.100917431193
Travel : 20216.01785714286
Shopping : 18746.677685950413
News : 15892.724137931034
Navigation : 25972.05
Lifestyle : 8978.308510638299
Entertainment : 10822.961077844311
Food & Drink : 20179.093023255813
Sports : 20128.974683544304
Book : 8498.333333333334
Finance : 13522.261904761905
Education : 6266.333333333333
Productivity : 19053.887096774193
Business : 6367.8
Catalogs : 1779.5555555555557
Medical : 459.75


On average,  Music App has the highest user reviews which is why I will recommend the Music App profle for the App store

In [192]:
display_table(free_apps, 5) # The Installs columns



1,000,000+ 1555
10,000,000+ 1249
100,000+ 1079
10,000+ 925
1,000+ 758
5,000,000+ 752
100+ 623
500,000+ 527
50,000+ 436
5,000+ 410
100,000,000+ 409
10+ 316
500+ 290
50,000,000+ 289
50+ 171
500,000,000+ 72
5+ 70
1,000,000,000+ 58
1+ 46
0+ 4
0 1


Computing the Average number of Installs per App genre for the Google Play Dataset

In [214]:
category_unique = freq_table(free_apps, 1)

for category in category_unique:
    total = 0
    len_category = 0
    for element in free_apps:
        category_app = element[1]
        if category_app == category:
            n_installs = element[5]
            n_installs = float(n_installs.replace('+', '').replace(',', ''))
            total += n_installs
            len_category += 1
    average_installs = total / len_category
    print(category, ': ',average_installs)
            
           

ART_AND_DESIGN :  2005195.1612903227
AUTO_AND_VEHICLES :  647317.8170731707
BEAUTY :  513151.88679245283
BOOKS_AND_REFERENCE :  9465252.512315271
BUSINESS :  2245520.3811659194
COMICS :  934769.1666666666
COMMUNICATION :  90683100.55833334
DATING :  1164270.7356828193
EDUCATION :  5729276.315789473
ENTERTAINMENT :  19516734.69387755
EVENTS :  253542.22222222222
FINANCE :  2511355.6790830945
FOOD_AND_DRINK :  2190710.008
HEALTH_AND_FITNESS :  4869225.852307692
HOUSE_AND_HOME :  1917187.0568181819
LIBRARIES_AND_DEMO :  749950.119047619
LIFESTYLE :  1477863.44077135
GAME :  33048939.16116871
FAMILY :  5742274.952835485
MEDICAL :  147563.28813559323
SOCIAL :  48184458.56849315
SHOPPING :  12588522.03488372
PHOTOGRAPHY :  32218111.54952077
SPORTS :  4860918.563888889
TRAVEL_AND_LOCAL :  27921561.32520325
TOOLS :  14968685.586928105
PERSONALIZATION :  7508854.330097088
PRODUCTIVITY :  35794644.73232323
PARENTING :  542603.6206896552
WEATHER :  5747142.162162162
VIDEO_PLAYERS :  36385565.6140

Communication apps typically have 38,456,119 installs, which is the most. A few apps (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts) with over one billion installs and a few others with between 100 million and 500 million installs significantly distort this number: