# Profitable App Profiles for the App Store and Google Play Markets



In this data project, our objective is to identify mobile app profiles that generate profitability in the App Store and Google Play markets. As data analysts for a company specializing in Android and iOS mobile app development, our role is to empower our team of developers to make informed decisions based on data insights when creating new apps.

At our company, we exclusively focus on developing apps that are free to download and install, relying primarily on in-app advertisements as our main source of revenue. Consequently, the number of users who engage with our apps heavily influences our earnings. Therefore, the aim of this project is to conduct data analysis that will assist our developers in comprehending the types of apps likely to attract a larger user base.

## Opening and Exploring the Data

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play. (Source: [Statista](https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/))

To avoid spending resources on collecting new data ourselves, we should first try to see if we can find any relevant existing data at no cost. Luckily, here are two data sets that seem suitable for our goals:

- [A dataset](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).

- [A dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).

In [27]:
from csv import reader

### The Google Play data set ###
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)

### The iOS App data set ###
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)

We will create function to explore a dataset:

In [28]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Let's take a look at the iOS App data set.

In [29]:
explore_data(ios, 0, 4, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7198
Number of columns: 16


We have 7197 iOS apps in this data set (excluding the header row), and the columns that seem interesting are: 'track_name', 'currency', 'price', 'rating_count_tot', and 'prime_genre'. Not all column names are self-explanatory in this case, but details about each column can be found in the [data set documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home).

Next, let's take a look at the Google Play data set.

In [30]:
explore_data(android, 0, 4, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


We see that the Google Play data set has 10842 apps (excluding header row) and 13 columns. At a quick glance, the columns that might be useful for the purpose of our analysis are 'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price', and 'Genres'.

## Data Cleaning

Before we begin the analysis, let's go over the data to ensure the data is clean (i.e standardised, no duplicates etc.)

Let's start with creating a function to look at all unique values in each column of the data table. So that we can identify any anomalies in each data column.

In [31]:
def unique(data):
    total_rows = len(data)
    num_columns = len(data[0])

    # Iterate over each column
    for col_index in range(num_columns):
        # Extract the values of the column from the data table, skipping the header row
        column_values = [row[col_index] for row in data[1:] if col_index < len(row)]
            
        # Count the occurrences of each unique value in the column
        unique_counts = {}
        for value in column_values:
            if value in unique_counts:
                unique_counts[value] += 1
            else:
                unique_counts[value] = 1

        num_unique_values = len(unique_counts)

        # Check if the number of unique values exceeds 90% of the total rows
        if num_unique_values >= 0.1 * (total_rows):
            print(f"Over 90% values are unique in column {data[0][col_index]}")
        else:
            print(f"Column '{data[0][col_index]}'")
            # Print the unique values and their respective counts
            for key, value in unique_counts.items():
                print(key, ':', value)       
        
        print('\n')

Let's use the function on the iOS Apps data.

#### Data cleaning of the iOS data set

In [32]:
unique(ios)

Over 90% values are unique in column id


Over 90% values are unique in column track_name


Over 90% values are unique in column size_bytes


Column 'currency'
USD : 7197


Column 'price'
0.0 : 4056
1.99 : 621
0.99 : 728
6.99 : 166
2.99 : 683
7.99 : 33
4.99 : 394
9.99 : 81
3.99 : 277
8.99 : 9
5.99 : 52
14.99 : 21
13.99 : 6
19.99 : 13
17.99 : 3
15.99 : 4
24.99 : 8
20.99 : 2
29.99 : 6
12.99 : 5
39.99 : 2
74.99 : 1
16.99 : 2
249.99 : 1
11.99 : 6
27.99 : 2
49.99 : 2
59.99 : 3
22.99 : 2
18.99 : 1
99.99 : 1
21.99 : 1
34.99 : 1
299.99 : 1
23.99 : 2
47.99 : 1


Over 90% values are unique in column rating_count_tot


Over 90% values are unique in column rating_count_ver


Column 'user_rating'
3.5 : 702
4.5 : 2663
4.0 : 1626
3.0 : 383
5.0 : 492
2.5 : 196
2.0 : 106
1.5 : 56
1.0 : 44
0.0 : 929


Column 'user_rating_ver'
3.5 : 533
4.0 : 1237
4.5 : 2205
5.0 : 964
3.0 : 304
0.0 : 1443
2.5 : 176
1.5 : 74
2.0 : 136
1.0 : 125


Over 90% values are unique in column ver


Column 'cont_rating'
4+ : 4433
12

From our function, we can see only 4056  out of the total 7197 rows are free apps in the data set. Let's isolate that before we move further.

#### Isolating free apps

In [33]:
# The ios data price column (index 4) uses '0.0' as a string to represent free apps
print('Original data:', len(ios), 'rows')
ios = [row for row in ios if row[4] == '0.0']
print('Free apps:', len(ios),'rows')

Original data: 7198 rows
Free apps: 4056 rows


#### Isolating English-only apps
Next, let's make sure our iOS data only contains English apps by inspecting all of the names in the data set.

In [34]:
for row in ios:
    name = row[1]
    print(name)

Facebook
Instagram
Clash of Clans
Temple Run
Pandora - Music & Radio
Pinterest
Bible
Candy Crush Saga
Spotify Music
Angry Birds
Subway Surfers
Solitaire
CSR Racing
Crossy Road - Endless Arcade Hopper
Injustice: Gods Among Us
Hay Day
PAC-MAN
Calorie Counter & Diet Tracker by MyFitnessPal
DragonVale
The Weather Channel: Forecast, Radar & Alerts
Head Soccer
Google – Search made just for mobile
Despicable Me: Minion Rush
The Sims™ FreePlay
Google Earth
Sonic Dash
Groupon - Deals, Coupons & Discount Shopping App
8 Ball Pool™
Tiny Tower - Free City Building
Jetpack Joyride
Bike Race - Top Motorcycle Racing Games
Shazam - Discover music, artists, videos & lyrics
Kim Kardashian: Hollywood
Trivia Crack
WordBrain
Sniper 3D Assassin: Shoot to Kill Gun Game
Flow Free
Lose It! – Weight Loss Program and Calorie Counter
Skype for iPhone
Geometry Dash Lite
▻Sudoku
Twitter
Messenger
Waze - GPS Navigation, Maps & Real-time Traffic
Zillow Real Estate - Homes for Sale & for Rent
Tumblr
Fruit Ninja®
Snapch

From a quick glance of the above result, we can see there are some non-English (Chinese and Japanese) apps in the database. 

All characters that are specific to English texts are encoded using the ASCII standard. Each ASCII character has a corresponding number between 0 and 127 associated with it, and we can take advantage and look at those rows.

In [35]:
# We will isolate the apps whose name (column index 1) contains no English texts
for row in ios:
    string = row[1]
    non_english = False
    for char in string:
        if ord(char) > 127:
            non_english = True
    if non_english:
        print(row[1])

Google – Search made just for mobile
The Sims™ FreePlay
8 Ball Pool™
Lose It! – Weight Loss Program and Calorie Counter
▻Sudoku
Fruit Ninja®
iHeartRadio – Free Music & Radio Stations
The Simpsons™: Tapped Out
Plants vs. Zombies™ 2
Pokémon GO
Star Wars™: Commander
Kindle – Read eBooks, Magazines & Textbooks
Chase Mobile℠
The Weather Channel App for iPad – best local forecast, radar map, and storm tracking
Call of Duty®: Heroes
ooVoo – Free Video Call, Text and Voice
The Secret Society® - Hidden Mystery
Viber Messenger – Text & Call
Words with Friends – Best Word Game
Jurassic World™: The Game
Flashlight Ⓞ
▻Solitaire
Guess My Age  Math Magic
Tetris® Blitz
Star Wars™: Galaxy of Heroes
Bubble Mania™
Big Fish Casino – Best Vegas Slot Machines & Games
⋆Solitaire
Audible – audio books, original series & podcasts
DoubleDown Casino & Slots  – Vegas Slot Machines!
Walgreens – Pharmacy, Photo, Coupons and Shopping
ABC – Watch Live TV & Stream Full Episodes
QuizUp™
UNO ™ & Friends
Solitaire·
Over

The for loop seems to work fine, but some English app names use emojis or other symbols (™, — (em dash), – (en dash), etc.) which fall outside of the ASCII range. Because of this, we'll remove useful apps if we use the function in its current form.


In [36]:
# To minimize the impact of data loss, let's change it to only print out an app 
# if its name has more than three non-ASCII characters
for row in ios:
    string = row[1]
    non_english = 0
    for char in string:
        if ord(char) > 127:
            non_english += 1
    if non_english > 3:
        print(row[1])

爱奇艺PPS -《欢乐颂2》电视剧热播
聚力视频HD-人民的名义,跨界歌王全网热播
优酷视频
网易新闻 - 精选好内容，算出你的兴趣
淘宝 - 随时随地，想淘就淘
搜狐视频HD-欢乐颂2 全网首播
阴阳师-全区互通现世集结
百度贴吧-全球最大兴趣交友社区
百度网盘
爱奇艺HD -《欢乐颂2》电视剧热播
乐视视频HD-白鹿原,欢乐颂,奔跑吧全网热播
万年历-值得信赖的日历黄历查询工具
新浪新闻-阅读最新时事热门头条资讯视频
喜马拉雅FM（听书社区）电台有声小说相声英语
央视影音-海量央视内容高清直播
腾讯视频HD-楚乔传,明日之子6月全网首播
手机百度 - 百度一下你就得到
百度视频HD-高清电视剧、电影在线观看神器
MOMO陌陌-开启视频社交,用直播分享生活
QQ 浏览器-搜新闻、选小说漫画、看视频
同花顺-炒股、股票
聚力视频-蓝光电视剧电影在线热播
快看漫画
乐视视频-白鹿原,欢乐颂,奔跑吧全网热播
酷我音乐HD-无损在线播放
滴滴出行
高德地图（精准专业的手机地图）
百度HD-极速安全浏览器
美丽说-潮流穿搭快人一步
百度地图-智能的手机导航，公交地铁出行必备
Majiang Mahjong（单机+川麻+二人+武汉+国标）
土豆视频HD—高清影视综艺视频播放器
360手机卫士-超安全的来电防骚扰助手
QQ浏览器HD-极速搜索浏览器
搜狗输入法-Sogou Keyboard
百度网盘 HD
大众点评-发现品质生活
讯飞输入法-智能语音输入和表情斗图神器
美柚 - 女生助手
爱奇艺 - 电视剧电影综艺娱乐视频播放器
搜狐视频-欢乐颂2 全网首播
百度地图HD
QQ同步助手-新机一键换机必备工具
QQ音乐-来这里“发现・音乐”
腾讯新闻-头条新闻热点资讯掌上阅读软件
土豆（短视频分享平台）
风行视频+ HD - 电影电视剧体育视频播放器
YY- 小全民手机直播交友软件
腾讯视频-欢乐颂2全网首播
中华万年历-2亿用户首选的日历软件
央视影音HD-海量央视内容高清直播
蘑菇街-网红直播搭配的购物特卖平台
Keep - 移动健身教练 自由运动场
美团 - 吃喝玩乐全都有
百度贴吧HD
腾讯手机管家-拦截骚扰电话的QQ安全助手
Color•多彩手帐
饿了么外卖-大牌美食，折扣热卖
宝宝树孕育-火爆的备孕怀孕育儿社区
懂球帝 - 足球迷必备神器
今日头条 - 热点新闻资讯、

Now that we have to isolate the non-english apps, and remove them from our data.

In [37]:
# Creating a list of indices to remove if the name of the app contains more than 3 non-english characters

id_to_remove=[]

for row in ios:
    string = row[1]
    id = row[0]
    non_english = 0
    for char in string:
        if ord(char) > 127:
            non_english += 1
    if non_english > 3:
        id_to_remove.append(id)

print('Free apps data:',len(ios),'rows')

ios = [row for row in ios if row[0] not in id_to_remove]
        
print('Free English-only app data:', len(ios),'rows')

Free apps data: 4056 rows
Free English-only app data: 3222 rows


Finally, let's check for duplicated apps, as our unique function did not show us whether there are any in the data set.

#### Check and removing duplicated apps

In [38]:
unique_apps = []
duplicated_apps = []

for app in ios:
    name = app[1]
    if name in unique_apps:
        duplicated_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of unique apps:', len(unique_apps))
print('\n')
print('Number of duplicated apps:', len(duplicated_apps))
print(duplicated_apps)

Number of unique apps: 3220


Number of duplicated apps: 2
['Mannequin Challenge', 'VR Roller Coaster']


In [39]:
for app in ios:
    name = app[1]
    if name in duplicated_apps:
        print(app)

['1173990889', 'Mannequin Challenge', '109705216', 'USD', '0.0', '668', '87', '3.0', '3.0', '1.4', '9+', 'Games', '37', '4', '1', '1']
['952877179', 'VR Roller Coaster', '169523200', 'USD', '0.0', '107', '102', '3.5', '3.5', '2.0.0', '4+', 'Games', '37', '5', '1', '1']
['1178454060', 'Mannequin Challenge', '59572224', 'USD', '0.0', '105', '58', '4.0', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']
['1089824278', 'VR Roller Coaster', '240964608', 'USD', '0.0', '67', '44', '3.5', '4.0', '0.81', '4+', 'Games', '38', '0', '1', '1']


From the above, we know that column index 5 is 'rating_count_tot', which means the row with the higher value has more total number of votes hence is the more recent data, so we will keep those.
We will delete the ros with IDs 1178454060 and 1089824278

In [40]:
print('Before removing duplicates:', len(ios), 'rows')

duplicated_index = [index for index, row in enumerate(ios) if row[0] == '1089824278' or row[0] == '1178454060']
for index in duplicated_index:
    del ios[index]
        
print('After removing duplicates:', len(ios),'rows')

Before removing duplicates: 3222 rows
After removing duplicates: 3220 rows


From the [Kaggle page](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps) as well as our unique fucntion, the rest of the iOS data columns seems to be free of anomalies. 

Let's repeat the above data cleaning method on the Google Play data set.

#### Data cleaning on the Play Store data set

In [41]:
unique(android)

Over 90% values are unique in column App


Column 'Category'
ART_AND_DESIGN : 65
AUTO_AND_VEHICLES : 85
BEAUTY : 53
BOOKS_AND_REFERENCE : 231
BUSINESS : 460
COMICS : 60
COMMUNICATION : 387
DATING : 234
EDUCATION : 156
ENTERTAINMENT : 149
EVENTS : 64
FINANCE : 366
FOOD_AND_DRINK : 127
HEALTH_AND_FITNESS : 341
HOUSE_AND_HOME : 88
LIBRARIES_AND_DEMO : 85
LIFESTYLE : 382
GAME : 1144
FAMILY : 1972
MEDICAL : 463
SOCIAL : 295
SHOPPING : 260
PHOTOGRAPHY : 335
SPORTS : 384
TRAVEL_AND_LOCAL : 258
TOOLS : 843
PERSONALIZATION : 392
PRODUCTIVITY : 424
PARENTING : 60
WEATHER : 82
VIDEO_PLAYERS : 175
NEWS_AND_MAGAZINES : 283
MAPS_AND_NAVIGATION : 137
1.9 : 1


Column 'Rating'
4.1 : 708
3.9 : 386
4.7 : 499
4.5 : 1038
4.3 : 1076
4.4 : 1109
3.8 : 303
4.2 : 952
4.6 : 823
3.2 : 64
4.0 : 568
NaN : 1474
4.8 : 234
4.9 : 87
3.6 : 174
3.7 : 239
3.3 : 102
3.4 : 128
3.5 : 163
3.1 : 69
5.0 : 274
2.6 : 25
3.0 : 83
1.9 : 13
2.5 : 21
2.8 : 42
2.7 : 25
1.0 : 16
2.9 : 45
2.3 : 20
2.2 : 14
1.7 : 8
2.0 : 12
1.8 : 8
2.4 

From the function, we can see the following issues:

- In the 'Rating' column (index 2), there is an entry of '19' and 'NaN', which needed removing
- In the 'Install' column (index 5), there is an entry of 'Free' and '0' which needed removing
- In the 'Type' column (index 6), there is an entry of 'NaN' and '0' which will need checking if it falls under 'Paid' or 'Free', otherwise they will be removed
- In the 'Content Rating' column (index 8), there is an entry of ' ' which we will remove
- Finally, we will isolate the free apps for our analysis

In [42]:
# Creating lists of indices of rows with problematic data
rating_list = [index for index, row in enumerate(android) if row[2] == '19' or row[2] == 'NaN']
install_list = [index for index, row in enumerate(android) if row[5] == 'Free' or row[5] == '0']
type_list = [index for index, row in enumerate(android) if row[6] == 'NaN' or row[6] == '0']
content_rating_list = [index for index, row in enumerate(android) if row[8] == ' ']

In [43]:
# Observing potential issues in the type_list
for index in type_list:
    print(android[index])

['Command & Conquer: Rivals', 'FAMILY', 'NaN', '0', 'Varies with device', '0', 'NaN', '0', 'Everyone 10+', 'Strategy', 'June 28, 2018', 'Varies with device', 'Varies with device']
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


From the above, it would seem that 'Command & Conquer' is a free app but it was labelled as 'NaN' under column 'Type'.
As for the app 'Life Made WI-Fi Touchscreen Photo Frame', the data for 'Category' appear to be missing hence why there is a 'Rating' of 19 and 'Free' under the 'Install' column. For now, we will put it under the category of 'Photography'.

In [44]:
type_list # the indices of the above mentioned rows

[9149, 10473]

In [45]:
print('Before')
print(android[9149])
android[9149][6]='Free' # Updateing the price value
print('After')
print(android[9149])
print('\n')
print('Before')
print(android[10473])
android[10473].insert(1, 'PHOTOGRAPHY') # Inserting 'Photography' to the Category column
print('After')
print(android[10473])

Before
['Command & Conquer: Rivals', 'FAMILY', 'NaN', '0', 'Varies with device', '0', 'NaN', '0', 'Everyone 10+', 'Strategy', 'June 28, 2018', 'Varies with device', 'Varies with device']
After
['Command & Conquer: Rivals', 'FAMILY', 'NaN', '0', 'Varies with device', '0', 'Free', '0', 'Everyone 10+', 'Strategy', 'June 28, 2018', 'Varies with device', 'Varies with device']


Before
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
After
['Life Made WI-Fi Touchscreen Photo Frame', 'PHOTOGRAPHY', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


With the above amended, the full list that we need to remove from the android data table will only be those which does not have a rating. After that, we will insolate the apps that are free.

#### Deleting apps with no rating and Isolating free apps

In [46]:
print('Amended Android data:', len(android), 'rows')

rating_list = [index for index, row in enumerate(android) if row[2] == 'NaN']
for index in sorted(rating_list, reverse=True):
    del android[index]
    
android = [row for row in android if row[6] == 'Free']
print('Free Android Apps data:', len(android), 'rows')

Amended Android data: 10842 rows
Free Android Apps data: 8720 rows


#### Checking for duplicates

In [53]:
unique_apps = []
duplicated_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicated_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of unique apps:', len(unique_apps))
print('\n')
print('Number of duplicated apps:', len(duplicated_apps))
print('Example of duplicated apps:', duplicated_apps[:10])

Number of unique apps: 7595


Number of duplicated apps: 1125
Example of duplicated apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


We will apply the same principle as we did for the ios data. However, we will need to change our methods as there are over 1000 duplicated apps in the android data, we won't be able to simply pull a few indices as we did earlier. 

Let's create a dictionary:

In [48]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

In a previous code cell, we found that there are 1,125 cases where an app occurs more than once, so the length of our dictionary (of unique apps) should be equal to the difference between the length of our data set and 1,125.

In [49]:
print('Expected length:', len(android) - 1125)
print('Actual length:', len(reviews_max))

Expected length: 7595
Actual length: 7595


In [50]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name) 

In [51]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 7595
Number of columns: 13


In [52]:
print(len(android_clean))
print(len(ios))

7595
3220


We're left with 7595 Android apps and 3220 iOS apps, which should be enough for our analysis.

## Most Common Apps by Genre

### Part One

As we mentioned in the introduction, our goal is to determine the kinds of apps that are likely to attract more users because the number of people using our apps affect our revenue.

To minimize risks and overhead, our validation strategy for an app idea has three steps:

- Build a minimal Android version of the app, and add it to Google Play.
- If the app has a good response from users, we develop it further.
- If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful in both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

### Part Two

Let's begin the analysis by getting a sense of the most common genres for each market. For this, we'll build a frequency table for the 'prime_genre' column of the App Store data set, and the 'Genres' and 'Category' columns of the Google Play data set.

In [56]:
def freq_table(data, index):
    table = {}
    total = 0
    
    for row in data:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage
        
    return table_percentages


def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

### Part Three

We start by examining the frequency table for the 'prime_genre' column of the App Store data set.

In [57]:
display_table(ios, -5)

Games : 58.16770186335404
Entertainment : 7.888198757763975
Photo & Video : 4.937888198757764
Education : 3.6645962732919255
Social Networking : 3.291925465838509
Shopping : 2.608695652173913
Utilities : 2.515527950310559
Sports : 2.142857142857143
Music : 2.049689440993789
Health & Fitness : 2.018633540372671
Productivity : 1.7391304347826086
Lifestyle : 1.5838509316770186
News : 1.3354037267080745
Travel : 1.2422360248447204
Finance : 1.1180124223602486
Weather : 0.8695652173913043
Food & Drink : 0.8074534161490683
Reference : 0.5590062111801243
Business : 0.5279503105590062
Book : 0.43478260869565216
Navigation : 0.18633540372670807
Medical : 0.18633540372670807
Catalogs : 0.12422360248447205


We can see that withiin the free English iOS apps, more than a half (58.16%) are games. Next is entertainment apps (7.89%), followed by photo and video apps (4.94%), followed by education (3.66%), followed by social networking (3.29%) of the apps in our data set.

The overall perception is that the App Store, particularly the section featuring free English apps, is primarily filled with apps designed for entertainment purposes, including games, photo and video, social networking, sports, and music. Conversely, apps with practical functionalities such as education, shopping, utilities, productivity, and lifestyle are relatively scarce. However, it is important to note that while fun apps may dominate in quantity, it does not necessarily mean they have the highest user demand. The actual user demand might differ from the abundance of offerings in the market.

Let's continue by examining the Genres and Category columns of the Google Play data set (two columns which seem to be related).

In [58]:
display_table(android_clean, 1)

FAMILY : 19.67083607636603
GAME : 10.836076366030282
TOOLS : 8.650427913100724
FINANCE : 3.80513495720869
PRODUCTIVITY : 3.726135615536537
LIFESTYLE : 3.726135615536537
BUSINESS : 3.344305464121132
PHOTOGRAPHY : 3.2784726793943384
SPORTS : 3.133640552995392
COMMUNICATION : 3.0809743252139565
PERSONALIZATION : 3.067807768268598
HEALTH_AND_FITNESS : 3.067807768268598
MEDICAL : 3.0019749835418037
NEWS_AND_MAGAZINES : 2.6596445029624753
SOCIAL : 2.6464779460171166
TRAVEL_AND_LOCAL : 2.356813693219223
SHOPPING : 2.3436471362738645
BOOKS_AND_REFERENCE : 2.1198156682027647
VIDEO_PLAYERS : 1.9091507570770245
DATING : 1.7248189598420012
MAPS_AND_NAVIGATION : 1.4878209348255431
EDUCATION : 1.3561553653719554
FOOD_AND_DRINK : 1.2113232389730084
ENTERTAINMENT : 1.119157340355497
AUTO_AND_VEHICLES : 0.9479921000658328
WEATHER : 0.8558262014483212
LIBRARIES_AND_DEMO : 0.8426596445029624
HOUSE_AND_HOME : 0.8031599736668862
ART_AND_DESIGN : 0.7373271889400922
COMICS : 0.7109940750493746
PARENTING : 0.

The Google Play landscape presents notable differences compared to the iOS, with fewer apps focused on entertainment and a larger presence of practical categories such as family, tools, business, lifestyle, and productivity. However, upon closer examination, it becomes apparent that the family category, which constitutes nearly 20% of the apps, primarily consists of games intended for children.

In [60]:
display_table(android_clean, -4)

Tools : 8.637261356155365
Entertainment : 6.030283080974326
Education : 5.411454904542462
Finance : 3.80513495720869
Productivity : 3.726135615536537
Lifestyle : 3.7129690585911783
Action : 3.528637261356155
Business : 3.344305464121132
Photography : 3.2653061224489797
Sports : 3.2126398946675447
Communication : 3.0809743252139565
Personalization : 3.067807768268598
Health & Fitness : 3.067807768268598
Medical : 3.0019749835418037
News & Magazines : 2.6596445029624753
Social : 2.6464779460171166
Travel & Local : 2.3436471362738645
Simulation : 2.3436471362738645
Shopping : 2.3436471362738645
Books & Reference : 2.1198156682027647
Arcade : 2.0276497695852536
Casual : 1.9749835418038184
Video Players & Editors : 1.8828176431863068
Dating : 1.7248189598420012
Maps & Navigation : 1.4878209348255431
Food & Drink : 1.2113232389730084
Racing : 1.1059907834101383
Puzzle : 1.0928242264647794
Role Playing : 1.053324555628703
Strategy : 1.0269914417379855
Auto & Vehicles : 0.9479921000658328
Weat

The distinction between the 'Genres' and 'Category' columns may not be entirely clear, but one observation is evident: the 'Genres' column provides a more detailed breakdown with numerous categories. However, for our current analysis, we will focus solely on the 'Category' column to gain a broader understanding of the data.

### Most Popular Apps by Genre on the App Store

To determine the popularity of different genres based on user engagement, one approach is to calculate the average number of installs for each app genre. In the Google Play dataset, this information is available in the Installs column. However, in the App Store dataset, the number of installs is not directly provided. 
As a workaround, we can utilize the total number of user ratings, which can be found in the rating_count_tot column, as a proxy for popularity. By analyzing this metric, we can gain insights into the relative popularity of genres across both datasets.

Below, we calculate the average number of user ratings per app genre on the App Store:

In [62]:
genres_ios = freq_table(ios, -5)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in ios:
        genre_app = app[-5]
        if genre_app == genre:
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', round(avg_n_ratings,2))

Social Networking : 71548.35
Photo & Video : 28620.01
Games : 22800.78
Music : 57326.53
Reference : 74942.11
Health & Fitness : 23298.02
Weather : 52279.89
Utilities : 18684.46
Travel : 28243.8
Shopping : 26919.69
News : 21248.02
Navigation : 86090.33
Lifestyle : 16485.76
Entertainment : 14029.83
Food & Drink : 33333.92
Sports : 23008.9
Book : 39758.5
Finance : 31467.94
Education : 7003.98
Productivity : 21028.41
Business : 7491.12
Catalogs : 4004.0
Medical : 612.0


On average, navigation apps have the highest number of user reviews, but this figure is heavily influenced by Waze and Google Maps, which have close to half a million user reviews together:

In [63]:
for app in ios:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5]) # print name and number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


In [64]:
for app in ios:
    if app[-5] == 'Social Networking':
        print(app[1], ':', app[5]) # print name and number of ratings

Facebook : 2974676
Pinterest : 1061624
Skype for iPhone : 373519
Messenger : 351466
Tumblr : 334293
WhatsApp Messenger : 287589
Kik : 260965
ooVoo – Free Video Call, Text and Voice : 177501
TextNow - Unlimited Text + Calls : 164963
Viber Messenger – Text & Call : 164249
Followers - Social Analytics For Instagram : 112778
MeetMe - Chat and Meet New People : 97072
We Heart It - Fashion, wallpapers, quotes, tattoos : 90414
InsTrack for Instagram - Analytics Plus More : 85535
Tango - Free Video Call, Voice and Chat : 75412
LinkedIn : 71856
Match™ - #1 Dating App. : 60659
Skype for iPad : 60163
POF - Best Dating App for Conversations : 52642
Timehop : 49510
Find My Family, Friends & iPhone - Life360 Locator : 43877
Whisper - Share, Express, Meet : 39819
Hangouts : 36404
LINE PLAY - Your Avatar World : 34677
WeChat : 34584
Badoo - Meet New People, Chat, Socialize. : 34428
Followers + for Instagram - Follower Analytics : 28633
GroupMe : 28260
Marco Polo Video Walkie Talkie : 27662
Miitomo : 2

We can observe a similar trend in the case of social networking apps, where the average number is significantly influenced by a handful of dominant players such as Facebook, Pinterest, Skype, and others. Similarly, in the music app category, the average number is heavily influenced by major players like Pandora, Spotify, Shazam, and others. The presence and impact of these prominent apps can skew the average numbers for their respective genres.

Our objective is to identify popular genres, but it is worth noting that navigation, social networking, or music apps might appear more popular than they actually are. The average number of ratings can be heavily skewed by a few apps that have amassed hundreds of thousands of user ratings, while other apps may struggle to surpass the 10,000 threshold. To obtain a more accurate representation, it would be beneficial to exclude these extremely popular apps within each genre and recalculate the averages. However, for the time being, we will defer this level of analysis for future consideration.

Reference apps have 74,942 user ratings on average, but it's actually the Bible and Dictionary.com which skew up the average rating:

In [65]:
for app in ios:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


However, this particular niche appears to hold promise. One approach we could explore is to transform another popular book into an app, augmenting it with additional features beyond the raw version of the book. This could involve incorporating daily quotes from the book, an audio rendition, quizzes related to the book, and more. Furthermore, we could integrate a dictionary within the app, eliminating the need for users to exit the app when looking up words.

This concept aligns well with the App Store's dominance of entertainment-oriented apps. It suggests that while the market may be saturated with entertainment-focused apps, a practical app like the one described may have a better chance of standing out amidst the vast number of apps available.

Other popular genres include weather, books, food and drink, and finance. However:
- Weather apps typically have limited in-app engagement, making it challenging to generate substantial profits from in-app ads. Additionally, obtaining reliable live weather data may require connecting to non-free APIs.

- Food and drink apps often involve established brands like Starbucks, Dunkin' Donuts, McDonald's, etc. Developing a successful food and drink app would necessitate cooking capabilities and a delivery service, which fall outside the scope of our company.

- Finance apps encompass various functionalities such as banking, bill payment, and money transfers. Building a finance app would require specialized knowledge in the finance domain, which is not our expertise or focus at the moment.

Now let's analyze the Google Play market a bit.

### Most Popular Apps by Genre on Google Play

For the Google Play market, the data actually contains number of installs, so we should be able to get a clearer picture about genre popularity. However, the install numbers are categorised (100+, 1,000+, 5,000+, etc.):

In [67]:
display_table(android_clean, 5) # the Installs columns

1,000,000+ : 18.367346938775512
100,000+ : 13.337722185648454
10,000,000+ : 12.310730743910467
10,000+ : 11.441737985516786
5,000,000+ : 7.992100065832784
1,000+ : 7.465437788018433
500,000+ : 6.477946017116524
50,000+ : 5.490454246214615
5,000+ : 4.739960500329164
100+ : 3.120473996050033
50,000,000+ : 2.685977616853193
100,000,000+ : 2.488479262672811
500+ : 2.1461487820934826
10+ : 0.6714944042132982
50+ : 0.5529953917050692
500,000,000+ : 0.31599736668861095
1,000,000,000+ : 0.26333113890717574
5+ : 0.1184990125082291
1+ : 0.013166556945358787


One limitation of the available data is its lack of precision. For example, when an app indicates "100,000+" installs, we do not know the exact number of installs, whether it is 100,000, 200,000, or 350,000. However, for our purposes, we do not require precise data. Our goal is to gain a general understanding of which app genres attract the most users, and we do not need exact figures for the number of users.

To work with the data effectively, we will maintain the numbers as they are. This means that we will consider an app with "100,000+" installs to have 100,000 installs, and an app with "1,000,000+" installs to have 1,000,000 installs, and so on.

To perform computations on the install numbers, we will need to convert them to floats. This requires removing the commas and plus characters from the values. By doing this directly within the loop, we can also calculate the average number of installs for each genre (category).

In [68]:
categories_android = freq_table(android_clean, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in android_clean:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

ART_AND_DESIGN : 2021733.9285714286
AUTO_AND_VEHICLES : 737219.4444444445
BEAUTY : 640861.9047619047
BOOKS_AND_REFERENCE : 10346391.335403727
BUSINESS : 2743328.5826771655
COMICS : 832057.4074074074
COMMUNICATION : 47166160.384615384
DATING : 1075582.5190839695
EDUCATION : 1842233.009708738
ENTERTAINMENT : 11640705.88235294
EVENTS : 354431.3333333333
FINANCE : 1574833.2179930797
FOOD_AND_DRINK : 2300192.934782609
HEALTH_AND_FITNESS : 4907867.896995708
HOUSE_AND_HOME : 1591344.262295082
LIBRARIES_AND_DEMO : 813796.875
LIFESTYLE : 1775837.4911660778
GAME : 16326565.558930742
FAMILY : 4149821.124497992
MEDICAL : 165329.71929824562
SOCIAL : 27302664.05472637
SHOPPING : 7866974.382022472
PHOTOGRAPHY : 18699861.8875502
SPORTS : 4601628.844537815
TRAVEL_AND_LOCAL : 16171381.56424581
TOOLS : 12327241.522070015
PERSONALIZATION : 6562636.9527897
PRODUCTIVITY : 20465227.455830388
PARENTING : 647208.5416666666
WEATHER : 5542846.153846154
VIDEO_PLAYERS : 27115353.103448275
NEWS_AND_MAGAZINES : 1172

On average, communication apps have the most installs: 47,166,160. This number is heavily skewed up by a few apps that have over one billion installs, namely WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts, and a few others with over 100 and 500 million installs:

We observe a similar trend in the video players category, which comes in second place with 27,115,353 installs. This genre is dominated by apps like YouTube, Google Play Movies & TV, and MX Player. The pattern repeats itself in social apps with giants like Facebook, Instagram, Google+, and others, as well as in photography apps with Google Photos and other popular photo editors, and in productivity apps with Microsoft Word, Dropbox, Google Calendar, Evernote, and more.

However, it is important to consider that these app genres may appear more popular than they truly are. Additionally, these niches are predominantly controlled by a few dominant players, making it challenging to compete against them.

Although the game genre remains popular, we previously identified it as a saturated market. Therefore, we should explore alternative app recommendations if possible.

The books and reference genre also demonstrates popularity, with an average number of installs reaching 10,346,391. This genre presents an intriguing opportunity for further exploration since it shows potential for success on both the App Store and Google Play. Our objective is to recommend an app genre that exhibits profitability potential on both platforms.

Let's take a look at some of the apps from this genre and their number of installs:

In [70]:
for app in android_clean:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
English translation from Bengali : 100

Within the book and reference genre, there exists a diverse array of apps catering to different purposes such as ebook processing and reading, library collections, dictionaries, and programming or language tutorials. However, it is important to acknowledge that the average number of installs within this genre may be distorted by a limited number of highly popular apps. These outliers have the potential to significantly influence the overall average install numbers.

In [71]:
for app in android_clean:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


However, it looks like there are only a few very popular apps, so this market still shows potential. Let's try to get some app ideas based on the kind of apps that are somewhere in the middle in terms of popularity (between 1,000,000 and 100,000,000 downloads):

In [72]:
for app in android_clean:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

The book and reference genre predominantly consists of software designed for processing and reading ebooks, along with various collections of libraries and dictionaries. Given the existing competition in this niche, it may not be advisable to develop similar apps that could face significant market saturation.

However, it is worth noting that there is a notable presence of apps centered around the book Quran, indicating that creating an app based on a popular book can be a profitable venture. Consequently, developing an app around a well-received and recent book holds potential for success in both the Google Play and App Store markets.

To differentiate our app from existing libraries, it is crucial to incorporate distinctive features beyond offering the raw version of the book. This could involve including features such as daily quotes from the book, an audio rendition, quizzes related to the book, or a forum where users can engage in discussions about the book.

### Conclusion
Our analysis suggests that transforming a popular, contemporary book into an app presents a profitable opportunity in both the Google Play and App Store markets. However, to stand out among existing libraries, incorporating additional interactive features is essential for success. These features could include daily quotes, an audio version, quizzes, and a dedicated forum for book-related discussions.