# Profitable App Profiles for the App Store and Google Play Markets
Our aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. We're working as data analysts for a company that builds Android and iOS mobile apps, and our job is to enable our team of developers to make data-driven decisions with respect to the kind of apps they build.

At our company, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.

# Opening and Exploring the Data
As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

![img](https://s3.amazonaws.com/dq-content/350/py1m8_statista.png) Source: [Statista](https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/)
Collecting data for over four million apps requires a significant amount of time and money, so we'll try to analyze a sample of data instead. To avoid spending resources with collecting new data ourselves, we should first try to see whether we can find any relevant existing data at no cost. Luckily, these are two data sets that seem suitable for our purpose:

A data set containing data about approximately ten thousand Android apps from Google Play. You can download the data set directly from this link.
A data set containing data about approximately seven thousand iOS apps from the App Store. You can download the data set directly from this link.
Let's start by opening the two data sets and then continue with exploring the data.

*For appstore dataset setting  Id as index*

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### The App Store data set ###
appstore_df = pd.read_csv('AppleStore.csv', index_col = "id") 
appstore_df.sort_values(by = 'rating_count_tot', ascending = False, inplace = True)
appstore_df.head()

Unnamed: 0_level_0,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
284882215,Facebook,389879808,USD,0.0,2974676,212,3.5,3.5,95.0,4+,Social Networking,37,1,29,1
389801252,Instagram,113954816,USD,0.0,2161558,1289,4.5,4.0,10.23,12+,Photo & Video,37,0,29,1
529479190,Clash of Clans,116476928,USD,0.0,2130805,579,4.5,4.5,9.24.12,9+,Games,38,5,18,1
420009108,Temple Run,65921024,USD,0.0,1724546,3842,4.5,4.0,1.6.2,9+,Games,40,5,1,1
284035177,Pandora - Music & Radio,130242560,USD,0.0,1126879,3594,4.0,4.5,8.4.1,12+,Music,37,4,1,1


In [2]:


### The Google Play data set ###

google_df = pd.read_csv('googleplaystore.csv') 
google_df.sort_values(by = 'Reviews', ascending = False, inplace = True)
google_df.head()


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
2989,GollerCepte Live Score,SPORTS,4.2,9992,31M,"1,000,000+",Free,0,Everyone,Sports,"May 23, 2018",6.5,4.1 and up
4970,Ad Block REMOVER - NEED ROOT,TOOLS,3.3,999,91k,"100,000+",Free,0,Everyone,Tools,"December 17, 2013",3.2,2.2 and up
2723,SnipSnap Coupon App,SHOPPING,4.2,9975,18M,"1,000,000+",Free,0,Everyone,Shopping,"January 22, 2018",1.4,4.3 and up
2705,SnipSnap Coupon App,SHOPPING,4.2,9975,18M,"1,000,000+",Free,0,Everyone,Shopping,"January 22, 2018",1.4,4.3 and up
3079,US Open Tennis Championships 2018,SPORTS,4.0,9971,33M,"1,000,000+",Free,0,Everyone,Sports,"June 5, 2018",7.1,5.0 and up


# Data exploration
To make it easier to explore the two data sets, we'll first explore rows in a more readable way. We'll also show the number of apps for any data set.



In [3]:
appstore_apps = len(appstore_df)
print("There are", appstore_apps, "apps in this dataset")

There are 7197 apps in this dataset


In [4]:
google_apps = len(google_df)
print("There are", google_apps, "apps in this dataset")

There are 10841 apps in this dataset


In [5]:
# Saving the amount of android apps for later analytics
amount_of_android = len(google_df) 

In [6]:
# Saving the amount of apps in the appstore for later analytics
amount_of_ios = len(appstore_df)

In [7]:
appstore_df.prime_genre.value_counts()

Games                3862
Entertainment         535
Education             453
Photo & Video         349
Utilities             248
Health & Fitness      180
Productivity          178
Social Networking     167
Lifestyle             144
Music                 138
Shopping              122
Sports                114
Book                  112
Finance               104
Travel                 81
News                   75
Weather                72
Reference              64
Food & Drink           63
Business               57
Navigation             46
Medical                23
Catalogs               10
Name: prime_genre, dtype: int64

In [8]:
google_df.Category.value_counts()

FAMILY                 1972
GAME                   1144
TOOLS                   843
MEDICAL                 463
BUSINESS                460
PRODUCTIVITY            424
PERSONALIZATION         392
COMMUNICATION           387
SPORTS                  384
LIFESTYLE               382
FINANCE                 366
HEALTH_AND_FITNESS      341
PHOTOGRAPHY             335
SOCIAL                  295
NEWS_AND_MAGAZINES      283
SHOPPING                260
TRAVEL_AND_LOCAL        258
DATING                  234
BOOKS_AND_REFERENCE     231
VIDEO_PLAYERS           175
EDUCATION               156
ENTERTAINMENT           149
MAPS_AND_NAVIGATION     137
FOOD_AND_DRINK          127
HOUSE_AND_HOME           88
LIBRARIES_AND_DEMO       85
AUTO_AND_VEHICLES        85
WEATHER                  82
ART_AND_DESIGN           65
EVENTS                   64
COMICS                   60
PARENTING                60
BEAUTY                   53
1.9                       1
Name: Category, dtype: int64

In [9]:
google_df.isna().sum()

App                  0
Category             0
Rating            1474
Reviews              0
Size                 0
Installs             0
Type                 1
Price                0
Content Rating       1
Genres               0
Last Updated         0
Current Ver          8
Android Ver          3
dtype: int64

In [10]:
google_df.dtypes

App                object
Category           object
Rating            float64
Reviews            object
Size               object
Installs           object
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

In [11]:
google_df.Installs.value_counts()

1,000,000+        1579
10,000,000+       1252
100,000+          1169
10,000+           1054
1,000+             907
5,000,000+         752
100+               719
500,000+           539
50,000+            479
5,000+             477
100,000,000+       409
10+                386
500+               330
50,000,000+        289
50+                205
5+                  82
500,000,000+        72
1+                  67
1,000,000,000+      58
0+                  14
0                    1
Free                 1
Name: Installs, dtype: int64

*Counting repeated apps*

In [12]:
appstore_df["track_name"].value_counts()

Mannequin Challenge                                  2
VR Roller Coaster                                    2
PhotoScan - scanner by Google Photos                 1
My Challenge Tracker                                 1
Deliciously Ella                                     1
                                                    ..
LivePapers - Live Wallpapers from your photos        1
Lines the Game                                       1
Paper by FiftyThree - Sketch, Diagram, Take Notes    1
Vetmoji                                              1
Lotto Out! - Mexican Loteria                         1
Name: track_name, Length: 7195, dtype: int64

In [13]:
google_df["App"].value_counts()

ROBLOX                                               9
CBS Sports App - Scores, News, Stats & Watch Live    8
Duolingo: Learn Languages Free                       7
Candy Crush Saga                                     7
ESPN                                                 7
                                                    ..
Canon CameraWindow                                   1
Death Dragon Knights RPG                             1
Brick Breaker BR                                     1
CB Fit                                               1
DM Magazine                                          1
Name: App, Length: 9660, dtype: int64

In [14]:
google_df["Type"].value_counts()

Free    10039
Paid      800
0           1
Name: Type, dtype: int64

In [15]:
google_df["Content Rating"].value_counts()

Everyone           8714
Teen               1208
Mature 17+          499
Everyone 10+        414
Adults only 18+       3
Unrated               2
Name: Content Rating, dtype: int64

In [16]:
google_df.Genres.value_counts()

Tools                                  842
Entertainment                          623
Education                              549
Medical                                463
Business                               460
                                      ... 
Board;Pretend Play                       1
Puzzle;Education                         1
Health & Fitness;Action & Adventure      1
Adventure;Brain Games                    1
Parenting;Brain Games                    1
Name: Genres, Length: 120, dtype: int64

In [17]:
google_df['Last Updated'].value_counts()

August 3, 2018       326
August 2, 2018       304
July 31, 2018        294
August 1, 2018       285
July 30, 2018        211
                    ... 
March 3, 2016          1
September 9, 2017      1
February 22, 2013      1
February 18, 2013      1
November 8, 2015       1
Name: Last Updated, Length: 1378, dtype: int64

# Data Cleaning
The Google Play data set has a dedicated discussion section, and we can see that one of the discussions outlines an error for row 10472. Let's print this row and compare it against the header and another row that is correct.

In [18]:
google_df[google_df.App=="Life Made WI-Fi Touchscreen Photo Frame"]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10472,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up,


In [19]:
google_df[google_df.App=="Facebook"]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
2544,Facebook,SOCIAL,4.1,78158306,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"August 3, 2018",Varies with device,Varies with device
3943,Facebook,SOCIAL,4.1,78128208,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"August 3, 2018",Varies with device,Varies with device


In [20]:
google_df.loc[google_df.App == "Life Made WI-Fi Touchscreen Photo Frame", "Category"] = 'FAMILY'
google_df.loc[google_df.App == "Life Made WI-Fi Touchscreen Photo Frame", "Rating"] = 1.9
google_df.loc[google_df.App == "Life Made WI-Fi Touchscreen Photo Frame", "Reviews"] = 19.0
google_df.loc[google_df.App == "Life Made WI-Fi Touchscreen Photo Frame", "Size"] = 'Varies with device'
google_df.loc[google_df.App == "Life Made WI-Fi Touchscreen Photo Frame", "Installs"] = '1,000+'
google_df.loc[google_df.App == "Life Made WI-Fi Touchscreen Photo Frame", "Type"] = 'Free'
google_df.loc[google_df.App == "Life Made WI-Fi Touchscreen Photo Frame", "Price"] = 0
google_df.loc[google_df.App == "Life Made WI-Fi Touchscreen Photo Frame", "Content Rating"] = 'Everyone'
google_df.loc[google_df.App == "Life Made WI-Fi Touchscreen Photo Frame", "Genres"] = 'Tools'
google_df.loc[google_df.App == "Life Made WI-Fi Touchscreen Photo Frame", "Last Updated"] = 'August 3, 2018'
google_df.loc[google_df.App == "Life Made WI-Fi Touchscreen Photo Frame", "Current Ver"] = 'Varies with device'
google_df.loc[google_df.App == "Life Made WI-Fi Touchscreen Photo Frame", "Android Ver"] = 'Varies with device'

In [21]:
google_df[google_df.App=="Life Made WI-Fi Touchscreen Photo Frame"]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10472,Life Made WI-Fi Touchscreen Photo Frame,FAMILY,1.9,19.0,Varies with device,"1,000+",Free,0,Everyone,Tools,"August 3, 2018",Varies with device,Varies with device


# Removing Duplicate Entries
### Part One
If we explore the Google Play data set long enough, we'll find that some apps have more than one entry. For instance, the application Instagram has four entries:

In [22]:
google_df.App.value_counts()

ROBLOX                                               9
CBS Sports App - Scores, News, Stats & Watch Live    8
Duolingo: Learn Languages Free                       7
Candy Crush Saga                                     7
ESPN                                                 7
                                                    ..
Canon CameraWindow                                   1
Death Dragon Knights RPG                             1
Brick Breaker BR                                     1
CB Fit                                               1
DM Magazine                                          1
Name: App, Length: 9660, dtype: int64

In [23]:
google_df.App.unique()

array(['GollerCepte Live Score', 'Ad Block REMOVER - NEED ROOT',
       'SnipSnap Coupon App', ..., 'CE-SETRAM l’Appli',
       'Glanceable Ap Watch Face', 'G-NetReport Pro'], dtype=object)

In [24]:
#google_df = google_df.App.loc == "Instagram"

In [25]:
google_df

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
2989,GollerCepte Live Score,SPORTS,4.2,9992,31M,"1,000,000+",Free,0,Everyone,Sports,"May 23, 2018",6.5,4.1 and up
4970,Ad Block REMOVER - NEED ROOT,TOOLS,3.3,999,91k,"100,000+",Free,0,Everyone,Tools,"December 17, 2013",3.2,2.2 and up
2723,SnipSnap Coupon App,SHOPPING,4.2,9975,18M,"1,000,000+",Free,0,Everyone,Shopping,"January 22, 2018",1.4,4.3 and up
2705,SnipSnap Coupon App,SHOPPING,4.2,9975,18M,"1,000,000+",Free,0,Everyone,Shopping,"January 22, 2018",1.4,4.3 and up
3079,US Open Tennis Championships 2018,SPORTS,4.0,9971,33M,"1,000,000+",Free,0,Everyone,Sports,"June 5, 2018",7.1,5.0 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
7217,CE-STRONG,FAMILY,,0,16M,100+,Free,0,Everyone,Education,"June 17, 2016",1.0.4,4.0 and up
6492,Anime Mod for BM,BOOKS_AND_REFERENCE,,0,8.0M,100+,Free,0,Everyone,Books & Reference,"July 28, 2017",1.0,4.0 and up
7221,CE-SETRAM l’Appli,LIBRARIES_AND_DEMO,,0,2.6M,100+,Free,0,Everyone,Libraries & Demo,"December 5, 2017",1.1.8,4.0.3 and up
5480,Glanceable Ap Watch Face,PERSONALIZATION,,0,11M,5+,Paid,$0.99,Everyone,Personalization,"August 14, 2016",1.0.103,4.4 and up


In [26]:
google_df[google_df.App=='Instagram']

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
2604,Instagram,SOCIAL,4.5,66577446,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
2545,Instagram,SOCIAL,4.5,66577313,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
2611,Instagram,SOCIAL,4.5,66577313,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
3909,Instagram,SOCIAL,4.5,66509917,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device


In [27]:
google_df.App[ google_df.App == "Instagram"].value_counts()

Instagram    4
Name: App, dtype: int64

In [28]:
google_df.loc[2545,"App"] = '0'
google_df.loc[2604,"App"] = '0'
google_df.loc[2611,"App"] = '0'

In [29]:
google_df[google_df.App=='Instagram']

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
3909,Instagram,SOCIAL,4.5,66509917,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device


In total, there are 1,173 cases where an app occurs more than once:



*Checking app with most amount of reviews*

## Isolating the Free Apps
As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps, and we'll need to isolate only the free apps for our analysis. Below, we isolate the free apps for both our data sets.

In [30]:
android = google_df["Price"].str.contains("0").value_counts()[True]
android

10223

In [31]:
ios = appstore_df["price"]

In [32]:
google_df.groupby('Price')['Price'].agg('count').sort_values(ascending = False).to_frame()

Unnamed: 0_level_0,Price
Price,Unnamed: 1_level_1
0,10040
$0.99,148
$2.99,129
$1.99,73
$4.99,72
...,...
$2.60,1
$2.59,1
$2.56,1
$2.50,1


In [33]:
appstore_df.groupby('price')['price'].agg('count').sort_values(ascending = False).to_frame()

Unnamed: 0_level_0,price
price,Unnamed: 1_level_1
0.0,4056
0.99,728
2.99,683
1.99,621
4.99,394
3.99,277
6.99,166
9.99,81
5.99,52
7.99,33


# Most Common Apps by Genre
## Part One
As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea has three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Let's begin the analysis by getting a sense of the most common genres for each market. For this, we'll build a frequency table for the prime_genre column of the App Store data set, and the Genres and Category columns of the Google Play data set.

## Part Two
We'll build two functions we can use to analyze the frequency tables:

- Generate frequency tables that show percentages

In [35]:
ios_category_frequency_table = appstore_df.sort_values('track_name', ascending=False).groupby("prime_genre")["track_name"].count()
ios_category_frequency_table.to_frame()

Unnamed: 0_level_0,track_name
prime_genre,Unnamed: 1_level_1
Book,112
Business,57
Catalogs,10
Education,453
Entertainment,535
Finance,104
Food & Drink,63
Games,3862
Health & Fitness,180
Lifestyle,144


In [37]:
ios_category_percentage_table = ios_category_frequency_table / amount_of_ios
round(ios_category_percentage_table,5).to_frame()

Unnamed: 0_level_0,track_name
prime_genre,Unnamed: 1_level_1
Book,0.01556
Business,0.00792
Catalogs,0.00139
Education,0.06294
Entertainment,0.07434
Finance,0.01445
Food & Drink,0.00875
Games,0.53661
Health & Fitness,0.02501
Lifestyle,0.02001


We can see that among the free apps, more than a half (54%) are games. Entertainment apps are close to 7%, followed by photo and video apps, which are close to 5%. Only 6% of the apps are designed for education, followed by social networking apps which amount for 2% of the apps in our data set.

The general impression is that App Store (at least the part containing free English apps) is dominated by apps that are designed for fun (games, entertainment, photo and video, social networking, sports, music, etc.), while apps with practical purposes (education, shopping, utilities, productivity, lifestyle, etc.) are more rare. However, the fact that fun apps are the most numerous doesn't also imply that they also have the greatest number of users — the demand might not be the same as the offer.

Let's continue by examining the Genres and Category columns of the Google Play data set (two columns which seem to be related).

In [38]:
google_category_frequency_table = google_df.groupby("Genres")["App"].count()
google_category_frequency_table.to_frame()

Unnamed: 0_level_0,App
Genres,Unnamed: 1_level_1
Action,365
Action;Action & Adventure,17
Adventure,75
Adventure;Action & Adventure,13
Adventure;Brain Games,1
...,...
Video Players & Editors,173
Video Players & Editors;Creativity,2
Video Players & Editors;Music & Video,3
Weather,82


In [39]:
google_category_percentage_table = google_df.groupby("Genres")["App"].count()/amount_of_android
round(google_category_percentage_table,2).to_frame()

Unnamed: 0_level_0,App
Genres,Unnamed: 1_level_1
Action,0.03
Action;Action & Adventure,0.00
Adventure,0.01
Adventure;Action & Adventure,0.00
Adventure;Brain Games,0.00
...,...
Video Players & Editors,0.02
Video Players & Editors;Creativity,0.00
Video Players & Editors;Music & Video,0.00
Weather,0.01


## Downloads Analysis
##### We see the same pattern for the video players category, which is the runner-up with 24,727,872 installs. The market is dominated by apps like Youtube, Google Play Movies & TV, or MX Player. The pattern is repeated for social apps (where we have giants like Facebook, Instagram, Google+, etc.), photography apps (Google Photos and other popular photo editors), or productivity apps (Microsoft Word, Dropbox, Google Calendar, Evernote, etc.).

Again, the main concern is that these app genres might seem more popular than they really are. Moreover, these niches seem to be dominated by a few giants who are hard to compete against.

The game genre seems pretty popular, but previously we found out this part of the market seems a bit saturated, so we'd like to come up with a different app recommendation if possible.

The books and reference genre looks fairly popular as well, with an average number of installs of 8,767,811. It's interesting to explore this in more depth, since we found this genre has some potential to work well on the App Store, and our aim is to recommend an app genre that shows potential for being profitable on both the App Store and Google Play.

Let's take a look at some of the apps from this genre and their number of installs:

In [40]:
downloads_apps= google_df.groupby("Installs")["App"].count()/amount_of_android
round(downloads_apps,2).to_frame()

Unnamed: 0_level_0,App
Installs,Unnamed: 1_level_1
0,0.0
0+,0.0
1+,0.01
"1,000+",0.08
"1,000,000+",0.15
"1,000,000,000+",0.01
10+,0.04
"10,000+",0.1
"10,000,000+",0.12
100+,0.07


This niche seems to be dominated by software for processing and reading ebooks, as well as various collections of libraries and dictionaries, so it's probably not a good idea to build similar apps since there'll be some significant competition.

We also notice there are quite a few apps built around the book Quran, which suggests that building an app around a popular book can be profitable. It seems that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets.

However, it looks like the market is already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.

# Conclusions
In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We concluded that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets. The markets are already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.