# Profitable App Profiles for the App Store and Google Play Markets
Our aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. We're working as data analysts for a company that builds Android and iOS mobile apps, and our job is to enable our team of developers to make data-driven decisions with respect to the kind of apps they build.

At our company, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.

# Opening and Exploring the Data
As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

![img](https://s3.amazonaws.com/dq-content/350/py1m8_statista.png) Source: [Statista](https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/)
Collecting data for over four million apps requires a significant amount of time and money, so we'll try to analyze a sample of data instead. To avoid spending resources with collecting new data ourselves, we should first try to see whether we can find any relevant existing data at no cost. Luckily, these are two data sets that seem suitable for our purpose:

A data set containing data about approximately ten thousand Android apps from Google Play. You can download the data set directly from this link.
A data set containing data about approximately seven thousand iOS apps from the App Store. You can download the data set directly from this link.
Let's start by opening the two data sets and then continue with exploring the data.

*For appstore dataset setting  Id as index*

In [2]:
import pandas as pd
### The App Store data set ###
appstore_df = pd.read_csv('AppleStore.csv', index_col = "id") 

In [3]:


### The Google Play data set ###

google_df = pd.read_csv('googleplaystore.csv') 


# Data exploration
To make it easier to explore the two data sets, we'll first explore rows in a more readable way. We'll also show the number of apps for any data set.



In [4]:
appstore_apps = len(appstore_df)
print("There are", appstore_apps, "apps in this dataset")

There are 7197 apps in this dataset


In [5]:
google_apps = len(google_df)
print("There are", google_apps, "apps in this dataset")

There are 10841 apps in this dataset


In [74]:
# Saving the amount of android apps for later analytics
amount_of_android = len(google_df) 

In [72]:
# Saving the amount of apps in the appstore for later analytics
amount_of_ios = len(appstore_df)

In [6]:
appstore_df.head()

Unnamed: 0_level_0,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
284882215,Facebook,389879808,USD,0.0,2974676,212,3.5,3.5,95.0,4+,Social Networking,37,1,29,1
389801252,Instagram,113954816,USD,0.0,2161558,1289,4.5,4.0,10.23,12+,Photo & Video,37,0,29,1
529479190,Clash of Clans,116476928,USD,0.0,2130805,579,4.5,4.5,9.24.12,9+,Games,38,5,18,1
420009108,Temple Run,65921024,USD,0.0,1724546,3842,4.5,4.0,1.6.2,9+,Games,40,5,1,1
284035177,Pandora - Music & Radio,130242560,USD,0.0,1126879,3594,4.0,4.5,8.4.1,12+,Music,37,4,1,1


In [7]:
google_df.describe()

Unnamed: 0,Rating
count,9367.0
mean,4.193338
std,0.537431
min,1.0
25%,4.0
50%,4.3
75%,4.5
max,19.0


In [8]:
appstore_df.describe()

Unnamed: 0,size_bytes,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
count,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0
mean,199134500.0,1.726218,12892.91,460.373906,3.526956,3.253578,37.361817,3.7071,5.434903,0.993053
std,359206900.0,5.833006,75739.41,3920.455183,1.517948,1.809363,3.737715,1.986005,7.919593,0.083066
min,589824.0,0.0,0.0,0.0,0.0,0.0,9.0,0.0,0.0,0.0
25%,46922750.0,0.0,28.0,1.0,3.5,2.5,37.0,3.0,1.0,1.0
50%,97153020.0,0.0,300.0,23.0,4.0,4.0,37.0,5.0,1.0,1.0
75%,181924900.0,1.99,2793.0,140.0,4.5,4.5,38.0,5.0,8.0,1.0
max,4025970000.0,299.99,2974676.0,177050.0,5.0,5.0,47.0,5.0,75.0,1.0


In [9]:
google_df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


*Counting repeated apps*

In [100]:
appstore_df["track_name"].value_counts().to_frame()

Unnamed: 0,track_name
Mannequin Challenge,2
VR Roller Coaster,2
The Very Hungry Caterpillar - Creative Play,1
スタバで呪文,1
Truecaller - Spam Identification & Block,1
...,...
Mystic Messenger,1
FOX Sports GO,1
Pocket Glasses PRO - text magnifier app,1
Visionn - Real Time Artistic Photo & Video Effects,1


In [99]:
google_df["App"].value_counts().to_frame()

Unnamed: 0,App
ROBLOX,9
"CBS Sports App - Scores, News, Stats & Watch Live",8
Duolingo: Learn Languages Free,7
8 Ball Pool,7
Candy Crush Saga,7
...,...
Autool BT-BOX,1
Woody Puzzle,1
Job CV Maker & Portfolio Maker,1
ZERO Lock Screen,1


# Deleting Wrong Data
The Google Play data set has a dedicated discussion section, and we can see that one of the discussions outlines an error for row 10472. Let's print this row and compare it against the header and another row that is correct.

In [12]:
google_df["App"].str.contains("Life Made WI-Fi Touchscreen Photo Frame")
google_df.head(10473)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10468,Tassa.fi Finland,LIFESTYLE,3.6,346,7.5M,"50,000+",Free,0,Everyone,Lifestyle,"May 22, 2018",5.5,4.0 and up
10469,TownWiFi | Wi-Fi Everywhere,COMMUNICATION,3.9,2372,58M,"500,000+",Free,0,Everyone,Communication,"August 2, 2018",4.2.1,4.2 and up
10470,Jazz Wi-Fi,COMMUNICATION,3.4,49,4.0M,"10,000+",Free,0,Everyone,Communication,"February 10, 2017",0.1,2.3 and up
10471,Xposed Wi-Fi-Pwd,PERSONALIZATION,3.5,1042,404k,"100,000+",Free,0,Everyone,Personalization,"August 5, 2014",3.0.0,4.0.3 and up


In [13]:
google_df = google_df.drop([10472]).head(10473)
google_df.head(10473)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10468,Tassa.fi Finland,LIFESTYLE,3.6,346,7.5M,"50,000+",Free,0,Everyone,Lifestyle,"May 22, 2018",5.5,4.0 and up
10469,TownWiFi | Wi-Fi Everywhere,COMMUNICATION,3.9,2372,58M,"500,000+",Free,0,Everyone,Communication,"August 2, 2018",4.2.1,4.2 and up
10470,Jazz Wi-Fi,COMMUNICATION,3.4,49,4.0M,"10,000+",Free,0,Everyone,Communication,"February 10, 2017",0.1,2.3 and up
10471,Xposed Wi-Fi-Pwd,PERSONALIZATION,3.5,1042,404k,"100,000+",Free,0,Everyone,Personalization,"August 5, 2014",3.0.0,4.0.3 and up


In [14]:
google_df["App"].str.contains("Life Made WI-Fi Touchscreen Photo Frame")

0        False
1        False
2        False
3        False
4        False
         ...  
10468    False
10469    False
10470    False
10471    False
10473    False
Name: App, Length: 10473, dtype: bool

# Removing Duplicate Entries
### Part One
If we explore the Google Play data set long enough, we'll find that some apps have more than one entry. For instance, the application Instagram has four entries:

In [98]:
google_df['App'].value_counts().head(50).to_frame()

Unnamed: 0,App
ROBLOX,9
"CBS Sports App - Scores, News, Stats & Watch Live",8
Duolingo: Learn Languages Free,7
8 Ball Pool,7
Candy Crush Saga,7
ESPN,7
slither.io,6
"Bleacher Report: sports news, scores, & highlights",6
Sniper 3D Gun Shooter: Free Shooting Games - FPS,6
Subway Surfers,6


In [16]:
apps = list(google_df['App'])

In [17]:
for app in apps:
    name = app
    if name == 'Instagram':
        print(app)

Instagram
Instagram
Instagram
Instagram


In total, there are 1,173 cases where an app occurs more than once:



In [18]:
duplicate_apps = []
unique_apps = []

for app in apps:
    name = app
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
    
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])

Number of duplicate apps: 1173


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


*Checking app with most amount of reviews*

In [19]:
google_df[ 
    google_df["Reviews"]==max(google_df["Reviews"]) 
]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
2989,GollerCepte Live Score,SPORTS,4.2,9992,31M,"1,000,000+",Free,0,Everyone,Sports,"May 23, 2018",6.5,4.1 and up


In [20]:
google_df['Reviews'].describe()

count     10473
unique     5912
top           0
freq        549
Name: Reviews, dtype: object

In [94]:
appstore_df['rating_count_ver'].describe().to_frame()

Unnamed: 0,rating_count_ver
count,7197.0
mean,460.373906
std,3920.455183
min,0.0
25%,1.0
50%,23.0
75%,140.0
max,177050.0


## Isolating the Free Apps
As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps, and we'll need to isolate only the free apps for our analysis. Below, we isolate the free apps for both our data sets.

In [22]:
android = google_df["Price"].str.contains("0").value_counts()[True]
android

9874

In [23]:
ios = appstore_df["price"]

In [93]:
google_df.groupby('Price')['Price'].agg('count').sort_values(ascending = False).to_frame()

Unnamed: 0_level_0,Price
Price,Unnamed: 1_level_1
0,9699
$0.99,142
$2.99,127
$4.99,72
$1.99,71
...,...
$2.50,1
$4.29,1
$4.59,1
$4.60,1


In [97]:
appstore_df.groupby('price')['price'].agg('count').sort_values(ascending = False).to_frame()

Unnamed: 0_level_0,price
price,Unnamed: 1_level_1
0.0,4056
0.99,728
2.99,683
1.99,621
4.99,394
3.99,277
6.99,166
9.99,81
5.99,52
7.99,33


# Most Common Apps by Genre
## Part One
As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea has three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Let's begin the analysis by getting a sense of the most common genres for each market. For this, we'll build a frequency table for the prime_genre column of the App Store data set, and the Genres and Category columns of the Google Play data set.

## Part Two
We'll build two functions we can use to analyze the frequency tables:

- Generate frequency tables that show percentages

In [101]:
# average_rating_category = google_df.groupby("Category")["Rating"].agg(["mean"]).round(2)
# average_rating_category

In [113]:
ios_category_frequency_table = appstore_df.groupby("prime_genre")["track_name"].count()
ios_category_frequency_table.to_frame()

Unnamed: 0_level_0,track_name
prime_genre,Unnamed: 1_level_1
Book,112
Business,57
Catalogs,10
Education,453
Entertainment,535
Finance,104
Food & Drink,63
Games,3862
Health & Fitness,180
Lifestyle,144


In [112]:
appstore_df.head()

Unnamed: 0_level_0,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
284882215,Facebook,389879808,USD,0.0,2974676,212,3.5,3.5,95.0,4+,Social Networking,37,1,29,1
389801252,Instagram,113954816,USD,0.0,2161558,1289,4.5,4.0,10.23,12+,Photo & Video,37,0,29,1
529479190,Clash of Clans,116476928,USD,0.0,2130805,579,4.5,4.5,9.24.12,9+,Games,38,5,18,1
420009108,Temple Run,65921024,USD,0.0,1724546,3842,4.5,4.0,1.6.2,9+,Games,40,5,1,1
284035177,Pandora - Music & Radio,130242560,USD,0.0,1126879,3594,4.0,4.5,8.4.1,12+,Music,37,4,1,1


In [115]:
ios_category_percentage_table = ios_category_frequency_table / amount_of_ios
round(ios_category_percentage_table,2).to_frame()

Unnamed: 0_level_0,track_name
prime_genre,Unnamed: 1_level_1
Book,0.02
Business,0.01
Catalogs,0.0
Education,0.06
Entertainment,0.07
Finance,0.01
Food & Drink,0.01
Games,0.54
Health & Fitness,0.03
Lifestyle,0.02


We can see that among the free apps, more than a half (54%) are games. Entertainment apps are close to 7%, followed by photo and video apps, which are close to 5%. Only 6% of the apps are designed for education, followed by social networking apps which amount for 2% of the apps in our data set.

The general impression is that App Store (at least the part containing free English apps) is dominated by apps that are designed for fun (games, entertainment, photo and video, social networking, sports, music, etc.), while apps with practical purposes (education, shopping, utilities, productivity, lifestyle, etc.) are more rare. However, the fact that fun apps are the most numerous doesn't also imply that they also have the greatest number of users — the demand might not be the same as the offer.

Let's continue by examining the Genres and Category columns of the Google Play data set (two columns which seem to be related).

In [123]:
google_category_frequency_table = google_df.groupby("Genres")["App"].count()
category_frequency_table.to_frame()

Unnamed: 0_level_0,App
Category,Unnamed: 1_level_1
ART_AND_DESIGN,65
AUTO_AND_VEHICLES,82
BEAUTY,53
BOOKS_AND_REFERENCE,214
BUSINESS,445
COMICS,59
COMMUNICATION,369
DATING,233
EDUCATION,156
ENTERTAINMENT,149


In [121]:
google_category_percentage_table = google_df.groupby("Genres")["App"].count()/amount_of_android
round(category_percentage_table,2).to_frame()

Unnamed: 0_level_0,App
Category,Unnamed: 1_level_1
ART_AND_DESIGN,0.01
AUTO_AND_VEHICLES,0.01
BEAUTY,0.01
BOOKS_AND_REFERENCE,0.02
BUSINESS,0.04
COMICS,0.01
COMMUNICATION,0.04
DATING,0.02
EDUCATION,0.01
ENTERTAINMENT,0.01


## Downloads Analysis
##### We see the same pattern for the video players category, which is the runner-up with 24,727,872 installs. The market is dominated by apps like Youtube, Google Play Movies & TV, or MX Player. The pattern is repeated for social apps (where we have giants like Facebook, Instagram, Google+, etc.), photography apps (Google Photos and other popular photo editors), or productivity apps (Microsoft Word, Dropbox, Google Calendar, Evernote, etc.).

Again, the main concern is that these app genres might seem more popular than they really are. Moreover, these niches seem to be dominated by a few giants who are hard to compete against.

The game genre seems pretty popular, but previously we found out this part of the market seems a bit saturated, so we'd like to come up with a different app recommendation if possible.

The books and reference genre looks fairly popular as well, with an average number of installs of 8,767,811. It's interesting to explore this in more depth, since we found this genre has some potential to work well on the App Store, and our aim is to recommend an app genre that shows potential for being profitable on both the App Store and Google Play.

Let's take a look at some of the apps from this genre and their number of installs:

In [131]:
downloads_apps= google_df.groupby("Installs")["App"].count()/amount_of_android
round(downloads_apps,2).to_frame()

Unnamed: 0_level_0,App
Installs,Unnamed: 1_level_1
0,0.0
0+,0.0
1+,0.01
"1,000+",0.08
"1,000,000+",0.15
"1,000,000,000+",0.01
10+,0.03
"10,000+",0.1
"10,000,000+",0.12
100+,0.06


This niche seems to be dominated by software for processing and reading ebooks, as well as various collections of libraries and dictionaries, so it's probably not a good idea to build similar apps since there'll be some significant competition.

We also notice there are quite a few apps built around the book Quran, which suggests that building an app around a popular book can be profitable. It seems that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets.

However, it looks like the market is already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.

# Conclusions
In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We concluded that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets. The markets are already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.