# Title: Analyzing mobile App Data 

I loved this project on Dataquest and picked up to work on it using Pandas!

Short Introduction:

What the project is about: As a data analyst for a company that builds Android and iOS mobile apps, I have to analyze the mobile app data. For example: Which users have seen and engaged them with ads, will help us understand how much revenue the user of our apps determines.

This company only builds free apps to download and install, and their main source of revenue consists of in-app ads. This means that the number of users of the apps determines the revenue for any given app — the more users who see and engage with the ads, the better.

This project aims to analyze data to help the developers understand what apps will likely attract more users on Google Play and the App Store.

To do this: I'll need to collect and analyze data about mobile apps available on Google Play and the App Store.

- A [data set](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately ten thousand Android apps from Google Play. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).
- A [data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately seven thousand iOS apps from the App Store. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).


# Exploartion of Data

In [1]:
import pandas as pd

AppleStore_Data = pd.read_csv('/Users/kajol/Downloads/AppleStore.csv')
AppleStore_Data


Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
0,284882215,Facebook,389879808,USD,0.00,2974676,212,3.5,3.5,95.0,4+,Social Networking,37,1,29,1
1,389801252,Instagram,113954816,USD,0.00,2161558,1289,4.5,4.0,10.23,12+,Photo & Video,37,0,29,1
2,529479190,Clash of Clans,116476928,USD,0.00,2130805,579,4.5,4.5,9.24.12,9+,Games,38,5,18,1
3,420009108,Temple Run,65921024,USD,0.00,1724546,3842,4.5,4.0,1.6.2,9+,Games,40,5,1,1
4,284035177,Pandora - Music & Radio,130242560,USD,0.00,1126879,3594,4.0,4.5,8.4.1,12+,Music,37,4,1,1
5,429047995,Pinterest,74778624,USD,0.00,1061624,1814,4.5,4.0,6.26,12+,Social Networking,37,5,27,1
6,282935706,Bible,92774400,USD,0.00,985920,5320,4.5,5.0,7.5.1,4+,Reference,37,5,45,1
7,553834731,Candy Crush Saga,222846976,USD,0.00,961794,2453,4.5,4.5,1.101.0,4+,Games,43,5,24,1
8,324684580,Spotify Music,132510720,USD,0.00,878563,8253,4.5,4.5,8.4.3,12+,Music,37,5,18,1
9,343200656,Angry Birds,175966208,USD,0.00,824451,107,4.5,3.0,7.4.0,4+,Games,38,0,10,1


In [2]:
AppleStore_Data.describe()

Unnamed: 0,id,size_bytes,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
count,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0
mean,863131000.0,199134500.0,1.726218,12892.91,460.373906,3.526956,3.253578,37.361817,3.7071,5.434903,0.993053
std,271236800.0,359206900.0,5.833006,75739.41,3920.455183,1.517948,1.809363,3.737715,1.986005,7.919593,0.083066
min,281656500.0,589824.0,0.0,0.0,0.0,0.0,0.0,9.0,0.0,0.0,0.0
25%,600093700.0,46922750.0,0.0,28.0,1.0,3.5,2.5,37.0,3.0,1.0,1.0
50%,978148200.0,97153020.0,0.0,300.0,23.0,4.0,4.0,37.0,5.0,1.0,1.0
75%,1082310000.0,181924900.0,1.99,2793.0,140.0,4.5,4.5,38.0,5.0,8.0,1.0
max,1188376000.0,4025970000.0,299.99,2974676.0,177050.0,5.0,5.0,47.0,5.0,75.0,1.0


In [3]:
AppleStore_Data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7197 entries, 0 to 7196
Data columns (total 16 columns):
id                  7197 non-null int64
track_name          7197 non-null object
size_bytes          7197 non-null int64
currency            7197 non-null object
price               7197 non-null float64
rating_count_tot    7197 non-null int64
rating_count_ver    7197 non-null int64
user_rating         7197 non-null float64
user_rating_ver     7197 non-null float64
ver                 7197 non-null object
cont_rating         7197 non-null object
prime_genre         7197 non-null object
sup_devices.num     7197 non-null int64
ipadSc_urls.num     7197 non-null int64
lang.num            7197 non-null int64
vpp_lic             7197 non-null int64
dtypes: float64(3), int64(8), object(5)
memory usage: 899.7+ KB


I have 7197 iOS apps in this data set, and the columns that seem interesting are: 'track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', and 'prime_genre'. Not all column names are self-explanatory in this case, but details about each column can be found in the data set [documentation](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps)




In [4]:
GoogleplayStore_Data = pd.read_csv('/Users/kajol/Downloads/googleplaystore.csv')
GoogleplayStore_Data

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
5,Paper flowers instructions,ART_AND_DESIGN,4.4,167,5.6M,"50,000+",Free,0,Everyone,Art & Design,"March 26, 2017",1.0,2.3 and up
6,Smoke Effect Photo Maker - Smoke Editor,ART_AND_DESIGN,3.8,178,19M,"50,000+",Free,0,Everyone,Art & Design,"April 26, 2018",1.1,4.0.3 and up
7,Infinite Painter,ART_AND_DESIGN,4.1,36815,29M,"1,000,000+",Free,0,Everyone,Art & Design,"June 14, 2018",6.1.61.1,4.2 and up
8,Garden Coloring Book,ART_AND_DESIGN,4.4,13791,33M,"1,000,000+",Free,0,Everyone,Art & Design,"September 20, 2017",2.9.2,3.0 and up
9,Kids Paint Free - Drawing Fun,ART_AND_DESIGN,4.7,121,3.1M,"10,000+",Free,0,Everyone,Art & Design;Creativity,"July 3, 2018",2.8,4.0.3 and up


In [5]:
GoogleplayStore_Data.describe()

Unnamed: 0,Rating
count,9367.0
mean,4.193338
std,0.537431
min,1.0
25%,4.0
50%,4.3
75%,4.5
max,19.0


In [6]:
GoogleplayStore_Data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
App               10841 non-null object
Category          10841 non-null object
Rating            9367 non-null float64
Reviews           10841 non-null object
Size              10841 non-null object
Installs          10841 non-null object
Type              10840 non-null object
Price             10841 non-null object
Content Rating    10840 non-null object
Genres            10841 non-null object
Last Updated      10841 non-null object
Current Ver       10833 non-null object
Android Ver       10838 non-null object
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


I have 10841 apps in Googleplaystore data, and the columns that might be useful for the purpose of our analysis are 'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price', and 'Genres'.

# Deleting Wrong Data

The Google Play data set has a dedicated [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion), and I can see that [one of the discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) outlines an error for row 10472. Let's print this row and compare it against the header and another row that is correct.

In [7]:
GoogleplayStore_Data.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [8]:
display(GoogleplayStore_Data.iloc[10472])

App               Life Made WI-Fi Touchscreen Photo Frame
Category                                              1.9
Rating                                                 19
Reviews                                              3.0M
Size                                               1,000+
Installs                                             Free
Type                                                    0
Price                                            Everyone
Content Rating                                        NaN
Genres                                  February 11, 2018
Last Updated                                       1.0.19
Current Ver                                    4.0 and up
Android Ver                                           NaN
Name: 10472, dtype: object

Few fields like Content rating and Android Version are missing with incorrect data under Price coulmn here as per the [discussion section](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/66015) of this data.

It also corresponds to the app _Life Made WI-Fi Touchscreen Photo Frame_, and I can see that the rating is 19. This is clearly off because the maximum rating for a Google Play app is 5 (as mentioned in the [discussions section](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015), this problem is caused by a missing value in the `'Category'` column). As a consequence, I'll delete this row. 

In [9]:
GoogleplayStore_Data.drop([10472],inplace=True) # don't run this more than once


In [10]:
GoogleplayStore_Data.loc[10470:10475]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10470,Jazz Wi-Fi,COMMUNICATION,3.4,49,4.0M,"10,000+",Free,0,Everyone,Communication,"February 10, 2017",0.1,2.3 and up
10471,Xposed Wi-Fi-Pwd,PERSONALIZATION,3.5,1042,404k,"100,000+",Free,0,Everyone,Personalization,"August 5, 2014",3.0.0,4.0.3 and up
10473,osmino Wi-Fi: free WiFi,TOOLS,4.2,134203,4.1M,"10,000,000+",Free,0,Everyone,Tools,"August 7, 2018",6.06.14,4.4 and up
10474,Sat-Fi Voice,COMMUNICATION,3.4,37,14M,"1,000+",Free,0,Everyone,Communication,"November 21, 2014",2.2.1.5,2.2 and up
10475,Wi-Fi Visualizer,TOOLS,3.9,132,2.6M,"50,000+",Free,0,Everyone,Tools,"May 17, 2017",0.0.9,2.3 and up


In [11]:
GoogleplayStore_Data.shape

(10840, 13)

I can see that, row 10472 is deleted now.  

# Removing duplicate entries for Googleplaystore Data
# Part One

If I explore the Google Play data set long enough, I'll find that some apps have more than one entry. For instance, the application Instagram has four entries:



In [12]:
duplicate = GoogleplayStore_Data[GoogleplayStore_Data.duplicated('App')]
duplicate

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
229,Quick PDF Scanner + OCR FREE,BUSINESS,4.2,80805,Varies with device,"5,000,000+",Free,0,Everyone,Business,"February 26, 2018",Varies with device,4.0.3 and up
236,Box,BUSINESS,4.2,159872,Varies with device,"10,000,000+",Free,0,Everyone,Business,"July 31, 2018",Varies with device,Varies with device
239,Google My Business,BUSINESS,4.4,70991,Varies with device,"5,000,000+",Free,0,Everyone,Business,"July 24, 2018",2.19.0.204537701,4.4 and up
256,ZOOM Cloud Meetings,BUSINESS,4.4,31614,37M,"10,000,000+",Free,0,Everyone,Business,"July 20, 2018",4.1.28165.0716,4.0 and up
261,join.me - Simple Meetings,BUSINESS,4.0,6989,Varies with device,"1,000,000+",Free,0,Everyone,Business,"July 16, 2018",4.3.0.508,4.4 and up
265,Box,BUSINESS,4.2,159872,Varies with device,"10,000,000+",Free,0,Everyone,Business,"July 31, 2018",Varies with device,Varies with device
266,Zenefits,BUSINESS,4.2,296,14M,"50,000+",Free,0,Everyone,Business,"June 15, 2018",3.2.1,4.1 and up
267,Google Ads,BUSINESS,4.3,29313,20M,"5,000,000+",Free,0,Everyone,Business,"July 30, 2018",1.12.0,4.0.3 and up
268,Google My Business,BUSINESS,4.4,70991,Varies with device,"5,000,000+",Free,0,Everyone,Business,"July 24, 2018",2.19.0.204537701,4.4 and up
269,Slack,BUSINESS,4.4,51507,Varies with device,"5,000,000+",Free,0,Everyone,Business,"August 2, 2018",Varies with device,Varies with device


In total, there are 1,181 cases where an app occurs more than once. After sorting values below, I examined the rows I printed for the 8 Ball Pool app, the main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show the data was collected at different times.


In [13]:

duplicate.sort_values(by='App', ascending=True)


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
1407,10 Best Foods for You,HEALTH_AND_FITNESS,4.0,2490,3.8M,"500,000+",Free,0,Everyone 10+,Health & Fitness,"February 17, 2017",1.9,2.3.3 and up
2543,1800 Contacts - Lens Store,MEDICAL,4.7,23160,26M,"1,000,000+",Free,0,Everyone,Medical,"July 27, 2018",7.4.1,5.0 and up
2385,2017 EMRA Antibiotic Guide,MEDICAL,4.4,12,3.8M,"1,000+",Paid,$16.99,Everyone,Medical,"January 27, 2017",1.0.5,4.0.3 and up
1434,21-Day Meditation Experience,HEALTH_AND_FITNESS,4.4,11506,15M,"100,000+",Free,0,Everyone,Health & Fitness,"August 2, 2018",3.0.0,4.1 and up
5415,365Scores - Live Scores,SPORTS,4.6,666246,25M,"10,000,000+",Free,0,Everyone,Sports,"July 29, 2018",5.5.9,4.1 and up
7035,420 BZ Budeze Delivery,MEDICAL,5.0,2,11M,100+,Free,0,Mature 17+,Medical,"June 6, 2018",1.0.1,4.1 and up
3953,8 Ball Pool,SPORTS,4.5,14184910,52M,"100,000,000+",Free,0,Everyone,Sports,"July 31, 2018",4.0.0,4.0.3 and up
1703,8 Ball Pool,GAME,4.5,14198602,52M,"100,000,000+",Free,0,Everyone,Sports,"July 31, 2018",4.0.0,4.0.3 and up
1755,8 Ball Pool,GAME,4.5,14200344,52M,"100,000,000+",Free,0,Everyone,Sports,"July 31, 2018",4.0.0,4.0.3 and up
1844,8 Ball Pool,GAME,4.5,14200550,52M,"100,000,000+",Free,0,Everyone,Sports,"July 31, 2018",4.0.0,4.0.3 and up


![Screenshot%202024-07-24%20at%2012.05.42%E2%80%AFPM.png](attachment:Screenshot%202024-07-24%20at%2012.05.42%E2%80%AFPM.png)

The highest review here is : 14201891	


I can use this information to build a criterion for removing the duplicates. The higher the number of reviews, the more recent the data should be. Rather than removing duplicates randomly, I'll only keep the row with the highest number of reviews and remove the other entries for any given app.

Earlier, I worked on Google Play dataset and found that there are 1,181 duplicates. After I remove the duplicates, I should be left with 9,659 rows:

# Part Two

Now, I will remove duplicates using pandas:

First, sorted Reviews column by descending order, then dropped duplicates at column App and reset the index to have a new clean one. 



In [14]:
New_GoogleplayStore_Data = GoogleplayStore_Data.sort_values('Reviews', ascending=False).drop_duplicates('App').sort_index().reset_index(drop=True)
New_GoogleplayStore_Data

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
2,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
3,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
4,Paper flowers instructions,ART_AND_DESIGN,4.4,167,5.6M,"50,000+",Free,0,Everyone,Art & Design,"March 26, 2017",1.0,2.3 and up
5,Smoke Effect Photo Maker - Smoke Editor,ART_AND_DESIGN,3.8,178,19M,"50,000+",Free,0,Everyone,Art & Design,"April 26, 2018",1.1,4.0.3 and up
6,Infinite Painter,ART_AND_DESIGN,4.1,36815,29M,"1,000,000+",Free,0,Everyone,Art & Design,"June 14, 2018",6.1.61.1,4.2 and up
7,Garden Coloring Book,ART_AND_DESIGN,4.4,13791,33M,"1,000,000+",Free,0,Everyone,Art & Design,"September 20, 2017",2.9.2,3.0 and up
8,Kids Paint Free - Drawing Fun,ART_AND_DESIGN,4.7,121,3.1M,"10,000+",Free,0,Everyone,Art & Design;Creativity,"July 3, 2018",2.8,4.0.3 and up
9,Text on Photo - Fonteee,ART_AND_DESIGN,4.4,13880,28M,"1,000,000+",Free,0,Everyone,Art & Design,"October 27, 2017",1.0.4,4.1 and up


As we see, after removing duplicates I have 

Number of rows: 9659
Number of columns: 13

And, to verify that the row with the highest number of review exists while all other entries have been removed. To check this, I used the '8 Ball Pool' app in the New_GoogleplayStore_Data, where it includes the row with the highest number of reviews and with new index.

In [15]:
(New_GoogleplayStore_Data[New_GoogleplayStore_Data['App'] == '8 Ball Pool'])

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
1347,8 Ball Pool,GAME,4.5,14201891,52M,"100,000,000+",Free,0,Everyone,Sports,"July 31, 2018",4.0.0,4.0.3 and up


# Removing duplicates for Applestore data


In [16]:
AppleStore_Data

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
0,284882215,Facebook,389879808,USD,0.00,2974676,212,3.5,3.5,95.0,4+,Social Networking,37,1,29,1
1,389801252,Instagram,113954816,USD,0.00,2161558,1289,4.5,4.0,10.23,12+,Photo & Video,37,0,29,1
2,529479190,Clash of Clans,116476928,USD,0.00,2130805,579,4.5,4.5,9.24.12,9+,Games,38,5,18,1
3,420009108,Temple Run,65921024,USD,0.00,1724546,3842,4.5,4.0,1.6.2,9+,Games,40,5,1,1
4,284035177,Pandora - Music & Radio,130242560,USD,0.00,1126879,3594,4.0,4.5,8.4.1,12+,Music,37,4,1,1
5,429047995,Pinterest,74778624,USD,0.00,1061624,1814,4.5,4.0,6.26,12+,Social Networking,37,5,27,1
6,282935706,Bible,92774400,USD,0.00,985920,5320,4.5,5.0,7.5.1,4+,Reference,37,5,45,1
7,553834731,Candy Crush Saga,222846976,USD,0.00,961794,2453,4.5,4.5,1.101.0,4+,Games,43,5,24,1
8,324684580,Spotify Music,132510720,USD,0.00,878563,8253,4.5,4.5,8.4.3,12+,Music,37,5,18,1
9,343200656,Angry Birds,175966208,USD,0.00,824451,107,4.5,3.0,7.4.0,4+,Games,38,0,10,1


Looking for Duplicates: First I'm checking based on 'id' column and then based on 'track_name' column:

In [17]:
Apple_duplicates = AppleStore_Data[AppleStore_Data.duplicated('id')]
Apple_duplicates

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic


There are no duplicates based on 'id' column.

In [18]:
Apple_duplicates = AppleStore_Data[AppleStore_Data.duplicated('track_name')]
Apple_duplicates

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
4463,1178454060,Mannequin Challenge,59572224,USD,0.0,105,58,4.0,4.5,1.0.1,4+,Games,38,5,1,1
4831,1089824278,VR Roller Coaster,240964608,USD,0.0,67,44,3.5,4.0,0.81,4+,Games,38,0,1,1


Here, I found two duplicates of each of the above rows based on column 'track name'

In [19]:
(AppleStore_Data[AppleStore_Data['track_name']== 'Mannequin Challenge'])

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
2948,1173990889,Mannequin Challenge,109705216,USD,0.0,668,87,3.0,3.0,1.4,9+,Games,37,4,1,1
4463,1178454060,Mannequin Challenge,59572224,USD,0.0,105,58,4.0,4.5,1.0.1,4+,Games,38,5,1,1


In [20]:
(AppleStore_Data[AppleStore_Data['track_name']== 'VR Roller Coaster'])

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
4442,952877179,VR Roller Coaster,169523200,USD,0.0,107,102,3.5,3.5,2.0.0,4+,Games,37,5,1,1
4831,1089824278,VR Roller Coaster,240964608,USD,0.0,67,44,3.5,4.0,0.81,4+,Games,38,0,1,1


I don't want to count certain apps more than once when I analyze data, so I should ideally remove the duplicate entries and keep only one entry per app like I did in Googleplaystore data. But here, I don't see any specific criteria for removing the duplicates. Hence, I will skip it Applestore data case because there are not many duplicte entries here. 

# Removing Non-English apps for Googleplaystore data
# Part One

Checking whether these app names are detected as English or non-English:

In [21]:
New_GoogleplayStore_Data.App

0          Photo Editor & Candy Camera & Grid & ScrapBook
1       U Launcher Lite – FREE Live Cool Themes, Hide ...
2                                   Sketch - Draw & Paint
3                   Pixel Draw - Number Art Coloring Book
4                              Paper flowers instructions
5                 Smoke Effect Photo Maker - Smoke Editor
6                                        Infinite Painter
7                                    Garden Coloring Book
8                           Kids Paint Free - Drawing Fun
9                                 Text on Photo - Fonteee
10                Name Art Photo Editor - Focus n Filters
11                         Tattoo Name On My Photo Editor
12                                  Mandala Coloring Book
13        3D Color Pixel by Number - Sandbox Art Coloring
14                        Learn To Draw Kawaii Characters
15           Photo Designer - Write your name with shapes
16                               350 Diy Room Decor Ideas
17            

After exploring the data sets enough, I found the names of some of the apps suggest they are not directed toward an English-speaking audience. Below, we see a couple of examples from googleappstore dataset;

In [22]:
New_GoogleplayStore_Data.iloc[[8995, 8022, 6979]]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
8995,CNY Slots : Gong Xi Fa Cai 发财机,GAME,3.6,33,71M,"5,000+",Free,0,Teen,Casino,"June 6, 2017",1.2,4.0.3 and up
8022,বাংলাflix,FAMILY,4.2,1111,7.3M,"100,000+",Free,0,Everyone,Entertainment,"June 5, 2018",3.6.1,4.1 and up
6979,Hlášenírozhlasu.cz,COMMUNICATION,,0,17M,10+,Free,0,Everyone,Communication,"July 27, 2018",2.1.3,4.1 and up


After taking a look at Applestore data set, I found multiple apps not directed towards english :

In [23]:
AppleStore_Data.track_name

0                                                Facebook
1                                               Instagram
2                                          Clash of Clans
3                                              Temple Run
4                                 Pandora - Music & Radio
5                                               Pinterest
6                                                   Bible
7                                        Candy Crush Saga
8                                           Spotify Music
9                                             Angry Birds
10                                         Subway Surfers
11                                    Fruit Ninja Classic
12                                              Solitaire
13                                             CSR Racing
14                    Crossy Road - Endless Arcade Hopper
15                               Injustice: Gods Among Us
16                                                Hay Day
17            

Python has a built-in package called re , which can be used to work with Regular Expressions.

Using regular expressions, I filetered apps as much as possible. 

In [24]:

import re

def is_english(text):
    """Checks if a string contains only English characters."""
    pattern = re.compile(r'^[a-zA-Z0-9\s\&\d\™\®\#\()\ ⁴\!\‰\℠\’\⁺\%\+\‼\◎\Ⓞ\⋆\*\-\'\\▻\.\∞\,.,!,?,;,:,-–]+')
    return pattern.fullmatch(str(text)) is not None



#is_english("Countdown‼ (Event Reminders and Timer)")
Apple_data = AppleStore_Data
Apple_df = pd.DataFrame(Apple_data)

Apple_df = Apple_df[Apple_df['track_name'].apply(is_english)]
print(Apple_df)


              id                                         track_name  \
0      284882215                                           Facebook   
1      389801252                                          Instagram   
2      529479190                                     Clash of Clans   
3      420009108                                         Temple Run   
4      284035177                            Pandora - Music & Radio   
5      429047995                                          Pinterest   
6      282935706                                              Bible   
7      553834731                                   Candy Crush Saga   
8      324684580                                      Spotify Music   
9      343200656                                        Angry Birds   
10     512939461                                     Subway Surfers   
11     362949845                                Fruit Ninja Classic   
12     359917414                                          Solitaire   
13    

In case of apple apps above, I was able to filter 6089 clean english apps using regular expressions. In remaining 1108 apps: there are still few english apps that includes some expressions which were not easy to filter. For example: checking few random english looking apps from the remaining 1108 apps if they are english or not:

In [25]:
def is_english(text):
    """Checks if a string contains only English characters."""
    pattern = re.compile(r'^[a-zA-Z0-9\s\&\d\™\®\#\()\ ⁴\!\‰\℠\’\⁺\%\+\‼\◎\Ⓞ\⋆\*\-\'\\▻\.\∞\,.,!,?,;,:,-–]+')
    return pattern.fullmatch(str(text)) is not None



print(is_english("Lapse It Pro • Time Lapse & Stop Motion Camera... "))

print(is_english("Adobe Spark Page — Create Stunning Web Pages    "))

print(is_english("COOKING MAMA Let's Cook！   "))

print(is_english("MSQRD — Live Filters & Face Swap for Video Sel... "))

False
False
False
False


Here, I see all of these above are False. That means the text is using some expressions which is not easy to identify using Regex.

Filtering english apps from Android:

In [26]:

import re

def is_english(text):
    """Checks if a string contains only English characters."""
    pattern = re.compile(r'^[a-zA-Z0-9\s\&\d\™\®\#\()\ ⁴\!\‰\℠\’\⁺\%\+\‼\◎\Ⓞ\⋆\*\-\'\\▻\.\∞\,.,!,?,;,:,-–]+')
    return pattern.fullmatch(str(text)) is not None



#is_english("Countdown‼ (Event Reminders and Timer)")
Google_data = New_GoogleplayStore_Data
Google_df = pd.DataFrame(Google_data)

Google_df = Google_df[Google_df['App'].apply(is_english)]
print(Google_df)

                                                    App             Category  \
0        Photo Editor & Candy Camera & Grid & ScrapBook       ART_AND_DESIGN   
1     U Launcher Lite – FREE Live Cool Themes, Hide ...       ART_AND_DESIGN   
2                                 Sketch - Draw & Paint       ART_AND_DESIGN   
3                 Pixel Draw - Number Art Coloring Book       ART_AND_DESIGN   
4                            Paper flowers instructions       ART_AND_DESIGN   
5               Smoke Effect Photo Maker - Smoke Editor       ART_AND_DESIGN   
6                                      Infinite Painter       ART_AND_DESIGN   
7                                  Garden Coloring Book       ART_AND_DESIGN   
8                         Kids Paint Free - Drawing Fun       ART_AND_DESIGN   
9                               Text on Photo - Fonteee       ART_AND_DESIGN   
10              Name Art Photo Editor - Focus n Filters       ART_AND_DESIGN   
11                       Tattoo Name On 

In case of Android apps above, I was able to filter 9544 clean english apps using regular expressions. In remaining  115 apps: there are still few english apps that includes some expressions or emojis which were not easy to filter. For example: checking few random english looking apps from the remaining 115 apps if they are english or not:

In [27]:
def is_english(text):
    """Checks if a string contains only English characters."""
    pattern = re.compile(r'^[a-zA-Z0-9\s\&\d\™\®\#\()\ ⁴\!\‰\℠\’\⁺\%\+\‼\◎\Ⓞ\⋆\*\-\'\\▻\.\∞\,.,!,?,;,:,-–]+')
    return pattern.fullmatch(str(text)) is not None



print(is_english("Invoice 2go — Professional Invoices and Estimates "))
print(is_english("🔥 Football Wallpapers 4K | Full HD Backgrounds 😍 "))
print(is_english("   FlirtChat - ♥Free Dating/Flirting App♥ "))
print(is_english("Homes.com 🏠 For Sale, Rent "))

False
False
False
False


# Hence, I'm left with 9544 Android apps and 6089 iOS apps.

This filter function does impact the data loss but I have all the clean english apps without any unidetifiable expressions and emojis which will make it easy to analyse the new filtered data later. Our filter function is still not perfect, but it should be fairly effective.



# Isolating the free apps

As we know, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads.  Our datasets contain both free and non-free apps; I'll need to isolate only the free apps for our analysis. Below, I isolate the free apps for both our data sets.



In [28]:
Google_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9544 entries, 0 to 9658
Data columns (total 13 columns):
App               9544 non-null object
Category          9544 non-null object
Rating            8099 non-null float64
Reviews           9544 non-null object
Size              9544 non-null object
Installs          9544 non-null object
Type              9543 non-null object
Price             9544 non-null object
Content Rating    9544 non-null object
Genres            9544 non-null object
Last Updated      9544 non-null object
Current Ver       9536 non-null object
Android Ver       9542 non-null object
dtypes: float64(1), object(12)
memory usage: 1.0+ MB


Here, the dtype for Price column for android app is object . Therefore, filtering using str contain. 

In [29]:

filtered_Google_df = Google_df[Google_df['Price'].str.contains('0')]


filtered_Google_df 

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
2,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
3,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
4,Paper flowers instructions,ART_AND_DESIGN,4.4,167,5.6M,"50,000+",Free,0,Everyone,Art & Design,"March 26, 2017",1.0,2.3 and up
5,Smoke Effect Photo Maker - Smoke Editor,ART_AND_DESIGN,3.8,178,19M,"50,000+",Free,0,Everyone,Art & Design,"April 26, 2018",1.1,4.0.3 and up
6,Infinite Painter,ART_AND_DESIGN,4.1,36815,29M,"1,000,000+",Free,0,Everyone,Art & Design,"June 14, 2018",6.1.61.1,4.2 and up
7,Garden Coloring Book,ART_AND_DESIGN,4.4,13791,33M,"1,000,000+",Free,0,Everyone,Art & Design,"September 20, 2017",2.9.2,3.0 and up
8,Kids Paint Free - Drawing Fun,ART_AND_DESIGN,4.7,121,3.1M,"10,000+",Free,0,Everyone,Art & Design;Creativity,"July 3, 2018",2.8,4.0.3 and up
9,Text on Photo - Fonteee,ART_AND_DESIGN,4.4,13880,28M,"1,000,000+",Free,0,Everyone,Art & Design,"October 27, 2017",1.0.4,4.1 and up


In [30]:
Apple_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6089 entries, 0 to 7195
Data columns (total 16 columns):
id                  6089 non-null int64
track_name          6089 non-null object
size_bytes          6089 non-null int64
currency            6089 non-null object
price               6089 non-null float64
rating_count_tot    6089 non-null int64
rating_count_ver    6089 non-null int64
user_rating         6089 non-null float64
user_rating_ver     6089 non-null float64
ver                 6089 non-null object
cont_rating         6089 non-null object
prime_genre         6089 non-null object
sup_devices.num     6089 non-null int64
ipadSc_urls.num     6089 non-null int64
lang.num            6089 non-null int64
vpp_lic             6089 non-null int64
dtypes: float64(3), int64(8), object(5)
memory usage: 808.7+ KB


Since, the price column dtype is float in Apple dataset. I will filter it directly. 

In [31]:
filtered_Apple_df  = Apple_df[Apple_df['price']== 0]
filtered_Apple_df 


Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
0,284882215,Facebook,389879808,USD,0.0,2974676,212,3.5,3.5,95.0,4+,Social Networking,37,1,29,1
1,389801252,Instagram,113954816,USD,0.0,2161558,1289,4.5,4.0,10.23,12+,Photo & Video,37,0,29,1
2,529479190,Clash of Clans,116476928,USD,0.0,2130805,579,4.5,4.5,9.24.12,9+,Games,38,5,18,1
3,420009108,Temple Run,65921024,USD,0.0,1724546,3842,4.5,4.0,1.6.2,9+,Games,40,5,1,1
4,284035177,Pandora - Music & Radio,130242560,USD,0.0,1126879,3594,4.0,4.5,8.4.1,12+,Music,37,4,1,1
5,429047995,Pinterest,74778624,USD,0.0,1061624,1814,4.5,4.0,6.26,12+,Social Networking,37,5,27,1
6,282935706,Bible,92774400,USD,0.0,985920,5320,4.5,5.0,7.5.1,4+,Reference,37,5,45,1
7,553834731,Candy Crush Saga,222846976,USD,0.0,961794,2453,4.5,4.5,1.101.0,4+,Games,43,5,24,1
8,324684580,Spotify Music,132510720,USD,0.0,878563,8253,4.5,4.5,8.4.3,12+,Music,37,5,18,1
9,343200656,Angry Birds,175966208,USD,0.0,824451,107,4.5,3.0,7.4.0,4+,Games,38,0,10,1


In [32]:
print(len(filtered_Google_df ))
print(len(filtered_Apple_df))

8974
3156


I'm left with 8974 Android apps and 3156 iOS apps, which should be enough for our analysis.




## Most Common Apps by Genre

### Part One

As I mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, my validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we then develop it further.
3. If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both the App Store and Google Play, I need to find app profiles that are successful on both markets. For instance, a profile that might work well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of the most common genres for each market. For this, I'll build a frequency table for the `prime_genre` column of the App Store data set, and the `Genres` and `Category` columns of the Google Play data set.

### Part Two

I'll use pandas to analyze the frequency tables:

- First one to generate frequency tables that show percentages
- Another to display the percentages in a descending order

In [33]:
#Apple:

counts = filtered_Apple_df['prime_genre'].value_counts()

Apple_percentages = (counts / len(filtered_Apple_df)) * 100

print(Apple_percentages)

Games                58.713561
Entertainment         7.889734
Photo & Video         5.006337
Education             3.738910
Social Networking     3.168568
Shopping              2.503169
Utilities             2.376426
Sports                2.186312
Health & Fitness      2.059569
Music                 2.027883
Productivity          1.615970
Lifestyle             1.489227
News                  1.299113
Travel                1.172370
Finance               1.108999
Weather               0.887199
Food & Drink          0.823828
Business              0.538657
Reference             0.538657
Book                  0.348542
Medical               0.190114
Navigation            0.190114
Catalogs              0.126743
Name: prime_genre, dtype: float64


In [34]:
#Android: Category

counts = filtered_Google_df['Category'].value_counts()

Google_percentages = (counts / len(filtered_Google_df)) * 100

print(Google_percentages)

FAMILY                 18.965901
GAME                    9.772677
TOOLS                   8.502340
BUSINESS                4.535324
PRODUCTIVITY            3.900156
LIFESTYLE               3.889013
FINANCE                 3.666147
PERSONALIZATION         3.655003
MEDICAL                 3.655003
SPORTS                  3.398707
COMMUNICATION           3.242701
HEALTH_AND_FITNESS      3.008692
PHOTOGRAPHY             2.919545
NEWS_AND_MAGAZINES      2.819256
SOCIAL                  2.607533
TRAVEL_AND_LOCAL        2.273234
BOOKS_AND_REFERENCE     2.206374
SHOPPING                2.206374
DATING                  1.816358
VIDEO_PLAYERS           1.749499
MAPS_AND_NAVIGATION     1.370626
FOOD_AND_DRINK          1.214620
EDUCATION               1.158903
ENTERTAINMENT           0.913751
AUTO_AND_VEHICLES       0.913751
LIBRARIES_AND_DEMO      0.891464
HOUSE_AND_HOME          0.791175
WEATHER                 0.780031
EVENTS                  0.713171
ART_AND_DESIGN          0.646312
PARENTING 

The landscape seems significantly different on Google Play: there are not that many apps designed for fun, and it seems that a good number of apps are designed for practical purposes (family, tools, business, lifestyle, productivity, etc.). However, if we investigate this further, we can see that the family category (which accounts for almost 19% of the apps) means mostly games for kids.

![image.png](attachment:image.png)

In [35]:
#Android: Genres

counts = filtered_Google_df['Genres'].value_counts()

Google_percentages = (counts / len(filtered_Google_df)) * 100

print(Google_percentages)

Tools                                  8.491197
Entertainment                          6.028527
Education                              5.337642
Business                               4.535324
Productivity                           3.900156
Lifestyle                              3.877869
Finance                                3.666147
Personalization                        3.655003
Medical                                3.655003
Sports                                 3.465567
Communication                          3.242701
Action                                 3.120125
Health & Fitness                       3.008692
Photography                            2.919545
News & Magazines                       2.819256
Social                                 2.607533
Travel & Local                         2.262090
Shopping                               2.206374
Books & Reference                      2.206374
Simulation                             2.039224
Arcade                                 1

The difference between the Genres and the Category columns is not crystal clear, but one thing we can notice is that the Genres column is much more granular (it has more categories). We're only looking for the bigger picture at the moment, so I'll only work with the Category column moving forward.

Up to this point, I found that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps. Now I'd like to get an idea about the kind of apps that have most users.

## Most Popular Apps by Genre on the App Store
One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, I can find this information in the Installs column, but for the App Store data set this information is missing. As a workaround, I'll take the total number of user ratings as a proxy, which I can find in the rating_count_tot app.

Below, I calculated the average number of user ratings per app genre on the App Store:



In [36]:
filtered_Apple_df

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
0,284882215,Facebook,389879808,USD,0.0,2974676,212,3.5,3.5,95.0,4+,Social Networking,37,1,29,1
1,389801252,Instagram,113954816,USD,0.0,2161558,1289,4.5,4.0,10.23,12+,Photo & Video,37,0,29,1
2,529479190,Clash of Clans,116476928,USD,0.0,2130805,579,4.5,4.5,9.24.12,9+,Games,38,5,18,1
3,420009108,Temple Run,65921024,USD,0.0,1724546,3842,4.5,4.0,1.6.2,9+,Games,40,5,1,1
4,284035177,Pandora - Music & Radio,130242560,USD,0.0,1126879,3594,4.0,4.5,8.4.1,12+,Music,37,4,1,1
5,429047995,Pinterest,74778624,USD,0.0,1061624,1814,4.5,4.0,6.26,12+,Social Networking,37,5,27,1
6,282935706,Bible,92774400,USD,0.0,985920,5320,4.5,5.0,7.5.1,4+,Reference,37,5,45,1
7,553834731,Candy Crush Saga,222846976,USD,0.0,961794,2453,4.5,4.5,1.101.0,4+,Games,43,5,24,1
8,324684580,Spotify Music,132510720,USD,0.0,878563,8253,4.5,4.5,8.4.3,12+,Music,37,5,18,1
9,343200656,Angry Birds,175966208,USD,0.0,824451,107,4.5,3.0,7.4.0,4+,Games,38,0,10,1


In [37]:



AvgRating_ios = filtered_Apple_df[['rating_count_tot', 'prime_genre']].groupby("prime_genre").mean().sort_values(by = 'rating_count_tot', ascending=False)
AvgRating_ios 

Unnamed: 0_level_0,rating_count_tot
prime_genre,Unnamed: 1_level_1
Navigation,86090.333333
Reference,79350.470588
Social Networking,75813.06
Music,59112.140625
Weather,52279.892857
Book,50560.727273
Food & Drink,33333.923077
Finance,32367.028571
Travel,30474.864865
Photo & Video,28616.734177


On average, navigation apps have the highest number of user reviews, but this figure is heavily influenced by Waze and Google Maps, which have close to half a million user reviews together:

In [38]:

Navigation = filtered_Apple_df[filtered_Apple_df['prime_genre'] == 'Navigation']
Nav_grouped = Navigation.groupby('track_name')
Navigation_ios = Nav_grouped['rating_count_tot'].mean().sort_values( ascending=False)
Navigation_ios

track_name
Waze - GPS Navigation, Maps & Real-time Traffic     345046
Google Maps - Navigation & Transit                  154911
Geocaching®                                          12811
CoPilot GPS – Car Navigation & Offline Maps           3582
ImmobilienScout24: Real Estate Search in Germany       187
Railway Route Search                                     5
Name: rating_count_tot, dtype: int64

The same pattern applies to social networking apps, where the average number is heavily influenced by a few giants like Facebook, Pinterest, Skype, etc. Same applies to music apps, where a few big players like Pandora, Spotify, and Shazam heavily influence the average number.

Our aim is to find popular genres, but navigation, social networking or music apps might seem more popular than they really are. The average number of ratings seem to be skewed by very few apps which have hundreds of thousands of user ratings, while the other apps may struggle to get past the 10,000 threshold. We could get a better picture by removing these extremely popular apps for each genre and then rework the averages, but we'll leave this level of detail for later.

Reference apps have 79,350. user ratings on average, but it's actually the Bible and Dictionary.com which skew up the average rating:

In [39]:
Reference = filtered_Apple_df[filtered_Apple_df['prime_genre'] == 'Reference']
Ref_grouped = Reference.groupby('track_name')
Reference_ios = Ref_grouped['rating_count_tot'].mean().sort_values( ascending=False)
Reference_ios 

track_name
Bible                                                                                                 985920
Dictionary.com Dictionary & Thesaurus                                                                 200047
Dictionary.com Dictionary & Thesaurus for iPad                                                         54175
Google Translate                                                                                       26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran                                                     18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition                                 17588
Merriam-Webster Dictionary                                                                             16849
Night Sky                                                                                              12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE)                          8535
LUCKY BL

Other genres that seem popular include weather, book, food and drink, or finance. 

Weather apps — people generally don't spend too much time in-app, and the chances of making profit from in-app adds are low. Also, getting reliable live weather data may require us to connect our apps to non-free APIs.

Food and drink — examples here include Starbucks, Dunkin' Donuts, McDonald's, etc. So making a popular food and drink app requires actual cooking and a delivery service, which is outside the scope of our company.

Finance apps — these apps involve banking, paying bills, money transfer, etc. Building a finance app requires domain knowledge, and we don't want to hire a finance expert just to build an app.

Now let's analyze the Google Play market a bit.

## Most Popular Apps by Genre on Google Play
For the Google Play market, we actually have data about the number of installs, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.):

In [40]:
filtered_Google_df

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
2,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
3,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
4,Paper flowers instructions,ART_AND_DESIGN,4.4,167,5.6M,"50,000+",Free,0,Everyone,Art & Design,"March 26, 2017",1.0,2.3 and up
5,Smoke Effect Photo Maker - Smoke Editor,ART_AND_DESIGN,3.8,178,19M,"50,000+",Free,0,Everyone,Art & Design,"April 26, 2018",1.1,4.0.3 and up
6,Infinite Painter,ART_AND_DESIGN,4.1,36815,29M,"1,000,000+",Free,0,Everyone,Art & Design,"June 14, 2018",6.1.61.1,4.2 and up
7,Garden Coloring Book,ART_AND_DESIGN,4.4,13791,33M,"1,000,000+",Free,0,Everyone,Art & Design,"September 20, 2017",2.9.2,3.0 and up
8,Kids Paint Free - Drawing Fun,ART_AND_DESIGN,4.7,121,3.1M,"10,000+",Free,0,Everyone,Art & Design;Creativity,"July 3, 2018",2.8,4.0.3 and up
9,Text on Photo - Fonteee,ART_AND_DESIGN,4.4,13880,28M,"1,000,000+",Free,0,Everyone,Art & Design,"October 27, 2017",1.0.4,4.1 and up


In [41]:
Install_counts = filtered_Google_df['Installs'].value_counts()

Avg_installs = (Install_counts / len(filtered_Google_df)) *100

print(Avg_installs)

1,000,000+        15.433474
100,000+          11.377312
10,000+           10.296412
10,000,000+       10.262982
1,000+             8.624916
100+               7.131714
5,000,000+         6.674838
500,000+           5.460218
50,000+            4.758190
5,000+             4.602184
10+                3.799866
500+               3.264988
50,000,000+        2.217517
100,000,000+       2.072654
50+                2.016938
5+                 0.835748
1+                 0.601738
500,000,000+       0.267439
1,000,000,000+     0.222866
0+                 0.066860
0                  0.011143
Name: Installs, dtype: float64


One problem with this data is that is not precise. For instance, I don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, I don't need very precise data for our purposes — we only want to get an idea which app genres attract the most users, and we don't need perfect precision with respect to the number of users.

I'm going to leave the numbers as they are, which means that I'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on.

To perform computations, however, I'll need to convert each install number to `float` — this means that I need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error. Then, I'll compute the average number of installs for each genre (category).

In [42]:

filtered_Google_df['Installs'] = filtered_Google_df['Installs'].replace({',': '','\+': ''}, regex=True).astype(float)
filtered_Google_df['Installs']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


0          10000.0
1        5000000.0
2       50000000.0
3         100000.0
4          50000.0
5          50000.0
6        1000000.0
7        1000000.0
8          10000.0
9        1000000.0
10       1000000.0
11      10000000.0
12        100000.0
13        100000.0
14          5000.0
15        500000.0
16         10000.0
17       5000000.0
18      10000000.0
19        100000.0
20        100000.0
21        500000.0
22         50000.0
23         10000.0
24        500000.0
25        100000.0
26         10000.0
27        100000.0
28        100000.0
29         50000.0
           ...    
9629         100.0
9630        1000.0
9631       10000.0
9632       50000.0
9633      500000.0
9634         100.0
9635      100000.0
9636       10000.0
9637        5000.0
9638        1000.0
9639          50.0
9640          10.0
9641         100.0
9642       10000.0
9643         100.0
9644     5000000.0
9645        5000.0
9646       10000.0
9647       10000.0
9648      100000.0
9649        5000.0
9650      10

In [43]:
avg_installs = filtered_Google_df.groupby('Category')['Installs'].mean().sort_values( ascending=False)
pd.options.display.float_format = '{:.0f}'.format
avg_installs

Category
COMMUNICATION         37687362
VIDEO_PLAYERS         24940973
SOCIAL                23426760
PHOTOGRAPHY           17699500
PRODUCTIVITY          16545944
GAME                  14972564
TRAVEL_AND_LOCAL      14130902
ENTERTAINMENT         11992195
TOOLS                 10471269
NEWS_AND_MAGAZINES     9364495
BOOKS_AND_REFERENCE    7908804
SHOPPING               7067367
WEATHER                5146965
PERSONALIZATION        4604751
HEALTH_AND_FITNESS     4227918
MAPS_AND_NAVIGATION    3968872
FAMILY                 3632271
SPORTS                 3590926
ART_AND_DESIGN         1952105
FOOD_AND_DRINK         1933383
EDUCATION              1782212
BUSINESS               1685263
LIFESTYLE              1408595
FINANCE                1383475
HOUSE_AND_HOME         1353556
COMICS                  842714
DATING                  800091
AUTO_AND_VEHICLES       647318
LIBRARIES_AND_DEMO      641199
PARENTING               542604
BEAUTY                  513152
EVENTS                  249581

On average, communication apps have the most installs: 37,687,362. This number is heavily skewed up by a few apps that have over one billion installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts), and a few others with over 100 and 500 million installs:

In [44]:
Communication = filtered_Google_df[filtered_Google_df['Category'] == 'COMMUNICATION']
Comm_grouped = Communication.groupby('App')
Comm_Android = Comm_grouped['Installs'].mean().sort_values( ascending=False)
Comm_Android

App
Gmail                                               1000000000
Google Chrome: Fast & Secure                        1000000000
Messenger – Text and Video Chat for Free            1000000000
WhatsApp Messenger                                  1000000000
Hangouts                                            1000000000
Skype - free IM & video calls                       1000000000
LINE: Free Calls & Messages                          500000000
imo free video calls and chat                        500000000
UC Browser - Fast Download Private & Secure          500000000
Viber Messenger                                      500000000
Google Duo - High Quality Video Calls                500000000
Who                                                  100000000
Android Messages                                     100000000
Telegram                                             100000000
BBM - Free Calls & Messages                          100000000
Opera Mini - fast web browser                      

If I remove all the communication apps that have over 100 million installs, the average would be reduced roughly ten times:



In [46]:
under_100mn = filtered_Google_df.loc[(filtered_Google_df['Installs'] < 100000000) 
                                           & (filtered_Google_df['Category'] == 'COMMUNICATION')]
under_100mn 

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
278,Messenger for SMS,COMMUNICATION,4,125257,17M,10000000,Free,0,Teen,Communication,"June 6, 2018",1.8.9,4.1 and up
279,My Tele2,COMMUNICATION,4,158679,8.8M,5000000,Free,0,Everyone,Communication,"August 3, 2018",2.4.1,4.4 and up
281,Contacts,COMMUNICATION,4,66602,Varies with device,50000000,Free,0,Everyone,Communication,"June 26, 2018",2.8.4.201036949,5.0 and up
282,Call Free – Free Call,COMMUNICATION,4,30209,15M,5000000,Free,0,Everyone,Communication,"July 28, 2018",1.3.4,4.1 and up
283,Web Browser & Explorer,COMMUNICATION,4,36901,6.6M,5000000,Free,0,Everyone,Communication,"July 4, 2018",11.8.6,4.0.3 - 7.1.1
284,Browser 4G,COMMUNICATION,4,192948,6.6M,10000000,Free,0,Everyone,Communication,"June 19, 2018",24.6.6,4.0.3 - 7.1.1
285,MegaFon Dashboard,COMMUNICATION,4,99559,Varies with device,10000000,Free,0,Everyone,Communication,"July 30, 2018",Varies with device,Varies with device
286,ZenUI Dialer & Contacts,COMMUNICATION,4,437674,Varies with device,10000000,Free,0,Everyone,Communication,"August 1, 2018",Varies with device,Varies with device
287,Cricket Visual Voicemail,COMMUNICATION,4,13698,5.1M,10000000,Free,0,Everyone,Communication,"July 2, 2018",3.2.0.100171,4.1 and up
288,TracFone My Account,COMMUNICATION,4,20769,18M,1000000,Free,0,Everyone,Communication,"July 11, 2018",R6.0.3,4.1 and up


In [47]:
sum(under_100mn['Installs'])/len(under_100mn['Installs'])

3284175.4583333335

We see the same pattern for the video players category, which is the runner-up with 24,940,973 installs. The market is dominated by apps like Youtube, Google Play Movies & TV, or MX Player. The pattern is repeated for social apps (where we have giants like Facebook, Instagram, Google+, etc.), photography apps (Google Photos and other popular photo editors), or productivity apps (Microsoft Word, Dropbox, Google Calendar, Evernote, etc.).

Again, the main concern is that these app genres might seem more popular than they really are. Moreover, these niches seem to be dominated by a few giants who are hard to compete against.

The game genre seems pretty popular, but previously we found out this part of the market seems a bit saturated, so we'd like to come up with a different app recommendation if possible.

The books and reference genre looks fairly popular as well, with an average number of installs of 7,908,804. It's interesting to explore this in more depth, since we found this genre has some potential to work well on the App Store, and our aim is to recommend an app genre that shows potential for being profitable on both the App Store and Google Play.

Let's take a look at some of the apps from this genre and their number of installs:

In [57]:
Book_Ref = filtered_Google_df.loc[(filtered_Google_df['Category'] == 'BOOKS_AND_REFERENCE')]
Book_Ref.sort_values(by=['Installs'], ascending=False)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
144,Google Play Books,BOOKS_AND_REFERENCE,4,1433233,Varies with device,1000000000,Free,0,Teen,Books & Reference,"August 3, 2018",Varies with device,Varies with device
2958,Bible,BOOKS_AND_REFERENCE,5,2440695,Varies with device,100000000,Free,0,Teen,Books & Reference,"August 2, 2018",Varies with device,Varies with device
4525,Audiobooks from Audible,BOOKS_AND_REFERENCE,4,568922,Varies with device,100000000,Free,0,Teen,Books & Reference,"August 1, 2018",Varies with device,Varies with device
3069,Amazon Kindle,BOOKS_AND_REFERENCE,4,814151,Varies with device,100000000,Free,0,Teen,Books & Reference,"July 27, 2018",Varies with device,Varies with device
7124,Dictionary,BOOKS_AND_REFERENCE,4,264260,Varies with device,10000000,Free,0,Everyone,Books & Reference,"June 22, 2018",Varies with device,Varies with device
8464,English Hindi Dictionary,BOOKS_AND_REFERENCE,4,384368,Varies with device,10000000,Free,0,Everyone,Books & Reference,"August 4, 2018",Varies with device,Varies with device
8463,Oxford Dictionary of English : Free,BOOKS_AND_REFERENCE,4,364452,7.1M,10000000,Free,0,Everyone,Books & Reference,"July 11, 2018",9.1.363,4.1 and up
8447,JW Library,BOOKS_AND_REFERENCE,5,922752,Varies with device,10000000,Free,0,Everyone,Books & Reference,"June 15, 2018",Varies with device,Varies with device
8443,Dictionary - Merriam-Webster,BOOKS_AND_REFERENCE,4,454412,Varies with device,10000000,Free,0,Everyone,Books & Reference,"May 18, 2018",Varies with device,Varies with device
8392,Spanish English Translator,BOOKS_AND_REFERENCE,4,87919,Varies with device,10000000,Free,0,Teen,Books & Reference,"May 28, 2018",Varies with device,Varies with device


The book and reference genre includes a variety of apps: software for processing and reading ebooks, various collections of libraries, dictionaries, tutorials on programming or languages, etc. It seems there's still a small number of extremely popular apps that skew the average:

In [86]:


Popular_Book_Ref = filtered_Google_df.loc[(filtered_Google_df['Category'] == 'BOOKS_AND_REFERENCE')
                                          & ((filtered_Google_df['Installs'] == 1000000000)
                                            |(filtered_Google_df['Installs'] == 500000000)
                                             |(filtered_Google_df['Installs'] == 100000000))]

Popular_Book_Ref.sort_values(by=['Installs'], ascending=False)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
144,Google Play Books,BOOKS_AND_REFERENCE,4,1433233,Varies with device,1000000000,Free,0,Teen,Books & Reference,"August 3, 2018",Varies with device,Varies with device
2958,Bible,BOOKS_AND_REFERENCE,5,2440695,Varies with device,100000000,Free,0,Teen,Books & Reference,"August 2, 2018",Varies with device,Varies with device
3069,Amazon Kindle,BOOKS_AND_REFERENCE,4,814151,Varies with device,100000000,Free,0,Teen,Books & Reference,"July 27, 2018",Varies with device,Varies with device
4525,Audiobooks from Audible,BOOKS_AND_REFERENCE,4,568922,Varies with device,100000000,Free,0,Teen,Books & Reference,"August 1, 2018",Varies with device,Varies with device


However, it looks like there are only a few very popular apps, so this market still shows potential. Let's try to get some app ideas based on the kind of apps that are somewhere in the middle in terms of popularity (between 1,000,000 and 100,000,000 downloads):

In [87]:
MidPop_Book_Ref = filtered_Google_df.loc[(filtered_Google_df['Category'] == 'BOOKS_AND_REFERENCE')
                                          & ((filtered_Google_df['Installs'] == 1000000)
                                            |(filtered_Google_df['Installs'] == 5000000)
                                             |(filtered_Google_df['Installs'] == 10000000)
                                            |(filtered_Google_df['Installs'] == 50000000))]

MidPop_Book_Ref.sort_values(by=['Installs'], ascending=False)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
137,Wikipedia,BOOKS_AND_REFERENCE,4,577550,Varies with device,10000000,Free,0,Everyone,Books & Reference,"August 2, 2018",Varies with device,Varies with device
3077,Aldiko Book Reader,BOOKS_AND_REFERENCE,4,210534,22M,10000000,Free,0,Everyone,Books & Reference,"June 13, 2018",3.0.58,4.0 and up
8464,English Hindi Dictionary,BOOKS_AND_REFERENCE,4,384368,Varies with device,10000000,Free,0,Everyone,Books & Reference,"August 4, 2018",Varies with device,Varies with device
8463,Oxford Dictionary of English : Free,BOOKS_AND_REFERENCE,4,364452,7.1M,10000000,Free,0,Everyone,Books & Reference,"July 11, 2018",9.1.363,4.1 and up
8447,JW Library,BOOKS_AND_REFERENCE,5,922752,Varies with device,10000000,Free,0,Everyone,Books & Reference,"June 15, 2018",Varies with device,Varies with device
8443,Dictionary - Merriam-Webster,BOOKS_AND_REFERENCE,4,454412,Varies with device,10000000,Free,0,Everyone,Books & Reference,"May 18, 2018",Varies with device,Varies with device
8392,Spanish English Translator,BOOKS_AND_REFERENCE,4,87919,Varies with device,10000000,Free,0,Teen,Books & Reference,"May 28, 2018",Varies with device,Varies with device
7124,Dictionary,BOOKS_AND_REFERENCE,4,264260,Varies with device,10000000,Free,0,Everyone,Books & Reference,"June 22, 2018",Varies with device,Varies with device
5355,NOOK: Read eBooks & Magazines,BOOKS_AND_REFERENCE,4,155466,Varies with device,10000000,Free,0,Teen,Books & Reference,"April 25, 2018",Varies with device,Varies with device
5155,English Dictionary - Offline,BOOKS_AND_REFERENCE,4,341234,30M,10000000,Free,0,Everyone 10+,Books & Reference,"March 20, 2018",3.9.1,4.2 and up


This niche seems to be dominated by software for processing and reading ebooks, as well as various collections of libraries and dictionaries, so it's probably not a good idea to build similar apps since there'll be some significant competition.

We also notice there are quite a few apps built around the book Quran, which suggests that building an app around a popular book can be profitable. It seems that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets.

However, it looks like the market is already full of libraries, so I'll have to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.

## Conclusions

In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We concluded that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets. The markets are already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.