# Profitable App Profiles for the Google Play Market

Our aim in this project is to find mobile app profiles that are profitable for the Google Play market. We're working as data analysts for a company that builds Android mobile apps, and our job is to enable our team of developers to make data-driven decisions with respect to the kind of apps they build.

At our company, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.

* [A data set](https://www.kaggle.com/lava18/google-play-store-apps/home) containing data about approximately ten thousand Android apps from Google Play

In [1]:
import numpy as np
import pandas as pd

In [2]:
android = pd.read_csv("googleplaystore.csv", encoding = "utf-8")

In [4]:
android.head(3)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up


In [5]:
android.shape

(10841, 13)

# Data Cleaning (Deleting Wrong Data)

In [6]:
android.loc[10472] # incorrect row

App               Life Made WI-Fi Touchscreen Photo Frame
Category                                              1.9
Rating                                                 19
Reviews                                              3.0M
Size                                               1,000+
Installs                                             Free
Type                                                    0
Price                                            Everyone
Content Rating                                        NaN
Genres                                  February 11, 2018
Last Updated                                       1.0.19
Current Ver                                    4.0 and up
Android Ver                                           NaN
Name: 10472, dtype: object

In [7]:
print(len(android))
android.drop(10472, axis = 0, inplace = True)
print(len(android))

10841
10840


# Removing Duplicate Entries

In [8]:
android[android["App"] =="Instagram"]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
2545,Instagram,SOCIAL,4.5,66577313,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
2604,Instagram,SOCIAL,4.5,66577446,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
2611,Instagram,SOCIAL,4.5,66577313,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
3909,Instagram,SOCIAL,4.5,66509917,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device


In [9]:
print("Number of duplicate apps:", android.duplicated(["App"]).sum())

Number of duplicate apps: 1181


In [11]:
android["Reviews"] = android["Reviews"].astype(float)

In [12]:
android.sort_values("Reviews", ascending = False, inplace = True)

In [13]:
android.head(3)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
2544,Facebook,SOCIAL,4.1,78158306.0,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"August 3, 2018",Varies with device,Varies with device
3943,Facebook,SOCIAL,4.1,78128208.0,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"August 3, 2018",Varies with device,Varies with device
381,WhatsApp Messenger,COMMUNICATION,4.4,69119316.0,Varies with device,"1,000,000,000+",Free,0,Everyone,Communication,"August 3, 2018",Varies with device,Varies with device


In [14]:
android.drop_duplicates(["App"], inplace = True)

In [15]:
android.shape

(9659, 13)

# Removing Non-English Apps

In [16]:
print(android["App"].sort_values().tail(20).values)

['বাংলাflix' 'သိင်္ Astrology - Min Thein Kha BayDin'
 '► MultiCraft ― Free Miner! 👍' '【Miku AR Camera】Mikuture'
 '【Ranobbe complete free】 Novelba - Free app that you can read and write novels'
 'あなカレ【BL】無料ゲーム' 'パーリーゲイツ公式通販｜EJ STYLE（イージェイスタイル）' '中国語 AQリスニング'
 '乐屋网: Buying a house, selling a house, renting a house'
 '乗換NAVITIME\u3000Timetable & Route Search in Japan Tokyo' '哈哈姆特不EY'
 '日本AV历史' '漫咖 Comics - Manga,Novel and Stories' '英漢字典 EC Dictionary'
 '감성학원 BL 첫사랑' '뽕티비 - 개인방송, 인터넷방송, BJ방송' "💎 I'm rich"
 '💘 WhatsLov: Smileys of love, stickers and GIF'
 '📏 Smart Ruler ↔️ cm/inch measuring for homework!'
 '🔥 Football Wallpapers 4K | Full HD Backgrounds 😍']


In [17]:
def is_english(string):
    for item in string:
        if ord(item) > 127:
            return False
    return True

print(is_english("Instagram"))
print(is_english("爱奇艺PPS -《欢乐颂2》电视剧热播"))

True
False


In [18]:
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

print(ord('™'))
print(ord('😜'))

False
False
8482
128540


To minimize the impact of data loss, we'll only remove an app if its name has more than three non-ASCII characters:

In [19]:
def is_english(string):
    non_ascii = 0
    for item in string:
        if ord(item) > 127:
            non_ascii += 1
    if non_ascii > 3:
        return False
    return True

print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
True


In [20]:
is_eng = android["App"].apply(is_english)

In [21]:
android_english = android[is_eng].copy()

In [22]:
android_english.shape

(9614, 13)

In [23]:
android_english.head(3)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
2544,Facebook,SOCIAL,4.1,78158306.0,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"August 3, 2018",Varies with device,Varies with device
381,WhatsApp Messenger,COMMUNICATION,4.4,69119316.0,Varies with device,"1,000,000,000+",Free,0,Everyone,Communication,"August 3, 2018",Varies with device,Varies with device
2604,Instagram,SOCIAL,4.5,66577446.0,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device


We can see that we're left with 9614 Android apps

# Isolating the Free Apps

In [24]:
android_final = android_english[android_english["Price"] == "0"]

In [25]:
android_final.shape

(8864, 13)

We're left with 8864 Android apps which should be enough for our analysis.

# Most Common Apps by Genre

In [26]:
(android_final["Category"].value_counts(normalize = True)*100).head(10)

FAMILY             18.919224
GAME                9.713448
TOOLS               8.461191
BUSINESS            4.591606
LIFESTYLE           3.903430
PRODUCTIVITY        3.892148
FINANCE             3.700361
MEDICAL             3.531137
SPORTS              3.395758
PERSONALIZATION     3.316787
Name: Category, dtype: float64

In [27]:
(android_final["Genres"].value_counts(normalize = True)*100).head(10)

Tools              8.449910
Entertainment      6.069495
Education          5.347473
Business           4.591606
Lifestyle          3.892148
Productivity       3.892148
Finance            3.700361
Medical            3.531137
Sports             3.463448
Personalization    3.316787
Name: Genres, dtype: float64

# Most Popular Apps

We calculate the average number of installation per category on the Google Play Store:

In [28]:
(android_final["Installs"].value_counts(normalize = True)*100).head()

1,000,000+     15.726534
100,000+       11.552347
10,000,000+    10.548285
10,000+        10.198556
1,000+          8.393502
Name: Installs, dtype: float64

In [29]:
android_final["Installs"] = android_final["Installs"].str.replace("+","").str.replace(",","")\
.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [31]:
android_final.groupby("Category")["Installs"].mean().sort_values(ascending  = False).head(10)

Category
COMMUNICATION         3.845612e+07
VIDEO_PLAYERS         2.472787e+07
SOCIAL                2.325365e+07
PHOTOGRAPHY           1.784011e+07
PRODUCTIVITY          1.678733e+07
GAME                  1.559451e+07
TRAVEL_AND_LOCAL      1.398408e+07
ENTERTAINMENT         1.164071e+07
TOOLS                 1.080139e+07
NEWS_AND_MAGAZINES    9.549178e+06
Name: Installs, dtype: float64

On average, communication apps have the most installs: 38,456,119. This number is heavily skewed up by a few apps that have over one billion installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts), and a few others with over 100 and 500 million installs:



In [32]:
comm = android_final["Category"] == "COMMUNICATION"
installs = (android_final["Installs"] == 1000000000) |(android_final["Installs"] == 500000000)\
|(android_final["Installs"] == 100000000)

android_final.loc[comm & installs,["App","Installs"]].sort_values("Installs", ascending = False)

Unnamed: 0,App,Installs
381,WhatsApp Messenger,1000000000
411,Google Chrome: Fast & Secure,1000000000
464,Hangouts,1000000000
382,Messenger – Text and Video Chat for Free,1000000000
451,Gmail,1000000000
468,Skype - free IM & video calls,1000000000
383,imo free video calls and chat,500000000
4039,Google Duo - High Quality Video Calls,500000000
474,LINE: Free Calls & Messages,500000000
4676,Viber Messenger,500000000


If we removed all the communication apps that have over 100 million installs, the average would be reduced roughly ten times:

In [33]:
comm = android_final["Category"] == "COMMUNICATION"
installs = android_final["Installs"] < 100000000

android_final[comm & installs]["Installs"].mean()

3603485.3884615386

We see the same pattern for the video players category, which is the runner-up with 24,727,872 installs. The market is dominated by apps like Youtube, Google Play Movies & TV, or MX Player. The pattern is repeated for social apps (where we have giants like Facebook, Instagram, Google+, etc.), photography apps (Google Photos and other popular photo editors), or productivity apps (Microsoft Word, Dropbox, Google Calendar, Evernote, etc.).

Again, the main concern is that these app genres might seem more popular than they really are. Moreover, these niches seem to be dominated by a few giants who are hard to compete against.

The books and reference genre looks fairly popular as well, with an average number of installs of 8,767,811. It's interesting to explore this in more depth, since we found this genre has some potential to work well on the App Store, and our aim is to recommend an app genre that shows potential for being profitable on the Google Play.

Let's take a look at some of the apps from this genre and their number of installs:

In [34]:
books = android_final[android_final["Category"] =='BOOKS_AND_REFERENCE' ]
books[["App","Installs"]]

Unnamed: 0,App,Installs
4715,Wattpad 📖 Free Books,100000000
3941,Bible,100000000
152,Google Play Books,1000000000
9625,JW Library,10000000
6290,Dictionary.com: Find Definitions for English W...,10000000
...,...,...
9255,CompactiMa EC pH Calibration,100
4543,Guide for R Programming,5
6492,Anime Mod for BM,100
4546,Learn R Programming,10


The book and reference genre includes a variety of apps: software for processing and reading ebooks, various collections of libraries, dictionaries, tutorials on programming or languages, etc. It seems there's still a small number of extremely popular apps that skew the average:

In [36]:
comm = android_final["Category"] == 'BOOKS_AND_REFERENCE'
installs = (android_final["Installs"] == 1000000000) |(android_final["Installs"] == 500000000)\
|(android_final["Installs"] == 100000000)

android_final.loc[comm & installs,["App","Installs"]].sort_values("Installs", ascending = False)

Unnamed: 0,App,Installs
152,Google Play Books,1000000000
4715,Wattpad 📖 Free Books,100000000
3941,Bible,100000000
4083,Amazon Kindle,100000000
5651,Audiobooks from Audible,100000000


However, it looks like there are only a few very popular apps, so this market still shows potential. Let's try to get some app ideas based on the kind of apps that are somewhere in the middle in terms of popularity (between 1,000,000 and 100,000,000 downloads):

In [37]:
comm = android_final["Category"] == 'BOOKS_AND_REFERENCE'
installs = (android_final["Installs"] == 1000000) |(android_final["Installs"] == 5000000)\
|(android_final["Installs"] == 10000000)|(android_final["Installs"] == 50000000)

android_final.loc[comm & installs,["App","Installs"]].sort_values("Installs", ascending = False)

Unnamed: 0,App,Installs
9625,JW Library,10000000
8293,Dictionary,10000000
6290,Dictionary.com: Find Definitions for English W...,10000000
173,HTC Help,10000000
6497,NOOK: Read eBooks & Magazines,10000000
149,FBReader: Favorite Book Reader,10000000
4100,Aldiko Book Reader,10000000
5319,Al-Quran (Free),10000000
179,Moon+ Reader,10000000
144,Cool Reader,10000000


# Conclusions

In this project, we analyzed data about the Google Play mobile apps with the goal of recommending an app profile that can be profitable.

We concluded that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for the Google Play markets. The markets are already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.