# Profitable App Profiles for the App Store and Google Play Markets

# Aim: 
My aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. As a data analyst for a company that builds Android and iOS mobile apps, and my job is to enable the developers to make dataa-deriven decisions with respenct to the kind of apps they build. 

At our company, we build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.

# Opening and Exploring the Data


As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

Collecting data for over four million apps requires a significant amount of time and money, so we'll try to analyze a sample of data instead. To avoid spending resources with collecting new data ourselves, we should first try to see whether we can find any relevant existing data at no cost. Luckily, these are two data sets that seem suitable for our purpose:

- [A data set](https://www.kaggle.com/lava18/google-play-store-apps/home) containing data about approximately ten thousand Android apps from Google Play
- [A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) containing data about approximately seven thousand iOS apps from the App Store

Let's start by opening the two data sets and then continue with exploring the data.

In [1]:
import pandas as pd

apple_data = pd.read_csv('AppleStore.csv')
android_data = pd.read_csv('googleplaystore.csv')

As you can see above, we used pandas DataFrame to view the files. 

In [2]:
apple_data.head()

Unnamed: 0.1,Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
0,1,281656475,PAC-MAN Premium,100788224,USD,3.99,21292,26,4.0,4.5,6.3.5,4+,Games,38,5,10,1
1,2,281796108,Evernote - stay organized,158578688,USD,0.0,161065,26,4.0,3.5,8.2.2,4+,Productivity,37,5,23,1
2,3,281940292,"WeatherBug - Local Weather, Radar, Maps, Alerts",100524032,USD,0.0,188583,2822,3.5,4.5,5.0.0,4+,Weather,37,5,3,1
3,4,282614216,"eBay: Best App to Buy, Sell, Save! Online Shop...",128512000,USD,0.0,262241,649,4.0,4.5,5.10.0,12+,Shopping,37,5,9,1
4,5,282935706,Bible,92774400,USD,0.0,985920,5320,4.5,5.0,7.5.1,4+,Reference,37,5,45,1


In [3]:
android_data.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


Let's see how many rows and columns does each of the data file has. 

In [4]:
print("Android Data = Number of rows: ", len(android_data))
print("Android Data = Number of columns: ", len(android_data.columns))

print("\nApple Data = Number of rows: ", len(apple_data))
print("Apple Data = Number of columns: ", len(apple_data.columns))

Android Data = Number of rows:  10841
Android Data = Number of columns:  13

Apple Data = Number of rows:  7197
Apple Data = Number of columns:  17


We have 10841 Android Apps in the data set. The most interesting columns are: `'App'`, `'Category'`, `'Reviews'`, `'Installs'`, `'Type'`, `'Price'`, and `'Genres'`.

We see that the data set has 7197 iOS apps. The columns that might be the most useful for our anaysis are: `'track_name'`, `'currency'`, `'price'`, `'rating_count_tot'`, `'rating_count_ver'`, and `'prime_genre'`.

Not all column names are self-explanatory in this case, but details about each column can be found in the data set [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)

# Deleting Wrong Data

The Google Play data set has a dedicated [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion), and we can see that [one of the discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) outlines an error for row 10472. Let's print this row and compare it against the header and another row that is correct.

In [5]:
#inconsistent/incorrect row

print(android_data.iloc[10472]) 

App               Life Made WI-Fi Touchscreen Photo Frame
Category                                              1.9
Rating                                                 19
Reviews                                              3.0M
Size                                               1,000+
Installs                                             Free
Type                                                    0
Price                                            Everyone
Content Rating                                        NaN
Genres                                  February 11, 2018
Last Updated                                       1.0.19
Current Ver                                    4.0 and up
Android Ver                                           NaN
Name: 10472, dtype: object


In [6]:
#correct row

print(android_data.iloc[10471])

App               Xposed Wi-Fi-Pwd
Category           PERSONALIZATION
Rating                         3.5
Reviews                       1042
Size                          404k
Installs                  100,000+
Type                          Free
Price                            0
Content Rating            Everyone
Genres             Personalization
Last Updated        August 5, 2014
Current Ver                  3.0.0
Android Ver           4.0.3 and up
Name: 10471, dtype: object


If you compare row 10472 and 10471, you can tell that:
The row 10472 corresponds to the app Life Made WI-Fi Touchscreen Photo Frame, and we can see that the rating is 19. This is clearly off because the maximum rating for a Google Play app is 5. As a consequence, we'll delete this row.

In [7]:
#Deleting the inconsistent row

android_data.drop([10472], axis = 0, inplace = True)

#Checking the length of the data set to make sure the
#row has been deleted. 

len(android_data)

10840

As we can see, the row has been deleted and total number of rows decreased to 10840 from 10841. 

The incorrect row has been deleted. 

# Removing Duplicate Enteries
## Part One
If we explore the Google Play data set long enough, we'll find that some apps have more than one entry. For instance, the application Instagram has four entries:

In [8]:
duplicates = android_data[android_data['App'] == "Instagram"]
duplicates[0:]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
2545,Instagram,SOCIAL,4.5,66577313,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
2604,Instagram,SOCIAL,4.5,66577446,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
2611,Instagram,SOCIAL,4.5,66577313,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
3909,Instagram,SOCIAL,4.5,66509917,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device


In total, there are 1181 cases where an app occurs more than once:

In [9]:
duplicate_apps = []
unique_apps = []

for idx, row in android_data.iterrows():
    name = row['App']
    
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps:', len(duplicate_apps))
print('Number of duplicate apps:', len(unique_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])

Number of duplicate apps: 1181
Number of duplicate apps: 9659


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


We don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app. One thing we could do is remove the duplicate rows randomly, but we could probably find a better way.

If you examine the rows we printed two cells above for the Instagram app, the main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show that the data was collected at different times. We can use this to build a criterion for keeping rows. We won't remove rows randomly, but rather we'll keep the rows that have the highest number of reviews because the higher the number of reviews, the more reliable the ratings.

To do that, we will:

- Create a dictionary where each key is a unique app name, and the value is the highest number of reviews of that app
- Use the dictionary to create a new data set, which will have only one entry per app (and we only select the apps with the highest number of reviews)

### Part Two

Let's start by building the dictionary.

In [10]:
reviews_max = {}

for idx, row in android_data.iterrows():
    name = row['App']
    reviews = float(row['Reviews'])
    
    if name in reviews_max and reviews_max[name] < reviews:
        reviews_max[name] = reviews
    elif name not in reviews_max:
        reviews_max[name] = reviews

len(reviews_max)

9659

In a previous code cell, we found that there are 1,181 cases where an app occurs more than once, so the length of our dictionary (of unique apps) should be equal to the difference between the length of our data set and 1,181.

In [11]:
print("Expected length: ", len(android_data) - 1181)
print("Actual length: ", len(reviews_max))

Expected length:  9659
Actual length:  9659


Now, let's use the `reviews_max` dictionary to remove the duplicates. For the duplicate cases, we'll only keep the entries with the highest number of reviews. In the code cell below:

- We start by initializing two empty lists, `android_clean` and `already_added`.
- We loop through the `android` data set, and for every iteration:
    - We isolate the name of the app and the number of reviews.
    - We add the current row (`app`) to the `android_clean` list, and the app name (`name`) to the `already_cleaned` list if:
        - The number of reviews of the current app matches the number of reviews of that app as described in the `reviews_max` dictionary; and
        - The name of the app is not already in the `already_added` list. We need to add this supplementary condition to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry (for example, the Box app has three entries, and the number of reviews is the same). If we just check for `reviews_max[name] == n_reviews`, we'll still end up with duplicate entries for some apps.

In [12]:
android_clean = [] #list of new cleaned data
already_added = [] #list of the cleaned app names

for idx, row in android_data.iterrows():
    name = row['App']
    n_reviews = float(row['Reviews'])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(row)
        already_added.append(name)
        
len(android_clean)

9659

Now let's quickly explore the new data set, and confirm that the number of rows is 9,659.

In [13]:
# again, using pandas dataframe
android_clean = pd.DataFrame(android_clean)

# since we deleted many rows for duplicacy reasons,
# we need to reset the index so that it is continuous
android_clean.reset_index(drop=True)


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
2,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
3,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
4,Paper flowers instructions,ART_AND_DESIGN,4.4,167,5.6M,"50,000+",Free,0,Everyone,Art & Design,"March 26, 2017",1.0,2.3 and up
5,Smoke Effect Photo Maker - Smoke Editor,ART_AND_DESIGN,3.8,178,19M,"50,000+",Free,0,Everyone,Art & Design,"April 26, 2018",1.1,4.0.3 and up
6,Infinite Painter,ART_AND_DESIGN,4.1,36815,29M,"1,000,000+",Free,0,Everyone,Art & Design,"June 14, 2018",6.1.61.1,4.2 and up
7,Garden Coloring Book,ART_AND_DESIGN,4.4,13791,33M,"1,000,000+",Free,0,Everyone,Art & Design,"September 20, 2017",2.9.2,3.0 and up
8,Kids Paint Free - Drawing Fun,ART_AND_DESIGN,4.7,121,3.1M,"10,000+",Free,0,Everyone,Art & Design;Creativity,"July 3, 2018",2.8,4.0.3 and up
9,Text on Photo - Fonteee,ART_AND_DESIGN,4.4,13880,28M,"1,000,000+",Free,0,Everyone,Art & Design,"October 27, 2017",1.0.4,4.1 and up


# Removing Non-English Apps

### Part One

If you explore the data sets enough, you'll notice the names of some of the apps suggest they are not directed toward an English-speaking audience. Below, we see a couple of examples from both data sets:

In [14]:
android_clean[4410:4415]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
5510,AQ Coach,SPORTS,,0,28M,5+,Free,0,Everyone,Sports,"May 25, 2018",1.1.0,4.4 and up
5511,AQ Dentals,HEALTH_AND_FITNESS,,0,12M,10+,Free,0,Everyone,Health & Fitness,"December 22, 2017",1.0.1,4.1 and up
5513,中国語 AQリスニング,FAMILY,,21,17M,"5,000+",Free,0,Everyone,Education,"June 22, 2016",2.4.0,4.0 and up
5514,ClanHQ,COMMUNICATION,2.7,560,37M,"10,000+",Free,0,Everyone,Communication,"July 25, 2018",1.0.21,4.4 and up
5515,QuickShortcutMaker,PERSONALIZATION,4.6,41000,2.0M,"1,000,000+",Free,0,Everyone,Personalization,"February 23, 2014",2.4.0,1.6 and up


We're not interested in keeping these kind of apps, so we'll remove them. One way to go about this is to remove each app whose name contains a symbol that is not commonly used in English text — English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;, etc.), and other symbols (+, *, /, etc.).

All these characters that are specific to English texts are encoded using the ASCII standard. Each ASCII character has a corresponding number between 0 and 127 associated with it, and we can take advantage of that to build a function that checks an app name and tells us whether it contains non-ASCII characters.

We built this function below, and we use the built-in `ord()` function to find out the corresponding encoding number of each character.

In [15]:
def is_english(string):
    for char in range(len(string)):
        if ord(string[char]) > 127:
            return False
    return True

print(is_english("Instagram"))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('لعبة تقدر تربح DZ'))
print(is_english('Maninder'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
False
True
False
False


The function seems to work fine, but some English app names use emojis or other symbols (™, — (em dash), – (en dash), etc.) that fall outside of the ASCII range. Because of this, we'll remove useful apps if we use the function in its current form.

## Part Two
To minimize the impact of data loss, we'll only remove an app if its name has more than three non-ASCII characters:

In [16]:
def is_english(string):
    special_char = 0
    for char in range(len(string)):
        if ord(string[char]) > 127:
            special_char += 1
        if special_char > 3:
            return False
    return True
 
print(is_english("Instagram"))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('لعبة تقدر تربح DZ'))
print(is_english('Maninder'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
False
True
True
True


The function is still not perfect, and very few non-English apps might get past our filter, but this seems good enough at this point in our analysis.

Below, we use the `is_english()` function to filter out the non-English apps for both data sets:

In [17]:
android_english = []
apple_english = []

for idx, row in android_clean.iterrows():
    name = row['App']
    output = is_english(name)
    
    if output == True:
        android_english.append(row)
print(len(android_english))

for idx,row in apple_data.iterrows():
    name = row['track_name']
    output = is_english(name)
    
    if output == True:
        apple_english.append(row)
        
print(len(apple_english))

9614
6183


In [20]:
android_english = pd.DataFrame(android_english)
apple_english = pd.DataFrame(apple_english)

We can see that we're left with 9614 Android apps and 6183 iOS apps.

## Isolating the Free Apps

As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps, and we'll need to isolate only the free apps for our analysis. Below, we isolate the free apps for both our data sets.

In [21]:
android_final = []
apple_final =[]

for idx, row in android_english.iterrows():
    price = row['Price']
    
    if price == '0':
        android_final.append(row)
        
for idx,row in apple_english.iterrows():
    price = row['price']
    
    if price == 0:
        apple_final.append(row)
        
print(len(android_final))
print(len(apple_final))

8864
3222


In [22]:
android_final = pd.DataFrame(android_final)
apple_final = pd.DataFrame(apple_final)

We're left with 8864 Android apps and 3222 iOS apps, which should be enough for our analysis.

## Most Common Apps by Genre

### Part One

As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we then develop it further.
3. If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both the App Store and Google Play, we need to find app profiles that are successful on both markets. For instance, a profile that might work well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of the most common genres for each market. For this, we'll build a frequency table for the `prime_genre` column of the App Store data set, and the `Genres` and `Category` columns of the Google Play data set.

### Part Two

We'll build two functions we can use to analyze the frequency tables:

- One function to generate frequency tables that show percentages
- Another function that we can use to display the percentages in a descending order

### Part Three

We start by examining the frequency table for the `prime_genre` column of the App Store data set.

In [23]:
pd.value_counts(apple_final['prime_genre'].values.ravel()) 

Games                1874
Entertainment         254
Photo & Video         160
Education             118
Social Networking     106
Shopping               84
Utilities              81
Sports                 69
Music                  66
Health & Fitness       65
Productivity           56
Lifestyle              51
News                   43
Travel                 40
Finance                36
Weather                28
Food & Drink           26
Reference              18
Business               17
Book                   14
Medical                 6
Navigation              6
Catalogs                4
dtype: int64

In [24]:
pd.value_counts(android_final['Category'].values.ravel()) 

FAMILY                 1676
GAME                    862
TOOLS                   750
BUSINESS                407
LIFESTYLE               346
PRODUCTIVITY            345
FINANCE                 328
MEDICAL                 313
SPORTS                  301
PERSONALIZATION         294
COMMUNICATION           287
HEALTH_AND_FITNESS      273
PHOTOGRAPHY             261
NEWS_AND_MAGAZINES      248
SOCIAL                  236
TRAVEL_AND_LOCAL        207
SHOPPING                199
BOOKS_AND_REFERENCE     190
DATING                  165
VIDEO_PLAYERS           159
MAPS_AND_NAVIGATION     124
FOOD_AND_DRINK          110
EDUCATION               103
ENTERTAINMENT            85
LIBRARIES_AND_DEMO       83
AUTO_AND_VEHICLES        82
HOUSE_AND_HOME           73
WEATHER                  71
EVENTS                   63
PARENTING                58
ART_AND_DESIGN           57
COMICS                   55
BEAUTY                   53
dtype: int64

In [25]:
pd.value_counts(android_final['Genres'].values.ravel()) 

Tools                                  749
Entertainment                          538
Education                              474
Business                               407
Lifestyle                              345
Productivity                           345
Finance                                328
Medical                                313
Sports                                 307
Personalization                        294
Communication                          287
Action                                 275
Health & Fitness                       273
Photography                            261
News & Magazines                       248
Social                                 236
Travel & Local                         206
Shopping                               199
Books & Reference                      190
Simulation                             181
Dating                                 165
Arcade                                 164
Video Players & Editors                157
Casual     

We can see that among the free English apps, more than a half (58.16%) are games. Entertainment apps are close to 8%, followed by photo and video apps, which are close to 5%. Only 3.66% of the apps are designed for education, followed by social networking apps which amount for 3.29% of the apps in our data set. 

The general impression is that App Store (at least the part containing free English apps) is dominated by apps that are designed for fun (games, entertainment, photo and video, social networking, sports, music, etc.), while apps with practical purposes (education, shopping, utilities, productivity, lifestyle, etc.) are more rare. However, the fact that fun apps are the most numerous doesn't also imply that they also have the greatest number of users — the demand might not be the same as the offer. 

Let's continue by examining the `Genres` and `Category` columns of the Google Play data set (two columns which seem to be related).

In [None]:
display_table(android_final, 1) # Category

The landscape seems significantly different on Google Play: there are not that many apps designed for fun, and it seems that a good number of apps are designed for practical purposes (family, tools, business, lifestyle, productivity, etc.). However, if we investigate this further, we can see that the family category (which accounts for almost 19% of the apps) means mostly games for kids.

Even so, practical apps seem to have a better representation on Google Play compared to App Store. This picture is also confirmed by the frequency table we see for the `Genres` column:

In [None]:
display_table(android_final, -4)


The difference between the Genres and the Category columns is not crystal clear, but one thing we can notice is that the Genres column is much more granular (it has more categories). We're only looking for the bigger picture at the moment, so we'll only work with the Category column moving forward.

Up to this point, we found that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps. Now we'd like to get an idea about the kind of apps that have most users.

# Most Popular Apps by Genre on the App Store
One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs column, but for the App Store data set this information is missing. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app.

Below, we calculate the average number of user ratings per app genre on the App Store:

In [None]:
genres_ios = freq_table(apple_final, -5)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in ios_final:
        genre_app = app[-5]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)