# Analyzing the Free Mobile App Market
--DESCRIPTION 1-2 paragraphs explaining what the project is about and its goals--
As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

### Loading the Data and First Look
Collecting data for over four million apps requires a significant amount of time and money, so we'll try to analyze a sample of the data instead. To avoid spending resources on collecting new data ourselves, we should first try to see whether we can find any relevant existing data at no cost. Luckily, these are two data sets that seem suitable for our goals:

- A [data set](https://www.kaggle.com/lava18/google-play-store-apps/home) containing data about approximately ten thousand Android apps from Google Play — the data was collected in August 2018
- A [data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) containing data about approximately seven thousand iOS apps from the App Store — the data was collected in July 2017
We'll start by opening and exploring these two data sets. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [2]:
google_store_df = pd.read_csv('googleplaystore.csv')
apple_store_df = pd.read_csv('AppleStore.csv')

Let's look at the first few rows of the Google Play store data:

In [3]:
print('''Google Play store 
          # of apps:    {} 
          # of columns: {}
      '''.format(*google_store_df.shape))
google_store_df.head(3)

Google Play store 
          # of apps:    10841 
          # of columns: 13
      


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up


We see that the Google Play data set has 10841 apps and 13 columns. At a quick glance, the columns that might be useful for the purpose of our analysis are 'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price', and 'Genres'.

Now let's look at the iOS App store data set.

In [4]:
print('''Apple store 
          # of apps:    {} 
          # of columns: {}
      '''.format(*apple_store_df.shape))
apple_store_df.head(3)

Apple store 
          # of apps:    7197 
          # of columns: 17
      


Unnamed: 0.1,Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
0,1,281656475,PAC-MAN Premium,100788224,USD,3.99,21292,26,4.0,4.5,6.3.5,4+,Games,38,5,10,1
1,2,281796108,Evernote - stay organized,158578688,USD,0.0,161065,26,4.0,3.5,8.2.2,4+,Productivity,37,5,23,1
2,3,281940292,"WeatherBug - Local Weather, Radar, Maps, Alerts",100524032,USD,0.0,188583,2822,3.5,4.5,5.0.0,4+,Weather,37,5,3,1


The Apple Store data set has 7197 apps and the columns that look interesting are 'track_name', 'price', 'rating_count_tot', 'rating_count_ver', and 'prime_genre'. A more complete description of each column can be found in the [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home).

### Data Cleaning
Next let's look through the data for possible erroneous input and pare it down to our needs.

#### Duplicates and Wrong Data
The [Kaggle Discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/81616) mention a few mistakes found in the Google Play data:
1. Several apps appear more than once in the data: 

In [5]:
print('Number of duplicate apps:', len(google_store_df) - google_store_df['App'].nunique())
google_store_df[(google_store_df['App'] == 'Subway Surfers') | (google_store_df['App'] == 'Facebook')]

Number of duplicate apps: 1181


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
1654,Subway Surfers,GAME,4.5,27722264,76M,"1,000,000,000+",Free,0,Everyone 10+,Arcade,"July 12, 2018",1.90.0,4.1 and up
1700,Subway Surfers,GAME,4.5,27723193,76M,"1,000,000,000+",Free,0,Everyone 10+,Arcade,"July 12, 2018",1.90.0,4.1 and up
1750,Subway Surfers,GAME,4.5,27724094,76M,"1,000,000,000+",Free,0,Everyone 10+,Arcade,"July 12, 2018",1.90.0,4.1 and up
1872,Subway Surfers,GAME,4.5,27725352,76M,"1,000,000,000+",Free,0,Everyone 10+,Arcade,"July 12, 2018",1.90.0,4.1 and up
1917,Subway Surfers,GAME,4.5,27725352,76M,"1,000,000,000+",Free,0,Everyone 10+,Arcade,"July 12, 2018",1.90.0,4.1 and up
2544,Facebook,SOCIAL,4.1,78158306,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"August 3, 2018",Varies with device,Varies with device
3896,Subway Surfers,GAME,4.5,27711703,76M,"1,000,000,000+",Free,0,Everyone 10+,Arcade,"July 12, 2018",1.90.0,4.1 and up
3943,Facebook,SOCIAL,4.1,78128208,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"August 3, 2018",Varies with device,Varies with device


Let's find and remove these duplicate entries, but first notice that all of the columns appear to be the same except for the 'Reviews' column, so which do we keep? Rather than throw away the rows randomly, let's get rid of all but the one with the highest number of reviews as it likely has the most up to date information on the app.

In [6]:
google_store_df = google_store_df.sort_values('Reviews').drop_duplicates(subset='App', keep='last').sort_index()
print('Apps remaining in Google Play: ', len(google_store_df))
google_store_df[(google_store_df['App'] == 'Subway Surfers') | (google_store_df['App'] == 'Facebook')]

Apps remaining in Google Play:  9660


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
1917,Subway Surfers,GAME,4.5,27725352,76M,"1,000,000,000+",Free,0,Everyone 10+,Arcade,"July 12, 2018",1.90.0,4.1 and up
2544,Facebook,SOCIAL,4.1,78158306,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"August 3, 2018",Varies with device,Varies with device


2. Row 10472 has displaced columns where 'Category' is '1.9' and the 'Rating' is '19' even though the rating is supposed to be in a range 0-5. Let's remove that row as well.

In [7]:
google_store_df.loc[10472]

App               Life Made WI-Fi Touchscreen Photo Frame
Category                                              1.9
Rating                                                 19
Reviews                                              3.0M
Size                                               1,000+
Installs                                             Free
Type                                                    0
Price                                            Everyone
Content Rating                                        NaN
Genres                                  February 11, 2018
Last Updated                                       1.0.19
Current Ver                                    4.0 and up
Android Ver                                           NaN
Name: 10472, dtype: object

In [8]:
google_store_df.drop(10472, inplace=True)
print('Apps remaining in Google Play:', len(google_store_df))

Apps remaining in Google Play: 9659


#### Non-English Apps

Looking through the data sets you'll notice the names of some of the apps suggest they are not directed toward an English-speaking audience. Below, we see an example from both data sets:

In [9]:
print(apple_store_df.loc[814, 'track_name'])
google_store_df.loc[4550, 'App']

搜狐新闻—新闻热点资讯掌上阅读软件


'RMEduS - 음성인식을 활용한 R 프로그래밍 실습 시스템'

Since we are looking at the free app market for a primarily English-speaking audience, let's remove the non-English apps from the data set. One way to go about doing this is to remove all apps with names only containing letters from the English alphabet, the digits 0-9, and common punctuation marks and symbols (?, ;, *, /, etc.)

By the ASCII standard, all of these characters have an encoded value between 0 and 127, so if we remove all names with encoded values greater than 127 we will be left with the apps only with standard English characters.

However, many characters commonly used in app names such as <sup>TM</sup>, --, and emojis fall outside of this range so we will only remove an app if it has more than three non-ASCII characters. This approach isn't perfect but we don't want to spend too much time on optimization.

In [10]:
def has_mostly_english_chars(word):
    return sum([1 if ord(c) > 127 else 0 for c in word]) <= 3

gp_english_apps_boolean = google_store_df['App'].apply(lambda x: has_mostly_english_chars(x))
google_store_df = google_store_df[gp_english_apps_boolean]

ios_english_apps_boolean = apple_store_df['track_name'].apply(lambda x: has_mostly_english_chars(x))
apple_store_df = apple_store_df[ios_english_apps_boolean]

removal_fs = '''Removed {0} {1} apps from Google Play data
Apps remaining in Google Play: {2}

Removed {3} {1} apps from Apple Store data
Apps remaining in Apple Store: {4}'''

print(removal_fs.format(len(gp_english_apps_boolean) - np.sum(gp_english_apps_boolean), 
                        'non-english',
                        len(google_store_df),
                        len(ios_english_apps_boolean) - np.sum(ios_english_apps_boolean),
                        len(apple_store_df)))

Removed 45 non-english apps from Google Play data
Apps remaining in Google Play: 9614

Removed 1014 non-english apps from Apple Store data
Apps remaining in Apple Store: 6183


In [11]:
apple_store_df[apple_store_df.duplicated(subset='track_name')]

Unnamed: 0.1,Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
5603,7579,1089824278,VR Roller Coaster,240964608,USD,0.0,67,44,3.5,4.0,0.81,4+,Games,38,0,1,1
7128,10885,1178454060,Mannequin Challenge,59572224,USD,0.0,105,58,4.0,4.5,1.0.1,4+,Games,38,5,1,1


#### Removing Non-Free Apps
Since we are only interested in apps that are free to download and install, we will need to isolate out the free apps for our analysis.

In [14]:
free_gs_df = google_store_df[google_store_df['Price'] == '0']
free_ios_df = apple_store_df[apple_store_df['price'] == 0]

print(removal_fs.format(len(google_store_df) - len(free_gs_df),
                        'non-free',
                        len(free_gs_df),
                        len(apple_store_df) - len(free_ios_df),
                        len(free_ios_df)))

Removed 752 non-free apps from Google Play data
Apps remaining in Google Play: 8862

Removed 2961 non-free apps from Apple Store data
Apps remaining in Apple Store: 3222


## Analysis

Our aim is to identify the type of apps that attract the most users, because for an app that can be downloaded and used for free more active users means more advertising revenue.

#### Most Popular App Genres
We'll begin our analysis by getting a sense for the most common genres for each market by looking at frequency tables for the 'prime_genre' column of the iOS App Store data set, and the 'Genres' and 'Category' columns of the Google Play data set.

Let's start by looking at the iOS apps.

In [28]:
def value_count_percent(column):
    return column.value_counts() / len(column) * 100

value_count_percent(free_ios_df['prime_genre'])

Games                58.162632
Entertainment         7.883302
Photo & Video         4.965860
Education             3.662322
Social Networking     3.289882
Shopping              2.607076
Utilities             2.513966
Sports                2.141527
Music                 2.048417
Health & Fitness      2.017381
Productivity          1.738051
Lifestyle             1.582868
News                  1.334575
Travel                1.241465
Finance               1.117318
Weather               0.869025
Food & Drink          0.806952
Reference             0.558659
Business              0.527623
Book                  0.434513
Medical               0.186220
Navigation            0.186220
Catalogs              0.124146
Name: prime_genre, dtype: float64

Games is far and away the most popular genre in the Apple Store with nearly 60% of apps falling into this category trailed distantly by Entertainment at 8% and Photo & Video apps in third with 5%.

The general impression is that the App Store is dominated by apps that are designed for fun (games, entertainment, photo and video, social networking, sports, music, etc.) while non-entertainment based apps are less common. However, that the most numerous apps are based around fun doesn't mean that they also have the greatest number of users.

Now let's look at the Google Play apps.

In [29]:
value_count_percent(free_gs_df['Category'])

FAMILY                 18.957346
GAME                    9.681787
TOOLS                   8.451817
BUSINESS                4.592643
LIFESTYLE               3.904311
PRODUCTIVITY            3.893026
FINANCE                 3.701196
MEDICAL                 3.520650
SPORTS                  3.396524
PERSONALIZATION         3.317536
COMMUNICATION           3.238547
HEALTH_AND_FITNESS      3.080569
PHOTOGRAPHY             2.945159
NEWS_AND_MAGAZINES      2.798465
SOCIAL                  2.663056
TRAVEL_AND_LOCAL        2.335816
SHOPPING                2.245543
BOOKS_AND_REFERENCE     2.143986
DATING                  1.861882
VIDEO_PLAYERS           1.794177
MAPS_AND_NAVIGATION     1.399233
FOOD_AND_DRINK          1.241255
EDUCATION               1.162266
ENTERTAINMENT           0.959151
LIBRARIES_AND_DEMO      0.936583
AUTO_AND_VEHICLES       0.925299
HOUSE_AND_HOME          0.823742
WEATHER                 0.801174
EVENTS                  0.710900
PARENTING               0.654480
ART_AND_DE

The landscape seems significantly different on Google Play: there are not that many apps designed for fun, and it seems that a good number of apps are designed for practical purposes (family, tools, business, lifestyle, productivity, etc.). However, if we investigate this further, we can see that the family category (which accounts for almost 19% of the apps) means mostly games for kids.

Even so, practical apps seem to have a better representation on Google Play compared to App Store. This picture is also confirmed by the frequency table we see for the Genres column:

In [30]:
value_count_percent(free_gs_df['Genres'])

Tools                                  8.440533
Entertainment                          6.070864
Education                              5.348680
Business                               4.592643
Productivity                           3.893026
Lifestyle                              3.893026
Finance                                3.701196
Medical                                3.520650
Sports                                 3.464229
Personalization                        3.317536
Communication                          3.238547
Action                                 3.103137
Health & Fitness                       3.080569
Photography                            2.945159
News & Magazines                       2.798465
Social                                 2.663056
Travel & Local                         2.324532
Shopping                               2.245543
Books & Reference                      2.143986
Simulation                             2.042428
Dating                                 1

The difference between the 'Category' and 'Genres' columns is not entirely clear, but one thing we can see is that the 'Genres' column is much more granular (it has more categories). Since we're only looking for the bigger picture at the moment, we'll only work with the Category column moving forward.

Up to this point, we found that the App Store is dominated by apps designed for fun, while Google Play shows a slightly more balanced landscape of both practical and for-fun apps but still heavily leaning towards entertainment. Now we'd like to get an idea about the kind of apps that have most users.