# Data Analyst Project01: 
## Profitable App Profiles for the App Store and Google Play Markets

**Synopsis**:
Pretend we're working as data analysts for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and the App Store.

We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better. Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

In [1]:
def explore_data(dataset, start=0, end=-1, rows_and_cols=False):
    dset_slice = dataset[start:end]
    for row in dset_slice:
        print(row)
        print('\n')
        
    if rows_and_cols:
        print('Number of rows: ', len(dataset))
        print('Number of cols: ', len(dataset[0]))
        
    return None

We will work with two datasets from Kaggle published within the last two years:

- Apple Store data (2018): https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps

- Google Play Store data (2019): https://www.kaggle.com/lava18/google-play-store-apps

In [2]:
import csv
# Apple Store data
with open('AppleStore.csv') as file:
    apple_store_data = list(csv.reader(file))

# Google Play Store data
with open('googleplaystore.csv') as file:
    gplay_store_data = list(csv.reader(file))

# Separate headers and data
apple_store_header = apple_store_data[0]
apple_store = apple_store_data[1:]

gplay_store_header = gplay_store_data[0]
gplay_store = gplay_store_data[1:]

In [3]:
print(apple_store_header, '\n')
_ = explore_data(apple_store, end=3, rows_and_cols=True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows:  7197
Number of cols:  16


### First look at the data:

The Apple store data has 7197 apps listed with 16 columns of data. The provided column descriptions are a little obfuscated, but looking at the descriptions on the Kaggle page we can get an idea of a few columns of initial interest:
1. `track_name`: App name in store.
2. `currency` and `price`: Price and currency of app, since we are interested in free apps, we will want to filter out paid apps.
3. `rating_count_tot` and `rating_count_ver`: Number of ratings for app in total and for the most recent version, respectively. These give an idea of how popular the app is overall and with it's most recent version.
4. `user_rating` and `user_rating_ver`: Average user review scores (using a scale from 0.0-5.0) overall and for the most recent version. These indicate the quality/reception of an app.
5. `cont_rating` and `prime_genre`: Recommended age restrictions and the main category for an app. These help describe (generally) what the app does, and what audience the app is targeted towards.

In [4]:
print(gplay_store_header, '\n')
_ = explore_data(gplay_store, end=3, rows_and_cols=True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows:  10841
Number of cols:  13


The Google Play store data lists just over 10,000 apps described with 13 columns. The column names are a little more descriptive so we can decide what columns might be of interest:

1. `App`: App name in store.
2. `Type` and `Price`: Type refers to an app that is either 'Free' or 'Paid' and the price of the app (in USD).
3. `Reviews`: The number of ratings for the app in total.
4. `Rating`: The average user review score (using a scale from 0.0-5.0) overall.
5. `Content Rating`, `Category`, and `Genres`: Recommended age restrictions, primary category, and full list of relevant categories for the app. Note that an app has belongs to only one `Category`, but possibly many `Genres`.

If we compare the two datasets, we can thankfully see that they both mostly contain the same types of data we're interested in. One notable difference is that the Apple store yields information about the most current version of an app as well as it's overall history, while the Google play store provides a finer resolution of the different types of apps.

## Data Validation

Let's do a quick smoke check for incorrect/missing data. Two of the most common data errors are missing values and duplicated data that should be unique.

The simplest check for missing data is to iterate looking for rows that are shorter than the header, indicating that at least one column is missing data. So we introduce a function that checks that each row in our table contains the same number of elements as the header.

This type of error is fairly rare in our data, and without any prior knowledge beforehand on how to fill the missing values, it's safest to simply remove the rows from our analysis.

Our data is meant to represent individual apps in either the Google Play or Apple App stores. This means that we need to ensure that each row corresponds to one and only one potential app for each store. We do this by looking through the data for duplicated entries in a field that should be unique to each row (usually called a key). We'll then use that key to detect collisions in the data and either eliminate or fold in the duplicated data until we are left with distinct values in the key column(s). 

If we knew that our data was obtained from a relational database (RDB), or as the result of a query to an official source, we could depend on the RDB managment system to handle this for us. However, in our case it seems likely that this data was obtained by scraping the websites for the Google Play and Apple App stores. Although the Apple dataset has an `id` field, which is likely to be a key, no such field exists in the Google Play dataset. Our next best option is to look for duplicates in the fields containing the app names.

In [5]:
def data_smoke_test_missing(dataset, n_cols):
    # Simple smoke test of data looking at how many columns are present for each row    
    out = []
    for idx, row in enumerate(dataset):
        if len(row) != n_cols: # row is missing at least one column's worth of data
            out.append(idx)
    
    return out

def data_smoke_test_duplicate(dataset, key_index):
    # Another simple test looking for uniqueness of values for a given key column.
    # seen_values is keyed by the values present in the key column.
    # For a given key k, seen_values[k] is a list of each row index idx such that row[idx] = k 
    # Thus a key k is unique if and only if len(seen_values[k]) ==  1
    seen_values = {}
    for idx, row in enumerate(dataset):
        val = row[key_index]
        if val not in seen_values:
            seen_values[val] = [idx]
        else:
            seen_values[val].append(idx)
            
    # filter unique keys from seen_values
    non_unique_keys = {}
    idx_to_remove = None
    for key,value in seen_values.items():
        if len(value) > 1:
            non_unique_keys[key] = value      
    
    return non_unique_keys

In [6]:
idx_missing_apple = data_smoke_test_missing(apple_store, len(apple_store_header))
idx_missing_gplay = data_smoke_test_missing(gplay_store, len(gplay_store_header))

print("Apple store rows w. missing columns = {}".format(len(idx_missing_apple)))
print("Gplay store rows w. missing columns = {}".format(len(idx_missing_gplay)))

print([apple_store[idx] for idx in idx_missing_apple])
print([gplay_store[idx] for idx in idx_missing_gplay])
    
dup_key_apple = data_smoke_test_duplicate(apple_store, key_index=1)
dup_key_gplay = data_smoke_test_duplicate(gplay_store, key_index=0)

if dup_key_apple:
    n_dup_keys = len(dup_key_apple.keys())
    print("\nDuplicated apps in apple store: {}\n".format(n_dup_keys))
    for key, idx_dup_apple in dup_key_apple.items():
        print(*[apple_store[idx] for idx in idx_dup_apple], sep='\n')

if dup_key_gplay:
    n_dup_keys = len(dup_key_gplay.keys())
    print("\nA selection of duplicated apps in gplay store: {}\n".format(n_dup_keys))
    # There are many more collisions in the gplay data.. we'll just show a few
    print(*[gplay_store[idx] for idx in dup_key_gplay['Instagram']], sep='\n')

Apple store rows w. missing columns = 0
Gplay store rows w. missing columns = 1
[]
[['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']]

Duplicated apps in apple store: 2

['952877179', 'VR Roller Coaster', '169523200', 'USD', '0.0', '107', '102', '3.5', '3.5', '2.0.0', '4+', 'Games', '37', '5', '1', '1']
['1089824278', 'VR Roller Coaster', '240964608', 'USD', '0.0', '67', '44', '3.5', '4.0', '0.81', '4+', 'Games', '38', '0', '1', '1']
['1173990889', 'Mannequin Challenge', '109705216', 'USD', '0.0', '668', '87', '3.0', '3.0', '1.4', '9+', 'Games', '37', '4', '1', '1']
['1178454060', 'Mannequin Challenge', '59572224', 'USD', '0.0', '105', '58', '4.0', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']

A selection of duplicated apps in gplay store: 798

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies

Even though there are no collisions in the `id` field of the Apple App store data, we still find two sets of collisions in the app names. For the Google Play set, we find many more sets of collisions. 

Looking at the collisions in the Google Play data, it's clear that the only differing column is the `Reviews` column. What's most likely happening is that when this dataset was composed, the Google Play store was scraped multiple times with some time elapsing between runs. During this gap, additional reviews were submitted for several of the apps (even more likely for highly popular apps), which were then incorrectly identified as new apps and added to the dataset. To resolve these conflicts, we'll keep only the record with the largest number of reviews, indicating the most recent data pulled for that app.

For the Apple Store collisions, we see that the both sets of apps have more substantial distinguishing columns. Both collisions have differing version columns, with one set further having a different content rating. Along with the unique `id` field, this could at least plausibly indicate that these collisions are in fact distinct apps. There is still some risk in not resolving these collisions, because we don't know for certain the providence of the data. However that risk is acceptably small since there are only two sets of two collisions.

In [7]:
idx_to_remove_apple = idx_missing_apple
idx_to_remove_gplay = idx_missing_gplay

# De-duplicate gplay dataset
# Get list of indicies to remove, leaving row with largest number of reviews
discrim_col = 3 # discriminate rows based on total number of reviews
for key in dup_key_gplay:
    list_of_idx = dup_key_gplay[key]
    max_idx = list_of_idx[0]
    max_rev = int(gplay_store[max_idx][discrim_col])
    for idx in list_of_idx[1:]:
        n_rev = int(gplay_store[idx][discrim_col])
        if n_rev > max_rev:
            idx_to_remove_gplay.append(max_idx)
            max_idx = idx
            max_rev = n_rev
        else:
            idx_to_remove_gplay.append(idx)
        
# Filter out indicies to be removed
delete = True
if(delete):
    apple_store = [row for idx, row in enumerate(apple_store) if idx not in idx_to_remove_apple]
    gplay_store = [row for idx, row in enumerate(gplay_store) if idx not in idx_to_remove_gplay]
    
print("Number of Apple Apps Remaining = {}".format(len(apple_store)))
print("Number of Google Play Apps Remaining = {}".format(len(gplay_store)))


Number of Apple Apps Remaining = 7197
Number of Google Play Apps Remaining = 9659


Next, we'll partition the data into 'English primary' and 'non-English primary' apps based on the number of characters in the app names that fall outside of the English ASCII range. We define an app name as 'English primary' if it has three or fewer characters outside the the standard ASCII range ( 0<=ord(c)<=127 ). We allow a few characters outside this range to permit the odd '™' or emoji characters.

In [8]:
def is_str_eng(string, tol=0):
    MAX_ASCII = 127
    count = 0
    for char in string:
        if ord(char) > MAX_ASCII:
            count += 1
            if count > tol:
                return False
    return True

print(is_str_eng('Instagram', tol=3))
print(is_str_eng('爱奇艺PPS -《欢乐颂2》电视剧热播', tol=3))
print(is_str_eng('Docs To Go™ Free Office Suite', tol=3))
print(is_str_eng('Instachat 😜', tol=3))

True
False
True
True


In [9]:
# Partition datasets into English primary and non-English primary subsets
apple_eng = []
apple_non_eng = []
APP_NAME_INDEX = 1
for row in apple_store:
    if is_str_eng(row[APP_NAME_INDEX], tol=3):
        apple_eng.append(row)
    else:
        apple_non_eng.append(row)
        
gplay_eng = []
gplay_non_eng = []
APP_NAME_INDEX = 0
for row in gplay_store:
    if is_str_eng(row[APP_NAME_INDEX], tol=3):
        gplay_eng.append(row)
    else:
        gplay_non_eng.append(row)
        
print("Number of English primary Apple Apps Remaining = {}".format(len(apple_eng)))
print("Number of English primary Google Play Apps Remaining = {}".format(len(gplay_eng)))

Number of English primary Apple Apps Remaining = 6183
Number of English primary Google Play Apps Remaining = 9614


In [10]:
# Partition English primary data into Free and Paid apps
apple_free = []
apple_paid = []
APP_PRICE_INDEX = 4
for row in apple_eng:
    if float(row[APP_PRICE_INDEX]) == 0.:
        apple_free.append(row)
    else:
        apple_paid.append(row)
        
gplay_free = []
gplay_paid = []
APP_PRICE_INDEX = 7
for row in gplay_eng:
    if float(row[APP_PRICE_INDEX].strip('$')) == 0.:
        gplay_free.append(row)
    else:
        gplay_paid.append(row)
        
print("Number of Free, English primary Apple Apps = {}".format(len(apple_free)))
print("Number of Free, English primary Google Play Apps = {}".format(len(gplay_free)))

Number of Free, English primary Apple Apps = 3222
Number of Free, English primary Google Play Apps = 8864


Now that we have a relatively clean dataset for these two app stores, we can start discussing what types of analyses we're going to pursue, and how these analyses will answer some of the questions we posed at the beginning of the notebook.

In the synopsis, we stated that our aim was to provide a team of developers a profile of 'successful' free apps on the Google Play and Apple Store markets. Here we will define 'successful' as maximzing the number of downloads (hence maximizing the number of ad impressions). Given this profile of successful apps, our developer team might produce a minimum working example (MWE) that is beta tested on one of the app stores. If that MWE is well recieved by users, the team may decide it's worth developing further to capture a larger share of users or build a second version of the app for other markets. 

Looking through the datasets, a good place to start building a profile for successful apps is to look at the frequencies of different genres in the `prime_genre` column of the Apple Store data and the `category` and `genres` columns of the Google Play data. This might then be further broken down by the age/content ratings later on.

In [11]:
import pandas as pd
apple_dframe = pd.DataFrame(data=apple_free, columns=apple_store_header)
apple_dframe.rating_count_tot = pd.to_numeric(apple_dframe.rating_count_tot)

gplay_dframe = pd.DataFrame(data=gplay_free, columns=gplay_store_header)

APPLE_GENRE_INDEX = 11
GPLAY_CAT_INDEX = 1
GPLAY_GENRE_INDEX = 9

def freq_table(dframe, col_index):
    freq = {}
    if type(col_index) == int:
        df = dframe.iloc[:,col_index]
    else:
        df = dframe[col_index]
    
    for row in df:
        if row in freq:
            freq[row] += 1
        else:
            freq[row] = 1
    
    freq_df = pd.DataFrame({'count':freq})
    freq_df['percentage'] = (freq_df['count']/freq_df['count'].sum())*100.
    return freq_df

apple_genre_freq = freq_table(apple_dframe, 'prime_genre')
gplay_categories_freq = freq_table(gplay_dframe, 'Category')

# Gplay 'Genres' columns requires a little extra work. Column values are ';' seperated list of genres
# split values into lists
gplay_genres_split = gplay_dframe['Genres'].str.split(";")

# build dframe to send to freq_table()
gplay_genres = []
for row in gplay_genres_split:
    for val in row:
        gplay_genres.append(val)
        
gplay_genres = pd.DataFrame(data=gplay_genres,columns=['Genres'])
gplay_genre_freq = freq_table(gplay_genres, 'Genres')

print("Apple Genre Frequencies:")
print(apple_genre_freq.sort_values(by='percentage', ascending=False))

Apple Genre Frequencies:
                   count  percentage
Games               1874   58.162632
Entertainment        254    7.883302
Photo & Video        160    4.965860
Education            118    3.662322
Social Networking    106    3.289882
Shopping              84    2.607076
Utilities             81    2.513966
Sports                69    2.141527
Music                 66    2.048417
Health & Fitness      65    2.017381
Productivity          56    1.738051
Lifestyle             51    1.582868
News                  43    1.334575
Travel                40    1.241465
Finance               36    1.117318
Weather               28    0.869025
Food & Drink          26    0.806952
Reference             18    0.558659
Business              17    0.527623
Book                  14    0.434513
Navigation             6    0.186220
Medical                6    0.186220
Catalogs               4    0.124146


In the Apple Store dataset, we can see that broadly speaking apps relating to entertainment (Games, Entertainment, and Photo & Video) dominate the market. This is followed by Educational apps and Social Networking apps. More practical apps, like Utilities or Health & Fitness, are relatively more rare. Note that this data relates to the frequency of apps listed on the store and doesn't indicate the popularity among users or success of apps of any genre.

In [12]:
print("+++++++++++++++++++++")
print("Gplay Categories Frequencies:")
print(gplay_categories_freq.sort_values(by='percentage', ascending=False))

+++++++++++++++++++++
Gplay Categories Frequencies:
                     count  percentage
FAMILY                1676   18.907942
GAME                   862    9.724729
TOOLS                  750    8.461191
BUSINESS               407    4.591606
LIFESTYLE              346    3.903430
PRODUCTIVITY           345    3.892148
FINANCE                328    3.700361
MEDICAL                313    3.531137
SPORTS                 301    3.395758
PERSONALIZATION        294    3.316787
COMMUNICATION          287    3.237816
HEALTH_AND_FITNESS     273    3.079874
PHOTOGRAPHY            261    2.944495
NEWS_AND_MAGAZINES     248    2.797834
SOCIAL                 236    2.662455
TRAVEL_AND_LOCAL       207    2.335289
SHOPPING               199    2.245036
BOOKS_AND_REFERENCE    190    2.143502
DATING                 165    1.861462
VIDEO_PLAYERS          159    1.793773
MAPS_AND_NAVIGATION    124    1.398917
FOOD_AND_DRINK         110    1.240975
EDUCATION              103    1.162004
ENTERTAINMEN

The Google Play data is dominated by Family and Games. Examining a few pages of the Family category reveals that it's mostly comprised of games for younger children. This is similar to the trend we noted in the Apple Store data. However, after this there appears to be a larger selection of practical apps (tools, business, lifestyle, productivity, and finance).

In [13]:
print("+++++++++++++++++++++")
print("Gplay Genres Frequencies:")
print(gplay_genre_freq.sort_values(by='percentage', ascending=False))

+++++++++++++++++++++
Gplay Genres Frequencies:
                         count  percentage
Tools                      750    8.165487
Education                  606    6.597714
Entertainment              569    6.194883
Business                   407    4.431138
Lifestyle                  347    3.777899
Productivity               345    3.756124
Finance                    328    3.571040
Medical                    313    3.407730
Sports                     309    3.364181
Personalization            294    3.200871
Communication              288    3.135547
Action                     284    3.091998
Health & Fitness           275    2.994012
Photography                261    2.841590
News & Magazines           248    2.700054
Social                     236    2.569407
Casual                     210    2.286336
Travel & Local             207    2.253674
Shopping                   199    2.166576
Books & Reference          191    2.079477
Simulation                 191    2.079477
Arcade

The genres column of the Google Play set is essentially a more fine-grain view of the categories column. For the most part it appears to agree with the categories resluts. However, since each app may have multiple genres we lose the one-to-one property of the categories column. It's also likely that some of the genres are strongly correllated with each other (e.g. it's likely that tools correlates with business, lifestyle, and productivity).

One thing to remark: this analysis only looks at the number of apps that are listed on the two markets, and so only gives insight into the profile of apps available as a whole, rather than which genres are the most sucessful. So let's incorporate some information about the number of downloads for each genre/category.

In [14]:
apple_dframe.head()
apple_genre_freq['avg_review_count'] = None
for genre in apple_dframe['prime_genre'].unique():
    df = apple_dframe[apple_dframe['prime_genre']==genre]
    total_reviews = df['rating_count_tot'].sum()
    count = apple_genre_freq.loc[genre,'count']
    avg_review_count = total_reviews/count
    
    apple_genre_freq.loc[genre,'avg_review_count'] = avg_review_count

In [15]:
print("Apple Genre Frequencies:")
print(apple_genre_freq.sort_values(by=['avg_review_count','percentage'], ascending=False))

Apple Genre Frequencies:
                   count  percentage avg_review_count
Navigation             6    0.186220          86090.3
Reference             18    0.558659          74942.1
Social Networking    106    3.289882          71548.3
Music                 66    2.048417          57326.5
Weather               28    0.869025          52279.9
Book                  14    0.434513          39758.5
Food & Drink          26    0.806952          33333.9
Finance               36    1.117318          31467.9
Photo & Video        160    4.965860          28441.5
Travel                40    1.241465          28243.8
Shopping              84    2.607076          26919.7
Health & Fitness      65    2.017381            23298
Sports                69    2.141527          23008.9
Games               1874   58.162632          22788.7
News                  43    1.334575            21248
Productivity          56    1.738051          21028.4
Utilities             81    2.513966          18684.5
Lif

Including the number of average reviews for apps indicates that Navigation, Reference, Social Networking, Music, and Weather apps appear to have significantly more reviews on average. However the counts for some of these app genres are relatively small which hints that this data might be skewed by some extremely popular apps. Investigating reveals this to be the case:

In [16]:
criteria = apple_dframe.prime_genre == 'Navigation'
df = apple_dframe[criteria][['track_name','prime_genre', 'rating_count_tot']]

print(df.sort_values(by='rating_count_tot', ascending=False))

                                            track_name prime_genre  \
43     Waze - GPS Navigation, Maps & Real-time Traffic  Navigation   
118                 Google Maps - Navigation & Transit  Navigation   
712                                        Geocaching®  Navigation   
1173       CoPilot GPS – Car Navigation & Offline Maps  Navigation   
2322  ImmobilienScout24: Real Estate Search in Germany  Navigation   
3014                              Railway Route Search  Navigation   

      rating_count_tot  
43              345046  
118             154911  
712              12811  
1173              3582  
2322               187  
3014                 5  


The Waze and Google Maps navigation apps have well over an order of magnitude more reviews than the rest of the set which barely break 10k reviews, so the result is skewed. This is the case for many other genres as well. In general, it would be best to filter out as many of these outliers as possible to get a better sense of user engagement for a certain genre. However, we leave that for later.

In [17]:
criteria = apple_dframe.prime_genre == 'Reference'
df = apple_dframe[criteria][['track_name','prime_genre', 'rating_count_tot']]

print(df.sort_values(by='rating_count_tot', ascending=False))

                                             track_name prime_genre  \
6                                                 Bible   Reference   
80                Dictionary.com Dictionary & Thesaurus   Reference   
304      Dictionary.com Dictionary & Thesaurus for iPad   Reference   
474                                    Google Translate   Reference   
597   Muslim Pro: Ramadan 2017 Prayer Times, Azan, Q...   Reference   
612   New Furniture Mods - Pocket Wiki & Game Tools ...   Reference   
628                          Merriam-Webster Dictionary   Reference   
732                                           Night Sky   Reference   
850   City Maps for Minecraft PE - The Best Maps for...   Reference   
1064  LUCKY BLOCK MOD ™ for Minecraft PC Edition - T...   Reference   
1522    GUNS MODS for Minecraft PC Edition - Mods Tools   Reference   
1747  Guides for Pokémon GO - Pokemon GO News and Ch...   Reference   
1791                                               WWDC   Reference   
1815  

The reference genre shows some promise, with a good number of apps recieving well over 10k reviews.

In [18]:
criteria = apple_dframe.prime_genre == 'Social Networking'
df = apple_dframe[criteria][['track_name','prime_genre', 'rating_count_tot']]

print(df.sort_values(by='rating_count_tot', ascending=False))

                                             track_name        prime_genre  \
0                                              Facebook  Social Networking   
5                                             Pinterest  Social Networking   
38                                     Skype for iPhone  Social Networking   
42                                            Messenger  Social Networking   
45                                               Tumblr  Social Networking   
56                                   WhatsApp Messenger  Social Networking   
64                                                  Kik  Social Networking   
100             ooVoo – Free Video Call, Text and Voice  Social Networking   
106                    TextNow - Unlimited Text + Calls  Social Networking   
109                       Viber Messenger – Text & Call  Social Networking   
171          Followers - Social Analytics For Instagram  Social Networking   
205                   MeetMe - Chat and Meet New People  Social 

Social networking apps also feature a large number of apps with large engagement, although the number of apps listed on the market suggests it may be saturated. Other apps worth looking into might be music, weather, books, or finance.

Next let's look at the Google Play data again. 

In [22]:
gplay_categories_freq['avg_installs'] = None
gplay_dframe.Installs = pd.to_numeric(gplay_dframe.Installs.apply(lambda x: x.replace('+', '').replace(',','')))


for genre in gplay_dframe['Category'].unique():
    df = gplay_dframe[gplay_dframe.Category==genre]
    total_installs = df.Installs.sum()
    count = gplay_categories_freq.loc[genre,'count']
    avg_install_count = total_installs/count
    
    gplay_categories_freq.loc[genre,'avg_installs'] = avg_install_count

AttributeError: 'int' object has no attribute 'replace'

In [20]:
print("Gplay Category Frequencies:")
print(gplay_categories_freq.sort_values(by=['avg_installs','percentage'], ascending=False))

Gplay Category Frequencies:
                     count  percentage avg_installs
BEAUTY                  53    0.597924       69.283
COMICS                  55    0.620487      66.7636
ART_AND_DESIGN          57    0.643051      64.4211
PARENTING               58    0.654332      63.3103
EVENTS                  63    0.710740      58.2857
WEATHER                 71    0.800993      51.7183
HOUSE_AND_HOME          73    0.823556      50.3014
AUTO_AND_VEHICLES       82    0.925090      44.7805
LIBRARIES_AND_DEMO      83    0.936372       44.241
ENTERTAINMENT           85    0.958935         43.2
EDUCATION              103    1.162004      35.6505
FOOD_AND_DRINK         110    1.240975      33.3818
MAPS_AND_NAVIGATION    124    1.398917      29.6129
VIDEO_PLAYERS          159    1.793773      23.0943
DATING                 165    1.861462      22.2545
BOOKS_AND_REFERENCE    190    2.143502      19.3263
SHOPPING               199    2.245036      18.4523
TRAVEL_AND_LOCAL       207    2.3352

In [21]:
criteria = gplay_dframe.Category == 'COMMUNICATION'
df = gplay_dframe[criteria][['App','Category', 'Installs']]

print(df.sort_values(by='Installs', ascending=False))

                                                    App       Category  \
277                                  WhatsApp Messenger  COMMUNICATION   
302            Messenger – Text and Video Chat for Free  COMMUNICATION   
356                                            Hangouts  COMMUNICATION   
318                        Google Chrome: Fast & Secure  COMMUNICATION   
306                       Skype - free IM & video calls  COMMUNICATION   
344                                               Gmail  COMMUNICATION   
314                         LINE: Free Calls & Messages  COMMUNICATION   
3372                                    Viber Messenger  COMMUNICATION   
322         UC Browser - Fast Download Private & Secure  COMMUNICATION   
294               Google Duo - High Quality Video Calls  COMMUNICATION   
303                       imo free video calls and chat  COMMUNICATION   
307                                                 Who  COMMUNICATION   
3332        UC Browser Mini -Tiny Fast

Once again we see that the number of reviews for many of the categories are dominated by a handful of very popular apps. However, 