# Guided Project: Profitable App Profiles For The App Store And Google Play Markets

As a company that builds free apps to download and install, our main source of revenue is from in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app. In this project, I will be analyzing data to understand what type of apps are likely to attract more users and thus are more profitable for the company.

Below is a function we've defined to review any dataset, so that we have an easy way to review our data.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
apple_file = open('AppleStore.csv')
google_file = open('googleplaystore.csv')

from csv import reader
apple_data = list(reader(apple_file))
google_data = list(reader(google_file))

# Identify available columns
explore_data(apple_data[1:],0,1, True)
print('\n')
explore_data(google_data[1:],0,1, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 7197
Number of columns: 16


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


Documentation for the columns can be found [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) for the App Store data and [here](https://www.kaggle.com/lava18/google-play-store-apps) for the Google Play data.

## Removing Invalid Records
As explained in [this discussion thread](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) on the Google Play dataset forum, one of the records is missing one of the columns and will cause issues for our data work. So we must delete the particular record (found to be on line 10473).

In [4]:
# cleaning of record that is missing the 'Category' column value in the Google Play data
explore_data(google_data,10473,10474)
del google_data[10473]

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']




In [5]:
explore_data(google_data,10473,10474)

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']




## Duplicate Removal
### Part One
Reviewing the Google Play data, we can also see that there are duplicate entries for the same apps. Below we're using the Instagram app as an example:

In [6]:
for app in google_data[1:]:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In the below code, we loop through the Google Play data. For each app, we check whether the app's name exists within the `unique_apps` list. If it does, we append it to the `duplicate_apps` list. Otherwise, we apped it to the `unique_apps` list.

In [7]:
duplicate_apps = []
unique_apps = []

for app in google_data[1:]:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps: ', len(duplicate_apps), '\n')
print('Examples of duplicate apps: ', duplicate_apps[:15])

Number of duplicate apps:  1181 

Examples of duplicate apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


We will want to clean the data to ensure unique entries. We could remove duplicates randomly, but we run the risk of losing the most accurate data available to us.

We can see that the main difference between the entries appears to be the number of reviews in the ```Reviews``` field. We can utilise this to make sure that we only keep the entry with the most current number of ratings, and remove the rest.

In [8]:
print('number of apps after duplicate cleanup: ', len(google_data[1:]) - len(duplicate_apps))

number of apps after duplicate cleanup:  9659


### Part Two
In the below code:
* We create a new dictionary called `reviews_max` to hold a list of the highest review for each app.
* We loop through the Google Play data and for each app, grab the name and number of reviews:
    * If the app name already exists in the dictionary and the number of reviews is higher than the key's associated value, we update the value with the new number of reviews.
    * Otherwise, if the name does not exist, we append the app name and number of ratings as a new key-value pair in the dictionary. 

In [9]:
reviews_max = {}

for app in google_data[1:]:
    name = app[0]
    n_reviews = float(app[3])

    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews
        
print(len(reviews_max))
print(reviews_max['Instagram'])

9659
66577446.0


Now that we have a list of the highest ratings for each app, we can select the appropriate unique records for our data set:

* We initialise two new lists, `android_clean` and `already_added`.
* Looping through the Google Play data, we pull the current app's name and number of ratings:
    * If the number of ratings matches the app's maximum number of ratings and the app has not already been added, we add the app's record to the `android_clean` list and the app's name to the `already_added` list.

In [10]:
android_clean = []
already_added = []

for app in google_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)

print(android_clean[:5])

[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']]


## Filtering Non-English Records
### Part One
As our company uses English for any apps they develop, we want to analyse only the apps that are aimed at an English-speaking audience. To do this, we have written the below function to identify non-English characters in an app name.

We loop through each character of the supplied string. Using the built-in `ord()` function, we check the character's Unicode number:
* If the number is higher than the standard English character range (0-127), we can safely assume that the app name contains non-English characters and return a false value.
* If no character's Unicode number is above 127, we can safely assume the app name is English and so return a true value.

In [11]:
def is_english(string):
    for char in string:
        if ord(char) > 127:
            return False
    
    return True

In [12]:
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
False
False


### Part Two
As we can see in the printed values above, certain values are incorrectly identified as non-English. This occurs when emojis or symbols appear within the app's name. To minimise the chance of valid apps being incorrectly removed, we will update the function to only return true if three or more non-English characters are identified.

In [13]:
def is_english(string):
    no_ascii = 0;
    for char in string:
        if ord(char) > 127:
            no_ascii += 1;
            
        if no_ascii > 3:
            return False
        else:
            return True
        
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
True


Now we'll use our new function to filter the two data sets.

In [14]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    
    if is_english(name):
        android_english.append(app)
        
for app in apple_data:
    name = app[0]
    
    if is_english(name):
        ios_english.append(app)

explore_data(ios_english, 1, 5, True)
print('\n')
explore_data(android_english, 1, 5, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7198
Number of columns: 16


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number

## Filtering Out Paid Apps

As our company's main source of revenue consists of in-app ads, we only want to see the free apps within our lists. To do this, we'll loop through each list and isolate the free apps within separate lists. 

In [15]:
ios_free = []
android_free = []

for app in ios_english:
    cost = app[4]
    
    if cost == '0.0':
        ios_free.append(app)

for app in android_english:
    cost = app[6]
    
    if cost == 'Free':
        android_free.append(app)

explore_data(ios_free, 1, 5, True)
print('\n')
explore_data(android_free, 1, 5, True)

['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 4056
Number of columns: 16


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - N

## Most Common Apps by Genre
### Part One

Our aim is to work out which apps are most likely to attract users as our revenue is based on the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of what are the most common genres for each market. For this, we'll need to build frequency tables for a few columns in our data sets.

We will be using the primary columns in both sets: the `prime_genre` column for iOS, and the `Genres` and `Category` columns for Android.

### Part Two

Now we're going to create two functions to help us generate and analyse frequency tables.

The `freq_table()` function will take two inputs: `dataset` (a list of lists) and `index` (an integer). It will return a frequency table as a dictionary for any column we pass to it, with the values as percentages.

The `display_table()` function will convert any dataset to a frequency table (using the `freq_table()` function), sort in descending order by the frequencies and display it. 

In [16]:
def freq_table(dataset, index):
    frequency_table = {}
    dataset_count = len(dataset)
    
    for row in dataset:
        data_point = row[index]
        if data_point in frequency_table:
            frequency_table[data_point] += 1
        else:
            frequency_table[data_point] = 1

    for key in frequency_table:
        frequency_table[key] /= dataset_count
        frequency_table[key] *= 100
    
    return frequency_table

In [17]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

### Part Three

Now we will review the frequency tables and see what insights we can find. We will start with the App Store `genre` frequency table.

In [18]:
print('iOS prime_genre frequency table:')
display_table(ios_free, 11)

iOS prime_genre frequency table:
Games : 55.64595660749507
Entertainment : 8.234714003944774
Photo & Video : 4.117357001972387
Social Networking : 3.5256410256410255
Education : 3.2544378698224854
Shopping : 2.983234714003945
Utilities : 2.687376725838264
Lifestyle : 2.3175542406311638
Finance : 2.0710059171597637
Sports : 1.947731755424063
Health & Fitness : 1.8737672583826428
Music : 1.6518737672583828
Book : 1.6272189349112427
Productivity : 1.5285996055226825
News : 1.4299802761341223
Travel : 1.3806706114398422
Food & Drink : 1.0601577909270217
Weather : 0.7642998027613412
Reference : 0.4930966469428008
Navigation : 0.4930966469428008
Business : 0.4930966469428008
Catalogs : 0.22189349112426035
Medical : 0.19723865877712032


We can see that the most common genre in the free English app market is games (55.64%), followed by entertainment (8.23%). The games genre has the majority share of the market. Looking at the table's overall results, it appears that the most popular genres tend to be entertainment-focused, with practical-focused apps falling behind.

This would suggest that an entertainment or game-based app would be the most successful profile on the App Store, but it's worth considering that the market may be too overpopulated or that demand does not necessarily match the supply, so this may not be a good fit for our apps.

Let's review the Google Play data using the `Category` and `Genres` columns (they appear to be related) next.

In [19]:
print('Google Play Category frequency table:')
display_table(android_free, 1) # Category

Google Play Category frequency table:
FAMILY : 18.96900269541779
GAME : 9.703504043126685
TOOLS : 8.434411500449237
BUSINESS : 4.5822102425876015
LIFESTYLE : 3.930817610062893
PRODUCTIVITY : 3.8858939802336026
FINANCE : 3.6837376460017968
MEDICAL : 3.515274034141959
SPORTS : 3.380503144654088
PERSONALIZATION : 3.313117699910153
COMMUNICATION : 3.234501347708895
HEALTH_AND_FITNESS : 3.0660377358490565
PHOTOGRAPHY : 2.9424977538185084
NEWS_AND_MAGAZINES : 2.8301886792452833
SOCIAL : 2.6504941599281224
TRAVEL_AND_LOCAL : 2.324797843665768
SHOPPING : 2.2461814914645104
BOOKS_AND_REFERENCE : 2.178796046720575
DATING : 1.853099730458221
VIDEO_PLAYERS : 1.7969451931716083
MAPS_AND_NAVIGATION : 1.4150943396226416
FOOD_AND_DRINK : 1.2353998203054808
EDUCATION : 1.1680143755615455
ENTERTAINMENT : 0.9546271338724168
LIBRARIES_AND_DEMO : 0.9321653189577718
AUTO_AND_VEHICLES : 0.9209344115004492
HOUSE_AND_HOME : 0.8198562443845463
WEATHER : 0.7973944294699011
EVENTS : 0.7075471698113208
PARENTING :

We can see significant differences on Google Play: Practical app categories (Family, Tools, Business, etc.) appear to be dominating the marketplace. On closer inspection, the Family category (which holds 18% of the market) means mostly kid-focused games.

However, it does appear that the productivity apps are more prevalent in Google Play, as confirmed by the data from the `Genre` frequency table:

In [20]:
print('Google Play Genres frequency table:')
display_table(android_free, 9)

Google Play Genres frequency table:
Tools : 8.423180592991914
Entertainment : 6.087151841868823
Education : 5.3908355795148255
Business : 4.5822102425876015
Lifestyle : 3.919586702605571
Productivity : 3.8858939802336026
Finance : 3.6837376460017968
Medical : 3.515274034141959
Sports : 3.447888589398023
Personalization : 3.313117699910153
Communication : 3.234501347708895
Action : 3.088499550763702
Health & Fitness : 3.0660377358490565
Photography : 2.9424977538185084
News & Magazines : 2.8301886792452833
Social : 2.6504941599281224
Travel & Local : 2.3135669362084457
Shopping : 2.2461814914645104
Books & Reference : 2.178796046720575
Simulation : 2.0664869721473496
Dating : 1.853099730458221
Arcade : 1.8418688230008984
Video Players & Editors : 1.7744833782569631
Casual : 1.7520215633423182
Maps & Navigation : 1.4150943396226416
Food & Drink : 1.2353998203054808
Puzzle : 1.1230907457322552
Racing : 0.9883198562443846
Role Playing : 0.9321653189577718
Libraries & Demo : 0.9321653189577

The `Genre` column seems to be much more granular than the `Category` column, and it's not immediately clear what the relationship is between the two. As we're only looking at a high-level plan for the app, we'll focus on the `Category` column for now.

The frequency tables we analyzed showed us that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and fun apps. Now, we'd like to get an idea about the kind of apps with the most users.

In [21]:
genres_ios = freq_table(ios_free,11)

for genre in genres_ios:
    total = 0
    len_genre = 0
    
    for app in ios_free:
        genre_app = app[11]
        
        if genre_app == genre:
            no_of_ratings = float(app[6])
            total += no_of_ratings
            len_genre += 1
    
    avg_ratings = total / len_genre
    print(genre,':',avg_ratings)

Medical : 36.75
Social Networking : 646.0
Weather : 1624.4516129032259
News : 62.3448275862069
Business : 183.6
Health & Fitness : 234.02631578947367
Productivity : 225.3709677419355
Finance : 298.98809523809524
Entertainment : 179.31736526946108
Games : 609.4714222419141
Catalogs : 364.77777777777777
Utilities : 1618.2844036697247
Lifestyle : 1007.5
Sports : 177.25316455696202
Music : 618.8059701492538
Travel : 175.89285714285714
Food & Drink : 457.4651162790698
Shopping : 714.8595041322315
Education : 683.3712121212121
Book : 106.68181818181819
Navigation : 228.15
Reference : 2471.9
Photo & Video : 417.47904191616766


Based on the results, we can see that the `Weather` genre has the highest average user rating (1624.45), followed closely by the `Utilities` genre (1618.28). Digging deeper, we can see that the `Weather` genre's average is heavily influenced by The Weather Channel and AccuWeather:

In [22]:
for app in ios_free:
    if app[-5] == 'Weather':
        print(app[1], ':', app[5]) # print name and number of ratings

The Weather Channel: Forecast, Radar & Alerts : 495626
The Weather Channel App for iPad – best local forecast, radar map, and storm tracking : 208648
WeatherBug - Local Weather, Radar, Maps, Alerts : 188583
MyRadar NOAA Weather Radar Forecast : 150158
AccuWeather - Weather for Life : 144214
Yahoo Weather : 112603
Weather Underground: Custom Forecast & Local Radar : 49192
NOAA Weather Radar - Weather Forecast & HD Radar : 45696
Weather Live Free - Weather Forecast & Alerts : 35702
Storm Radar : 22792
QuakeFeed Earthquake Map, Alerts, and News : 6081
Moji Weather - Free Weather Forecast : 2333
Hurricane by American Red Cross : 1158
Forecast Bar : 375
Hurricane Tracker WESH 2 Orlando, Central Florida : 203
FEMA : 128
iWeather - World weather forecast : 80
Weather - Radar - Storm with Morecast App : 78
Yurekuru Call : 53
Weather & Radar : 37
WRAL Weather Alert : 25
Météo-France : 24
JaxReady : 22
Freddy the Frogcaster's Weather Station : 14
Almanac Long-Range Weather Forecast : 12
실시간 날씨 :

The same pattern applies to utility apps, where the average number is heavily influenced by Google Search. Shopping apps are similarly skewed up by Amazon, Groupon, Wish, etc.

Our aim is to find popular genres, but weather, utility and shopping apps might seem more popular than they really are. The average number of ratings seem to be skewed by very few apps which have hundreds of thousands of user ratings, while the other apps may struggle to get past the 10,000 threshold. We could get a better picture by removing these extremely popular apps for each genre and then rework the averages, but we'll leave this level of detail for later.

We can see once we get to the education apps that we're finding more potential:

In [23]:
for app in ios_free:
    if app[-5] == 'Education':
        print(app[1], ':', app[5]) # print name and number of ratings

Duolingo - Learn Spanish, French and more : 162701
Guess My Age  Math Magic : 123190
Lumosity - Brain Training : 96534
Elevate - Brain Training and Games : 58092
Fit Brains Trainer : 46363
ClassDojo : 35440
Memrise: learn languages : 20383
Peak - Brain Training : 20322
Canvas by Instructure : 19981
ABCmouse.com - Early Learning Academy : 18749
Quizlet: Study Flashcards, Languages & Vocabulary : 16683
Photomath - Camera Calculator : 16523
iTunes U : 15801
Blackboard Mobile Learn™ : 13567
Star Chart : 13482
Remind: Fast, Efficient School Messaging : 9796
PBS KIDS Video : 8651
Toca Kitchen Monsters : 8062
Toca Hair Salon - Christmas Gift : 8049
Edmodo : 7197
Prodigy Math Game : 6683
Epic! - Unlimited Books for Kids : 6676
ChineseSkill -Learn Mandarin Chinese Language Free : 6077
Google Classroom : 5942
TED : 5782
Khan Academy: you can learn anything : 5459
Got It - Homework Help Math, Chem, Physics Solver : 4903
PowerSchool Mobile : 4547
SkyView® Free - Explore the Universe : 4188
Hopsco

We can see that the user ratings are much more balanced in the education app market. The more popular apps appear to be around learning, and specifically around either improving mental processes or learning additional languages. It's clear that there is a strong potential for an iOS app that is focused around gamifying learning processes.

### Most Popular Apps by Genre on Google Play
We have data about the number of installs for the Google Play market, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.):

In [24]:
display_table(android_free, 5) # Installs column

1,000,000+ : 15.689577717879605
100,000+ : 11.579065588499551
10,000,000+ : 10.500898472596585
10,000+ : 10.25381850853549
1,000+ : 8.423180592991914
100+ : 6.918238993710692
5,000,000+ : 6.8171608265947885
500,000+ : 5.536837376460018
50,000+ : 4.818059299191375
5,000+ : 4.526055705300988
10+ : 3.5377358490566038
500+ : 3.234501347708895
50,000,000+ : 2.2911051212938007
100,000,000+ : 2.1226415094339623
50+ : 1.9092542677448336
5+ : 0.7861635220125787
1+ : 0.5166217430368374
500,000,000+ : 0.2695417789757413
1,000,000,000+ : 0.22461814914645103
0+ : 0.04492362982929021


For instance, we don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, we don't need very precise data for our purposes — we only want to find out which app genres attract the most users, and we don't need perfect precision with respect to the number of users.

We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on. To perform computations, however, we'll need to convert each install number from string to float. This means we need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error.

In [34]:
categories_android = freq_table(android_free,1)
category_installs = {}

for category in categories_android:
    total = 0
    len_category = 0
    
    for app in android_free:
        category_app = app[1]
        
        if category_app == category:
            no_of_installs = float(app[5].replace('+','').replace(',',''))
            total += no_of_installs
            len_category += 1
    
    avg_installs = total / len_category
    category_installs[category] = avg_installs

template = '{} : {:,}'

for cat, installs in category_installs.items():
    print(template.format(cat,installs))

EDUCATION : 1,825,480.7692307692
COMICS : 803,234.8214285715
LIFESTYLE : 1,436,126.94
HEALTH_AND_FITNESS : 4,188,821.9853479853
COMMUNICATION : 38,322,625.697916664
MEDICAL : 120,550.61980830671
ENTERTAINMENT : 11,640,705.88235294
FINANCE : 1,387,692.475609756
BUSINESS : 1,708,215.906862745
MAPS_AND_NAVIGATION : 3,993,339.603174603
AUTO_AND_VEHICLES : 647,317.8170731707
SPORTS : 3,638,640.1428571427
ART_AND_DESIGN : 1,952,105.1724137932
PERSONALIZATION : 5,183,850.806779661
NEWS_AND_MAGAZINES : 9,401,635.952380951
HOUSE_AND_HOME : 1,331,540.5616438356
BEAUTY : 513,151.88679245283
WEATHER : 5,074,486.197183099
LIBRARIES_AND_DEMO : 638,503.734939759
TOOLS : 10,787,009.952063914
PARENTING : 542,603.6206896552
PRODUCTIVITY : 16,738,957.554913295
PHOTOGRAPHY : 17,772,018.759541985
FOOD_AND_DRINK : 1,924,897.7363636363
TRAVEL_AND_LOCAL : 13,984,077.710144928
FAMILY : 3,671,043.037892244
VIDEO_PLAYERS : 24,573,948.25
DATING : 854,028.8303030303
EVENTS : 253,542.22222222222
BOOKS_AND_REFERENCE

The highest results for installs on the Google Play Store is found in communication apps (38,322,625.697916664), followed by video player apps (24,573,948.25). This is not surprising given the amount of ways to communicate in the modern world (SMS, instant messaging, email, phone calls, social media, etc.) but let's take a closer look at the top end of the communication category:

In [50]:
for app in android_free:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


We can see that the communication category is being heavily skewed up by some of the big names: WhatsApp, Messenger, Gmail, Skype and Chrome are all hitting upwards of 1,000,000,000 installs. Let's see what happens to our average if we remove any app with over 100,000,000 installs:

In [52]:
below_100_m = []

for app in android_free:
    no_of_installs = float(app[5].replace('+','').replace(',',''))
    total += no_of_installs
    
    if (app[1] == 'BOOKS_AND_REFERENCE' and float(no_of_installs) < 100000000):
        below_100_m.append(float(no_of_installs))
    
sum(below_100_m) / len(below_100_m)

1407123.0687830688

Our average has now dropped by 90%. A similar situation is occurring in the runner-up categories of video players (dominated by Youtube and Google Play Movies & TV), social (Facebook, Instagram, Snapchat, LinkedIn, Pinterest, etc), photography (Google Photos and other big-name photo editing apps) and productivity (Microsoft Office, Google Docs and Drive, Evernote among others). Given the next categories on our list, games and entertainment, are incredibly saturated markets, we want to avoid these two in favour of somewhere our app can make more of an impact and thus get more users.

This brings us to the books and reference category. With the exception of Google Play Books, there are not many big names that are dominating this market, allowing more chance for our app to stand out. Providing some level of gamification (for example, reading challenges or studying quizzes) and social networking (such as challenge leaderboards, forums or friend recommendations) would help to provide a point of difference for our app within this smaller market.

### Conclusion

Given the results that we've found in both the iOS and Google Play markets, a strong case can be made to create an app in the books and reference genre. A suggestion would be an online library of study resources (that can be bought in-app for cheaper rates than their physical counterparts). Users can access the resources, as well as additional study material, both in-house and user-submitted. They can also connect with friends and other users studying the same materials, to either discuss the content or gamify their studying through challenges and leaderboards.