# Types of apps that are likely to attract more users

This project is intended to give insights on what types of apps are likely to attract more users. This data will be useful to share with our development team so that they can understand what types of apps to develop so that we can gain the highest number of users. Since we only develop free apps, our main source of revenue are from in-app ads. This is why we are interested in finding out what types of apps attract more users.

Our goal for this project is to determine which apps currently available in the app store and the google play store are attracting users and why.

As a first step, we're going to open both data sets and then slice them into two lists each, so each data set will have a list with all the headers, and a list with the actual data. They will be named subsequently with ios or android to distinguish between the two operating systems.

In [1]:
from csv import reader

# import ios data set and convert to two separate lists
opened_file = open('C:\AppleStore.csv', encoding='utf8')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios_data = ios[1:]

# import android data set and convert to two separate lists
opened_file = open('C:\googleplaystore.csv', encoding='utf8')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android_data = android[1:]

Now that we have our data sets opened and converted into lists, we're going to inspect the data and see what types of data it has. To do this, first we define a function which allows us to explore a data set by separating the rows so that they are easier to read, this function takes in 4 inputs, the name of the dataset, the start and end rows for displaying data, and a parameter to count the number of rows and lengths in a data set (without the header data).

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new empty line after each row
        
    if rows_and_columns is True:
        print('Number of rows: ', len(dataset))
        print('Number of columns: ', len(dataset[0]))
        
    return

print(ios_header)
print('\n')
explore_data(ios_data, 0, 2, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows:  7197
Number of columns:  16


Analyzing the data from the iOS store, we see that there are a total of 7,197 different apps, and 16 different columns, out of these columns we might be able to use "track_name," "currency," "price," "ratingcounttot," "ratingcountver," and "prime_genre" for our analysis. For more details on what these columns actually mean, you may click [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

In [3]:
print(android_header)
print('\n')
explore_data(android_data, 0, 2, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', '7-Jan-18', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', '15-Jan-18', '2.0.0', '4.0.3 and up']


Number of rows:  10841
Number of columns:  13


Analyzing the data from the Google Play store, we see that there are a total of 10,841 apps and 13 columns, out of which the following seem interesting: "App," "Category," "Reviews," "Installs," "Type," "Price," and "Genres."

Now, we are going to check both data sets to make sure the data is reliable. The first step we are going to do is to make sure that all columns of each data set have the same length as the header row. This tells us that all rows in the data set have the same length of data.

In [4]:
length_ios_header = len(ios_header)
for row in ios_data:
    if len(row) != length_ios_header:
        print(row)
        print(ios_data.index(row))

Running the above code, we did not get any output, which means that the rows in the iOS data set are reliable and correct. Now we are going to run the same test for the android data set.

In [5]:
#alternately
for row in android:
    if len(row) != len(android_header): 
        print(row)
print(android.index(row))
        

10841


Based on the output from the above code, we can see that row index 10472 does not have the same length as the header row, and this means that there is a data point missing in this row. We are going to delete this row from the data set in the next step.

In [6]:

del android[10473]
print(len(android))

10841


Now, we are going to check for duplicate apps in each data set. To achieve this, we're going to create two empty lists, one for unique apps, and the other for duplicate apps. First, we're going to perform this for the ios data set, and then the android data set.

In [7]:
unique_apps = []
duplicate_apps = []

for row in ios_data:
    name = row[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps: ', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps: ', duplicate_apps[:10])

Number of duplicate apps:  0


Examples of duplicate apps:  []


As we can see, there are no duplicate apps in the ios data set. Now, we're going to run the same code for the android data set.

In [8]:
unique_apps = []
duplicate_apps = []

for row in android_data:
    name = row[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps: ', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps: ', duplicate_apps[:10])

Number of duplicate apps:  1181


Examples of duplicate apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


We can see above that the result gave us a total number 1,181 apps that are duplicated in the android data set. To keep things simple, I reused the same lists from the ios example above.

To find out more about the duplicate apps, I'm going to print the rows of one duplicate app and see how the data looks like.

In [9]:
for row in android_data:
    name = row[0]
    if name == 'Slack':
        print(row)

['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', '2-Aug-18', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', '2-Aug-18', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', '2-Aug-18', 'Varies with device', 'Varies with device']


I took the example of the Slack app, and if we look at the rows above, the only difference between the 3 entries is in the 4th column and it looks like it represents the number of reviews. The third entry has the highest number of reviews, so this entry would be the one that we would want to keep, and the other 2 entries would have to be removed. We will use this logic to remove duplicates in the following steps.

In [10]:
reviews_max = {}

for row in android_data:
    name = row[0]
    n_reviews = row[3]
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

expected_length = len(android_data) - len(duplicate_apps)
print(expected_length)
print(len(reviews_max))

9660
9660


Above, we created an empty dictionary and looped through the android data set and for each app, we added the name and the number of reviews to the dictionary. Within the loop, we checked that if the name already existed inside the dictionary, we checked if the number of reviews was less than the current row being looped in, and if it was true, we updated the number of reviews in the dictionary with the higher number.

Then, we checked the expected length of our new data set by subtracting duplicate apps from the original data set and comparing this length with the length of the newly created dictionary, to make sure we will have the same length after removing our duplicates.

In [11]:
android_clean = []
already_added = []

for row in android_data:
    name = row[0]
    n_reviews = row[3]
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(row)
        already_added.append(name)

print(len(android_clean))

9660


Above, we completed the act of cleaning our android data set. We did this first by creating two new empty lists, one for the newly cleaned data, the second to maintain a list of names of apps that have already been added to the cleaned list. We cleaned the data by looping through the android data set and comparing the number of reviews for each row with the number of reviews stored in our dictionary against the corresponding app name. If the number of reviews matched for that specific app inside the dictionary, we added the full row to the android_clean list, and added the name of the app to the other list if it already didn't exist.

Now we have completed the cleaning of duplicate apps.

                                         #REMOVING NON_ENGLISH


Now, we are going to begin the process of removing non English apps from our data. Since our company is only focused on creating apps for the English speaking population, we are only interested in analyzing English language apps. 

To begin with this, we're going to write a function that checks whether an app's name is in English or not.

In [12]:
def is_english(a_string):
    special_char = 0
    for character in a_string:
        if ord(character) > 127:
            special_char += 1
    if special_char > 3:
        return False
    else:
        return True

In [13]:
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视')) # its greater than 127(> 127)
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat😜'))

True
False
True
True


In [14]:
print(ord('😜'))
print(ord('™'))


128540
8482


In [15]:
print(ord('b'))
print(ord('j'))

98
106


Above, we wrote a function that checks each character of a string and uses the ord function to determine it's ASCII number. According to ASCII, all letters in the English language have an ASCII number from 0 to 127. So our function is designed in such a way that it checks the ASCII number of each character in the string to see if it is > 127 and if it is, it returns a False statement which tells us it's not in English.
                     PART 2
However, we can see that some apps which have special characters such as Docs To Go™ Free Office Suite and Instachat 😜 will return a False statement with the above function. This is because emojis and special characters are outside of the 0-127 range of ASCII characters. 

To minimize the impact of data loss, we'll only remove an app if its name has more than three non-ASCII characters:


So now, we're going to modify the function so that it accepts at most 3 special characters to determine whether an app is English or not, any app more than 3 special characters, will be discounted as non English, even if it is made up of all English characters.

Now that we have our function ready, we are going to loop through both data sets, we use the is_english() function to filter out the non-English apps for both data sets.


And identify apps as English, and if they are identified as English, we will append them to a new list.

In [16]:
ios_english = []
android_english = []

for row in ios_data:
    if is_english(row[1]):
        ios_english.append(row)
        
for row in android_clean:
    if is_english(row[0]):
        android_english.append(row)
        
explore_data(ios_english, 0, 3, True)
print('\n')
explore_data(android_english, 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows:  6183
Number of columns:  16


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', '7-Jan-18', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', '1-Aug-18', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', '8-Jun-18', 'Varies with device'

After running the above loop, we found 6,183 apps that were English and created a new list with the data, the same for Android with 9,614 apps. We used our function to check if the apps were English, and if they were, we added the whole row to a new list. These two lists will now be the basis of our analysis.

                              Isolating the Free Apps
As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps, and we'll need to isolate only the free apps for our analysis. Below, we isolate the free apps for both our data sets.

We are only interested in free apps, so in order to analyze them, we need to remove all the apps that are paid.

In [17]:
ios_free = []
android_free = []

for row in ios_english:
    price = row[4]
    if price == '0.0':
        ios_free.append(row)
        
for row in android_english:
    price = row[6]
    if price == 'Free':
        android_free.append(row)
        
explore_data(ios_free, 0, 3, True)
print('\n')
explore_data(android_free, 0, 3, True)
print('\n')


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows:  3222
Number of columns:  16


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', '7-Jan-18', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', '1-Aug-18', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', '8-Jun-18', 'Varies with device'

Most Common Apps by Genre


                                  PART ONE
As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

Build a minimal Android version of the app, and add it to Google Play.


a)   If the app has a good response from users, we then develop it further.


b)   If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.


Because our end goal is to add the app on both the App Store and Google Play, we need to find app profiles that are successful on both markets. 

-For instance, a profile that might work well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of the most common genres for each market. For this, we'll build a frequency table for the prime_genre column of the App Store data set, and the Genres and Category columns of the Google Play data set.



    
    
                       NOTE
    
    
    
    
    We'll build two functions we can use to analyze the frequency tables:

     1)    One function to generate frequency tables that show percentages
     2)    Another function that we can use to display the percentages in a descending order
We can create a dictionary and populate it with values by following these steps:

We create an empty dictionary.
We add values one by one to that empty dictionary.


Adding a value to a dictionary follows the pattern 

dictionary_name[index] = value.

To add a value 4433 with an index '4+' to a dictionary named content_ratings,

we need to use the code content_ratings['4+'] = 4433.

Essence of the total = 0, total += 1

Initiate a variable named total with a value of 0. This variable will store the sum of user ratings 

(the number of ratings, not the actual ratings) specific to each genre.

Initiate a variable named len_genre with a value of 0. This variable will store the number of apps specific to each genre.


However, the sorted() function works well if we transform the dictionary into a list of tuples, where each tuple contains a dictionary key along with its corresponding dictionary value. To ensure the sorting works right, the dictionary value comes first, and the dictionary key comes second:




In [18]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages


However, dictionaries don't have order, and it will be very difficult to analyze the frequency tables. 

We'll need to build a second function which can help us display the entries in the frequency table in a descending order.

To do that, we'll need to make use of the built-in sorted() function. 

This function takes in an iterable data type (like a list, dictionary, tuple, etc.), 

and returns a list of the elements of that iterable sorted in ascending or descending order (the reverse parameter controls whether the order is ascending or descending).

However, the sorted() function works well if we transform the dictionary into a list of tuples, where each tuple contains a dictionary key along with its corresponding dictionary value. To ensure the sorting works right, the dictionary value comes first, and the dictionary key comes second:


This is a bit overcomplicated to just sort a dictionary, but there are much simpler ways to do this once we learn more advanced techniques.


Using the workaround above, we wrote a helper function for you named display_table(), which you'll be able to combine with the function you're going to write in the next exercise. The display_table() function you see below:

Takes in two parameters: dataset and index. dataset is expected to be a list of lists, and index is expected to be an integer.
Generates a frequency table using the freq_table() function (which you're going to write as an exercise).
Transforms the frequency table into a list of tuples, then sorts the list in a descending order.
Prints the entries of the frequency table in descending order.




In [19]:

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [20]:
#Part Three
#We start by examining the frequency table for the prime_genre column of the App Store data set.
display_table(ios_english, -5) # prime_genre

Games : 54.860100274947435
Entertainment : 7.261846999838266
Education : 6.6310852337053205
Photo & Video : 5.515122109008572
Utilities : 3.4449296458030085
Productivity : 2.7171276079573023
Health & Fitness : 2.6686074721009216
Music : 2.215752870774705
Social Networking : 2.037845705967977
Sports : 1.6820313763545207
Lifestyle : 1.6011644832605532
Shopping : 1.3747371825974446
Weather : 1.1159631246967492
Travel : 0.9704027171276078
News : 0.9218825812712276
Book : 0.8895358240336406
Reference : 0.8571890667960537
Business : 0.8571890667960537
Finance : 0.7924955523208799
Food & Drink : 0.7116286592269124
Navigation : 0.452854601326217
Medical : 0.3396409509946628
Catalogs : 0.08086689309396733


We can see that among the free English apps, more than a half (54.86%) are Games. 

Entertainment apps are close to 8%, followed by photo and video apps, which are close to 5%. Only 3.66% of the apps are designed for education, followed by social networking apps which amount for 3.29% of the apps in our data set.

The general impression is that App Store (at least the part containing free English apps) is dominated by apps that are designed for fun (games, entertainment, photo and video, social networking, sports, music, etc.),

while apps with practical purposes (education, shopping, utilities, productivity, lifestyle, etc.) are more rare. 

However, the fact that fun apps are the most numerous doesn't also imply that they also have the greatest number of users — the demand might not be the same as the offer.

Let's continue by examining the Genres and Category columns of the Google Play data set (two columns which seem to be related).


In [21]:
display_table(android_english, 1) # Category

FAMILY : 19.34477379095164
GAME : 9.786791471658866
TOOLS : 8.60114404576183
BUSINESS : 4.357774310972439
MEDICAL : 4.108164326573062
PERSONALIZATION : 3.9001560062402496
PRODUCTIVITY : 3.8793551742069687
LIFESTYLE : 3.785751430057202
FINANCE : 3.58814352574103
SPORTS : 3.3801352054082163
COMMUNICATION : 3.2657306292251693
HEALTH_AND_FITNESS : 2.995319812792512
PHOTOGRAPHY : 2.912116484659386
NEWS_AND_MAGAZINES : 2.6001040041601664
SOCIAL : 2.485699427977119
TRAVEL_AND_LOCAL : 2.277691107644306
BOOKS_AND_REFERENCE : 2.267290691627665
SHOPPING : 2.0904836193447736
DATING : 1.7784711388455536
VIDEO_PLAYERS : 1.6952678107124284
MAPS_AND_NAVIGATION : 1.341653666146646
FOOD_AND_DRINK : 1.1648465938637547
EDUCATION : 1.1128445137805512
ENTERTAINMENT : 0.9048361934477379
LIBRARIES_AND_DEMO : 0.8736349453978159
AUTO_AND_VEHICLES : 0.8736349453978159
WEATHER : 0.8216328653146125
HOUSE_AND_HOME : 0.7592303692147686
EVENTS : 0.6656266250650026
PARENTING : 0.62402496099844
ART_AND_DESIGN : 0.62402

The landscape seems significantly different on Google Play
there are not that many apps designed for fun, and it seems that a good number of apps are designed for practical purposes 
(family, tools, business, lifestyle, productivity, etc.). 
However, if we investigate this further, we can see that the family category (which accounts for almost 19% of the apps) 
means mostly games for kids.


Even so, practical apps seem to have a better representation on Google Play compared to App Store. This picture is also confirmed by the frequency table we see for the Genres column:

In [22]:
display_table(android_english, -4) # Genres

Tools : 8.590743629745191
Entertainment : 5.793031721268851
Education : 5.231409256370255
Business : 4.357774310972439
Medical : 4.108164326573062
Personalization : 3.9001560062402496
Productivity : 3.8793551742069687
Lifestyle : 3.7753510140405613
Finance : 3.58814352574103
Sports : 3.4425377015080603
Communication : 3.2657306292251693
Action : 3.109724388975559
Health & Fitness : 2.995319812792512
Photography : 2.912116484659386
News & Magazines : 2.6001040041601664
Social : 2.485699427977119
Travel & Local : 2.267290691627665
Books & Reference : 2.267290691627665
Shopping : 2.0904836193447736
Simulation : 1.9760790431617263
Arcade : 1.9136765470618826
Dating : 1.7784711388455536
Casual : 1.7056682267290693
Video Players & Editors : 1.6744669786791473
Maps & Navigation : 1.341653666146646
Puzzle : 1.2376495059802393
Food & Drink : 1.1648465938637547
Role Playing : 1.0816432657306292
Strategy : 0.9776391055642226
Racing : 0.9464378575143005
Libraries & Demo : 0.8736349453978159
Auto &

The difference between the Genres and the Category columns is not crystal clear, but one thing we can notice is that the Genres column is much more granular (it has more categories). We're only looking for the bigger picture at the moment, so we'll only work with the Category column moving forward.

Up to this point, we found that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps. 


Now we'd like to get an idea about the kind of apps that have most users.

Most Popular Apps by Genre on the App Store


One way to find out what genres are the most 
popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs column, but for the App Store data set this information is missing. 

As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app.

Below, we calculate the average number of user ratings per app genre on the App Store:


In [4]:
genres_ios = freq_table(ios_english, -5)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in ios_english:
        genre_app = app[-5]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

NameError: name 'ios_english' is not defined

On average, navigation apps have the highest number of user reviews, but this figure is heavily influenced by Waze and Google Maps, which have close to half a million user reviews together:

In [24]:
for app in ios_english:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5]) # print name and number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
MotionX GPS : 14970
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
Gaia GPS Classic : 2429
Plane Finder - Flight Tracker : 1438
iMaps+ for Google Maps ™ and Street View ™ : Transit and Offline Contacts : 1225
NAVIGON Europe : 927
Localscope - Find places and people around you : 868
Ski Tracks : 829
TRANSPORT MODS for MINECRAFT Pc EDITION : 754
Pocket Earth PRO Offline Maps & Travel Guides : 748
Ship Finder : 624
Boating USA : 342
Maps 3D PRO - GPS for Bike, Hike, Ski & Outdoor : 280
Cachly - Simple and powerful Geocaching for iPhone : 263
ImmobilienScout24: Real Estate Search in Germany : 187
The JMU Bus App : 35
Avertinoo : 32
iStellar : 30
mySTATE - State College : 26
Road watcher: dash camera, car video recorder. : 10
Streets – Street View Browser : 10
Railway Route Search : 5
parkOmator – for Apple Watch meter expiration timer, notifications & GPS navigator t

The same pattern applies to social networking apps, where the average number is heavily influenced by a few giants like Facebook, Pinterest, Skype, etc. Same applies to music apps, where a few big players like Pandora, Spotify, and Shazam heavily influence the average number.

Our aim is to find popular genres, but navigation, social networking or music apps might seem more popular than they really are. The average number of ratings seem to be skewed by very few apps which have hundreds of thousands of user ratings, while the other apps may struggle to get past the 10,000 threshold. We could get a better picture by removing these extremely popular apps for each genre and then rework the averages, but we'll leave this level of detail for later.

Reference apps have 74,942 user ratings on average, but it's actually the Bible and Dictionary.com which skew up the average rating:



In [28]:
  for app in ios_english:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Sky Guide: View Stars Night or Day : 22100
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
Dictionary.com Dictionary & Thesaurus Premium : 11530
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
WolframAlpha : 7410
e-Sword HD: Bible Study Made Easy : 7309
iHandy Translator Pro : 5163
Dictionary.com Premium Dictionary & Thesaurus for iPad : 4922
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
Speak & Translate － Live Voice and Text Translator : 4344
National Geographic World Atlas : 4255
Knots 3D : 3196
iQuran : 2929
Merriam-Webster Dictionary & Thesaurus : 2843
e-Sword LT: Bible Study on the Go : 2152
GUNS MODS for Minecraf

However, this niche seems to show some potential. One thing we could do is take another popular book and turn it into an app where we could add different features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes about the book, etc. On top of that, we could also embed a dictionary within the app, so users don't need to exit our app to look up words in an external app.

This idea seems to fit well with the fact that the App Store is dominated by for-fun apps. This suggests the market might be a bit saturated with for-fun apps, which means a practical app might have more of a chance to stand out among the huge number of apps on the App Store.

Other genres that seem popular include weather, book, food and drink, or finance. The book genre seem to overlap a bit with the app idea we described above, but the other genres don't seem too interesting to us:

Weather apps — people generally don't spend too much time in-app, and the chances of making profit from in-app adds are low. Also, getting reliable live weather data may require us to connect our apps to non-free APIs.

Food and drink — examples here include Starbucks, Dunkin' Donuts, McDonald's, etc. So making a popular food and drink app requires actual cooking and a delivery service, which is outside the scope of our company.

Finance apps — these apps involve banking, paying bills, money transfer, etc. Building a finance app requires domain knowledge, and we don't want to hire a finance expert just to build an app.

Now let's analyze the Google Play market a bit.

Most Popular Apps by Genre on Google Play



For the Google Play market, we actually have data about the number of installs, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.):



In [26]:
display_table(android_english, 5) # the Installs columns

1,000,000+ : 14.716588663546542
100,000+ : 11.502860114404575
10,000+ : 10.618824752990118
10,000,000+ : 9.713988559542381
1,000+ : 9.162766510660427
100+ : 7.332293291731669
5,000,000+ : 6.302652106084243
500,000+ : 5.252210088403536
5,000+ : 4.83619344773791
50,000+ : 4.815392615704628
10+ : 3.9937597503900157
500+ : 3.4113364534581385
50+ : 2.1216848673946958
50,000,000+ : 2.1112844513780553
100,000,000+ : 1.9552782111284452
5+ : 0.8528341133645346
1+ : 0.6864274570982839
500,000,000+ : 0.24960998439937598
1,000,000,000+ : 0.20800832033281333
0+ : 0.13520540821632865
Free : 0.010400416016640665
0 : 0.010400416016640665



One problem with this data is that is not precise. For instance, we don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, we don't need very precise data for our purposes — we only want to get an idea which app genres attract the most users, and we don't need perfect precision with respect to the number of users.

We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on.
  
To perform computations, however, we'll need to convert each install number to float — this means that we need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error. We'll do this directly in the loop below, where we also compute the average number of installs for each genre (category).

In [27]:
categories_android = freq_table(android_english, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in android_english:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

ART_AND_DESIGN : 1887285.0
AUTO_AND_VEHICLES : 632501.3214285715
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 7641777.871559633
BUSINESS : 1663758.627684964
COMICS : 817657.2727272727
COMMUNICATION : 35153714.17515924
DATING : 824129.2807017544
EDUCATION : 1770579.4392523365
ENTERTAINMENT : 11375402.298850575
EVENTS : 249580.640625
FINANCE : 1319851.4028985507
FOOD_AND_DRINK : 1891060.2767857143
HEALTH_AND_FITNESS : 3972300.388888889
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 630903.6904761905
LIFESTYLE : 1369954.7774725275
GAME : 14227278.868225291
FAMILY : 3344163.6580645163
MEDICAL : 96691.58734177215
SOCIAL : 22961790.384937238
SHOPPING : 6966908.880597015
PHOTOGRAPHY : 16604098.410714285
SPORTS : 3373767.6861538463
TRAVEL_AND_LOCAL : 13218662.767123288
TOOLS : 9676869.30471584
PERSONALIZATION : 4086652.4853333333
PRODUCTIVITY : 15530942.008042896
PARENTING : 525351.8333333334
WEATHER : 4570892.658227848
VIDEO_PLAYERS : 24121489.079754602
NEWS_AND_MAGAZINES : 947

ValueError: could not convert string to float: 'Free'

On average, communication apps have the most installs: 35153714.17515924. This number is heavily skewed up by a few apps that have over one billion installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts), and a few others with over 100 and 500 million installs:

In [29]:
for app in android_english:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess


If we removed all the communication apps that have over 100 million installs, the average would be reduced roughly ten times:

In [30]:

under_100_m = []

for app in android_english:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
sum(under_100_m) / len(under_100_m)

3269220.386759582

We see the same pattern for the video players category, which is the runner-up with 24,727,872 installs. The market is dominated by apps like Youtube, Google Play Movies & TV, or MX Player. The pattern is repeated for social apps (where we have giants like Facebook, Instagram, Google+, etc.), photography apps (Google Photos and other popular photo editors), or productivity apps (Microsoft Word, Dropbox, Google Calendar, Evernote, etc.).

Again, the main concern is that these app genres might seem more popular than they really are. Moreover, these niches seem to be dominated by a few giants who are hard to compete against.

The game genre seems pretty popular, but previously we found out this part of the market seems a bit saturated, so we'd like to come up with a different app recommendation if possible.

The books and reference genre looks fairly popular as well, with an average number of installs of 8,767,811. It's interesting to explore this in more depth, since we found this genre has some potential to work well on the App Store, 


and OUR AIM is to recommend an app genre that shows potential for being profitable on both the App Store and Google Play.

Let's take a look at some of the apps from this genre and their number of installs:

In [31]:
for app in android_english:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

This niche seems to be dominated by software for processing and reading ebooks, as well as various collections of libraries and dictionaries, so it's probably not a good idea to build similar apps since there'll be some significant competition.

We also notice there are quite a few apps built around the book Quran, which suggests that building an app around a popular book can be profitable. It seems that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets.

However, it looks like the market is already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.

CONCLUSIONS




In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We concluded that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets. The markets are already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.