# Which app should you build? An analysis on the most profitable app profiles on the AppStore and Google Play Store

According to Statista, by September 2018, there were more than 2 million apps on the AppStore and 2.1 million apps on the Google Play Store. The huge number of apps poses a great question for the new app developers: what app should they build to stand out and succeed? 

In this project, we focus on the free-app market, which means the market for apps users can download for free. The revenue model for these apps is the in-app advertisement. As a result, it is desirable to get as many users for the apps as possible. By analyzing a dataset of more than 18,000 apps on the AppStore and Google Play Store, we will try to find out which kinds of apps are likely to attract users.


# 1. Exploring Data
In this project, we analyze two datasets available from Kaggle: the [ios][1] dataset, and the [android][2] dataset.
We first open the files containing the datasets, then read the content into two variables: `ios_data` and `android_data`.

[1]: https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home "ios"
[2]: https://www.kaggle.com/lava18/google-play-store-apps/home "android"

In [2]:
opened_ios_file = open('AppleStore.csv',encoding='utf-8')
opened_android_file = open('googleplaystore.csv',encoding='utf-8')

from csv import reader
read_ios_file = reader(opened_ios_file)
read_android_file = reader(opened_android_file)

ios_data = list(read_ios_file)
android_data = list(read_android_file)

To make data exploration easier, we create a function `explore_data()`. The function prints a few rows from index `start` to `end` and the numbers of rows and columns of the dataset if requested.

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

The first three rows of the ios dataset are shown below. As we can see the dataset contains information about 7,198 apps. Details about the meaning of each column are written in the data [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home). As we see, each app is equipped with an unique ID. The fields we probably will care about are the ratings and genre.

In [4]:
#first three rows of ios data
explore_data(ios_data, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7198
Number of columns: 16


The first three rows of the android dataset and information about its columns are shown below. The documentation for this dataset is available [here](https://www.kaggle.com/lava18/google-play-store-apps/home). We see that the number of records in the android dataset is 10,842 records, more than the ios dataset.

In [5]:
#first three rows of android data
explore_data(android_data, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


# 2. Cleaning Data
In this section, we fix some errors in the dataset found by other users working on the same datasets. Discussion about these errors is available on the dataset [discussion page](https://www.kaggle.com/lava18/google-play-store-apps/discussion). 

## 2.1. Removing Erroneous Entry
In [one](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) of the discussions, it is reported that the entry #10473 in the Goolge dataset has no value for the 'Category' value. Since only one out of 10,842 rows in the android dataset has this problem, it is safe to simply delete this entry without affecting the analysis of the dataset.

In [6]:
android_data[10473]

['Life Made WI-Fi Touchscreen Photo Frame',
 '1.9',
 '19',
 '3.0M',
 '1,000+',
 'Free',
 '0',
 'Everyone',
 '',
 'February 11, 2018',
 '1.0.19',
 '4.0 and up']

In [7]:
del android_data[10473]

## 2.2. Removing Duplicated Entries
The google dataset has some duplicated entries for the same apps because the data was collected at different times, and one entry was created at each time the information was recorded. In order to find the duplicated apps, we write the `find_duplicate()` function. The function returns a list of the unique and duplicated apps. Looking at the result, we see that there is no duplicated entry in the ios dataset. However, there are 1,181 duplicated entries in the android dataset. We will remove the duplicated apps so that one app has only one entry. The criteria for removal is based on the numbers of reviews of an app, i.e., for one app, the entry with the largest number of user reviews must be the most recent record. We will only keep that entry and delete all the other ones.

In [18]:
def find_duplicate(dataset):

    duplicate_apps = []
    unique_apps = []
    
    for app in dataset[1:]:
        name = app[0]
        if name in unique_apps:
            duplicate_apps.append(name)
        else:
            unique_apps.append(name)
    return unique_apps, duplicate_apps

ios_unique_apps, ios_duplicate_apps = find_duplicate(ios_data)
android_unique_apps, android_duplicate_apps = find_duplicate(android_data)

print('Number of duplicated ios apps:',len(ios_duplicate_apps))
print('Number of unique ios apps:',len(ios_unique_apps))
print('Number of duplicated android apps:',len(android_duplicate_apps))
print('Number of unique android apps:',len(android_unique_apps))

Number of duplicated ios apps: 0
Number of unique ios apps: 7197
Number of duplicated android apps: 1181
Number of unique android apps: 9659


In this part, we make a list of the apps and the largest number of reviews that they received. The list is represented as a dictionary, in which a key is the name of the app and the value is the corresponding number of reviews that the app receives. As a verification step, we print the number of apps in the list, and we confirm that it matches the number of unique apps in the android dataset.

In [19]:
def find_max_review(dataset, index):
    reviews_max = {}
    for app in dataset[1:]:
        name = app[0]
        n_reviews = float(app[index])

        if name in reviews_max and reviews_max[name] < n_reviews:
            reviews_max[name] = n_reviews
        elif name not in reviews_max:
            reviews_max[name] = n_reviews
    return reviews_max

# the number of reviews is the 4th column of the android dataset
android_max_reviews = find_max_review(android_data,3)
print('Length of the dictionary:',len(android_max_reviews))

Length of the dictionary: 9659


Now, we filter the android dataset, keeping only the records that have the number of reviews matches the total number of reviews stored in the abovementioned dictionary. We remove all other entries of the same app. A look of the cleaned dataset is shown below.

In [20]:
def clean_duplicate(dataset, index, reviews_max):   
    dataset_clean = []
    already_added = []

    for app in dataset[1:]:
        name = app[0]
        n_reviews = float(app[index])
        if n_reviews == reviews_max[name] and name not in already_added:
            dataset_clean.append(app)
            already_added.append(name)
    return dataset_clean

android_clean = clean_duplicate(android_data, 3, android_max_reviews)
print('First three rows of the cleaned android dataset:\n')
explore_data(android_clean,0,3,True)

First three rows of the cleaned android dataset:

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


# 2.3. Removing non-English Apps

Interstingly, not every app in the ios and android stores is in English. In this analysis, assuming that we want to develope an app for English-speaking users, we will narrow our focus to only the apps which are in English. To this end, we present a function `is_English` to check if the app name is in English or not. Since all English characters are represented in computers using ASCII codes ranging from 0 to 127, we will filter the apps containing ASCII codes larger than 127. To avoid missing apps that are actually in English but contain some emoji, we will remove the app if the number of ASCII code larger than 127 is more than 3. Some sample outputs of the `is_English` are shown below when applying to a variety of inputs.

In [11]:
def is_English(string):
    count = 0
    for char in string:
        if ord(char) > 127:
            count = count + 1
        if count > 3:
            return False
    return True

print(is_English('Instagram'))
print(is_English('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_English('Docs To Go™ Free Office Suite'))
print(is_English('Instachat 😜'))

True
False
True
True


Next, we create a function named `only_English` to keep only the apps that are in English. The number of English apps in the android and ios stores is as follows.

In [21]:
def only_English(dataset):   
    dataset_english = []

    for app in dataset:
        name = app[0]
        if is_English(name):
            dataset_english.append(app)
    return dataset_english

android_english = only_English(android_clean)
ios_english = only_English(ios_data[1:])
print('Number of English apps in android store:',len(android_english))
print('Number of English apps in the ios store:',len(ios_english))

Number of English apps in android store: 9614
Number of English apps in the ios store: 7197


# 2.4. Removing Paid Apps

Since we are only intersted in free apps, we will filter our datasets again to keep only the free apps. The number of apps after the filter in both stores are shown below.

In [22]:
android_free = []
ios_free = []

for app in android_english:
    price = app[6]
    if price == 'Free':
        android_free.append(app)

for app in ios_english[1:]:
    price = float(app[4])
    if price == 0.0:
        ios_free.append(app)
print('Number of apps after filters in the android store:',len(android_free))
print('Number of apps after filters in the ios store:',len(ios_free))

Number of apps after filters in the android store: 8863
Number of apps after filters in the ios store: 4055


After all the filters, we are left with 8,863 apps in the android stores and 4,055 apps in the ios store, which are sufficient for our analysis.

# 3. Analysis

In this section, we perform several analysis on the two datasets. First, we will look at which genres are the most popular on the two platforms.

# 3.1. Genre Analysis

We take a look at the frequency tables of the app genres, i.e., the portion of each genre in the store. The following `freq_table` and `display_table` functions build and display the frequency tables of the two datasets, respectively.

In [27]:
def freq_table(dataset, index):
    freq = {}
    count = 0
    for app in dataset:
        count = count + 1
        column = app[index]
        if column not in freq:
            freq[column] = 1
        else:
            freq[column] = freq[column] + 1
    for key in freq:
        freq[key] = (freq[key] / count) * 100
    return freq

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Fist, let's take a look at the ios store. The relevant column we will be looking at is the `prime_genre` column.

In [25]:
print('Frequency table of the ios dataset (in percentage):\n')
display_table(ios_free, 11)

Frequency table of the ios dataset (in percentage):

Games : 55.659679408138096
Entertainment : 8.236744759556105
Photo & Video : 4.1183723797780525
Social Networking : 3.501849568434032
Education : 3.255240443896424
Shopping : 2.9839704069050557
Utilities : 2.688039457459926
Lifestyle : 2.318125770653514
Finance : 2.0715166461159065
Sports : 1.9482120838471024
Health & Fitness : 1.8742293464858202
Music : 1.6522811344019728
Book : 1.627620221948212
Productivity : 1.528976572133169
News : 1.4303329223181258
Travel : 1.381011097410604
Food & Drink : 1.060419235511714
Weather : 0.7644882860665845
Reference : 0.4932182490752158
Navigation : 0.4932182490752158
Business : 0.4932182490752158
Catalogs : 0.22194821208384713
Medical : 0.19728729963008632


As we can see, more than half of the apps on the Apple store are games (55.7%). The four next common categories are entertainment (8.2%), photo & video (4.1%), social networking (3.5%), and education (3.3%). Combining the statistics of the first four groups, we see that 71.5% of the ios stores are apps belonging to the 'fun' categories. However, the large portion of apps may not directly indicate the number of users for the apps.

Let's switch our focus to the android store for now. There are two columns in the android datasets that represent the app groups: `Categories` and `Genre`. In the code below, we take a look at the frequencies table for both columns.

In [28]:
print('Frequency table of the genres in android dataset (in percentage):\n')
display_table(android_free, 9)

Frequency table of the genres in android dataset (in percentage):

Tools : 8.450863138892023
Entertainment : 6.070179397495204
Education : 5.348076272142616
Business : 4.592124562789123
Productivity : 3.8925871601038025
Lifestyle : 3.8925871601038025
Finance : 3.7007785174320205
Medical : 3.5315355974275078
Sports : 3.463838429425702
Personalization : 3.317161232088458
Communication : 3.2381812027530184
Action : 3.102786866749408
Health & Fitness : 3.0802211440821394
Photography : 2.944826808078529
News & Magazines : 2.798149610741284
Social : 2.6627552747376737
Travel & Local : 2.324269434728647
Shopping : 2.245289405393208
Books & Reference : 2.1437436533904997
Simulation : 2.042197901387792
Dating : 1.8616721200496444
Arcade : 1.8503892587160102
Video Players & Editors : 1.771409229380571
Casual : 1.7601263680469368
Maps & Navigation : 1.399074805370642
Food & Drink : 1.241114746699763
Puzzle : 1.128286133363421
Racing : 0.9928917973598104
Role Playing : 0.9364774906916393
Libraries

In [29]:
print('Frequency table of the categories in android dataset (in percentage):\n')
display_table(android_free,1)

Frequency table of the categories in android dataset (in percentage):

FAMILY : 18.898792733837304
GAME : 9.725826469592688
TOOLS : 8.462146000225657
BUSINESS : 4.592124562789123
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.5315355974275078
SPORTS : 3.396141261423897
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2381812027530184
HEALTH_AND_FITNESS : 3.0802211440821394
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496444
VIDEO_PLAYERS : 1.7939749520478394
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.1621347173643235
ENTERTAINMENT : 0.9590432133589079
LIBRARIES_AND_DEMO : 0.9364774906916393
AUTO_AND_VEHICLES : 0.9251946293580051
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS

Unlike the App Store, in which most apps fall into the 'fun' category, the Google Play store contains similar spread of apps about games, entertainment, tools, and educations. This is an interesting observation that may serve developers to choose which platform they should publish their apps.

# 3.2. User Downloads Analysis

In this section, we analyze which app genre attracts the most user downloads. This information can be found on the `installs` coulumn of the android dataset. However, the ios dataset does not provide this information explicitly. To substitute, we may calculate which app genre receives the most ratings by analyzing the `rating_count_tot` column of the ios dataset.

# 3.2.1. On the Google Play Store

Let us first focus on the dataset on the Google Play store. We will show the analysis result in a dictionary, in which the key is the name of the category and the corresponding value is the average number of installs apps belong to the category receive. To this end, we write two functions `android_genre_avg_install()` and `display_android_genre_rating_table()` to calculate and display the result, respectively. Since the number of installs in the android dataset is only represented as `100,000+`, `50,000+` and so on, we only get a rough idea about how many installs the apps actually receive. In our analysis, we regard `100,000+` as 100,000 installations, `50,000+` as 50,000 installations, etc. 

In [31]:
android_category_freq = freq_table(android_free,1)

def android_genre_avg_install(dataset, index):
    android_genre_install = {}
    for category in android_category_freq:
        total = 0
        len_category = 0
        for app in dataset:
            category_app = app[1]
            if category_app == category:
                app_install = app[index]
                app_install = app_install.replace('+','')
                app_install = app_install.replace(',','')
                app_install = float(app_install)
                total = total + app_install
                len_category = len_category + 1
        android_genre_install[category] = total / len_category
    return android_genre_install

def display_android_genre_rating_table(dataset, index):
    table = android_genre_avg_install(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

print('Average number of installations per category of apps on the Google Play Store:\n')
display_android_genre_rating_table(android_free, 5)

Average number of installations per category of apps on the Google Play Store:

COMMUNICATION : 38456119.167247385
VIDEO_PLAYERS : 24727872.452830188
SOCIAL : 23253652.127118643
PHOTOGRAPHY : 17840110.40229885
PRODUCTIVITY : 16787331.344927534
GAME : 15588015.603248259
TRAVEL_AND_LOCAL : 13984077.710144928
ENTERTAINMENT : 11640705.88235294
TOOLS : 10801391.298666667
NEWS_AND_MAGAZINES : 9549178.467741935
BOOKS_AND_REFERENCE : 8767811.894736841
SHOPPING : 7036877.311557789
PERSONALIZATION : 5201482.6122448975
WEATHER : 5074486.197183099
HEALTH_AND_FITNESS : 4188821.9853479853
MAPS_AND_NAVIGATION : 4056941.7741935486
FAMILY : 3697848.1731343283
SPORTS : 3638640.1428571427
ART_AND_DESIGN : 1986335.0877192982
FOOD_AND_DRINK : 1924897.7363636363
EDUCATION : 1833495.145631068
BUSINESS : 1712290.1474201474
LIFESTYLE : 1437816.2687861272
FINANCE : 1387692.475609756
HOUSE_AND_HOME : 1331540.5616438356
DATING : 854028.8303030303
COMICS : 817657.2727272727
AUTO_AND_VEHICLES : 647317.8170731707
LI

We observe that the top five categories that receive the most installs are communication, video players, social, photography, and productivity. This observation reveals a behavior of android mobile users that they mostly use apps for communicating and socializing purpose, then for entertaining purpose. We take a closer look at the apps belonging to the communication category with the most installations.

In [32]:
for app in android_free:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+' or app[5] == '500,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Viber Messenger : 500,000,000+


We see that there are 11 communication apps that receive more than 500 million installations. To see if these 11 apps dominate the average installation number in the communication category, we calculate the average installation again after removing these 11 apps.

In [36]:
android_under_500 = []

for app in android_free:
    no_installs = app[5]
    no_installs = no_installs.replace(',', '')
    no_installs = no_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(no_installs) < 500000000):
        android_under_500.append(float(no_installs))
        
sum(android_under_500) / len(android_under_500)

9191689.13405797

Compare to the average number of installs before removing the 11 apps, which is 38456119, the new average is more than 4 times smaller. This indicates that the majority of downloads in this category falls into the hands of a few most common apps.

We are interested to see if the same pattern applies to the next four categories: video players, social, photography, and productivity. Since we have fewer apps in these categories, we will analyze the apps that receive more than 100 million downloads compared to the rest of the apps.

In [48]:
android_video = 0
android_social = 0
android_photo = 0
android_productivity = 0
android_video_under_100 = []
android_social_under_100 = []
android_photo_under_100 = []
android_productivity_under_100 = []

for app in android_free:
    no_installs = app[5]
    no_installs = no_installs.replace(',', '')
    no_installs = no_installs.replace('+', '')
    if (app[1] == 'VIDEO_PLAYERS'):
        android_video += 1
        if (float(no_installs) < 100000000):
            android_video_under_100.append(float(no_installs))
    elif (app[1] == 'SOCIAL'):
        android_social += 1
        if (float(no_installs) < 100000000):
            android_social_under_100.append(float(no_installs))
    elif (app[1] == 'PHOTOGRAPHY'):
        android_photo += 1
        if (float(no_installs) < 100000000):
            android_photo_under_100.append(float(no_installs))
    elif (app[1] == 'PRODUCTIVITY'):
        android_productivity += 1
        if (float(no_installs) < 100000000):
            android_productivity_under_100.append(float(no_installs))
        
print('Number of apps that have above 100 millions installations in video players:', android_video - len(android_video_under_100))
print('Average installs of the remaining apps for the video players:', sum(android_video_under_100) / len(android_video_under_100))
print('\n')
print('Number of apps that have above 100 millions installations in social category:', android_social - len(android_social_under_100))
print('Average installs of the remaining apps apps for social category:', sum(android_social_under_100) / len(android_social_under_100))
print('\n')
print('Number of apps that have above 100 millions installations in photography:', android_photo - len(android_photo_under_100))
print('Average installs of the remaining apps apps for photography:', sum(android_photo_under_100) / len(android_photo_under_100))
print('\n')
print('Number of apps that have above 100 millions installations in productivity:', android_productivity - len(android_productivity_under_100))
print('Average installs of the remaining apps apps for productivity:', sum(android_productivity_under_100) / len(android_productivity_under_100))

Number of apps that have above 100 millions installations in video players: 9
Average installs of the remaining apps for the video players: 5544878.133333334


Number of apps that have above 100 millions installations in social category: 13
Average installs of the remaining apps apps for social category: 3084582.5201793723


Number of apps that have above 100 millions installations in photography: 19
Average installs of the remaining apps apps for photography: 7670532.29338843


Number of apps that have above 100 millions installations in productivity: 22
Average installs of the remaining apps apps for productivity: 3379657.318885449


We observe the same trend we see in the `Communication` category for the video players and social category, which is that the majority of the downloads fall into a few apps. In other words, the market for these categories are dominated by a few 'big players'. However, for the photography and productivity categories, it seems that many apps succeeded. We will compare this result to the AppStore.

# 3.2.2. On the AppStore

We look at the most popular genres in the AppStore sorted in the descending order of the number of reviews.

In [49]:
def find_ios_genre_freq(dataset, index):    
    ios_genre_freq = freq_table(dataset, index)
    genre_avg_rating = {}
    for genre in ios_genre_freq:
        total = 0
        len_genre = 0
        for app in ios_free:
            genre_app = app[11]
            if genre_app == genre:
                rating_count = float(app[5])
                total = total + rating_count
                len_genre = len_genre + 1
        avg_rating = total / len_genre
        genre_avg_rating[genre] = avg_rating
    return genre_avg_rating

def display_genre_rating_table(dataset, index):
    table = find_ios_genre_freq(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        
print('Average number of user ratings per category of apps on the AppStore:\n')
display_genre_rating_table(ios_free, 11)

Average number of user ratings per category of apps on the AppStore:

Reference : 67447.9
Music : 56482.02985074627
Weather : 47220.93548387097
Social Networking : 32503.563380281692
Photo & Video : 27249.892215568863
Navigation : 25972.05
Travel : 20216.01785714286
Food & Drink : 20179.093023255813
Sports : 20128.974683544304
Health & Fitness : 19952.315789473683
Productivity : 19053.887096774193
Games : 18924.68896765618
Shopping : 18746.677685950413
News : 15892.724137931034
Utilities : 14010.100917431193
Finance : 13522.261904761905
Entertainment : 10822.961077844311
Lifestyle : 8978.308510638299
Book : 8498.333333333334
Business : 6367.8
Education : 6266.333333333333
Catalogs : 1779.5555555555557
Medical : 459.75


We look more closely at the `Photo & Video` and `Productivity` categories in order to compare them with the data in the android dataset. Based on the average rating numbers of the two categories, we see how many apps in the AppStore have more than 30k ratings in the `Photo & Video` and more than 20k ratings in the `Productivity` category.

In [53]:
ios_photo = 0
ios_productivity = 0
ios_photo_under_100 = []
ios_productivity_under_100 = []

for app in ios_free:
    no_ratings = app[5]
    
    if (app[11] == 'Photo & Video'):
        ios_photo += 1
        if (float(no_ratings) < 30000):
            ios_photo_under_100.append(float(no_ratings))
    elif (app[11] == 'Productivity'):
        ios_productivity += 1
        if (float(no_ratings) < 20000):
            ios_productivity_under_100.append(float(no_ratings))
        
print('Number of apps that have above 30k ratings in photography:', ios_photo - len(ios_photo_under_100))
print('Average installs of the remaining apps apps for photography:', sum(ios_photo_under_100) / len(ios_photo_under_100))
print('\n')
print('Number of apps that have above 20k ratings installations in productivity:', ios_productivity - len(ios_productivity_under_100))
print('Average installs of the remaining apps apps for productivity:', sum(ios_productivity_under_100) / len(ios_productivity_under_100))

Number of apps that have above 100 millions installations in photography: 21
Average installs of the remaining apps apps for photography: 4243.232876712329


Number of apps that have above 100 millions installations in productivity: 15
Average installs of the remaining apps apps for productivity: 4716.0


We see that there are 21 apps in the `Photo & Video` group and 15 apps in the `Productivity` group that are accounted for most of the user ratings. However, this relatively large number of apps also indicates that there is no small number of 'big players' or 'giants' in the market of these categories. Therefore, if developers aim to develop new apps that are expected to attract a lot of users on both platform, then `Photo & Video` and `Productivity` may be good choices since they attract a large number of users and are not saturated at the present. 

# 4. Concluding Remarks

In this project, we analyzed the app data on the AppStore and the Google Play Store to find the most profitable app profile that developers should create. In particular, we focus on the free English app market. We showed that `Photo & Video` and `Productivity` categories are the favorable categories to develop new apps because of their tendency to attract a large number of users and that there are not yet many 'big players' that dominate the market at the moment.