# Mobile apps data analysis

In this project, we are going to inspect mobile apps available on App Store and Google Play. From that, our goal is to unveil the most popular apps based on their type. With that in hands, we can utilize this information in order to leverage more users towards our free app.

In [1]:
from csv import reader
def open_dataset(dataset_file):
    opened_file = open(dataset_file)
    read_file = reader(opened_file)
    apps_data = list(read_file)
    return apps_data[1:], apps_data[0]

In [2]:
android_data, android_header = open_dataset("/home/gutz22/Downloads/data_sets/googleplaystore.csv")
ios_data, ios_header = open_dataset("/home/gutz22/Downloads/data_sets/AppleStore.csv")

First, we've created a function to retrieve a list of data points and headers for both Android and IOS, and stored them on variables we're going to work with.

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print("\n")
    if rows_and_columns:
        print("Number of rows:", len(dataset))
        print("Number of columns:", len(dataset[0]))
        print("\n")

In [4]:
explore_data(android_data, 0 , 3, rows_and_columns=True)
explore_data(ios_data, 0, 3, rows_and_columns=True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+

Secondly, we've explored the data by printing some rows and discovered the length of rows and columns for both apps, for the purpose of illustrating a few rows of the datasets.


In [5]:
print(android_header, "\n\n", ios_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

 ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


On this step, the headers are showed for getting a better understanding on each column value. In case you are still lost, I got the links for the documentations where the data was sourced from.

- [Google Play data documentation](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015)
- [Apple Store data documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

On these links you can have a thorough explanation on the meaning of the headers values.

Next, it's time to create a function that checks if there are any missing values for each data point.

def check_missing_values(dataset, dataset_header):
    for row in dataset:
        if len(dataset_header) != len(row):
            print("Row with a missing value:")
            print(row)
            print(dataset.index(row))

In [7]:
check_missing_values(android_data, android_header)

Row with a missing value:
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
10472


When we choose the Google Play data as a parameter, we get one application that has a missing value. As we don't want to have any outline polluting our data, the solution here is to delete this row.

In [None]:
del android_data[10472]

In [8]:
check_missing_values(ios_data, ios_header)

Now we can see that, unlike the android data, the Apple Store data remains untouched by not having any missing elements.

In [9]:
def has_duplicates(dataset, name_index):
    duplicate_apps = []
    unique_apps = []
    for row in dataset:
        name = row[name_index]
        if name not in unique_apps:
            unique_apps.append(name)
        else:
            duplicate_apps.append(name)
    return duplicate_apps

Continuing the data cleasing, we've created a function that returns all the apps that have more than one entry on a given dataset.

In [10]:
android_duplicates = has_duplicates(android_data, 0)
print("Examples of android duplicate apps:\n", android_duplicates[:3])
print("\nNumber of android duplicate apps:", len(android_duplicates))

Examples of android duplicate apps:
 ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business']

Number of android duplicate apps: 1181


In [11]:
ios_duplicates = has_duplicates(ios_data, 1)
print("All IOS duplicate apps:\n", ios_duplicates)

All IOS duplicate apps:
 ['Mannequin Challenge', 'VR Roller Coaster']


In [12]:
print("IOS duplicate app name entries:")
for row in ios_data:
    name = row[1]
    if (name == "Mannequin Challenge") or (name == "VR Roller Coaster"):
        print(row, "\n")

IOS duplicate app name entries:
['1173990889', 'Mannequin Challenge', '109705216', 'USD', '0.0', '668', '87', '3.0', '3.0', '1.4', '9+', 'Games', '37', '4', '1', '1'] 

['952877179', 'VR Roller Coaster', '169523200', 'USD', '0.0', '107', '102', '3.5', '3.5', '2.0.0', '4+', 'Games', '37', '5', '1', '1'] 

['1178454060', 'Mannequin Challenge', '59572224', 'USD', '0.0', '105', '58', '4.0', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1'] 

['1089824278', 'VR Roller Coaster', '240964608', 'USD', '0.0', '67', '44', '3.5', '4.0', '0.81', '4+', 'Games', '38', '0', '1', '1'] 



On this snippet, it is instanced the rows within the IOS dataset that have duplicate application names.
By that, it is possible to infer that both apllications have an outdated entry, given the fact that the elements which corresponding columns are `version` have distinct versions numbers, and how the `rating_count_tot` had grown between those versions. 

In [13]:
unique_id = []
duplicate_id = []
for row in ios_data:
    id = row[0]
    if id not in unique_id:
        unique_id.append(id)
    else:
        duplicate_id.append(id)
print("Number of IOS duplicate ids: ", len(duplicate_id))

Number of IOS duplicate ids:  0


Following that line of thought, it would be normal to guess that the `id` would also have two non unique values, but, instead, there isn't any duplicate identities, which leads us to conclude that there must be an error on the Apple Store data.

I leave this as a curiosity as it doesn't matter for our analysis since we are going to clean both datasets duplicates entries in disregard of the ids being unique or not.

In [14]:
def dict_nreviews_max(dataset, name_index, nreviews_index):
    reviews_max = {}
    for row in dataset:
        name = row[name_index]
        nreviews = row[nreviews_index]
        if nreviews[-1] == "M":
            nreviews = nreviews.replace('M', '')
        nreviews = float(nreviews)
        if name in reviews_max and reviews_max[name] < nreviews:
            reviews_max[name] = nreviews
        elif name not in reviews_max:
            reviews_max[name] = nreviews
    return reviews_max

In [15]:
android_nreviews_max = dict_nreviews_max(android_data, 0, 3)
print("Length of the Android dictionary with the maximum number of reviews: ", len(android_nreviews_max))

Length of the Android dictionary with the maximum number of reviews:  9660


In [16]:
ios_nreviews_max = dict_nreviews_max(ios_data, 1, 5)
print("\nLength of the IOS dictionary with the maximum number of reviews: ", len(ios_nreviews_max))


Length of the IOS dictionary with the maximum number of reviews:  7195


For the purpose of eliminating any element that has more than one entry, we have built a frequency table for each application name and their maximum number of reviews or ratings. It was chosen that criteria since, as demonstrated a few steps above, the higher the `rating_count_tot` or the higher the `Reviews`, more updated is the app.

In [17]:
def clean_duplicates(dataset, name_index, nreviews_index, dataset_nreviews_max):
    clean_dataset = []
    already_added = []
    for row in dataset:
        name = row[name_index]
        nreviews = row[nreviews_index]
        if nreviews[-1] == "M":
            nreviews = nreviews.replace('M', '')
        nreviews = float(nreviews)
        if nreviews == dataset_nreviews_max[name] and name not in already_added:
            clean_dataset.append(row)
            already_added.append(name)
    return clean_dataset

In [18]:
android_clean_1 = clean_duplicates(android_data, 0, 3, android_nreviews_max)
print("Length of the Android list after removing duplicate apps: ", len(android_clean_1))

Length of the Android list after removing duplicate apps:  9660


In [19]:
ios_clean_1 = clean_duplicates(ios_data, 1, 5, ios_nreviews_max)
print("\nLength of the IOS list after removing duplicate apps: ", len(ios_clean_1))



Length of the IOS list after removing duplicate apps:  7195


After that, it was just a matter of creating a function that has the mentioned frequency table as a parameter to remove any duplicate value.

In [20]:
def is_English(app_name):
    non_english_character = 0
    for character in app_name:
        if ord(character) > 127: 
            non_english_character += 1
            if non_english_character > 3:
                return False
    return True

In [21]:
print("Testing Function")
print(is_English("Instagram"))
print(is_English("爱奇艺PPS -《欢乐颂2》电视剧热播"))
print(is_English("Docs To Go™ Free Office Suite"
))
print(is_English("Instachat 😜"))

Testing Function
True
False
True
True


In [22]:
def clean_non_English(dataset, name_index):
    clean_dataset = []
    for row in dataset:
        name = row[name_index]
        if is_English(name):
            clean_dataset.append(row)
    return clean_dataset

As the application being developed has an English speaking audience as a public target, it is our job to take out any app not in compliance with this objective, in order to enhance our data for later analysis.

In [23]:
android_clean_2 = clean_non_English(android_clean_1, 0)
print("Length of the Android list after removing non English apps:", len(android_clean_2))

Length of the Android list after removing non English apps: 9615


In [24]:
ios_clean_2 = clean_non_English(ios_clean_1, 1)
print("\nLength of the IOS list after removing non English apps: ", len(ios_clean_2))


Length of the IOS list after removing non English apps:  6181


In [25]:
def clean_non_free(dataset, price_index):
    clean_dataset = []
    for row in dataset:
        price = row[price_index]
        if price == "0" or price == "0.0":
            clean_dataset.append(row)
    return clean_dataset

Another information to keep in mind is that our app has a free price, so we made a function to filter out all non-free apps.

In [26]:
android_cleaned = clean_non_free(android_clean_2, 7)
print("Length of the Android list after removing non free apps:" ,len(android_cleaned))

Length of the Android list after removing non free apps: 8864


In [27]:
ios_cleaned = clean_non_free(ios_clean_2, 4)
print("Length of the IOS list after removing non free apps:" ,len(ios_cleaned))

Length of the IOS list after removing non free apps: 3220


At this point, we have finished data cleasing, on that scenario, we move on to extract the most common apps genres, that way we increase our market knowledge and build a better profile for our desired audience.

In case that suceeds, after launching the app at Google Play, it will also be deployed on Apple Store.

In [28]:
def dict_popular_genre_or_category_percentage(dataset, genre_or_category_index):
    genres_or_category_freq = {}
    for row in dataset:
        genre_or_category = row[genre_or_category_index]
        if genre_or_category in genres_or_category_freq:
            genres_or_category_freq[genre_or_category] += 1
        else:
            genres_or_category_freq[genre_or_category] = 1
    for key in genres_or_category_freq:
        genres_or_category_freq[key] = (genres_or_category_freq[key] / len(dataset)) * 100
    return genres_or_category_freq

In [29]:
android_genres_freq_percentage = dict_popular_genre_or_category_percentage(android_cleaned, 9) 
print(f"Number of genres on the Android cleaned dataset: {len(android_genres_freq_percentage)}\n")


Number of genres on the Android cleaned dataset: 114



In [30]:
ios_genres_freq_percentage = dict_popular_genre_or_category_percentage(ios_cleaned, 11)
print(f"\nNumber of genres on the IOS cleaned dataset: {len(ios_genres_freq_percentage)}")


Number of genres on the IOS cleaned dataset: 23


In [32]:
def display_table_ordered(dataset, index):
    table = dict_popular_genre_or_category_percentage(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(f"{entry[1]} : {entry[0]:.2f}")

In [33]:
display_table_ordered(android_cleaned, 9)

Tools : 8.45
Entertainment : 6.07
Education : 5.35
Business : 4.59
Productivity : 3.89
Lifestyle : 3.89
Finance : 3.70
Medical : 3.53
Sports : 3.46
Personalization : 3.32
Communication : 3.24
Action : 3.10
Health & Fitness : 3.08
Photography : 2.94
News & Magazines : 2.80
Social : 2.66
Travel & Local : 2.32
Shopping : 2.25
Books & Reference : 2.14
Simulation : 2.04
Dating : 1.86
Arcade : 1.85
Video Players & Editors : 1.77
Casual : 1.76
Maps & Navigation : 1.40
Food & Drink : 1.24
Puzzle : 1.13
Racing : 0.99
Role Playing : 0.94
Libraries & Demo : 0.94
Auto & Vehicles : 0.93
Strategy : 0.91
House & Home : 0.82
Weather : 0.80
Events : 0.71
Adventure : 0.68
Comics : 0.61
Beauty : 0.60
Art & Design : 0.60
Parenting : 0.50
Card : 0.45
Casino : 0.43
Trivia : 0.42
Educational;Education : 0.39
Board : 0.38
Educational : 0.37
Education;Education : 0.34
Word : 0.26
Casual;Pretend Play : 0.24
Music : 0.20
Racing;Action & Adventure : 0.17
Puzzle;Brain Games : 0.17
Entertainment;Music & Video : 0.1

By having a glance at the most popular genres only for english free apps at Google Play Store, it is noticeable they are concentrated on utility and learning, followed by entertainment which is predominantly gaming apps.

In [34]:
display_table_ordered(android_cleaned, 1)

FAMILY : 18.91
GAME : 9.72
TOOLS : 8.46
BUSINESS : 4.59
LIFESTYLE : 3.90
PRODUCTIVITY : 3.89
FINANCE : 3.70
MEDICAL : 3.53
SPORTS : 3.40
PERSONALIZATION : 3.32
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.94
NEWS_AND_MAGAZINES : 2.80
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.86
VIDEO_PLAYERS : 1.79
MAPS_AND_NAVIGATION : 1.40
FOOD_AND_DRINK : 1.24
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.82
WEATHER : 0.80
EVENTS : 0.71
PARENTING : 0.65
ART_AND_DESIGN : 0.64
COMICS : 0.62
BEAUTY : 0.60


Still inspecting through Google Store, now in the category relation, despite not having a clear distinguish from the last, it can strenghten the previous analysis with a more concise view. 

For confirmation, we verify that the leading parameter, `Family`, is filled mosltly by gaming apps. That means that our preceding hypothesis was not entirely correct, although a large percentage of the applications still pertains to the "utility" set.

In [35]:
display_table_ordered(ios_cleaned, -5)

Games : 58.14
Entertainment : 7.89
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.52
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.74
Lifestyle : 1.58
News : 1.34
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12


App Store, on the other hand, is evidently populated with a great amount of gaming apps, having a volume that matches 58% of the market, followed by `Entertainment` and `Photo and Video`, thus giving an insight that tells us most popular apps are planned to be for pastime.

We have to keep in mind, though, that if there's a good number of apps for a particular genre, that doesn't mean that apps of that genre will also have the same quantity of users. There maybe more apps utilized with lower development ratio.

In [36]:
display_table_ordered(android_cleaned, 5)

1,000,000+ : 15.73
100,000+ : 11.55
10,000,000+ : 10.55
10,000+ : 10.20
1,000+ : 8.39
100+ : 6.92
5,000,000+ : 6.83
500,000+ : 5.56
50,000+ : 4.77
5,000+ : 4.51
10+ : 3.54
500+ : 3.25
50,000,000+ : 2.30
100,000,000+ : 2.13
50+ : 1.92
5+ : 0.79
1+ : 0.51
500,000,000+ : 0.27
1,000,000,000+ : 0.23
0+ : 0.05
0 : 0.01


To start out our conclusion on Google Store, we have examinated the percentage of apps that belongs to each installation range. 

We can see that the extent of values for each series covers a wide distance. That way, the popularity measure won't have a magnificent precision, but it's not a problem, since our goal is to get an idea on which app genre attracts most users, we don't need a perfect precision, so we won't modify this data. 

Depending on our business intentions, a sucessful app could be near a hundred thousand installs, but a good landmark would be around one million installs.

In [37]:
def table_most_populars_genres_or_categories_by_avg_installs_or_rating_count_tot(dataset, genre_or_category_index, installs_or_rating_count_tot_index):
    table_percentage = dict_popular_genre_or_category_percentage(dataset, genre_or_category_index)
    for genre_or_category in table_percentage:
        total = 0
        len_genre_or_category = 0
        for row in dataset:
            if dataset == android_cleaned:
                category_app = row[genre_or_category_index]
                if category_app == genre_or_category:
                    installs = row[installs_or_rating_count_tot_index]
                    installs = installs.replace('+', '')
                    installs = installs.replace(',', '')
                    total += float(installs)
                    len_genre_or_category += 1
            else:
                genre_app = row[genre_or_category_index]
                if genre_app == genre_or_category:
                    rating_count_tot = float(row[installs_or_rating_count_tot_index])
                    total += rating_count_tot
                    len_genre_or_category += 1
        avg_installs_or_rating_count_tot_by_genre_or_category = total / len_genre_or_category
        print(f"{genre_or_category} : {avg_installs_or_rating_count_tot_by_genre_or_category:.2f}")

In [38]:
table_most_populars_genres_or_categories_by_avg_installs_or_rating_count_tot(android_cleaned, 1, 5)

ART_AND_DESIGN : 1986335.09
AUTO_AND_VEHICLES : 647317.82
BEAUTY : 513151.89
BOOKS_AND_REFERENCE : 8767811.89
BUSINESS : 1712290.15
COMICS : 817657.27
COMMUNICATION : 38456119.17
DATING : 854028.83
EDUCATION : 1833495.15
ENTERTAINMENT : 11640705.88
EVENTS : 253542.22
FINANCE : 1387692.48
FOOD_AND_DRINK : 1924897.74
HEALTH_AND_FITNESS : 4188821.99
HOUSE_AND_HOME : 1331540.56
LIBRARIES_AND_DEMO : 638503.73
LIFESTYLE : 1437816.27
GAME : 15588015.60
FAMILY : 3695641.82
MEDICAL : 120550.62
SOCIAL : 23253652.13
SHOPPING : 7036877.31
PHOTOGRAPHY : 17840110.40
SPORTS : 3638640.14
TRAVEL_AND_LOCAL : 13984077.71
TOOLS : 10801391.30
PERSONALIZATION : 5201482.61
PRODUCTIVITY : 16787331.34
PARENTING : 542603.62
WEATHER : 5074486.20
VIDEO_PLAYERS : 24727872.45
NEWS_AND_MAGAZINES : 9549178.47
MAPS_AND_NAVIGATION : 4056941.77


This table displays us every `Category` on Google Store and it's average number of installs so that we can find out the most prospective area for building our app.

A first impression would be that some categories like `Communication`, `Video Players`, and `Social` are the best bet for what style of app we are going to make. However, if we take a deeper dive and look further we'll comprehend that within all these areas there are just a few apps that increase significantly the medium. That means that those markets have already been taken by major companies.

In [39]:
def show_big_installs_android(category):
    for row in android_cleaned:
        if row[1] == category and (row[5] == '1,000,000,000+'
                                          or row[5] == '500,000,000+'
                                          or row[5] == '100,000,000+'):
            print(row[0], ':', row[5])

In [40]:
show_big_installs_android("COMMUNICATION")
print("\n")
show_big_installs_android("VIDEO_PLAYERS")
print("\n")
show_big_installs_android("SOCIAL")
print("\n")


WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

For illustration, those are the apps with more than a hundred million installations that belongs to the top three most popular categories. 

Instead of already discarding these possibilities, one thing we could do is check if the mean installation without these apps would still maintain considerable.

In [41]:
def show_android_avg_installs_below_100Million(category):
    under_100_m = []
    for row in android_cleaned:
        installs = row[5]
        installs = installs.replace(',', '')
        installs = installs.replace('+', '')
        if (row[1] == category) and (float(installs) < 100000000):
            under_100_m.append(float(installs))
    print(f"{sum(under_100_m) / len(under_100_m):.2f}")

In [42]:
show_android_avg_installs_below_100Million("COMMUNICATION")
print("\n")
show_android_avg_installs_below_100Million("VIDEO_PLAYERS")
print("\n")
show_android_avg_installs_below_100Million("SOCIAL")
print("\n")

3603485.39


5544878.13


3084582.52




As a result, the average installs for the three most popular categories, when analysed only for apps below one hundred million downloads, still looks pretty solid. But that's not enough, let's try checking under a ten million installs for all categories. 

In [45]:
def show_all_android_avg_category_installs_below_10Million():
    table_percentage = dict_popular_genre_or_category_percentage(android_cleaned, 1)
    for category in table_percentage:
        total = 0
        len_category = 0
        for row in android_cleaned:
            category_app = row[1]
            if category_app == category:
                installs = row[5]
                installs = installs.replace(',', '')
                installs = installs.replace('+', '')
                if float(installs) < 10000000:
                    total += float(installs)
                    len_category += 1
        print(f"{category}  :  {(total / len_category):.2f}")     

In [58]:
show_all_android_avg_category_installs_below_10Million()

ART_AND_DESIGN  :  446559.62
AUTO_AND_VEHICLES  :  413500.76
BEAUTY  :  330712.50
BOOKS_AND_REFERENCE  :  457134.10
BUSINESS  :  302072.58
COMICS  :  647613.89
COMMUNICATION  :  747172.39
DATING  :  387992.08
EDUCATION  :  1145789.47
ENTERTAINMENT  :  1597500.00
EVENTS  :  253542.22
FINANCE  :  559626.62
FOOD_AND_DRINK  :  842667.54
HEALTH_AND_FITNESS  :  767984.95
HOUSE_AND_HOME  :  694153.84
LIBRARIES_AND_DEMO  :  287447.62
LIFESTYLE  :  425648.39
GAME  :  1130890.65
FAMILY  :  588878.50
MEDICAL  :  120550.62
SOCIAL  :  733307.99
SHOPPING  :  961223.18
PHOTOGRAPHY  :  1142753.47
SPORTS  :  831006.50
TRAVEL_AND_LOCAL  :  889103.94
TOOLS  :  522021.91
PERSONALIZATION  :  639501.56
PRODUCTIVITY  :  624106.60
PARENTING  :  376684.39
WEATHER  :  1057693.33
VIDEO_PLAYERS  :  629225.61
NEWS_AND_MAGAZINES  :  527661.88
MAPS_AND_NAVIGATION  :  936916.18


At this time, we can trustly say that those three areas are not likely to be our pick, because of how the average installs has fallen off compared to other categories, moreover, there is a higher chance of our app ends up being crushed by competition.

Therefore, any conclusion about the other sets will be precipitate, for the reason that the average number under ten million installs for each category doesn't give us enough information about the number of apps we are competing with. For instance, an ideal place to put our app can be where there isn't much installs as long as there is room to growth, that means not having a lot of competitors, or not having a monopoly as Google Maps has on `Maps and Navigation`, and existing a potential market.

This indeed happens if we analyse only the few last snippets alone. However, if we look back into one of our results, we actually already have this referred information in our hands.

In [48]:
display_table_ordered(android_cleaned, 1)

FAMILY : 18.91
GAME : 9.72
TOOLS : 8.46
BUSINESS : 4.59
LIFESTYLE : 3.90
PRODUCTIVITY : 3.89
FINANCE : 3.70
MEDICAL : 3.53
SPORTS : 3.40
PERSONALIZATION : 3.32
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.94
NEWS_AND_MAGAZINES : 2.80
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.86
VIDEO_PLAYERS : 1.79
MAPS_AND_NAVIGATION : 1.40
FOOD_AND_DRINK : 1.24
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.82
WEATHER : 0.80
EVENTS : 0.71
PARENTING : 0.65
ART_AND_DESIGN : 0.64
COMICS : 0.62
BEAUTY : 0.60


Recalling this table, the values displayed are in respect with the percentage of downloaded apps belonging to a determinate category over all installed apps in the Google Play Store.

From that, when we cross both last tables and examine which values complies with a requirement of not being too popular and having a decent installation mean at the same time, some good candidates are uncovered: `Weather`, `Comics`, `Education`, and `Books and Reference` are the ones selected.

In [65]:
def table_apps_installs_or_rating_count_tot(dataset, app_name_index, genre_or_category_name, genre_or_category_index, installs_or_rating_count_tot_index):
    for row in dataset:
        if row[genre_or_category_index] == genre_or_category_name:
            print(row[app_name_index], " : ", row[installs_or_rating_count_tot_index])

In [60]:
table_apps_installs_or_rating_count_tot(android_cleaned, 0, "WEATHER", 1, 5)
print('\n')
table_apps_installs_or_rating_count_tot(android_cleaned, 0, "COMICS", 1, 5)
print("\n")
table_apps_installs_or_rating_count_tot(android_cleaned, 0, "EDUCATION", 1, 5)
print("\n")
table_apps_installs_or_rating_count_tot(android_cleaned, 0, "BOOKS_AND_REFERENCE", 1, 5)

The Weather Channel: Rain Forecast & Storm Alerts  :  50,000,000+
Weather forecast  :  1,000,000+
AccuWeather: Daily Forecast & Live Weather Reports  :  50,000,000+
Live Weather Pro  :  10,000+
Weather by WeatherBug: Forecast, Radar & Alerts  :  10,000,000+
weather - weather forecast  :  1,000,000+
MyRadar NOAA Weather Radar  :  10,000,000+
SMHI Weather  :  1,000,000+
Free live weather on screen  :  1,000,000+
Weather Radar Widget  :  1,000,000+
Weather –Simple weather forecast  :  10,000,000+
Weather Crave  :  5,000,000+
Klara weather  :  500,000+
Yahoo Weather  :  10,000,000+
Real time Weather Forecast  :  1,000,000+
METEO FRANCE  :  5,000,000+
APE Weather ( Live Forecast)  :  5,000,000+
Live Weather & Daily Local Weather Forecast  :  1,000,000+
Weather  :  10,000,000+
Rainfall radar - weather  :  5,000,000+
Yahoo! Weather for SH Forecast for understanding the approach of rain clouds Free  :  1,000,000+
The Weather Network  :  5,000,000+
Klart.se - Sweden's best weather  :  1,000,000

For the time being, besides `Weather`, we are going to keep these candidates in mind.

The reason why we are casting it aside is that our main goal is to build a free app, and for that to be profitable, it's our responsability to maintain our users engaged as long as possible, that way generating more revenue with in app advertisements.

In [63]:
table_most_populars_genres_or_categories_by_avg_installs_or_rating_count_tot(ios_cleaned, 11, 5)

Social Networking : 71548.35
Photo & Video : 28441.54
Games : 22812.92
Music : 57326.53
Reference : 74942.11
Health & Fitness : 23298.02
Weather : 52279.89
Utilities : 18684.46
Travel : 28243.80
Shopping : 26919.69
News : 21248.02
Navigation : 86090.33
Lifestyle : 16485.76
Entertainment : 14029.83
Food & Drink : 33333.92
Sports : 23008.90
Book : 39758.50
Finance : 31467.94
Education : 7003.98
Productivity : 21028.41
Business : 7491.12
Catalogs : 4004.00
Medical : 612.00


On our final approach to decide what type of app we are going to build, the list showed above pictures every genre and it's respective average number of ratings within Apple Store.

As well as in Android context, the drawback of choosing this criteria for popularity is that most of these categories are already dominated by big companies that retains the vast majority of user base. Thereby, competing on these areas may take a lot of resources to attract an acceptable amount of users, so it's unlikely to be susteinable.

In [59]:
table_apps_installs_or_rating_count_tot(ios_cleaned, 1, "Navigation", -5, 5)
print('\n')
table_apps_installs_or_rating_count_tot(ios_cleaned, 1, "Social Networking", -5, 5)
print('\n')
table_apps_installs_or_rating_count_tot(ios_cleaned, 1, "Music", -5, 5)

Waze - GPS Navigation, Maps & Real-time Traffic  :  345046
Google Maps - Navigation & Transit  :  154911
Geocaching®  :  12811
CoPilot GPS – Car Navigation & Offline Maps  :  3582
ImmobilienScout24: Real Estate Search in Germany  :  187
Railway Route Search  :  5


Facebook  :  2974676
Pinterest  :  1061624
Skype for iPhone  :  373519
Messenger  :  351466
Tumblr  :  334293
WhatsApp Messenger  :  287589
Kik  :  260965
ooVoo – Free Video Call, Text and Voice  :  177501
TextNow - Unlimited Text + Calls  :  164963
Viber Messenger – Text & Call  :  164249
Followers - Social Analytics For Instagram  :  112778
MeetMe - Chat and Meet New People  :  97072
We Heart It - Fashion, wallpapers, quotes, tattoos  :  90414
InsTrack for Instagram - Analytics Plus More  :  85535
Tango - Free Video Call, Voice and Chat  :  75412
LinkedIn  :  71856
Match™ - #1 Dating App.  :  60659
Skype for iPad  :  60163
POF - Best Dating App for Conversations  :  52642
Timehop  :  49510
Find My Family, Friends & iPhone 

As already showed on Google Play, Apple Store scenario is not so different about the participation of big companies, as expected. These list are some examples where it's clearly visible the monopoly of these genres by only a few apps, like Waze and Google Maps in `Navigation`, Facebook, Pinterest and Skype in `Social Networking`, Pandora and Spotify in `Music`, and more.

Those apps are the ones that skews up the previous illustrated list of `genres` and `rating count`, making it look like there were more sucessfull apps than there really are.

In [55]:
table_apps_installs_or_rating_count_tot(ios_cleaned, 1, "Weather", -5, 5)
print("\n")
table_apps_installs_or_rating_count_tot(ios_cleaned, 1, "Finance", -5, 5)
print("\n")
table_apps_installs_or_rating_count_tot(ios_cleaned, 1, "Food & Drink", -5, 5)
print("\n")
table_apps_installs_or_rating_count_tot(ios_cleaned, 1, "News", -5, 5)
print('\n')
table_apps_installs_or_rating_count_tot(ios_cleaned, 1, "Travel", -5, 5)
print('\n')
table_apps_installs_or_rating_count_tot(ios_cleaned, 1, "Photo & Video", -5, 5)

The Weather Channel: Forecast, Radar & Alerts  :  495626
The Weather Channel App for iPad – best local forecast, radar map, and storm tracking  :  208648
WeatherBug - Local Weather, Radar, Maps, Alerts  :  188583
MyRadar NOAA Weather Radar Forecast  :  150158
AccuWeather - Weather for Life  :  144214
Yahoo Weather  :  112603
Weather Underground: Custom Forecast & Local Radar  :  49192
NOAA Weather Radar - Weather Forecast & HD Radar  :  45696
Weather Live Free - Weather Forecast & Alerts  :  35702
Storm Radar  :  22792
QuakeFeed Earthquake Map, Alerts, and News  :  6081
Moji Weather - Free Weather Forecast  :  2333
Hurricane by American Red Cross  :  1158
Forecast Bar  :  375
Hurricane Tracker WESH 2 Orlando, Central Florida  :  203
FEMA  :  128
iWeather - World weather forecast  :  80
Weather - Radar - Storm with Morecast App  :  78
Yurekuru Call  :  53
Weather & Radar  :  37
WRAL Weather Alert  :  25
Météo-France  :  24
JaxReady  :  22
Freddy the Frogcaster's Weather Station  :  14
A

The tables presented above, however, indicate that if you are a medium or big firm specialized on one of those matters, since already having an established range of costumers and domain knowledge, it might be a good idea consider creating an app, at least on Apple Store, and it probably won't be free. Unless in the cases of `Finance` and `News`, where users tend to spend more time compared to the others possibilities.

Taking into consideration that this is not the case of our company, we will continue the data exploration.

In [61]:
table_apps_installs_or_rating_count_tot(ios_cleaned, 1, "Education", -5, 5)
print("\n")
table_apps_installs_or_rating_count_tot(ios_cleaned, 1, "Book", -5, 5)
print("\n")
table_apps_installs_or_rating_count_tot(ios_cleaned, 1, "Reference", -5, 5)

Duolingo - Learn Spanish, French and more  :  162701
Guess My Age  Math Magic  :  123190
Lumosity - Brain Training  :  96534
Elevate - Brain Training and Games  :  58092
Fit Brains Trainer  :  46363
ClassDojo  :  35440
Memrise: learn languages  :  20383
Peak - Brain Training  :  20322
Canvas by Instructure  :  19981
ABCmouse.com - Early Learning Academy  :  18749
Quizlet: Study Flashcards, Languages & Vocabulary  :  16683
Photomath - Camera Calculator  :  16523
iTunes U  :  15801
Blackboard Mobile Learn™  :  13567
Star Chart  :  13482
Remind: Fast, Efficient School Messaging  :  9796
PBS KIDS Video  :  8651
Toca Kitchen Monsters  :  8062
Toca Hair Salon - Christmas Gift  :  8049
Edmodo  :  7197
Prodigy Math Game  :  6683
Epic! - Unlimited Books for Kids  :  6676
ChineseSkill -Learn Mandarin Chinese Language Free  :  6077
Google Classroom  :  5942
TED  :  5782
Khan Academy: you can learn anything  :  5459
Got It - Homework Help Math, Chem, Physics Solver  :  4903
PowerSchool Mobile  : 

Inspecting through this list, an observation to make is that despite not having much development on Google Store, the `Education` genre is a lot more common on Apple Store. So it's not the best option but stills resides as a not bad idea.

With that, we have spotted a legitimate genre to which we can build on to product our software. `Books and Reference` along with `Comics` Google Store categories should be great choices for generating income after developing a free app for both Android and IOS markets, while targeting an English speaking audience.

Thereat, some suggestions for an app would be a comic or manga reader, a dictionary or an even safer option, a religion oriented app, preferably a famous book. The app could come with some extra functionalities that comes up with the book, for example, highlighting some quotes from the volume, a variety of daily quizzes, an audio version, a forum for discussion etc.

## Conclusion

After extracting and cleaning Apple Store's and Google Play's data, we created functions that illustrate us the percentage of apps belonging to each category as well as the categories with most installs.

Next, we used this information to unveil the most prospective type for both markets, and got `Comics` and `Books and Reference` as a result.

Lastly, we came up some ideas of how the app could be, and proposed a book focused on religion with bonus features to keep the user interested.