# Profitable apps on App Store and Google Play markets
The company source of revenue consists of in-app ads. So the more users an app has, the more money company earns. 
Our goal for this project is to show the developers, what type of apps are most popular and why.

We will explore data, clean data and analyze data.

Let's start with exploring

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))                             

Dataquest provides us with explore_data function which takes in four parameters. Of course we are aware there are better and way shorter way to do it, but we will work with that on that stage of knowledge.
That function returns every row in a dataset and number of rows and columns in a dataset.

In [2]:
from csv import reader
opened_applestore = open('AppleStore.csv')
opened_googleplaystore = open('googleplaystore.csv')


read_AppleStore = reader(opened_applestore)
apple_data = list(read_AppleStore)
header_1 = apple_data [0]
rest_1 = apple_data[1:]

read_googleplay = reader(opened_googleplaystore)
google_data = list(read_googleplay)
header_2 = google_data[0]
rest_2 = google_data[1:]






Right now we need to read in both datasets. To do that we will import reader module from csv library and opened them by open() function.
Our explore_data function does not exclude first rows in both datasets, and those are name labels, so we will separate them from the rest of the rows.

In [3]:
print(explore_data(google_data, 0, 3, True))
for row in google_data[1:]:
    if len(row) != len(header_2):
        print(row)
        print("\n")
        print("Index postion is:", google_data.index(row))
del(google_data[10473])
print(len(google_data))

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13
None
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Index postion is: 10473
10841


We got information that one row is incorrect so we found it and removed it.

In [4]:
print(explore_data(apple_data, 0, 3, True))
 

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7198
Number of columns: 16
None


In [5]:
for app in google_data:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Relying on Instagram app, we searched for any duplicates. We found 4. So that gives us a information that there indeed are some dupplicates.

In [6]:
duplicate_apps = []
unique_apps = []

for row in rest_2:
    name = row[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])
        
        

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


#### 1181 rows were duplicated.

In [7]:
print('Expected length:', len(unique_apps))

Expected length: 9660


In [8]:
reviews_max = {}
for row in google_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews
print(len(reviews_max))



9659


The case with duplicates we interpret as there are older informations about apps. So the goal is to get from every app the biggest number of reviews so we know it is the latest.

In [9]:
android_clean = []
already_added = []

for row in google_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(row)
        already_added.append(name)
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


We want to work only on English apps. Every ascii character above 127 is non  common English characters but there can be some exceptions like small TM sign or any emojis.
Let's assume that over 3 non English characters in app name is not an English name of an app.

In [10]:
def string_check(string):
    
    non_ascii = 0
    for row in string:
        if ord (row) >127:
            non_ascii += 1
            
    if non_ascii > 3:
        return False
    else:
        return True

print(string_check('Docs To Go™ Free Office Suite'))
print(string_check('爱奇艺PPS -《欢乐颂2》电视剧热播'))



True
False


Let's implement that to our datasets.

In [11]:
english_google_list = []
english_apple_list = []
for row in android_clean:
    name = row[0]
    if string_check(name) == True:
        english_google_list.append(row)
for row in apple_data:
    name = row[1]
    if string_check(name) == True:
        english_apple_list.append(row)
        
explore_data(english_google_list, 0, 3, True)
explore_data(english_apple_list, 0, 3, True)
    

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram

From the beggining we wanted to get informations only about free apps.
Let's exclude apps with non null price.

In [12]:
android_final = []
apple_final = []
for row in english_google_list:
    price = row[7]
    if price == '0':
        android_final.append(row)
    
for row in english_apple_list:
    price = row[4]
    if price == '0.0':
        apple_final.append(row)
    
print(len(android_final))
print(len(apple_final))


8864
3222


#### Okay, we finished cleaning the data now let's analyze it.
We want apps that fits both the App Store and Google Play. We can do it by adding minimal Android version app to Google Play, and if the app has good feedback from users, after six months we will develop it. If the App will be profitable after six months, we are going to add it to Apple store.

In [13]:
explore_data(android_final, 0, 3, True)
explore_data(apple_final, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 8864
Number of columns: 13
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'G

To give devs more specific info what apps are more popular, we will show them what genres of apps are most popular. To do so, we will write freq_table function which returns frequencies of all unique names in choosen dataset.

In [14]:
def freq_table(dataset, index):
    table = {}
    total = 0 
    
    for row in dataset:
        total += 1
        value = row [index]
        
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage
    return table_percentages
    
    

And now use freq_table function in display_table function which visualize the results.

In [15]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Below there are all genres from App Store dataset. What we can see right a way is 'Games' genre. This is easly the most popular genre, almost 60%!

Why is that happening? Because there are plenty of games people get bored with and download news. Game development is huge and there are hundreds of games even though people spend more time on socials than gaming. But it's only few popular apps in Social Networking genre which people use.

In [16]:
display_table(apple_final, -5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


Now let's check Google Play dataset and columns 'Category' and 'Genres'.

In [22]:
display_table(android_final, 1) 

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

As we can see, here 'Game' category are on second place with 9% and first place takes 'Family' genre. But looking into that, most of the Family apps are games for kids. So overall in Google Play games are most popular too so the gaming is the most promising investment. But let's check 'Genres' column now.

In [23]:
display_table(android_final, 9) 

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

'Tools' and 'Entertainment' are most popular here but merging both 'Genres' and 'Category' columns, gaming is still most promising investment.

But the thing is, in Apple Store dataset, apps designed for fun are dominating the market. On the other hand in Google Play entertainment and practical tools are much more balanced.

Now we want to now which kind of apps has the most users.

### Most popular apps by genre App Store.
We will identify the most well-liked genres (have the most users). If we want to identify the most popular app in the App Store, there isn't an immediately obvious column to examine. But, we can use the total number of user ratings as a stand-in, which can be found in the 'rating count tot' column (column 5).

The average number of user ratings for each genre of app on the App Store should be calculated first. To do it, we must take the following actions:


Separate the apps by genre.
Sum up the user ratings for those genre-specific apps.
Subtract the total from the number of apps in that genre (not by the total number of apps).

In [26]:
genres_ios = freq_table(apple_final, -5)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for row in apple_final:
        genre_app = row[-5]
        if genre_app == genre:
            user_ratings = float(row[5])
            total += user_ratings
            len_genre += 1
    avg_numb_user_ratings = total/len_genre
    print(genre,':', round(avg_numb_user_ratings), 'users')
        
    
    
    

Social Networking : 71548 users
Photo & Video : 28442 users
Games : 22789 users
Music : 57327 users
Reference : 74942 users
Health & Fitness : 23298 users
Weather : 52280 users
Utilities : 18684 users
Travel : 28244 users
Shopping : 26920 users
News : 21248 users
Navigation : 86090 users
Lifestyle : 16486 users
Entertainment : 14030 users
Food & Drink : 33334 users
Sports : 23009 users
Book : 39758 users
Finance : 31468 users
Education : 7004 users
Productivity : 21028 users
Business : 7491 users
Catalogs : 4004 users
Medical : 612 users


As we can see the 'Social Networking', 'Navigation' and 'Reference' are on on top.

### Most popular apps by genre Google Play
The data for Google Play is easier to understand; we can examine the "Installs" column. Let's display the corresponding column.

In [28]:
display_table(android_final,5)


1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


As we can see, the data on the number of installations is not exact, but in reality, it would be sufficient for our study because we can determine which app pulls more users by ratio. 
But, we must first prepare the data by converting it from string to float format. 
To begin, we will create a frequency table for the "Category" column . The average number of installs is then calculated by looping through the Google Play dataset, removing the '+'and ',' signs from the installs column, adding up the ratings, and dividing the result by the total number of ratings.

In [30]:
categories_android = freq_table(android_final, 1)

for category in categories_android:
    total = 0 
    len_category = 0
    for row in android_final:
        category_app = row[1]
        if category_app == category:
            installs = row[5]
            installs = installs.replace('+', "")
            installs = installs.replace(',', "")
            total += float(installs)
            len_category += 1
    avg_num = total/len_category
    print(category, ":", round(avg_num))
            
    

ART_AND_DESIGN : 1986335
AUTO_AND_VEHICLES : 647318
BEAUTY : 513152
BOOKS_AND_REFERENCE : 8767812
BUSINESS : 1712290
COMICS : 817657
COMMUNICATION : 38456119
DATING : 854029
EDUCATION : 1833495
ENTERTAINMENT : 11640706
EVENTS : 253542
FINANCE : 1387692
FOOD_AND_DRINK : 1924898
HEALTH_AND_FITNESS : 4188822
HOUSE_AND_HOME : 1331541
LIBRARIES_AND_DEMO : 638504
LIFESTYLE : 1437816
GAME : 15588016
FAMILY : 3695642
MEDICAL : 120551
SOCIAL : 23253652
SHOPPING : 7036877
PHOTOGRAPHY : 17840110
SPORTS : 3638640
TRAVEL_AND_LOCAL : 13984078
TOOLS : 10801391
PERSONALIZATION : 5201483
PRODUCTIVITY : 16787331
PARENTING : 542604
WEATHER : 5074486
VIDEO_PLAYERS : 24727872
NEWS_AND_MAGAZINES : 9549178
MAPS_AND_NAVIGATION : 4056942


As we can see from our data, the categories with the most installs are "COMMUNICATION," "VIDEO PLAYERS," "PRODUCTIVITY," and "GAME," with "ENTERTAINMENT" coming in second.

### Conclusions:
Our project's objective was to recommend an app that would be widely used, have a sizable user base, and probably have a lot of clients who would be prepared to insert advertisements. After analyzing the data on Google Play and App Store apps, we can say that gaming apps appear to be pretty popular. What's more, gaming apps are highly diverse, which means new apps would have a greater chance of competing on the market and attracting consumers right away after being released. In addition, game apps offer a lot of room for expansion once they are out, including the addition of paid features that don't need huge financial outlays yet generate extra revenue.