# Revenue analysis from free apps on Google Play and the App Store

This is a data analysis of apps in Google Play and the App Store. The goal of this project is to identify which type of free apps generate the greater number of downloads. The free apps generate revenue depending on in-app ads, and therefore the number of downloads is a reliable indicator of profitability. It is important to mention that I do not analyze the time each user spend using each app which will have an impact in the number of ads, and thus, the profitability of the app types.

## Import the data

In [1]:
from csv import reader

def open_file(path):
    opened_file = open(path, encoding="utf8")
    read_file = reader(opened_file)
    dataset = list(read_file)
    return dataset

## Explore the data

Look in how the data is structured and what type of information is given by app in the two different datasets (Google Play and App Store).

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
appstore_data = open_file('AppleStore.csv')
googleplay_data = open_file('googleplaystore.csv')

In [4]:
explore_data(appstore_data, 0, 3)
explore_data(googleplay_data, 0, 3)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15

In [5]:
del googleplay_data[10472+1]

## Process the data

### Removing duplicates

Removing duplicate apps in the Google Play dataset. The entry with the highest number of reviews (i.e. the most recent entry) will be kept.

In [6]:
reviews_max = {}

for app in googleplay_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if (name in reviews_max) and (reviews_max[name] < n_reviews):
        reviews_max[name] = n_reviews
        
    if name not in reviews_max:
        reviews_max[name] = n_reviews

In [7]:
len(reviews_max)

9659

In [8]:
android_clean = []
already_added = []

for app in googleplay_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)
        
len(android_clean)

9659

### Remove non-English apps

For a more objective metric, let's focus on only apps with english characters in their names.

In [9]:
def is_it_english(any_string):
    count_non_english = 0
    for c in any_string:
        if ord(c) > 127:
            count_non_english +=1
        if (count_non_english > 3) and (len(any_string) > 3):
            return False
    return True

In [10]:
print(is_it_english('Instagram'))
print(is_it_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_it_english('Docs To Go™ Free Office Suite'))
print(is_it_english('Instachat 😜'))

True
False
True
True


In [11]:
appstore_english = []
googleplay_english = []

for app in appstore_data[1:]:
    name = app[0]
    if is_it_english(name):
        appstore_english.append(app)
        
for app in android_clean:
    name = app[0]
    if is_it_english(name):
        googleplay_english.append(app)

In [12]:
len(googleplay_english)

9614

### Remove non-free apps

Select only free apps

In [13]:
app_free = []
google_free = []

for app in appstore_english:
    price = float(app[4])
    if price == 0.0:
        app_free.append(app)
        
for app in googleplay_english:
    price = app[6]
    if price == 'Free':
        google_free.append(app)

## Evaluate the two datasets

Now we want to create an app profile for both providers. In order to validate the idea of developing and app in both platforms, we need to look if the app might have a good response in both platforms.

In [14]:
def freq_table(dataset, index):
    dictionary  = {}
    for row in dataset:
        content = row[index]
        if content in dictionary:
            dictionary[content] +=1
        else:
            dictionary[content] = 1
    return dictionary

In [15]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

### Check for the frequency of app types in the market

Counting the number of apps is a proxy to know which app are most popular and most produced.

In [16]:
print('GENRES FROM FREE ENGLISH APPS IN APPLESTORE')
display_table(app_free, 11)

GENRES FROM FREE ENGLISH APPS IN APPLESTORE
Games : 2257
Entertainment : 334
Photo & Video : 167
Social Networking : 143
Education : 132
Shopping : 121
Utilities : 109
Lifestyle : 94
Finance : 84
Sports : 79
Health & Fitness : 76
Music : 67
Book : 66
Productivity : 62
News : 58
Travel : 56
Food & Drink : 43
Weather : 31
Reference : 20
Navigation : 20
Business : 20
Catalogs : 9
Medical : 8


In [17]:
print('GENRES FROM FREE ENGLISH APPS IN GOOGLE PLAY')
display_table(google_free, 1)

GENRES FROM FREE ENGLISH APPS IN GOOGLE PLAY
FAMILY : 1675
GAME : 862
TOOLS : 750
BUSINESS : 407
LIFESTYLE : 346
PRODUCTIVITY : 345
FINANCE : 328
MEDICAL : 313
SPORTS : 301
PERSONALIZATION : 294
COMMUNICATION : 287
HEALTH_AND_FITNESS : 273
PHOTOGRAPHY : 261
NEWS_AND_MAGAZINES : 248
SOCIAL : 236
TRAVEL_AND_LOCAL : 207
SHOPPING : 199
BOOKS_AND_REFERENCE : 190
DATING : 165
VIDEO_PLAYERS : 159
MAPS_AND_NAVIGATION : 124
FOOD_AND_DRINK : 110
EDUCATION : 103
ENTERTAINMENT : 85
LIBRARIES_AND_DEMO : 83
AUTO_AND_VEHICLES : 82
HOUSE_AND_HOME : 73
WEATHER : 71
EVENTS : 63
PARENTING : 58
ART_AND_DESIGN : 57
COMICS : 55
BEAUTY : 53


In [18]:
print('CATEGORIES FROM FREE ENGLISH APPS IN GOOGLE PLAY')
display_table(google_free, 9)

CATEGORIES FROM FREE ENGLISH APPS IN GOOGLE PLAY
Tools : 749
Entertainment : 538
Education : 474
Business : 407
Productivity : 345
Lifestyle : 345
Finance : 328
Medical : 313
Sports : 307
Personalization : 294
Communication : 287
Action : 275
Health & Fitness : 273
Photography : 261
News & Magazines : 248
Social : 236
Travel & Local : 206
Shopping : 199
Books & Reference : 190
Simulation : 181
Dating : 165
Arcade : 164
Video Players & Editors : 157
Casual : 156
Maps & Navigation : 124
Food & Drink : 110
Puzzle : 100
Racing : 88
Role Playing : 83
Libraries & Demo : 83
Auto & Vehicles : 82
Strategy : 80
House & Home : 73
Weather : 71
Events : 63
Adventure : 60
Comics : 54
Beauty : 53
Art & Design : 53
Parenting : 44
Card : 40
Casino : 38
Trivia : 37
Educational;Education : 35
Board : 34
Educational : 33
Education;Education : 30
Word : 23
Casual;Pretend Play : 21
Music : 18
Racing;Action & Adventure : 15
Puzzle;Brain Games : 15
Entertainment;Music & Video : 15
Casual;Brain Games : 12
Casu

### Check for the number of installs of each app type

Knowing the average and/or median number of installs of an app is a proxy to know which app types are the most downloaded.

In [19]:
genres_app_free = freq_table(app_free, 11)

for genre in genres_app_free:
    total = 0
    len_genre = 0
    
    for app in app_free:
        genre_app = app[11]
        
        if genre_app == genre:
            total += float(app[5]) # ratings
            len_genre += 1
    
    avg_rating = total/len_genre
    print(genre + ': ' + str(avg_rating))

Social Networking: 53078.195804195806
Photo & Video: 27249.892215568863
Games: 18924.68896765618
Music: 56482.02985074627
Reference: 67447.9
Health & Fitness: 19952.315789473683
Weather: 47220.93548387097
Utilities: 14010.100917431193
Travel: 20216.01785714286
Shopping: 18746.677685950413
News: 15892.724137931034
Navigation: 25972.05
Lifestyle: 8978.308510638299
Entertainment: 10822.961077844311
Food & Drink: 20179.093023255813
Sports: 20128.974683544304
Book: 8498.333333333334
Finance: 13522.261904761905
Education: 6266.333333333333
Productivity: 19053.887096774193
Business: 6367.8
Catalogs: 1779.5555555555557
Medical: 459.75


In [20]:
category_google_free = freq_table(google_free, 1)

for category in category_google_free:
    total = 0
    len_genre = 0
    
    for app in google_free:
        category_app = app[1]
        
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace('+','')
            n_installs = n_installs.replace(',','')
            total += float(n_installs) # ratings
            len_genre += 1
    
    avg_rating = total/len_genre
    print(category + ': ' + str(avg_rating))

ART_AND_DESIGN: 1986335.0877192982
AUTO_AND_VEHICLES: 647317.8170731707
BEAUTY: 513151.88679245283
BOOKS_AND_REFERENCE: 8767811.894736841
BUSINESS: 1712290.1474201474
COMICS: 817657.2727272727
COMMUNICATION: 38456119.167247385
DATING: 854028.8303030303
EDUCATION: 1833495.145631068
ENTERTAINMENT: 11640705.88235294
EVENTS: 253542.22222222222
FINANCE: 1387692.475609756
FOOD_AND_DRINK: 1924897.7363636363
HEALTH_AND_FITNESS: 4188821.9853479853
HOUSE_AND_HOME: 1331540.5616438356
LIBRARIES_AND_DEMO: 638503.734939759
LIFESTYLE: 1437816.2687861272
GAME: 15588015.603248259
FAMILY: 3697848.1731343283
MEDICAL: 120550.61980830671
SOCIAL: 23253652.127118643
SHOPPING: 7036877.311557789
PHOTOGRAPHY: 17840110.40229885
SPORTS: 3638640.1428571427
TRAVEL_AND_LOCAL: 13984077.710144928
TOOLS: 10801391.298666667
PERSONALIZATION: 5201482.6122448975
PRODUCTIVITY: 16787331.344927534
PARENTING: 542603.6206896552
WEATHER: 5074486.197183099
VIDEO_PLAYERS: 24727872.452830188
NEWS_AND_MAGAZINES: 9549178.467741935
MA

The average rating might not be the best example since it can be heavily skewed by outliers. The median is computed in the cells below.

In [21]:
import numpy as np

genres_app_free = freq_table(app_free, 11)

all_median_rating = {}
all_mean_rating = {}

for genre in genres_app_free:
    total = []
    
    for app in app_free:
        genre_app = app[11]
        
        if genre_app == genre:
            total.append(float(app[5])) # ratings
    
    median_rating = np.median(total)
    all_median_rating[genre] = median_rating
    
    mean_rating = np.mean(total)
    all_mean_rating[genre] = mean_rating


## Most downloaded app types in App Store (from lower to higher number of installations)

In [22]:
{k: v for k, v in sorted(all_median_rating.items(), key=lambda item: item[1])}

{'Book': 0.0,
 'Catalogs': 0.0,
 'Finance': 13.5,
 'Navigation': 15.5,
 'Medical': 17.5,
 'Lifestyle': 125.0,
 'Weather': 128.0,
 'Food & Drink': 148.0,
 'News': 263.5,
 'Travel': 320.5,
 'Games': 422.0,
 'Education': 457.0,
 'Utilities': 612.0,
 'Entertainment': 640.5,
 'Sports': 809.0,
 'Business': 853.0,
 'Health & Fitness': 882.0,
 'Shopping': 895.0,
 'Social Networking': 1033.0,
 'Photo & Video': 2099.0,
 'Reference': 3095.0,
 'Music': 3687.0,
 'Productivity': 5335.0}

## Most downloaded app types in Google Play Store (from lower to higher number of installations)

In [23]:
import numpy as np

category_google_free = freq_table(google_free, 1)

all_median_rating = {}
all_mean_rating = {}

for category in category_google_free:
    total = []
    
    for app in google_free:
        category_app = app[1]
        
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace('+','')
            n_installs = n_installs.replace(',','')
            total.append(float(n_installs)) # ratings
    
    median_rating = np.nanmedian(total)
    all_median_rating[category] = median_rating
    
    mean_rating = np.nanmean(total)
    all_mean_rating[category] = mean_rating

In [24]:
{k: v for k, v in sorted(all_median_rating.items(), key=lambda item: item[1])}

{'BUSINESS': 1000.0,
 'EVENTS': 1000.0,
 'MEDICAL': 1000.0,
 'DATING': 10000.0,
 'FINANCE': 10000.0,
 'LIBRARIES_AND_DEMO': 10000.0,
 'LIFESTYLE': 10000.0,
 'BEAUTY': 50000.0,
 'BOOKS_AND_REFERENCE': 50000.0,
 'NEWS_AND_MAGAZINES': 50000.0,
 'ART_AND_DESIGN': 100000.0,
 'AUTO_AND_VEHICLES': 100000.0,
 'COMICS': 100000.0,
 'FAMILY': 100000.0,
 'SOCIAL': 100000.0,
 'SPORTS': 100000.0,
 'TRAVEL_AND_LOCAL': 100000.0,
 'TOOLS': 100000.0,
 'PERSONALIZATION': 100000.0,
 'PRODUCTIVITY': 100000.0,
 'PARENTING': 100000.0,
 'MAPS_AND_NAVIGATION': 100000.0,
 'COMMUNICATION': 500000.0,
 'FOOD_AND_DRINK': 500000.0,
 'HEALTH_AND_FITNESS': 500000.0,
 'HOUSE_AND_HOME': 500000.0,
 'EDUCATION': 1000000.0,
 'ENTERTAINMENT': 1000000.0,
 'GAME': 1000000.0,
 'SHOPPING': 1000000.0,
 'PHOTOGRAPHY': 1000000.0,
 'WEATHER': 1000000.0,
 'VIDEO_PLAYERS': 1000000.0}