# Profitable app profiles for Apple Store and Google Play

Our aim is to help our developers understand what type of apps are likely to attract more users on Google Play and the App Store.

## Data Exploration
First we open the two data sets: Apple Store and Google Play

In [2]:
import csv
appleStore = []
playStore = []

# Load in AppleStore data
with open('AppleStore.csv', 'r') as csv_file:
    reader = csv.reader(csv_file)
    for row in reader:
        appleStore.append(row)
        
# Load in PlayStore data
with open('googleplaystore.csv', 'r') as csv_file:
    reader = csv.reader(csv_file)
    for row in reader:
        playStore.append(row)
        
        
# Make sure data loaded correctly
print(len(playStore))
print(len(appleStore))

10842
7198


To ease the taks of exploring data, we will create a function to repeatedly print rows in a readable way

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [4]:
# Column headings

explore_data(playStore, 0, 1, True)
print('\n')
explore_data(appleStore, 0, 1, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Number of rows: 10842
Number of columns: 13


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Number of rows: 7198
Number of columns: 16


## Removing duplicates

First, we have to clean out the duplicate apps in Playstore. Below, I've listed a few of the duplicate apps existing in the data set.

In [5]:
duplicates = []
uniques = []
for app in playStore:
    name = app[0]
    if name in uniques:
        duplicates.append(name)
    else:
        uniques.append(name)

print('Number of duplicate apps: ', len(duplicates))
print('\n')
print('Examples: ', duplicates[:15])

Number of duplicate apps:  1181


Examples:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


In [6]:
# Detecting duplicate entries (criterion: only keep the ones with most reviews)

reviews_max = {}
for app in playStore:
    name = app[0]
    n_reviews = app[3]
    if((name in reviews_max) and reviews_max[name] < n_reviews):
        reviews_max[name] = n_reviews
    elif(name not in reviews_max):
        reviews_max[name] = n_reviews
    
print('Expected len: ', len(playStore)-1181)
print('Actual cleaned len: ', len(reviews_max))

# Removing duplicates
android_clean = []
already_added = []

for app in playStore:
    name = app[0]
    n_reviews = app[3]
    if(n_reviews == reviews_max[name] and name not in already_added):
        android_clean.append(app)
        already_added.append(name)

print('\n')
explore_data(android_clean, 1, 3, True)

Expected len:  9661
Actual cleaned len:  9661


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 9661
Number of columns: 13


## Removing non-English apps

If we explore the data long enough, we'll see that there are entries with non-english names:

In [7]:
def checkEng(name):
    isEng = True
    count = 0
    for s in name:
        if(ord(s) > 127):
            count+=1
    if(count >= 3):
        isEng = False
        
    return isEng

The function works fine, but it eliminates some English apps with special characters and emojis.

Hence, we've adapted the function to allow up-to 3 non-ascii characters

Here are some examples:

In [8]:
print(checkEng('Docs To Go™ Free Office Suite'))
print(checkEng('Instachat 😜'))

True
True


Now, we loop through the data set and delete the non-english entries

In [9]:
new_android_list = []
new_ios_list = []

for app in android_clean:
    name = app[0]
    if(checkEng(name)):
        new_android_list.append(app)

for app in appleStore:
    name = app[1]
    if(checkEng(name)):
        new_ios_list.append(app)

print(len(new_android_list))
print(len(new_ios_list))

9599
6156


## Isolating free apps

Below, we will make sure we only analyze free apps

In [10]:
free_android_list = []
free_ios_list = []

for app in new_android_list:
    price = app[6]
    if(price == 'Free'):
        free_android_list.append(app)
        
for app in new_ios_list:
    price = app[4]
    if(price == '0.0'):
        free_ios_list.append(app)
        
print(len(free_android_list))
print(len(free_ios_list))

8845
3203


# Taking a step back

(Verbatim from Dataquest)

Our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.

2. If the app has a good response from users, we develop it further.

3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

In [11]:
explore_data(playStore, 0, 1, True)
print('\n')
explore_data(appleStore, 0, 1, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Number of rows: 10842
Number of columns: 13


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Number of rows: 7198
Number of columns: 16


By taking a glance at the columns, we can see that genre and prime_genre can be used to better understand which category the app should be. Hence, we will build a frequency_table for each.

In [12]:
# Generates a frequency table
def freq_table(dataset, index):
    table = {}
    total = 0
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1   
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages


# Displays table in percentages
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Below is the top 5 genres that gained the most traction in the play store:

In [13]:
# play store freq table for genres
play_genres = freq_table(free_android_list, 9)

# Displaying top five genres
sorted_play_genres = sorted(play_genres, key=play_genres.get, reverse = True)

print(sorted_play_genres[:5])

['Tools', 'Entertainment', 'Education', 'Business', 'Productivity']


Below is the top 5 genres that gained the most traction in the apple
store:

In [14]:
# play store freq table for genres
apple_genres = freq_table(free_ios_list, 11)

# Displaying top five genres
sorted_apple_genres = sorted(apple_genres, key=apple_genres.get, reverse = True)

print(sorted_apple_genres[:5])

['Games', 'Entertainment', 'Photo & Video', 'Education', 'Social Networking']


Below is the top 5 categories that gained most traction:

In [15]:
# play store freq table for genres
play_category = freq_table(free_android_list, 1)

# Displaying top five genres
sorted_play_category = sorted(play_category, key=play_category.get, reverse = True)

print(sorted_play_category[:5])

['FAMILY', 'GAME', 'TOOLS', 'BUSINESS', 'PRODUCTIVITY']


## Other ways to determine popular profiles

Instead of sorting and searching for the most frequent genres, we can look at apps with the highest number of installations.

Below, we will look at the average number of installs for each app genre on the App Store:

In [16]:
for genre in apple_genres:
    total = 0
    len_genre = 0
    for app in free_ios_list:
        genre_app = app[-5]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

Photo & Video : 28441.54375
Entertainment : 14195.358565737051
News : 21248.023255813954
Music : 57326.530303030304
Shopping : 27230.734939759037
Catalogs : 4004.0
Book : 46384.916666666664
Food & Drink : 33333.92307692308
Navigation : 86090.33333333333
Weather : 52279.892857142855
Business : 7491.117647058823
Lifestyle : 16815.48
Games : 22886.36709539121
Productivity : 21028.410714285714
Finance : 32367.02857142857
Utilities : 19156.493670886077
Social Networking : 71548.34905660378
Reference : 79350.4705882353
Sports : 23008.898550724636
Travel : 28243.8
Health & Fitness : 23298.015384615384
Education : 7003.983050847458
Medical : 612.0


Now we calculate average number of installs per app genre for the Google Play data set.

In [17]:
categories_android = freq_table(free_android_list, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in free_android_list:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

HEALTH_AND_FITNESS : 4188821.9853479853
PARENTING : 542603.6206896552
HOUSE_AND_HOME : 1360598.042253521
BUSINESS : 1712290.1474201474
EDUCATION : 1820673.076923077
MAPS_AND_NAVIGATION : 4049274.6341463416
PRODUCTIVITY : 16787331.344927534
SOCIAL : 23253652.127118643
TOOLS : 10710881.491298527
LIFESTYLE : 1446158.2238372094
NEWS_AND_MAGAZINES : 9549178.467741935
BOOKS_AND_REFERENCE : 8814199.78835979
COMICS : 832613.8888888889
ENTERTAINMENT : 11640705.88235294
SPORTS : 3650602.276666667
FOOD_AND_DRINK : 1924897.7363636363
MEDICAL : 120616.48717948717
FAMILY : 3696479.242695289
PERSONALIZATION : 5201482.6122448975
TRAVEL_AND_LOCAL : 13984077.710144928
EVENTS : 253542.22222222222
COMMUNICATION : 38590581.08741259
SHOPPING : 7036877.311557789
AUTO_AND_VEHICLES : 647317.8170731707
VIDEO_PLAYERS : 24727872.452830188
PHOTOGRAPHY : 17805627.643678162
LIBRARIES_AND_DEMO : 638503.734939759
GAME : 15516683.567251462
BEAUTY : 513151.88679245283
FINANCE : 1387692.475609756
ART_AND_DESIGN : 1986335

# Done!