# Profitable App Profiles

Here, we analyze app data across Google Play and the Apple App store to understand what types of apps are likely to attract more users.

Here are the links to the documentation:

[Apple Store Data](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

[Google Play Store Data](https://www.kaggle.com/lava18/google-play-store-apps)

In [1]:
opened_apple_file = open('AppleStore.csv') # Check that we get these from correct path
opened_google_file = open('googleplaystore.csv')
from csv import reader
read_afile = reader(opened_apple_file)
read_gfile = reader(opened_google_file)
apple_apps_data = list(read_afile)
google_apps_data = list(read_gfile)

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
explore_data(apple_apps_data,0,3,True)
explore_data(google_apps_data,0,3,True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7198
Number of columns: 16
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone

In [3]:
for label in apple_apps_data[0]:
    print(label, '\n')
for label in google_apps_data[0]:
    print(label, '\n')

id 

track_name 

size_bytes 

currency 

price 

rating_count_tot 

rating_count_ver 

user_rating 

user_rating_ver 

ver 

cont_rating 

prime_genre 

sup_devices.num 

ipadSc_urls.num 

lang.num 

vpp_lic 

App 

Category 

Rating 

Reviews 

Size 

Installs 

Type 

Price 

Content Rating 

Genres 

Last Updated 

Current Ver 

Android Ver 



In [4]:
print(google_apps_data[10473])
del google_apps_data[10473]

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


There are some duplicates in our data set. We will find out just how many, what they are, and sample a few:

In [5]:
duplicate_apps = []
unique_apps = []

for app in google_apps_data:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[100:111])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Meet4U - Chat, Love, Singles!', '95Live -SG#1 Live Streaming App', 'Just She - Top Lesbian Dating', 'Hily: Dating, Chat, Match, Meet & Hook up', 'O-Star', 'Random Video Chat', 'Black People Meet Singles Date', 'Howlr', 'Free Dating & Flirt Chat - Choice of Love', 'Cardi B Live Stream Video Chat - Prank', 'Chat Kids - Chat Room For Kids']


To remove the duplicates, we will use the ratings number to determine which instance of the app is the most recently posted - we assume that the instance of a duplicate app with the most reviews is the most recent iteration, so we will delete all other instances of the duplicate app.

In [6]:
reviews_max = {}
for app in google_apps_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if (name in reviews_max and reviews_max[name] < n_reviews):
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
print(len(reviews_max))

9659


Now that we have our dictionary `reviews_max`, we will use it to remove duplicate rows. To do this, we create two empty lists `android_clean` and `already_added`, and for each app in the Google apps data list, we check to see if the number of reviews matches with the corresponding entry in our dictionary, ie the key and the number of reviews match. We also check to see that the name of the app we are currently inspecting is **not** in `already_added`, and then if both conditions are met, we append the entire row corresponding to a single app to `android_clean` and add the name of said app to `already_added`.

In [7]:
android_clean = []
already_added = []
for app in google_apps_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if (n_reviews == reviews_max[name] and name not in already_added):
        android_clean.append(app)
        already_added.append(name)
print(len(android_clean))
print(android_clean[:5])

9659
[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']]


In [8]:
def is_english(your_string):
    my_index = 0
    for character in your_string:
        if ord(character) > 127:
            my_index += 1
        if my_index == 3:
            return False
    return True
# check some strings bro
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
True
True


Now our `is_english` function can detect non-english app names (but there should be a better way this is meh). We next use it to eliminate non-english app names from both the Apple data and the Google data.

In [9]:
android_clean_english = []
apple_apps_data_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_clean_english.append(app)

for app in apple_apps_data[1:]:
    name = app[1]
    if is_english(name):
        apple_apps_data_english.append(app)
        
print('Nonduplicate, english Android apps:' + '\n')
print(len(android_clean_english))
print('English Apple apps: \n')
print(len(apple_apps_data_english))

Nonduplicate, english Android apps:

9597
English Apple apps: 

6155


Next, we isolate only the free apps from these lists.

In [10]:
android_clean_english_free = []
apple_apps_data_english_free = []

for app in android_clean_english:
    price = app[7]
    if price == '0' or price == '$0':
        android_clean_english_free.append(app)
        
for app in apple_apps_data_english:
    price = float(app[4])
    if price == 0:
        apple_apps_data_english_free.append(app)
        
print(len(android_clean_english_free))
print(len(apple_apps_data_english_free))

8848
3203


Want to determine what kinds of apps are likely to attract more users.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of what are the most common genres for each market. For this, we'll need to build frequency tables for a few columns in our data sets.

In [12]:
def freq_table(dataset, index):
    ourtable = {}
    for app in dataset:
        data_point = app[index]
        if data_point in ourtable:
            ourtable[data_point] += 1
        else:
            ourtable[data_point] = 1
    return ourtable

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

print('prime_genre for Apple App Store:\n')
display_table(apple_apps_data_english_free, 11)
print('Category for Google Play:\n')
display_table(android_clean_english_free, 1) # Category
print('Genre for Google Play:\n')
display_table(android_clean_english_free, -4) # Genres

prime_genre for Apple App Store:

Games : 1866
Entertainment : 251
Photo & Video : 160
Education : 118
Social Networking : 106
Shopping : 83
Utilities : 79
Sports : 69
Music : 66
Health & Fitness : 65
Productivity : 56
Lifestyle : 50
News : 43
Travel : 40
Finance : 35
Weather : 28
Food & Drink : 26
Reference : 17
Business : 17
Book : 12
Navigation : 6
Medical : 6
Catalogs : 4
Category for Google Play:

FAMILY : 1676
GAME : 858
TOOLS : 748
BUSINESS : 407
PRODUCTIVITY : 345
LIFESTYLE : 344
FINANCE : 328
MEDICAL : 313
SPORTS : 300
PERSONALIZATION : 294
COMMUNICATION : 286
HEALTH_AND_FITNESS : 273
PHOTOGRAPHY : 261
NEWS_AND_MAGAZINES : 248
SOCIAL : 236
TRAVEL_AND_LOCAL : 207
SHOPPING : 199
BOOKS_AND_REFERENCE : 189
DATING : 165
VIDEO_PLAYERS : 159
MAPS_AND_NAVIGATION : 123
FOOD_AND_DRINK : 110
EDUCATION : 103
ENTERTAINMENT : 85
LIBRARIES_AND_DEMO : 83
AUTO_AND_VEHICLES : 82
HOUSE_AND_HOME : 71
WEATHER : 70
EVENTS : 63
PARENTING : 58
ART_AND_DESIGN : 57
COMICS : 54
BEAUTY : 53
Genre for Goo

Now that we have nice lists of app types in terms of genres and categories, we want to understand how many users, on average, does each kind of app carry. The Google Play Store (erroneously labeled as Android), this number is given, but for the Apple App store we must infer the number of users from rating count total. We'll do this average calculation directly for the app store. We'll need to use a nested for loop.

In [15]:
prime_genre_table = freq_table(apple_apps_data_english_free,11)
for genre in prime_genre_table:
    total = 0
    len_genre = 0
    for app in apple_apps_data_english:
        genre_app = app[11]
        if genre_app == genre:
            total += float(app[5])
            len_genre += 1
    avg_user_ratings = total / len_genre
    print('The genre: ' + genre)
    print(avg_user_ratings) 
    print('average ratings\n')

The genre: Social Networking
60253.84920634921
average ratings

The genre: Photo & Video
14688.715542521993
average ratings

The genre: Games
15641.67426035503
average ratings

The genre: Music
29047.109489051094
average ratings

The genre: Reference
28096.21568627451
average ratings

The genre: Health & Fitness
10868.024390243903
average ratings

The genre: Weather
23145.246376811596
average ratings

The genre: Utilities
8002.298578199052
average ratings

The genre: Travel
19351.4406779661
average ratings

The genre: Shopping
26938.964285714286
average ratings

The genre: News
17283.535714285714
average ratings

The genre: Navigation
19370.821428571428
average ratings

The genre: Lifestyle
9021.5
average ratings

The genre: Entertainment
8920.807174887892
average ratings

The genre: Food & Drink
19934.386363636364
average ratings

The genre: Sports
15350.913461538461
average ratings

The genre: Book
10750.11320754717
average ratings

The genre: Finance
23840.0625
average ratings

The 

My recommendation would be any app genre that has, on average, >= 20,000 ratings. Now for the Google Play Store data:

In [22]:
category_freq_table = freq_table(android_clean_english_free,1)
for category in category_freq_table:
    total = 0
    len_category = 0
    for app in android_clean_english_free:
        category_app = app[1]
        if category_app == category:
            installs = app[5]
            installs = installs.replace('+','')
            installs = installs.replace(',','')
            installs = float(installs)
            total += installs
            len_category += 1
    avg_installs = total / len_category
    print(category + ' has')
    print(avg_installs)
    print('average installs.\n')

ART_AND_DESIGN has
1986335.0877192982
average installs.

AUTO_AND_VEHICLES has
647317.8170731707
average installs.

BEAUTY has
513151.88679245283
average installs.

BOOKS_AND_REFERENCE has
8814199.78835979
average installs.

BUSINESS has
1712290.1474201474
average installs.

COMICS has
832613.8888888889
average installs.

COMMUNICATION has
38590581.08741259
average installs.

DATING has
854028.8303030303
average installs.

EDUCATION has
1833495.145631068
average installs.

ENTERTAINMENT has
11640705.88235294
average installs.

EVENTS has
253542.22222222222
average installs.

FINANCE has
1387692.475609756
average installs.

FOOD_AND_DRINK has
1924897.7363636363
average installs.

HEALTH_AND_FITNESS has
4188821.9853479853
average installs.

HOUSE_AND_HOME has
1360598.042253521
average installs.

LIBRARIES_AND_DEMO has
638503.734939759
average installs.

LIFESTYLE has
1446158.2238372094
average installs.

GAME has
15544014.51048951
average installs.

FAMILY has
3695641.8198090694
average 