# Profitable App Profiles for the App Store and Google Play Markets

## Phase 1: Business Understanding
_The goal of this project is to determine the kind of apps an app development company should focus their efforts on considering various factors such as market share, competition, app genre, and user engagement._

---

## Phase 2: Data Mining
_Finding necessary data to fit the purpose of analysis_

### Data sets used
_Further information/documentation are available in the source links._
* [Android](https://www.kaggle.com/lava18/google-play-store-apps/home) contains data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018.
* [iOS](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) contains data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017.

In [1]:
from csv import reader

### Android data set ###
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android_raw = list(read_file)
android_header = android_raw[0]
android_data = android_raw[1:]

### iOS data set ###
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios_raw = list(read_file)
ios_header = ios_raw[0]
ios_data = ios_raw[1:]

In [2]:
def convert_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
### Android preview ###
print(android_header)
print('\n')
convert_data(android_data, 0, 1, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


('Number of rows:', 10841)
('Number of columns:', 13)


In [4]:
### iOS preview ###
print(ios_header)
print('\n')
convert_data(ios_data, 0, 1, True)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


('Number of rows:', 7197)
('Number of columns:', 17)


### Relevant Info Between These Data Sets

| Android Index | Android Column | iOS Index | iOS Column |
| --- | --- | --- | --- |
| 0 | App | 1 | track_name |
| 1 | Category | 11 | prime_genre |
| 2 | Rating | 7 | user_rating |
| 3 | Reviews | 5 | rating_count_tot |
| 7 | Price | 4 | price |

---

## Phase 3: Data Cleaning
_Making sure data across the boards are complete and consistent by:_
* deleting wrong data
* removing duplicates
* removing non-english apps
* isolating free apps

### Deleting Wrong Data

In [5]:
print(android_header)
print(android_data[10472]) # missing CATEGORY value
#del android_data[10472]  <--- converted to comment after running once
print(android_data[10472]) # after deletion, new len should be 10840

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


### Removing Duplicates

In [6]:
### Checking for duplicates in android data ###

duplicate_android = []
unique_android = []

for app in android_data:
    name = app[0]
    if name in unique_android:
        duplicate_android.append(name)
    else:
        unique_android.append(name)
        
print('Duplicate android apps: ', len(duplicate_android))
print('\n')
print('Including: ', duplicate_android[:10])

### Keeping only duplicate with the most reviews
n_reviews_max = {} # starts with an empty dictionary
for app in android_data:
    name = app[0]
    n_reviews = float(app[3])
    if name in n_reviews_max and n_reviews_max[name] < n_reviews:
        n_reviews_max[name] = n_reviews # if the current n_reviews_max is less, we change the value
    elif name not in n_reviews_max:
        n_reviews_max[name] = n_reviews # otherwise, we add a new key-value pair

('Duplicate android apps: ', 1181)


('Including: ', ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack'])


We expect there to be 1181 duplicates. So after deleting entries, the **new len of android_data should be 9659** (previous len minus duplicates).

In [7]:
len(n_reviews_max)

9659

In [8]:
### Isolating android data to a new data set

android_clean = [] # stores new cleaned data set
android_already_added = [] # only store app names

for app in android_data:
    name = app[0]
    n_reviews = float(app[3])
    
    if (n_reviews_max[name] == n_reviews) and (name not in android_already_added):
        android_clean.append(app)
        android_already_added.append(app[0])
        
print('len of android_data:', len(android_data))
print('len of android_clean: ', len(android_clean))
print('number of entries removed: ', len(android_data) - len(android_clean))


('len of android_data:', 10840)
('len of android_clean: ', 9659)
('number of entries removed: ', 1181)


In [9]:
### Checking for duplicates in ios data ###

duplicate_ios = []
unique_ios = []

for app in ios_data:
    name = app[1]
    if name in unique_ios:
        duplicate_ios.append(name)
    else:
        unique_ios.append(name)
        
print('Duplicate ios apps: ', len(duplicate_ios))
print('\n')
print('Including: ', duplicate_ios[:10])

### Keeping only duplicate with the most reviews
n_reviews_max_ios = {} # starts with an empty dictionary
for app in ios_data:
    name = app[1]
    n_reviews = float(app[5])
    if name in n_reviews_max_ios and n_reviews_max_ios[name] < n_reviews:
        n_reviews_max_ios[name] = n_reviews # if the current n_reviews_max is less, we change the value
    elif name not in n_reviews_max_ios:
        n_reviews_max_ios[name] = n_reviews # otherwise, we add a new key-value pair

('Duplicate ios apps: ', 0)


('Including: ', [])


We expect there to be 2 duplicates. So after deleting entries, the **new len of ios_data should be 7195** (previous len minus duplicates).

In [10]:
len(n_reviews_max_ios)

7197

In [11]:
### Isolating ios data to a new data set

ios_clean = [] # stores new cleaned data set
ios_already_added = [] # only store app names

for app in ios_data:
    name = app[1]
    n_reviews = float(app[5])
    
    if (n_reviews_max_ios[name] == n_reviews) and (name not in ios_already_added):
        ios_clean.append(app)
        ios_already_added.append(app[1])
        
print('len of ios_data:', len(ios_data))
print('len of ios_clean: ', len(ios_clean))
print('number of entries removed: ', len(ios_data) - len(ios_clean))

('len of ios_data:', 7197)
('len of ios_clean: ', 7197)
('number of entries removed: ', 0)


Now, we have a separate and clean data sets called `android_clean` and `ios_clean`. We did this by following these steps:
1. Checking for duplicates
2. Getting duplicates with the most reviews
3. Isolating the apps we need in new data sets:
    * unique apps
    * duplicates with most reviews
4. Validating that we have to correct and expected len for both data sets

### Removing non-English apps
We develop apps directed for English-speaking audience so there's no need to include in both data sets apps that are not targeted to the same audience.

Characters that are not typically used in English text can be isolated by using the `ord(character)` function. Originally based on the English alphabet, ASCII encodes __128 specified characters (slice :127)__ into seven-bit integers as shown by the ASCII chart below:

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/cf/USASCII_code_chart.png/361px-USASCII_code_chart.png" alt="Alt text that describes the graphic" title="Title text" />

We will create a function called `is_this_english(string)` that will take a string and returns the bool `True` or `False` based on whether there are characters that suggest the app is not directed for English-speaking audience.

In [12]:
### Creating the is_this_english function ###
# this will return False only if there are more than 3 characters
# within the string classified as non_ascii

def is_english(string):
    non_ascii = 0
    
    for char in string:
        if ord(char) > 127:
            non_ascii += 1
            
    if non_ascii > 3:
        return False
    else:
        return True

is_english('爱奇艺PPS -《欢乐颂2》电视剧热播') # test for is_english function

False

In [13]:
android_english = []
for app in android_clean:
    if is_english(app[0]):
        android_english.append(app)

print('len of android_clean:', len(android_clean))
print('len of android_english: ', len(android_english))
print('number of entries removed: ', len(android_clean) - len(android_english))
        
ios_english = []
for app in ios_clean:
    if is_english(app[1]):
        ios_english.append(app)

print('\n')
print('len of ios_clean:', len(ios_clean))
print('len of ios_english: ', len(ios_english))
print('number of entries removed: ', len(ios_clean) - len(ios_english))

('len of android_clean:', 9659)
('len of android_english: ', 9500)
('number of entries removed: ', 159)


('len of ios_clean:', 7197)
('len of ios_english: ', 7197)
('number of entries removed: ', 0)


### Isolating free apps
We will to this by looping through the price columns in both data sets: `android_english[app][7]` and `ios_english[app][4]`.

In [14]:
### Isolating free apps on android ###

android_free = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_free.append(app)
        
print('len of android_english: ', len(android_english))
print('len of android_free:', len(android_free))
print('number of entries removed: ', len(android_english) - len(android_free))

('len of android_english: ', 9500)
('len of android_free:', 8760)
('number of entries removed: ', 740)


In [15]:
### Isolating free apps on ios ###

ios_free = []

for app in ios_english:
    price = app[4]
    if price == '0.0':
        ios_free.append(app)
        
print('len of ios_english: ', len(ios_english))
print('len of ios_free:', len(ios_free))
print('number of entries removed: ', len(ios_english) - len(ios_free))

('len of ios_english: ', 7197)
('len of ios_free:', 0)
('number of entries removed: ', 7197)


### Clean up funnel

| Derived list | Android | iOS |
| --- | --- |
| _data | 10841 | 7197 |
| (remove wrong data) | 10840 | 7197 |
| _clean | 9659 | 7195 |
| _english | 9614 | 6191 |
| _free | 8864 | 3220 |

---

## Phase 4: Data Exploration
_Form hypothesis to answer what kind of apps we should be building_

### Validation strategy
To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

### Ways to explore data
1. Most common overall genres by number of apps - _what kind of apps do people look for?_
2. Most engaging genres by number of reviews - _what do users use and care about?_
3. Most common apps per recommended genre - _what specific apps can we look into?_

In [16]:
### Most common overall genres by number of apps ###
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for app in dataset:
        total += 1
        value = app[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for value in table:
        percentage = (table[value] / total) * 100
        table_percentages[value] = percentage 
    
    return table_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [17]:
android_genre = display_table(android_free, 1)

('WEATHER', ':', 0)
('VIDEO_PLAYERS', ':', 0)
('TRAVEL_AND_LOCAL', ':', 0)
('TOOLS', ':', 0)
('SPORTS', ':', 0)
('SOCIAL', ':', 0)
('SHOPPING', ':', 0)
('PRODUCTIVITY', ':', 0)
('PHOTOGRAPHY', ':', 0)
('PERSONALIZATION', ':', 0)
('PARENTING', ':', 0)
('NEWS_AND_MAGAZINES', ':', 0)
('MEDICAL', ':', 0)
('MAPS_AND_NAVIGATION', ':', 0)
('LIFESTYLE', ':', 0)
('LIBRARIES_AND_DEMO', ':', 0)
('HOUSE_AND_HOME', ':', 0)
('HEALTH_AND_FITNESS', ':', 0)
('GAME', ':', 0)
('FOOD_AND_DRINK', ':', 0)
('FINANCE', ':', 0)
('FAMILY', ':', 0)
('EVENTS', ':', 0)
('ENTERTAINMENT', ':', 0)
('EDUCATION', ':', 0)
('DATING', ':', 0)
('COMMUNICATION', ':', 0)
('COMICS', ':', 0)
('BUSINESS', ':', 0)
('BOOKS_AND_REFERENCE', ':', 0)
('BEAUTY', ':', 0)
('AUTO_AND_VEHICLES', ':', 0)
('ART_AND_DESIGN', ':', 0)


In [18]:
ios_genre = display_table(ios_free, 11)

__iOS App Store is dominated by leisure apps__ with GAMES taking up more than half of the pie (58.14%). Combining it with other categories related to leisure such as ENTERTAINMENT, LIFESTYLE, TRAVEL, FOOD & DRINK (61.77%), the remaining 38.23% is shared by more practical categories including EDUCATION, PRODUCTIVITY, and BUSINESS.

__Google Play has a more balanced landscape__ with FAMILY, GAMES, and TOOLS being most common. However, the differences are less pronounced than they are in iOS App Store.

In [19]:
### Most engaging genres by number of reviews ###

print('Most reviewed Android Apps')
print('\n')
for genre in freq_table(android_free, 1):
    total = 0
    len_genre = 0
    for app in android_free:
        genre_app = app[1]
        if genre_app == genre:            
            n_ratings = float(app[3])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

print('\n')
print('Most reviewed iOS Apps')
print('\n')
    
for genre in freq_table(ios_free, 11):
    total = 0
    len_genre = 0
    for app in ios_free:
        genre_app = app[11]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

Most reviewed Android Apps


('LIBRARIES_AND_DEMO', ':', 10564.037974683544)
('SHOPPING', ':', 226042.02030456852)
('BUSINESS', ':', 24239.727272727272)
('ENTERTAINMENT', ':', 305205.71428571426)
('MEDICAL', ':', 3753.2861736334407)
('MAPS_AND_NAVIGATION', ':', 145978.80991735536)
('LIFESTYLE', ':', 34075.09037900875)
('GAME', ':', 691572.0271867613)
('BOOKS_AND_REFERENCE', ':', 73424.84042553192)
('AUTO_AND_VEHICLES', ':', 14217.567901234568)
('HOUSE_AND_HOME', ':', 27812.057971014492)
('BEAUTY', ':', 7476.226415094339)
('COMICS', ':', 45616.98039215686)
('PHOTOGRAPHY', ':', 404081.3754789272)
('PARENTING', ':', 16913.339285714286)
('WEATHER', ':', 175771.34782608695)
('ART_AND_DESIGN', ':', 24699.42105263158)
('PERSONALIZATION', ':', 182426.62847222222)
('DATING', ':', 22207.28834355828)
('EVENTS', ':', 2555.84126984127)
('FAMILY', ':', 114086.83664858348)
('HEALTH_AND_FITNESS', ':', 78671.31365313653)
('VIDEO_PLAYERS', ':', 427904.43670886074)
('FINANCE', ':', 37600.61042944785)
('P

### Most Rated Android Genre 
_We will focus our development efforts on Android. Higher reward across the board than iOS (downloads, competition, user engagement)._
1. COMMUNICATION : 995608.4634146341
2. SOCIAL : 965830.9872881356
3. GAME : 683523.8445475638 **also among with the most apps**
4. VIDEO_PLAYERS : 425350.081761006
5. PHOTOGRAPHY : 404081.3754789272 **category with fewer apps but more invested users**
6. TOOLS : 305732.8973333333 **moderate number of apps and good engagement**
7. ENTERTAINMENT : 301752.24705882353
8. SHOPPING : 223887.34673366835
9. PERSONALIZATION : 181122.31632653062
10. WEATHER : 171250.77464788733
11. PRODUCTIVITY : 160634.542028985527
12. MAPS_AND_NAVIGATION : 142860.0483870968

### Most Rated iOS Genre 
1. Navigation : 86090.33333333333
2. Reference : 74942.11111111111
3. Social Networking : 71548.34905660378
4. Music : 57326.530303030304 **good competition and engagement**
5. Weather : 52279.892857142855
6. Book : 39758.5
7. Food & Drink : 33333.92307692308
8. Finance : 31467.944444444445
9. Photo & Video : 28441.54375
10. Travel : 28243.8
11. Shopping : 26919.690476190477
12. Health & Fitness : 23298.015384615384

In [20]:
### Most common apps per recommended genre ###

print('Most reviewed Android GAME apps')
for app in android_free:
    genre = app[1]
    n_ratings = float(app[3])
    if genre == 'GAME' and n_ratings > 20000000:
        print(app[0], ':', app[3])
        
print('\n')

print('Most reviewed Android PHOTOGRAPHY apps')
for app in android_free:
    genre = app[1]
    n_ratings = float(app[3])
    if genre == 'PHOTOGRAPHY' and n_ratings > 5000000:
        print(app[0], ':', app[3])

Most reviewed Android GAME apps
('Candy Crush Saga', ':', '22430188')
('Subway Surfers', ':', '27725352')
('Clash Royale', ':', '23136735')
('Clash of Clans', ':', '44893888')


Most reviewed Android PHOTOGRAPHY apps
('B612 - Beauty & Filter Camera', ':', '5282578')
('Google Photos', ':', '10859051')
('Retrica', ':', '6120977')
('PicsArt Photo Studio: Collage Maker & Pic Editor', ':', '7594559')
('PhotoGrid: Video & Pic Collage Maker, Photo Editor', ':', '7529865')


---

## Phase 5: Insights

### Recommended App Profiles to Generate Profit

#### GAMES
* High competition
* Enough people who are interested and engaged (in terms of leaving feedback)
* Ads can be put strategically (to earn extra life, power up, or currency within the game)
    * in-your-face ads - while the next game loads
    * user-activated ads - reward user behavior
* Digital product (vs SHOPPING apps that have engagement but also demands a lot of offline work)

#### PHOTOGRAPHY
* Less competition
* High engagement from users
* Ads can be put strategically
    * limited edition/ bonus stickers/filters
* Digital product (everything is in the app)

<br>

<div class="alert alert-block alert-info">
<b>Note:</b> The results between Android and iOS apps have marked differences in terms of categories. Because Android still has a dominant worldwide market share of 74.85% by April 2019, we decided to focus our development efforts on leading Android categories. 

<br>

Since <b>GAMES</b> and <b>PHOTOGRAPHY</b> are social apps-- meaning users can be very public about the specific apps they use, we can park our iOS development efforts until their active users actually express their interest in joining the bandwagon.
</div>