# Profitable App Profiles for the App Store and Google Play Markets

This is a __data product__ submitted to a developer agency specializing in _Android_ and _iOS_ mobile apps. The agency only built apps that were free to download and install, and their main source of revenue consisted of mostly in-app ads.

This meant that the number of users determined the overall revenue for any given app — the more users who see and engage with the ads, the better.

* My goal for this project was to analyze data to help their developers understand what _type_ of apps were more likely to attract more users.
* For this data product, I left in my original code comments to better illustrate my thought process.
---

## Opening and Exploring the Data

In order to avoid spending $ of pricey data, I found two data sets that seem suitable starting points for analysis.

* A [dataset](https://www.kaggle.com/lava18/google-play-store-apps) containing data approximately 10,000 Android apps from Google Play; the data was collected in August 2018. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).
* A [dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).

### Open and explore both datasets

In [2]:
from csv import reader

# Open the Android csv file -> read into a file object
android_file = open('googleplaystore.csv')
android_read_file = reader(android_file)

# Open the iOS csv file -> read into a file object
ios_file = open('AppleStore.csv')
ios_read_file = reader(ios_file)

# Turn the file object into a list to iterate over
android_data = list(android_read_file)
ios_data = list(ios_read_file)

#### Helper function to quickly view app data

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    
    data_set_slice = dataset[start:end]
    print('Headers:', dataset[0], '\n')

    for row in data_set_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset[1:]))
        print('Number of columns:', len(dataset[0]))

In [4]:
# Explore first 2 rows of Android data set and
# get baseline of columns and rows
explore_data(android_data, 1, 3, True)

Headers: ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


In [5]:
explore_data(ios_data, 1, 3, True)

Headers: ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7197
Number of columns: 16


### Clean data

Best practice when getting acquainted with a new data set is to clean it up a bit, check for duplicates, missing data, anything that can help secure a solid data foundatoin for better analysis.

When I filtered for _most commented_ at the data source, turns out there's some [bad data that was discovered by others](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion?sort=votes).

To fix this, we can compare all rows in the data and make sure they have the same length as that of the header row. If it doesn't, that means it's an entry with bad data. Safe to remove to help sanitize the data.

In [6]:
found = None

for row in android_data[1:]:
    header_length = len(android_data[0])
        
    if len(row) != header_length:
        print('index of bad entry:', android_data[1:].index(row), '\n\n')
        print('Bad Row:', row, '\n\n')
        found = row

android_data.remove(found)
print('Data after deletion:\n', android_data[10472])

index of bad entry: 10472 


Bad Row: ['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up'] 


Data after deletion:
 ['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


### Duplicate entries

Good practice is to always check for dupes. The data community also [flagged](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/67894?sort=votes) duplicates.

- After looping through the data, turns out there are quite a few duplicates: 1,181

These entries would have really skewed any meanginful analysis.

In [7]:
duplicate_apps = []
unique_apps = []

for app in android_data[1:]:
    name = app[0]
    
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Random sampling of duplicate names:', duplicate_apps[100:106], '\n\n')
print('Total num of duplicates:', len(duplicate_apps), '\n\n')

Random sampling of duplicate names: ['Meet4U - Chat, Love, Singles!', '95Live -SG#1 Live Streaming App', 'Just She - Top Lesbian Dating', 'Hily: Dating, Chat, Match, Meet & Hook up', 'O-Star', 'Random Video Chat'] 


Total num of duplicates: 1181 




#### Only keep the app entry that has the highest customer review count

In [8]:
reviews_max = dict()

for app in android_data[1:]:
    name = app[0]
    reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < reviews:
        reviews_max[name] = reviews
    
    if name not in reviews_max:
        reviews_max[name] = reviews

# print('Length of original Android data', len(android_data[1:]))
# ==> Length of Android 10840

# If this worked, the number of entries in our 'reviews_max' dictionary
# should be (10,840 - 1,181) == 9,659
print(len(reviews_max)) # => 9,659


# Using the dictionary we built, we can now selectively remove
# the duplicates!

android_clean = []
already_added = []

for app in android_data[1:]:
    name = app[0]
    reviews = float(app[3])
    
    if reviews_max[name] == reviews and name not in already_added:
        android_clean.append(app)
        already_added.append(name)

print(len(android_clean))

9659
9659


#### Removing Non-English Apps

Since your agency only makes apps for an english-speaking audience, only english apps were used for this data product.

In [9]:
# print(len(ios_data))       >>> 7198
# print(len(android_clean))  >>> 9659

# Edge cases to use as a filter comparison
# "-" symbol = 8211
#  Letter: ™ 
#  ascii: 8482 
# ® > ascii: 174 



# HELPER FUNCTIONS 

def check_for_good_symbol(letter):
    approved_symbs = [8211, 8212, 8482, 174, 65293, 233, 9412,
              8480, 1057, 65281, 8252, 8226, 8217, 967, 1088]

    return ord(letter) in approved_symbs

def not_english(app_name):
    for letter in app_name:
        if check_for_good_symbol(letter):
            continue
        if ord(letter) > 127:
            return True
    return False


# Function that takes a dataset and
# returns two datasets, English and NonEnglish

def sift_for_english(dataset, appNameIndex=1):
    # NB: find the index of the app name in dataset and
    # use that for "appNameIndex" - default to 2nd row in dataset
    english_apps = []
    non_english_apps = []
    
    for app in dataset[1:]:
        name = app[appNameIndex]
    
        if not_english(name):
            non_english_apps.append(app)
            continue
        else:
            english_apps.append(app)
        
    return (english_apps, non_english_apps)

    
ios_english, ios_non_english = sift_for_english(ios_data, appNameIndex=1)

android_english, android_non_english = sift_for_english(android_clean, 1)

print('original android dataset:', len(android_clean[1:]))
print('android english apps:', len(android_english))
print('android non-english apps:', len(android_non_english), '\n')

print('original iOS dataset:', len(ios_data[1:]))
print('iOS english apps:', len(ios_english))
print('iOS non-english apps:', len(ios_non_english))

original android dataset: 9658
android english apps: 9658
android non-english apps: 0 

original iOS dataset: 7197
iOS english apps: 6096
iOS non-english apps: 1101


#### Remove apps that cost money

Your agency's app portfolio shows most of your products are free-to-download.

At this point the datasets are in a good state to begin analysis

In [10]:
"""
    Loop through each dataset to isolate the free apps in separate lists.
        Make sure you identify the columns describing the app price correctly.
        Prices come up as strings ('0', $0.99, $2.99, etc.), so make sure you're not checking an integer or a float in your conditional statements.

    After you isolate the free apps, check the length of each dataset to see how many apps you have remaining.

"""

# android english apps: 9658 > android_english
# iOS english apps: 6096 > ios_english

ios_free = []
android_free = []

for app in ios_english:
    price = float(app[4])
    
    if price == 0:
        ios_free.append(app)
    
for app in android_english:
    price = app[7]
    
    if len(price) == 1:
        android_free.append(app)
        
print('Original android dataset:', len(android_english))
print('Free android apps:', len(android_free), '\n')

print('Original iOS dataset:', len(ios_english))
print('Free iOS apps:', len(ios_free))


Original android dataset: 9658
Free android apps: 8904 

Original iOS dataset: 6096
Free iOS apps: 3165


# Analysis

#### Sanitized data sets: `ios_free`, `android_free` (Lists)

**Primary objective**: Provide analytical guidance and data-driven suggestions to help you and your team decide how your agency selects new app products to build.

*Foundations*:

We want to know what app genres were downloaded the most, and which genres generated the most user engagement using reviews and ratings.

### Data Rows

In [11]:
# The very first entry of the data lists are the columns
print('Android data columns:\n\n', android_data[0], '\n')
print('iOS data columns:\n\n', ios_data[0])

Android data columns:

 ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

iOS data columns:

 ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


### Right away, a few columns stand out.

Android: Category, Genres, Installs
iOS: rating_count_tot, prime_genre, user_rating

__Android columns__

* App
* Category X
* Rating
* Installs X
* Type
* Price
* Genres X

__iOS columns__
* track_name 
* price 
* rating_count_tot X | User Rating counts (for all version)
* user_rating X 
* prime_genre > X

In [12]:
# CLEANED DATASETS
# ios_free
# android_free

# android > category[1] > installs[5] > genres[9]
# ios > prime_genre[11] > rating_count_tot[5]


android_histo = {}
ios_histo = {}

# iOS histogram

for app in ios_free:
    genre = app[11]
    all_time_ratings = app[5]
    
    if genre in ios_histo:
        ios_histo[genre] += 1
    else:
        ios_histo[genre] = 1

sorted_ios_apps = []

for k, v in ios_histo.items():
    sorted_ios_apps.append((v, k))

# Android histogram

for app in android_free:
    category = app[1]
    genre = app[9]
    
    if genre in android_histo:
        android_histo[genre] += 1
    else:
        android_histo[genre] = 1

sorted_android_apps = []

for k, v in android_histo.items():
    sorted_android_apps.append((v, k))
    
print('Top iOS apps by total ratings:')
print(sorted_ios_apps[:10])
# print('Top Android by app genre:')
# print(sorted_android_apps)

Top iOS apps by total ratings:
[(104, 'Social Networking'), (159, 'Photo & Video'), (1849, 'Games'), (65, 'Music'), (17, 'Reference'), (63, 'Health & Fitness'), (28, 'Weather'), (76, 'Utilities'), (37, 'Travel'), (81, 'Shopping')]


## iOS & Android Histograms

iOS Top Genres

- 1849 > Games
- 251 > Entertainment
- 159 > Photo & Video
- 117 > Education
- 104 > Social Networking
- 81 > Shopping
- 76 > Utilities
- 68 > Sports
- 65 > Music
- 63 > Health & Fitness
- 54 > Productivity
- 49 > Lifestyle
- 43 > News
- 35 > Finance
- 28 > Weather
- 26 > Food & Drink
- 17 > Reference
- 15 > Business
- 12 > Book
- 6 > Medical
- 4 > Catalogs
 
Android Top Genres

- 750 > Tools
- 542 > Entertainment
- 480 > Education
- 408 > Business
- 349 > Lifestyle
- 346 > Productivity
- 328 > Finance
- 313 > Medical
- 307 > Sports
- 295 > Personalization
- 288 > Communication
- 275 > Action
- 273 > Health & Fitness
- 262 > Photography
- 252 > News & Magazines
- 236 > Social
- 206 > Travel & Local
- 200 > Shopping
- 194 > Books & Reference
- 184 > Simulation
- 165 > Dating
- 164 > Arcade
- 158 > Video Players & Editors
- 156 > Casual
- 126 > Maps & Navigation
- 110 > Food & Drink
- 100 > Puzzle
- 88 > Racing
- 83 > Libraries & Demo
- 82 > Auto & Vehicles
- 73 > House & Home
- 71 > Weather
- 63 > Events
- 61 > Adventure
- 55 > Comics
- 53 > Art & Design
- 44 > Parenting
- 40 > Card
- 38 > Casino
- 35 > Educational;Education
- 34 > Board
- 33 > Educational
- 31 > Education;Education
- 23 > Word
- 21 > Casual;Pretend Play
- 18 > Music
- 15 > Entertainment;Music & Video
- 12 > Casual;Action & Adventure
- 11 > Arcade;Action & Adventure
- 9 > Action;Action & Adventure
- 8 > Educational;Pretend Play
- 7 > Board;Brain Games
- 6 > Art & Design;Creativity
- 5 > Education;Pretend Play
- 4 > Education;Creativity
- 3 > Adventure;Action & Adventure
- 2 > Board;Action & Adventure
- 1 > Adventure;Education

### Average number of installs for each genre


### Identifying the genres with the highest user engagement.

To uncover the most popular app genres in terms of user base, one approach is to compute the average number of installs for each genre.

In [13]:
"""
Android data columns:

 ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating',
  'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

iOS data columns:

 ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating',
  'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']

"""

# CLEANED DATASETS
# ios_free
# android_free

# HISTOGRAMS
ios_top_genres = dict(reversed(sorted(sorted_ios_apps)))
android_top_genres = dict(reversed(sorted(sorted_android_apps)))

print('iOS top genres:', ios_top_genres, '\n\n', 'genre count', len(ios_top_genres), '\n')
print('android top genres:', android_top_genres, '\n\n', 'genre count', len(android_top_genres))

iOS top genres: {1849: 'Games', 251: 'Entertainment', 159: 'Photo & Video', 117: 'Education', 104: 'Social Networking', 81: 'Shopping', 76: 'Utilities', 68: 'Sports', 65: 'Music', 63: 'Health & Fitness', 54: 'Productivity', 49: 'Lifestyle', 43: 'News', 37: 'Travel', 35: 'Finance', 28: 'Weather', 26: 'Food & Drink', 17: 'Reference', 15: 'Business', 12: 'Book', 6: 'Medical', 4: 'Catalogs'} 

 genre count 22 

android top genres: {750: 'Tools', 542: 'Entertainment', 480: 'Education', 408: 'Business', 349: 'Lifestyle', 346: 'Productivity', 328: 'Finance', 313: 'Medical', 307: 'Sports', 295: 'Personalization', 288: 'Communication', 275: 'Action', 273: 'Health & Fitness', 262: 'Photography', 252: 'News & Magazines', 236: 'Social', 206: 'Travel & Local', 200: 'Shopping', 194: 'Books & Reference', 184: 'Simulation', 165: 'Dating', 164: 'Arcade', 158: 'Video Players & Editors', 156: 'Casual', 126: 'Maps & Navigation', 110: 'Food & Drink', 100: 'Puzzle', 88: 'Racing', 83: 'Libraries & Demo', 82:

In [57]:
basket = {}

for app in ios_free:
    genre = app[11]
    rating = float(app[5])
#     print(genre)
#     print(rating, '\n')
    
    if genre not in basket:
        basket[genre] = rating
    else:
        basket[genre] += rating
    
    
"""

basket
{'Games': 10594110.0, 'Music': 560667.0, 'Entertainment': 894076.0, 'Sports': 927512.0,
'Social Networking': 1055267.0, 'Photo & Video': 524867.0, 'Shopping': 661336.0, 'Food & Drink': 258624.0,
'Book': 252076.0, 'Finance': 466210.0, 'Travel': 369434.0, 'Weather': 691603.0, 'Reference': 200047.0,
'Education': 162701.0, 'Productivity': 297027.0, 'Navigation': 154911.0, 'Lifestyle': 143040.0,
'Health & Fitness': 136833.0, 'News': 132703.0, 'Utilities': 257398.0}

[(4, 'Catalogs'), (6, 'Medical'), (6, 'Navigation'), (12, 'Book'), (15, 'Business'),
(17, 'Reference'), (26, 'Food & Drink'), (28, 'Weather'), (35, 'Finance'),
(37, 'Travel'), (43, 'News'), (49, 'Lifestyle'), (54, 'Productivity'),
(63, 'Health & Fitness'), (65, 'Music'), (68, 'Sports'), (76, 'Utilities'), (81, 'Shopping'),
(104, 'Social Networking'), (117, 'Education'), (159, 'Photo & Video'), (251, 'Entertainment'),
(1849, 'Games')]


"""

# Helper Function - takes a dictionary, swaps the key value pairs for easy sorting/ranking
# returns List
def swap(dict):
    final = []
    
    for k, v in dict.items():
        final.append((v, k))
    
    return final

total_app_genres = {}

for k, v in sorted(sorted_ios_apps):
    total_app_genres[v] = k

# print(total_app_genres)

avg_ratings_per_genre = {}

for app_name in basket:
    if app_name in total_app_genres:
        avg_ratings_per_genre[app_name] = round(basket[app_name] / total_app_genres[app_name])

ios_avg = swap(avg_ratings_per_genre)

# for k, v in list(reversed(sorted(ios_avg))):
#     print('-', v, ':', k)

### Results

- Navigation : 86090
- Reference : 79350
- Social Networking : 72917
- Music : 58205
- Weather : 52280
- Book : 46385
- Food & Drink : 33334
- Finance : 32367
- Travel : 30524
- Photo & Video : 28619
- Shopping : 27899
- Health & Fitness : 24038
- Sports : 23102
- Games : 22734
- Productivity : 21799
- News : 21248
- Utilities : 19423
- Lifestyle : 17155
- Entertainment : 14195
- Business : 6840
- Education : 6011
- Catalogs : 4004
- Medical : 612

# Average Installs per Genre - Android 

In [42]:
"""
Android data columns:

 ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating',
  'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
  
"""

# HISTOGRAMS
# ios_histo
# android_histo

total_installs_by_genre = {}

for app in android_free:
    genre = app[9]
    installs = app[5].replace('+', '')
    installs = int(installs.replace(',', ''))
    
    if genre in total_installs_by_genre:
        total_installs_by_genre[genre] += installs
    else:
        total_installs_by_genre[genre] = installs

# total_installs_by_genre
# android_histo

avg_installs = {}

for genre in android_histo.keys():
    avg_installs[genre] = round(total_installs_by_genre[genre] / android_histo[genre])

android_install_ranks = swap(avg_installs)

# Sort the genres by install
# for v, k in reversed(sorted(android_install_ranks)):
#     print('-', k, v)

### Results

- Communication 38322626
- Adventure;Action & Adventure 35333333
- Video Players & Editors 24790074
- Social 23253652
- Arcade 22888365
- Casual 19569222
- Puzzle;Action & Adventure 18366667
- Photography 17772019
- Educational;Action & Adventure 17016667
- Productivity 16738958
- Racing 15910646
- Travel & Local 14051476
- Casual;Action & Adventure 12916667
- Action 12603589
- Strategy 11124294
- Tools 10788059

## Conclusions & Action Items

Based on the data of both the total installs (Android) and total reviews (iOS), I recommend selecting any of the following genres/categories for your agency's next app product. The popularity and review engagement will provide your agency with the best possible foundation for success when you need to configure and generate ad revenue.

Additionally, I recommend setting up a *Voice of the Customer* program that employs a closed-loop process. A VoC program will better position your agency to identify problems and opportunities so they can take the appropriate next steps. [Further reading](https://monkeylearn.com/blog/voice-of-customer-analysis/).

### Avg Installs
- Communication 38322626
- Adventure;Action & Adventure 35333333
- Video Players & Editors 24790074
- Social 23253652
- Arcade 22888365
- Casual 19569222
- Puzzle;Action & Adventure 18366667
- Photography 17772019
- Educational;Action & Adventure 17016667
- Productivity 16738958

### Avg Reviews
- Navigation : 86090
- Reference : 79350
- Social Networking : 72917
- Music : 58205
- Weather : 52280
- Book : 46385
- Food & Drink : 33334
- Finance : 32367
- Travel : 30524
- Photo & Video : 28619