# Profitable App Profiles for the App Store and Google Play Markets

For the purpose of this project I am imagining myself as a data analyst that works for a company that build Android and iOS mobile apps. These apps are available on Google Play and the App Store. We only build apps that are free to download and install, and our main source of revenue is in-app ads. Therefore, our revenue for any given app is primarily determined by how many people are using said app. 

The goal of this project is to analyse the data to help our developers understand what type of apps are likely to attract more users.

Data used:
- [A data set](https://www.kaggle.com/lava18/google-play-store-apps) containing approximately 10,000 android apps from Google Play, collected in August 2018.
- [A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing approximately 7,000 iOS apps from the App Store, collected in July 2017

## Data exploration

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
opened_file = open('AppleStore.csv')
from csv import reader
read_file = reader(opened_file)
apps_data = list(read_file)

opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
google_data = list(read_file)

In [4]:
explore_data(apps_data, 0, 3, rows_and_columns=True)
explore_data(google_data, 0, 3, rows_and_columns=True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7198
Number of columns: 16
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone

## Data cleaning

In [5]:
print(google_data[10473])
del google_data[10473]

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Now to check for duplicates in the Google Play data:

In [6]:
duplicate_apps = []
unique_apps = []

for app in google_data:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


Rather than remove duplicates randomly, the entry with the most reviews will be kept as this will be the most recent entry

In [7]:
reviews_max = {}

for app in google_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
len(reviews_max)

9659

In [8]:
android_clean = []
already_added = []

for app in google_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
        
len(android_clean)
explore_data(android_clean, 0, 3)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']




Next, we need to remove any apps that are not aimed at english-speaking audiences. According to the ASCII (American Standard Code for Information Interchange) system, the numbers corresponding to the characters commonly used in English text are all in the range 0 to 127.

In [9]:
def english_check(app_name):
    count = 0
    for character in app_name:
        code = ord(character)
        
        if code > 127:
            count += 1
        
        if count > 3:
            return False
        
    return True

print(english_check('Instagram'))
print(english_check('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_check('Docs To Go™ Free Office Suite'))
print(english_check('Instachat 😜'))

True
False
True
True


In [10]:
english_google_data = []
english_apps_data = []

for app in android_clean[1:]:
    name = app[0]
    if english_check(name):
        english_google_data.append(app)
        
for app in apps_data[1:]:
    name = app[0]
    if english_check(name):
        english_apps_data.append(app)

print(len(android_clean))        
print(len(english_google_data))
print(len(apps_data))
print(len(english_apps_data))

9659
9613
7198
7197


Now we need to isolate free apps

In [11]:
free_android = []
free_apple = []

for app in english_google_data:
    cost = app[6]
    if cost == 'Free':
        free_android.append(app)
        
for app in english_apps_data:
    cost = float(app[4])
    if cost == 0.0:
        free_apple.append(app)
        
print(len(free_android))
print(len(free_apple))
    

8862
4056


## Data analysis

We want to find an app profile that will be popular on both the App Store and Google Play. Firstly, let's see what the most common genres are for each market.

In [12]:
explore_data(google_data, 0, 1)
explore_data(apps_data, 0, 1)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']




It appears the most relevant column in the App Store dataset is the 'prime_genre' column and for the Play Store dataset, the 'Genres' and 'Category' columns.

In [13]:
def freq_table(dataset, index):
    new_dict = {}
    
    for row in dataset:
        value = row[index]
        
        if value in new_dict:
            new_dict[value] += 1
        else:
            new_dict[value] = 1
            
    total = len(dataset)
    
    for value in new_dict:
        new_dict[value] = (new_dict[value] / total) * 100
           
            
    return new_dict

{'Book': 1.6272189349112427,
 'Business': 0.4930966469428008,
 'Catalogs': 0.22189349112426035,
 'Education': 3.2544378698224854,
 'Entertainment': 8.234714003944774,
 'Finance': 2.0710059171597637,
 'Food & Drink': 1.0601577909270217,
 'Games': 55.64595660749507,
 'Health & Fitness': 1.8737672583826428,
 'Lifestyle': 2.3175542406311638,
 'Medical': 0.19723865877712032,
 'Music': 1.6518737672583828,
 'Navigation': 0.4930966469428008,
 'News': 1.4299802761341223,
 'Photo & Video': 4.117357001972387,
 'Productivity': 1.5285996055226825,
 'Reference': 0.4930966469428008,
 'Shopping': 2.983234714003945,
 'Social Networking': 3.5256410256410255,
 'Sports': 1.947731755424063,
 'Travel': 1.3806706114398422,
 'Utilities': 2.687376725838264,
 'Weather': 0.7642998027613412}

Now to convert the dictionaries into a list of tuples so we can order them descending

In [20]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        
#display_table(free_apple, 11)        
#display_table(free_android, 9)
#display_table(free_android, 1)

FAMILY : 18.900925299029563
GAME : 9.726923944933423
TOOLS : 8.463100880162491
BUSINESS : 4.5926427443015125
LIFESTYLE : 3.9043105393816293
PRODUCTIVITY : 3.8930264048747465
FINANCE : 3.7011961182577298
MEDICAL : 3.5319341006544795
SPORTS : 3.39652448657188
PERSONALIZATION : 3.3175355450236967
COMMUNICATION : 3.238546603475513
HEALTH_AND_FITNESS : 3.080568720379147
PHOTOGRAPHY : 2.945159106296547
NEWS_AND_MAGAZINES : 2.798465357707064
SOCIAL : 2.663055743624464
TRAVEL_AND_LOCAL : 2.335815842924848
SHOPPING : 2.2455427668697814
BOOKS_AND_REFERENCE : 2.143985556307831
DATING : 1.8618821936357481
VIDEO_PLAYERS : 1.7941773865944481
MAPS_AND_NAVIGATION : 1.399232678853532
FOOD_AND_DRINK : 1.2412547957571656
EDUCATION : 1.162265854208982
ENTERTAINMENT : 0.9591514330850823
LIBRARIES_AND_DEMO : 0.9365831640713158
AUTO_AND_VEHICLES : 0.9252990295644324
HOUSE_AND_HOME : 0.8237418190024826
WEATHER : 0.8011735499887158
EVENTS : 0.7109004739336493
PARENTING : 0.6544798013992327
ART_AND_DESIGN : 0.6

Within our subset, 'Games' and 'Entertainment' are the most common genres on the App Store. Within the Play Store, 'Tools' and 'Entertainment' are the most common genres and 'Family', 'Game' and 'Tools' are the most common categories. 

Let's analyse the average rating per genre in the App Store:

In [22]:
prime_genre_freq = freq_table(free_apple, 11)

for genre in prime_genre_freq:
    total = 0
    len_genre = 0
    
    for row in free_apple:
        genre_app = row[11]
        
        if genre_app == genre:
            user_ratings = float(row[7])
            total += user_ratings
            len_genre += 1
            
    average = total / len_genre
    print(genre, average)

Education 3.484848484848485
Games 3.5285777580859548
News 2.8793103448275863
Productivity 3.9596774193548385
Food & Drink 3.0348837209302326
Navigation 2.2
Entertainment 3.1482035928143715
Photo & Video 3.7934131736526946
Book 1.5984848484848484
Lifestyle 2.5904255319148937
Business 3.5
Weather 3.2580645161290325
Utilities 3.4541284403669725
Reference 3.3
Travel 3.375
Music 3.9402985074626864
Medical 2.875
Social Networking 2.9965034965034967
Sports 2.9177215189873418
Finance 2.2202380952380953
Shopping 3.5330578512396693
Health & Fitness 3.5789473684210527
Catalogs 1.8333333333333333


Now let's calculate the average number of installs per app genre for the Google Play dataset.

In [23]:
category_freq = freq_table(free_android, 1)

for category in category_freq:
    total = 0
    len_category = 0
    
    for row in free_android:
        category_app = row[1]
        
        if category_app == category:
            installs = row[5]
            installs = installs.replace('+', '')
            installs = installs.replace(',', '')
            installs = float(installs)
            total += installs