# Profitable App Profiles for the App Store and Google Play Markets

Our aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. We're working as data analysts for a company that builds Android and iOS mobile apps, and our job is to enable our team of developers to make data-driven decisions with respect to the kind of apps they build.

At our company, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.

Collecting data for over 4 million apps requires a significant amount of time and money, so we'll try to analyze a sample of the data instead. To avoid spending resources on collecting new data ourselves, we should first try to see if we can find any relevant existing data at no cost. Luckily, here are two datasets that seem suitable for our goals:

- [A data set](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. You can download the dataset directly from [this link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).

- [A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. You can download the dataset directly from [this link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).

We'll start by opening and exploring these two datasets. To make them easier for you to explore, we created a function named `explore_data()` that you can repeatedly use to print rows in a readable way.

In [20]:
from csv import reader

def open_data(data, header=True):
    opened_file = open(data, encoding="utf8")
    read_file = reader(opened_file)
    apps_data = list(read_file)
    dataset = ()
    if header:
        dataset = apps_data[1:], apps_data[0]
    else:
        dataset = apps_data[:], "No header"
    return dataset

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') 
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))


applestore = open_data('AppleStore.csv')
googleplaystore = open_data('googleplaystore.csv')

apple_data = applestore[0]
google_data = googleplaystore[0]

print("Apple store data:")
explore_data(apple_data, 0, 5, rows_and_columns=True)

print('\n')

print('Google Play store:')
explore_data(google_data, 0, 5, rows_and_columns=True)

Apple store data:
['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1']


Number of rows: 7197
Number of columns: 17


Google Play store:
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art &

At our company, we only build apps that are free to download and install, and we design them for an English-speaking audience. This means that we'll need to do the following:

- Remove non-English apps like 爱奇艺PPS -《欢乐颂2》电视剧热播.
- Remove apps that aren't free.

Thus we do data cleaning before the analysis; it includes removing or correcting wrong data, removing duplicate data, and modifying the data to fit the purpose of our analysis. We begin by detecting and deleting wrong data.

The row 10472 corresponds to the app Life Made WI-Fi Touchscreen Photo Frame, and we can see that the rating is 19. This is clearly off because the maximum rating for a Google Play app is 5 (as mentioned in the discussions section, this problem is caused by a missing value in the 'Category' column). As a consequence, we'll delete this row

In [21]:
print(len(google_data))
del google_data[10472]
print(len(google_data))
# for row in google_data:
#    for item in row:
#        if len(item) < 1:
#            break
#    else:
#        continue 

10841
10840


There's also discussion on the presence of duplicate apps being present in the data set. [This discussion](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps/discussion/90409) here in particular raises the issue of the 'name' column having non-unique values. 

Our next step is find all the duplicates within the dataset, and then use this information to build a criterion for removing the duplicates. The higher the number of reviews, the more recent the data should be. Rather than removing duplicates randomly, we'll only keep the row with the highest number of reviews and remove the other entries for any given app.

To remove the duplicates, we will do the following:

- Create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.
- Use the information stored in the dictionary and create a new dataset, which will have only one entry per app (and for each app, we'll only select the entry with the highest number of reviews).

In [22]:
reviews_max_android = {}
reviews_max_apple = {}

for row in google_data:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max_android and reviews_max_android[name] < n_reviews:
        reviews_max_android[name] = n_reviews
    if name not in reviews_max_android:
        reviews_max_android[name] = n_reviews

for row in apple_data:
    name = row[2]
    n_reviews = float(row[6])
    if name in reviews_max_apple and reviews_max_apple[name] < n_reviews:
        reviews_max_apple[name] = n_reviews
    if name not in reviews_max_apple:
        reviews_max_apple[name] = n_reviews

count = 0
for item in reviews_max_apple:
    if count < 4:
        print(item)
        count += 1
print(reviews_max_apple['PAC-MAN Premium'])

PAC-MAN Premium
Evernote - stay organized
WeatherBug - Local Weather, Radar, Maps, Alerts
eBay: Best App to Buy, Sell, Save! Online Shopping
21292.0


Both datasets have apps with names that suggest they are not designed for an English-speaking audience. We're not interested in keeping these apps, so we'll remove them by removing each app with a name containing a symbol that isn't commonly used in English text. English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;), and other symbols (+, *, /). 

If an app name contains a character with an ASCII value greater than 127, then it probably means that the app has a non-English name. We'll next make a function that checks each charater within a string, and return a boolean based on if that string meets the criteria. To minimize the impact of data loss, we'll only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range. This means all English apps with up to three emoji or other special characters will still be labeled as English.

In [30]:
import math

def num_check(num):
    if math.isnan(num):
        return False
    else:
        return True

test = google_data[24]
print(num_check(float(test[2])))

True


In [35]:
def lang_check(string):
    char_count = 0
    for char in string:
        if ord(char) > 127:
            char_count += 1
    
    if char_count > 3:
        return False
    else:
        return True


# Loop through Google data to find apps with non-English titles
android_clean = []
already_added_android = []

for row in google_data:
    name = row[0]
    name_check = lang_check(name)
    n_reviews = float(row[3])
    if n_reviews == reviews_max_android[name] and name_check and name not in already_added_android:
        android_clean.append(row)
        already_added_android.append(name)


# Loop through Apple data to find apps with non-English titles
apple_clean = []
already_added_apple = []

for row in apple_data:
    name = row[2]
    name_check = lang_check(name)
    n_reviews = float(row[6])
    if n_reviews == reviews_max_apple[name] and name_check and name not in already_added_apple:
        apple_clean.append(row)
        already_added_apple.append(name)



# Isolate all free apps within the dataset
free_apps_clean_android = []
free_apps_clean_apple = []

for row in android_clean:
    if row[6] == 'Free' and num_check(float(row[2])):
        free_apps_clean_android.append(row)

for row in apple_clean:
    price = float(row[5])
    if price < 1:
        free_apps_clean_apple.append(row)

print(len(free_apps_clean_android))
print(len(free_apps_clean_apple))

7566
3861


our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

Build a minimal Android version of the app, and add it to Google Play.
- If the app has a good response from users, we develop it further.
- If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Let's begin the analysis by getting a sense of what are the most common genres for each market. For this, we'll need to build frequency tables for a few columns in our datasets

In [42]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = round(percentage, 3) 
    
    return table_percentages


def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (key, table[key])
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

print('Android ')
display_table(free_apps_clean_android, 1)
print('\n')
display_table(free_apps_clean_apple, 12)

FAMILY : 19.614
GAME : 10.851
TOOLS : 8.684
FINANCE : 3.82
PRODUCTIVITY : 3.727
LIFESTYLE : 3.688
BUSINESS : 3.344
PHOTOGRAPHY : 3.278
SPORTS : 3.146
COMMUNICATION : 3.093
PERSONALIZATION : 3.08
HEALTH_AND_FITNESS : 3.08
MEDICAL : 3.013
SOCIAL : 2.657
NEWS_AND_MAGAZINES : 2.617
TRAVEL_AND_LOCAL : 2.366
SHOPPING : 2.353
BOOKS_AND_REFERENCE : 2.102
VIDEO_PLAYERS : 1.916
DATING : 1.731
MAPS_AND_NAVIGATION : 1.48
EDUCATION : 1.348
FOOD_AND_DRINK : 1.216
ENTERTAINMENT : 1.123
AUTO_AND_VEHICLES : 0.952
WEATHER : 0.859
LIBRARIES_AND_DEMO : 0.846
HOUSE_AND_HOME : 0.806
ART_AND_DESIGN : 0.727
COMICS : 0.701
PARENTING : 0.634
EVENTS : 0.595
BEAUTY : 0.555


Games : 58.534
Entertainment : 8.184
Photo & Video : 5.361
Education : 3.574
Social Networking : 2.979
Utilities : 2.927
Shopping : 2.176
Sports : 2.02
Health & Fitness : 1.943
Productivity : 1.813
Music : 1.813
Lifestyle : 1.658
News : 1.217
Travel : 1.114
Finance : 1.036
Weather : 0.881
Food & Drink : 0.751
Reference : 0.544
Business : 0.51

Examining the Apple App Store's 'Prime Genre' frequency table, we see that among the free English apps, more than a half (58.16%) are games. Entertainment apps are close to 8%, followed by photo and video apps, which are close to 5%. Only 3.66% of the apps are designed for education, followed by social networking apps which amount for 3.29% of the apps in our data set.

The general impression is that App Store (at least the part containing free English apps) is dominated by apps that are designed for fun (games, entertainment, photo and video, social networking, sports, music, etc.), while apps with practical purposes (education, shopping, utilities, productivity, lifestyle, etc.) are more rare. However, the fact that fun apps are the most numerous doesn't also imply that they also have the greatest number of users — the demand might not be the same as the offer.

Examining the Genres and Category columns of the Google Play data set, The landscape seems significantly different on Google Play: there are not that many apps designed for fun, and it seems that a good number of apps are designed for practical purposes (family, tools, business, lifestyle, productivity, etc.). However, if we investigate this further, we can see that the family category (which accounts for almost 19% of the apps) means mostly games for kids.

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play dataset, we can find this information in the Installs column, but this information is missing for the App Store dataset. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app.

Let's start with calculating the average number of user ratings per app genre on the App Store. To do that, we'll need to do the following:

- Isolate the apps of each genre
- Add up the user ratings for the apps of that genre
- Divide the sum by the number of apps belonging to that genre (not by the total number of apps)