# Profitable App Profiles for the App Store and Google Play Markets

Our aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. We're working as data analysts for a company that builds Android and iOS mobile apps, and our job is to enable our team of developers to make data-driven decisions with respect to the kind of apps they build.

At our company, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [2]:
from csv import reader
opened_ios = open('AppleStore.csv')
opened_android = open('googleplaystore.csv')

read_ios = reader(opened_ios)
read_android = reader(opened_android)

apps_ios = list(read_ios)
apps_android = list(read_android)

In [3]:
explore_data(apps_ios,0,2)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']




In [4]:
explore_data(apps_android,0,2)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']




In [5]:
rows_ios = len(apps_ios) - 1
print('number of rows in ios = ', rows_ios)
col_ios = len(apps_ios[0])
print('number of columns in ios = ', col_ios)

number of rows in ios =  7197
number of columns in ios =  16


In [6]:
rows_android = len(apps_android) - 1
print('number of rows in android = ', rows_android)
col_android = len(apps_android[0])
print('number of columns in android = ', col_android)

number of rows in android =  10841
number of columns in android =  13


In [15]:
print(apps_android[10473])
#This is a row with error as the Category coloumn has '1.9'
#We delete this row

#del apps_android[10473]

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [16]:
#After deleing the error row
print(apps_android[10473])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


The data contains duplicate rows for some apps in android. First we find out how many duplicate rows are present and then try to figure a way to remove them

In [9]:
duplicate_apps_android = []
unique_apps_android = []

for i in apps_android:
    name = i[0]
    if name in unique_apps_android:
        duplicate_apps_android.append(name)
    else:
        unique_apps_android.append(name)
        
print('number of duplicate android apps', len(duplicate_apps_android))
print('number of unique android apps', len(unique_apps_android))

number of duplicate android apps 1181
number of unique android apps 9661


In [10]:
print(duplicate_apps_android[0:10])

['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


In [11]:
duplicate_apps_ios = []
unique_apps_ios = []

for i in apps_ios:
    name = i[0]
    if name in unique_apps_ios:
        duplicate_apps_ios.append(name)
    else:
        unique_apps_ios.append(name)
        
print('number of duplicate android apps', len(duplicate_apps_ios))
print('number of unique android apps', len(unique_apps_ios))

number of duplicate android apps 0
number of unique android apps 7198


We don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app. One thing we could do is remove the duplicate rows randomly. But we will try a different method

In [12]:
for i in apps_android:
    name = i[0]
    if name == 'Instagram':
        print(i)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


If you examine the rows we printed two cells above for the Instagram app, the main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show that the data was collected at different times. We can use this to build a criterion for keeping rows. We won't remove rows randomly, but rather we'll keep the rows that have the highest number of reviews because the higher the number of reviews, the more reliable the ratings.

To do that, we will:

Create a dictionary where each key is a unique app name, and the value is the highest number of reviews of that app
Use the dictionary to create a new data set, which will have only one entry per app (and we only select the apps with the highest number of reviews)

In [24]:
reviews_max = {}

for i in apps_android[1:]:
    app_name = i[0]
    reviews = float(i[3])
    if app_name in reviews_max and reviews_max[app_name] < reviews:
        reviews_max[app_name] = reviews
    elif app_name not in reviews_max:
        reviews_max[app_name] = reviews

In [27]:
print('Actual Length = ',len(reviews_max))

Actual Length =  9659


Now, let's use the reviews_max dictionary to remove the duplicates. For the duplicate cases, we'll only keep the entries with the highest number of reviews. In the code cell below:

    We start by initializing two empty lists, android_clean and already_added.
    We loop through the android data set, and for every iteration:
    We isolate the name of the app and the number of reviews.
    We add the current row (i) to the android_clean list, and the app name (app_name) to the already_added list if:
        The number of reviews of the current app matches the number of reviews of that app as described in the reviews_max dictionary; and
        The name of the app is not already in the already_added list. 
        We need to add this supplementary condition to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry (for example, the Box app has three entries, and the number of reviews is the same). If we just check for reviews_max[name] == n_reviews, we'll still end up with duplicate entries for some apps.

In [28]:
android_clean = []
already_added = []

for i in apps_android[1:]:
    app_name = i[0]
    reviews = float(i[3])
    
    if (reviews == reviews_max[app_name]) and (app_name not in already_added):
        android_clean.append(i)
        already_added.append(app_name)

In [29]:
print(len(android_clean))

9659


# Removing Non-English Apps

## Part 1

If you explore the data sets enough, you'll notice the names of some of the apps suggest they are not directed toward an English-speaking audience.


We're not interested in keeping these kind of apps, so we'll remove them. One way to go about this is to remove each app whose name contains a symbol that is not commonly used in English text — English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;, etc.), and other symbols (+, *, /, etc.).

All these characters that are specific to English texts are encoded using the ASCII standard. Each ASCII character has a corresponding number between 0 and 127 associated with it, and we can take advantage of that to build a function that checks an app name and tells us whether it contains non-ASCII characters.

We built this function below, and we use the built-in ord() function to find out the corresponding encoding number of each character.



To minimize the impact of data loss, we'll only remove an app if its name has more than three non-ASCII characters:

In [40]:
def check_string(name):
    count = 0
    for i in name:
        if ord(i) > 127:
            count = count + 1
    if count > 3:
        return False
    else:
        return True
    

In [41]:
check_string('Instagram')

True

In [42]:
check_string('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

In [43]:
check_string('Docs To Go™ Free Office Suite')

True

In [44]:
check_string('Instachat 😜')

True

## Part Two


The function is still not perfect, and very few non-English apps might get past our filter, but this seems good enough at this point in our analysis — we shouldn't spend too much time on optimization at this point.

Below, we use the is_english() function to filter out the non-English apps for both data sets:

In [46]:
english_android = []
non_english_android = []
for i in android_clean:
    if check_string(i[0]):
        english_android.append(i)
    else:
        non_english_android.append(i)

In [51]:
print('Number of non-english android apps = ', len(non_english_android))
print('Number of english android apps = ', len(english_android))

Number of non-english android apps =  45
Number of english android apps =  9614


In [54]:
english_ios = []
non_english_ios = []
for i in apps_ios[1:]:
    if check_string(i[1]):
        english_ios.append(i)
    else:
        non_english_ios.append(i)

In [55]:
print('Number of non-english ios apps = ', len(non_english_ios))
print('Number of english ios apps = ', len(english_ios))

Number of non-english ios apps =  1014
Number of english ios apps =  6183


# Isolating the Free Apps

As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps, and we'll need to isolate only the free apps for our analysis. Below, we isolate the free apps for both our data sets.

In [79]:
android_paid = []
android_final = []
for i in english_android:
    if i[7] == '0':
        android_final.append(i)
    else:
        android_paid.append(i)

In [80]:
print('Number of paid-english android apps = ', len(android_paid))
print('Number of free-english android apps = ', len(android_final))

Number of paid-english android apps =  750
Number of free-english android apps =  8864


In [81]:
ios_paid = []
ios_final = []
for i in english_ios:
    if i[4] == '0.0':
        ios_final.append(i)
    else:
        ios_paid.append(i)

In [84]:
print('Number of paid-english ios apps = ', len(ios_paid))
print('Number of free-english ios apps = ', len(ios_final))

Number of paid-english ios apps =  2961
Number of free-english ios apps =  3222


# Most Common Apps by Genre¶

## Part One

As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we then develop it further.
3. If the app is profitable after six months, we also build an iOS version    of the app and add it to the App Store.

Because our end goal is to add the app on both the App Store and Google Play, we need to find app profiles that are successful on both markets. For instance, a profile that might work well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of the most common genres for each market. For this, we'll build a frequency table for the prime_genre column of the App Store data set, and the Genres and Category columns of the Google Play data set.


## Part Two

We'll build two functions we can use to analyze the frequency tables:

One function to generate frequency tables that show percentages
Another function that we can use to display the percentages in a descending order

In [90]:
genres_android = {}

for i in android_final:
    name = i[9]
    if name in genres_android:
        genres_android[name] = genres_android[name] + 1
    else:
        

['id',
 'track_name',
 'size_bytes',
 'currency',
 'price',
 'rating_count_tot',
 'rating_count_ver',
 'user_rating',
 'user_rating_ver',
 'ver',
 'cont_rating',
 'prime_genre',
 'sup_devices.num',
 'ipadSc_urls.num',
 'lang.num',
 'vpp_lic']

In [87]:
len(ios_final)

3222