# Analysis of Profitable Apps for the App Store and Google Play

As of 2018, there were about 2 million iOS apps available in the App Store, and about 2.1 million Android apps in the Google Play store. This app analysis is intended to find the most profitable types of apps to develop in terms of ad revenue, as the apps themselves will be free to download. Naturally, this means that we must find the kind of apps which are most installed, since more users means more people seeing and potentially engaging with the ads. Therefore, the goal is to help our hypothetical developers understand what kinds of apps are most likely to attract the most users on both Android and iOS.

We will accomplish these goals with a few fairly straightforward steps:

1. Open data for the App Store and Google Play apps, and create lists to more easily process the data
2. Clean the data by removing any incorrect, duplicate, or irrelevant apps (non-free, non-English, incorrect data)
3. Categorize and sort the apps so we can make informed statements about which ones users prefer

## Data exploration and sorting

Collecting data for over 4 million apps would not only be very time-consuming, it would also be expensive. Realistically, it would not even be necessary, as a smaller sample should be suitable to give us the information we need. Our data sets for the Android and iOS app stores will include about 10,000 and 7,000 apps, respectively.

Opening and exploring the data sets is the logical first step. We'll use a function to allow for repeated printing of rows in a readable way.

In [1]:
def explore_data(dataset, start, end, rows_columns=False):
    data_slice = dataset[start:end]
    for row in data_slice:
        print(row)
        print('\n')

    if rows_columns:
        print(f'Number of rows: {len(dataset)}')
        print(f'Number of columns: {len(dataset[0])}')

The `explore_data()` function takes four parameters: dataset, which is expected as a list of lists; start and end, which should be integers and represent the starting/ending indices of a slice of the dataset; rows_columns, expected to be a boolean with False set as default.

The function opens and slices the data. It then loops through the slice, and for each iteration prints a row and a blank line.

Let's open the two app data sets and create lists from them. We'll also save each header row into its own variable.

In [2]:
from csv import reader

### The Google Play data set ###
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

### The App Store data set ###
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

Simple enough! Next, we'll use our explore_data() function to examine the first few rows of the Google data file.

In [3]:
print(android_header)
print('\n')
explore_data(android, 0, 4, rows_columns=True)

[&#39;App&#39;, &#39;Category&#39;, &#39;Rating&#39;, &#39;Reviews&#39;, &#39;Size&#39;, &#39;Installs&#39;, &#39;Type&#39;, &#39;Price&#39;, &#39;Content Rating&#39;, &#39;Genres&#39;, &#39;Last Updated&#39;, &#39;Current Ver&#39;, &#39;Android Ver&#39;]


[&#39;Photo Editor &amp; Candy Camera &amp; Grid &amp; ScrapBook&#39;, &#39;ART_AND_DESIGN&#39;, &#39;4.1&#39;, &#39;159&#39;, &#39;19M&#39;, &#39;10,000+&#39;, &#39;Free&#39;, &#39;0&#39;, &#39;Everyone&#39;, &#39;Art &amp; Design&#39;, &#39;January 7, 2018&#39;, &#39;1.0.0&#39;, &#39;4.0.3 and up&#39;]


[&#39;Coloring book moana&#39;, &#39;ART_AND_DESIGN&#39;, &#39;3.9&#39;, &#39;967&#39;, &#39;14M&#39;, &#39;500,000+&#39;, &#39;Free&#39;, &#39;0&#39;, &#39;Everyone&#39;, &#39;Art &amp; Design;Pretend Play&#39;, &#39;January 15, 2018&#39;, &#39;2.0.0&#39;, &#39;4.0.3 and up&#39;]


[&#39;U Launcher Lite – FREE Live Cool Themes, Hide Apps&#39;, &#39;ART_AND_DESIGN&#39;, &#39;4.7&#39;, &#39;87510&#39;, &#39;8.7M&#39;, &#39;5,000,00

There are 10,841 apps, sorted into 13 columns. Many of the column headings seem like they will be useful to our analysis. The most relevant for now are probably 'App', 'Category', 'Rating', 'Installs', 'Type', 'Price', and 'Genres'.

Now let's do the same for the iOS store data.

In [4]:
print(ios_header)
print('\n')
explore_data(ios, 0, 4, rows_columns=True)

[&#39;id&#39;, &#39;track_name&#39;, &#39;size_bytes&#39;, &#39;currency&#39;, &#39;price&#39;, &#39;rating_count_tot&#39;, &#39;rating_count_ver&#39;, &#39;user_rating&#39;, &#39;user_rating_ver&#39;, &#39;ver&#39;, &#39;cont_rating&#39;, &#39;prime_genre&#39;, &#39;sup_devices.num&#39;, &#39;ipadSc_urls.num&#39;, &#39;lang.num&#39;, &#39;vpp_lic&#39;]


[&#39;284882215&#39;, &#39;Facebook&#39;, &#39;389879808&#39;, &#39;USD&#39;, &#39;0.0&#39;, &#39;2974676&#39;, &#39;212&#39;, &#39;3.5&#39;, &#39;3.5&#39;, &#39;95.0&#39;, &#39;4+&#39;, &#39;Social Networking&#39;, &#39;37&#39;, &#39;1&#39;, &#39;29&#39;, &#39;1&#39;]


[&#39;389801252&#39;, &#39;Instagram&#39;, &#39;113954816&#39;, &#39;USD&#39;, &#39;0.0&#39;, &#39;2161558&#39;, &#39;1289&#39;, &#39;4.5&#39;, &#39;4.0&#39;, &#39;10.23&#39;, &#39;12+&#39;, &#39;Photo &amp; Video&#39;, &#39;37&#39;, &#39;0&#39;, &#39;29&#39;, &#39;1&#39;]


[&#39;529479190&#39;, &#39;Clash of Clans&#39;, &#39;116476928&#39;, &#39;USD&#39;, &#39;0.0&#

The iOS data set includes 7,197 apps with attributes sorted into 16 columns. The columns for the Apple data are in some cases a bit cryptic. Nonetheless, they are easy enough to figure out by looking at the entries. For example, track_name corresponds to the app's name. The most useful seem to be 'track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', and 'prime_genre'. For additional help with the columns, the [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) is available.

## Data Cleaning

One of the most critical parts of data analysis is cleaning the data, i.e. removing irrelevant, inaccurate, or duplicate data which would interfere with drawing accurate conclusions.

From the [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion) of the Google Play data, we see that row 10,472 is incorrect - it lists the app's rating as 19, while the maximum rating for an app in the Google Play Store should be 5.

In [5]:
print(android_header)  # header
print('\n')
print(android[10472])  # incorrect row
print('\n')
print(android[0])      # example correct row

[&#39;App&#39;, &#39;Category&#39;, &#39;Rating&#39;, &#39;Reviews&#39;, &#39;Size&#39;, &#39;Installs&#39;, &#39;Type&#39;, &#39;Price&#39;, &#39;Content Rating&#39;, &#39;Genres&#39;, &#39;Last Updated&#39;, &#39;Current Ver&#39;, &#39;Android Ver&#39;]


[&#39;Life Made WI-Fi Touchscreen Photo Frame&#39;, &#39;1.9&#39;, &#39;19&#39;, &#39;3.0M&#39;, &#39;1,000+&#39;, &#39;Free&#39;, &#39;0&#39;, &#39;Everyone&#39;, &#39;&#39;, &#39;February 11, 2018&#39;, &#39;1.0.19&#39;, &#39;4.0 and up&#39;]


[&#39;Photo Editor &amp; Candy Camera &amp; Grid &amp; ScrapBook&#39;, &#39;ART_AND_DESIGN&#39;, &#39;4.1&#39;, &#39;159&#39;, &#39;19M&#39;, &#39;10,000+&#39;, &#39;Free&#39;, &#39;0&#39;, &#39;Everyone&#39;, &#39;Art &amp; Design&#39;, &#39;January 7, 2018&#39;, &#39;1.0.0&#39;, &#39;4.0.3 and up&#39;]


According to the discussion, this issue is caused by a missing value for the 'Category' column. To fix this, we'll simply delete the row so it doesn't interfere with the rest of the data.

In [6]:
del android[10472] # don't run this more than once!

### Checking for Duplicates

Another issue we should check for is duplicate rows. A few duplicate entries can really skew our data and the conclusions we draw from it. Fortunately, we can check the [discussion](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion) around our data; this won't always be an option so we'll also want to check for ourselves.

The easiest way to do this is create two lists to account for unique and duplicate app names:

In [7]:
android_duplicate_apps = []
android_unique_apps = []

for app in android:
    name = app[0]
    if name in android_unique_apps:
        android_duplicate_apps.append(name)
    else:
        android_unique_apps.append(name)

print('Number of duplicates: ', len(android_duplicate_apps))
print('\n')

Number of duplicates:  1181




So there are 1181 cases where an Android app name occurs more than once. Let's check the App Store as well.

In [8]:
apple_duplicate_apps = []
apple_unique_apps = []

for app in ios:
    name = app[0] # the iOS data shows the app name in the first column, labeled "ID"
    if name in apple_unique_apps:
        apple_duplicate_apps.append(name)
    else:
        apple_unique_apps.append(name)

print('Number of duplicates: ', len(apple_duplicate_apps))
print('\n')
print(apple_duplicate_apps)

Number of duplicates:  0


[]


It appears there are no duplicate entries in the App Store data.

We could just remove all the duplicate entries, but a better idea would be to examine them a little closer and remove only the entries which make sense to remove. For example, let's take a closer look at the Instagram app entries on Google Play:

In [9]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

[&#39;Instagram&#39;, &#39;SOCIAL&#39;, &#39;4.5&#39;, &#39;66577313&#39;, &#39;Varies with device&#39;, &#39;1,000,000,000+&#39;, &#39;Free&#39;, &#39;0&#39;, &#39;Teen&#39;, &#39;Social&#39;, &#39;July 31, 2018&#39;, &#39;Varies with device&#39;, &#39;Varies with device&#39;]
[&#39;Instagram&#39;, &#39;SOCIAL&#39;, &#39;4.5&#39;, &#39;66577446&#39;, &#39;Varies with device&#39;, &#39;1,000,000,000+&#39;, &#39;Free&#39;, &#39;0&#39;, &#39;Teen&#39;, &#39;Social&#39;, &#39;July 31, 2018&#39;, &#39;Varies with device&#39;, &#39;Varies with device&#39;]
[&#39;Instagram&#39;, &#39;SOCIAL&#39;, &#39;4.5&#39;, &#39;66577313&#39;, &#39;Varies with device&#39;, &#39;1,000,000,000+&#39;, &#39;Free&#39;, &#39;0&#39;, &#39;Teen&#39;, &#39;Social&#39;, &#39;July 31, 2018&#39;, &#39;Varies with device&#39;, &#39;Varies with device&#39;]
[&#39;Instagram&#39;, &#39;SOCIAL&#39;, &#39;4.5&#39;, &#39;66509917&#39;, &#39;Varies with device&#39;, &#39;1,000,000,000+&#39;, &#39;Free&#39;, &#39;0&#39;, &#3

Looking at index 3 (the fourth column), which corresponds to the number of reviews, we see that they are not all the same, which tells us that the data was collected at different times even though it's the same app. With this in mind, it makes sense to only keep the most recent entry - in this case, the one with the highest count of reviews. When we remove all the duplicates, we should be left with 9659:

In [10]:
print('Expected length: ', len(android) - 1181)

Expected length:  9659


### Removing Duplicates

The safest way to remove the duplicate entries is to confirm the expected number of entries (current total - duplicates), build a new dataset of only unique values, and compare the two before deleting any rows.

First, we'll create a dictionary called `reviews_max` (specific to each store), where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app. Then, we'll create a new data set from the dictionary, with just the latest entry per app.

In [11]:
android_reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in android_reviews_max and android_reviews_max[name] < n_reviews:
        android_reviews_max[name] = n_reviews
        
    elif name not in android_reviews_max:
        android_reviews_max[name] = n_reviews

Now let's see if our new data set has the same length as what we expect:

In [12]:
print('Expected length:', len(android) - 1181)
print('Actual length:', len(android_reviews_max))

Expected length: 9659
Actual length: 9659


Now we can use our `reviews_max` dictionaries to remove the duplicates. As stated before, we'll only keep the entry with the most reviews for each app.

We'll start by creating two empty lists, `clean` and `already_added`. Then we can loop through the data, and add each app to the list only if the number of apps is equal to the value in our `reviews_max` dictionary, and the app doesn't already exist in the list. This will eliminate any duplicates and leave us with just the most up to date row for each unique app.

In [13]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])

    if (android_reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

Let's use our `explore_data()` function to make sure everything worked as expected.

In [14]:
print('Android Data:')
print('\n')
explore_data(android_clean, 0, 3, True)

Android Data:


[&#39;Photo Editor &amp; Candy Camera &amp; Grid &amp; ScrapBook&#39;, &#39;ART_AND_DESIGN&#39;, &#39;4.1&#39;, &#39;159&#39;, &#39;19M&#39;, &#39;10,000+&#39;, &#39;Free&#39;, &#39;0&#39;, &#39;Everyone&#39;, &#39;Art &amp; Design&#39;, &#39;January 7, 2018&#39;, &#39;1.0.0&#39;, &#39;4.0.3 and up&#39;]


[&#39;U Launcher Lite – FREE Live Cool Themes, Hide Apps&#39;, &#39;ART_AND_DESIGN&#39;, &#39;4.7&#39;, &#39;87510&#39;, &#39;8.7M&#39;, &#39;5,000,000+&#39;, &#39;Free&#39;, &#39;0&#39;, &#39;Everyone&#39;, &#39;Art &amp; Design&#39;, &#39;August 1, 2018&#39;, &#39;1.2.4&#39;, &#39;4.0.3 and up&#39;]


[&#39;Sketch - Draw &amp; Paint&#39;, &#39;ART_AND_DESIGN&#39;, &#39;4.5&#39;, &#39;215644&#39;, &#39;25M&#39;, &#39;50,000,000+&#39;, &#39;Free&#39;, &#39;0&#39;, &#39;Teen&#39;, &#39;Art &amp; Design&#39;, &#39;June 8, 2018&#39;, &#39;Varies with device&#39;, &#39;4.2 and up&#39;]


Number of rows: 9659
Number of columns: 13


Looks like we got the results we were expecting. The next step is to remove some more apps which aren't relevant to our analysis.

### Isolating English Apps

Since our company only creates apps in English, it makes sense to analyze just the English-language apps. Looking at our data, we'll find that both data sets include app names that are not in English or don't seem to target an English-speaking audience. Here are a few examples:

In [16]:
print(ios[813][1])
print(ios[6731][1])
print('\n')
print(android_clean[4412][0])
print(android_clean[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


中国語 AQリスニング
لعبة تقدر تربح DZ


We're not interested in these apps for our analysis, so let's remove them. The most logical way to do this is by comparing the characters in each app name to the characters which are typically used in English. In other words, we'll exclude apps whose names include characters which are not used in English.

Normal English characters, including letters A through Z, numbers 0 through 9, punctuation (., !, ?, ;, etc), and other symbols are each included in the ASCII standard, and has a number from 0 to 127 associated with it. With this in mind, we can check each app name to see if it includes non-English (i.e. non-ASCII) characters using a function. We'll use Python's built-in `ord()` function, which pulls a character's corresponding ASCII value.

In [17]:
def is_english(string):
    # We'll only remove an app if it has more than 3 non-ASCII characters, to minimize       unnecessary data loss.
    non_ascii = 0

    for character in string:
        if ord(character) > 127:
            non_ascii += 1
    
    if non_ascii > 3:
        return False
    else:
        return True

Let's test our function on a few app names:

In [18]:
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Instachat 😜'))
print(is_english('Docs To Go™ Free Office Suite'))

True
False
True
True


Great! This function is somewhat simple and a few non-English apps could sneak past, but it should be mostly effective.

Next, we can apply our function to filter out the non-English apps in our data sets.