# Profitable App Profiles for the App Store and Google Play Markets

We'll pretend we're working as data analysts for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and the App Store.

We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better. Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

Collecting data for over 4 million apps requires a significant amount of time and money, so we'll try to analyze a sample of the data instead.

- A [data set](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. You can download the data set directly from this [link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).
- A [data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. You can download the data set directly from this [link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

The explore_data() function:

- Takes in four parameters:
    - dataset, which is expected to be a list of lists.
    - start and end, which are both expected to be integers and represent the starting and the ending indices of a slice from the data set.
    - rows_and_columns, which is expected to be a Boolean and has False as a default argument.
- Slices the data set using dataset[start:end].
- Loops through the slice, and for each iteration, prints a row and adds a new line after that row using print('\n').
    - The \n in print('\n') is a special character and won't be printed. Instead, the \n character adds a new line, and we use print('\n') to add some blank space between rows.
- Prints the number of rows and columns if rows_and_columns is True.
    - dataset shouldn't have a header row, otherwise the function will print the wrong number of rows (one more row compared to the actual length).

In [2]:
from csv import reader
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
apple = list(read_file)
apple_header = apple[0]
apple = apple[1:]

In [3]:
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

In [4]:
print(android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


We see that the Google Play data set has 10841 apps and 13 columns. At a quick glance, the columns that might be useful for the purpose of our analysis are 'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price', and 'Genres'.

In [5]:
print(apple_header)
print('\n')
explore_data(apple, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


We have 7197 iOS apps in this data set, and the columns that seem interesting are: 'track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', and 'prime_genre'. Not all column names are self-explanatory in this case, but details about each column can be found in the data set documentation.

## Deleting Wrong Data

The Google Play data set has a dedicated discussion section, and we can see that one of the discussions outlines an error for row 10472. Let's print this row and compare it against the header and another row that is correct.

In [6]:
print(android[10472])  # incorrect row
print('\n')
print(android_header)  # header
print('\n')
print(android[0])      # correct row

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']



The row 10472 corresponds to the app Life Made WI-Fi Touchscreen Photo Frame, and we can see that the rating is 19. This is clearly off because the maximum rating for a Google Play app is 5. As a consequence, we'll delete this row.

In [7]:
print(len(android))
del android[10472]  # don't run this more than once
print(len(android))

10841
10840


## Removing Duplicate Entries

If we explore the Google Play data set long enough, we'll find that some apps have more than one entry. For instance, the application Instagram has four entries:

In [8]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [9]:
unique = []
repeated = []
for rows in android:
    if rows[0] in unique:
        repeated.append(rows[0])
    else:
        unique.append(rows[0])

In [10]:
len(unique)

9659

In [11]:
len(repeated)

1181

In total, there are 1,181 cases where an app occurs more than once:

Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']
We don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app. One thing we could do is remove the duplicate rows randomly, but we could probably find a better way.

If you examine the rows we printed two cells above for the Instagram app, the main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show that the data was collected at different times. We can use this to build a criterion for keeping rows. We won't remove rows randomly, but rather we'll keep the rows that have the highest number of reviews because the higher the number of reviews, the more reliable the ratings.

To do that, we will:

- Create a dictionary where each key is a unique app name, and the value is the highest number of reviews of that app
- Use the dictionary to create a new data set, which will have only one entry per app (and we only select the apps with the highest number of reviews)

In [12]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews


In a previous code cell, we found that there are 1,181 cases where an app occurs more than once, so the length of our dictionary (of unique apps) should be equal to the difference between the length of our data set and 1,181.

In [13]:
print('Expected length:', len(android) - 1181)
print('Actual length:', len(reviews_max))

Expected length: 9659
Actual length: 9659


In [14]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name) # make sure this is inside the if block

In [15]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


## Eliminating non-English Apps

We'd like to analyze only the apps that are directed toward an English-speaking audience. However, if we explore the data long enough, we'll find that both data sets have apps with names that suggest they are not directed toward an English-speaking audience.

The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system. Based on this number range, we can build a function that detects whether a character belongs to the set of common English characters or not. If the number is equal to or less than 127, then the character belongs to the set of common English characters.

In [16]:
def language_check(text):
    for character in text:
        if ord(character) > 127:
            return False
    return True

In [17]:
values = [
'Instagram',
'爱奇艺PPS -《欢乐颂2》电视剧热播',
'Docs To Go™ Free Office Suite',
'Instachat 😜',
]
for i in values:
    print(language_check(i))

True
False
False
False


We wrote a function that detects non-English app names, but we saw that the function couldn't correctly identify certain English app names like 'Docs To Go™ Free Office Suite' and 'Instachat 😜'. This is because emojis and characters like ™ fall outside the ASCII range and have corresponding numbers over 127.

If we're going to use the function we've created, we'll lose useful data since many English apps will be incorrectly labeled as non-English. To minimize the impact of data loss, we'll only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range.

In [18]:
def language_check(text):
    count = 0
    for character in text:
        if ord(character) > 127:
            count+=1
            if count > 3:
                return False
    return True

In [19]:
values = [
'Instagram',
'爱奇艺PPS -《欢乐颂2》电视剧热播',
'Docs To Go™ Free Office Suite',
'Instachat 😜',
]
for i in values:
    print(language_check(i))

True
False
True
True


In [20]:
apple_english_apps = []
for row in apple:
    if language_check(row[1]):
        apple_english_apps.append(row)
android_english_apps = []
for row in android:
    if language_check(row[0]):
        android_english_apps.append(row)

In [21]:
len(apple_english_apps)

6183

In [22]:
free_apple = []
for row in apple_english_apps:
    if row[4] == '0.0':
        free_apple.append(row)
free_android = []
for row in android_english_apps:
    if row[6] == 'Free':
        free_android.append(row)

So far in the data cleaning process, we:

- Removed inaccurate data
- Removed duplicate app entries
- Removed non-English apps

As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps; we'll need to isolate only the free apps for our analysis.

In [23]:
len(free_android)

9998

In [24]:
len(free_apple)

3222

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets.

For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of what are the most common genres for each market.

We'll need to build a frequency table for the prime_genre column of the App Store data set, and for the Genres and Category columns of the Google Play data set.

In [25]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages

The display_table() function you see below:

- Takes in two parameters: dataset and index. dataset is expected to be a list of lists, and index is expected to be an integer.
- Generates a frequency table using the freq_table() function.
- Transforms the frequency table into a list of tuples, then sorts the list in a descending order.
- Prints the entries of the frequency table in descending order.

In [26]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [27]:
apple_frequency = display_table(free_apple,-5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665



We can see that among the free English apps, more than a half (58.16%) are games. Entertainment apps are close to 8%, followed by photo and video apps, which are close to 5%. Only 3.66% of the apps are designed for education, followed by social networking apps which amount for 3.29% of the apps in our data set.

The general impression is that App Store (at least the part containing free English apps) is dominated by apps that are designed for fun (games, entertainment, photo and video, social networking, sports, music, etc.), while apps with practical purposes (education, shopping, utilities, productivity, lifestyle, etc.) are more rare. However, the fact that fun apps are the most numerous doesn't also imply that they also have the greatest number of users — the demand might not be the same as the offer.

Let's continue by examining the Genres and Category columns of the Google Play data set (two columns which seem to be related).

In [28]:
android_frequency = display_table(free_android,1)

FAMILY : 17.663532706541307
GAME : 10.592118423684736
TOOLS : 7.641528305661133
BUSINESS : 4.450890178035607
PRODUCTIVITY : 3.9507901580316065
SPORTS : 3.6007201440288057
LIFESTYLE : 3.590718143628726
COMMUNICATION : 3.590718143628726
MEDICAL : 3.540708141628326
FINANCE : 3.4906981396279257
HEALTH_AND_FITNESS : 3.2506501300260053
PHOTOGRAPHY : 3.120624124824965
PERSONALIZATION : 3.080616123224645
SOCIAL : 2.9205841168233646
NEWS_AND_MAGAZINES : 2.7705541108221645
SHOPPING : 2.570514102820564
TRAVEL_AND_LOCAL : 2.4604920984196843
DATING : 2.2704540908181636
BOOKS_AND_REFERENCE : 1.990398079615923
VIDEO_PLAYERS : 1.7003400680136025
EDUCATION : 1.5103020604120825
ENTERTAINMENT : 1.4702940588117623
MAPS_AND_NAVIGATION : 1.300260052010402
FOOD_AND_DRINK : 1.250250050010002
HOUSE_AND_HOME : 0.8801760352070414
LIBRARIES_AND_DEMO : 0.8401680336067214
AUTO_AND_VEHICLES : 0.8201640328065612
WEATHER : 0.7401480296059212
EVENTS : 0.630126025205041
ART_AND_DESIGN : 0.610122024404881
COMICS : 0.5901


The landscape seems significantly different on Google Play: there are not that many apps designed for fun, and it seems that a good number of apps are designed for practical purposes (family, tools, business, lifestyle, productivity, etc.). However, if we investigate this further, we can see that the family category (which accounts for almost 19% of the apps) means mostly games for kids.

## Most Popular Apps by Genre on the App Store

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs column, but for the App Store data set this information is missing. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app.

Below, we calculate the average number of user ratings per app genre on the App Store:

In [29]:
genres_ios = freq_table(free_apple, -5)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in free_apple:
        genre_app = app[-5]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

Games : 22788.6696905016
Productivity : 21028.410714285714
Entertainment : 14029.830708661417
Lifestyle : 16485.764705882353
Weather : 52279.892857142855
Health & Fitness : 23298.015384615384
Sports : 23008.898550724636
Catalogs : 4004.0
Travel : 28243.8
Navigation : 86090.33333333333
Social Networking : 71548.34905660378
News : 21248.023255813954
Finance : 31467.944444444445
Medical : 612.0
Utilities : 18684.456790123455
Reference : 74942.11111111111
Business : 7491.117647058823
Music : 57326.530303030304
Food & Drink : 33333.92307692308
Book : 39758.5
Photo & Video : 28441.54375
Shopping : 26919.690476190477
Education : 7003.983050847458


On average, navigation apps have the highest number of user reviews for app store.

We have data about the number of installs for the Google Play market, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.):

In [31]:
display_table(free_android,5)

1,000,000+ : 15.523104620924185
10,000,000+ : 12.49249849969994
100,000+ : 10.722144428885777
10,000+ : 9.161832366473295
1,000+ : 7.5215043008601725
5,000,000+ : 7.501500300060012
100+ : 6.2012402480496105
500,000+ : 5.271054210842169
50,000+ : 4.300860172034407
100,000,000+ : 4.0908181636327265
5,000+ : 4.0708141628325665
10+ : 3.150630126025205
500+ : 2.900580116023205
50,000,000+ : 2.8905781156231245
50+ : 1.7103420684136827
500,000,000+ : 0.7201440288057611
5+ : 0.7001400280056012
1,000,000,000+ : 0.580116023204641
1+ : 0.4500900180036007
0+ : 0.040008001600320066


We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on.

In [42]:
categories_android = freq_table(free_android, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in free_android:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

AUTO_AND_VEHICLES : 647317.8170731707
SOCIAL : 48184458.56849315
PERSONALIZATION : 7533233.402597402
FOOD_AND_DRINK : 2190710.008
LIBRARIES_AND_DEMO : 749950.119047619
HOUSE_AND_HOME : 1917187.0568181819
COMICS : 950443.220338983
NEWS_AND_MAGAZINES : 27058831.263537906
PARENTING : 542603.6206896552
ENTERTAINMENT : 19516734.69387755
EVENTS : 253542.22222222222
PHOTOGRAPHY : 32321374.407051284
SPORTS : 4860918.563888889
TRAVEL_AND_LOCAL : 27921561.32520325
BOOKS_AND_REFERENCE : 9655197.28643216
LIFESTYLE : 1479956.6267409471
ART_AND_DESIGN : 2038050.8196721312
DATING : 1164270.7356828193
BUSINESS : 2250454.1348314607
GAME : 33111302.596789423
COMMUNICATION : 90935671.86908078
HEALTH_AND_FITNESS : 4869225.852307692
FAMILY : 5787370.152887883
MEDICAL : 147563.28813559323
MAPS_AND_NAVIGATION : 5569698.307692308
BEAUTY : 513151.88679245283
SHOPPING : 12637504.221789883
EDUCATION : 5760596.026490066
PRODUCTIVITY : 35885137.50379747
TOOLS : 14988276.79842932
FINANCE : 2511355.6790830945
WEATHE

On average, communication apps have the most installs: 38,456,19 in the play store.

## Conclusion 

A communication(social) or a navigation app could be a great for both the platforms as they seem to have a huge demand.