# Profitable App Profiles for the App Store and Google Play Markets

This project is analysing what profiles profitable applications have in common in App Store and Google Play Markets. Two major application stores. As of 2018, there are 4 million apps in App Store and Google Play market. It requires significant time and money to analyze them. For that we are going to sample them and analyze. I'm going to use

1. 10,000 Android apps from Google Play
2. 7,000 iOS apps from App Store

The goal of the project is to identify what are the traits of profitable applications and define the strategy that we can create one.

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [2]:
def create_header_list(filename, delimiter = ',', newline = ''):
    import csv
    with open(filename, newline=newline) as csvfile:
        openfile = csv.reader(csvfile, delimiter=delimiter)
        list_ = list(openfile)
        header_ = list_[0]
    return header_, list_[1:]


In [3]:
apple_store_header, apple_store_list = create_header_list('AppleStore.csv')
google_store_header, google_store_list = create_header_list('googleplaystore.csv')


## Data Cleansing

We are only analyzing the apps for English speaking audience and free apps only. We are going to remove:

1. Non-English apps.
2. apps that aren't free.

In [4]:
print(google_store_list[10472])  # incorrect row
print('\n')
print(google_store_header)  # header
print('\n')
print(google_store_list[0])      # correct row

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


In [5]:
print(len(google_store_list))
del google_store_list[10472]  # don't run this more than once
print(len(google_store_list))

10841
10840


## Duplicate Data

### Defining duplicate data

As the following example, Instagram, we can see that there are duplicates in the data. There is only one instagram app but there are four record of it. One way to define the most recent data is the number of reviews. The assumption is that the more reviews it has, the more recent data it is. 

In [6]:
for app in google_store_list:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


The steps to identifying duplicates

1. create empty list for duplicate apps and uniques apps
2. Loop through the data, extract the name of the app.
3. If app is in unique_app, append to duplicate apps list.
4. If app is not in unique apps list, append to unique apps list.

In [7]:
duplicate_apps = []
unique_apps = []

for app in google_store_list:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

The steps to identifying the most recent data

1. create empty dictionary for the review counts
2. loop through the data
3. assign name and n_reviews. Because all data is in character, we converted it to float
4. If max review count is smaler than current review count, assign current review count to the max review count
5. If the record is not in the dictionary, simply add it to the dictionary

In [8]:
reviews_max = {}
for app in google_store_list:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max:
        if reviews_max[name] < n_reviews:
            reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

In [9]:
len(reviews_max)

9659

Steps to store data without duplicate

1. create empty list to store the clean list and name of the app already added
2. assign name and review count into variables
3. If name is not in already added list, check if the review counts matches the max review count.
4. If so append to the lists. one app data, the other one name only

In [10]:
android_clean = [] 
already_added = []

In [11]:
for app in google_store_list:
    name = app[0]
    n_reviews = float(app[3])
    if name not in already_added:
        if reviews_max[name] == n_reviews:
            android_clean.append(app)
            already_added.append(app[0])

### Removing the data that are not in English

In [12]:
android_english = []

In [13]:
def check_english(string):
    for c in string[:3]:
        if ord(c) > 127:
            return False
    return True

In [14]:
for app in android_clean:
    if check_english(app[0]):
        android_english.append(app)

### Filtering free apps only

In [15]:
android_free = []

In [16]:
for app in android_english:
    if app[6] == 'Free':
        android_free.append(app)

The goal is to determine what are the profiles that are popular in both platforms. Because the more users you attract, the more likely your app will be profitable. To minimize the risks and overhead, the validation strategy for an apps is:

1. Build a minimal Android version of the app, and add it to Google Play
2. If the app has a good response from users, develop it further
3. If the app is profitable after six months, build an iOS version of the app and add it to App Store

It looks like they both provide Genre or Category columns

### Frequency Table

In [17]:
def freq_table(dataset, index):
    dict_ = {}
    for app in dataset:
        if app[index] in dict_:
            dict_[app[index]] += 1
        elif app[index] not in dict_:
            dict_[app[index]] = 1
    return dict_

In [18]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [19]:
print(apple_store_header)
print(google_store_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [20]:
print(apple_store_header.index('prime_genre'))
print(google_store_header.index('Genres'))
print(google_store_header.index('Category'))

11
9
1


In [21]:
display_table(apple_store_list, 11)

Games : 3862
Entertainment : 535
Education : 453
Photo & Video : 349
Utilities : 248
Health & Fitness : 180
Productivity : 178
Social Networking : 167
Lifestyle : 144
Music : 138
Shopping : 122
Sports : 114
Book : 112
Finance : 104
Travel : 81
News : 75
Weather : 72
Reference : 64
Food & Drink : 63
Business : 57
Navigation : 46
Medical : 23
Catalogs : 10


From the result above we can say

1. The most common genre in App Store is Games. 
2. Even looking at other genres, most apps are developed for entertainment (Games and Entertainment takes up about half of the appplication genre)
3. Generally speaking there are more apps for entertainment because that's what people want. However, that means also there are lots of apps out there already for it. So it doesn't mean that the app in entertainment category is going to be profitable. 

In [22]:
display_table(android_free, 9)

Tools : 748
Entertainment : 537
Education : 473
Business : 408
Lifestyle : 344
Productivity : 343
Finance : 327
Medical : 311
Sports : 306
Personalization : 295
Communication : 285
Action : 274
Health & Fitness : 272
Photography : 262
News & Magazines : 248
Social : 235
Travel & Local : 206
Shopping : 197
Books & Reference : 192
Simulation : 183
Dating : 165
Arcade : 163
Video Players & Editors : 157
Casual : 156
Maps & Navigation : 121
Food & Drink : 108
Puzzle : 100
Racing : 88
Role Playing : 83
Libraries & Demo : 83
Auto & Vehicles : 82
Strategy : 80
House & Home : 72
Weather : 69
Events : 63
Adventure : 60
Beauty : 53
Art & Design : 53
Comics : 51
Parenting : 44
Card : 40
Trivia : 38
Casino : 38
Educational;Education : 35
Board : 34
Educational : 32
Education;Education : 31
Word : 23
Casual;Pretend Play : 21
Music : 18
Racing;Action & Adventure : 15
Puzzle;Brain Games : 15
Entertainment;Music & Video : 15
Casual;Brain Games : 12
Casual;Action & Adventure : 12
Arcade;Action & Advent

Compare to iOS data, it's much more messy to navigate. This seems more detailed data as well. For instance, most litkely Roll Playing, Strategy, Adventure, and ect would be categorized in iOS as Games.

In [23]:
display_table(android_free, 1)

FAMILY : 1675
GAME : 861
TOOLS : 749
BUSINESS : 408
LIFESTYLE : 345
PRODUCTIVITY : 343
FINANCE : 327
MEDICAL : 311
SPORTS : 300
PERSONALIZATION : 295
COMMUNICATION : 285
HEALTH_AND_FITNESS : 272
PHOTOGRAPHY : 262
NEWS_AND_MAGAZINES : 248
SOCIAL : 235
TRAVEL_AND_LOCAL : 207
SHOPPING : 197
BOOKS_AND_REFERENCE : 192
DATING : 165
VIDEO_PLAYERS : 159
MAPS_AND_NAVIGATION : 121
FOOD_AND_DRINK : 108
EDUCATION : 104
ENTERTAINMENT : 84
LIBRARIES_AND_DEMO : 83
AUTO_AND_VEHICLES : 82
HOUSE_AND_HOME : 72
WEATHER : 69
EVENTS : 63
PARENTING : 58
ART_AND_DESIGN : 57
BEAUTY : 53
COMICS : 52


Now this is more tidy data. What we see is that:

1. Game is also one of the most common genre in Android.
2. Family is number one before anything, but what makes an app as family is vague. 
3. From the analysis so far, Game is the most popular apps from the both market.

One way to see how many users there are is see how many installations there have been. This data is missing in iOS data. The next best thing we can use is total number of rating. 

In [24]:
iOS_freq = freq_table(apple_store_list, 11)

In [26]:
for genre in iOS_freq:
    total = 0
    len_genre = 0
    for app in apple_store_list:
        genre_app = app[11]
        if genre == genre_app:
            number_user_ratings = float(app[5])
            total += number_user_ratings
            len_genre += 1
    average_number_user_ratings = total / len_genre
    print(genre, average_number_user_ratings)


Food & Drink 13938.619047619048
Education 2239.2295805739514
Utilities 6863.822580645161
Reference 22410.84375
Entertainment 7533.678504672897
Weather 22181.027777777777
Medical 592.7826086956521
Navigation 11853.95652173913
Finance 11047.653846153846
Health & Fitness 9913.172222222222
Lifestyle 6161.763888888889
News 13015.066666666668
Games 13691.996633868463
Shopping 18615.32786885246
Productivity 8051.3258426966295
Photo & Video 14352.280802292264
Catalogs 1732.5
Travel 14129.444444444445
Social Networking 45498.89820359281
Sports 14026.929824561403
Music 28842.021739130436
Book 5125.4375
Business 4788.087719298245


It's interesting. There are lots of games out there. However, what people use most is Social Networking in App Store. M

Now we are going to look at android data. It does have installation number. However it's not precise. The values are in the format of 100+, 1000+, 5000+, etc. We don't actually know exact number. 5000+ could mean 6000, 7000, or 9999. But for our purpose here it could be enough. 

In [27]:
android_category_freq = freq_table(android_free, 1)

In [31]:
android_free[:1]

[['Photo Editor & Candy Camera & Grid & ScrapBook',
  'ART_AND_DESIGN',
  '4.1',
  '159',
  '19M',
  '10,000+',
  'Free',
  '0',
  'Everyone',
  'Art & Design',
  'January 7, 2018',
  '1.0.0',
  '4.0.3 and up']]

In [37]:
for category in android_category_freq:
    total = 0
    len_category = 0
    for app in android_free:
        category_app = app[1]
        if category == category_app:
            number_installs = app[5]
            number_installs = number_installs.replace(',','')
            number_installs = number_installs.replace('+','')
            number_installs = float(number_installs)        
            total += number_installs
            len_category += 1
    avg_n_installs = total /len_category
    print(category, avg_n_installs)

MEDICAL 121319.43408360129
PHOTOGRAPHY 17772018.759541985
BEAUTY 513151.88679245283
EVENTS 253542.22222222222
BUSINESS 1708215.906862745
WEATHER 5212877.101449275
SOCIAL 23348348.519148935
TRAVEL_AND_LOCAL 13984077.710144928
SHOPPING 7105728.85786802
VIDEO_PLAYERS 24727872.452830188
AUTO_AND_VEHICLES 647317.8170731707
COMMUNICATION 38725984.88070176
GAME 15547984.262485482
SPORTS 3650768.91
HOUSE_AND_HOME 1348645.2916666667
LIFESTYLE 1456502.3739130434
COMICS 835022.1153846154
FINANCE 1361355.1437308867
LIBRARIES_AND_DEMO 638503.734939759
FOOD_AND_DRINK 1960358.8055555555
ART_AND_DESIGN 1986335.0877192982
ENTERTAINMENT 11767380.952380951
NEWS_AND_MAGAZINES 9549218.387096774
TOOLS 10815793.690253671
BOOKS_AND_REFERENCE 8676746.145833334
EDUCATION 1825480.7692307692
FAMILY 3697409.3677611942
DATING 854028.8303030303
PERSONALIZATION 5183850.806779661
HEALTH_AND_FITNESS 4200545.595588235
MAPS_AND_NAVIGATION 4108312.2314049588
PRODUCTIVITY 16879239.98250729
PARENTING 542603.6206896552


The result is the same as iOS. The most popular catego