# Profitable App Profiles For The App Store & Google Play Markets

As our company success is based on the number of downloads we achieve on our mobile apps, being able to pinpoint the most profitable app genres is crucial.

The goal of this project is to sift through our data to figure out what kind of apps our development team should be focused on creating. We might be able to gain insight into what our core users could be into next.

The two data sets provided to find insight are:


[**Apple App Store** data set - (_Documentation Link_)](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)

[**Google Play** data set - (_Documentation Link_](https://www.kaggle.com/lava18/google-play-store-apps/home)

My first task will be using Python to open and read in both datasets into a list of lists.

In [1]:
import csv

f_apple = open('AppleStore.csv')
f_a_read = csv.reader(f_apple)
f_a_list = list(f_a_read)

f_google = open('googleplaystore.csv')
f_g_read = csv.reader(f_google)
f_g_list = list(f_g_read)

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
apple_explore = explore_data(f_a_list, 0, 5, rows_and_columns=True)


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7198
Number of columns: 16


looking at the first few columns of the Apple App Store data, I believe the mose useful ones will be **user_rating, rating_count_tot, cont_rating prime_genre.**

In [4]:
google_explore = explore_data(f_g_list, 0, 5, rows_and_columns=True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10842
Number of columns: 13


In [5]:
del f_g_list[10473]

The Google Play store data is similar to the Apple data. The most useful columns will be **Rating, Installs, Genres and Content Rating**

At this point, I want to check for any duplicate data in the Google Play dataset. The will keep the most recent data, based on the number of reviews. A for loop with a conditional statement will allow me to do this:

In [6]:
google_dups = []
no_dups = []
for app in f_g_list[1:]:
    name = app[0]
    if name in no_dups:
        google_dups.append(name)
    else:
        no_dups.append(name)

print(len(no_dups))
print(len(google_dups))
print(google_dups[:15])        

9659
1181
['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


In [7]:
reviews_max = {} #empty dictionary to count most recent app reviews

for app in f_g_list[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and \
    reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    if name not in reviews_max:
        reviews_max[name] = n_reviews
    
len(reviews_max) # checking the length of the dictionary for accuracy.

android_clean = [] #empty list to store cleaned data
already_added = [] # stores app names

for app in f_g_list[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if n_reviews == reviews_max[name]:
        if name not in already_added:
            android_clean.append(app)
            already_added.append(name)
            
print(len(android_clean))
    

9659


**The code above required a few steps to truly clean:**
* First I need to create a dictionary that would store the most recent reviews for any duplicate entries.
* Afterward I created two lists, one with the cleaned data I will be working with going forward and another to catch all of the names of the apps. This was done to ensure no duplicates will be added and the list lengths were correct.
* I ran a similar for loop as I did with the dictionary, but used an if statement w/ a nested if in order to filter down the results based on the dictionary entry for the review, as well as if the app could be found in the `already_added` list.
* After that, I checked the length of the `android_clean` dataset to ensure it was the same as the dictionary `reviews_max`. 

In [8]:
def check_chars(string):
    '''Takes in a string to check if it is 
    an English ASCII character.'''

    ascii_count = 0
    for let in string:
        if ord(let) >= 127:
            ascii_count += 1
    if ascii_count >= 3:
        return False
    else:
        return True
    
check_chars('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

In [9]:
android_apps = []
apple_apps = []

for app in android_clean:
    name = app[0]
    if check_chars(name):
        android_apps.append(app)

for app in f_a_list:
    name = app[1]
    if check_chars(name):
        apple_apps.append(app)
        
print(android_apps[:2])
print(apple_apps[:2])


[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']]
[['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'], ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']]


From looking at the newly created lists, we can see that the android dataset didn't have many non-english apps, but the apple dataset clearly had quite a few non-english apps in it. 

Next, we will filter the data further to isolate the apps that are free. We will use another set of lists, a set of for loops and some if statements to collect the information we need.

In [10]:
# android_apps - our cleaned google data set - index 7 for price
# apple_apps - our cleaned apple data set - index 4 for price

android_free_apps = []
apple_free_apps = []

for app in android_apps:
    free_app = app[6]
    if free_app == 'Free':
        android_free_apps.append(app)
    
print(len(android_apps)) # Length of original list 
print(len(android_free_apps)) # Length of free list
    
for app in apple_apps:
    free_app = app[4]
    if free_app == '0.0':
        apple_free_apps.append(app)
        
print(len(apple_apps)) # Length of original list
print(len(apple_free_apps)) # Length of free list

print(apple_free_apps[:2])
print('\n')
print(android_free_apps[:2])
        

9597
8847
6156
3203
[['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']]


[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']]


## Beginning The Analysis
At this point, we've cleaned the data enough to start looking for patterns. Because the goal is the find the most popular apps across both platforms _(in order to determine the most profitable)_, we need to take a closer look at columns that could help us. 

Columns related to genres might be a good place to start.


In [31]:
# category is [1]. genres are [9] android_free_apps
# genres is apple_free_apps[11]

def freq_table(dataset, index):
    '''Builds a frequency table from a dataset'''
    freq_table = {}
    num_of_apps = 0
    
    for app in dataset:
        num_of_apps += 1
    print('Total Number of apps: ' + str(num_of_apps) + '\n')

    for app in dataset:
        col = app[index]
        if col in freq_table:
            freq_table[col] += 1
        else:
            freq_table[col] = 1
            
    for iteration in freq_table:
        freq_table[iteration] /= num_of_apps
        freq_table[iteration] *= 100
        
    return freq_table 


def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])


**The Apple 'English Only' App Frequency Table:**

In [32]:
print('Apple Genre List:\n')
apple_prime_genre = display_table(apple_free_apps, 11)

Apple Genre List:

Total Number of apps: 3203

Games : 58.25788323446769
Entertainment : 7.836403371838902
Photo & Video : 4.995316890415236
Education : 3.6840462066812365
Social Networking : 3.3093974399000934
Shopping : 2.5913206369029034
Utilities : 2.466437714642523
Sports : 2.1542304089915705
Music : 2.0605682172962845
Health & Fitness : 2.0293474867311896
Productivity : 1.7483609116453322
Lifestyle : 1.5610365282547611
News : 1.3424914142990947
Travel : 1.248829222603809
Finance : 1.0927255697783327
Weather : 0.8741804558226661
Food & Drink : 0.8117389946924758
Reference : 0.5307524196066188
Business : 0.5307524196066188
Book : 0.3746487667811427
Navigation : 0.18732438339057134
Medical : 0.18732438339057134
Catalogs : 0.1248829222603809


Looking at the data from the Apple App Store frequency table, it is clear to see that the most popular genre of apps _(at least English-Only apps)_, is **Games**. The runner-up is the closely related **Entertainment**. What is interesting is how far away the two are percentage-wise.  

It looks like a few other categories have 'ok' popularity, like Photo & Video and Education. I'd like to look a bit deeper into the social media category, as I thought this would be way more popular than the data shows.

Based on this dataset, you could say that the Games categories is the ones we should focus on developing for. However, there is a chance that the most popular applications don't have the highest number of regular users. If social media isn't as popular genre-wise, but has more daily users, it might make sense to go in that direction development-wise. 

**The Android "English Only" Categories Frequency Table:**

In [33]:
print('\nAndroid Category List:\n')
android_cats = display_table(android_free_apps, 1)


Android Category List:

Total Number of apps: 8847

FAMILY : 18.932971628800725
GAME : 9.698202780603594
TOOLS : 8.45484344975698
BUSINESS : 4.600429524132474
PRODUCTIVITY : 3.8996269922007465
LIFESTYLE : 3.888323725556686
FINANCE : 3.7074714592517237
MEDICAL : 3.537922459590822
SPORTS : 3.39097999321804
PERSONALIZATION : 3.3231603933536795
COMMUNICATION : 3.2327342602011986
HEALTH_AND_FITNESS : 3.0857917938284163
PHOTOGRAPHY : 2.950152594099695
NEWS_AND_MAGAZINES : 2.803210127726913
SOCIAL : 2.6675709279981916
TRAVEL_AND_LOCAL : 2.3397761953204474
SHOPPING : 2.2493500621679665
BOOKS_AND_REFERENCE : 2.136317395727365
DATING : 1.8650389962699219
VIDEO_PLAYERS : 1.797219396405561
MAPS_AND_NAVIGATION : 1.3903017972193965
FOOD_AND_DRINK : 1.2433593308466147
EDUCATION : 1.1642364643381937
ENTERTAINMENT : 0.9607776647451114
LIBRARIES_AND_DEMO : 0.938171131456991
AUTO_AND_VEHICLES : 0.9268678648129309
HOUSE_AND_HOME : 0.8025319317282694
WEATHER : 0.7912286650842093
EVENTS : 0.712105798575788

The most common category for this Android column is **Family**. This one is tough to come to any conclusions on because we do not know what the **Family** category truly entails. The **Games** category is right behind family, with nearly 9.7% genre popularity. However, These two categories could very well be combined if you are looking to build a game that is rated for all ages. 

Once again in this data column Social Media apps seem to take a hit in genre popularity. Which is odd to me because these social media apps are almost built in to every mobile device. I'm thinking developers might not be labeling their social apps as _"Social Media"_. Some might be getting labeled under the wrong columns. That is something to look into. On the other hand, maybe Games and Family apps are just that popular.

Based on this data, I recommend a Family app for developers to build, simply based on the popularity of them. Building something rated for all ages gives the best chance of appealing to the largest group of consumers.

**The Android "English Only" Genres Frequency Table:**

In [37]:
print('\nAndroid Genre List:\n')
android_genres = display_table(android_free_apps, 9)


Android Genre List:

Total Number of apps: 8847

Tools : 8.44354018311292
Entertainment : 6.081157454504352
Education : 5.357748389284503
Business : 4.600429524132474
Productivity : 3.8996269922007465
Lifestyle : 3.8770204589126256
Finance : 3.7074714592517237
Medical : 3.537922459590822
Sports : 3.458799593082401
Personalization : 3.3231603933536795
Communication : 3.2327342602011986
Action : 3.0970950604724763
Health & Fitness : 3.0857917938284163
Photography : 2.950152594099695
News & Magazines : 2.803210127726913
Social : 2.6675709279981916
Travel & Local : 2.3284729286763874
Shopping : 2.2493500621679665
Books & Reference : 2.136317395727365
Simulation : 2.045891262574884
Dating : 1.8650389962699219
Arcade : 1.8424324629818016
Video Players & Editors : 1.7746128631174407
Casual : 1.763309596473381
Maps & Navigation : 1.3903017972193965
Food & Drink : 1.2433593308466147
Puzzle : 1.1303266644060133
Racing : 0.9946874646772917
Role Playing : 0.938171131456991
Libraries & Demo : 0.93

Looking at this Android _"English only"_ dataset, I'm coming to a different conclusion than when looking at the _Category_. If you base your choice off this column, it appears that **Tools** are the most popular genre. **Entertainment** comes in second, and **Education** in third. 

I'm not sure what to think about this column of data because looking at it closer, I can see that several genres are entered in differently, which is skewing the percentages. The question is, by how much?

Genres like: Arcade, Racing, Puzzle, Role-Playing, Strategy, Adventure, Casino, etc. All should technically fall under the **Games** genre, but they do not. I bet if we were to append these sub-genres to Games, we would see a definite rise in genre popularity, like we saw in the **Category**  column data. 

My guess is that most developers fill out Category, and sort of use the Genre column to describe the type of game they are creating. Because of this, I wouldn't make a recommendation based on this column. I would use the Category column to make a recommendation on the dataset.

In [74]:
# Generating a frequency table
# For the rating_count_tot column of the app store data

apple_rating_tots = {}

for app in apple_free_apps:
    app_genre = app[11]
    if app_genre in apple_rating_tots:
        apple_rating_tots[app_genre] += 1
    else:
        apple_rating_tots[app_genre] = 1


for genre in apple_rating_tots:
    total = 0
    len_genre = 0
    for i in apple_free_apps:
        genre_app = i[11]
        if genre_app == genre:
            num_ratings = float(i[5])
            total += num_ratings
            len_genre += 1
    avg_num_ratings = total / len_genre
    print(genre + ' - Average num of users: ' + str(avg_num_ratings))

    


Food & Drink - Average num of users: 33333.92307692308
Photo & Video - Average num of users: 28441.54375
Games - Average num of users: 22886.36709539121
Entertainment - Average num of users: 14195.358565737051
Lifestyle - Average num of users: 16815.48
Productivity - Average num of users: 21028.410714285714
Finance - Average num of users: 32367.02857142857
Book - Average num of users: 46384.916666666664
Education - Average num of users: 7003.983050847458
Health & Fitness - Average num of users: 23298.015384615384
Medical - Average num of users: 612.0
Reference - Average num of users: 79350.4705882353
News - Average num of users: 21248.023255813954
Business - Average num of users: 7491.117647058823
Catalogs - Average num of users: 4004.0
Social Networking - Average num of users: 71548.34905660378
Sports - Average num of users: 23008.898550724636
Navigation - Average num of users: 86090.33333333333
Weather - Average num of users: 52279.892857142855
Utilities - Average num of users: 19156

So this data confirms some of what I mentioned earlier regarding Social media. Even though it appears **Games** are the most created apps, **Navigation, Reference & Social Networking** apps have the highest number of average users. 

This is important because we want to build apps that a high number of users will want to use in order to generate the maximum amount of revenue.

In [83]:
# Calculating the average installs per app genre for Google Play

google_cate_freq = {}

for app in android_free_apps:
    app_genre = app[1]
    if app_genre in google_cate_freq:
        google_cate_freq[app_genre] += 1
    else:
        google_cate_freq[app_genre] = 1
        


for cat in google_cate_freq:
    total = 0
    len_category = 0
    for i in android_free_apps:
        category_app = i[1]
        if category_app == cat:
            num_installs = i[5]
            num_installs = num_installs.replace('+', '')
            num_installs = num_installs.replace(',', '')
            new_num_installs = float(num_installs)
            total += new_num_installs
            len_category += 1
    
    avg_num_installs = total / len_category
    print(cat + ' - Total Number of Installs: ' +
          str(avg_num_installs))


BOOKS_AND_REFERENCE - Total Number of Installs: 8814199.78835979
PHOTOGRAPHY - Total Number of Installs: 17840110.40229885
LIBRARIES_AND_DEMO - Total Number of Installs: 638503.734939759
MEDICAL - Total Number of Installs: 120550.61980830671
COMMUNICATION - Total Number of Installs: 38590581.08741259
SPORTS - Total Number of Installs: 3650602.276666667
PRODUCTIVITY - Total Number of Installs: 16787331.344927534
TOOLS - Total Number of Installs: 10830251.970588235
FOOD_AND_DRINK - Total Number of Installs: 1924897.7363636363
SHOPPING - Total Number of Installs: 7036877.311557789
FINANCE - Total Number of Installs: 1387692.475609756
PERSONALIZATION - Total Number of Installs: 5201482.6122448975
WEATHER - Total Number of Installs: 5145550.285714285
FAMILY - Total Number of Installs: 3697848.1731343283
PARENTING - Total Number of Installs: 542603.6206896552
HEALTH_AND_FITNESS - Total Number of Installs: 4188821.9853479853
BEAUTY - Total Number of Installs: 513151.88679245283
BUSINESS - Tot

Like the last analysis we did on the Apple data for user installs, this data gives us a more clear picture regarding what users keep on their phones. While initially I thought games and entertainment were on top, it turns out that **Social** and **Video Player** apps are in front of every other genre in terms of user installs. This might be because these apps tend to come preloaded with mobile devices. 

To get a better picture, it would be nice to see uninstall data on users. That would let us know what users are getting rid of and allow us to avoid creating apps of that nature. 

All in all, judging by this data, the genre that makes the most sense would be **Games** or **Entertainment**. I know the Social and Video Player genres have more installs, but we know these apps tend to come preinstalled on devices. Plus, it will be difficult to dethrown the Facebook, twitter and Instagrams of the app world...at least right now. The same goes for video players like YouTube and Vevo. 

Our best bet based on these datasets is to target the Game and Entertainment categories. 