# Top Profitable Apps from the Google Play and Apple Store

This project will dive into the user data for both apps in the Google Play Store and the Apple Store. We will be analysing which apps have the most amount of user interaction. 

All of the apps that we will analyze are free to play apps, so all of our revenue comes from the ads we serve. Discovering what apps are gaining the most attraction would allow us to focus development and advertisement efforts that align with those types of apps.

# Opening the Data Files
The data we will be working with was obtained from from kaggle.com, working with this found data set saves time against sourcing our own data.

[The Google Play data set](https://www.kaggle.com/lava18/google-play-store-apps) Has about ten thousand apps that are available on android. This data set can be directly downloaded [here.](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv)

[The Apple iOS Store data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) has about seven thousand IOS Apps from the Apple iOS App Store and can be downloaded directly [here.](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)


In [1]:
def explore_data(dataset, start, end,
                rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') #adds new empty line after each row
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
            
from csv import reader          
#Apple Store Data
apple_file = open('AppleStore.csv')
apple_list = list(reader(apple_file))
apple_header = apple_list[0]
apple_list = apple_list[1:]

#Google Store Data
google_file = open('googleplaystore.csv')
google_list = list(reader(google_file))
google_header = google_list[0]
google_list = google_list[1:]

# The Data at First Glance
We need to get information on the data we are working with before we can start to clean it up. The function 'explore_data' above will take a list as an argument, as well as start and stop points to take just a section of the entire list, as well as print out the total number of rows in the list, and the columns in the first row of the selected section.

In [2]:
print(google_header)
print('\n')
explore_data(google_list, 0, 2, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


Above we can see that the header row for the Google Apps data as well as a couple of the first rows. The total number of rows is 10841, and the number of colums in each row is 13.

In [3]:
print(apple_header)
print('\n')
explore_data(apple_list, 0, 2, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7197
Number of columns: 16


Above we can see that the Apple iOS apps data header as well as a couple of the first rows. The total number of rows in this data set is 7197 rows, and the number of columns in each row is 16.

# Cleaning the Data
On the disussion formum where we downloaded the data for the Google Apps, a user has pointed out an error in the data set [here.](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) The users in this thread indecated that there is a row that is missing a column, so let's see the row they are talking about.

In [4]:
print(google_list[0])
print(google_list[10472])
print(f'Columns in header: {len(google_list[0])}')
print(f'Columns in error row: {len(google_list[10472])}')

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
Columns in header: 13
Columns in error row: 12


The row we are looking at has a missing 'Category' column, so this row should be deleted.

In [5]:
del google_list[10472]

# Remove Duplicate Data
Considering there is one error, there might be more. One thing to look for first is to see if there are duplicates in the data set.

In [6]:
for app in google_list:
    if app[0] == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [7]:
for app in google_list:
    if app[0] == 'Facebook':
        print(app)

['Facebook', 'SOCIAL', '4.1', '78158306', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']
['Facebook', 'SOCIAL', '4.1', '78128208', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']


The Google Apps store data set has duplicate entries for both the Instagram and Facebook app. There are likely to be more so we should sift through all of the data to create a set of all the apps that are referenced more than once.

If you look closely at the duplicates, the reviews are the only changing value. We can determine that the row with the most number of reveiws is the most up to date data, and the rest of the data should be removed from the dataset.

In [8]:
duplicate_apps = []
unique_apps = []

for app in google_list:
    if app[0] in unique_apps:
        duplicate_apps.append(app[0])
    else:
        unique_apps.append(app[0])
        
print(f'Number of duplicate apps: {len(duplicate_apps)}')

Number of duplicate apps: 1181


There are a total of 1181 duplicate apps, we can use this number to check after we have removed the duplicate apps to determine if we have the expected amount left. Now we need to make a clean list with all the up to date unique values for each app in the app store. First we need a list of all the apps with thier most recent user rating scores.

To do this we will use a dictioinary where each unique value will have the app name as the key, and the ratings as the value. If the app is not already in the dictionary then it will be added along with it's respective rating value. If the app has already been added, the higher rating value will be the one that will be the final value in the dictionary.

In [9]:
max_reviews = {}

for app in google_list:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in max_reviews and max_reviews[name] < n_reviews:
        max_reviews[name] = n_reviews
        
    elif name not in max_reviews:
        max_reviews[name] = n_reviews
        
print(f'Expected length: {len(google_list) - 1181}')
print(f'Actual length: {len(max_reviews)}')

Expected length: 9659
Actual length: 9659


Now that we confirmed that the dictionary contains all the values we believe to have duplicates, we can start the process of removing the duplicates with outdated data.

In [10]:
google_list_clean = []
already_added = []

for app in google_list:
    name = app[0]
    n_reviews = float(app[3])
    
    if (max_reviews[name] == n_reviews) and (name not in already_added):
        google_list_clean.append(app)
        already_added.append(name)
        
explore_data(google_list_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


9659 rows left is the expected amount of rows we should have after removing the duplicates, so we can assume all the values in the dataset are now unique values.

# Remove Non-English Apps

In [11]:
print(google_list_clean[4412][0])
print(google_list_clean[7940][0])
print('\n')
print(apple_list[813][1])
print(apple_list[6731][1])

中国語 AQリスニング
لعبة تقدر تربح DZ


爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


There are rows in both of the datasets that contain non-English apps which we are not concerned about, so we should remove them.

In [12]:
def english_or_not(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
            
    if non_ascii > 3:
        return False
    else:
        return True        
        
print(english_or_not('Instagram'))
print(english_or_not('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_or_not('Docs To Go™ Free Office Suite'))
print(english_or_not('Instachat 😜'))

True
False
True
True


The function above determines if a string has more than three non ascii characters. Waiting for three non ascii characters reduces the chance of flagging an English app as a non-English app if it so happend to have a special character or emoji.

Below we will create two English only data sets for both the Google Apps and the Apple iOS Apps:

In [13]:
google_english = []
apple_english = []

for app in google_list_clean:
    name = app[0]
    if english_or_not(name):
        google_english.append(app)
        
for app in apple_list:
    name = app[1]
    if english_or_not(name):
        apple_english.append(app)
        
explore_data(google_english, 0, 3, True)
print('\n')
explore_data(apple_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

After filtering out only the English apps for both data sets, we are left with 9614 apps in the Google Apps data set. In the Apple iOS data set we are left with 6183 apps.

# Focusing on Free To Play Apps
The apps that we are developing are free to play apps, so to focus our attention on these apps we need to filter out all of the paid apps. Once we are left with only free to play apps in both data sets we can start to do more data analysis.

In [14]:
google_free = []
apple_free = []

for app in google_english:
    price = app[6]
    if price == 'Free':
        google_free.append(app)
        
for app in apple_english:
    price = app[4]
    if price == '0.0':
        apple_free.append(app)
        
explore_data(google_free, 0, 3, True)
explore_data(apple_free, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 8863
Number of columns: 13
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'G

After finding only the free apps in each data set, we are left with 8863 Google apps and 3222 Apple iOS Apps for the analysis.

We want to find apps that are profitable in both the Google Play store and the Apple iOS App store. When we find an app that is profitable in both markets, we will make a minimal version of the app to add to the Google Play store. If the app has a good response from users we will develop it further. If the app is profitable in six months we will post an iOS version of the app.

# Most Common Genres

To find the most popular genres of apps in each app store, we will first create a frequency table for the each genre. Using the prime_genre column for the Apple store and the Genres and Category columns of the Google Play store we will be able to see the most common types of apps.

In [15]:
def freq_table(dataset, index):
    list_dict = {}
    total_app = len(dataset)
    for app in dataset:
        if app[index] in list_dict:
            list_dict[app[index]] += 1
        else:
            list_dict[app[index]] = 1
            
    for genre in list_dict:
        percentage = (list_dict[genre] / total_app) * 100
        list_dict[genre] = percentage

    return list_dict

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Now that the functions to create the frequency table are setup we can start to examine the data in the prime_genre column of the Apple iOS app store data.

In [16]:
display_table(apple_free, -5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


The frequency data shows that game apps hold the overwhelming majority of free, English apps on the Apple iOS store. This is followed by Entertainment, Photo & Video, Education, and Social Networking. The top genres show that the majority of the most common apps are focused around entertainment. Apps that have a more practical purpose such as Education, Shopping, and Utilities may hold less of a dominance in the store, but this does not mean that they have any less of a userbase.

Just because the Apple store is dominated by apps focused around fun, does not mean that apps with practical uses have any less of a user base.

Next we will analize the data in the Google App store.

In [17]:
display_table(google_free, 1)

FAMILY : 18.898792733837304
GAME : 9.725826469592688
TOOLS : 8.462146000225657
BUSINESS : 4.592124562789123
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.5315355974275078
SPORTS : 3.396141261423897
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2381812027530184
HEALTH_AND_FITNESS : 3.0802211440821394
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496444
VIDEO_PLAYERS : 1.7939749520478394
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.1621347173643235
ENTERTAINMENT : 0.9590432133589079
LIBRARIES_AND_DEMO : 0.9364774906916393
AUTO_AND_VEHICLES : 0.9251946293580051
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS : 0.7108202640189552
PARENTING : 0.6544059573507841
ART_AND_DESIGN : 0

The Google Play store does not have nearly the same amount of Games dominance as the Apple iOS store. The family genre holds the most dominance, but if you look into what the family genre is in the app store, it appears to be games suitible for families. It appears that on the Google Play store the most dominate genre is also games, but a larger majority of practical purpose apps are apperent on the in this app store.

# User Count
While these tables have helped us determine the apps that have the most prevelance in each store are apps made for fun, we now need to know which genres have the most users.

# Apple User Count
Below is the rating total for app genres on the Apple iOS store. We are using user ratings instead of user count because that data is not avaiable in this data set. User ratings should suffice as the users that have left a rating for the app should be more active than normal users and can give a more accurate perspective on the activity in an app genre.

In [18]:
apple_genres = freq_table(apple_free, -5)
for genre in apple_genres:
    total = 0
    len_genre = 0
    for app in apple_free:
        genre_app = app[-5]
        if genre_app == genre:
            user_rate = float(app[5])
            total += user_rate
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)
            

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


Based on the above results, there is a large number of users in the social media and networking space. But this space would be a difficult one to gain users in since you would need to gain a large amount of users in a short amount of time to have a social media platform others see worth using. There are also plenty of users in the video and photo space, but these users are most likely accustomed to using a specific suite to do their photo and video work. This leaves games as the next most likely category for app development. A single player game that is fun to play can attract users, withought the game needing a large playerbase to make the experience enjoyable.

Next we will analyze the Google Play store data.

# Google Play Store User Count
In the Google Play store data set, it includes an install count for each app. We will be using the install count which will give a good idea of how many users have downloaded an app within a specific genre.

In the data set, the install count is not an exact number, it is: 100,000+ or 300,000+. So we do not know exactly home many installs the apps have, but it will be a close enough estimation to make an informed decision.

In [19]:
google_genres = freq_table(google_free, 1)

for genre in google_genres:
    #print(genre)
    total = 0
    len_genre = 0
    for app in google_free:
        genre_app = app[1]
        if genre_app == genre:
            installs_tot = app[5]
            installs_tot = installs_tot.strip('+')
            installs_tot = installs_tot.replace(',', '')
            total +=  float(installs_tot)
            len_genre += 1
            
    avg_installs = total / len_genre
    print(genre, ':', avg_installs)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3697848.1731343283
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

In this table, there is a large number of users who have downloaded apps that are mostly practical apps. Looking at the game genre, there are a large amount of user that have downloaded game related apps. Considering the family genre is also mainly consisting of games that are family appropriate, there is a large userbase for games on the Google Play store.

# Conclusion
In both the Google Play Store and the Apple iOS store, there is evidence to show that a large amount of users are interested in game type apps. With the nature of games, a simple game that does not rely on a large playerbase can gain attraction quickly. If there is marketing behind a game, a multiplayer game has the potential to attract a userbase that will promote the game themsleves, leading to the game to grow more over time. 

With this in mind, we should put development effort towards the game genre. This has the greatest potential to create returns on investment compared to the effort that would be needed to create a larger user base for something such as a social media platform. If the game app is profitable on the Google Play store, we will carry it over to the Apple iOS store.