# Profitable App profiles for the App Store and Google Play Markets

The purpose of this project to analyze data on free apps that have in-app ads that serve as their main source of revenue, and help developers determine the type of apps that maximize this engagement and attract the most users.

## Opening and Exploring the Data

Our analysis has two different datasets, one for iOS apps, and one for Android apps on Google Play. As of September 2018, there were 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play. However, it would take a significant amount of time to analyze all of that data, so our dataset only contains a sample of all the data. AppleStore.csv contains 7,000 iOS apps, and googleplaystore.csv contains 10,000 Android apps from Google Play. 

A [dataset](https://www.kaggle.com/datasets/lava18/google-play-store-apps) containing data about approximately ten thousand Android apps from Google Play. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).

A [dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately seven thousand iOS apps from the App Store. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
opened_file = open('AppleStore.csv')
from csv import reader
read_file = reader(opened_file)
ios = list(read_file)

opened_file_2 = open('googleplaystore.csv')
from csv import reader
read_file_2 = reader(opened_file_2)
android = list(read_file_2)

ios = ios[1:]
android = android[1:]

explore_data(ios, 0, 3)

explore_data(android, 0, 3)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite ‚Äì FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']




## Deleting Wrong Data

There is some data in the dataset that is displayed incorrectly, and this can be seen in row 10472, where one column is missing, making the rest of the data shifted to the left by 1. Therefore, we will remove this data in order to preserve our analysis. 

In [2]:
explore_data(android, 10472, 10473)
del android[10472]

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']




In order for our data to be as accurate as possible, we need to make sure that each app on the dataset is represented equally, with the same amount of weight. Potential duplicate apps might skew our analysis. Therefore, we need to adjust our data set in order to account for these duplicates. For example, the app named "Quick PDF Scanner + OCR FREE" has three total entries, with slightly varying review numbers.  

In [3]:
unique_apps = []
duplicate_apps = []

for row in android:
    name = row[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

test_name = duplicate_apps[0]
for row in android:
    if test_name == row[0]:
        print(row)

['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


However, these duplicates will not be removed randomly. Certain ceiteria will be put into place to decide what gets removed, as some entries, despite having the same name, contain differing statistics. Most notably, this includes the number of reviews. The higher the number of reviews, the more accurate and up-to-date the information is. 

In [4]:
reviews_max = {}
for row in android:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max:
        if reviews_max[name] < n_reviews:
            reviews_max[name] = n_reviews
    else:
        reviews_max[name] = n_reviews

print(len(reviews_max))

9659


Now that we have our most up to date number of reviews for every unique application, we now need to figure out what row we need to include in our cleaned dataset. To do this, we loop through our android dataset, and check the number of reviews and the name of the app. If the number of reviews matches what was the maximum in our dictionary, then we include that row. However, in the example featured above, multiple rows can have the same number of reviews. Therefore, we append the name we just added to another list, and check if the name is absent in our "already_added" list before adding the row. Therefore, the length of our cleaned list is the same length as our dictionary. 

In [5]:
android_clean = []
already_added = []
for row in android:
    name = row[0]
    n_reviews = float(row[3])
    if (n_reviews == reviews_max[name]):
        if name not in already_added:
            android_clean.append(row)
            already_added.append(name)

print(len(android_clean))

9659


## Removing Non-English Apps

Another way we can clean the dataset is by filtering out non-English applications, as our goal is focused on data from an English user-base. As these data sets are potentially global, we want to filter out apps that are probably not directed to an English speaking person. Therefore, if an app has more than 3 characters that aren't in the Latin alphabet, we can presume that the application is not an English application. 

In [6]:
def isEnglishName(name):
    counter = 0
    for character in name:
        if ord(character) > 127:
            counter+= 1
            if (counter > 3):
                return False
    return True

print(isEnglishName('Instagram'))
print(isEnglishName('Áà±Â•áËâ∫PPS -„ÄäÊ¨¢‰πêÈ¢Ç2„ÄãÁîµËßÜÂâßÁÉ≠Êí≠'))
print(isEnglishName('Docs To Go‚Ñ¢ Free Office Suite'))
print(isEnglishName('Instachat üòú'))

True
False
True
True


In [7]:
android_english = []
ios_english = []

for row in android_clean:
    name = row[0]
    if isEnglishName(name):
        android_english.append(row)

for row in ios:
    name = row[1]
    if isEnglishName(name):
        ios_english.append(row)
        
print(len(android_english))
print(len(ios_english))

9614
6183


## Isolating the Free Apps

Our focus is primarily on apps that generate ad revenue, which are typically free apps. Therefore, we can once again clean our dataset by filtering out apps that have a cost above zero. In the Google Play dataset, this is represented by the price section showing a string of `'0'`. In the Apple Store playset, this is represented by the price showing a string of `'0.0'`. 

In [8]:
android_free = []
ios_free = []

for row in android_english:
    price = row[7]
    if price == '0':
        android_free.append(row)

for row in ios_english:
    price = row[4]
    if price == '0.0':
        ios_free.append(row)
        
print(len(android_free))
print(len(ios_free))

8864
3222


## Most Common Apps by Genre

Once we have our fully cleaned dataset, we can now create a frequency table of the apps based on specific categories. In this case, our Apple dataset has one applicable category "prime_genre", which lists the primary genre of the app. The Google Play dataset has two categories, "Category" and "Genres". "Category" displays the category the app belongs to, and "Genre" shows the same thing, but some apps can have multiple descriptors.  

In [9]:
def freq_table(dataset, index):
    myTable = {}
    numElements = 0
    for row in dataset:
        element = row[index]
        if element in myTable:
            myTable[element]+= 1
        else:
            myTable[element] = 1
        numElements+= 1;
    
    for key in myTable:
        myTable[key] = (myTable[key] / numElements) * 100
    return myTable

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])


In [10]:
display_table(ios_free, 11)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


In [11]:
display_table(android_free, 9)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

In [12]:
display_table(android_free, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

For Android, Tools, Entertainment, and Education are the most common genres, and Family, Game, and Tools are the most common categories. However, for iOS, the overwhelming majority of apps are games.  

## Most Common Apps by Number of Installs/Ratings

Although there may be an overwhelming number of apps that are skewed towards one genre or category, it doesn't show for how popular the apps are. An app can be of a lower common genre, but also have much more user ratings. We can find the average number of ratings for apps per genre. 

In [13]:
prime_genre_table = freq_table(ios_free, 11)
for genre in prime_genre_table:
    total = 0
    len_genre = 0
    for row in ios_free:
        genre_app = row[11]
        if genre_app == genre:
            total += float(row[5])
            len_genre += 1
    print(genre + ': ' + str(total / len_genre))
    

Social Networking: 71548.34905660378
Photo & Video: 28441.54375
Games: 22788.6696905016
Music: 57326.530303030304
Reference: 74942.11111111111
Health & Fitness: 23298.015384615384
Weather: 52279.892857142855
Utilities: 18684.456790123455
Travel: 28243.8
Shopping: 26919.690476190477
News: 21248.023255813954
Navigation: 86090.33333333333
Lifestyle: 16485.764705882353
Entertainment: 14029.830708661417
Food & Drink: 33333.92307692308
Sports: 23008.898550724636
Book: 39758.5
Finance: 31467.944444444445
Education: 7003.983050847458
Productivity: 21028.410714285714
Business: 7491.117647058823
Catalogs: 4004.0
Medical: 612.0


We can see that while the vast majority of apps on the app store are games, they are actually on the lower end of average reviews. However, we can also keep in mind that this could be dragged down due to an overwhelming amount of low-popularity games compared to a few super popular games. Other highly rated apps are Navigation (although could be heavily dominated by popular apps), Weather, Social Networking, and Music. However, this might not say much about the genre's popularity, more of how very large apps might skew the data. For example, in the Navigation category, which is the largest one, the data looks like this:

In [14]:
for app in ios_free:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching¬Æ : 12811
CoPilot GPS ‚Äì Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


Most of the data is dominated by Waze and Google Maps. We can do similar analysis for the rest of the popular columns:

In [15]:
for app in ios_free:
    if app[-5] == 'Social Networking':
        print(app[1], ':', app[5])

Facebook : 2974676
Pinterest : 1061624
Skype for iPhone : 373519
Messenger : 351466
Tumblr : 334293
WhatsApp Messenger : 287589
Kik : 260965
ooVoo ‚Äì Free Video Call, Text and Voice : 177501
TextNow - Unlimited Text + Calls : 164963
Viber Messenger ‚Äì Text & Call : 164249
Followers - Social Analytics For Instagram : 112778
MeetMe - Chat and Meet New People : 97072
We Heart It - Fashion, wallpapers, quotes, tattoos : 90414
InsTrack for Instagram - Analytics Plus More : 85535
Tango - Free Video Call, Voice and Chat : 75412
LinkedIn : 71856
Match‚Ñ¢ - #1 Dating App. : 60659
Skype for iPad : 60163
POF - Best Dating App for Conversations : 52642
Timehop : 49510
Find My Family, Friends & iPhone - Life360 Locator : 43877
Whisper - Share, Express, Meet : 39819
Hangouts : 36404
LINE PLAY - Your Avatar World : 34677
WeChat : 34584
Badoo - Meet New People, Chat, Socialize. : 34428
Followers + for Instagram - Follower Analytics : 28633
GroupMe : 28260
Marco Polo Video Walkie Talkie : 27662
Miito

In [16]:
for app in ios_free:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ‚Ñ¢ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pok√©mon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
Êïô„Åà„Å¶!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


Therefore, it might be difficult to find a good profile for an app based on overall rating statistics, and we would need to analyze distributions more deeply to make it a case. Popular genres seem to be Reference, Weather, Music, Book, and Food/Drink. 

However, for Android, instead, of ratings, we have number of installs, so we need to calculate our averages through that column. However, this information isn't precise, as it is generally rounded. Therefore, in order to calculate the data, we need to remove potential string values associated with rounded figures to iterate through our data. 

In [17]:
category_table = freq_table(android_free, 1)
for category in category_table:
    total = 0
    len_category = 0
    for row in android_free:
        category_app = row[1]
        if category_app == category:
            num_installs = row[5].replace('+', '')
            num_installs = num_installs.replace(',', '')
            num_installs = float(num_installs)
            len_category += 1
            total += num_installs
    print(category + ": " + str(total / len_category))
    
            

ART_AND_DESIGN: 1986335.0877192982
AUTO_AND_VEHICLES: 647317.8170731707
BEAUTY: 513151.88679245283
BOOKS_AND_REFERENCE: 8767811.894736841
BUSINESS: 1712290.1474201474
COMICS: 817657.2727272727
COMMUNICATION: 38456119.167247385
DATING: 854028.8303030303
EDUCATION: 1833495.145631068
ENTERTAINMENT: 11640705.88235294
EVENTS: 253542.22222222222
FINANCE: 1387692.475609756
FOOD_AND_DRINK: 1924897.7363636363
HEALTH_AND_FITNESS: 4188821.9853479853
HOUSE_AND_HOME: 1331540.5616438356
LIBRARIES_AND_DEMO: 638503.734939759
LIFESTYLE: 1437816.2687861272
GAME: 15588015.603248259
FAMILY: 3695641.8198090694
MEDICAL: 120550.61980830671
SOCIAL: 23253652.127118643
SHOPPING: 7036877.311557789
PHOTOGRAPHY: 17840110.40229885
SPORTS: 3638640.1428571427
TRAVEL_AND_LOCAL: 13984077.710144928
TOOLS: 10801391.298666667
PERSONALIZATION: 5201482.6122448975
PRODUCTIVITY: 16787331.344927534
PARENTING: 542603.6206896552
WEATHER: 5074486.197183099
VIDEO_PLAYERS: 24727872.452830188
NEWS_AND_MAGAZINES: 9549178.467741935
MA

The Google Play Store has a much larger variety of apps, and also much larger numbers. However, we also have to take into account the phenomena we discussed before, where large apps dominate the averages of installs. 

## Conclusion

Massive apps can dominate popularity analysis, and only with specific parsing can we try to detect what genres are actually evenly popular. However, with this parsing, we can try to find traits that are common in these apps, and build something that fits a certain niche. Games, weather apps, and lifestyle apps have a lot of success on both iOS and Google Play, so looking at the data, those types might be taken more into consideration. 