# Profitable App Profiles

Rutgers University Mobile App Development Club (RUMAD) is looking to build the next big app. Thanks to its recent graduates, the club has been commissioned by numerous up and coming companies to build an app for them. The question is, which company should they work with? and what should they build?

The president of the club has asked me, Nandan, to figure out which app profiles are most profitibale. He laid out the rules:

1. Companies want a free app.
2. We can only work with 1 company.
3. The app must be soled in both the App Store and Google Play Store.

A frantic Nandan Patel then luckily found two data sets, 1 for the App Store and 1 for the Google Play Store and started to look for useful trends.

Since free apps rely primiarly on advertizes for revenue, we will look for the most popular types of apps in both data sets.

First, I thought it would be important to read the data and make the appropriate lists.


In [1]:
from csv import reader

# ios
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[:1]
ios = ios[1:]

# android
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android  = list(read_file)
android_header = android[:1]
android = android[1:]


I then used a function called the 'explore data' function to print a few of the rows and also find the number of rows and columns.

In [2]:
# This is the explore data function

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
    


In [3]:
print(ios_header)
print('\n')
explore_data(ios,0,5, True)

print('\n')

print(android_header)
print('\n')
explore_data(android,0,5,True)


[['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']]


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of columns: 16


[['App', 'C

Since I found the data on Kaggle, I was able to find some of the issues with the dataset. For instance, a user mentioned that row 10472 in the Google Play Store was incorrectly formatted. I proceeded to delete that row. However, this led me to believe that there would be other rows with issues as well.

In [4]:
print(len(android))
del android[10472] #... commented out so that delete only occurs once
print (len(android))

10841
10840


## Deleting Duplicate Cells
The Google Play Dataset is not perfect. There are numerous entries that are the same. When we use the following code, this becomes clear:

In [5]:
for app in android:
    if app[0] == 'Instagram':
        print(app)

duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('\n')
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate entries:', duplicate_apps[:10])


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Number of duplicate apps: 1181


Examples of duplicate entries: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


When deciding which duplicate entries to remove, we will look at the number of ratings in each duplicate entry. We will use the entry with the highest number of ratings as that would be the latest updated entry for that specific app.

First, we will build a dictionary so that we can find the entries that we want to keep for each app using our reviews criteria. 

Then we will check to see if the number of entries is the same as the total number of entries in the original set - 1181 (number of duplicate entries).

Following that, we will create a new list that only has the entries that we want.

In [6]:
highest_reviews = {}

for app in android:
    name = app[0]
    num_reviews = float(app[3])
    
    if name in highest_reviews and highest_reviews[name] < num_reviews:
        highest_reviews[name] = num_reviews
        
    elif name not in highest_reviews:
        highest_reviews[name] = num_reviews

print('Given Value from Code:', len(highest_reviews))
print('Actual Value:', len(android) - 1181)
        

Given Value from Code: 9659
Actual Value: 9659


In [7]:
nodups_android = []
adding_android = []

for app in android:
    name = app[0]
    num_reviews = float(app[3])
    
    if (num_reviews == highest_reviews[name]) and (name not in adding_android):
        nodups_android.append(app)
        adding_android.append(name)

explore_data(nodups_android, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


Since we are only targeting an apps that use English, we will get rid of all apps that have other languages and characters.

In [8]:
def english_only (string):
    
    for character in string:
        if ord(character) > 127:
            return False
    
    return True 

print(english_only('Instagram'))
print(english_only('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_only('Docs To Go™ Free Office Suite'))
print(english_only('Instachat 😜'))

True
False
False
False


The issue here is that some apps that have emojis or copyright will be labeled as non-english using the function above. To curb this, we should ensure that the data allows up to 3 characters beyond the ASCII chart.

In [9]:
def english_only (string):
    
    count = 0
    
    for character in string:
        if ord(character) > 127:
            count += 1
    
    if count > 3:
        return False
    else:
        return True 

print(english_only('Docs To Go™ Free Office Suite'))
print(english_only('Instachat 😜'))  

True
True


In [10]:
android_english = []
ios_english = []

for app in nodups_android:
    name = app[0]
    if english_only(name):
        android_english.append(app)

for app in ios:
    name = app[1]
    if english_only(name):
        ios_english.append(app)

explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

Now we are going to isolate all of the free apps in both data sets. This is the last step in data cleaning before we start to analyze for trends.

In [11]:
android_final = []
ios_final = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)

for app in ios_english:
    price = app[4]
    if price == '0.0':
        ios_final.append(app)

print(len(android_final))
print(len(ios_final))
    
    

8864
3222


## Finding Most Common Apps by Genre

The end goal is to determine apps that are likely to attract more users, which will create more revenue for the company we decide to work with.

Here was my thought process:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we then develop it further.
3. If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.

I begin by looking at the most common genres for both android and ios. Before we do that, we need to build a frequence function to find the percent of apps that fall into each genre type. 

In [12]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key]/total) * 100
        table_percentages[key] = percentage
    
    return table_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [13]:
display_table(ios_final, -5) #prime_genre

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


As we can see, the most common genre by far is games with almost 60% of the apps labeled as games. Entertainment comes in second at around 8%. Overall, apps that are geared towards fun are more frequent than apps that are geared towards practical purposes.

In [14]:
display_table(android_final, 1) #category

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

In the Google Play Store, the most common apps are geared towards practical purposes. However, when we take a closer look the most frequent category, which is Family and considered practical, we see that most of the apps are just games for children which would fall under fun.

In [15]:
display_table(android_final, -4) #Genre for Android

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

The Genre category takes a more in-depth look at the types of apps on the Google Play Store. Since we only want to focus on the bigger picture, we will continue to look at the Category column in the Google Play Store.

Overall, the App Store tends to have more Apps geared towards fun and Google Play is more balanced between fun apps and practical apps.

## Most Popular Apps by Genre in App Store

In order to make this calucation, we will use the number of app ratings as a proxy since we do not have the number of installs for each app. We will calculate the average number of ratings for each category to approximate popularity of each category.

In [16]:
genres_ios = freq_table(ios_final, -5)

for genre in genres_ios:
    total_ratings = 0
    number_of_apps = 0
    for app in ios_final:
        app_genre = app[-5]
        app_ratings = float(app[5])
        if genre == app_genre:
            total_ratings += app_ratings
            number_of_apps += 1
    average_rating = total_ratings/number_of_apps
    print(genre, ':', average_rating)

Reference : 74942.11111111111
Games : 22788.6696905016
Utilities : 18684.456790123455
Travel : 28243.8
Medical : 612.0
Food & Drink : 33333.92307692308
Music : 57326.530303030304
Health & Fitness : 23298.015384615384
Productivity : 21028.410714285714
Navigation : 86090.33333333333
Education : 7003.983050847458
Shopping : 26919.690476190477
Entertainment : 14029.830708661417
Social Networking : 71548.34905660378
Business : 7491.117647058823
Weather : 52279.892857142855
News : 21248.023255813954
Catalogs : 4004.0
Lifestyle : 16485.764705882353
Photo & Video : 28441.54375
Finance : 31467.944444444445
Sports : 23008.898550724636
Book : 39758.5


The two most popular App Categories by far seem to be Navigation and Reference. However, Navigation reviews are skewed by apps such as Google Maps that get many more reviews than other apps in the category. The same goes for the Reference category where the Bible takes up most of the reviews. 

In [17]:
for app in ios_final:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


In [18]:
for app in ios_final:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


One way that we can work with this information is by creating an app that has niche features in these cateories. We know that categories such as games are oversaturated with the number of apps and so these categories with fewer apps might garner more reviews if we can think of an idea that is beneificial to customers (such as Waze using traffic patterns to become an alternative to google maps).

A potential idea could be creating a travel map that shows roads or areas with the most reviews for food/entertainment/hotel, etc.

Let's now look at Google Play data.

## Most Popular Apps by Genre in Google Play Store

The google play store already provide us with an install category which we can use. However, we need to conver the number of installs from float to int, We need to take out '+' and ',' from the strings to get numbers.

Then, find the total number of installs per category and divide by the total number of apps in each category

In [19]:
genres_android = freq_table(android_final, 1)

for genre in genres_android:
    total_installs = 0
    total_apps = 0
    for app in android_final:
        category = app[1]
        if category == genre:
            installs = app[5]
            installs = installs.replace(',','')
            installs = installs.replace('+','')
            total_installs += float(installs)
            total_apps += 1
    average_installs = total_installs/total_apps
    print(genre, ':', average_installs)

PRODUCTIVITY : 16787331.344927534
BOOKS_AND_REFERENCE : 8767811.894736841
LIFESTYLE : 1437816.2687861272
TOOLS : 10801391.298666667
SOCIAL : 23253652.127118643
GAME : 15588015.603248259
EDUCATION : 1833495.145631068
PERSONALIZATION : 5201482.6122448975
COMICS : 817657.2727272727
LIBRARIES_AND_DEMO : 638503.734939759
BUSINESS : 1712290.1474201474
MEDICAL : 120550.61980830671
HEALTH_AND_FITNESS : 4188821.9853479853
FINANCE : 1387692.475609756
ART_AND_DESIGN : 1986335.0877192982
TRAVEL_AND_LOCAL : 13984077.710144928
ENTERTAINMENT : 11640705.88235294
FAMILY : 3695641.8198090694
HOUSE_AND_HOME : 1331540.5616438356
DATING : 854028.8303030303
COMMUNICATION : 38456119.167247385
EVENTS : 253542.22222222222
VIDEO_PLAYERS : 24727872.452830188
PHOTOGRAPHY : 17840110.40229885
BEAUTY : 513151.88679245283
FOOD_AND_DRINK : 1924897.7363636363
SHOPPING : 7036877.311557789
WEATHER : 5074486.197183099
AUTO_AND_VEHICLES : 647317.8170731707
NEWS_AND_MAGAZINES : 9549178.467741935
PARENTING : 542603.620689655

We can see here that maps and navigation is once again a fairly popular category in the Google Play Store. If we take a closer look, we can see that few apps are the most popular and thus it can be easy to break into the maps and navigation market if we can make a niche idea. 

In [24]:
for app in android_final:
    if app[1] == 'MAPS_AND_NAVIGATION' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'
                                            or app[5] == '50,000,000+'
                                            or app[5] == '10,000,000+'):
        print(app[0], ':', app[5])

Waze - GPS, Maps, Traffic Alerts & Live Navigation : 100,000,000+
MapQuest: Directions, Maps, GPS & Navigation : 10,000,000+
Yahoo! transit guide free timetable, operation information, transfer search : 10,000,000+
Uber : 100,000,000+
GPS Navigation & Offline Maps Sygic : 50,000,000+
Yandex.Transport : 10,000,000+
Compass : 10,000,000+
Subway Terminator: Smarter Subway : 10,000,000+
Moovit: Bus Time & Train Time Live Info : 10,000,000+
AT&T DriveMode : 10,000,000+
Free GPS Navigation : 50,000,000+
TomTom GPS Navigation Traffic : 10,000,000+
DB Navigator : 10,000,000+
Maps, GPS Navigation & Directions, Street View : 10,000,000+


In [25]:
under_100_m = []

for app in android_final:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'MAPS_AND_NAVIGATION') and (float(n_installs) < 10000000):
        under_100_m.append(float(n_installs))
        
sum(under_100_m) / len(under_100_m)

936916.1818181818

Along with our travel map idea from before, we can also create an app during these rough times (Spring 2020)that can help people social distance. Since Corona Virus has been on the minds of everyone and the government needs a stronger re-opening plan for many states, We can create an uber service that follows CDC guidelines in cleanliness of their cars and that limits its passengers to just 1. We can also create an app that partitions areas of the world by certain thresholds of frequency of visits to ensure that places that are most frequently visited are opened last and less frequently visited places are opened first.

## Conclusion

After analyzing both the Google Play Store and App Store, I was able to suggest to the president of our club that the maps and navigations sector is not only popular and easier to break into than some other categories, but also is in great demand due to our current circumstances.

I suggested that we work with the up and coming map company that has approached our club so that our app can do substantially well. Our app would feature maps with a more closer look at where Coronavirus cases are higher (by town) and suggest to the government and users which areas would be the best to re-open and visit in order to avoid crowds.