## App Profiles for the iOS App Store and Google Play Markets ##

The goal of this project is to better understand what types of mobile applications are likely to attract more users by analyzing data about apps on the iOS App Store and Google Play. The two data sets we will be analyzing can be found on Kaggle, which I've linked.

In [1]:
import csv 
open_file_1 = open('AppleStore.csv')
open_file_2 = open('googleplaystore.csv')
read_file_1 = csv.reader(open_file_1)
read_file_2 = csv.reader(open_file_2)
app_store_data = list(read_file_1)
google_play_data = list(read_file_2)

def explore_data(dataset, start, end, rows_and_columns = True):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new empty line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset[1:])) # without header row
        print('Number of columns:', len(dataset[0]))

We will now use the explore function above to examine the first 3 rows of each data set.

In [2]:
explore_data(app_store_data, 0, 3)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7197
Number of columns: 16


In [3]:
explore_data(google_play_data, 0, 3)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


Now that we have an idea of what the data looks like, we will want to perform some data cleaning in order to prepare the data to fit the scope of our project as well as for better analysis.

To begin, there are some notes about errors in the discussion section for the Google Play data set on Kaggle. We can verify if this is the case and remove them if necessary.

In [4]:
google_play_data[10473]

['Life Made WI-Fi Touchscreen Photo Frame',
 '1.9',
 '19',
 '3.0M',
 '1,000+',
 'Free',
 '0',
 'Everyone',
 '',
 'February 11, 2018',
 '1.0.19',
 '4.0 and up']

Here you can see that the number of columns is 12, when it should be 13 as listed by the `Explore_Data()` function. 
We will delete this row from our data set.

In [5]:
del google_play_data[10473]

The discussion also mentions that some apps have duplicate entries. We will now check both data sets to see if this is the case.

In [6]:
duplicate_play_apps = []
unique_play_apps = []

for app in google_play_data:
    app_name = app[0]
    if app_name in unique_play_apps:
        duplicate_play_apps.append(app_name)
    else:
        unique_play_apps.append(app_name)
        
print(str(len(duplicate_play_apps)) + ' duplicate apps')
print(str(len(unique_play_apps)) + ' unique apps')

1181 duplicate apps
9660 unique apps


The code above shows that there are 1181 duplicate apps in the Google Play data set.

In [7]:
duplicate_ios_apps = []
unique_ios_apps = []

for app in app_store_data:
    app_name = app[0]
    if app_name in unique_ios_apps:
        duplicate_ios_apps.append(app_name)
    else:
        unique_ios_apps.append(app_name)

print(str(len(duplicate_ios_apps)) + ' duplicate apps')
print(str(len(unique_ios_apps)) + ' unique apps')

0 duplicate apps
7198 unique apps


There aren't any duplicates in the iOS App Store data set.

Lets explore some of the duplicates in the Google Play Store data set:

In [8]:
print(sorted(duplicate_play_apps[:10]))

['Box', 'Box', 'Google Ads', 'Google My Business', 'Google My Business', 'Quick PDF Scanner + OCR FREE', 'Slack', 'ZOOM Cloud Meetings', 'Zenefits', 'join.me - Simple Meetings']


Just from looking at the first 10 items we can see that 'Box' is duplicated.

Looking at that specific app further:

In [9]:
print(google_play_data[0])
for app in google_play_data:
    app_name = app[0]
    if app_name == 'Box':
        print(app)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']


When removing dupicates, we will keep the one that has the most reviews. To do this, we will first create a dictionary with key-value pairs cooresponding to the app name and entry with the highest number of reviews.

In [10]:
reviews_max = {}

for app in google_play_data[1:]:
    app_name = app[0]
    n_reviews = float(app[3]) # we need to convert to float from string
    if app_name in reviews_max and reviews_max[app_name] < n_reviews:
        reviews_max[app_name] = n_reviews
    elif app_name not in reviews_max:
        reviews_max[app_name] = n_reviews

We should verify that the reviews_max dictionary has the correct amount of entries:


In [11]:
print('Expected:', len(google_play_data[1:]) - 1181) # 1181 is the number of duplicates found earlier.
print('Actual:', len(reviews_max))

Expected: 9659
Actual: 9659


Now we will remove the duplicate rows:

In [12]:
google_play_clean = []
already_added = []

for app in google_play_data[1:]:
    app_name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[app_name] and app_name not in already_added: # Check against the dictionary above
        google_play_clean.append(app)
        already_added.append(app_name)
        
len(google_play_clean)

9659

Verifying the length of the `google_play_clean` list shows that it has the correct number of entries that we expected.

## Removing Non-English Apps ##

In [13]:
def english_detection (a_string):
    count = 0
    for character in a_string:
        if ord(character) > 127: # End of English ASCII range
            count += 1
            
    if count > 3:
        return False
    else: 
        return True

The function above will take in a string and detect if it has more than three characters that fall outside of the English ASCII range. Choosing three is arbitrary, but this will be used to remove apps that are most likely not in English from our data set.

In [14]:
google_play_english = []
for app in google_play_clean:
    app_name = app[0]
    if english_detection(app_name) == True:
        google_play_english.append(app)

len(google_play_clean) - len(google_play_english)

45

In [15]:
ios_english = []
for app in app_store_data[1:]:
    app_name = app[1]
    if english_detection(app_name) == True:
        ios_english.append(app)
    
len(app_store_data[1:]) - len(ios_english)

1014

There appears to be 45 apps that we filtered out as being non-English from the Google Play Store data set and 1014 from the iOS app store data set.

## Isolating Free Apps ##

In [16]:
google_play_free_apps = []
app_store_free_apps = []

for app in google_play_english:
    price = app[7]
    if price == '0':
        google_play_free_apps.append(app)
        
for app in ios_english:
    price = app[4]
    if price == '0.0':
        app_store_free_apps.append(app)
        
print('Google Play Apps: ' + str(len(google_play_free_apps)))
print('iOS App Store Apps: ' + str(len(app_store_free_apps)))

final_google_data = google_play_free_apps
final_ios_data = app_store_free_apps

Google Play Apps: 8864
iOS App Store Apps: 3222


Here we isolated the free apps by looping through both data sets and appending those that are listed as costing zero to two new lists that will be the final lists for analysis. The final number of apps for each of the data sets are also listed.

# Analysis #

So far, we have removed inaccurate data, removed duplicate app entries, removed non-English apps, and isolated the apps that are free. We are now in a better position to analyze the data to determine the kinds of apps that will most likely attract the most users.

In [17]:
explore_data(final_google_data, 0, 5)
explore_data(final_ios_data, 0, 5)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 8863
Number of columns: 13
['284882215', 'Facebook', '389879808', 'USD', '0

To start, we can build a frequency table to determine which genres of apps are most popular.

In [18]:
def freq_table (dataset, index):
    frequency_table = {} 
    table_percentages = {}
    total = 0
    for row in dataset:
        total += 1
        column = row[index]
        if column in frequency_table:
            frequency_table[column] += 1
        else:
            frequency_table[column] = 1
            
    for key in frequency_table:
        percentage = round((frequency_table[key] / total) * 100, 3)
        table_percentages[key] = percentage 
            
    return table_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

The `freq_table` function will be used to generate frequency tables that show percentages. The second function, `display_table`, will be used to display those percentages in descending order. 

In [19]:
display_table(final_ios_data, 11)

Games : 58.163
Entertainment : 7.883
Photo & Video : 4.966
Education : 3.662
Social Networking : 3.29
Shopping : 2.607
Utilities : 2.514
Sports : 2.142
Music : 2.048
Health & Fitness : 2.017
Productivity : 1.738
Lifestyle : 1.583
News : 1.335
Travel : 1.241
Finance : 1.117
Weather : 0.869
Food & Drink : 0.807
Reference : 0.559
Business : 0.528
Book : 0.435
Navigation : 0.186
Medical : 0.186
Catalogs : 0.124


For the iOS App Store, it appears that Games is the most common genre of app, with a distant second being Entertainment. Most of the top app genres for the iOS app store tend to be more entertainment and creative oriented. 

In [20]:
display_table(final_google_data, 1)

FAMILY : 18.908
GAME : 9.725
TOOLS : 8.461
BUSINESS : 4.592
LIFESTYLE : 3.903
PRODUCTIVITY : 3.892
FINANCE : 3.7
MEDICAL : 3.531
SPORTS : 3.396
PERSONALIZATION : 3.317
COMMUNICATION : 3.238
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.944
NEWS_AND_MAGAZINES : 2.798
SOCIAL : 2.662
TRAVEL_AND_LOCAL : 2.335
SHOPPING : 2.245
BOOKS_AND_REFERENCE : 2.144
DATING : 1.861
VIDEO_PLAYERS : 1.794
MAPS_AND_NAVIGATION : 1.399
FOOD_AND_DRINK : 1.241
EDUCATION : 1.162
ENTERTAINMENT : 0.959
LIBRARIES_AND_DEMO : 0.936
AUTO_AND_VEHICLES : 0.925
HOUSE_AND_HOME : 0.824
WEATHER : 0.801
EVENTS : 0.711
PARENTING : 0.654
ART_AND_DESIGN : 0.643
COMICS : 0.62
BEAUTY : 0.598


In [21]:
display_table(final_google_data, 9)

Tools : 8.45
Entertainment : 6.069
Education : 5.347
Business : 4.592
Productivity : 3.892
Lifestyle : 3.892
Finance : 3.7
Medical : 3.531
Sports : 3.463
Personalization : 3.317
Communication : 3.238
Action : 3.102
Health & Fitness : 3.08
Photography : 2.944
News & Magazines : 2.798
Social : 2.662
Travel & Local : 2.324
Shopping : 2.245
Books & Reference : 2.144
Simulation : 2.042
Dating : 1.861
Arcade : 1.85
Video Players & Editors : 1.771
Casual : 1.76
Maps & Navigation : 1.399
Food & Drink : 1.241
Puzzle : 1.128
Racing : 0.993
Role Playing : 0.936
Libraries & Demo : 0.936
Auto & Vehicles : 0.925
Strategy : 0.914
House & Home : 0.824
Weather : 0.801
Events : 0.711
Adventure : 0.677
Comics : 0.609
Beauty : 0.598
Art & Design : 0.598
Parenting : 0.496
Card : 0.451
Casino : 0.429
Trivia : 0.417
Educational;Education : 0.395
Board : 0.384
Educational : 0.372
Education;Education : 0.338
Word : 0.259
Casual;Pretend Play : 0.237
Music : 0.203
Racing;Action & Adventure : 0.169
Puzzle;Brain G

There are two different columns that could be useful for determining the most common genres of apps in the Google Play store. In the Category column for the `final_google_data` data set, it looks like Family is the highest percentage with 18.908%. For the Genres column, Tools take up the most percentage with 8.45%.

## Most Popular Apps by Genre - iOS App Store ##

We will now look at which genres have the most users by calculating the number of installs for each app genre. Since there isn't an installs column for the iOS App Store data set, we will need to determine this by use of the `rating_count_tot` column.

In [22]:
ios_genres = freq_table(final_ios_data, -5)

for genre in ios_genres:
    total = 0
    len_genre = 0
    for app in final_ios_data:
        genre_app = app[-5]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

Catalogs : 4004.0
Education : 7003.983050847458
Book : 39758.5
Business : 7491.117647058823
Entertainment : 14029.830708661417
Music : 57326.530303030304
Shopping : 26919.690476190477
Weather : 52279.892857142855
Social Networking : 71548.34905660378
Photo & Video : 28441.54375
News : 21248.023255813954
Utilities : 18684.456790123455
Productivity : 21028.410714285714
Travel : 28243.8
Reference : 74942.11111111111
Finance : 31467.944444444445
Food & Drink : 33333.92307692308
Games : 22788.6696905016
Lifestyle : 16485.764705882353
Navigation : 86090.33333333333
Sports : 23008.898550724636
Health & Fitness : 23298.015384615384
Medical : 612.0


Navigation and Social Networking seem to be the most popular by installs. Exploring those further:

In [23]:
for app in final_ios_data:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5]) # printing name and number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


In [24]:
for app in final_ios_data:
    if app[-5] == 'Social Networking':
        print(app[1], ':', app[5])

Facebook : 2974676
Pinterest : 1061624
Skype for iPhone : 373519
Messenger : 351466
Tumblr : 334293
WhatsApp Messenger : 287589
Kik : 260965
ooVoo – Free Video Call, Text and Voice : 177501
TextNow - Unlimited Text + Calls : 164963
Viber Messenger – Text & Call : 164249
Followers - Social Analytics For Instagram : 112778
MeetMe - Chat and Meet New People : 97072
We Heart It - Fashion, wallpapers, quotes, tattoos : 90414
InsTrack for Instagram - Analytics Plus More : 85535
Tango - Free Video Call, Voice and Chat : 75412
LinkedIn : 71856
Match™ - #1 Dating App. : 60659
Skype for iPad : 60163
POF - Best Dating App for Conversations : 52642
Timehop : 49510
Find My Family, Friends & iPhone - Life360 Locator : 43877
Whisper - Share, Express, Meet : 39819
Hangouts : 36404
LINE PLAY - Your Avatar World : 34677
WeChat : 34584
Badoo - Meet New People, Chat, Socialize. : 34428
Followers + for Instagram - Follower Analytics : 28633
GroupMe : 28260
Marco Polo Video Walkie Talkie : 27662
Miitomo : 2

## Most Popular Apps by Genre - Google Play Store ##

For the Google Play Store, we can use the installs column:

In [25]:
display_table(final_google_data, 5)

1,000,000+ : 15.727
100,000+ : 11.552
10,000,000+ : 10.548
10,000+ : 10.199
1,000+ : 8.394
100+ : 6.916
5,000,000+ : 6.825
500,000+ : 5.562
50,000+ : 4.772
5,000+ : 4.513
10+ : 3.542
500+ : 3.249
50,000,000+ : 2.301
100,000,000+ : 2.132
50+ : 1.918
5+ : 0.79
1+ : 0.508
500,000,000+ : 0.271
1,000,000,000+ : 0.226
0+ : 0.045
0 : 0.011


In [26]:
google_genres = freq_table(final_google_data, 1)

for category in google_genres:
    total = 0
    len_category = 0
    for app in final_google_data:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

PRODUCTIVITY : 16787331.344927534
ART_AND_DESIGN : 1986335.0877192982
COMMUNICATION : 38456119.167247385
SHOPPING : 7036877.311557789
FINANCE : 1387692.475609756
PERSONALIZATION : 5201482.6122448975
HEALTH_AND_FITNESS : 4188821.9853479853
ENTERTAINMENT : 11640705.88235294
TOOLS : 10801391.298666667
COMICS : 817657.2727272727
BUSINESS : 1712290.1474201474
EVENTS : 253542.22222222222
BOOKS_AND_REFERENCE : 8767811.894736841
MEDICAL : 120550.61980830671
SPORTS : 3638640.1428571427
BEAUTY : 513151.88679245283
LIFESTYLE : 1437816.2687861272
MAPS_AND_NAVIGATION : 4056941.7741935486
LIBRARIES_AND_DEMO : 638503.734939759
SOCIAL : 23253652.127118643
FAMILY : 3695641.8198090694
WEATHER : 5074486.197183099
PHOTOGRAPHY : 17840110.40229885
TRAVEL_AND_LOCAL : 13984077.710144928
VIDEO_PLAYERS : 24727872.452830188
HOUSE_AND_HOME : 1331540.5616438356
DATING : 854028.8303030303
NEWS_AND_MAGAZINES : 9549178.467741935
PARENTING : 542603.6206896552
EDUCATION : 1833495.145631068
GAME : 15588015.603248259
FOO

Here it looks like communication has the most installs. Let's explore which apps these are:

In [27]:
for app in final_google_data:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
Skype - free IM & video calls : 1,000,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+


Ultimately, a number of very large apps (WhatsApp, Waze, Facebook, etc) tend to dominate their respective genre categories by number of installs.