# User Engagement Analysis

XYZ company builds apps that are free to download and install, so the main source of revenue consists of in-app ads. For this reason, our revenue is based upon the number of users who use our app. 

---

Goals for User Engagement Analysis
* Analyze data to understand our users and what apps attracts them
* Give guidance to our product and development teams
* Personal goal: to be better

In [12]:
# The App Store data set
from csv import reader

opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
apple_data = list(read_file)
apple_data_header = apple_data[0]
apple_data = apple_data[1:]

In [13]:
# The Google Play Store data set
from csv import reader

opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
google_data = list(read_file)
google_data_header = google_data[0]
google_data = google_data[1:]

In [16]:
explore_data(apple_data, 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows:  7197
Number of columns:  16


In [15]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row
        
    if rows_and_columns:
        print('Number of rows: ', len(dataset))
        print('Number of columns: ', len(dataset[0]))

In [17]:
explore_data(google_data, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows:  10841
Number of columns:  13


In [18]:
print('Apple Data Headers', apple_data_header)
print('Google Data Headers', google_data_header)

Apple Data Headers ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
Google Data Headers ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


## Useful Apple Data Headers
* track_name
* currency
* price
* rating_count_tot
* rating_count_ver
* user_rating
* prime_genre

---

## Useful Google Data Headers
* App
* Category
* Rating
* Reviews
* Installs
* Type
* Price
* Genres

In [21]:
# Find the error for a certain row, someone comments that it would be either 10472
# or 10473 depending if you had the header. If no header, I assumed it would
# be 10472, and I was correct
# based on what I see, it seems the category for this app is missing
# The correct category for this app is 'Lifestyle'

error_row = google_data[10472]
print(google_data_header)
print(error_row)

# We are going to delete this row that has the error using the del statement
# del google_data[10472]
# Once you delete it, comment out the statement to avoid deleting the row
# replacing this one
print(google_data[10472])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


In [22]:
# Before we start analysis, we want to remove any dupicates from the Google Play
# data set

unique_apps = []
duplicate_apps = []

for row in google_data:
    name = row[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

# Here are a few rows
print(duplicate_apps[:11])
print('Count of duplicates: ', len(duplicate_apps))

# We will not be removing duplicates randomly as some of the data
# are older or recent. We will set a criteria to match those of the
# most recent information to give us better accurate data

['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic']
Count of duplicates:  1181


In [23]:
# The criteria we have set to remove the duplicates is based upon the 
# highest number of reviews. Higher the number, the more recent the data is

reviews_max = {}

for row in google_data:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print('Length: ', len(reviews_max))

Length:  9659


Now that we have identified the duplicates, let's use reviews_max to remove the duplicates from our data set. We will only be keeping the entries that have the highest number of reviews.

* We start by initializing two empty lists, google_clean and already_added
* We use a for loop to iterate through the original data set:
    * For each iteration we grab only the name of the app and the number of reviews
    * We add the current row to google_clean list and the app name to the already_added list
    if:
        * The number of reviews matches the number of reviews indicated in the reviews_max dictionary - and
        * The app is not already in already_added list. We need this additional condition to account for cases where the highest number of reviews of a duplicate app is the same for more than one entry.

In [24]:
google_clean = []
already_added = []

for row in google_data:
    name = row[0]
    n_reviews = float(row[3])
    
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        google_clean.append(row)
        already_added.append(name)
        
print('Length: ', len(google_clean))

Length:  9659


In [25]:
explore_data(google_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:  9659
Number of columns:  13


Leave this to write notes about filtering out non-English apps

In [26]:
# def is_english(string):
#    for character in string:
#        if ord(character) > 127:
#            return False
#    return True
    
# print(is_english('Instagram'))
# print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
# print(is_english('Docs To Go™ Free Office Suite'))
# print(is_english('Instachat 😜'))

Are you surprised that the last two apps returned False? I am but when you think about it, it makes sense as to why. The symbols (TM) and the emoji have a corresponding number higher than 127. We are removing English apps, which is not our intention. We will need to modify the function in a way that it still captures it despite some symbols used.

In [27]:
print(ord('™'))
print(ord('😜'))

8482
128540


We are now going to make a variation of the in_english() function to allow an app to have up to three emoji or special characters. If it exceeds more than that, it will be considered non-English.

In [28]:
def is_english(string):
    special_char_count = 0
    for character in string:
        if ord(character) > 127:
            special_char_count += 1
        
    if special_char_count > 3:
        return False
    else:
        return True
            
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))        

True
False
True
True


Modify function again to include data set and include one list for English apps and another list for non-English apps (just to see)

* For apple, it will be row[1], and row[0] is ID
* For google, it will be row[0]

In [30]:
google_eng = []
apple_eng = []

for app in google_clean:
    name = app[0]
    if is_english(name):
        google_eng.append(app)
        
for app in apple_data:
    name = app[1]
    if is_english(name):
        apple_eng.append(app)
        
explore_data(google_eng, 0, 3, True)
print('\n')
explore_data(apple_eng, 0, 3, True)

    

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:  9614
Number of columns:  13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+'

In [32]:
# just to check if it's osmino
# google_data[10472]

['osmino Wi-Fi: free WiFi',
 'TOOLS',
 '4.2',
 '134203',
 '4.1M',
 '10,000,000+',
 'Free',
 '0',
 'Everyone',
 'Tools',
 'August 7, 2018',
 '6.06.14',
 '4.4 and up']

In [35]:
apple_free = []
google_free = []

for app in apple_eng:
    price = app[4]
    if price == '0.0':
        apple_free.append(app)
        
for app in google_eng:
    price = app[7]
    if price == '0':
        google_free.append(app)
        
print(len(apple_free))
print(len(google_free))


3222
8864


In [43]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
            
    percent_table = {}
    for key in table:
        percentage = (table[key] / total) * 100
        percent_table[key] = percentage
        
    return percent_table

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
    
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [39]:
print(google_data_header)
print(apple_data_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Analyzing the frequency of Category and Genres in the Google Play store data set.

What are the most common genres? Tools, Entertainment, Education, and Business.

What other patterns do you see? I feel the counts of the Genres should be higher as they have done a secondary metadata. For example there is Entertainment Genre, and there is also Entertainment; Music & Video. Tools genre is actually higher than Entertainment which is interesting to see in comparison to the App Store data set. In Category, Family is double the size of Game, meaning a market towards kid friendly apps for Parents with children.


In [47]:
display_table(google_free, 1)
print('\n')
display_table(google_free, -4)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

Analyzing the frequency of prime_genre in the App Store data set.

Most commone genre: Games 
Runner-up: Entertainment

Patterns: Entertainment type apps are more popular than the Lifestyle type apps. Games takes over 50% of the frequency

General Impression: Apps designed for entertainment to keep the users engaged and push in-app purchases (strong guess)

Can you recommend an app profile for the App Store market based on this frequency table alone? No, as we only found the frequency of which kinds of apps there are. We would need to do another analysis of actual apps, not the genre. 

If there's a large number of apps of a particular genre, does that also imply that apps of that genre genrally have a large number of users? Not necessarily. Developers create different kinds of apps everyday, that does not mean that users download that particular genre in abundance. It just means there are a lot of apps for that particular genre. We would need to do a secondary analysis of top apps and then figure out what genres they are to clearly identify what the top 10 genres would be and not based off frequency.

In [49]:
display_table(apple_free, -5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


## Most popular apps by genre on the App Store

In [53]:
unique_app_genres = freq_table(apple_free, -5)

# print(unique_app_genres)
  
for genre in unique_app_genres:
    total = 0
    len_genre = 0
    for row in apple_free:
        genre_app = row[-5]
        if genre_app == genre:
            user_rating = float(row[5])
            total += user_rating
            len_genre += 1
    avg_rating = total / len_genre
    print(genre, ':', avg_rating)
        
    

Social Networking : 71548.34905660378
Health & Fitness : 23298.015384615384
Reference : 74942.11111111111
Shopping : 26919.690476190477
News : 21248.023255813954
Business : 7491.117647058823
Travel : 28243.8
Finance : 31467.944444444445
Food & Drink : 33333.92307692308
Games : 22788.6696905016
Book : 39758.5
Music : 57326.530303030304
Photo & Video : 28441.54375
Weather : 52279.892857142855
Productivity : 21028.410714285714
Utilities : 18684.456790123455
Sports : 23008.898550724636
Entertainment : 14029.830708661417
Catalogs : 4004.0
Lifestyle : 16485.764705882353
Medical : 612.0
Navigation : 86090.33333333333
Education : 7003.983050847458


In [54]:
for app in apple_free:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


In [55]:
## Most popular apps by category on the Google Play Store

In [60]:
unique_app_cat = freq_table(google_free, 1)
# print(unique_app_cat)

for category in unique_app_cat:
    total = 0
    len_category = 0
    for row in google_free:
        category_app = row[1]
        if category_app == category:
            num_installs = row[5]
            num_installs = num_installs.replace('+','')
            num_installs = num_installs.replace(',','')
            total += float(num_installs)
            len_category += 1
    avg_installs = total / len_category
    print(category, ':', avg_installs)

HEALTH_AND_FITNESS : 4188821.9853479853
SOCIAL : 23253652.127118643
EVENTS : 253542.22222222222
ART_AND_DESIGN : 1986335.0877192982
TRAVEL_AND_LOCAL : 13984077.710144928
BUSINESS : 1712290.1474201474
MAPS_AND_NAVIGATION : 4056941.7741935486
MEDICAL : 120550.61980830671
PARENTING : 542603.6206896552
FAMILY : 3695641.8198090694
VIDEO_PLAYERS : 24727872.452830188
SPORTS : 3638640.1428571427
PERSONALIZATION : 5201482.6122448975
GAME : 15588015.603248259
BOOKS_AND_REFERENCE : 8767811.894736841
COMMUNICATION : 38456119.167247385
TOOLS : 10801391.298666667
COMICS : 817657.2727272727
FOOD_AND_DRINK : 1924897.7363636363
ENTERTAINMENT : 11640705.88235294
PHOTOGRAPHY : 17840110.40229885
DATING : 854028.8303030303
AUTO_AND_VEHICLES : 647317.8170731707
HOUSE_AND_HOME : 1331540.5616438356
NEWS_AND_MAGAZINES : 9549178.467741935
SHOPPING : 7036877.311557789
PRODUCTIVITY : 16787331.344927534
FINANCE : 1387692.475609756
BEAUTY : 513151.88679245283
LIFESTYLE : 1437816.2687861272
EDUCATION : 1833495.14563

In [65]:
for app in google_free:
    if app[1] == 'COMMUNICATION':
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
Messenger for SMS : 10,000,000+
My Tele2 : 5,000,000+
imo beta free calls and text : 100,000,000+
Contacts : 50,000,000+
Call Free – Free Call : 5,000,000+
Web Browser & Explorer : 5,000,000+
Browser 4G : 10,000,000+
MegaFon Dashboard : 10,000,000+
ZenUI Dialer & Contacts : 10,000,000+
Cricket Visual Voicemail : 10,000,000+
TracFone My Account : 1,000,000+
Xperia Link™ : 10,000,000+
TouchPal Keyboard - Fun Emoji & Android Keyboard : 10,000,000+
Skype Lite - Free Video Call & Chat : 5,000,000+
My magenta : 1,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Seznam.cz : 1,000,000+
Antillean Gold Telegram (original version) : 100,000+
AT&T Visual Voicemail : 10,000,000+
GMX Mail : 10,000,000+
Omlet Chat : 10,000,000+
My Vodacom SA : 5,000,000+
Microsoft Edge : 5,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Calls & Text by Mo+ : 5,000,000+
free 