DataQuest Project 1: App Usage Data Analysis

This project analyzes data for a company that builds Android and iOS mobile apps.  The goal of the project is to make recommendations designed to attract more users on Google Play and the App store thus maximizing in-app ad revenue.

In [2]:
import csv        
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
data_set1 = open("AppleStore.csv")
ios = list(csv.reader(data_set1))
data_set2 = open("googleplaystore.csv")
google = list(csv.reader(data_set2))

In [3]:
# Explore the iOS data displaying first 10 items
explore_data(ios, 0, 5, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7198
Number of columns: 16


In [4]:
# Explore the Google data displaying first 5 items
explore_data(google, 0, 5, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10842
Number of columns: 13


In [5]:
# Error with the Google dataset entry 10473
google[10473]

['Life Made WI-Fi Touchscreen Photo Frame',
 '1.9',
 '19',
 '3.0M',
 '1,000+',
 'Free',
 '0',
 'Everyone',
 '',
 'February 11, 2018',
 '1.0.19',
 '4.0 and up']

In [6]:
del google[10473]
google[10473]

['osmino Wi-Fi: free WiFi',
 'TOOLS',
 '4.2',
 '134203',
 '4.1M',
 '10,000,000+',
 'Free',
 '0',
 'Everyone',
 'Tools',
 'August 7, 2018',
 '6.06.14',
 '4.4 and up']

In [7]:
# Google data has duplicate apps so we need to remove them.
# We'll keep the version with the most reviews since that version 
# likely has the most recent data.
unique_apps = []
duplicate_apps = []
for row in google:
    app_name = row[0]
    if app_name in unique_apps:
        duplicate_apps.append(app_name)
    else:
        unique_apps.append(app_name)    
print(len(duplicate_apps))
duplicate_apps[0:10]

1181


['Quick PDF Scanner + OCR FREE',
 'Box',
 'Google My Business',
 'ZOOM Cloud Meetings',
 'join.me - Simple Meetings',
 'Box',
 'Zenefits',
 'Google Ads',
 'Google My Business',
 'Slack']

In [8]:
reviews_max = {}
for row in google[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max:
        if reviews_max[name] < n_reviews:
            reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews
print(len(reviews_max))

9659


To clean the Google data set, we need to eliminate duplicate applications keeping the instance with the most reviews.  We'll create 2 lists, one to contain the unique apps data and one to keep track of the apps that have been added.  We'll only add the app if its reviews match the reviews_max created previously.

In [9]:
android_clean = []
already_added = []
for row in google[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if n_reviews == reviews_max[name]:
        if name not in already_added:
            android_clean.append(row)
            already_added.append(name)
print(len(android_clean))
android_clean[0:5]

9659


[['Photo Editor & Candy Camera & Grid & ScrapBook',
  'ART_AND_DESIGN',
  '4.1',
  '159',
  '19M',
  '10,000+',
  'Free',
  '0',
  'Everyone',
  'Art & Design',
  'January 7, 2018',
  '1.0.0',
  '4.0.3 and up'],
 ['U Launcher Lite – FREE Live Cool Themes, Hide Apps',
  'ART_AND_DESIGN',
  '4.7',
  '87510',
  '8.7M',
  '5,000,000+',
  'Free',
  '0',
  'Everyone',
  'Art & Design',
  'August 1, 2018',
  '1.2.4',
  '4.0.3 and up'],
 ['Sketch - Draw & Paint',
  'ART_AND_DESIGN',
  '4.5',
  '215644',
  '25M',
  '50,000,000+',
  'Free',
  '0',
  'Teen',
  'Art & Design',
  'June 8, 2018',
  'Varies with device',
  '4.2 and up'],
 ['Pixel Draw - Number Art Coloring Book',
  'ART_AND_DESIGN',
  '4.3',
  '967',
  '2.8M',
  '100,000+',
  'Free',
  '0',
  'Everyone',
  'Art & Design;Creativity',
  'June 20, 2018',
  '1.1',
  '4.4 and up'],
 ['Paper flowers instructions',
  'ART_AND_DESIGN',
  '4.4',
  '167',
  '5.6M',
  '50,000+',
  'Free',
  '0',
  'Everyone',
  'Art & Design',
  'March 26, 2017

In [10]:
# Check whether or not an app is intended for English speakers
# using ASCII and ord() function.  We'll require more than 3 non-ASCII
# characters to consider it non-English
def check_english(string):
    english = True
    count = 0
    for i in range(len(string)):
        if ord(string[i]) > 127:
            count +=1
            if count > 3:
                english = False
                return english
    return english
print(check_english('Instagram'))
print(check_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(check_english('Docs To Go™ Free Office Suite'))
print(check_english('Instachat 😜'))

True
False
True
True


In [11]:
ios_english = []
android_clean_english = []
for row in ios[1:]:
    name = row[0]
    if check_english(name):
        ios_english.append(row)
for row in android_clean:
    name = row[0]
    if check_english(name):
        android_clean_english.append(row)
print(len(ios_english))
print(len(android_clean_english))

7197
9614


In [12]:
ios_english_free = []
android_clean_english_free = []
for row in ios_english:
    price = float(row[4])
    if price == 0:
        ios_english_free.append(row)
for row in android_clean_english:
    free = row[6]
    if free == 'Free':
        android_clean_english_free.append(row)
print(len(ios_english_free))
print(len(android_clean_english_free))

4056
8863


We would like to develop a successful app using minimal resources.  The plan involves 3 steps:

1) Build a bare bones Android version of the app, and add it to Google Play.
2) Given successful initial reception, we'll further develop the app.
3) If the app success continues for an additional 6 months, we'll then develop the app for iOS.

To validate this strategy, we'll look for apps that have been successful on both the Google and iOS platforms. 

In [13]:
def freq_table(dataset, index):
    freq_table = {}
    length = len(dataset)
    for list in dataset:
        name = list[index]
        if name in freq_table:
            freq_table[name] += 1
        else:
            freq_table[name] = 1
    for key in freq_table:
        freq_table[key] /= length
    return freq_table

In [14]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
print('Apple Prime Genre:')
display_table(ios_english_free, 11)
print('\nGoogle Genres:')
display_table(android_clean_english_free, 9)
print('\nGoogle Category:')
display_table(android_clean_english_free, 1)

Apple Prime Genre:
Games : 0.5564595660749507
Entertainment : 0.08234714003944774
Photo & Video : 0.04117357001972387
Social Networking : 0.035256410256410256
Education : 0.03254437869822485
Shopping : 0.029832347140039447
Utilities : 0.02687376725838264
Lifestyle : 0.023175542406311637
Finance : 0.020710059171597635
Sports : 0.01947731755424063
Health & Fitness : 0.01873767258382643
Music : 0.016518737672583828
Book : 0.016272189349112426
Productivity : 0.015285996055226824
News : 0.014299802761341223
Travel : 0.013806706114398421
Food & Drink : 0.010601577909270217
Weather : 0.007642998027613412
Reference : 0.004930966469428008
Navigation : 0.004930966469428008
Business : 0.004930966469428008
Catalogs : 0.0022189349112426036
Medical : 0.0019723865877712033

Google Genres:
Tools : 0.08450863138892023
Entertainment : 0.06070179397495205
Education : 0.053480762721426156
Business : 0.04592124562789123
Productivity : 0.038925871601038026
Lifestyle : 0.038925871601038026
Finance : 0.037007

Above, we see a frequency table for the percentage of free English apps from each of Apple Prime Genre, Android Category and Android Genre.  It should be pointed out, though, that the apps are not classified in the same manner.  For example, the leading classification among Android apps, 'Family' which account for 19% of free English apps, is not even a category for iOS apps.  Thus a direct comparison is not possible.  

Nevertheless, we see some significant differences between the breakdown of Apple apps vs. Google apps.  Above all, we see that games dominate the list of free English apps on The App Store.  In fact, more than half of all apps in that format are games.  Adding in entertainment based apps and social media apps, apps designed for fun account for more than 2/3 of all free English iOS apps.  

On the other hand, the breakdown of Google apps appears much more balanced.  Games account for less than 10% of the Google apps.  Whereas iOS apps tend to be heavily skewed towards fun oriented apps, many anroid apps are designed for productivity.  Tools represent 8.5% of the Google apps, business apps are 4.6% of the total and productivity apps are 3.9%.  

According to our stated plan, the ultimate success of the app would come through the iOS release.  Therefore, we recommend developing an app likely to succeed on the platform.  Of course the fact that games dominate the App store does not necessarily mean they also have the most users.  Nevertheless, it is reasonable to believe that developers would not release so many games unless they had some success generating ad revenue.  Thus we recommend the client develop a free English based game app which are prevalent on Google Play (#2) and dominate the App Store (#1).  

In the next section we'll explore average usage among the different categories.

In [16]:
apple_genres = freq_table(ios_english_free, 11)
for genre in apple_genres:
    total = 0
    len_genre = 0
    for row in ios_english_free:
        genre_app = row[11]
        if genre_app == genre:
            ratings = float(row[5])
            total += ratings
            len_genre += 1
    average_genre = total / len_genre
    print("{}: {:,.1f} ({:,.0f})".format(genre, average_genre, len_genre))

Sports: 20,129.0 (79)
Shopping: 18,746.7 (121)
Travel: 20,216.0 (56)
Utilities: 14,010.1 (109)
Book: 8,498.3 (66)
Education: 6,266.3 (132)
Finance: 13,522.3 (84)
Weather: 47,220.9 (31)
Photo & Video: 27,249.9 (167)
Productivity: 19,053.9 (62)
Business: 6,367.8 (20)
Music: 56,482.0 (67)
Health & Fitness: 19,952.3 (76)
Reference: 67,447.9 (20)
News: 15,892.7 (58)
Navigation: 25,972.0 (20)
Games: 18,924.7 (2,257)
Catalogs: 1,779.6 (9)
Medical: 459.8 (8)
Entertainment: 10,823.0 (334)
Lifestyle: 8,978.3 (94)
Social Networking: 53,078.2 (143)
Food & Drink: 20,179.1 (43)


Based on the table above, the genres with the highest average number of reviews are Reference (67,448), Music (56,482) and Social Networking (53,078).  These genres would seem to be attractive areas to target.  Still, while these genres have high averages, there aren't many apps in each category.  This fact implies that the high average is being driven by just a few very popular apps.  For example, among social networking apps, Facebook has the highest total reviews among iOS apps with just under 3 million.  It would be hard to produce another social networking app with such a formidable competitor. Similarly, free music apps are dominated by Pandora.  While games has a much lower average at just under 19,000 reviews, this category more than 2,200 apps which indicates that the average is not driven by 1 app.  Therefore, this area still appears the most attractive.

In [25]:
android_genres = freq_table(android_clean_english_free, 1)
for genre in android_genres:
    total = 0
    len_genre = 0
    for row in android_clean_english_free:
        genre_app = row[1]
        if genre_app == genre:
            installs = row[5]
            installs = installs.replace('+', '')
            installs = installs.replace(',', '')
            installs = float(installs)
            total += installs
            len_genre += 1
    average_genre = total / len_genre
    print("{}: {:,.1f} ({:,.0f})".format(genre, average_genre, len_genre))

ENTERTAINMENT: 11,640,705.9 (85)
GAME: 15,588,015.6 (862)
SPORTS: 3,638,640.1 (301)
PARENTING: 542,603.6 (58)
DATING: 854,028.8 (165)
PRODUCTIVITY: 16,787,331.3 (345)
MAPS_AND_NAVIGATION: 4,056,941.8 (124)
TRAVEL_AND_LOCAL: 13,984,077.7 (207)
SHOPPING: 7,036,877.3 (199)
PERSONALIZATION: 5,201,482.6 (294)
COMMUNICATION: 38,456,119.2 (287)
MEDICAL: 120,550.6 (313)
WEATHER: 5,074,486.2 (71)
PHOTOGRAPHY: 17,840,110.4 (261)
COMICS: 817,657.3 (55)
HEALTH_AND_FITNESS: 4,188,822.0 (273)
LIFESTYLE: 1,437,816.3 (346)
ART_AND_DESIGN: 1,986,335.1 (57)
FOOD_AND_DRINK: 1,924,897.7 (110)
TOOLS: 10,801,391.3 (750)
FAMILY: 3,697,848.2 (1,675)
EVENTS: 253,542.2 (63)
BEAUTY: 513,151.9 (53)
NEWS_AND_MAGAZINES: 9,549,178.5 (248)
VIDEO_PLAYERS: 24,727,872.5 (159)
EDUCATION: 1,833,495.1 (103)
HOUSE_AND_HOME: 1,331,540.6 (73)
BUSINESS: 1,712,290.1 (407)
AUTO_AND_VEHICLES: 647,317.8 (82)
FINANCE: 1,387,692.5 (328)
SOCIAL: 23,253,652.1 (236)
BOOKS_AND_REFERENCE: 8,767,811.9 (190)
LIBRARIES_AND_DEMO: 638,503.7 (

Based on the table above, the genres with the highest average number of installs are Communication (38.5 million), Video Players (24.7 million) and Social (23.3 million).  Again, it would seem that it would be difficult to create a successful app in these areas.  Not only are these categories dominated by a few very popular apps, but they also seem to demand very specific expertise.  Specifically, both communication ans video players would require very high degree of domain knowledge which is likely hard to come by or expensive.  

One potential area to target based on the list above is Producitivty.  According to the list, this area is just outside the top 3 average installs with just under 17 million.  Additionally, there is a relatively high number of apps in this category which indicates broad level of demand.  The Productivity category also has a solid average number of reviews at 19,000 (or just above games) on the iOS platform which indicates that a successful Android version could translate to a successful iOS version.  

In addition to Productivity, games also appears to be an attractive category on Google Play.  There are a high number of apps (862) as well as a relatively high number of average installs (>15 million).  As we highlighted above, games is also an attractive category on the App Store. 

While deeper analysis would be beneficial here, our initial recommendation would be to focus either on the Productivity category or the Gaming category.  