### New App Project

Determine which app types will attract the most users based on analyzing Google Play store data.

In [3]:
from csv import reader

#Google Play Store data
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android_data = android[1:]

print(len(android_data))


10841


Let's see what the data looks like and what columns would make sense to analyze.

In [4]:
def explore_data(dataset, start, end, rows_and_columns=True):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') 
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [5]:
print(android_header)
print('\n')
explore_data(android_data,0,3)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']





['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']





['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']





['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']





Number of rows: 10841

Number of columns: 13


Remove row with one column missing (bad data).

In [6]:
del android_data[10472]

Below we'll check for duplicates in each dataset.

In [7]:
duplicate_apps_android=[]
unique_apps_android=[]

for app in android_data:
    name = app[0]
    if name in unique_apps_android:
        duplicate_apps_android.append(name)
    else:
        unique_apps_android.append(name)

print('Google Play Store duplicates:',len(duplicate_apps_android))
print('\n')
print('Examples',duplicate_apps_android[:10])


Google Play Store duplicates: 1181





Examples ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


For all apps, create a dictionary that selects the app name and highest number of ratings. 

In [8]:
reviews_max = {}

for app in android_data:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

print(len(reviews_max))

9659


If the row matches the max reviews in the dictionary, keep it and add to the android_clean list.

In [9]:
android_clean = []
already_added = []

for app in android_data:
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name]==n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

print(len(android_clean))    


9659


Create function to identify all apps wih non-English chars.

In [10]:
def char_detect(string):
    non_ascii = 0
    for char in string:
        if ord(char) > 127:
            non_ascii += 1
        
    if non_ascii > 3:
        return False
    else:
        return True

In [11]:
android_clean_english = []

for app in android_clean:
    name = app[0]
    if char_detect(name):
        android_clean_english.append(app)

print(android_clean_english[:10])
print(len(android_clean_english))


[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up'], ['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000+', '

Isolate dataset to free apps only.

In [12]:
android_clean_english_free = []

for app in android_clean_english:
    price = app[7]
    if price == '0':
        android_clean_english_free.append(app)
        
print(len(android_clean_english_free))

8864


Analyze apps by Category column. Determine percent of totals apps for each Category.

In [13]:
def freq_table_percent(dataset,index):

    column_analysis = {}
    total = len(dataset)

    for app in dataset:
        column = app[index]
        if column in column_analysis:
            column_analysis[column] += 1
        else:
            column_analysis[column] = 1
    
    app_percentages = {}
    for app in column_analysis:    
        percentage = (column_analysis[app] / total) * 100
        app_percentages[app] = percentage
        
    return app_percentages

Create a function that will call the frequency table percent function above and sort the frequency table produced from largest to smallest.

In [14]:
def display_table(dataset, index):
    table = freq_table_percent(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [15]:
display_table(android_clean_english_free,1) #category

FAMILY : 18.907942238267147

GAME : 9.724729241877256

TOOLS : 8.461191335740072

BUSINESS : 4.591606498194946

LIFESTYLE : 3.9034296028880866

PRODUCTIVITY : 3.892148014440433

FINANCE : 3.7003610108303246

MEDICAL : 3.531137184115524

SPORTS : 3.395758122743682

PERSONALIZATION : 3.3167870036101084

COMMUNICATION : 3.2378158844765346

HEALTH_AND_FITNESS : 3.0798736462093865

PHOTOGRAPHY : 2.944494584837545

NEWS_AND_MAGAZINES : 2.7978339350180503

SOCIAL : 2.6624548736462095

TRAVEL_AND_LOCAL : 2.33528880866426

SHOPPING : 2.2450361010830324

BOOKS_AND_REFERENCE : 2.1435018050541514

DATING : 1.861462093862816

VIDEO_PLAYERS : 1.7937725631768955

MAPS_AND_NAVIGATION : 1.3989169675090252

FOOD_AND_DRINK : 1.2409747292418771

EDUCATION : 1.1620036101083033

ENTERTAINMENT : 0.9589350180505415

LIBRARIES_AND_DEMO : 0.9363718411552346

AUTO_AND_VEHICLES : 0.9250902527075812

HOUSE_AND_HOME : 0.8235559566787004

WEATHER : 0.8009927797833934

EVENTS : 0.7107400722021661

PARENTING : 0.65433

Create function that can pull categories from frequency table and calculate average for Installs or Rating. Remove any strings ('NaN') that cannot be converted to a number. Remove special characters so that strings from Installs can be converted.Then update value to average Installs or Rating. Lastly, sort the values from high to low.

NOTE: since installs are given in a range (e.g. 10,000+, 50,000+, 100,000+,), installs are estimated as being the bottom number used to name the range.

In [16]:
def avg_metric(dataset,metric_index,dim_index=1):
    
    import operator
    avg_metric_dict = freq_table_percent(dataset,dim_index)
    

    for category in avg_metric_dict:
        total = 0
        len_metric = 0

        for app in dataset:
            category_app = app[dim_index]
            if category_app == category:
                metric = app[metric_index]
                if metric.find('.') == 1 and metric != 'NaN':
                    dec_metric = float(metric)
                    
                elif metric.find('.') != 1 and metric != 'NaN':
                    metric = metric.replace(',', '')
                    metric = metric.replace('+', '')
                    dec_metric = float(metric)
                    total += dec_metric
                    len_metric += 1
            
                total += dec_metric
                len_metric += 1 
            
        metric_avg = total / len_metric
        avg_metric_dict[category] = metric_avg
        
        sorted_values = sorted(avg_metric_dict.items(), key=operator.itemgetter(1))
        sorted_values = dict(sorted(avg_metric_dict.items(), key=operator.itemgetter(1),reverse=True))
        
    return sorted_values



In [17]:
avg_metric(android_clean_english_free,5) #installs

{'COMMUNICATION': 38456119.167247385,
 'VIDEO_PLAYERS': 24727872.452830188,
 'SOCIAL': 23253652.127118643,
 'PHOTOGRAPHY': 17840110.40229885,
 'PRODUCTIVITY': 16787331.344927534,
 'GAME': 15588015.603248259,
 'TRAVEL_AND_LOCAL': 13984077.710144928,
 'ENTERTAINMENT': 11640705.88235294,
 'TOOLS': 10801391.298666667,
 'NEWS_AND_MAGAZINES': 9549178.467741935,
 'BOOKS_AND_REFERENCE': 8767811.894736841,
 'SHOPPING': 7036877.311557789,
 'PERSONALIZATION': 5201482.6122448975,
 'WEATHER': 5074486.197183099,
 'HEALTH_AND_FITNESS': 4188821.9853479853,
 'MAPS_AND_NAVIGATION': 4056941.7741935486,
 'FAMILY': 3695641.8198090694,
 'SPORTS': 3638640.1428571427,
 'ART_AND_DESIGN': 1986335.0877192982,
 'FOOD_AND_DRINK': 1924897.7363636363,
 'EDUCATION': 1833495.145631068,
 'BUSINESS': 1712290.1474201474,
 'LIFESTYLE': 1437816.2687861272,
 'FINANCE': 1387692.475609756,
 'HOUSE_AND_HOME': 1331540.5616438356,
 'DATING': 854028.8303030303,
 'COMICS': 817657.2727272727,
 'AUTO_AND_VEHICLES': 647317.8170731707

In [18]:
avg_metric(android_clean_english_free,2) #rating

{'EVENTS': 4.439682539682542,
 'ART_AND_DESIGN': 4.345614035087719,
 'EDUCATION': 4.34271844660194,
 'BOOKS_AND_REFERENCE': 4.333684210526316,
 'PARENTING': 4.3275862068965525,
 'PERSONALIZATION': 4.300680272108847,
 'SOCIAL': 4.276694915254236,
 'BEAUTY': 4.243396226415094,
 'SPORTS': 4.242192691029909,
 'WEATHER': 4.222535211267605,
 'SHOPPING': 4.218592964824123,
 'GAME': 4.217517401392118,
 'LIBRARIES_AND_DEMO': 4.2144578313253,
 'COMICS': 4.198181818181819,
 'HEALTH_AND_FITNESS': 4.193406593406592,
 'FOOD_AND_DRINK': 4.178181818181818,
 'MEDICAL': 4.172523961661343,
 'FAMILY': 4.166467780429597,
 'PRODUCTIVITY': 4.157681159420297,
 'AUTO_AND_VEHICLES': 4.156097560975611,
 'PHOTOGRAPHY': 4.145210727969348,
 'COMMUNICATION': 4.132404181184663,
 'ENTERTAINMENT': 4.118823529411763,
 'LIFESTYLE': 4.108670520231213,
 'FINANCE': 4.106097560975613,
 'BUSINESS': 4.103194103194107,
 'HOUSE_AND_HOME': 4.083561643835616,
 'NEWS_AND_MAGAZINES': 4.068145161290324,
 'VIDEO_PLAYERS': 4.0515723270

The top 3 categories when looking at all apps from the Google Play store are family, game, and tools. The top 3 categories for most installs are communication (e.g. WhatsApp, Skype), video players (e.g. YouTube), and social (Facebook, Twitter, Snapchat). The top 3 rated app categories are events, art & design, and education. Depending on how you slice the data, you’ll get a different answer for what app category to build a new app for. Advertising revenue would be based on users downloading the app and using it often. For that reason, the results of most installs should be prioritized over the others. That said, it may not be wise to build an app within the top 3 (communication, video players, social) as that market is saturated. Perhaps one of the other categories from the top 10 of most installs would be a place to start. We would need to dig deeper into user behavior to understand which apps have the highest engagement from users, and would drive the highest ad profits. 