# Identifying Lucrative Apps

* **Overview**: An app company is looking to build a new revenue generating app. Their use of in-app ads means that high usercounts is thier target metric. We need to indentify what free app categories bring in the most users.
* **Goal**: 
    To find profitable Free App categories in the Android and iOS marketplaces, based on number of users.

In [1]:
from csv import reader
data_path_ios, datapath_droid = 'AppleStore.csv', "googleplaystore.csv"
o_file_ios, o_file_droid = open(data_path_ios), open(datapath_droid)
r_file_ios, r_file_droid = reader(o_file_ios), reader(o_file_droid)
dataset_apple, dataset_droid = list(r_file_ios), list(r_file_droid)

#seperated header and main data
dataset_apple_h, dataset_droid_h = dataset_apple[0], dataset_droid[0]
dataset_apple, dataset_droid = dataset_apple[1:], dataset_droid[1:]

In [2]:
#takes 4 inputs 
#a dataset, a start/end,optional logging
def explore_data(dataset, start, end, rows_and_columns=True):
    #holds the section of data
    dataset_slice = dataset[start:end]   
    #prints each row, then adds a new line
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

[Apple Source]https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps

[Android Source]https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps

In [3]:
print("Apple DataSet Exploration\n")
explore_data(dataset_apple,0,5,True)
print("\n")
print("Android DataSet Exploration\n")
explore_data(dataset_droid,0,5,True)

Apple DataSet Exploration

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1']


Number of rows: 7197
Number of columns: 17


Android DataSet Exploration

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0',

Finding Useful Column Data

In [4]:
print("Apple DataSet Columns")
explore_data(dataset_apple_h,0,17)
print("\n")
print("Droid DataSet Columns")
print("\n")
explore_data(dataset_droid_h,0,13)

Apple DataSet Columns



id


track_name


size_bytes


currency


price


rating_count_tot


rating_count_ver


user_rating


user_rating_ver


ver


cont_rating


prime_genre


sup_devices.num


ipadSc_urls.num


lang.num


vpp_lic


Number of rows: 17
Number of columns: 0


Droid DataSet Columns


App


Category


Rating


Reviews


Size


Installs


Type


Price


Content Rating


Genres


Last Updated


Current Ver


Android Ver


Number of rows: 13
Number of columns: 3


Useful Apple Cols
* id 
* track_name  
* price
* rating_count_tot
* user_rating

Useful Droid Cols
* App 
* Category  
* Rating
* Installs
* Price

Error in Android Dataset: Wrong rating for entry 10472
Rating is > 5
(link)https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015

In [5]:
print(dataset_droid[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [6]:
del dataset_droid[10472]

Data Cleaning - Duplicate Check - Droid

In [7]:

def duplicate_finder(dupes,uniq, dataset):
    for app in dataset:
        #set name to App col value
        name = app[0]
        #is the name in our list already?
        if name in uniq:
            dupes.append(name)#its not unique
        else:
            uniq.append(name)
    return dupes, uniq

dupes_set_1, unique_set_1 =  [], []
dupes_set_1, unique_set_1 = duplicate_finder(dupes_set_1, unique_set_1, dataset_droid)

print("Duplicate Count", len(dupes_set_1))
print("Uniques Count", len(unique_set_1))
print("Expected Count", len(dataset_droid) - len(dupes_set_1))

#print("\n", dupes_set_1)

Duplicate Count 1181
Uniques Count 9659
Expected Count 9659


Duplicate Cleaning 
"keep the row with the highest number of reviews and remove the other entries for any given app" as that should be the most recent review

In [8]:
unique_apps_reviews = {}

for app in dataset_droid:
    #search for each app by name, index 0
    #if not found
    if app[0] not in unique_apps_reviews:
        #set key = name, value = num of reviews
        unique_apps_reviews[app[0]] = float(app[3])
    #if its already in the list, we have a duplicate
    #check for the most recent, by number of reviews
    elif unique_apps_reviews[app[0]] < float(app[3]):
        unique_apps_reviews[app[0]] = float(app[3])

#check count
print(len(unique_apps_reviews))
#print(unique_apps_reviews)

9659


Count of unique_apps_reviews matches our expected value.

In [9]:
#cleaned android data
android_clean = []
#holder
already_added = []

#cleaning the full dataset using the list of uniques
for app in dataset_droid:
    #grab the name, and num of reviews
    name = app[0]
    n_reviews = float(app[3])
    #search uniques list for the name
    #if the app matches our unique list by review, we know its the most recent
    #and its not in the list already,i.e an exact duplicate, add it
    if (n_reviews == unique_apps_reviews[name]) and (name not in already_added):
        #append the entire row to the clean list
        android_clean.append(app)
        #update the already added list
        already_added.append(name)

#check length and first values
print("Clean Values Count: " + str(len(android_clean)) + "\n")
explore_data(android_clean, 0, 3, True)

Clean Values Count: 9659

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


Data Cleaning - Language Check - Apple/Droid

In [10]:
#Non Enlish App Examples
print(android_clean[4412][0])
print(android_clean[7940][0])

中国語 AQリスニング
لعبة تقدر تربح DZ


In [11]:
eng_lower = 'abcdefghijklmnopqrstuvwxyz'
eng_upper = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'


def eng_check(text):
    #leeway for a emojis,punct, etc
    non_eng_counter = 0
    
    for char in text:
        num = ord(char)
        if num > 127:
            non_eng_counter += 1
    #more than 3 non eng found? probally foreign 
    if non_eng_counter > 3:
        return False
    else:
        return True
#eng_check('Instachat 😜')   

def check_lang_droid(dataset, new_set):
    for app in dataset:
        name = app[0]
        english = eng_check(name)
        if english == True:
            new_set.append(app)
    found_non_eng = len(dataset) - len(new_set)
    print("Found Non English: ", str(found_non_eng))
    print("Old Data Count: ", str(len(dataset)))
    print("Cleaned Data Count: ", str(len(new_set)))

def check_lang_apple(dataset, new_set):
    for app in dataset:
        name = app[2]
        english = eng_check(name)
        if english == True:
            new_set.append(app)
    found_non_eng = len(dataset) - len(new_set)
    print("Found Non English: ", str(found_non_eng))
    print("Old Data Count: ", str(len(dataset)))
    print("Cleaned Data Count: ", str(len(new_set)))
            

In [12]:
ios_eng_clean = []
android_eng_clean = []

print("iOS Dataset")
check_lang_apple(dataset_apple, ios_eng_clean )
print("\n")
print("Android Dataset")
check_lang_droid(android_clean, android_eng_clean)


 

iOS Dataset
Found Non English:  1014
Old Data Count:  7197
Cleaned Data Count:  6183


Android Dataset
Found Non English:  45
Old Data Count:  9659
Cleaned Data Count:  9614


Summary so far:
* Removed Duplicate Data from android set
* Removed Non-English Data from Both Sets

Next:
* Remove Non-Free apps

In [13]:
#Apple Price is index 5
#Android Price is index 7
def find_free(data_set, new_set,index):
    non_free_count = 0
    for app in data_set:
        if app[index] == '0':
            new_set.append(app)
        else:
            non_free_count += 1
    print("Found Non Free: ", str(non_free_count))
    print("Old Data Count: ", str(len(data_set)))
    print("Cleaned Data Count: ", str(len(new_set)))
    print("\n")
            

In [14]:
ios_eng_free_clean = []
android_eng_free_clean = []

print("iOS Dataset")
find_free(ios_eng_clean, ios_eng_free_clean, 5)
explore_data(ios_eng_free_clean, 0,5,True)

print("\n")
print("Android Dataset")
find_free(android_eng_clean, android_eng_free_clean, 7)
explore_data(android_eng_free_clean, 0,5,True)

iOS Dataset
Found Non Free:  2961
Old Data Count:  6183
Cleaned Data Count:  3222


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1']


['7', '283646709', 'PayPal - Send and request money safely', '227795968', 'USD', '0', '119487', '879', '4', '4.5', '6.12.0', '4+', 'Finance', '37', '0', '19', '1']


Number of rows: 3222
Number of columns: 17


Android Dataset
Found Non Free:  750
Old Data Count:  9614
C

Summary so far:

* Removed Duplicate Data from android set
* Removed Non-English Data from Both Sets
* Remove Non-Free apps

Next:
* Building Frequency tables to find profitable apps

Apple Dateset = ios_eng_free_clean

Google Dataset = android_eng_free_clean


1. Identify the most common genre for each market

In [15]:
#find genre index
explore_data(ios_eng_free_clean,0,1,False)
explore_data(android_eng_free_clean,0,1,False)

['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']




In [16]:
#ios prime_genre index = 12 or -5
#droid genre index = 9
#droid category index = 1

def freq_table(dataset, index):
    newset = {}
    for app in dataset:
        if app[index] in newset:
            newset[app[index]] += 1
        else:
            newset[app[index]] = 1
    
    for app in newset:
        newset[app] = (newset[app] / len(dataset) ) * 100
    
    return newset

In [17]:
#takes a list and prints it in a sorted table format
def display_table(dataset, index):

    table = freq_table(dataset, index)
    
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Sorted Genres Frequency Table - iOS, Android

In [18]:
display_table(ios_eng_free_clean, 12)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


In [19]:
display_table(android_eng_free_clean, 9)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

In [20]:
display_table(android_eng_free_clean, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

# Pre-lim Report

1. What is the most common genre?

Apple: Games, Entertainment

Google: Family, Game

2. Thoughts?
Most iOS  apps tend to be entertainment focused rather than serious

Google Play Store is mixed, with entertainment having a slight edge

3. Reccommendation.
Game App

4. Questions?
Does this amount to a high usercount?



# Next Steps:

Match usercounts to genres, to find a active and popular genre

In [21]:
#ios_eng_free_clean
#android_eng_free_clean

In [22]:
ios_freq = freq_table(ios_eng_free_clean, 12)
droid_freq = freq_table(android_eng_free_clean, 1)

print("Average Number of Reviews Per Genre iOS: \n")

for genre in ios_freq: #finds a genre in the frequency list
    total = 0 #sum of num_user_ratings
    apps_per_genre =  0 #number of apps per genre
    for app in ios_eng_free_clean: # for each app in our dataset     
        genre_app = app[12] #grabs the genre of the app
        if genre_app == genre:# checks if the genre matches the one in our freq loop
            num_user_rating = float(app[6]) # grab the number of user ratings
            total += num_user_rating #update our total
            apps_per_genre += 1# add one to our apps_per_genre counter
    avg_num_user = total / apps_per_genre #
    print(genre, ": ", avg_num_user)



Average Number of Reviews Per Genre iOS: 

Productivity :  21028.410714285714
Weather :  52279.892857142855
Shopping :  26919.690476190477
Reference :  74942.11111111111
Finance :  31467.944444444445
Music :  57326.530303030304
Utilities :  18684.456790123455
Travel :  28243.8
Social Networking :  71548.34905660378
Sports :  23008.898550724636
Health & Fitness :  23298.015384615384
Games :  22788.6696905016
Food & Drink :  33333.92307692308
News :  21248.023255813954
Book :  39758.5
Photo & Video :  28441.54375
Entertainment :  14029.830708661417
Business :  7491.117647058823
Lifestyle :  16485.764705882353
Education :  7003.983050847458
Navigation :  86090.33333333333
Medical :  612.0
Catalogs :  4004.0


# Initial iOS Recommendation:

1. Navigation 
2. Music App
3. Social
4. Book
 
This info could be plotted on a graph to find a middle point and to remove outliers.

Our android dataset contains install information

information isnt precise but precision isnt needed

info is stored as strings

In [26]:
#droid_freq
#android_eng_free_clean

#relist headers to find index
#for header in dataset_droid_h:
#    print(header)

print("Average Number of Installs Per Genre Android: \n")
for category in droid_freq:
    total_installs = 0
    apps_per_genre = 0
    for app in android_eng_free_clean:
        app_category = app[1]
        #print(category, app[1])
        if app_category == category:
            installs = app[5]
            installs = installs.replace("+","")
            installs = installs.replace(",","")
            installs = float(installs)
           
            total_installs += installs
            apps_per_genre += 1
    avg_num_installs = total_installs / apps_per_genre
    print(category,": ",avg_num_installs)

Average Number of Installs Per Genre Android: 

ART_AND_DESIGN :  1986335.0877192982
AUTO_AND_VEHICLES :  647317.8170731707
BEAUTY :  513151.88679245283
BOOKS_AND_REFERENCE :  8767811.894736841
BUSINESS :  1712290.1474201474
COMICS :  817657.2727272727
COMMUNICATION :  38456119.167247385
DATING :  854028.8303030303
EDUCATION :  1833495.145631068
ENTERTAINMENT :  11640705.88235294
EVENTS :  253542.22222222222
FINANCE :  1387692.475609756
FOOD_AND_DRINK :  1924897.7363636363
HEALTH_AND_FITNESS :  4188821.9853479853
HOUSE_AND_HOME :  1331540.5616438356
LIBRARIES_AND_DEMO :  638503.734939759
LIFESTYLE :  1437816.2687861272
GAME :  15588015.603248259
FAMILY :  3695641.8198090694
MEDICAL :  120550.61980830671
SOCIAL :  23253652.127118643
SHOPPING :  7036877.311557789
PHOTOGRAPHY :  17840110.40229885
SPORTS :  3638640.1428571427
TRAVEL_AND_LOCAL :  13984077.710144928
TOOLS :  10801391.298666667
PERSONALIZATION :  5201482.6122448975
PRODUCTIVITY :  16787331.344927534
PARENTING :  542603.620689

# Initial Android Recommendation:

1. COMMUNICATION 
2. SOCIAL
3. ENTERTAINMENT
4. SPORTS
 
This info could be plotted on a graph to find a middle point and to remove outliers.