# Free apps and their marketability

In this project, I am going to analyze types of mobile apps and how many users they attract. Therefore, we are going to compare different free apps with their visibility and popularity. We are going to compare the apps on both Android and iOS platforms. 

In [1]:
from csv import reader

apple = open('AppleStore.csv')
read_a = reader(apple)
ios = list(read_a)
google = open('googleplaystore.csv')
read_g = reader(google)
android = list(read_g)

Due to unavailability of the entire data set, we will use the available samples for our analysis (source: *Kaggle*). 

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
explore_data(android,10471,10474,rows_and_columns=False)
del android[10472]

['Jazz Wi-Fi', 'COMMUNICATION', '3.4', '49', '4.0M', '10,000+', 'Free', '0', 'Everyone', 'Communication', 'February 10, 2017', '0.1', '2.3 and up']


['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']




In [3]:
#Clean google data set
#Check for duplicates
#Delete duplicate redundant entries

duplicate = []
unique = []

for app in android:
    name = app[0]
   # print(name)
    if name in unique:
        duplicate.append(name)
    else:
        unique.append(name)
        
for app in android:
    name = app[0]
    if name == 'Box':
        print(app)
        
print('Number of duplicate apps = ', len(duplicate))
print('\n')
print('Egs. of duplicate apps', duplicate[:15])
#print(unique)
print('\n')
print('Number of rows in android = ', len(android))
print('\n')
print("No. of unique apps = ", len(android[1:]) - len(duplicate))

['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
Number of duplicate apps =  1181


Egs. of duplicate apps ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


Number of rows in android =  10841


No. of unique apps =  9659


In the previous section we figured that there are duplicate entries for some apps. We also deleted any entry that was missing or was wrong.

Now we need to remove the redundant entries to stop us from double counting them. In order to do that, we will first create `dictionaries` to map app-names to their number of reviews. We will keep the app-names only when their value in the dictionary is the highest, i.e. we will count the app entries for the highest number of reviews as they are the latest data available. In the next few steps, we will perform this task.

In [4]:
#Create dictionary {app-name: no. of reviews}
reviews_max = {}

#Once we have this dictionary, 
#we will create a data set with only unique rows

android_clean = []
already_added = []

for app in android[1:]:
    name = app[0]
    n_reviews = (app[3]) #Check if there is 3.0M somewhere in reviews, delete that row
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews
        
#print(reviews_max)  
len(reviews_max)

for app in android[1:]:
    name = app[0]
    n_reviews = app[3]
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)
        
print(len(android_clean))        

9659


Now we have removed any duplicate entries in our Google Play app data set.

The next step is to remove any apps that do not have **English** names. This is because we are catering to our English-speaking audiences. 

Using `ordinality` of the English alphabets, given by the **ASCII** codes, we will only keep alphabets that conform to this ordinality, i.e. in the range of `0 to 127`. 

Before we delete these rows in our data set, we first need to index and find out the app names with these characters which will be our next step.

In [5]:
def string_log(string):
    for character in string:
        if ord(character) <= 127:
            return True
        else:
            return False
    
string = '😜a'
print(string_log(string))
    
#This method will remove some relevant apps too 
#Therefore, keep atleast 3 non-eng characters

def string_keep(string):
    non_eng = 0
    for character in string:
        if ord(character) > 127:
            non_eng += 1
    if non_eng > 3:
        return False
    else:
        return True

string = '😜乐电a'
print(string_keep(string))

ios_eng = []
android_eng = []

for app in ios[1:]:
    name = app[1]
    keep_name = string_keep(name)
    if (keep_name == True):
        ios_eng.append(app)
       
        
for app in android_clean:
    name = app[0]
    keep_name = string_keep(name)
    if keep_name == True:
        android_eng.append(app)
        
explore_data(android_eng, 0, 10, True)
print('\n')
explore_data(ios_eng, 0, 3, True)        

False
True
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M

## Isolating the free apps

In [8]:
ios_free = []
android_free = []

for app in ios_eng:
    price = app[4]
    if price == '0.0':
        ios_free.append(app)
        
for app in android_eng:
    price = app[6]
    if price == 'Free':
        android_free.append(app)
        
print(len(ios_free))
print(len(android_free))

3222
8860


## Validation Strategy for an App

Steps:
1. Build a minimal Android version and launch on Google Play
2. Develop further if it gets a good response
3. If the app is profitable after 6 months, build an iOS version and launch it on Apple Store

We need to assess profitable apps for both versions.

In [14]:
print(ios_free[0])
print('\n')
print(android_free[0])

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


By inspection of the data set, we need to build a frequency table for the `prime_genre` column in the Apple data set, and for the `Genres` and `Category` column in the Google data set.

In [20]:
def freq_table(dataset, index):
    table = {}
    freq_genre = 0
    
    for app in dataset:
        genre = app[index]
        freq_genre += 1
        if genre in table:
            table[genre] += 1
        else:
            table[genre] = 1 #This gives us the freq
            
        #Now need to convert freq to percentage 
    table_pp = {}
    for genre in table:
        table_pp[genre] = (table[genre] / freq_genre) * 100
        
    return table_pp

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        
display_table(ios_free,-5)
print('\n')
display_table(android_free,1)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


FAMILY : 18.927765237020317
GAME : 9.69525959367946
TOOLS : 8.45372460496614
BUSINESS : 4.5936794582392775
LIFESTYLE : 3.9051918735891653
PRODUCTIVITY : 3.8939051918735887
FINANCE : 3.7020316027088036
MEDICAL : 3.5214446952595937
SPORTS : 3.397291196388262
PERSONALIZATION : 3.3069977426636568


## Analyzing the frequency tables of different genres in iOS and Android apps

From the above analysis, we learn that in the Apple store data set, amongst the free English-named apps, **Games**, and **Entertainment** are the most common. Games dominate the iOS market by a huge margin. However, there is no such strong domination in the Android market. The Android market has **Family** and **Game** as the most common apps in our observable sub-sample. 

Based on this analysis, we can recommend Gaming apps (or Entertainment apps) based on the number of apps available alone. However, we need to check user rating and popularity or downloads before making any suggestions. 

## Analyzing App popularity

Now we move on to checking app popularity by checking the number of installs *or* the user ratings.

In order to do this, we will perform the following steps to first calculate the average user rating in each genre of app in the iOS store:
1. Isolate the apps in each genre
2. Calculate the average user rating 

In [38]:
ios_genre_table = freq_table(ios_free,-5)
print(ios_genre_table)
print('\n')

for genre in ios_genre_table:
    total = 0
    len_genre = 0
    for app in ios_free:
        genre_app = app[-5]
        if genre_app == genre:
            rating = float(app[5])
            total += rating
            len_genre += 1
    avg_rating = total/len_genre 
    print(genre)
    print(avg_rating)
    

{'Navigation': 0.186219739292365, 'Lifestyle': 1.5828677839851024, 'Medical': 0.186219739292365, 'Entertainment': 7.883302296710118, 'Games': 58.16263190564867, 'Book': 0.4345127250155183, 'Shopping': 2.60707635009311, 'Reference': 0.5586592178770949, 'Education': 3.662321539416512, 'Productivity': 1.7380509000620732, 'Travel': 1.2414649286157666, 'Food & Drink': 0.8069522036002483, 'News': 1.3345747982619491, 'Finance': 1.1173184357541899, 'Catalogs': 0.12414649286157665, 'Sports': 2.1415270018621975, 'Social Networking': 3.2898820608317814, 'Photo & Video': 4.9658597144630665, 'Weather': 0.8690254500310366, 'Music': 2.0484171322160147, 'Health & Fitness': 2.0173805090006205, 'Utilities': 2.5139664804469275, 'Business': 0.5276225946617008}


Navigation
86090.33333333333
Lifestyle
16485.764705882353
Medical
612.0
Entertainment
14029.830708661417
Games
22788.6696905016
Book
39758.5
Shopping
26919.690476190477
Reference
74942.11111111111
Education
7003.983050847458
Productivity
21028.410

From the above table, it seems like `Navigation` dominates the user ratings followed by `Reference` and `Social Networking`. By examination of the data set, Navigation is primarily dominated by Google maps. Therefore, we would suggest coming up with a free app in either of the two latter categories. 

The same could be true of any category (needs further analysis).

In [43]:
android_freqtab = freq_table(android_free,1)
print(android_freqtab)
print('\n')

for genre in android_freqtab:
    total = 0
    len_install = 0
    for app in android_free:
        genre_app = app[1]
        if genre_app == genre:
            installs = app[5]
            installs = installs.replace('+','')
            installs = installs.replace(',','')
            installs = float(installs)
            total += installs
            len_install += 1
    abg_installs = total/len_install 
    print(genre)
    print(abg_installs)
    

{'SOCIAL': 2.6636568848758464, 'MAPS_AND_NAVIGATION': 1.399548532731377, 'EDUCATION': 1.1738148984198644, 'NEWS_AND_MAGAZINES': 2.799097065462754, 'PERSONALIZATION': 3.3069977426636568, 'HEALTH_AND_FITNESS': 3.0812641083521446, 'PHOTOGRAPHY': 2.945823927765237, 'SHOPPING': 2.2460496613995486, 'BEAUTY': 0.5981941309255079, 'MEDICAL': 3.5214446952595937, 'VIDEO_PLAYERS': 1.7945823927765236, 'BOOKS_AND_REFERENCE': 2.144469525959368, 'EVENTS': 0.7110609480812641, 'COMMUNICATION': 3.239277652370203, 'BUSINESS': 4.5936794582392775, 'FOOD_AND_DRINK': 1.2415349887133182, 'LIFESTYLE': 3.9051918735891653, 'COMICS': 0.6207674943566591, 'ART_AND_DESIGN': 0.6433408577878104, 'TRAVEL_AND_LOCAL': 2.336343115124153, 'GAME': 9.69525959367946, 'FAMILY': 18.927765237020317, 'HOUSE_AND_HOME': 0.8239277652370203, 'WEATHER': 0.8013544018058691, 'FINANCE': 3.7020316027088036, 'DATING': 1.8623024830699775, 'TOOLS': 8.45372460496614, 'SPORTS': 3.397291196388262, 'PARENTING': 0.6546275395033859, 'AUTO_AND_VEHIC