# Analysis of user engagement with free apps 
Data analysis scenario: "We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that the number of users of our apps determines our revenue for any given app — the more users who see and engage with the ads, the better. Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users."
 -scenario taken from dataquest.io

(The purpose of this project is to put into practice and demonstrate the beginner skills I have gained since I started learning python)

**datasets used**
[App store](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)
[Google Play](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv) 

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row, the \ will add a new line after printing a row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

in the above code:
- dataset, being the list of lists extracted from the data source
- start: start index of slice, end: end index of slice
- rows_and_columns bool with a default argument of False
 
 how it works:
- dataset is sliced with ```dataset[start:end]```
- slice is looped through, and for each iteration, print a row and add a new line after it via `print('\n')`
- if `rows_and_columns = True`, then it will print the number of rows and columns 
  

In [2]:
def open_dataset(dataset='../input/app-store-apple-data-set-10k-apps/AppleStore.csv'):
    from csv import reader
    opened = open(dataset, encoding='utf-8') #community solution was to add this encoding, as previously returned error
    read_file = reader(opened)
    data = list(read_file)
    return data

In [3]:
Apple_data = open_dataset('../input/app-store-apple-data-set-10k-apps/AppleStore.csv')

In [4]:
Gplay_data = open_dataset('../input/google-play-store-apps/googleplaystore.csv')

In [5]:
explore_data(Apple_data, 0, 3, True) #testing out above function to see if it is working

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


Number of rows: 7198
Number of columns: 17


In [6]:
explore_data(Gplay_data, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


Now i'll decide which columns are going to help with the analysis

**Google Play**
Since free apps are trying to capitalise on ad revenue, it would make sense to focus on the 'Installs'[5] column (i.e. amount of people that use it), 'Category'[1] and 'Genres' [9]. 

**Apple store**
The columns are less descriptive, but an app with a high 'User Ratings'[8] or 'rating_count_tot' [6] could translate to more people that have installed or used the app; I'll also use 'prime_genre' [12] as this would help to identify a trend of certain genres having higher amounts of user ratings. 

# Data cleaning
to:
- remove/correct data e.g. non-english apps as the target audience is English-speakers
- remove non-free apps

more generally: 

- removing duplicate data
- modifying data to fit purpose of analysis
- remove/correct wrong data

### Missing entries

In [7]:
#here this code will iterate through each row in Gplay_data and seeing if it has the same length as the header
#(i.e. same amount of entries as there are headers)

for row in Gplay_data:
    header_length = len(Gplay_data[0]) #header is at index 0 of the data set
    row_length = len(row)
    if row_length != header_length: 
        print(row)
        print('\n')
        print(Gplay_data.index(row))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


10473


the above output has returned a missing value at index 9 for the above row, which corresponds to the 'Genres' section

In [8]:
del Gplay_data[10473]

In [9]:
print(Gplay_data[10473])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


In [10]:
for row in Apple_data:
    header_length = len(Apple_data[0]) 
    row_length = len(row)
    if row_length != header_length: 
        print(row)
        print('\n')
        print(Gplay_data.index(row))

### Duplicate entries

In [11]:
#this code will iterate through the apple data and append the list of apps that appear twice or not to the below empty lists

apple_unique_apps = [] 
apple_duplicate_apps = [] 

for app in Apple_data: 
    app_name = app[2] #index of the row containing app names 'track_name'

    if app_name not in apple_unique_apps:
        apple_unique_apps.append(app_name) #this will append all unique entries to this list
    else:
        apple_duplicate_apps.append(app_name) #all remaining entries i.e. those that appear more than once, appeneded here

In [12]:
print(len(apple_unique_apps))
print(len(apple_duplicate_apps))

7196
2


There must be two entries with duplicates in them, according to the above output

In [13]:
print(apple_duplicate_apps) #to identify which are duplicated

['VR Roller Coaster', 'Mannequin Challenge']


In [14]:
for row in Apple_data:
    if 'Mannequin Challenge' in row:
        print('Mannequin Challenge is located at index number:')
        print(Apple_data.index(row))

Mannequin Challenge is located at index number:
7093
Mannequin Challenge is located at index number:
7129


In [15]:
for row in Apple_data:
    if 'VR Roller Coaster' in row:
        print('VR Roller Coaster is located at index number:')
        print(Apple_data.index(row))

VR Roller Coaster is located at index number:
3320
VR Roller Coaster is located at index number:
5604


In [16]:
print(Apple_data[7093])
print(Apple_data[7129])

['10751', '1173990889', 'Mannequin Challenge', '109705216', 'USD', '0', '668', '87', '3', '3', '1.4', '9+', 'Games', '37', '4', '1', '1']
['10885', '1178454060', 'Mannequin Challenge', '59572224', 'USD', '0', '105', '58', '4', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']


In [17]:
print(Apple_data[3320])
print(Apple_data[5604])

['4000', '952877179', 'VR Roller Coaster', '169523200', 'USD', '0', '107', '102', '3.5', '3.5', '2.0.0', '4+', 'Games', '37', '5', '1', '1']
['7579', '1089824278', 'VR Roller Coaster', '240964608', 'USD', '0', '67', '44', '3.5', '4', '0.81', '4+', 'Games', '38', '0', '1', '1']


in the above output, there are a lot of differences between the columns. The difference in 'Size bytes' is quite high, so presumably this is because the two entries represent two different versions, one updated and one old. This is also confirmed at index 10, representing version number.

I'll take the column representing total rating numbers (index 6) for my approach to separate them later on. There are only two entries for Apple_data, manually removing them based on their index number is also possible (which i'll do for Apple_data). However, as seen below, manually removing 1181 entries from the Gplay_data would be laborious. 

I'll apply the same logic to the Gplay data and will separate duplicate apps based on total reviews. The idea being that the more recent, updated app, will contain a high number of reviews as it would have been on the storefront longer (reviews are not reset upon updates, I believe).

In [18]:
print(len(Apple_data))
del Apple_data[7129]
del Apple_data[5604]
print(len(Apple_data))

7198
7196


In [19]:
duplicate_apps = []
unique_apps = []

for app in Gplay_data:  
    app_name = app[0]
    if app_name in unique_apps: #instead of using 'not in' we use in to append to duplicate list first, just an alternative
        duplicate_apps.append(app_name)
    else:
        unique_apps.append(app_name)
    
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


In [20]:
for row in Gplay_data:
    if 'Quick PDF Scanner + OCR FREE' in row:
        print('Quick PDF Scanner + OCR FREE is located at index number:') #unsure why two 223s are returned????
        print(Gplay_data.index(row))


Quick PDF Scanner + OCR FREE is located at index number:
223
Quick PDF Scanner + OCR FREE is located at index number:
223
Quick PDF Scanner + OCR FREE is located at index number:
286


In [21]:
print(Gplay_data[223])
print(Gplay_data[286])

['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


## Removing duplicate apps

To remove the duplicates in Google play store dataset, my aim is to create a dictionary where each key will equal a unique app name, and the dictionary value will correspond to (focusing on Gplay_data first) highest review number.  

In [22]:
reviews_max = {}
for row in Gplay_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max and reviews_max[name] < n_reviews: #reviews_max[name] referring to value within key, if it is less than n_reviews (containing all rating values)
        reviews_max[name] = n_reviews #this adds values within n_reviews into the value of the dictionary
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
    
    

In [23]:
count = 0                 #to confirm if the above worked, printing out the first five entries of reviews_max
for key in reviews_max:
    if count < 5: #if the value in the variable count < 5, only then will the key, and dict[key] will be printed
        print(key, reviews_max[key])
    count += 1 #necessary to increment the count variable each loop

Photo Editor & Candy Camera & Grid & ScrapBook 159.0
Coloring book moana 974.0
U Launcher Lite – FREE Live Cool Themes, Hide Apps 87510.0
Sketch - Draw & Paint 215644.0
Pixel Draw - Number Art Coloring Book 967.0


In [24]:
print (len(reviews_max))

9659


In [25]:
android_clean = []
already_added = []

for row in Gplay_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    
    if reviews_max[name] == n_reviews and name not in already_added: #if values in reviews_max's keys equal number of reviews, and isn't already in already_added
        android_clean.append(row)
        already_added.append(name)

in the above code:
- as before, isolate name of apps and numbers of reviews, assigned to variables
- name (index 0 of dataset so current row) is appended to android_clean, and name variable (containing name of apps) only if:
 - number of reviews (n_reviews) matches number of reviews of that app as found within reviews_max
 - name of app not already in already_added (required because, for some duplicates, highest number of reviews is the same for more than one entry)

In [26]:
print(len(android_clean))

9659


In [27]:
print(len(already_added))

9659


## Removing Non-English apps
The scenario assumes our target audience being English speakers, however in the dataset there are many entries written in various different scripts - which suggests they were designed for a Non-english market. 

According to the ASCII(American Standard Code for Information Interchange) system, each English text corresponds to a number within the range of 0 to 127, which can be found with the ord() function:

In [28]:
print(ord('a')) #ord will get the correspond number of each character

97


In [29]:
print(ord('爱')) #for example, taking a Chinese character and looking at the number printer, we expect >127

29233


Therefore, any apps with a character greater than 127, i'll remove from the list. 

In [30]:
def isitenglish(x):
    x = str(x)
    for letter in x: #to interate through each element in the string i.e. each letter
        if ord(letter) > 127:
            return False #this will return False soon as it detects, so even if it contains English characters it will not return a True 
        

In [31]:
isitenglish('Instagram')

In [32]:
isitenglish('爱奇艺PPS -《欢乐颂2》test电视剧热播Jack') #it has iterated through th entire list and return false

False

In [33]:
isitenglish('Instachat 😜')

False

Here's a new issue - emoji and certain other characters have a code that corresponds outside of the ASCII 0-127 range. This means i'll end up losing many English apps by running this function through the dataset. 

Workaround: remove apps only if it contains >3 characters outside of the ASCII range. 
- this still isn't perfect, but it will minimise data loss (unsure if there's any other methods to use here?)

In [34]:
def isitenglish(word):
    count = 0 #initalise a variable with the value of 0
    
    for letter in word: 
        if ord(letter) > 127:
            count += 1 #count will increment by 1 for each letter that is detected as being outside of the range of ASCII
        
    if count > 3: #once count reaches 3+, it will return False
        return False
    else:
        return True #otherwise, return True    (edited to fix indentation issue)

In [35]:
isitenglish('Instachat 😜')

True

In [36]:
isitenglish('Instachat 😜剧 剧 ')

True

In [37]:
isitenglish('Instachat 😜😜😜😜')

False

In [38]:
isitenglish('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

**IT'S WORKING :)**

Now i'll use this new function to filter the non-English apps from both of the datasets. I'll do this by looping through each and append the row to a separate list.

In [39]:
english_apple = []

for row in Apple_data[1:]:
    app_name = row[2] 
    if isitenglish(app_name):
        english_apple.append(row)

In [40]:
explore_data(english_apple, 0, 3, True)

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


Number of rows: 6181
Number of columns: 17


In [41]:
english_google = []

for row in android_clean:
    app_name = row[0] #
    if isitenglish(app_name):
        english_google.append(row)

In [42]:
explore_data(english_google, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


**As above, we're left with 9614 google apps (from 9659) and 6181 apple apps from 7196**

## Isolating free apps
#### Recap
- inaccurate data has been removed (those with missing categories, etc.)
- removed duplicate data
- removed non-English apps

#### Isolation method
Now i'll isolate free apps. I'll do this by looping through each dataset, isolating free apps into a separate list. 

I had originally printed the below in Jupyter, which created a scrollable table of the values for me. Here, it prints all outputs on the screen. I've changed to markdown to prevent screen spam 

`for row in english_google:     
    print(row[7])
    print(type(row[7]))`

I can see from above that the values in row 7, the price row, are coming up as string values. I therefore need to make sure that in my conditional statements, I am not looking for integer or floats.

In [43]:
free_english_google = []

for row in english_google:
    price = row[7]
    if price == '0':
        free_english_google.append(row)
    

In [44]:
explore_data(free_english_google, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 8864
Number of columns: 13


`for row in free_english_google: #to check to see if it has worked - shouldn't see any value other than '0' (string)
    print(row[7])
    print(type(row[7]))`

Now, for the apple dataset.

`for row in english_apple:
    print(row[4])
    print(type(row[4]))`

In [45]:
free_english_apple = []

for row in english_apple:
    price = row[5]
    if price == '0': #this caught me out... price is 0.0 in the apple store, but '0' in google play
        free_english_apple.append(row)

In [46]:
explore_data(free_english_apple, 0, 3, True)

['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


Number of rows: 3220
Number of columns: 17


# Data Analysis
Now the data has been cleaned, I need to analyse it. In this scenario, the company wants to create a free app that will become popular to cash-in on the ad revenue (which is dependent on the amount of users engaging with the app). 

I should look for apps that have some link between category/genre and installs. 

**Google data**
- category at row [1] and genres at [-4]


**Apple data**
- prime genre at [12]


In [47]:
def display_table(dataset, index): 
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

#### Above code
This code was provided by the scenario. 

This is because the scenaro provider follows a learning pathway from beginner onwards, and this project precedes more advanced techniques. They acknowledge that there are simpler ways to achieve the above.

but what the code is doing:
- Takes in two parameters: dataset and index. dataset will be a list of lists, and index will be an integer
- Generates a frequency table using the freq_table() function (which I have to write)
- Transforms the frequency table into a list of tuples, then sorts the list in a descending order
- Prints the entries of the frequency table in descending order

**My first attempt**

`def freq_table(dataset, index):
    ftable = {}
    for row in dataset:
        genre = row[index]
        if genre in ftable:
            ftable[genre] += 1
        else:
            ftable[genre] = 1
    return ftable
`

I had originally tried this, but it did not display as a percentage

**This is the function i'll create, to be used alongside the provided 'display_table' function**

In [48]:
def freq_table(dataset, index):
    ftable = {}
    total = 0
    
    for row in dataset:
        total += 1 #this will increment total by +1 for each row in dataset (per loop) to establish the total genres
        genre = row[index] #genre will become the row which i specify in the function argument
        if genre in ftable:
            ftable[genre] += 1
        else:
            ftable[genre] = 1 
    
    ftablepercentages = {}
    for key in ftable: 
        percentage = (ftable[key] / total) * 100 #else will be 0. values
        ftablepercentages[key] = percentage #meaning each key will have the values specified above in the variable percentage
            
    return ftablepercentages

In [49]:
print(freq_table(free_english_apple, 12))

{'Productivity': 1.7391304347826086, 'Weather': 0.8695652173913043, 'Shopping': 2.608695652173913, 'Reference': 0.5590062111801243, 'Finance': 1.1180124223602486, 'Music': 2.049689440993789, 'Utilities': 2.515527950310559, 'Travel': 1.2422360248447204, 'Social Networking': 3.291925465838509, 'Sports': 2.142857142857143, 'Health & Fitness': 2.018633540372671, 'Games': 58.13664596273293, 'Food & Drink': 0.8074534161490683, 'News': 1.3354037267080745, 'Book': 0.43478260869565216, 'Photo & Video': 4.968944099378882, 'Entertainment': 7.888198757763975, 'Business': 0.5279503105590062, 'Lifestyle': 1.5838509316770186, 'Education': 3.6645962732919255, 'Navigation': 0.18633540372670807, 'Medical': 0.18633540372670807, 'Catalogs': 0.12422360248447205}


In [50]:
display_table(free_english_apple, 12)

Games : 58.13664596273293
Entertainment : 7.888198757763975
Photo & Video : 4.968944099378882
Education : 3.6645962732919255
Social Networking : 3.291925465838509
Shopping : 2.608695652173913
Utilities : 2.515527950310559
Sports : 2.142857142857143
Music : 2.049689440993789
Health & Fitness : 2.018633540372671
Productivity : 1.7391304347826086
Lifestyle : 1.5838509316770186
News : 1.3354037267080745
Travel : 1.2422360248447204
Finance : 1.1180124223602486
Weather : 0.8695652173913043
Food & Drink : 0.8074534161490683
Reference : 0.5590062111801243
Business : 0.5279503105590062
Book : 0.43478260869565216
Navigation : 0.18633540372670807
Medical : 0.18633540372670807
Catalogs : 0.12422360248447205


In [51]:
display_table(free_english_google, 1) #categories

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

In [52]:
display_table(free_english_google, -4) #genres

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

# Analysis of data: trends in free apps

**App store**

Most common genre: 'Games' with 58.1% share

Second most common: 'Entertainment' with 7.9%

- there's already a huge divide in % of the most common and second-most
- most apps that appear at the bottom of the list tend to be more utility-based e.g. navigation apps, medical, weather, news etc. 
- the apps that appear towards the top are more leisure-focussed, like gaming, entertainment, photo and videos

Based on the above, I would recommend developing a free app that is designed around gaming or related genres - though it would also depend on the number of users of apps within those categories. 

**Google store**

Most common category: 'FAMILY' with 18.9%

2nd: 'GAME' with 9.7%

3rd: 'TOOLS' with 8.5 %

Most common genre: 'Tools' with 8.4%

2nd: 'Entertainment' with 6.1%

3rd: 'Education' with 5.3 %

I've widened the search to include the top 3 as there's less of a discrepancy between the top common genres/categories.
- Google play store seems to follow a different app profile, with a wider spread of genres/categories
- one thing in common between both genre and category is tools, and leisure (entertainment and 'GAME')

Because of the differences in app profiles for free apps, a Google play app centered around an entertainment or utility function. 'FAMILY' ranks highest in category, so something that is inclusive and not age-restricted. 

This of course, reveals the most frequent app genres, and not necessarily what genres have the most users.

#### My goal is to calculate the average number of installs for each app genre
- for the app store data, i'll use the total user ratings as there's no data column for installs

In [53]:
unique_genres = freq_table(free_english_apple, 12) #i create a freq. table for the unique genres column

for genres in unique_genres: #within this freq. table i initialise the following two empty variables
    total = 0
    len_genre = 0
    
    for row in free_english_apple: #nested loop, i'm looping through the dataset
        genre_app = row[12] #saving the app genre to this variable
        if genre_app == genres: #if the above variable is the same as each element within unique_genres (i.e. the genres)
            total_ratings = float(row[6]) #i save no. of user ratings of the app as a float
            total += total_ratings #the number of user ratings are added to the total (initialised above)
            len_genre += 1 #len_genre is incremented by 1 per each genre = total apps corresponding to that genre
            
    average = total / len_genre #the total number of user ratings for a given genre is divided by the total of apps corresponding to that genre

    print(genres, ':', average)
            
#so, total = total number of user ratings (combined) for those within a genre, len_genre = total number of apps with that genre
#so total / len_genre will give average number of user ratings fora given genre

Productivity : 21028.410714285714
Weather : 52279.892857142855
Shopping : 26919.690476190477
Reference : 74942.11111111111
Finance : 31467.944444444445
Music : 57326.530303030304
Utilities : 18684.456790123455
Travel : 28243.8
Social Networking : 71548.34905660378
Sports : 23008.898550724636
Health & Fitness : 23298.015384615384
Games : 22812.92467948718
Food & Drink : 33333.92307692308
News : 21248.023255813954
Book : 39758.5
Photo & Video : 28441.54375
Entertainment : 14029.830708661417
Business : 7491.117647058823
Lifestyle : 16485.764705882353
Education : 7003.983050847458
Navigation : 86090.33333333333
Medical : 612.0
Catalogs : 4004.0


According to the above, 'leisure/entertainment' apps have the highest average number of user ratings. If we take that to represent apps that have a higher number of users, then targeting these genres would mean highest users.

However, ratings for apps under the genre of social networking are likely to be skewed by certain highly popular apps such as Facebook and Instagram.

In [54]:
for row in free_english_apple: #code to print app name if it's genre = 'Social Networking' + ':' + row containing app ratings
    if row[12] == 'Social Networking':
        print(row[2] + ':' + row[6])

Facebook:2974676
LinkedIn:71856
Skype for iPhone:373519
Tumblr:334293
Match™ - #1 Dating App.:60659
WhatsApp Messenger:287589
TextNow - Unlimited Text + Calls:164963
Grindr - Gay and same sex guys chat, meet and date:23201
imo video calls and chat:18841
Ameba:269
Weibo:7265
Badoo - Meet New People, Chat, Socialize.:34428
Kik:260965
Qzone:1649
Fake-A-Location Free ™:354
Tango - Free Video Call, Voice and Chat:75412
MeetMe - Chat and Meet New People:97072
SimSimi:23530
Viber Messenger – Text & Call:164249
Find My Family, Friends & iPhone - Life360 Locator:43877
Weibo HD:16772
POF - Best Dating App for Conversations:52642
GroupMe:28260
Lobi:36
WeChat:34584
ooVoo – Free Video Call, Text and Voice:177501
Pinterest:1061624
知乎:397
Qzone HD:458
Skype for iPad:60163
LINE:11437
QQ:9109
LOVOO - Dating Chat:1985
QQ HD:5058
Messenger:351466
eHarmony™ Dating App - Meet Singles:11124
YouNow: Live Stream Video Chat:12079
Cougar Dating & Life Style App for Mature Women:213
Battlefield™ Companion:689
Wh

As seen above, Facebook, Skype, etc. have huge numbers of user ratings relative to the rest of the list. It would not make sense to develop a free social media app given the market is dominated by a few apps that already have a high user base. Given the resources of Facebook and related companies, it would be difficult to compete.

In [55]:
unique_category_google = freq_table(free_english_google, 1)

for category in unique_category_google:
    total = 0
    len_category = 0
    
    for row in free_english_google:
        category_app = row[1]
        if category_app == category:
            installs = row[5]
            installs = installs.replace('+', '') 
            installs = installs.replace(',', '')
            installs = float(installs)
            total += installs
            len_category += 1
    
    average = total / len_category
    
            
    print(category, ':', average)        


ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

In [56]:
for row in free_english_google: 
    if row[1] == 'SOCIAL':
        print(row[0] + ':' + row[5])

Facebook:1,000,000,000+
Facebook Lite:500,000,000+
Tumblr:100,000,000+
Social network all in one 2018:100,000+
Pinterest:100,000,000+
TextNow - free text + calls:10,000,000+
Google+:1,000,000,000+
The Messenger App:1,000,000+
Messenger Pro:1,000,000+
Free Messages, Video, Chat,Text for Messenger Plus:1,000,000+
Telegram X:5,000,000+
The Video Messenger App:100,000+
Jodel - The Hyperlocal App:1,000,000+
Hide Something - Photo, Video:5,000,000+
Love Sticker:1,000,000+
Web Browser & Fast Explorer:5,000,000+
LiveMe - Video chat, new friends, and make money:10,000,000+
VidStatus app - Status Videos & Status Downloader:5,000,000+
Love Images:1,000,000+
Web Browser ( Fast & Secure Web Explorer):500,000+
SPARK - Live random video chat & meet new people:5,000,000+
Golden telegram:50,000+
Facebook Local:1,000,000+
Meet – Talk to Strangers Using Random Video Chat:5,000,000+
MobilePatrol Public Safety App:1,000,000+
💘 WhatsLov: Smileys of love, stickers and GIF:1,000,000+
HTC Social Plugin - Faceb

Again, the market for 'communication' apps is represented by a few large players.

In [57]:
for row in free_english_google: 
    if row[1] == 'GAME':
        print(row[0] + ':' + row[5])
        


Solitaire:10,000,000+
Sonic Dash:100,000,000+
PAC-MAN:100,000,000+
Bubble Witch 3 Saga:50,000,000+
Race the Traffic Moto:10,000,000+
Marble - Temple Quest:10,000,000+
Shooting King:10,000,000+
Geometry Dash World:10,000,000+
Jungle Marble Blast:5,000,000+
Roll the Ball® - slide puzzle:100,000,000+
Block Craft 3D: Building Simulator Games For Free:50,000,000+
Farm Fruit Pop: Party Time:1,000,000+
Love Balls:50,000,000+
Piano Tiles 2™:100,000,000+
Pokémon GO:100,000,000+
Paint Hit:10,000,000+
Snake VS Block:50,000,000+
Rolly Vortex:10,000,000+
Woody Puzzle:1,000,000+
Stack Jump:10,000,000+
The Cube:5,000,000+
Extreme Car Driving Simulator:100,000,000+
Bricks n Balls:1,000,000+
The Fish Master!:1,000,000+
Color Road:10,000,000+
Draw In:10,000,000+
PLANK!:500,000+
Looper!:1,000,000+
Trivia Crack:100,000,000+
Will it Crush?:5,000,000+
Tomb of the Mask:5,000,000+
Baseball Boy!:10,000,000+
Hello Stars:10,000,000+
Tank Stars:10,000,000+
Hole.io:10,000,000+
Mini Golf King - Multiplayer Game:5,0

There's a wider spread of data for apps under the category of game. This could be a potential target for a free app.