# Dataquest Project 1: Which apps generate downloads?

## 1. Analyzing Mobile App Data

I have to determine which apps in this data set are downloaded the most, as though I'm an analyst at a gaming company looking for which types of apps are the most downloaded. Maybe we're trying to diversify, I don't know, my boss has a terrible habit of not keeping me informed. I also know from prior experience that the verticals defined do a terrible job of revealing such information, but hey, Dataquest didn't consult me on the project up front.

Some 'roleplay' notes: we only make free games, deriving revenue from in-app ads (we're Voodoo, let's say). As such, the guide wants me to later delete paid-for apps. We're also going to focus on English only. Both steps I find confusing but ok.

## 2. Opening the Data
First, let's import our data set. I've never actually written this before, so I've added notes to code copied elsewhere to explain what things do.

In [1]:
from csv import reader

#Google Play data
opened_file = open('Project1/googleplaystore.csv')
read_file = reader(opened_file) #Open file doesn't read the file, so I have to read it with this command.
android = list(read_file) #Now turning it into a list and setting it to a variable.
android_header = android[0] #Finally, splitting the data set into header and data.
android = android[1:]

#App Store data
opened_file = open('Project1/AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

Dataquest have created a function which I've imported below. It allows me to "repeatedly print rows in a readable way."

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        
        print('\n') # adds a new (empty) line between rows
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

### explore_data documentation:

The `explore_data()` function:

- Takes in four parameters:
    - `dataset`, which is expected to be a list of lists.
    - `start` and `end`, which are both expected to be integers and represent the starting and the ending indices of a slice from the data set.
    - `rows_and_columns`, which is expected to be a Boolean and has `False` as a default argument.

- Slices the data set using `dataset[start:end]`.
- Loops through the slice, and for each iteration, prints a row and adds a new line after that row using `print('\n')`.

- The `\n` character adds a new line. We use `print('\n')` to add some blank space between rows.

- Prints the number of rows and columns if `rows_and_columns` is `True`.
     - `dataset` shouldn't have a header row, otherwise the function will print the wrong number of rows (one more row compared to the actual length).
     
Here's that in practice:

In [3]:
explore_data(android, 9999, 10001, True)
print('\n')
explore_data(ios, 500, 501, True)

['EW Handbook', 'BUSINESS', 'NaN', '1', '17M', '100+', 'Free', '0', 'Everyone', 'Business', 'June 11, 2015', '4', '2.3.3 and up']


['Mee-EW', 'TOOLS', 'NaN', '0', '17M', '5+', 'Free', '0', 'Everyone', 'Tools', 'August 2, 2018', '1.0.1', '4.1 and up']


Number of rows: 10841
Number of columns: 13


['387893495', 'Virtual Regatta Offshore', '123541504', 'USD', '0', '209', '1', '3.5', '5', '2.2.1', '4+', 'Games', '37', '5', '1', '1']


Number of rows: 7197
Number of columns: 16


## 2.1 Exploring the Data

Let's see what the headers of each dataset look like and pick out which columns are useful for our analysis.

In [4]:
print(android_header)
print('\n')
print(ios_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


### Android columns of interest:
I would say: App, Category, Installs, Genres.

They say: 'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price', and 'Genres'.

- App = App name
- Category = Vertical
- Reviews = Number of reviews
- Installs = Number of installs
- Type = Free/Paid app
- Price = If it has a price
- Genres = More verticals (Not sure why there are more in Play Store)

### iOS columns of interest:
Supposedly, these are: 'track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', and 'prime_genre'

- track_name = App name
- currency = ??? (Would not have included)
- price = free (0) or paid app
- rating_count_tot = Number of reviews
- rating_counts_ver = ???
- prime_genre = The first vertical they've picked to categorize

## 3) Deleting wrong data

It's time to begin cleaning the dataset, before we analyze it. We need to:

- Detect inaccurate data
- Correct or remove it
- Detect duplicate data
- Update it

We don't care about the 'foreign' market (lmao), so I need to remove all apps that are **non-English**.
I also have to remove apps that **aren't free** because, in role play mode, I'm analyzing on behalf of an app company that only produces free apps (as above).

First, let's take out the inaccurate data. Here's a row where things have gone wrong:

In [5]:
print(android[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


This row is missing a cell -- We have app names, but no category (hence rating is `19`, which simply can't be.)

Here I've been told to delete the row (in future perhaps we'd populate it, or at least add an N/A?)

To delete row 10472, use the **`del`** function:

In [6]:
del android[10472]

print(android[10472])
print(len(android))

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']
10840


The row is gone with a new row in its place. We can see TOOLS where category should be, and we should have 10840 rows).

According to the instructions there are 1,181 cases where an app appears more than once. That's ~10% of the dataset.

Now we have to create a function to find these errors. First, let's look for **duplicates**.

## 4. Removing Duplicate Entries: Part One

Here's proof of multiple apps of the same name (they've given me this code):

In [7]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)
        print(' ')

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
 
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
 
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
 
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
 


There are four 'Instagram's' in the set. What differs between them is column 4 (or 3, you should get used to that). That's 'reviews', which means the *number* of accumulated reviews. The higher the number, the more recent the data, so goes the logic of increasing reviews.

If we want to remove duplicates, rather than remove randomly, we should want to keep row with the **biggest** number from the 4th header.

But how many duplicates do we have overall? The code below tells us:

In [8]:
#Make two lists, one for unique apps and another for duplicates.
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name) #2) Except for ones which appear already
    else:
        unique_apps.append(name) #1) All the names are filled in here first

#Then we can get clever with the labelling, with a string followed by the length
print('Number of duplicate apps:', len(duplicate_apps))
print('Examples of duplicate apps:', duplicate_apps[:11])

Number of duplicate apps: 1181
Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic']


Now we know for sure that there are 1181 instances of repetitions. When we remove them we should at least know what our number should be to confirm our code worked. Here's how we do it:

In [9]:
#Since we now know the number of duplicate apps, we can subtract from the total - here you can use len, but the figure was 10840 as we found in the del section.
print('Expected length:', len(android) - 1181)

Expected length: 9659


In [10]:
#Gonna do the same for iOS and see if it works. But won't remove them as the answer booklet does not and I haven't been asked to.

duplicate_apps_ios = []
unique_apps_ios = []

for apps in ios:
    name = apps[2]
    if name in unique_apps_ios:
        duplicate_apps_ios.append(name) #2) Except for ones which appear already
    else:
        unique_apps_ios.append(name) #1) All the names are filled in here first

#Then we can get clever with the labelling, with a string followed by the length
print('Number of duplicate apps:', len(duplicate_apps_ios))
print('\n')
print('Examples of duplicate apps:', duplicate_apps_ios[:11])

print('Our expected length is 7195')
print('\n')
explore_data(unique_apps_ios, 0, 1, True)

Number of duplicate apps: 90


Examples of duplicate apps: ['270729216', '44099584', '53805056', '85724160', '114601984', '68952064', '116443136', '70040576', '30252032', '143166464', '79953920']
Our expected length is 7195


100788224


Number of rows: 7107
Number of columns: 9


## 5. Removing Duplicate Entries: Part Two

We've discovered there are 1,181 duplicates in our Android data set (and two in our iOS dataset). We have to remove them based on our established critera. 

### To remove the duplicates, I will:

- Create a dictionary where each key is the **unique app name**. 
- The dictionary **value** is the **highest number of reviews** of the named app.
- Then I'll use the dictionary to create a new data set that will have only one entry per app, only selecting from the highest number of reviews.

We're going to use the `not in` operator. This means that we get `True` instead of `False` when we check for membership of a value vs. what's in a dictionary.

Let's create the dictionary:

In [11]:
#Start by creating an empty dict:
reviews_max = {}

#Loop through the Google Play (android) dataset
for apps in android:
    name = apps[0] # Assign the app name to a variable named `name` - this is row 0 in android.
    n_reviews = float(apps[3]) # Convert the number of reviews to a float (1.1) and assign to n_reviews. The row is 3 - reviews.
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews # If name already exists and it's n_reviews is less than the current value, update!
    elif name not in reviews_max: # We're told not to use else, and since this is a second if statement, one can't use if, hence elif. Also, notice not in.
        reviews_max[name] = n_reviews # This line then creates a new entry where the key is the app name
        
#Inspect the dictionary - the expected length is 9659        
print('Actual length is:', len(reviews_max))

Actual length is: 9659


That's what we expected. So far, so good. Now we have our dictionary of unique apps only, we can use it to create a new dataset.

In [12]:
#Start with two empty lists:

android_clean = [] # Stores our new dataset
already_added = [] # Stores app names

#Loop through the Google Play (android) dataset:
for apps in android:
    name = apps[0] # Assign the app name to a variable named `name` - this is row 0 in android.
    n_reviews = float(apps[3]) # Convert the number of reviews to a float (1.1) and assign to n_reviews. The row is 3 - reviews.
    
    if n_reviews == reviews_max[name] and name not in already_added: #If n_reviews is the same as the max number of reviews of the app name and the app name is not already in the list already added
        android_clean.append(apps)
        already_added.append(name)
        
explore_data(android_clean, 10, 11, True)

['Name Art Photo Editor - Focus n Filters', 'ART_AND_DESIGN', '4.4', '8788', '12M', '1,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'July 31, 2018', '1.0.15', '4.0 and up']


Number of rows: 9659
Number of columns: 13


How this works:

- We loop through the android data set.
- For every iteration, we isolate the `name` of the app and the `number of reviews`
- We add the row (app) to the android_clean list and the app name (name) to the already_added list if:
    - The number of reviews of the current app matches the number of reviews in the dictionary we made, and
    - The name of the app is not in the `already_installed` list. (We've added this to acount for cases where there are duplicate entries with the same high number of reviews.
    
Now we have our 'clean' dataset, which we'll apply to the next section.

## 6. Removing Non-English Apps: Part One

We're going to remove 'non-English' apps from our dataset. This'll be done with a couple of cool techniques, but a less cool idea behind it.

This approach has limitations. Chinese, Japanese and (everyone's favorite) Russian would be filtered. But French, German and Portuguese would remain, generally. So while this is testing my skill by later making me create a Function, it is in general not a good way of approaching app name filtration.

Anyway, did you know, each letter (or character, or even emoji) has a corresponding number, according to the ASCII (American Standard Code for Information Interchange) system. See:

In [13]:
print(ord('A'))
print(ord('b'))
print(ord('C')) 
print(ord('ツ'))
print(ord('ф'))
print(ord('🥰'))
print(ord('9'))

65
98
67
12484
1092
129392
57


Based on this, we can build a function that detects whether a character belongs to the set of common English characters, or not, and then filter them into their own final list of clean apps.

So, the logic goes, if the number is `==` to or `<` 127, then the character belongs to the set of common English characters and should be kept in our dataset. Everything above that should go.

Let's build a function that "takes in a string and returns `False` if there's any character in the string that doesn't belong to the set of common English characters, otherwise it returns `True`". The one below adds a second condition - there must be three characters in the string that are outside 127 to be filtered.

In [14]:
def getout(string):
    strikes = 0
    
    for characters in string:
        if ord(characters) > 127:
            strikes += 1
            
    if strikes > 3: #Really important to remember indentation here - this is a new if statement separate from the For loop.
        return False
    else:
        return True
        
print(getout('超强清理大师'))
print(getout('Instachat 😏'))
print(getout('爱奇艺PPS -《欢乐颂2》电视剧热播'))

False
True
False


The function is not perfect, but this "seems good enough".

Let's use getout() function to filter out the non-English apps for our data sets:

In [15]:
#First of all, empty lists to house our new clean data

android_no_russian = []
ios_english = []

#Now the code to filter out the foreign characters:
for apps in android_clean: #Loop through android_clean
    name = apps[0]         #Separate the name column
    if getout(name):   #Check the name column with our newly created function
        android_no_russian.append(apps) #Append the English language(?) apps to the new empty list
        
for app in ios:
    name = app[1]
    if getout(name):
        ios_english.append(app)
        
#Let's see how we did: 
explore_data(android_no_russian, 0, 1, True)
print('\n')
explore_data(ios_english, 814, 815, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 9614
Number of columns: 13


['450555662', 'The Singing Machine Mobile Karaoke App', '32770048', 'USD', '0', '130', '9', '2.5', '1.5', '1.10.4', '12+', 'Music', '37', '1', '1', '1']


Number of rows: 6183
Number of columns: 16


So, we have android_no_russian with 9614 apps, and ios_english with 6183 apps. This is the same outcome as the [solutions notebook](https://github.com/dataquestio/solutions/blob/master/Mission350Solutions.ipynb).

## Eliminating paid-for apps

We're limiting our analysis to paid apps. That means we need to remove any apps that are paid for. Let's first remind ourselves of the headings:

In [16]:
print(android_header)
print('\n')
explore_data(android_no_russian, 0, 1, False)
print(ios_header)
print('\n')
explore_data(ios_english, 0, 1, False)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']




In Android, column 7 reveals the price (Type (6) is 'Free' or 'Paid'). iOS is 'price', column 4. (We should create code to look at the potential outcome as we did with duplicate entries, but I won't for now.)

Instead, let's loop through each data set, identify our free apps, and separate them into our final data sets:

In [17]:
android_final = []
ios_final = []

for apps in android_no_russian:
    price = apps[6]
    if price == 'Free':
        android_final.append(apps)

for app in ios_english:
    price = app[4]
    if price == '0': #The data in the table is a string(?) So you could either convert to a int/float or use ''.
        ios_final.append(app)
        
print(len(android_final))
print(len(ios_final))

8863
3222


WE'RE DONE, AND WE'RE NEVER DOING THIS PART AGAIN (UNTIL THE NEXT DATA CLEAN). 

(The iOS set provided is different to the solutions notebook - there's an extra column which made it confusing, so I just deleted it). I also have one less app in the Android dataset, but that's ok, I probably deleted it.
 
Let's move on to analyzing the data.

## 9. Most Common Apps by Genre

Apparently we build apps for android, then (if they are successful) ship them later onto iOS. Not sure this is the winning strategy, but then again we only make English-only apps as well. We also look at profitability after six months, not one or two like most other hyper-casuals. I will probably leave this company soon. 

Anyway, we need to look for:

1. Apps in our dataset that are successful on both iOS and Android

So let's begin by getting a sense of common genres for each market. Looking up, column 2 [1] is 'Category' in Android, and column 10 [9] 'genre' could be good too. Column 12 [11] is 'prime_genre' in iOS. I could use THESE to generate frequency tables.

To analyze the data, lets build two functions we can use to analyze the frequency tables:

- One to generate frequency tables that **show percentages**
- Another to display the percentages in a _descending_ order

Here are my instructions:

Create a function named `freq_table()` that takes in two inputs: `dataset` (which is expected to be a list of lists) and `index` (which is expected to be an integer).

- The function should return the frequency table (as a dictionary) for any column we want. 
- The frequencies should be expressed as percentages.

In [18]:
#freq_table function - this is the answer, let's look at it and see if we can replicate after deleting it.
def freq_table(dataset, index):
    
    table = {} #Empty dic. to fill in
    total = 0 #We add this so we can divide by the total later
    
    for row in dataset:
        total += 1 #Every loop, total gets one more value.
        value = row[index] #Now the frequency table starts - Defining iteration variable value, which is each row of the index. Not sure why named so?
        if value in table:
            table[value] += 1 #Checks whether the iteration variable (`value`) exists as a key in table. If so, adds one.
        else:
            table[value] = 1 #As above, but if it doesn't exist, adds it to the dataset.
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage
    
    return table_percentages
    
#display_table() function - completely copied from Dataquest:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [19]:
#You can now use the display_table function to look at your data, like so:

display_table(ios_final, 11)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


### Trends:
    
Developers are HOT for Gaming, with 58% of apps in our dataset appearing to be Games. Entertainment more generally is 8%, along with Photo and Video at 5%. Music is separate at 2%. Fun, essentially, makes up 73% of apps in the App Store.

But, as it stands, this just isn't enough data - sure, there are lots of developed 'Fun' apps, but that doesn't mean anyone is playing them, or enjoying them.

Let's also get a sense of the Play Store figures:

In [20]:
#Category

display_table(android_final, 1)

FAMILY : 18.898792733837304
GAME : 9.725826469592688
TOOLS : 8.462146000225657
BUSINESS : 4.592124562789123
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.5315355974275078
SPORTS : 3.396141261423897
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2381812027530184
HEALTH_AND_FITNESS : 3.0802211440821394
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496444
VIDEO_PLAYERS : 1.7939749520478394
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.1621347173643235
ENTERTAINMENT : 0.9590432133589079
LIBRARIES_AND_DEMO : 0.9364774906916393
AUTO_AND_VEHICLES : 0.9251946293580051
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS : 0.7108202640189552
PARENTING : 0.6544059573507841
ART_AND_DESIGN : 0

In [21]:
#Genre

display_table(android_final, 9)

Tools : 8.450863138892023
Entertainment : 6.070179397495204
Education : 5.348076272142616
Business : 4.592124562789123
Productivity : 3.8925871601038025
Lifestyle : 3.8925871601038025
Finance : 3.7007785174320205
Medical : 3.5315355974275078
Sports : 3.463838429425702
Personalization : 3.317161232088458
Communication : 3.2381812027530184
Action : 3.102786866749408
Health & Fitness : 3.0802211440821394
Photography : 2.944826808078529
News & Magazines : 2.798149610741284
Social : 2.6627552747376737
Travel & Local : 2.324269434728647
Shopping : 2.245289405393208
Books & Reference : 2.1437436533904997
Simulation : 2.042197901387792
Dating : 1.8616721200496444
Arcade : 1.8503892587160102
Video Players & Editors : 1.771409229380571
Casual : 1.7601263680469368
Maps & Navigation : 1.399074805370642
Food & Drink : 1.241114746699763
Puzzle : 1.128286133363421
Racing : 0.9928917973598104
Role Playing : 0.9364774906916393
Libraries & Demo : 0.9364774906916393
Auto & Vehicles : 0.9251946293580051
S

The data is much more spread out, as Android has way more categories (36 vs. 26 in iOS). Genre is a real pain as the Games are split into ALL Android categories, and includes apps with two categories. 

We've spent a lot of time filtering data that doesn't serve our purposes. This won't be the last time.

But yes, we have most common genres, now it's time to figure out which are the most popular, I think?

## 12. Most Popular Apps by Genre on the App Store

Next stage, let's figure out popularity. Because the App Store dataset doesn't have installs, we're going to approximate popularity by look at the number of ratings with `rating_count_tot`, and making an average number per genre.

To do this, we need to:

- Isolate the apps of each genre.
- Sum up the user ratings for the apps of that genre.
- Divide the sum by the number of apps belonging to that genre (not by the total number of apps).

I think this means we're creating an average of the number of user ratings - not the score.

To do these steps, we'll start by generating a frequency table for the `prime_genre` column to get the unique app genres. You can use the freq_table() function you wrote in a previous screen.

We're also going to use nested loops for the first time. This means... loops inside loops! The instructions showed me that the first loop runs one, and the second loop runs through the whole iteration.

Let's start with calculating the average number of user ratings per app genre on the App Store:

In [22]:
genres_ios = freq_table(ios_final, -5)

for genre in genres_ios: # Loop over the unique genres in the dataset
    total = 0 #This variable stores the sum of user ratings specific to each genre.
    len_genre = 0 #This variable will store the number of apps specific to each genre.
    for apps in ios_final: #Loop over the App Store data set
        genre_app = apps[-5]
        if genre_app == genre: #If genre_app is the same as genre (the iteration variable of the main loop), then
            n_ratings = float(apps[5]) #Save the number of user ratings of the app as a float
            total += n_ratings  #Add up the number of user ratings to the total variable. - I copied this because I didn't understand
            len_genre += 1 # Increment the `len_genre` variable by `1`. Again, copied - it's the same as above but?? Increment? Ok
    avg_n_rating = total / len_genre
    print(genre, ':', avg_n_rating) #This needs to be within the loop or it will only print row 1, lol.

Productivity : 21028.410714285714
Weather : 52279.892857142855
Shopping : 26919.690476190477
Reference : 74942.11111111111
Finance : 31467.944444444445
Music : 57326.530303030304
Utilities : 18684.456790123455
Travel : 28243.8
Social Networking : 71548.34905660378
Sports : 23008.898550724636
Health & Fitness : 23298.015384615384
Games : 22788.6696905016
Food & Drink : 33333.92307692308
News : 21248.023255813954
Book : 39758.5
Photo & Video : 28441.54375
Entertainment : 14029.830708661417
Business : 7491.117647058823
Lifestyle : 16485.764705882353
Education : 7003.983050847458
Navigation : 86090.33333333333
Medical : 612.0
Catalogs : 4004.0


(I want to know how to sort this list. It's more important to me than the analysis itself.)

The notebook tells me that Navigation has the highest number of apps. However, a snippet of code from there reveals that the figure is skewed by the high number of reviews of a few apps, notably Waze and Google Maps.

If three apps have 100000 reviews, and 300 apps have 100 reviews each, can we tell anything about popularity? Not really. And factors such as popularity, and the ability of apps to push users to rate their own, are just getting in the way of our analysis.

Also, this is just _number_ of reviews -- what if the average is a 3 because of a number of 1 star reviews?

In [23]:
for app in ios_final:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5]) # print name and number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Geocaching® : 12811
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5
CoPilot GPS – Car Navigation & Offline Maps : 3582
Google Maps - Navigation & Transit : 154911


Not so, here's the apps. Mine don't come out in the right order, which is frustrating, and I'd like to know how to order the code.

In fact, analyzing this for our purposes doesn't seem to be that useful. If we're a game company, surely we'd focus on Games. Or we'd focus on our own Games dataset! Here's what we do at Adjust when analyzing our data:

1) Manually recategorize subverticals into different types of games (there are dozens!)
2) Running analysis on a multitude of metrics, not only of our own data but on useful benchmark data, (can be found in Adjust reports).
3) Considering more in-house factors here, such as cost modelling, direct competitors, etc.

Any further analysis seems futile. Since we have rough Google Play download figures, let's move on.

## Most Popular Apps by Genre on Google Play

But wait, there's a snag: the Google Play data is kind of broad - these aren't exact numbers (see below). So, we're going to assume that the number is _the number_, sans `+` sign.

In [24]:
display_table(android_final, 5)

1,000,000+ : 15.728308699086089
100,000+ : 11.55365000564143
10,000,000+ : 10.549475346947986
10,000+ : 10.199706645605326
1,000+ : 8.394448832223853
100+ : 6.916393997517771
5,000,000+ : 6.826131106848697
500,000+ : 5.562450637481666
50,000+ : 4.772650344127271
5,000+ : 4.513144533453684
10+ : 3.542818458761142
500+ : 3.2494640640866526
50,000,000+ : 2.3017037120613786
100,000,000+ : 2.1324607920568655
50+ : 1.9180864267178157
5+ : 0.7898002933543946
1+ : 0.5077287600135394
500,000,000+ : 0.270788672007221
1,000,000,000+ : 0.2256572266726842
0+ : 0.045131445334536835


So, that means removing commas and the plus sign from each row.

In [30]:
cat_android = freq_table(android_final, 1)

for category in cat_android:
    total = 0
    len_cat = 0 
    for app in android_final:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5] 
            n_installs = n_installs.replace('+', '') #This function replaces plus signs with... nothing!
            n_installs = n_installs.replace(',', '')
            total += float(n_installs)
            len_cat += 1
    av_n_installs = total / len_cat
    print(category, ':', av_n_installs)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3697848.1731343283
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

Now we have a 'healthy' average of installs. Again, this is probably weighted, and the list is unordered, so it's a bit painstaking to analyze.

That said, 'Books and reference' have a ton of downloads. Likely Bible apps...

In [33]:
#I totally stole this:

for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


OK, also readers themselves. Not interesting. Generally I'd support going for some kind of app in the reader space, easy to produce and the Game company would likely have writers on staff who could use the practice, and develop 'lore' for content marketing purposes. Angry Birds has TWO films out of this strategy, after all.

Let's get writing.