# Project 1: Profitable App Profiles for the App Store and Google Play Markets

## Context: 
Company X makes free apps directed toward English-speaking audiences. These are available on Apple Store and Google Play platforms. Their main source of revenue is sourced from in-app ads. Their **income** is therefore **directly proportional** to the **user engagement with the ads.** 

## Goal: 
Analyze data to help our developers understand **what kind of ads** are likely to **attract more users**

## 1.0 Collect and Analyze data about mobile apps available on Google Play and the App Store

* Google Play has 2.1M apps & the App Store has 2M apps.

* Collecting data for >4M apps is not feasible. We will collect a sample instead.

* Instead of sourcing the data myself I found two suitable data sets for my project's goal. The .csv files can be found here:

[Apple Store data](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)
* Sourced: August 2018
* Apps: approx. 10 000 

[Google Play data](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv)
* Sourced: July 2017
* Apps: approx. 7 000 

### 1.1 Open the two data sets and save them as lists of lists

In [None]:
from csv import reader


opened_file = open('AppleStore.csv')

read_file = reader(opened_file)

apple_list = list(read_file)

apple_header = apple_list[0]

apple_data = apple_list[1:]


opened_file = open('googleplaystore.csv')

read_file = reader(opened_file)

google_list = list(read_file)

google_header = google_list[0]

google_data = google_list[1:]

### 1.2 Explore both data sets using `explore_data()` 

1. Print the first few rows of each data set
2. Find the number of rows and columns of each data set
3. Print the column names and try to identify the columns that could help us with our analysis.

##### The *`explore_data()` method*  extracts a sample from our data sets. 

Its four **parameters** are:
* `dataset`: a list of lists created after we open our .csv files
* `start`: first index of the slice
* `end`: last index of the slice
* `rows_and_columns`: if true it prints the number of rows and slices (false by default)

It **works** like this:
1. Slices the data using `dataset[start:end]`
2. Loops through the slice and prints each row separated by a new line
```python
for row in dataset_slice:
        print(row)
        print('\n')
```
3. If rows_and_columns is true it prints the slice's number of rows and columns
```python
if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
```
> If the data set has a header it *won't* output the correct number of rows. Store the header separately from the rest of the .csv file (ie. in different variables)

In [None]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

#### Apple Store Sample

In [None]:
print(apple_header)

print('\n')

explore_data(apple_data, 0, 4, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7197
Number of columns: 16


#### Google Play Sample

In [None]:
print(google_header)

print('\n')

explore_data(google_data, 0, 4, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10841
Number of columns: 13


### 1.3 Detect inaccurate or duplicate data, and correct or remove it

We need to perform data cleaning before the analysis to remove or correct wrong or duplicate data. 

When we are cleaning data we are looking for some **common errors** like:

* shifted columns
* empty data points
* duplicate data

##### The *`not in` and `in` operators* check for membership
We want to know whether a value belongs to some group of values or not. 

We can use the `not in` operator to check for **membership in a dictionary**. The membership check is only done over the dictionary keys

> These operators return boolean values. If the statement is satisfactory it will return True, otherwise it will return False.

##### The *`del` statement*  can be implemented to remove rows that contain incorrect or incomplete data
More information about the [`del`](https://docs.python.org/3/reference/simple_stmts.html?highlight=del#the-del-statement) statement here!

> Do **not** run the del statement more than once, otherwise you will delete more than one row

### The following **methods** to loop through each data set and identify these common errors efficiently.

`len_error` compares the  length of each row to the length of the data set's header. If the lengths are not the same then it appends that row to a list. It prints: 
1. The number of rows with **shifted columns** by calculating the length of the list
2. Some examples of shifted columns by printing the first 15 indexes of the list
3. The indexes of rows with shifted columns

In [None]:
def len_error(dataset, header):
    eLen_list = []
    clean_list = []
    ind_list = []
    
    
    for row in dataset:
        if len(row) != len(header):
            eLen_list.append(row)
            ind_list.append(dataset.index(row))
            
    print("Number of rows with shifted columns: ", len(eLen_list))
    print('\n')
    print("Indexes of rows with shifted columns: ",ind_list)
    print('\n')
    print("Some shifted column apps: ", eLen_list[:15])


`empty_str` loops through each data point and compares it to an **empty string ""**. If the row has cells with empty strings it will be appended to a list. It prints:
1. The number of rows with empty string cells by calculating the length of the list
2. Some examples of rows with empty string cells by printing the first 15 indexes of the list
3. The indexes of rows with empty string cells

In [None]:
def empty_str(dataset):
    eStr_list = []
    ind_list = []
    clean_list = []
    
    for row in dataset:
        for column in row:
            if column == "":
                eStr_list.append(row)
                ind_list.append(dataset.index(row))
        
    print("Number of rows with empty string data points: ", len(eStr_list))
    print('\n')
    print("Indexes of rows with empty string data points: ",ind_list)
    print('\n')
    print("Some rows with empty string data points : ", eStr_list[:15])


`dupl` loops through each app name in the dataset and appends the row to a unique name list. If the name already exists in the unique name list the row is appended to a duplicate name list. It prints:
1. The number of rows with **duplicate names** by calculating the length of the duplicate list
2. The expected number of unique dictionary keys 
3. The indexes of rows with duplicate names

In [None]:
def dupl(dataset, name_index):
    
    dupl_list = []
    uniq_list = []
    ind_list = []
    
    for row in dataset:
        name = row[name_index]
        
        if name in uniq_list:
            dupl_list.append(name)
            ind_list.append(dataset.index(row))
        else:
            uniq_list.append(name)
            
    print("Number of duplicate apps: ", len(dupl_list))
    print('\n')
    print("Expected number of unique keys: ", len(dataset) - len(dupl_list))
    print('\n')
    print("Indexes of rows with duplicate apps: ",ind_list)

`new_dict` loops through each row in the data set to return a dictionary with the most recent data and no duplicates.
> The higher the number of reviews, the more **recent** the data should be. Rather than removing duplicates randomly, we'll only keep the row with the highest number of reviews and remove the other entries for any given app.

In [None]:
def new_dict(dataset, name_index, rev_index):
    
    recent_data = {}
    
    for row in dataset:
        
        name = row[name_index]
        n_reviews = float(row[rev_index])
        
        if ((name in recent_data) and (recent_data[name] < n_reviews)):
            
            recent_data[name] = n_reviews 
            
        elif name not in recent_data:
            
            recent_data[name] = n_reviews 
    
    print("Actual number of unique keys: ", len(recent_data))
            
    return recent_data

`unique_data` loops through each row in the dataset and isolates the current row's name and review count. The review count is compared to the dictionary's value corresponding to that name (ie. the most recent data entry). If the values are equal and the name of the app is not already stored in the `already_added` list then the current row is added to `clean_list` and the app name is added to the `already_added` list.

> We need to add this supplementary condition to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry 

In [None]:
def unique_data(dataset, name_index, rev_index, dictionary):

    clean_list = []
    already_added = []
    
    for row in dataset:
        
        name = row[name_index]
        n_reviews = float(row[rev_index])
        
        if((dictionary[name] == n_reviews) and (name not in already_added)):
            
            clean_list.append(row)
            already_added.append(name)
    
    return clean_list

### Identify shifted columns, empty string cells in Google Play apps

In [None]:
len_error(google_data, google_header)

Number of rows with shifted columns:  1


Indexes of rows with shifted columns:  [10472]


Some shifted column apps:  [['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']]


In [None]:
empty_str(google_data)

Number of rows with empty string data points:  2


Indexes of rows with empty string data points:  [1553, 10472]


Some rows with empty string data points :  [['Market Update Helper', 'LIBRARIES_AND_DEMO', '4.1', '20145', '11k', '1,000,000+', 'Free', '0', 'Everyone', 'Libraries & Demo', 'February 12, 2013', '', '1.5 and up'], ['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']]


To remove rows with shifted columns and empty strings we implement the **`del` statement**  

> Remember to only run the del statement **once**. I will comment it out after running it once as a precaution.

In [None]:
print("Initial number of apps: ", len(google_data))

del google_data[10472]
del google_data[1552]

print("Number of apps after removing rows with shifted columns and empty strings: ", len(google_data))

Initial number of apps:  10841
Number of apps after removing rows with shifted columns and empty strings:  10839


In [None]:
print(google_data[10471])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


After removing the rows with shifted columns and empty strings we move onto removing **duplicates** and keeping only the **most recent data** available.

In [None]:
dupl(google_data, 0)

Number of duplicate apps:  1181


Expected number of unique keys:  9658


Indexes of rows with duplicate apps:  [222, 204, 193, 213, 253, 204, 237, 238, 193, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 228, 285, 230, 231, 232, 233, 234, 235, 192, 293, 294, 243, 336, 382, 383, 385, 386, 390, 392, 393, 359, 369, 408, 410, 411, 412, 372, 415, 417, 419, 420, 388, 385, 436, 451, 452, 464, 383, 389, 467, 391, 390, 399, 402, 403, 407, 483, 490, 485, 486, 488, 487, 520, 495, 489, 491, 518, 503, 516, 525, 494, 500, 502, 564, 565, 566, 568, 571, 574, 575, 576, 579, 580, 534, 528, 590, 596, 662, 663, 664, 603, 610, 496, 506, 670, 605, 608, 673, 674, 602, 589, 611, 679, 607, 528, 604, 613, 606, 609, 529, 600, 501, 697, 702, 700, 734, 741, 702, 700, 784, 740, 742, 741, 756, 789, 749, 793, 750, 796, 784, 801, 737, 807, 740, 824, 784, 800, 803, 801, 832, 717, 835, 809, 766, 824, 843, 844, 740, 835, 801, 883, 903, 855, 888, 931, 932, 933, 914, 912, 913, 917, 938, 919, 92

In [None]:
google_dict = new_dict(google_data, 0, 3)

Actual number of unique keys:  9658


The expected number of unique keys **matches** the actual number of unique keys! 

> Therefore we have the correct number of apps as keys in our dictionary after removing duplicates 

Now we move on to clean our data from multiple (most recent) equal entries in order to output an absolute duplicate free list.

In [None]:
google_uniq = unique_data(google_data, 0, 3, google_dict)

Now we **explore** our clean data list to double check our process.

In [None]:
print(google_header)
print('\n')
explore_data(google_uniq, 0, 4, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 9658
Number of columns: 13


#### Our Google Play data set is free from duplicates, shifted column rows, and empty cell rows!

### Identify and delete shifted columns, empty string cells, and duplicates in the Apple Store data set

In [None]:
len_error(apple_data, apple_header)

Number of rows with shifted columns:  0


Indexes of rows with shifted columns:  []


Some shifted column apps:  []


In [None]:
empty_str(apple_data)

Number of rows with empty string data points:  0


Indexes of rows with empty string data points:  []


Some rows with empty string data points :  []


We **do not** have any rows with shifted columns of empty strings in our Apple Store data. Yay! 

Now we proceed to remove all **duplicates**:

In [None]:
dupl(apple_data,1)

Number of duplicate apps:  2


Expected number of unique keys:  7195


Indexes of rows with duplicate apps:  [4463, 4831]


In [None]:
apple_dict = new_dict(apple_data,1,5)

Actual number of unique keys:  7195


The expected number of unique keys **matches** the actual number of unique keys! 

> Therefore we have the correct number of apps as keys in our dictionary after removing duplicates 

Now we move on to clean our data from multiple (most recent) equal entries in order to output an absolute duplicate free list.

In [None]:
apple_uniq = unique_data(apple_data, 1, 5, apple_dict)

Now we **explore** our clean data list to double check our process.

In [None]:
print(apple_header)
print('\n')
explore_data(apple_uniq, 0, 4, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7195
Number of columns: 16


#### Our Apple Store data set is free from duplicates, shifted column rows, and empty cell rows!

### 1.4 Remove non-English apps

Our data must be modified to fit the purpose of our analysis.

We want to analyze apps that satisfy two conditions:
* Cost = Free
* Language = English

English text usually **includes**: 
* Letters from the English alphabet 
* Numbers composed of digits from 0 to 9
* Punctuation marks (., !, ?, ;), and 
* Other symbols (+, *, /).

#### According to the ASCII system, each character we use in a string has a corresponding number associated with it

We can build a function that detects whether a character belongs to the set of common English characters or not.

> The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127

##### **We can get the corresponding number of each character using the `ord(c)` function**
Given a string representing one Unicode character (like "a"), returns an integer representing the Unicode code point of that character (like 97)

Our app names are stored as strings. How can we implement `ord(c)` if we have a string literal? 

<blockquote> In Python, strings are indexable and iterable. We can: 
    
* Use indexing to select and individual character, or 
* Iterate on the string using a for loop </blockquote>

#### **Exceptions: Emojis and other characters fall outside our ASCII range**

If we keep our function as-is we'll lose useful data since many English **apps will be incorrectly labeled as non-English**
> To minimize data loss, we'll only remove an app if its name has more than **three** characters with ASCII values > 127

The function `is_english()` takes in a string and outputs a boolean value: 
* True if the string is written in English & contains less than 3 emojis/ other characters

* False if the string is not written in English

In [None]:
def is_english(string):
    counter = 0
    
    for character in string:
        
        if ord(character) > 127:
            
            counter += 1
            
            if counter > 3:
                
                return False
            
    return True


In [None]:
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
True
False


Now, we implement our `is_english()` function inside our `clean_lang()` function to loop through our dataset and return an **english-only dataset**

In [None]:
def clean_lang(dataset, name_index):
    
    eng_apps = []
    non_eng = []
    
    for row in dataset:
        
        name = row[name_index]
        
        if is_english(name) is True:
            
            eng_apps.append(row)
            
        else:
            
            non_eng.append(row)
            
    dataset = eng_apps
    
    return dataset     

In [None]:
google_eng = clean_lang(google_uniq, 0)

print(google_eng[:10])

[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up'], ['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000+', '

In [None]:
apple_eng = clean_lang(apple_uniq, 1)
print(apple_eng[:10])

[['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'], ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'], ['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1'], ['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1'], ['429047995', 'Pinterest', '74778624', 'USD', '0.0', '1061624', '1814', '4.5', '4.0', '6.26', '12+', 'Social Networking', '37', '5', '27', '1'], ['282935706', 'Bible', '92774400', 'USD', '0.0', '985920', '5320', '4.5', '5.0', '7.5.1', '4+', 'Reference', '37', '5', '45', '1'], ['553834731', '

### 1.5 Remove non-free apps

As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps; **we'll need to isolate only the free apps for our analysis.**

In [None]:
def free_only(dataset, price_index):
    
    free_apps = []
    paid_apps = []
    
    for row in dataset:
        
        price = row[price_index]
        
        if (price == "0") or (price == "0.0") or (price == "Free"):
            
            free_apps.append(row)
            
        else:
            
            paid_apps.append(row)
    
    return free_apps

#### Now lets **implement our methods** on our google_eng and apple_eng datasets!

In [None]:
google_clean = free_only(google_eng,7)

print(google_clean[:10])

[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up'], ['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000+', '

In [None]:
apple_clean = free_only(apple_eng,4)

print(apple_clean[:10])

[['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'], ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'], ['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1'], ['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1'], ['429047995', 'Pinterest', '74778624', 'USD', '0.0', '1061624', '1814', '4.5', '4.0', '6.26', '12+', 'Social Networking', '37', '5', '27', '1'], ['282935706', 'Bible', '92774400', 'USD', '0.0', '985920', '5320', '4.5', '5.0', '7.5.1', '4+', 'Reference', '37', '5', '45', '1'], ['553834731', '

### We are have cleaned our data!

So far we've done this:
* Removed inaccurate data
* Removed duplicate app entries
* Removed non-English apps
* Isolated the free apps

In [None]:
print("Number of Free - English apps available on Google Play: ", len(google_clean))
print("\n")
print("Number of Free - English apps available on Apple Store: ", len(apple_clean))

Number of Free - English apps available on Google Play:  8863


Number of Free - English apps available on Apple Store:  3220


## 2.0 App profile analysis

As we mentioned in the introduction, our aim is to **determine the kinds of apps that are likely to attract more users** because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets.

### 2.1 Most common genres/ categories for each market

We will generate a sorted frequency table that outputs the percentage of each genre/category in descending order.
> Dictionaries don't have order, and it will be very difficult to analyze the frequency tables.

##### **The `sorted()` function can help us display the entries in the frequency table in a descending order**
This function takes in an iterable data type (like a list, dictionary, tuple, etc.), and returns a list of the elements of that iterable sorted in ascending or descending order (the `reverse` parameter controls whether the order is ascending or descending).
> It doesn't work too well with dictionaries because it only considers and returns the dictionary **keys**. However, we can transform the dictionary into a **list of tuples**, where each tuple contains a dictionary key along with its corresponding dictionary value. In this order specifically `tuple = (value, key)` 

In [None]:
def freq_table(dataset, column_index):
    
    table = {}
    total = 0
    percent_table = {}

    for row in dataset:
        
        total += 1
        key = row[column_index]
        
        if key in table:
            
            table[key] += 1
            
        else:
            
            table[key] = 1
    
    for key in table:
        
        percentage = (table[key]/total)*100
        
        percent_table[key] = percentage
        
   
    return percent_table

In [None]:
def display_table(dataset, index):
    
    table = freq_table(dataset, index)
    tuple_list = []
    
    for key in table:
        
        vk_tuple = (table[key],key)
        
        tuple_list.append(vk_tuple)
        
    table_sorted = sorted(tuple_list, reverse = True)
    
    for entry in table_sorted:
        print(entry[1],":",entry[0])

In [None]:
ios_genres = display_table(apple_clean, 11)

Games : 58.13664596273293
Entertainment : 7.888198757763975
Photo & Video : 4.968944099378882
Education : 3.6645962732919255
Social Networking : 3.291925465838509
Shopping : 2.608695652173913
Utilities : 2.515527950310559
Sports : 2.142857142857143
Music : 2.049689440993789
Health & Fitness : 2.018633540372671
Productivity : 1.7391304347826086
Lifestyle : 1.5838509316770186
News : 1.3354037267080745
Travel : 1.2422360248447204
Finance : 1.1180124223602486
Weather : 0.8695652173913043
Food & Drink : 0.8074534161490683
Reference : 0.5590062111801243
Business : 0.5279503105590062
Book : 0.43478260869565216
Navigation : 0.18633540372670807
Medical : 0.18633540372670807
Catalogs : 0.12422360248447205


In [None]:
droid_genre = display_table(google_clean,9)

Tools : 8.450863138892023
Entertainment : 6.070179397495204
Education : 5.348076272142616
Business : 4.592124562789123
Productivity : 3.8925871601038025
Lifestyle : 3.8925871601038025
Finance : 3.7007785174320205
Medical : 3.5315355974275078
Sports : 3.463838429425702
Personalization : 3.317161232088458
Communication : 3.2381812027530184
Action : 3.102786866749408
Health & Fitness : 3.0802211440821394
Photography : 2.944826808078529
News & Magazines : 2.798149610741284
Social : 2.6627552747376737
Travel & Local : 2.324269434728647
Shopping : 2.245289405393208
Books & Reference : 2.1437436533904997
Simulation : 2.042197901387792
Dating : 1.8616721200496444
Arcade : 1.8503892587160102
Video Players & Editors : 1.771409229380571
Casual : 1.7601263680469368
Maps & Navigation : 1.399074805370642
Food & Drink : 1.241114746699763
Puzzle : 1.128286133363421
Racing : 0.9928917973598104
Role Playing : 0.9364774906916393
Libraries & Demo : 0.9251946293580051
Auto & Vehicles : 0.9251946293580051
S

In [None]:
droid_category = display_table(google_clean, 1)

FAMILY : 18.910075595170937
GAME : 9.725826469592688
TOOLS : 8.462146000225657
BUSINESS : 4.592124562789123
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.5315355974275078
SPORTS : 3.396141261423897
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2381812027530184
HEALTH_AND_FITNESS : 3.0802211440821394
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496444
VIDEO_PLAYERS : 1.7939749520478394
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.1621347173643235
ENTERTAINMENT : 0.9590432133589079
LIBRARIES_AND_DEMO : 0.9251946293580051
AUTO_AND_VEHICLES : 0.9251946293580051
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS : 0.7108202640189552
PARENTING : 0.6544059573507841
ART_AND_DESIGN : 0

#### Frequency Table Highlights

|  | Google Play | Apple Store |
| --- | ----------- | --- |
| Most common category | Family = 18.91 % | |
| 2nd most common category | Game = 9.72 % | |
| 3rd most common category | Tools = 8.46 % | |
|  |  |  |
| Most common genre | Tools = 8.45 % |Games = 58.14 % |
| 2nd most common genre | Entertainment = 6.07 % |Entertainment = 7.89 % |
| 3rd most common genre | Education = 5.35 % |Photo & Video = 4.97 % |


#### Apple Store Analysis

From a sample of free English apps, more than a half (58.14%) are games. Entertainment apps account for 8% of the pool, followed by photo and video apps at almost 5%.

This sample (free English apps) is **dominated by apps that are designed for fun and leisure** rather than productivity and utility.

However, the fact that fun apps are the most numerous doesn't also imply that they also have the greatest number of users — **the demand might not be the same as the offer**.

####  Google Play Analysis

From a sample of free English apps, the most common genres are Tools (8.45%), Entertainment (6.07%), and Education apps (5.35%). The most common categories are Family (18.91%), Game (9.72%), and Tools (8.46%).

This sample (free English apps) seems to be **dominated by apps designed for practical purposes** according to the genres table. However, according to the categories table Family accounts for almost 19% of the pool (which mostly means kid games).

The difference between the Genres and the Category columns is not clear, but one thing we can notice is that the Genres column is much more granular (it has more categories). We're only looking for the bigger picture at the moment, so **we'll only work with the Category column moving forward**.

#### Summary:

The App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and leisure apps. Now we'd like to get an idea about the **kind of apps that have most users.**

### 2.2 Most popular apps for each market

One way to find out what genres are the most popular (have the most users) is to calculate the **average number of installs for each app genre**. 

For the Google Play data set, we can find this information in the `Installs` column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the `rating_count_column` app.

`avg_rating` returns the average number of user ratings per app genre

In [None]:
def avg_install(dataset, genre_index, install_index):
    
    genre_table = freq_table(dataset, genre_index)
    
    for key in genre_table:
        
        total = 0
        len_genre = 0
        
        for row in dataset:
            
            genre_app = row[genre_index]
            
            if genre_app == key:
                
                n_installs = row[install_index]
                
                if ("+" in n_installs) or ("," in n_installs): 
                
                    n_installs = n_installs.replace("+","")
                
                    n_installs = n_installs.replace(",","")
                
                n_installs = float(n_installs)
                
                total += n_installs
                
                len_genre += 1
       
        average = total/len_genre
        
        print(key, ":", average)

In [None]:
avg_install(google_clean,1,5)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 646168.4146341464
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND

In [None]:
avg_install(apple_clean,11,5)

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22812.92467948718
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


#### Top 5 Most Popular App Genres/ Categories 


| Rank | Google Play  | Apple Store |
| --- | ----------- | --- |
| 1 | COMMUNICATION = 38M + | Navigation = 86K + |
| 2 | VIDEO_PLAYERS = 24M + | Reference = 74K + |
| 3 | SOCIAL = 23M + | Social Networking = 71K + |
| 4 | PHOTOGRAPHY = 17M + | Music = 57K + |
| 5 | PRODUCTIVITY = 16M + | Weather = 52K + |

####  App Store Analysis

Navigation, Reference, and Social Networking are the most popular app genres for free apps in english. Lets take a look at a sample from each of these genres. The App Store installs are displayed in descending order so we will only print the top 10.

In [None]:
count = 0
for row in apple_clean:
    
    if (row[11] == 'Navigation') and (count <= 10):
        count += 1
        print(row[1], ':', row[5]) 
        

print("\n")

count = 0
for row in apple_clean:
    
    if (row[11] == 'Reference') and (count <= 10):
        count += 1
        print(row[1], ':', row[5])  
        
               
print("\n")

count = 0
for row in apple_clean:
    if (row[11] == 'Social Networking') and (count <= 10):
        count += 1
        print(row[1], ':', row[5])  

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497


Facebook : 2974676
Pinterest : 1061624
Skype for iPhone : 373519
Messenger : 351466
Tumblr : 334293
WhatsApp Messenger : 287589
Kik : 260965
ooVoo – Free Video Call, Text and

The Navigation genre might be the most popular due to the apps Waze and Google Maps which have markedly higher rating counts than other apps with the same genre. 

The Reference genre's most reviewed apps are the Bible and Dictionary.com. 

The Social Networking genre is probably the third most popular due to apps like Facebook and Pinterest which have a huge user base as compared to other apps with the same genre. 

The Reference genre appears to be the best option for launching an app. There is no monopoly over this genre, like the Facebook on the Social genre. We could launch a reading app that offers different and new helpful features, like an in-app dictionary tool.

This idea seems to fit well with the fact that the App Store is dominated by for-fun apps. This suggests the market might be a bit saturated with for-fun apps, which means a practical app might have more of a chance to stand out among the huge number of apps on the App Store.

####  Google Play Analysis

Communication, Video Players, and Social are Google Play's most popular app categories for free apps in english. We will print the most installed apps from each of these categories.

In [None]:
for row in google_clean:
    
    if (row[1] == 'COMMUNICATION') and ((row[5] == "1,000,000,000+") or (row[5] == "500,000,000+")):
        
        print(row[0], ':', row[5]) 

print("\n")

for row in google_clean:
    
    if (row[1] == 'VIDEO_PLAYERS') and ((row[5] == "1,000,000,000+") or (row[5] == "500,000,000+")):
        
        print(row[0], ':', row[5]) 

print("\n")

for row in google_clean:
    
    if (row[1] == 'SOCIAL') and ((row[5] == "1,000,000,000+") or (row[5] == "500,000,000+") or (row[5] == '100,000,000+')):
        
        print(row[0], ':', row[5]) 

print("\n")

for row in google_clean:
    
    if (row[1] == 'PHOTOGRAPHY') and ((row[5] == "1,000,000,000+") or (row[5] == "500,000,000+") or (row[5] == '100,000,000+')):
        
        print(row[0], ':', row[5]) 
        
print("\n")

for row in google_clean:
    
    if (row[1] == 'PRODUCTIVITY') and ((row[5] == "1,000,000,000+") or (row[5] == "500,000,000+") or (row[5] == '100,000,000+')):
        
        print(row[0], ':', row[5]) 
        
print("\n")

for row in google_clean:
    
    if (row[1] == 'BOOKS_AND_REFERENCE') and ((row[5] == "1,000,000,000+") or (row[5] == "500,000,000+") or (row[5] == '100,000,000+')):
        
        print(row[0], ':', row[5]) 

WhatsApp Messenger : 1,000,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Viber Messenger : 500,000,000+


YouTube : 1,000,000,000+
Google Play Movies & TV : 1,000,000,000+
MX Player : 500,000,000+


Facebook : 1,000,000,000+
Facebook Lite : 500,000,000+
Tumblr : 100,000,000+
Pinterest : 100,000,000+
Google+ : 1,000,000,000+
Badoo - Free Chat & Dating App : 100,000,000+
Tango - Live Video Broadcast : 100,000,000+
Instagram : 1,000,000,000+
Snapchat : 500,000,000+
LinkedIn : 100,000,000+
Tik Tok - including musical.ly : 100,000,000+
BIGO LIVE - Live Stream : 100,000,000+
VK : 100,000,000+


B612 - Beauty & Filter Camera : 100,000,0

The Communication category appears to be completely saturated by WhatsApp, Google, and Facebook Messenger. 

The second most popular category Video Player is dominated by other giants like Youtube, Google, and MX Player. 

The third most popular category Social follows a similar pattern. 

A major concern here is that these categories might not be the most popular and that we have achieved skewed results due to these apps with huge user bases which are hard to compete against. 

If we look at the Books and Reference category's most popular apps we find a shorter list of apps as compared to other categories which implies there is area of opportunity. Books and Reference is the 11th most popular genre. However we know that the top genres might only appear to be popular due to apps with enormous user bases like Facebook, which might skew the data. 

The market seems to be saturated with libraries and software for processing and reading ebooks, so we need to add some special features to our book application. This might include an audiobook version, daily reading goals, a discussion forum, and in-app dictionary tool, etc.

## 3.0 Conclusion

In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We concluded that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets. The markets are already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.