# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [8]:
# import reduce from functools, numpy and pandas
import pandas as pd
import numpy as np
from functools import reduce

#### Fun fact: 
In 2005, the Python's language author Guido van van Rossum expressed his critics about map(), filter(), and reduce(). He advocated agaist those functions, affirming that loops were more legible and easy to handle. The discussion reached a point of excluding reduce from build-in function, isolating reduce into the functools library.

Source: https://www.artima.com/weblogs/viewpost.jsp?thread=98196

# Challenge 1 - Mapping

#### We will use the map function to clean up some words in a book.

In the following cell, we will read a text file containing the book The Prophet by Kahlil Gibran.

* The book could be downloaded via the this [website](https://www.gutenberg.org/ebooks/58585).
* Our first step is download the data and open it in our Python environment:

In [9]:
# Define the directory location
location = '../58585-0.txt'

In [10]:
# open the file using the open()
with open(location, 'r', encoding="utf8") as f:
    prophet_string = f.read()

In [11]:
# the object imported is a string
print('Object type:', type(prophet_string))
print('Object length:', len(prophet_string))

Object type: <class 'str'>
Object length: 87749


* Now, let's divide the string object we just imported based on the space character;
* The method that 'divide' the string is called split()

In [12]:
# split/ divide/ tokenize the objected with the method split()
prophet = prophet_string.split(' ')

In [13]:
# after split the string, it will return a list with 13.637 elements
print('Object type:', type(prophet))
print('Object length:', len(prophet))

Object type: <class 'list'>
Object length: 13637


* Can you explain the reason why the type of the element has changed and the number of elements decreased?

In [14]:
# before the elements represented characters, all the letters in the string
#after split the number of elements represent the number of wordds in the string
#after split each word became a separate string in a list 

#### Let's remove the first 568 words since they contain information about the book but are not part of the book itself. 

Do this by removing from `prophet` elements 0 through 567 of the list (you can also do this by keeping elements 568 through the last element).

In [15]:
# Let's start by checking the number of elements inside the list 'prophet'
print(len(prophet))

13637


* The length of all prophet elements is 13.637. If we drop the first 568 elements, we're expecting to create a new one with 13.069 elements.

In [16]:
len(prophet)-568

13069

#### __Method 1__: just to keep in mind the string slide method using [ ]
* Now let's slice the data from 568 until the index '13637', or the list length

In [17]:
# define the max list's size
max_size = 568

In [18]:
# starting with a the list object followed by a bracket
prophet_568 = prophet[
    # define the max size
    max_size:]

In [19]:
# the new object's len is 568 elements, as we have expected
len(prophet_568)

13069

* Recheck: the element indexed as 568 is 'PROPHET\n\n|Almustafa,'.
* And so, the new object should start with the same element 'PROPHET\n\n|Almustafa,'.

In [20]:
prophet[568]

'PROPHET\n\n|Almustafa,'

In [21]:
prophet_568[0]

'PROPHET\n\n|Almustafa,'

If you look through the words, you will find that many words have a reference attached to them. For example, let's look at words 1 through 10.

In [22]:
# Slice the string from index 1 to 10
prophet1_10 = prophet[1:10]

In [23]:
prophet1_10

['Project',
 'Gutenberg',
 'EBook',
 'of',
 'The',
 'Prophet,',
 'by',
 'Kahlil',
 'Gibran\n\nThis']

In [24]:
# the new element lenghts 9 elements
len(prophet1_10)

9

#### The next step is to create a function that will remove references. 

We will do this by splitting the string on the `{` character and keeping only the part before this character. Write your function below.

In [25]:
def reference(x):
    '''
    Input: A string
    Output: The string with references removed
    
    Example:
    Input: 'the{7}'
    Output: 'the'
    '''
    # Your code here:
    return x.split('{')[0]

In [26]:
prophet_string.index('{')

867

* Let's check where the '{' is inside our string -- the one we have imported fist

Now that we have our function, use the `map()` function to apply this function to our book, The Prophet. Return the resulting list to a new list called `prophet_reference`

In [27]:
# Your code here:

prophet_reference = list(map(reference, prophet))

In [28]:
prophet_reference

['\ufeffThe',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'The',
 'Prophet,',
 'by',
 'Kahlil',
 'Gibran\n\nThis',
 'eBook',
 'is',
 'for',
 'the',
 'use',
 'of',
 'anyone',
 'anywhere',
 'in',
 'the',
 'United',
 'States',
 'and\nmost',
 'other',
 'parts',
 'of',
 'the',
 'world',
 'at',
 'no',
 'cost',
 'and',
 'with',
 'almost',
 'no',
 'restrictions\nwhatsoever.',
 '',
 'You',
 'may',
 'copy',
 'it,',
 'give',
 'it',
 'away',
 'or',
 're-use',
 'it',
 'under',
 'the',
 'terms\nof',
 'the',
 'Project',
 'Gutenberg',
 'License',
 'included',
 'with',
 'this',
 'eBook',
 'or',
 'online',
 'at\nwww.gutenberg.org.',
 '',
 'If',
 'you',
 'are',
 'not',
 'located',
 'in',
 'the',
 'United',
 'States,',
 "you'll\nhave",
 'to',
 'check',
 'the',
 'laws',
 'of',
 'the',
 'country',
 'where',
 'you',
 'are',
 'located',
 'before',
 'using\nthis',
 'ebook.\n\n\n\nTitle:',
 'The',
 'Prophet\n\nAuthor:',
 'Kahlil',
 'Gibran\n\nRelease',
 'Date:',
 'January',
 '1,',
 '2019',
 '[EBook',
 '#58585]\

Another thing you may have noticed is that some words contain a line break. Let's write a function to split those words. Our function will return the string split on the character `\n`. Write your function in the cell below.

In [29]:
def line_break(x):
    '''
    Input: A string
    Output: A list of strings split on the line break (\n) character
        
    Example:
    Input: 'the\nbeloved'
    Output: ['the', 'beloved']
    '''
    
    return x.split('\n')    
    

Apply the `line_break` function to the `prophet_reference` list. Name the new list `prophet_line`.

In [30]:
prophet_line = list(map(line_break, prophet_reference));
prophet_line[0:10]

[['\ufeffThe'],
 ['Project'],
 ['Gutenberg'],
 ['EBook'],
 ['of'],
 ['The'],
 ['Prophet,'],
 ['by'],
 ['Kahlil'],
 ['Gibran', '', 'This']]

If you look at the elements of `prophet_line`, you will see that the function returned lists and not strings. Our list is now a list of lists. Flatten the list using list comprehension. Assign this new list to `prophet_flat`.

In [31]:
prophet_flat = [y for x in prophet_line for y in x if y != ''];
prophet_flat[0:10]

['\ufeffThe',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'The',
 'Prophet,',
 'by',
 'Kahlil',
 'Gibran']

# Challenge 2 - Filtering

When printing out a few words from the book, we see that there are words that we may not want to keep if we choose to analyze the corpus of text. Below is a list of words that we would like to get rid of. Create a function that will return false if it contains a word from the list of words specified and true otherwise.

In [32]:
def word_filter(x):
    '''
    Input: A string
    Output: true if the word is not in the specified list and false if the word is in the list
        
    Example:
    word list = ['and', 'the']
    Input: 'and'
    Output: False
    
    Input: 'John'
    Output: True
    '''    
    word_list = ['and', 'the', 'a', 'an'];
    return x not in word_list;    
    

Use the `filter()` function to filter out the words speficied in the `word_filter()` function. Store the filtered list in the variable `prophet_filter`.

In [33]:
prophet_filter = list(filter(word_filter, prophet_flat));
prophet_filter[0:10]

['\ufeffThe',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'The',
 'Prophet,',
 'by',
 'Kahlil',
 'Gibran']

# Challenge 2 - Part 2

Rewrite the `word_filter` function above to not be case sensitive.

In [34]:
def word_filter_case(x):
   
    word_list = ['and', 'the', 'a', 'an']
    return x.lower() not in word_list 
    

In [35]:
prophet_filter_case = list(filter(word_filter_case, prophet_flat));
print("No case sensitive:", prophet_filter[30:40])
print('---------------------------')
print("Case sensitive:", prophet_filter_case[30:40])


No case sensitive: ['almost', 'no', 'restrictions', 'whatsoever.', 'You', 'may', 'copy', 'it,', 'give', 'it']
---------------------------
Case sensitive: ['no', 'restrictions', 'whatsoever.', 'You', 'may', 'copy', 'it,', 'give', 'it', 'away']


# Challenge 3 - Reducing

#### Now that we have significantly cleaned up our text corpus, let's use the `reduce()` function to put the words back together into one long string separated by spaces. 

We will start by writing a function that takes two strings and concatenates them together with a space between the two strings.

In [36]:
def concat_space(a, b):
    '''
    Input:Two strings
    Output: A single string separated by a space
        
    Example:
    Input: 'John', 'Smith'
    Output: 'John Smith'
    '''
    return "{} {}".format(a,b);


Use the function above to reduce the text corpus in the list `prophet_filter` into a single string. Assign this new string to the variable `prophet_string`.

In [37]:
prophet_string = reduce(concat_space, prophet_filter)
prophet_string[0:1000]

"\ufeffThe Project Gutenberg EBook of The Prophet, by Kahlil Gibran This eBook is for use of anyone anywhere in United States most other parts of world at no cost with almost no restrictions whatsoever. You may copy it, give it away or re-use it under terms of Project Gutenberg License included with this eBook or online at www.gutenberg.org. If you are not located in United States, you'll have to check laws of country where you are located before using this ebook. Title: The Prophet Author: Kahlil Gibran Release Date: January 1, 2019 [EBook #58585] Last Updated: January 3, 2018 Language: English Character set encoding: UTF-8 *** START OF THIS PROJECT GUTENBERG EBOOK THE PROPHET *** Produced by David Widger from page images generously provided by Internet Archive Transcriber's Note: Page numbers, ie: are included in this utf-8 text file. For those wishing to use text file unencumbered with page numbers open or download Latin-1 file 58585-8.txt. THE PROPHET By Kahlil Gibran New York: Alf

# Challenge 4 - Applying Functions to DataFrames

#### Our next step is to use the apply function to a dataframe and transform all cells.

To do this, we will load a dataset below and then write a function that will perform the transformation.

In [38]:
# Run this code:

# The dataset below contains information about pollution from PM2.5 particles in Beijing 

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00381/PRSA_data_2010.1.1-2014.12.31.csv"
pm25 = pd.read_csv(url)

Let's look at the data using the `head()` function.

In [39]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00381/PRSA_data_2010.1.1-2014.12.31.csv"

In [None]:
pm25 = pd.read_csv(url)

In [None]:
pm25.head()

The next step is to create a function that divides a cell by 24 to produce an hourly figure. Write the function below.

In [None]:
def hourly(x):
    '''
    Input: A numerical value
    Output: The value divided by 24
        
    Example:
    Input: 48
    Output: 2.0
    '''
    return(x/24)    

Apply this function to the columns `Iws`, `Is`, and `Ir`. Store this new dataframe in the variable `pm25_hourly`.

In [None]:
pm25_hourly = pm25[['Iws','Is','Ir']].apply(hourly)
pm25_hourly.head()

#### Our last challenge will be to create an aggregate function and apply it to a select group of columns in our dataframe.

Write a function that returns the standard deviation of a column divided by the length of a column minus 1. Since we are using pandas, do not use the `len()` function. One alternative is to use `count()`. Also, use the numpy version of standard deviation.

In [None]:
def sample_sd(x):
    '''
    Input: A Pandas series of values
    Output: the standard deviation divided by the number of elements in the series
        
    Example:
    Input: pd.Series([1,2,3,4])
    Output: 0.3726779962
    '''
    
    # Your code here:
    return np.std(x) / (x.count() - 1);
    

In [None]:
pm25_hourly_sd = pm25[['Iws','Is','Ir']].apply(sample_sd);
pm25_hourly_sd.head()