# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [1]:
# import reduce from functools, numpy and pandas

from functools import reduce 
import numpy as np
import pandas as pd

# Challenge 1 - Mapping

#### We will use the map function to clean up a words in a book.

In the following cell, we will read a text file containing the book The Prophet by Khalil Gibran.

In [2]:
# Run this code:

location = '../58585-0.txt'
with open(location, 'r', encoding="utf8") as f:
    prophet = f.read().split(' ')
    
    

#### Let's remove the first 568 words since they contain information about the book but are not part of the book itself. 

Do this by removing from `prophet` elements 0 through 567 of the list (you can also do this by keeping elements 568 through the last element).

In [3]:
# Your code here:

libro = list(filter(lambda x: prophet.index(x) > 567, prophet))
print(libro)

#print(prophet[567:])

['PROPHET\n\n|Almustafa,', 'the{7}', 'chosen', 'the\nbeloved,', 'who', 'was', 'dawn', 'unto', 'own\nday,', 'had', 'waited', 'twelve', 'years', 'city\nof', 'Orphalese', 'ship', 'that', 'was', 'to\nreturn', 'bear', 'him', 'back', 'isle', 'of\nhis', 'birth.\n\nAnd', 'twelfth', 'year,', 'on', 'seventh\nday', 'Ielool,', 'month', 'reaping,', 'he\nclimbed', 'hill', 'without', 'city', 'walls\nand', 'looked', 'seaward;', 'beheld', 'his\nship', 'coming', 'mist.\n\nThen', 'gates', 'heart', 'flung\nopen,', 'joy', 'flew', 'far', 'over', 'sea.\nAnd', 'closed', 'eyes', 'prayed', 'the\nsilences', 'soul.\n\n*****\n\nBut', 'as', 'descended', 'hill,', 'sadness\ncame', 'upon', 'him,', 'thought', 'his\nheart:\n\nHow', 'shall', 'I', 'go', 'peace', 'without\nsorrow?', 'Nay,', 'without', 'wound', 'the\nspirit', 'shall', 'I', 'leave', 'city.', '{8}Long\nwere', 'days', 'pain', 'I', 'have', 'spent\nwithin', 'its', 'walls,', 'long', 'the\nnights', 'aloneness;', 'who', 'can', 'depart\nfrom', 'pain', 'aloneness', '

If you look through the words, you will find that many words have a reference attached to them. For example, let's look at words 1 through 10.

In [4]:
# Your code here:

print(libro[:10])


['PROPHET\n\n|Almustafa,', 'the{7}', 'chosen', 'the\nbeloved,', 'who', 'was', 'dawn', 'unto', 'own\nday,', 'had']


#### The next step is to create a function that will remove references. 

We will do this by splitting the string on the `{` character and keeping only the part before this character. Write your function below.

In [5]:
def quitar(x):
    
    if '{' in x:
    
        a = x.split('{')[0]
        b = x.split('}')[1]
        
        if a != ''  and b != '':
            c = a, b
            return str(c)
        
        elif a == '' and b != '':
            return b
        
        elif a != '' and b == '':
            return a
        
        else:
            return x
        

quitar('{7}the')

'the'

Now that we have our function, use the `map()` function to apply this function to our book, The Prophet. Return the resulting list to a new list called `prophet_reference`

In [None]:
# Your code here:

prophet_reference = list(map(quitar, libro))

print(prophet_reference)

#??

Another thing you may have noticed is that some words contain a line break. Let's write a function to split those words. Our function will return the string split on the character `\n`. Write your function in the cell below.

In [6]:
def line_break(x):
    return x.split("\n")

line_break('the\nbeloved')

['the', 'beloved']

Apply the `line_break` function to the `prophet_reference` list. Name the new list `prophet_line`.

In [None]:
# Your code here:

prophet_line = list(map(line_break, prophet_reference))
print(prophet_line)

If you look at the elements of `prophet_line`, you will see that the function returned lists and not strings. Our list is now a list of lists. Flatten the list using list comprehension. Assign this new list to `prophet_flat`.

In [None]:
prophet_flat = [x for e in prohpet_line for x in e]
print(prophet_flat)

# Challenge 2 - Filtering

When printing out a few words from the book, we see that there are words that we may not want to keep if we choose to analyze the corpus of text. Below is a list of words that we would like to get rid of. Create a function that will return false if it contains a word from the list of words specified and true otherwise.

In [None]:
word_list = ['and', 'the', 'a', 'an']

def word_filter(x):
    return x not in word_list

Use the `filter()` function to filter out the words speficied in the `word_filter()` function. Store the filtered list in the variable `prophet_filter`.

# Bonus Challenge - Part 1

Rewrite the `word_filter` function above to not be case sensitive.

In [None]:
word_list = ['and', 'the', 'a', 'an']

def word_filter_case(x):
    return x not in word_list.lower()
    
word_filter_case("The")

# Challenge 3 - Reducing

#### Now that we have significantly cleaned up our text corpus, let's use the `reduce()` function to put the words back together into one long string separated by spaces. 

We will start by writing a function that takes two strings and concatenates them together with a space between the two strings.

In [None]:
def concat_space(a, b):
    return a + " " + b

Use the function above to reduce the text corpus in the list `prophet_filter` into a single string. Assign this new string to the variable `prophet_string`.

In [None]:
prophet_string = reduce(concat_space, prophet_filter)
print(prophet_string)

# Challenge 4 - Applying Functions to DataFrames

#### Our next step is to use the apply function to a dataframe and transform all cells.

To do this, we will load a dataset below and then write a function that will perform the transformation.

In [7]:
# Run this code:

# The dataset below contains information about pollution from PM2.5 particles in Beijing 

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00381/PRSA_data_2010.1.1-2014.12.31.csv"
pm25 = pd.read_csv(url)

Let's look at the data using the `head()` function.

In [8]:
pm25.head()

Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
0,1,2010,1,1,0,,-21,-11.0,1021.0,NW,1.79,0,0
1,2,2010,1,1,1,,-21,-12.0,1020.0,NW,4.92,0,0
2,3,2010,1,1,2,,-21,-11.0,1019.0,NW,6.71,0,0
3,4,2010,1,1,3,,-21,-14.0,1019.0,NW,9.84,0,0
4,5,2010,1,1,4,,-20,-12.0,1018.0,NW,12.97,0,0


The next step is to create a function that divides a cell by 24 to produce an hourly figure. Write the function below.

In [9]:
def hourly(x):
    return x/24

Apply this function to the columns `Iws`, `Is`, and `Ir`. Store this new dataframe in the variable `pm25_hourly`.

In [10]:
# Your code here:

pm25_hourly = pd.DataFrame(map(hourly, [pm25['Iws'], pm25['Is'], pm25['Ir']]))
pm25_hourly.T

Unnamed: 0,Iws,Is,Ir
0,0.074583,0.0,0.0
1,0.205000,0.0,0.0
2,0.279583,0.0,0.0
3,0.410000,0.0,0.0
4,0.540417,0.0,0.0
...,...,...,...
43819,9.665417,0.0,0.0
43820,9.907500,0.0,0.0
43821,10.112500,0.0,0.0
43822,10.280000,0.0,0.0


#### Our last challenge will be to create an aggregate function and apply it to a select group of columns in our dataframe.

Write a function that returns the standard deviation of a column divided by the length of a column minus 1. Since we are using pandas, do not use the `len()` function. One alternative is to use `count()`. Also, use the numpy version of standard deviation.

In [11]:
def sample_sd(x):
    return np.std(x) / (pd.Series.count(x) - 1)
    