# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [1]:
# Import reduce from functools, numpy and pandas
from functools import reduce
import pandas as pd
import numpy as np

# Challenge 1 - Mapping

#### We will use the map function to clean up words in a book.

In the following cell, we will read a text file containing the book The Prophet by Khalil Gibran.

In [4]:
# Run this code:

location = '../data/58585-0.txt'
with open(location, 'r', encoding="utf8") as f:
    prophet = f.read().split(' ')

#### Let's remove the first 568 words since they contain information about the book but are not part of the book itself. 

Do this by removing from `prophet` elements 0 through 567 of the list (you can also do this by keeping elements 568 through the last element).

In [31]:
# your code here
prophet_reduced = prophet[567:len(prophet)]   # Select the fragment between 568 and the end of the "prophet"
print('Length prophet: ', len(prophet))
print('Length prophet reduced: ', len(prophet_reduced))
print('Different between both: ', len(prophet) - len(prophet_reduced))   # Print the different between both lengths

Length prophet:  13637
Length prophet reduced:  13070
Different between both:  567


If you look through the words, you will find that many words have a reference attached to them. For example, let's look at words 1 through 10.

In [24]:
# your code here
print(prophet_reduced[1:300])

['PROPHET\n\n|Almustafa,', 'the{7}', 'chosen', 'and', 'the\nbeloved,', 'who', 'was', 'a', 'dawn', 'unto', 'his', 'own\nday,', 'had', 'waited', 'twelve', 'years', 'in', 'the', 'city\nof', 'Orphalese', 'for', 'his', 'ship', 'that', 'was', 'to\nreturn', 'and', 'bear', 'him', 'back', 'to', 'the', 'isle', 'of\nhis', 'birth.\n\nAnd', 'in', 'the', 'twelfth', 'year,', 'on', 'the', 'seventh\nday', 'of', 'Ielool,', 'the', 'month', 'of', 'reaping,', 'he\nclimbed', 'the', 'hill', 'without', 'the', 'city', 'walls\nand', 'looked', 'seaward;', 'and', 'he', 'beheld', 'his\nship', 'coming', 'with', 'the', 'mist.\n\nThen', 'the', 'gates', 'of', 'his', 'heart', 'were', 'flung\nopen,', 'and', 'his', 'joy', 'flew', 'far', 'over', 'the', 'sea.\nAnd', 'he', 'closed', 'his', 'eyes', 'and', 'prayed', 'in', 'the\nsilences', 'of', 'his', 'soul.\n\n*****\n\nBut', 'as', 'he', 'descended', 'the', 'hill,', 'a', 'sadness\ncame', 'upon', 'him,', 'and', 'he', 'thought', 'in', 'his\nheart:\n\nHow', 'shall', 'I', 'go', '

#### The next step is to create a function that will remove references. 

We will do this by splitting the string on the `{` character and keeping only the part before this character. Write your function below.

In [105]:
# Function with Loop
def reference(x):
    '''
    Input: A string
    Output: The string with references removed
    
    Example:
    Input: 'the{7}'
    Output: 'the'
    '''
    new_x = []   # Create empty list
    
    # your code here
    for word in x:   # Loop to go through the list
        word_split = word.split('{')   # Split the word using '{' as separator
        if len(word_split) > 1:   # If '{' is included in the word and it isn't the first character
            new_x.append(word_split[0])   # Append the first part of the word before '{' 
        else:
            new_x.append(word_split)   # If '{' isn't included in the word or it's the first character, store the word
    return new_x

reference(prophet_reduced)                  

In [104]:
# Function with List comprehension
def reference(x):
    return [word.split('{')[0] if len(word.split('{')) > 1 else word.split('{') for word in x]

reference(prophet_reduced) 

Now that we have our function, use the `map()` function to apply this function to our book, The Prophet. Return the resulting list to a new list called `prophet_reference`.

In [112]:
# Function to use in map()
# This is the format of the function that it must be to use with map() function
def reference_map(word):
    word_split = word.split('{')
    if len(word_split) > 1:
        return word_split[0]
    else:
        return word_split[0]

# I think this is the loop which I will replace with map() function
for word in prophet_reduced:
    print(reference_map(word))
    

In [110]:
# map() statement using previous function
x = prophet_reduced
#print(type(x))
#print(prophet_reference)

prophet_reference = list(map(reference_map, x))
print(prophet_reference)
#print(type(prophet_reference))

In [111]:
# map() statement using lambda function
x = prophet_reduced
prophet_reference = list(map(lambda word : word.split('{')[0] if len(word.split('{')) > 1 else word.split('{')[0], x))
print(prophet_reference)

Another thing you may have noticed is that some words contain a line break. Let's write a function to split those words. Our function will return the string split on the character `\n`. Write your function in the cell below.

In [113]:
def line_break(x):
    '''
    Input: A string
    Output: A list of strings split on the line break (\n) character
        
    Example:
    Input: 'the\nbeloved'
    Output: ['the', 'beloved']
    '''
    
    # your code here   
    return x.split('\n')   # Split the word using '\n' as separator

Apply the `line_break` function to the `prophet_reference` list. Name the new list `prophet_line`.

In [129]:
# your code here
x = prophet_reference
prophet_line = list(map(line_break, x))

print(prophet_line)

If you look at the elements of `prophet_line`, you will see that the function returned lists and not strings. Our list is now a list of lists. Flatten the list using list comprehension. Assign this new list to `prophet_flat`.

In [124]:
# With for loop
x = prophet_line

prophet_flat = []
for sublist in x:
    for item in sublist:
        prophet_flat.append(item)
        
print(prophet_flat)

In [158]:
# With list comprehension
x = prophet_line
prophet_flat = [item for sublist in x for item in sublist]

print(prophet_flat)

# Challenge 2 - Filtering

When printing out a few words from the book, we see that there are words that we may not want to keep if we choose to analyze the corpus of text. Below is a list of words that we would like to get rid of. Create a function that will return false if it contains a word from the list of words specified and true otherwise.

In [160]:
def word_filter(x):
    '''
    Input: A string
    Output: True if the word is not in the specified list 
    and False if the word is in the list.
        
    Example:
    word list = ['and', 'the']
    Input: 'and'
    Output: False
    
    Input: 'John'
    Output: True
    '''
    
    word_list = ['and', 'the', 'a', 'an']
    
    # your code here
    if x in word_list:
        return False
    else:
        return True

prophet_filter = list(filter(word_filter, prophet_flat))
#len(prophet_filter)
print(prophet_filter)

Use the `filter()` function to filter out the words speficied in the `word_filter()` function. Store the filtered list in the variable `prophet_filter`.

# Bonus Challenge

Rewrite the `word_filter` function above to not be case sensitive.

In [179]:
def word_filter_case(x):
   
    word_list = ['and', 'the', 'a', 'an']
    
    # your code here
    x = x.lower()
    if x in word_list:
        return False
    else:
        return True
    
prophet_filter = list(filter(word_filter_case, prophet_flat))
#len(prophet_filter)
print(prophet_filter)

In [153]:
x = 'THE'
word_list = ['and', 'the', 'a', 'an']
x = x.lower()
if x in word_list:
    print(True)
else:
    print(False)

True


# Challenge 3 - Reducing

#### Now that we have significantly cleaned up our text corpus, let's use the `reduce()` function to put the words back together into one long string separated by spaces. 

We will start by writing a function that takes two strings and concatenates them together with a space between the two strings.

In [176]:
def concat_space(a, b):
    '''
    Input:Two strings
    Output: A single string separated by a space
        
    Example:
    Input: 'John', 'Smith'
    Output: 'John Smith'
    '''
    
    # your code here
    return a + ' ' + b

a = 'hola'
b = 'que'
concat_space(a, b)

'hola que'

Use the function above to reduce the text corpus in the list `prophet_filter` into a single string. Assign this new string to the variable `prophet_string`.

In [178]:
# your code here
prophet_string = (reduce(concat_space, prophet_filter))
print(prophet_string)

# Challenge 4 - Applying Functions to DataFrames

#### Our next step is to use the apply function to a dataframe and transform all cells.

To do this, we will connect to Ironhack's database and retrieve the data from the *pollution* database. Select the *beijing_pollution* table and retrieve its data.

In [30]:
# your code here
dataset = pd.read_csv('../data/PRSA_data_2010.1.1-2014.12.31.csv')
dataset

Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
0,1,2010,1,1,0,,-21,-11.0,1021.0,NW,1.79,0,0
1,2,2010,1,1,1,,-21,-12.0,1020.0,NW,4.92,0,0
2,3,2010,1,1,2,,-21,-11.0,1019.0,NW,6.71,0,0
3,4,2010,1,1,3,,-21,-14.0,1019.0,NW,9.84,0,0
4,5,2010,1,1,4,,-20,-12.0,1018.0,NW,12.97,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
43819,43820,2014,12,31,19,8.0,-23,-2.0,1034.0,NW,231.97,0,0
43820,43821,2014,12,31,20,10.0,-22,-3.0,1034.0,NW,237.78,0,0
43821,43822,2014,12,31,21,10.0,-22,-3.0,1034.0,NW,242.70,0,0
43822,43823,2014,12,31,22,8.0,-22,-4.0,1034.0,NW,246.72,0,0


Let's look at the data using the `head()` function.

In [9]:
# your code here
fragment = dataset.head(10)
fragment

Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
0,1,2010,1,1,0,,-21,-11.0,1021.0,NW,1.79,0,0
1,2,2010,1,1,1,,-21,-12.0,1020.0,NW,4.92,0,0
2,3,2010,1,1,2,,-21,-11.0,1019.0,NW,6.71,0,0
3,4,2010,1,1,3,,-21,-14.0,1019.0,NW,9.84,0,0
4,5,2010,1,1,4,,-20,-12.0,1018.0,NW,12.97,0,0
5,6,2010,1,1,5,,-19,-10.0,1017.0,NW,16.1,0,0
6,7,2010,1,1,6,,-19,-9.0,1017.0,NW,19.23,0,0
7,8,2010,1,1,7,,-19,-9.0,1017.0,NW,21.02,0,0
8,9,2010,1,1,8,,-19,-9.0,1017.0,NW,24.15,0,0
9,10,2010,1,1,9,,-20,-8.0,1017.0,NW,27.28,0,0


The next step is to create a function that divides a cell by 24 to produce an hourly figure. Write the function below.

In [4]:
def hourly(x):
    '''
    Input: A numerical value
    Output: The value divided by 24
        
    Example:
    Input: 48
    Output: 2.0
    '''
    
    # your code here
    return x/24

In [10]:
hourly(48)

2.0

Apply this function to the columns `Iws`, `Is`, and `Ir`. Store this new dataframe in the variable `pm25_hourly`.

In [31]:
# your code here
pm25_hourly = dataset   #Create new dataset to apply the function and not change the original dataset
pm25_hourly.Iws = list(map(hourly, pm25_hourly.Iws))
pm25_hourly.Is = list(map(hourly, pm25_hourly.Is))
pm25_hourly.Ir = list(map(hourly, pm25_hourly.Ir))

fragment = pm25_hourly.head(150)
fragment

Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
0,1,2010,1,1,0,,-21,-11.0,1021.0,NW,0.074583,0.0,0.0
1,2,2010,1,1,1,,-21,-12.0,1020.0,NW,0.205000,0.0,0.0
2,3,2010,1,1,2,,-21,-11.0,1019.0,NW,0.279583,0.0,0.0
3,4,2010,1,1,3,,-21,-14.0,1019.0,NW,0.410000,0.0,0.0
4,5,2010,1,1,4,,-20,-12.0,1018.0,NW,0.540417,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
145,146,2010,1,7,1,130.0,-21,-16.0,1035.0,cv,0.018750,0.0,0.0
146,147,2010,1,7,2,43.0,-22,-18.0,1036.0,cv,0.055833,0.0,0.0
147,148,2010,1,7,3,37.0,-23,-15.0,1036.0,NW,0.167500,0.0,0.0
148,149,2010,1,7,4,30.0,-24,-16.0,1035.0,NW,0.297917,0.0,0.0


#### Our last challenge will be to create an aggregate function and apply it to a select group of columns in our dataframe.

Write a function that returns the standard deviation of a column divided by the length of a column minus 1. Since we are using pandas, do not use the `len()` function. One alternative is to use `count()`. Also, use the numpy version of standard deviation.

In [27]:
def sample_sd(x):
    '''
    Input: A Pandas series of values
    Output: the standard deviation divided by the number of elements in the series
        
    Example:
    Input: pd.Series([1,2,3,4])
    Output: 0.3726779962
    '''
    
    # your code here
    serie = pd.Series(x)   #Create the Pandas serie

    return np.std(serie)/(serie.count()-1)   #Return the standard desviation of serie divide by serie length minus 1

In [28]:
sample = [1,2,3,4]
sample_sd(sample)

4


0.37267799624996495