# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [2]:
# import reduce from functools, numpy and pandas

from functools import reduce
import numpy as np
import pandas as pd


# Challenge 1 - Mapping

#### We will use the map function to clean up a words in a book.

In the following cell, we will read a text file containing the book The Prophet by Khalil Gibran.

In [3]:
# Run this code:

location = '../58585-0.txt'
with open(location, 'r', encoding="utf8") as f:
    prophet = f.read().split(' ')

#### Let's remove the first 568 words since they contain information about the book but are not part of the book itself. 

Do this by removing from `prophet` elements 0 through 567 of the list (you can also do this by keeping elements 568 through the last element).

In [4]:
# Your code here:

print('Number of Words: ', len(prophet), '\n')
### print('Elements to be removed: ', prophet[0:567], '\n')
del prophet[0:567]
print('New Number of Words:', len(prophet), '(-567 Words)')

Number of Words:  13637 

New Number of Words: 13070 (-567 Words)


If you look through the words, you will find that many words have a reference attached to them. For example, let's look at words 1 through 10.

In [5]:
# Your code here:

prophet[:10]


['Farewell................92\n\n\n\n\nTHE',
 'PROPHET\n\n|Almustafa,',
 'the{7}',
 'chosen',
 'and',
 'the\nbeloved,',
 'who',
 'was',
 'a',
 'dawn']

#### The next step is to create a function that will remove references. 

We will do this by splitting the string on the `{` character and keeping only the part before this character. Write your function below.

In [6]:
sample = 'Farewell................92\n\n\n\n\nTHE'

def reference(x):
    '''
    Input: A string
    Output: The string with references removed
    
    Example:
    Input: 'the{7}'
    Output: 'the'
    '''
    
    # Your code here:
    
    for string in x.split('{'):
        if not string.isdigit(): ### if string.aplha() ... esta funcion sirve para este ejemplo pero el \
            return string  ###problema es que indica como None algunas palabras como la primera de la Lista Anterior         'Farewell................92\n\n\n\n\nTHE'
        else:
            return string
        
print(reference(sample))

Farewell................92




THE


Now that we have our function, use the `map()` function to apply this function to our book, The Prophet. Return the resulting list to a new list called `prophet_reference`

In [9]:
# Your code here:

prophet_reference = list(map(reference, prophet))
prophet_reference[:30]


['Farewell................92\n\n\n\n\nTHE',
 'PROPHET\n\n|Almustafa,',
 'the',
 'chosen',
 'and',
 'the\nbeloved,',
 'who',
 'was',
 'a',
 'dawn',
 'unto',
 'his',
 'own\nday,',
 'had',
 'waited',
 'twelve',
 'years',
 'in',
 'the',
 'city\nof',
 'Orphalese',
 'for',
 'his',
 'ship',
 'that',
 'was',
 'to\nreturn',
 'and',
 'bear',
 'him']

Another thing you may have noticed is that some words contain a line break. Let's write a function to split those words. Our function will return the string split on the character `\n`. Write your function in the cell below.

In [10]:
def line_break(x):
    '''
    Input: A string
    Output: A list of strings split on the line break (\n) character
        
    Example:
    Input: 'the\nbeloved'
    Output: ['the', 'beloved']
    '''
    
    # Your code here:
    
    return x.split('\n')
    
sample = 'Farewell................92\n\n\n\n\nTHE'

line_break(sample)

['Farewell................92', '', '', '', '', 'THE']

Apply the `line_break` function to the `prophet_reference` list. Name the new list `prophet_line`.

In [11]:
# Your code here:

prophet_line = list(map(line_break, prophet_reference))
prophet_line[:30]

[['Farewell................92', '', '', '', '', 'THE'],
 ['PROPHET', '', '|Almustafa,'],
 ['the'],
 ['chosen'],
 ['and'],
 ['the', 'beloved,'],
 ['who'],
 ['was'],
 ['a'],
 ['dawn'],
 ['unto'],
 ['his'],
 ['own', 'day,'],
 ['had'],
 ['waited'],
 ['twelve'],
 ['years'],
 ['in'],
 ['the'],
 ['city', 'of'],
 ['Orphalese'],
 ['for'],
 ['his'],
 ['ship'],
 ['that'],
 ['was'],
 ['to', 'return'],
 ['and'],
 ['bear'],
 ['him']]

If you look at the elements of `prophet_line`, you will see that the function returned lists and not strings. Our list is now a list of lists. Flatten the list using list comprehension. Assign this new list to `prophet_flat`.

In [13]:
# Your code here:

prophet_flat = [element for lst in prophet_line for element in lst if element != '']
prophet_flat[:30]


['Farewell................92',
 'THE',
 'PROPHET',
 '|Almustafa,',
 'the',
 'chosen',
 'and',
 'the',
 'beloved,',
 'who',
 'was',
 'a',
 'dawn',
 'unto',
 'his',
 'own',
 'day,',
 'had',
 'waited',
 'twelve',
 'years',
 'in',
 'the',
 'city',
 'of',
 'Orphalese',
 'for',
 'his',
 'ship',
 'that']

# Challenge 2 - Filtering

When printing out a few words from the book, we see that there are words that we may not want to keep if we choose to analyze the corpus of text. Below is a list of words that we would like to get rid of. Create a function that will return false if it contains a word from the list of words specified and true otherwise.

In [20]:
def word_filter(x):
    '''
    Input: A string
    Output: true if the word is not in the specified list and false if the word is in the list
        
    Example:
    word list = ['and', 'the']
    Input: 'and'
    Output: False
    
    Input: 'John'
    Output: True
    '''
    
    word_list = ['and', 'the', 'a', 'an']
    
    # Your code here: ### se habia hecho una list comprehension pero eso no serviria porque en la sig. celda se busca emplear la funcion en filter
    if x not in word_list:
        return True 
    else:
        return False 

print(word_filter('Hola'), word_filter('and'))

True False


Use the `filter()` function to filter out the words speficied in the `word_filter()` function. Store the filtered list in the variable `prophet_filter`.

In [21]:
prophet_filter = list(filter(word_filter, prophet_flat)) ### The 'THE' capitalized is excused as it is the title of the book
print('Filtered List:', prophet_filter[:30], 'there are', len(prophet_filter), 'words')

Filtered List: ['Farewell................92', 'THE', 'PROPHET', '|Almustafa,', 'chosen', 'beloved,', 'who', 'was', 'dawn', 'unto', 'his', 'own', 'day,', 'had', 'waited', 'twelve', 'years', 'in', 'city', 'of', 'Orphalese', 'for', 'his', 'ship', 'that', 'was', 'to', 'return', 'bear', 'him'] there are 13529 words


In [25]:
### Initial idea in mind for the code, but for the purpose of following the exercise, this remains as an alternative ...

def word_filter_efficient(x):
    word_list = ['and', 'the', 'a', 'an']
    return list(filter(lambda word: word if (word not in word_list) else False, x))

prophet_filter2 = word_filter_efficient(prophet_flat)
print('Flatten List:', prophet_flat[:30], len(prophet_flat), 'words \n')
print('Filtered List: ', prophet_filter2[:30], 'there are', len(prophet_filter2), 'words')

Flatten List: ['Farewell................92', 'THE', 'PROPHET', '|Almustafa,', 'the', 'chosen', 'and', 'the', 'beloved,', 'who', 'was', 'a', 'dawn', 'unto', 'his', 'own', 'day,', 'had', 'waited', 'twelve', 'years', 'in', 'the', 'city', 'of', 'Orphalese', 'for', 'his', 'ship', 'that'] 15299 words 

Filtered List:  ['Farewell................92', 'THE', 'PROPHET', '|Almustafa,', 'chosen', 'beloved,', 'who', 'was', 'dawn', 'unto', 'his', 'own', 'day,', 'had', 'waited', 'twelve', 'years', 'in', 'city', 'of', 'Orphalese', 'for', 'his', 'ship', 'that', 'was', 'to', 'return', 'bear', 'him'] there are 13529 words


# Bonus Challenge - Part 1

Rewrite the `word_filter` function above to not be case sensitive.

In [26]:
def word_filter_case(x):
   
    word_list = ['and', 'the', 'a', 'an']
    
    # Your code here:
    
    return list(filter(lambda word: word if (word.lower() not in word_list) else False, x))

In [27]:
new_filtered_list = word_filter_case(prophet_flat)
print('New Filtered List:', new_filtered_list[:30], 'containing %s words' %len(new_filtered_list))

New Filtered List: ['Farewell................92', 'PROPHET', '|Almustafa,', 'chosen', 'beloved,', 'who', 'was', 'dawn', 'unto', 'his', 'own', 'day,', 'had', 'waited', 'twelve', 'years', 'in', 'city', 'of', 'Orphalese', 'for', 'his', 'ship', 'that', 'was', 'to', 'return', 'bear', 'him', 'back'] containing 13235 words


# Challenge 3 - Reducing

#### Now that we have significantly cleaned up our text corpus, let's use the `reduce()` function to put the words back together into one long string separated by spaces. 

We will start by writing a function that takes two strings and concatenates them together with a space between the two strings.

In [35]:
def concat_space(a, b):
    '''
    Input:Two strings
    Output: A single string separated by a space
        
    Example:
    Input: 'John', 'Smith'
    Output: 'John Smith'
    '''

    # Your code here:
    
    return ' '.join([a,b]) # ALTERNATIVE a + ' ' + b 

concat_space('hola', 'John')


'hola John'

Use the function above to reduce the text corpus in the list `prophet_filter` into a single string. Assign this new string to the variable `prophet_string`.

In [40]:
# Your code here:

prophet_string = reduce(concat_space, prophet_filter)
prophet_string[:1500]


'Farewell................92 THE PROPHET |Almustafa, chosen beloved, who was dawn unto his own day, had waited twelve years in city of Orphalese for his ship that was to return bear him back to isle of his birth. And in twelfth year, on seventh day of Ielool, month of reaping, he climbed hill without city walls looked seaward; he beheld his ship coming with mist. Then gates of his heart were flung open, his joy flew far over sea. And he closed his eyes prayed in silences of his soul. ***** But as he descended hill, sadness came upon him, he thought in his heart: How shall I go in peace without sorrow? Nay, not without wound in spirit shall I leave this city. days of pain I have spent within its walls, long were nights of aloneness; who can depart from his pain his aloneness without regret? Too many fragments of spirit have I scattered in these streets, too many are children of my longing that walk naked among these hills, I cannot withdraw from them without burden ache. It is not garmen

# Challenge 4 - Applying Functions to DataFrames

#### Our next step is to use the apply function to a dataframe and transform all cells.

To do this, we will load a dataset below and then write a function that will perform the transformation.

In [41]:
# Run this code:

# The dataset below contains information about pollution from PM2.5 particles in Beijing 

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00381/PRSA_data_2010.1.1-2014.12.31.csv"
pm25 = pd.read_csv(url)

Let's look at the data using the `head()` function.

In [42]:
# Your code here:

pm25.head()


Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
0,1,2010,1,1,0,,-21,-11.0,1021.0,NW,1.79,0,0
1,2,2010,1,1,1,,-21,-12.0,1020.0,NW,4.92,0,0
2,3,2010,1,1,2,,-21,-11.0,1019.0,NW,6.71,0,0
3,4,2010,1,1,3,,-21,-14.0,1019.0,NW,9.84,0,0
4,5,2010,1,1,4,,-20,-12.0,1018.0,NW,12.97,0,0


The next step is to create a function that divides a cell by 24 to produce an hourly figure. Write the function below.

In [50]:
def hourly(x):
    '''
    Input: A numerical value
    Output: The value divided by 24
        
    Example:
    Input: 48
    Output: 2.0
    '''
    
    # Your code here:
    
    return round(x/24)

print(hourly(72), hourly(94), hourly(24))


3 4 1


Apply this function to the columns `Iws`, `Is`, and `Ir`. Store this new dataframe in the variable `pm25_hourly`.

In [59]:
# Your code here:

pm25_hourly = pm25.apply(lambda column: hourly(column) if column.name in ['Iws', 'Is', 'Ir'] else column) ### remember tu use <.name> method to call the columns of a Dataframe
pm25_hourly.head(-1)


Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
0,1,2010,1,1,0,,-21,-11.0,1021.0,NW,0.0,0.0,0.0
1,2,2010,1,1,1,,-21,-12.0,1020.0,NW,0.0,0.0,0.0
2,3,2010,1,1,2,,-21,-11.0,1019.0,NW,0.0,0.0,0.0
3,4,2010,1,1,3,,-21,-14.0,1019.0,NW,0.0,0.0,0.0
4,5,2010,1,1,4,,-20,-12.0,1018.0,NW,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
43818,43819,2014,12,31,18,10.0,-22,-2.0,1033.0,NW,9.0,0.0,0.0
43819,43820,2014,12,31,19,8.0,-23,-2.0,1034.0,NW,10.0,0.0,0.0
43820,43821,2014,12,31,20,10.0,-22,-3.0,1034.0,NW,10.0,0.0,0.0
43821,43822,2014,12,31,21,10.0,-22,-3.0,1034.0,NW,10.0,0.0,0.0


#### Our last challenge will be to create an aggregate function and apply it to a select group of columns in our dataframe.

Write a function that returns the standard deviation of a column divided by the length of a column minus 1. Since we are using pandas, do not use the `len()` function. One alternative is to use `count()`. Also, use the numpy version of standard deviation.

In [76]:
def sample_sd(x):
    '''
    Input: A Pandas series of values
    Output: the standard deviation divided by the number of elements in the series
        
    Example:
    Input: pd.Series([1,2,3,4])
    Output: 0.3726779962
    '''
    
    # Your code here:
    array = pd.Series(x) ### convert column into an array
    return np.std(array) / (np.count_nonzero(array)-1) ### the whole array is going to be analyzed

sample_sd([1,2,3,4])
    

0.37267799624996495

In [124]:
pm25_std = pm25.apply(lambda column: sample_sd(column) if column.name in ['TEMP', 'PRES', 'Iws'] else column) ### the columns selected have the same value after applying the function due to pd.Series()

print(pm25_std[['TEMP', 'PRES', 'Iws']].mean())

pm25_std.head(10)

TEMP    0.000286
PRES    0.000234
Iws     0.001141
dtype: float64


Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
0,1,2010,1,1,0,,-21,0.000286,0.000234,NW,0.001141,0,0
1,2,2010,1,1,1,,-21,0.000286,0.000234,NW,0.001141,0,0
2,3,2010,1,1,2,,-21,0.000286,0.000234,NW,0.001141,0,0
3,4,2010,1,1,3,,-21,0.000286,0.000234,NW,0.001141,0,0
4,5,2010,1,1,4,,-20,0.000286,0.000234,NW,0.001141,0,0
5,6,2010,1,1,5,,-19,0.000286,0.000234,NW,0.001141,0,0
6,7,2010,1,1,6,,-19,0.000286,0.000234,NW,0.001141,0,0
7,8,2010,1,1,7,,-19,0.000286,0.000234,NW,0.001141,0,0
8,9,2010,1,1,8,,-19,0.000286,0.000234,NW,0.001141,0,0
9,10,2010,1,1,9,,-20,0.000286,0.000234,NW,0.001141,0,0
