# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [50]:
# Import reduce from functools, numpy and pandas
import numpy as np
import pandas as pd
from functools import reduce #now use reduce()  if you import functools completely, use functools.reduce()

# Challenge 1 - Mapping

#### We will use the map function to clean up words in a book.

In the following cell, we will read a text file containing the book The Prophet by Khalil Gibran.
Data are available in the data folder.

In [32]:
# Run this code:

location = '../data/58585-0.txt'
with open(location, 'r', encoding="utf8") as f:
    prophet = f.read().split(' ')

#### Let's remove the first 568 words since they contain information about the book but are not part of the book itself. 

Do this by removing from `prophet` elements 0 through 567 of the list (you can also do this by keeping elements 568 through the last element).

In [33]:
# your code here
prophet_short = prophet[568:]
print("a: \n", prophet_short[-1])
print("check a: \n", prophet[-1])
print("b: \n", prophet_short[0])
print("check b: \n", prophet[568])


a: 
 eBooks.


check a: 
 eBooks.


b: 
 PROPHET

|Almustafa,
check b: 
 PROPHET

|Almustafa,


If you look through the words, you will find that many words have a reference attached to them. For example, let's look at words 1 through 10.

In [34]:
# your code here
prophet_short[0:11]

['PROPHET\n\n|Almustafa,',
 'the{7}',
 'chosen',
 'and',
 'the\nbeloved,',
 'who',
 'was',
 'a',
 'dawn',
 'unto',
 'his']

#### The next step is to create a function that will remove references. 

We will do this by splitting the string on the `{` character and keeping only the part before this character. Write your function below.

In [35]:
#test a forloop with a simple test
test = ['hello,', 'I', 'am', 'a', 'three{4}', 'bye']

for i in test:
    if '{' in i:
        a = i.split("{")
        print(a[0], type(a[0]))
    else:
        print(i)


hello,
I
am
a
three <class 'str'>
bye


In [36]:
#make a function out of the for loop
def reference(x):
    '''
    Input: A string
    Output: The string with references removed
    
    Example:
    Input: 'the{7}'
    Output: 'the'
    '''
    if '{' in x:
        a = x.split("{")
        return a[0]
    else:
        return x

Now that we have our function, use the `map()` function to apply this function to our book, The Prophet. Return the resulting list to a new list called `prophet_reference`.

In [37]:
# your code here

test2 = ['hello,', 'I', 'am', 'a', 'three{4}', 'bye']

print(list(map(reference, test2)))

prophet_reference = list(map(reference, prophet_short))
print(prophet_reference[0:12])


['hello,', 'I', 'am', 'a', 'three', 'bye']
['PROPHET\n\n|Almustafa,', 'the', 'chosen', 'and', 'the\nbeloved,', 'who', 'was', 'a', 'dawn', 'unto', 'his', 'own\nday,']


Another thing you may have noticed is that some words contain a line break. Let's write a function to split those words. Our function will return the string split on the character `\n`. Write your function in the cell below.

In [38]:
def line_break(x):
    '''
    Input: A string
    Output: A list of strings split on the line break (\n) character
        
    Example:
    Input: 'the\nbeloved'
    Output: ['the', 'beloved']
    '''
    if '\n' in x:
        return x.split("\n") #this returns a list
    else:
        return [x] #this needs to be a list for the ease of flattening later..

In [39]:
#test
x = 'the\nbeloved'
line_break(x)

['the', 'beloved']

Apply the `line_break` function to the `prophet_reference` list. Name the new list `prophet_line`.

In [40]:
# your code here
prophet_line = list(map(line_break, prophet_reference))
print(prophet_line[0:12])


[['PROPHET', '', '|Almustafa,'], ['the'], ['chosen'], ['and'], ['the', 'beloved,'], ['who'], ['was'], ['a'], ['dawn'], ['unto'], ['his'], ['own', 'day,']]


If you look at the elements of `prophet_line`, you will see that the function returned lists and not strings. Our list is now a list of lists. Flatten the list using list comprehension. Assign this new list to `prophet_flat`.

In [41]:
# your code here
prophet_flat = [j for i in prophet_line for j in i] #unpacks the level-1's to letters  

#prophet_flat5 = [y for x in prophet_list for y in x if y != ''];
prophet_flat[0:12]

['PROPHET',
 '',
 '|Almustafa,',
 'the',
 'chosen',
 'and',
 'the',
 'beloved,',
 'who',
 'was',
 'a',
 'dawn']

In [42]:
"""when you have a list like this, you can still do itm but need a more complicated if/else statement"""

test_string = ['hello', 'as', ['a', 'test'], 'I', ['enter', 'some', 'words', 'here'], 'bye']

test_flat2 = [i for i in test_string if type(i)==str] #returns only the unnested ones (deletes the lists)
test_flat3= [j for i in test_string for j in i if type(i)==list] #returns only the nested ones (unpacked)

print(test_flat2)
print(test_flat3)

test_flat4 = [j for i in test_string for j in (i if isinstance(i, list) else (i,))]

print(test_flat4)

['hello', 'as', 'I', 'bye']
['a', 'test', 'enter', 'some', 'words', 'here']
['hello', 'as', 'a', 'test', 'I', 'enter', 'some', 'words', 'here', 'bye']


# Challenge 2 - Filtering

When printing out a few words from the book, we see that there are words that we may not want to keep if we choose to analyze the corpus of text. Below is a list of words that we would like to get rid of. Create a function that will return false if it contains a word from the list of words specified and true otherwise.

In [43]:
def word_filter(x):
    '''
    Input: A string
    Output: True if the word is not in the specified list 
    and False if the word is in the list.
        
    Example:
    word list = ['and', 'the']
    Input: 'and'
    Output: False
    
    Input: 'John'
    Output: True
    '''
    
    word_list = ['and', 'the', 'a', 'an']
    
    # your code here
    if x not in word_list:
        return True
    else:
        return False


test=["hello", "a", "friend", "and", "a", "foe"]
print("using function", list(filter(word_filter, test)))

word_list_lambda = ['and', 'the', 'a', 'an']
print("using lambda function", list(filter(lambda x: x not in word_list_lambda, test)))

using function ['hello', 'friend', 'foe']
using lambda function ['hello', 'friend', 'foe']


Use the `filter()` function to filter out the words speficied in the `word_filter()` function. Store the filtered list in the variable `prophet_filter`.

In [44]:
prophet_filter = list(filter(word_filter, prophet_flat))
prophet_filter[0:12]

['PROPHET',
 '',
 '|Almustafa,',
 'chosen',
 'beloved,',
 'who',
 'was',
 'dawn',
 'unto',
 'his',
 'own',
 'day,']

# Bonus Challenge

Rewrite the `word_filter` function above to not be case sensitive.

In [45]:
def word_filter_case(x):
   
    word_list = ['and', 'the', 'a', 'an', ''] #EXTRA:filtered out empty string ''.
    
    # your code here
    return x.lower() not in word_list
    

test=["hello", "A", "friend", "aNd", "a", "foe"]
print("test using function", list(filter(word_filter_case, test)))
print("prophet using function", list(filter(word_filter_case, prophet_flat[0:12])))

prophet_filter = list(filter(word_filter_case, prophet_flat))

test using function ['hello', 'friend', 'foe']
prophet using function ['PROPHET', '|Almustafa,', 'chosen', 'beloved,', 'who', 'was', 'dawn']


# Challenge 3 - Reducing

#### Now that we have significantly cleaned up our text corpus, let's use the `reduce()` function to put the words back together into one long string separated by spaces. 

We will start by writing a function that takes two strings and concatenates them together with a space between the two strings.

In [46]:
prophet_filter[0:12] #list of strings

['PROPHET',
 '|Almustafa,',
 'chosen',
 'beloved,',
 'who',
 'was',
 'dawn',
 'unto',
 'his',
 'own',
 'day,',
 'had']

In [47]:
def concat_space(a, b):
    '''
    Input:Two strings
    Output: A single string separated by a space
        
    Example:
    Input: 'John', 'Smith'
    Output: 'John Smith'
    '''
    
    # your code here
    return a + ' ' + b

#test block:
# a = 'hello'
# b = 'world'

# concat_space(a, b)

Use the function above to reduce the text corpus in the list `prophet_filter` into a single string. Assign this new string to the variable `prophet_string`.

In [52]:
# your code here
prophet_string = reduce(concat_space, prophet_filter)
print(prophet_string[0:49])

prophet_string_lambda = reduce(lambda a,b: a + ' ' + b, prophet_filter)
print(prophet_string_lambda[0:49])

PROPHET |Almustafa, chosen beloved, who was dawn 
PROPHET |Almustafa, chosen beloved, who was dawn 


# Challenge 4 - Applying Functions to DataFrames

#### Our next step is to use the apply function to a dataframe and transform all cells.

To do this, we will connect to Ironhack's database and retrieve the data from the *pollution* database that you can find in the data folder. Access the data via a relative link and and retrieve it as a dataframe.

In [None]:
# your code here
pollution = pd.read_csv('../data/pollution.csv')

pollution.info()


Let's look at the data using the `head()` function.

In [None]:
# your code here
pollution.head()

The next step is to create a function that divides a cell by 24 to produce an hourly figure. Write the function below.

In [None]:
def hourly(x):
    '''
    Input: A numerical value
    Output: The value divided by 24
        
    Example:
    Input: 48
    Output: 2.0
    '''
    return x/24
    # your code here

Apply this function to the columns `Iws`, `Is`, and `Ir`. Store this new dataframe in the variable `pm25_hourly`.

In [None]:
# your code here
#pollution_num = pollution._get_numeric_data()
pm25_hourly = pollution[['Iws', 'Is', 'Ir']].apply(hourly)

print(pm25_hourly.head(), '\n\n')
print(pm25_hourly.info())

#### Our last challenge will be to create an aggregate function and apply it to a select group of columns in our dataframe.

Write a function that returns the standard deviation of a column divided by the length of a column minus 1. Since we are using pandas, do not use the `len()` function. One alternative is to use `count()`. Also, use the numpy version of standard deviation.

In [None]:
#using len anyway
def sample_sd(x):
    '''
    Input: A Pandas series of values
    Output: the standard deviation divided by the number of elements in the series
        
    Example:
    Input: pd.Series([1,2,3,4])
    Output: 0.3726779962
    '''
    
    # your code here
    return np.std(x)/(len(x)-1)  #len does seem to work, why do you have to use count?

#Using count
def sample_sd_2(x):
    return np.std(x)/(x.count()-1)

In [None]:
sample_sd_2(pollution[['TEMP', 'Iws', 'Is', 'Ir']])

In [None]:
sample_sd(pollution[['TEMP', 'Iws', 'Is', 'Ir']])

In [None]:
pollution[['TEMP', 'Iws', 'Is', 'Ir']].apply(sample_sd)