# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [1]:
#Import reduce from functools, numpy and pandas
from functools import reduce 
import numpy as np
import pandas as pd

# Challenge 1 - Mapping

#### We will use the map function to clean up a words in a book.

In the following cell, we will read a text file containing the book The Prophet by Khalil Gibran.

In [2]:
location = '../data/58585-0.txt'
with open(location, 'r', encoding="utf8") as f:
    prophet = f.read().split(' ')

#### Let's remove the first 568 words since they contain information about the book but are not part of the book itself. 

Do this by removing from `prophet` elements 0 through 567 of the list (you can also do this by keeping elements 568 through the last element).

In [3]:
del prophet[0:567]

If you look through the words, you will find that many words have a reference attached to them. For example, let's look at words 1 through 10.

In [4]:
prophet[0:10]

['Farewell................92\n\n\n\n\nTHE',
 'PROPHET\n\n|Almustafa,',
 'the{7}',
 'chosen',
 'and',
 'the\nbeloved,',
 'who',
 'was',
 'a',
 'dawn']

#### The next step is to create a function that will remove references. 

We will do this by splitting the string on the `{` character and keeping only the part before this character. Write your function below.

In [5]:
def removeReferences(string):
    return string.split("{")[0] #Split string in two parts and keep the first one
    
#Test
removeReferences("the{7}")

'the'

Now that we have our function, use the `map()` function to apply this function to our book, The Prophet. Return the resulting list to a new list called `prophet_reference`

In [6]:
#Without map() function (Guido likes that)
prophet_reference = [string.split("{")[0] for string in prophet]
    
#Test
prophet_reference[:10]

['Farewell................92\n\n\n\n\nTHE',
 'PROPHET\n\n|Almustafa,',
 'the',
 'chosen',
 'and',
 'the\nbeloved,',
 'who',
 'was',
 'a',
 'dawn']

Another thing you may have noticed is that some words contain a line break. Let's write a function to split those words. Our function will return the string split on the character `\n`. Write your function in the cell below.

In [7]:
def line_break(string):
    return string.split("\n") #Split strings with '\n' in two parts
    
#Test
line_break("own\nday")    

['own', 'day']

Apply the `line_break` function to the `prophet_reference` list. Name the new list `prophet_line`.

In [8]:
#Without map() function
prophet_line = [line_break(string) for string in prophet_reference]

#Test
prophet_line[:10]

[['Farewell................92', '', '', '', '', 'THE'],
 ['PROPHET', '', '|Almustafa,'],
 ['the'],
 ['chosen'],
 ['and'],
 ['the', 'beloved,'],
 ['who'],
 ['was'],
 ['a'],
 ['dawn']]

If you look at the elements of `prophet_line`, you will see that the function returned lists and not strings. Our list is now a list of lists. Flatten the list using list comprehension. Assign this new list to `prophet_flat`.

In [9]:
#Go through each sub-item and add it to the new list
prophet_flat = [subitem for item in prophet_line for subitem in item]
    
#Test
prophet_flat[:10]

['Farewell................92',
 '',
 '',
 '',
 '',
 'THE',
 'PROPHET',
 '',
 '|Almustafa,',
 'the']

# Challenge 2 - Filtering

When printing out a few words from the book, we see that there are words that we may not want to keep if we choose to analyze the corpus of text. Below is a list of words that we would like to get rid of. Create a function that will return false if it contains a word from the list of words specified and true otherwise.

In [10]:
def word_filter(word):
    '''
    Input: A string
    Output: true if the word is not in the specified list and false if the word is in the list
        
    Example:
    word list = ['and', 'the']
    Input: 'and'
    Output: False
    
    Input: 'John'
    Output: True
    '''
    
    word_list = ['and', 'the', 'a', 'an']
    
    if word in word_list:
        return False
    else:
        return True

Use the `filter()` function to filter out the words specified in the `word_filter()` function. Store the filtered list in the variable `prophet_filter`.

In [11]:
prophet_filter = filter(word_filter, prophet_flat)

#Test
for word in prophet_filter:
    print(word)

Farewell................92




THE
PROPHET

|Almustafa,
chosen
beloved,
who
was
dawn
unto
his
own
day,
had
waited
twelve
years
in
city
of
Orphalese
for
his
ship
that
was
to
return
bear
him
back
to
isle
of
his
birth.

And
in
twelfth
year,
on
seventh
day
of
Ielool,
month
of
reaping,
he
climbed
hill
without
city
walls
looked
seaward;
he
beheld
his
ship
coming
with
mist.

Then
gates
of
his
heart
were
flung
open,
his
joy
flew
far
over
sea.
And
he
closed
his
eyes
prayed
in
silences
of
his
soul.

*****

But
as
he
descended
hill,
sadness
came
upon
him,
he
thought
in
his
heart:

How
shall
I
go
in
peace
without
sorrow?
Nay,
not
without
wound
in
spirit
shall
I
leave
this
city.

days
of
pain
I
have
spent
within
its
walls,
long
were
nights
of
aloneness;
who
can
depart
from
his
pain
his
aloneness
without
regret?

Too
many
fragments
of
spirit
have
I
scattered
in
these
streets,
too
many
are
children
of
my
longing
that
walk
naked
among
these
hills,
I
cannot
withdraw
from
them
without
burden
ache.

It
i

house.

It
is
to
sow
seeds
with
tenderness
reap
harvest
with
joy,
even
as
if
your
beloved
were
to
eat
fruit.

It
is
to
charge
all
things
you
fashion
with
breath
of
your
own
spirit,

And
to
know
that
all
blessed
dead
are
standing
about
you
watching.

Often
have
I
heard
you
say,
as
if
speaking
in
sleep,
“He
who
works
in
marble,
finds
shape
of
his
own
soul
in
stone,
is
nobler
than
he
who
ploughs
soil.

he
who
seizes
rainbow
to
lay
it
on
cloth
in
likeness
of
man,
is
more
than
he
who
makes
sandals
for
our
feet.”

But
I
say,
not
in
sleep
but
in
overwakefulness
of
noontide,
that
wind
speaks
not
more
sweetly
to
giant
oaks
than
to
least
of
all
blades
of
grass;

And
he
alone
is
great
who
turns
voice
of
wind
into
song
made
sweeter
by
his
own
loving.

*****

Work
is
love
made
visible.

And
if
you
cannot
work
with
love
but
only
with
distaste,
it
is
better
that
you
should
leave
your
work
sit
at
gate
of
temple
take
alms
of
those
who
work
with
joy.

For
if
you
bake
bread
with
indifference,
you
bake
bi

your
heart’s
knowledge.

You
would
know
in
words
that
which
you
have
always
known
in
thought.

You
would
touch
with
your
fingers
naked
body
of
your
dreams.

And
it
is
well
you
should.

The
hidden
well-spring
of
your
soul
must
needs
rise
run
murmuring
to
sea;

And
treasure
of
your
infinite
depths
would
be
revealed
to
your
eyes.

But
let
there
be
no
scales
to
weigh
your
unknown
treasure;

And
seek
not
depths
of
your

with
staff
or
sounding
line.

For
self
is
sea
boundless
measureless.

*****

Say
not,
“I
have
found
truth,”
but
rather,
“I
have
found
truth.”

Say
not,
“I
have
found
path
of
soul.”
Say
rather,
“I
have
met
soul
walking
upon
my
path.”

For
soul
walks
upon
all
paths.

The
soul
walks
not
upon
line,
neither
does
it
grow
like
reed.

The
soul
unfolds
itself,
like
lotus
of
countless
petals.

[Illustration:
0083]

*****
*****


said
teacher,
Speak
to
us
of
_Teaching_.

And
he
said:

“No
man
can
reveal
to
you
aught
but
that
which
already
lies
half
asleep
in
dawning
of
your
knowledge.


about
you
you
shall
see
Him
playing
with
your
children.

And
look
into
space;
you
shall
see
Him
walking
in
cloud,
outstretching
His
arms
in
lightning
descending
in
rain.

You
shall
see
Him
smiling
in
flowers,
then
rising
waving
His
hands
in
trees.

*****
*****


Almitra
spoke,
saying,
We
would
ask
now
of
_Death_.

And
he
said:

You
would
know
secret
of
death.

But
how
shall
you
find
it
unless
you
seek
it
in
heart
of
life?

The
owl
whose
night-bound
eyes
are
blind
unto
day
cannot
unveil
mystery
of
light.

If
you
would
indeed
behold
spirit
of
death,
open
your
heart
wide
unto
body
of
life.

For
life
death
are
one,
even
as
river
sea
are
one.

In
depth
of
your
hopes
desires
lies
your
silent
knowledge
of
beyond;

And
like
seeds
dreaming
beneath
snow
your
heart
dreams
of
spring.

Trust
dreams,
for
in
them
is
hidden
gate
to
eternity.

fear
of
death
is
but
trembling
of
shepherd
when
he
stands
before
king
whose
hand
is
to
be
laid
upon
him
in
honour.

Is
shepherd
not
joyful
beneath
his
trembling,

ebook.

1.E.2.
If
individual
Project
Gutenberg-tm
electronic
work
is
derived
from
texts
not
protected
by
U.S.
copyright
law
(does
not
contain
notice
indicating
that
it
is
posted
with
permission
of
copyright
holder),
work
can
be
copied
distributed
to
anyone
in
United
States
without
paying
any
fees
or
charges.
If
you
are
redistributing
or
providing
access
to
work
with
phrase
"Project
Gutenberg"
associated
with
or
appearing
on
work,
you
must
comply
either
with
requirements
of
paragraphs
1.E.1
through
1.E.7
or
obtain
permission
for
use
of
work
Project
Gutenberg-tm
trademark
as
set
forth
in
paragraphs
1.E.8
or
1.E.9.

1.E.3.
If
individual
Project
Gutenberg-tm
electronic
work
is
posted
with
permission
of
copyright
holder,
your
use
distribution
must
comply
with
both
paragraphs
1.E.1
through
1.E.7
any
additional
terms
imposed
by
copyright
holder.
Additional
terms
will
be
linked
to
Project
Gutenberg-tm
License
for
all
works
posted
with
permission
of
copyright
holder
found
at
beginning
of
this
w

# Bonus Challenge - Part 1

Rewrite the `word_filter` function above to not be case sensitive.

In [12]:
def word_filter_case(x):
   
    word_list = ['and', 'the', 'a', 'an']
    
    if word.lower() in word_list:
        return False
    else:
        return True

In [13]:
#Rewrite prophet_filter, now being case insensitive
prophet_filter = filter(word_filter_case, prophet_flat)

# Challenge 3 - Reducing

#### Now that we have significantly cleaned up our text corpus, let's use the `reduce()` function to put the words back together into one long string separated by spaces. 

We will start by writing a function that takes two strings and concatenates them together with a space between the two strings.

In [14]:
def concat_space(a, b):
    '''
    Input: Two strings
    Output: A single string separated by a space
        
    Example:
    Input: 'John', 'Smith'
    Output: 'John Smith'
    '''
    
    sentence = a + " " + b
    return(sentence)

Use the function above to reduce the text corpus in the list `prophet_filter` into a single string. Assign this new string to the variable `prophet_string`.

In [15]:
prophet_string = reduce(concat_space, prophet_filter)

#Test
prophet_string[:200]

'Farewell................92     THE PROPHET  |Almustafa, the chosen and the beloved, who was a dawn unto his own day, had waited twelve years in the city of Orphalese for his ship that was to return an'

# Challenge 4 - Applying Functions to DataFrames

#### Our next step is to use the apply function to a dataframe and transform all cells.

To do this, we will load a dataset below and then write a function that will perform the transformation.

In [16]:
# The dataset below contains information about pollution from PM2.5 particles in Beijing 
# Run this code:

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00381/PRSA_data_2010.1.1-2014.12.31.csv"
pm25 = pd.read_csv(url)

Let's look at the data using the `head()` function.

In [17]:
pm25.head()

Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
0,1,2010,1,1,0,,-21,-11.0,1021.0,NW,1.79,0,0
1,2,2010,1,1,1,,-21,-12.0,1020.0,NW,4.92,0,0
2,3,2010,1,1,2,,-21,-11.0,1019.0,NW,6.71,0,0
3,4,2010,1,1,3,,-21,-14.0,1019.0,NW,9.84,0,0
4,5,2010,1,1,4,,-20,-12.0,1018.0,NW,12.97,0,0


The next step is to create a function that divides a cell by 24 to produce an hourly figure. Write the function below.

In [18]:
def hourly(x):
    '''
    Input: A numerical value
    Output: The value divided by 24
        
    Example:
    Input: 48
    Output: 2.0
    '''
    
    return x/24

Apply this function to the columns `Iws`, `Is`, and `Ir`. Store this new dataframe in the variable `pm25_hourly`.

In [19]:
# Your code here:
pm25_hourly = pm25.copy()

pm25_hourly[["Iws"]] = pm25_hourly[["Iws"]].apply(hourly)
pm25_hourly[["Is"]] = pm25_hourly[["Is"]].apply(hourly)
pm25_hourly[["Ir"]] = pm25_hourly[["Ir"]].apply(hourly)

#Test
pm25_hourly.head()

Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
0,1,2010,1,1,0,,-21,-11.0,1021.0,NW,0.074583,0.0,0.0
1,2,2010,1,1,1,,-21,-12.0,1020.0,NW,0.205,0.0,0.0
2,3,2010,1,1,2,,-21,-11.0,1019.0,NW,0.279583,0.0,0.0
3,4,2010,1,1,3,,-21,-14.0,1019.0,NW,0.41,0.0,0.0
4,5,2010,1,1,4,,-20,-12.0,1018.0,NW,0.540417,0.0,0.0


#### Our last challenge will be to create an aggregate function and apply it to a select group of columns in our dataframe.

Write a function that returns the standard deviation of a column divided by the length of a column minus 1. Since we are using pandas, do not use the `len()` function. One alternative is to use `count()`. Also, use the numpy version of standard deviation.

In [20]:
def sample_sd(x):
    '''
    Input: A Pandas series of values
    Output: the standard deviation divided by the number of elements in the series
        
    Example:
    Input: pd.Series([1,2,3,4])
    Output: 0.3726779962
    '''
    
    numOfElements = pm25_hourly[["Iws"]].count()
    stDeviation = x.std() #Pandas function for standard deviation
    
    return stDeviation/numOfElements
    
#Test
result = sample_sd(pm25_hourly[["Iws"]])
print("Result of the function:", result)

Result of the function: Iws    0.000048
dtype: float64
