# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [22]:
# Import reduce from functools, numpy and pandas
from functools import reduce
import numpy
import pandas

# Challenge 1 - Mapping

#### We will use the map function to clean up words in a book.

In the following cell, we will read a text file containing the book The Prophet by Khalil Gibran.

In [27]:
# Run this code:

location = '58585-0.txt'
with open(location, 'r', encoding="utf8") as f:
    prophet = f.read().split(' ')

In [29]:
len(prophet)

13637

#### Let's remove the first 568 words since they contain information about the book but are not part of the book itself. 

Do this by removing from `prophet` elements 0 through 567 of the list (you can also do this by keeping elements 568 through the last element).

In [31]:
prophet = list(map(lambda x: x, prophet[568:]))
len(prophet)

13069

If you look through the words, you will find that many words have a reference attached to them. For example, let's look at words 1 through 10.

In [35]:
for index, word in enumerate(prophet[:10], start=1):
    print(f"Word {index}: {word}")

Word 1: PROPHET

|Almustafa,
Word 2: the{7}
Word 3: chosen
Word 4: and
Word 5: the
beloved,
Word 6: who
Word 7: was
Word 8: a
Word 9: dawn
Word 10: unto


#### The next step is to create a function that will remove references. 

We will do this by splitting the string on the `{` character and keeping only the part before this character. Write your function below.

In [39]:
def reference(x):
    '''
    Input: A string
    Output: The string with references removed
    
    Example:
    Input: 'the{7}'
    Output: 'the'
    '''
    
    return x.split('{')[0]

In [41]:
print(reference('the{7}'))

the


In [43]:
cleaned_words = list(map(reference, prophet[:10]))

print(cleaned_words)

['PROPHET\n\n|Almustafa,', 'the', 'chosen', 'and', 'the\nbeloved,', 'who', 'was', 'a', 'dawn', 'unto']


Now that we have our function, use the `map()` function to apply this function to our book, The Prophet. Return the resulting list to a new list called `prophet_reference`.

In [47]:
prophet_reference = list(map(reference, prophet))

print(prophet_reference[:10])

['PROPHET\n\n|Almustafa,', 'the', 'chosen', 'and', 'the\nbeloved,', 'who', 'was', 'a', 'dawn', 'unto']


Another thing you may have noticed is that some words contain a line break. Let's write a function to split those words. Our function will return the string split on the character `\n`. Write your function in the cell below.

In [56]:
def line_break(x):
    '''
    Input: A string
    Output: A list of strings split on the line break (\n) character
        
    Example:
    Input: 'the\nbeloved'
    Output: ['the', 'beloved']
    '''
    
    return x.split('\n')

In [58]:
print(line_break('the\nbeloved'))

['the', 'beloved']


Apply the `line_break` function to the `prophet_reference` list. Name the new list `prophet_line`.

In [61]:
prophet_line = list(map(line_break, prophet_reference))

print(prophet_line[:10])

[['PROPHET', '', '|Almustafa,'], ['the'], ['chosen'], ['and'], ['the', 'beloved,'], ['who'], ['was'], ['a'], ['dawn'], ['unto']]


If you look at the elements of `prophet_line`, you will see that the function returned lists and not strings. Our list is now a list of lists. Flatten the list using list comprehension. Assign this new list to `prophet_flat`.

In [63]:
prophet_flat = [i for sub in prophet_line for i in sub]
#prophet_flat
print(prophet_flat[:10])

['PROPHET', '', '|Almustafa,', 'the', 'chosen', 'and', 'the', 'beloved,', 'who', 'was']


In [None]:
# your code here

# Challenge 2 - Filtering

When printing out a few words from the book, we see that there are words that we may not want to keep if we choose to analyze the corpus of text. Below is a list of words that we would like to get rid of. Create a function that will return false if it contains a word from the list of words specified and true otherwise.

In [65]:
def word_filter(x):
    '''
    Input: A string
    Output: True if the word is not in the specified list 
    and False if the word is in the list.
        
    Example:
    word list = ['and', 'the']
    Input: 'and'
    Output: False
    
    Input: 'John'
    Output: True
    '''
    
    word_list = ['and', 'the', 'a', 'an']
    return x not in word_list

Use the `filter()` function to filter out the words speficied in the `word_filter()` function. Store the filtered list in the variable `prophet_filter`.

In [69]:
prophet_filter = list(filter(word_filter, prophet_flat))
print(prophet_filter[:10])
print(prophet_filter[:40])

['PROPHET', '', '|Almustafa,', 'chosen', 'beloved,', 'who', 'was', 'dawn', 'unto', 'his']
['PROPHET', '', '|Almustafa,', 'chosen', 'beloved,', 'who', 'was', 'dawn', 'unto', 'his', 'own', 'day,', 'had', 'waited', 'twelve', 'years', 'in', 'city', 'of', 'Orphalese', 'for', 'his', 'ship', 'that', 'was', 'to', 'return', 'bear', 'him', 'back', 'to', 'isle', 'of', 'his', 'birth.', '', 'And', 'in', 'twelfth', 'year,']


# Bonus Challenge

Rewrite the `word_filter` function above to not be case sensitive.

In [96]:
def word_filter_case(x):
    '''
    Input: A string
    Output: True if the word is not in the specified list 
    (case-insensitive) and False if the word is in the list.
        
    Example:
    word list = ['and', 'the']
    Input: 'And'
    Output: False
    
    Input: 'John'
    Output: True
    '''
    word_list = ['and', 'the', 'a', 'an']
    
    
    return x.lower() not in [word.lower() for word in word_list]

In [98]:
prophet_filter_case = list(filter(word_filter_case, prophet_flat))

print(prophet_filter_case[:100])

['PROPHET', '', '|Almustafa,', 'chosen', 'beloved,', 'who', 'was', 'dawn', 'unto', 'his', 'own', 'day,', 'had', 'waited', 'twelve', 'years', 'in', 'city', 'of', 'Orphalese', 'for', 'his', 'ship', 'that', 'was', 'to', 'return', 'bear', 'him', 'back', 'to', 'isle', 'of', 'his', 'birth.', '', 'in', 'twelfth', 'year,', 'on', 'seventh', 'day', 'of', 'Ielool,', 'month', 'of', 'reaping,', 'he', 'climbed', 'hill', 'without', 'city', 'walls', 'looked', 'seaward;', 'he', 'beheld', 'his', 'ship', 'coming', 'with', 'mist.', '', 'Then', 'gates', 'of', 'his', 'heart', 'were', 'flung', 'open,', 'his', 'joy', 'flew', 'far', 'over', 'sea.', 'he', 'closed', 'his', 'eyes', 'prayed', 'in', 'silences', 'of', 'his', 'soul.', '', '*****', '', 'But', 'as', 'he', 'descended', 'hill,', 'sadness', 'came', 'upon', 'him,', 'he']


# Challenge 3 - Reducing

#### Now that we have significantly cleaned up our text corpus, let's use the `reduce()` function to put the words back together into one long string separated by spaces. 

We will start by writing a function that takes two strings and concatenates them together with a space between the two strings.

In [100]:
def concat_space(a, b):
    '''
    Input:Two strings
    Output: A single string separated by a space
        
    Example:
    Input: 'John', 'Smith'
    Output: 'John Smith'
    '''
    
    return a + ' ' + b

Use the function above to reduce the text corpus in the list `prophet_filter` into a single string. Assign this new string to the variable `prophet_string`.

In [102]:
from functools import reduce

prophet_string = reduce(concat_space, prophet_filter)

print(prophet_string[:100])

PROPHET  |Almustafa, chosen beloved, who was dawn unto his own day, had waited twelve years in city 
