# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [41]:
# Import reduce from functools, numpy and pandas
from functools import reduce
import numpy
import pandas

# Challenge 1 - Mapping

#### We will use the map function to clean up words in a book.

In the following cell, we will read a text file containing the book The Prophet by Khalil Gibran.

In [42]:
# Run this code:

location = '../data/58585-0.txt'
with open(location, 'r', encoding="utf8") as f:
    prophet = f.read().split(' ')

In [43]:
len(prophet)

13637

#### Let's remove the first 568 words since they contain information about the book but are not part of the book itself. 

Do this by removing from `prophet` elements 0 through 567 of the list (you can also do this by keeping elements 568 through the last element).

In [44]:
# your code here
location = '../data/58585-0.txt'
with open(location, 'r', encoding="utf8") as f:
    # Read the file and split it into words
    prophet = f.read().split()
    
    # Remove the first 568 words (metadata)
    book_content = prophet[572:]

# Optional: Join the remaining words back into a string
book_text = ' '.join(book_content)

# Output or save the remaining content 
print(' '.join(book_content[:100]))  # Print the first 100 words of the actual book content


It is not a garment I cast off this day, but a skin that I tear with my own hands. Nor is it a thought I leave behind me, but a heart made sweet with hunger and with thirst. ***** Yet I cannot tarry longer. The sea that calls all things unto her calls me, and I must embark. For to stay, though the hours burn in the night, is to freeze and crystallize and be bound in a mould. Fain would I take with me all that is here. But how shall I? A voice cannot carry the tongue


If you look through the words, you will find that many words have a reference attached to them. For example, let's look at words 1 through 10.

In [45]:
print( book_content[:10])

['It', 'is', 'not', 'a', 'garment', 'I', 'cast', 'off', 'this', 'day,']


#### The next step is to create a function that will remove references. 

We will do this by splitting the string on the `{` character and keeping only the part before this character. Write your function below.

In [46]:
def reference(x):
    '''
    Input: A string
    Output: The string with references removed
    
    Example:
    Input: 'the{7}'
    Output: 'the'
    '''
    
    # your code here
    return x.split('{')[0]

In [47]:
def reference (x):
    x = x.split ("{")[0]
    return x

Now that we have our function, use the `map()` function to apply this function to our book, The Prophet. Return the resulting list to a new list called `prophet_reference`.

In [48]:
# your code here
prophet_reference = list(map(reference, book_content))
print(prophet_reference)

['It', 'is', 'not', 'a', 'garment', 'I', 'cast', 'off', 'this', 'day,', 'but', 'a', 'skin', 'that', 'I', 'tear', 'with', 'my', 'own', 'hands.', 'Nor', 'is', 'it', 'a', 'thought', 'I', 'leave', 'behind', 'me,', 'but', 'a', 'heart', 'made', 'sweet', 'with', 'hunger', 'and', 'with', 'thirst.', '*****', 'Yet', 'I', 'cannot', 'tarry', 'longer.', 'The', 'sea', 'that', 'calls', 'all', 'things', 'unto', 'her', 'calls', 'me,', 'and', 'I', 'must', 'embark.', 'For', 'to', 'stay,', 'though', 'the', 'hours', 'burn', 'in', 'the', 'night,', 'is', 'to', 'freeze', 'and', 'crystallize', 'and', 'be', 'bound', 'in', 'a', 'mould.', 'Fain', 'would', 'I', 'take', 'with', 'me', 'all', 'that', 'is', 'here.', 'But', 'how', 'shall', 'I?', 'A', 'voice', 'cannot', 'carry', 'the', 'tongue', 'and', '', 'lips', 'that', 'gave', 'it', 'wings.', 'Alone', 'must', 'it', 'seek', 'the', 'ether.', 'And', 'alone', 'and', 'without', 'his', 'nest', 'shall', 'the', 'eagle', 'fly', 'across', 'the', 'sun.', '*****', 'Now', 'when',

Another thing you may have noticed is that some words contain a line break. Let's write a function to split those words. Our function will return the string split on the character `\n`. Write your function in the cell below.

In [49]:
def line_break(x):
    '''
    Input: A string
    Output: A list of strings split on the line break (\n) character
        
    Example:
    Input: 'the\nbeloved'
    Output: ['the', 'beloved']
    '''
    
    # your code here
    return x.split('\n')

Apply the `line_break` function to the `prophet_reference` list. Name the new list `prophet_line`.

In [50]:
# your code here
prophet_line = list(map(line_break, prophet_reference))
print(prophet_line)

[['It'], ['is'], ['not'], ['a'], ['garment'], ['I'], ['cast'], ['off'], ['this'], ['day,'], ['but'], ['a'], ['skin'], ['that'], ['I'], ['tear'], ['with'], ['my'], ['own'], ['hands.'], ['Nor'], ['is'], ['it'], ['a'], ['thought'], ['I'], ['leave'], ['behind'], ['me,'], ['but'], ['a'], ['heart'], ['made'], ['sweet'], ['with'], ['hunger'], ['and'], ['with'], ['thirst.'], ['*****'], ['Yet'], ['I'], ['cannot'], ['tarry'], ['longer.'], ['The'], ['sea'], ['that'], ['calls'], ['all'], ['things'], ['unto'], ['her'], ['calls'], ['me,'], ['and'], ['I'], ['must'], ['embark.'], ['For'], ['to'], ['stay,'], ['though'], ['the'], ['hours'], ['burn'], ['in'], ['the'], ['night,'], ['is'], ['to'], ['freeze'], ['and'], ['crystallize'], ['and'], ['be'], ['bound'], ['in'], ['a'], ['mould.'], ['Fain'], ['would'], ['I'], ['take'], ['with'], ['me'], ['all'], ['that'], ['is'], ['here.'], ['But'], ['how'], ['shall'], ['I?'], ['A'], ['voice'], ['cannot'], ['carry'], ['the'], ['tongue'], ['and'], [''], ['lips'], ['t

If you look at the elements of `prophet_line`, you will see that the function returned lists and not strings. Our list is now a list of lists. Flatten the list using list comprehension. Assign this new list to `prophet_flat`.

In [51]:
prophet_flat = [i for sub in prophet_line for i in sub]
prophet_flat

['It',
 'is',
 'not',
 'a',
 'garment',
 'I',
 'cast',
 'off',
 'this',
 'day,',
 'but',
 'a',
 'skin',
 'that',
 'I',
 'tear',
 'with',
 'my',
 'own',
 'hands.',
 'Nor',
 'is',
 'it',
 'a',
 'thought',
 'I',
 'leave',
 'behind',
 'me,',
 'but',
 'a',
 'heart',
 'made',
 'sweet',
 'with',
 'hunger',
 'and',
 'with',
 'thirst.',
 '*****',
 'Yet',
 'I',
 'cannot',
 'tarry',
 'longer.',
 'The',
 'sea',
 'that',
 'calls',
 'all',
 'things',
 'unto',
 'her',
 'calls',
 'me,',
 'and',
 'I',
 'must',
 'embark.',
 'For',
 'to',
 'stay,',
 'though',
 'the',
 'hours',
 'burn',
 'in',
 'the',
 'night,',
 'is',
 'to',
 'freeze',
 'and',
 'crystallize',
 'and',
 'be',
 'bound',
 'in',
 'a',
 'mould.',
 'Fain',
 'would',
 'I',
 'take',
 'with',
 'me',
 'all',
 'that',
 'is',
 'here.',
 'But',
 'how',
 'shall',
 'I?',
 'A',
 'voice',
 'cannot',
 'carry',
 'the',
 'tongue',
 'and',
 '',
 'lips',
 'that',
 'gave',
 'it',
 'wings.',
 'Alone',
 'must',
 'it',
 'seek',
 'the',
 'ether.',
 'And',
 'alone',

In [52]:
# your code here

# Challenge 2 - Filtering

When printing out a few words from the book, we see that there are words that we may not want to keep if we choose to analyze the corpus of text. Below is a list of words that we would like to get rid of. Create a function that will return false if it contains a word from the list of words specified and true otherwise.

In [53]:
def word_filter(x):
    '''
    Input: A string
    Output: True if the word is not in the specified list 
    and False if the word is in the list.
        
    Example:
    word list = ['and', 'the']
    Input: 'and'
    Output: False
    
    Input: 'John'
    Output: True
    '''
    
    word_list = ['and', 'the', 'a', 'an']
    
    # your code here

Use the `filter()` function to filter out the words speficied in the `word_filter()` function. Store the filtered list in the variable `prophet_filter`.

In [54]:
# your code here
prophet_filter = list(filter(word_filter, prophet_flat))

print(prophet_filter)

[]


# Bonus Challenge

Rewrite the `word_filter` function above to not be case sensitive.

In [55]:
def word_filter_case(x):
   
    word_list = ['and', 'the', 'a', 'an']
    
    # your code here
    counting = sum([x.lower().count(i) for i in word_list])
    return False if counting > 0 else True

prophet_filter = list(filter(word_filter_case, prophet_flat))

print(prophet_filter)

['It', 'is', 'not', 'I', 'off', 'this', 'but', 'skin', 'I', 'with', 'my', 'own', 'Nor', 'is', 'it', 'thought', 'I', 'behind', 'me,', 'but', 'sweet', 'with', 'hunger', 'with', 'thirst.', '*****', 'Yet', 'I', 'longer.', 'things', 'unto', 'her', 'me,', 'I', 'must', 'For', 'to', 'though', 'hours', 'burn', 'in', 'night,', 'is', 'to', 'freeze', 'be', 'bound', 'in', 'mould.', 'would', 'I', 'with', 'me', 'is', 'here.', 'But', 'how', 'I?', 'voice', 'tongue', '', 'lips', 'it', 'wings.', 'must', 'it', 'seek', 'without', 'his', 'nest', 'fly', 'sun.', '*****', 'Now', 'when', 'he', 'foot', 'of', 'hill,', 'he', 'turned', 'he', 'his', 'ship', 'upon', 'her', 'prow', 'men', 'of', 'his', 'own', 'his', 'soul', 'cried', 'out', 'to', 'he', 'Sons', 'of', 'my', 'you', 'riders', 'of', 'tides,', 'How', 'often', 'you', 'in', 'my', 'now', 'you', 'come', 'in', 'my', 'which', 'is', 'my', 'deeper', 'I', 'to', 'go,', 'my', 'with', 'full', 'set', 'wind.', 'Only', 'will', 'I', 'in', 'this', 'still', 'only', 'loving', '

# Challenge 3 - Reducing

#### Now that we have significantly cleaned up our text corpus, let's use the `reduce()` function to put the words back together into one long string separated by spaces. 

We will start by writing a function that takes two strings and concatenates them together with a space between the two strings.

In [57]:
from functools import reduce

# Define the concat_space function
def concat_space(a, b):
    '''
    Input: Two strings
    Output: A single string separated by a space
        
    Example:
    Input: 'John', 'Smith'
    Output: 'John Smith'
    '''
    return a + ' ' + b

# Assuming prophet_filter is a list of words you want to concatenate
# For demonstration, we'll create a sample list
prophet_filter = ["John", "Smith", "is", "a", "prophet"]

# Use reduce to concatenate the list of words with spaces in between
prophet_string = reduce(concat_space, prophet_filter)

# Output the concatenated string
print(prophet_string)


John Smith is a prophet


In [36]:
# your code here

Use the function above to reduce the text corpus in the list `prophet_filter` into a single string. Assign this new string to the variable `prophet_string`.

In [37]:
# your code here