# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [19]:
# Import reduce from functools, numpy and pandas
from functools import reduce
import numpy
import pandas

# Challenge 1 - Mapping

#### We will use the map function to clean up words in a book.

In the following cell, we will read a text file containing the book The Prophet by Khalil Gibran.

In [21]:
# Run this code:

location = '../data/58585-0.txt'
with open(location, 'r', encoding="utf8") as f:
    prophet = f.read().split(' ')

In [23]:
len(prophet)

13637

#### Let's remove the first 568 words since they contain information about the book but are not part of the book itself. 

Do this by removing from `prophet` elements 0 through 567 of the list (you can also do this by keeping elements 568 through the last element).

In [25]:
# your code here

# Remove the first 568 elements from the list
prophet = prophet[568:]

# Check the new length of the list to ensure it's correct
len(prophet)

13069

If you look through the words, you will find that many words have a reference attached to them. For example, let's look at words 1 through 10.

In [27]:
# your code here

# Display words 1 through 10 from the `prophet` list
prophet[:10]

['PROPHET\n\n|Almustafa,',
 'the{7}',
 'chosen',
 'and',
 'the\nbeloved,',
 'who',
 'was',
 'a',
 'dawn',
 'unto']

#### The next step is to create a function that will remove references. 

We will do this by splitting the string on the `{` character and keeping only the part before this character. Write your function below.

In [29]:
def reference(x):
    '''
    Input: A string
    Output: The string with references removed
    
    Example:
    Input: 'the{7}'
    Output: 'the'
    '''
    
    # your code here

    #Split the string at the '{' character
    parts = x.split('{')

    #Return the part before the '{' (i.e., the first part of the split)
    return parts[0]

In [None]:
# your code here

Now that we have our function, use the `map()` function to apply this function to our book, The Prophet. Return the resulting list to a new list called `prophet_reference`.

In [31]:
# your code here

# Use map() to apply the reference function to each word in the prophet list
prophet_reference = list(map(reference, prophet))

# Check the first 10 words from the prophet_reference list to verify
prophet_reference[:10]


['PROPHET\n\n|Almustafa,',
 'the',
 'chosen',
 'and',
 'the\nbeloved,',
 'who',
 'was',
 'a',
 'dawn',
 'unto']

Another thing you may have noticed is that some words contain a line break. Let's write a function to split those words. Our function will return the string split on the character `\n`. Write your function in the cell below.

In [33]:
def line_break(x):
    '''
    Input: A string
    Output: A list of strings split on the line break (\n) character
        
    Example:
    Input: 'the\nbeloved'
    Output: ['the', 'beloved']
    '''
    
    # your code here
    
    # Split the string on the newline character (\n)
    return x.split('\n')

Apply the `line_break` function to the `prophet_reference` list. Name the new list `prophet_line`.

In [56]:
# your code here

# Apply the line_break function to each word in the prophet_reference list
prophet_line = list(map(line_break, prophet_reference))

# Check the first 10 elements in prophet_line to verify
prophet_line[:10]

[['PROPHET', '', '|Almustafa,'],
 ['the'],
 ['chosen'],
 ['and'],
 ['the', 'beloved,'],
 ['who'],
 ['was'],
 ['a'],
 ['dawn'],
 ['unto']]

If you look at the elements of `prophet_line`, you will see that the function returned lists and not strings. Our list is now a list of lists. Flatten the list using list comprehension. Assign this new list to `prophet_flat`.

In [37]:
prophet_flat = [i for sub in prophet_line for i in sub]
prophet_flat

['PROPHET',
 '',
 '|Almustafa,',
 'the',
 'chosen',
 'and',
 'the',
 'beloved,',
 'who',
 'was',
 'a',
 'dawn',
 'unto',
 'his',
 'own',
 'day,',
 'had',
 'waited',
 'twelve',
 'years',
 'in',
 'the',
 'city',
 'of',
 'Orphalese',
 'for',
 'his',
 'ship',
 'that',
 'was',
 'to',
 'return',
 'and',
 'bear',
 'him',
 'back',
 'to',
 'the',
 'isle',
 'of',
 'his',
 'birth.',
 '',
 'And',
 'in',
 'the',
 'twelfth',
 'year,',
 'on',
 'the',
 'seventh',
 'day',
 'of',
 'Ielool,',
 'the',
 'month',
 'of',
 'reaping,',
 'he',
 'climbed',
 'the',
 'hill',
 'without',
 'the',
 'city',
 'walls',
 'and',
 'looked',
 'seaward;',
 'and',
 'he',
 'beheld',
 'his',
 'ship',
 'coming',
 'with',
 'the',
 'mist.',
 '',
 'Then',
 'the',
 'gates',
 'of',
 'his',
 'heart',
 'were',
 'flung',
 'open,',
 'and',
 'his',
 'joy',
 'flew',
 'far',
 'over',
 'the',
 'sea.',
 'And',
 'he',
 'closed',
 'his',
 'eyes',
 'and',
 'prayed',
 'in',
 'the',
 'silences',
 'of',
 'his',
 'soul.',
 '',
 '*****',
 '',
 'But',

In [None]:
# your code here

# Challenge 2 - Filtering

When printing out a few words from the book, we see that there are words that we may not want to keep if we choose to analyze the corpus of text. Below is a list of words that we would like to get rid of. Create a function that will return false if it contains a word from the list of words specified and true otherwise.

In [39]:
def word_filter(x):
    '''
    Input: A string
    Output: True if the word is not in the specified list 
    and False if the word is in the list.
        
    Example:
    word list = ['and', 'the']
    Input: 'and'
    Output: False
    
    Input: 'John'
    Output: True
    '''
    
    word_list = ['and', 'the', 'a', 'an']
    
    # your code here

     # Return False if the word is in the word_list, otherwise True
    return x not in word_list

Use the `filter()` function to filter out the words speficied in the `word_filter()` function. Store the filtered list in the variable `prophet_filter`.

In [41]:
# your code here

# Apply the word_filter function to prophet_flat using filter()
prophet_filter = list(filter(word_filter, prophet_flat))

# Check the first 10 filtered words
prophet_filter[:10]

['PROPHET',
 '',
 '|Almustafa,',
 'chosen',
 'beloved,',
 'who',
 'was',
 'dawn',
 'unto',
 'his']

# Bonus Challenge

Rewrite the `word_filter` function above to not be case sensitive.

In [48]:
def word_filter_case(x):

    #I'm adding these red comments:
    
    '''
    Input: A string
    Output: True if the word is not in the specified list (case-insensitive)
    and False if the word is in the list.
        
    Example:
    word list = ['and', 'the', 'a', 'an']
    Input: 'AND'
    Output: False
    
    Input: 'John'
    Output: True
    '''
    # List of words to filter out (I added this comment)
    word_list = ['and', 'the', 'a', 'an']

    # your code here
    
    # Convert the input word to lowercase and check if it's in the lowercase word_list
    return x.lower() not in word_list

In [50]:
# I added this:

# Test list of words with mixed cases
test_words = ['AND', 'the', 'A', 'An', 'John', 'love', 'Friend']

# Apply the word_filter_case function to the test list
filtered_test = list(filter(word_filter_case, test_words))

# Print the filtered result
print(filtered_test)


['John', 'love', 'Friend']


Explanation:

1. Case normalization:

    * x.lower() converts the input word x to lowercase to ensure case-insensitivity.

2. Membership check:

    * The lowercase word x.lower() is checked against the word_list (which contains only lowercase words).

3. Return value:

    * If the lowercase word is not in the word_list, the function returns True (keep the word).
    * If it is in the word_list, the function returns False (filter it out).

# Challenge 3 - Reducing

#### Now that we have significantly cleaned up our text corpus, let's use the `reduce()` function to put the words back together into one long string separated by spaces. 

We will start by writing a function that takes two strings and concatenates them together with a space between the two strings.

In [52]:
def concat_space(a, b):
    '''
    Input:Two strings
    Output: A single string separated by a space
        
    Example:
    Input: 'John', 'Smith'
    Output: 'John Smith'
    '''
    
    # your code here

    # Concatenate the two strings with a space in between
    return a + ' ' + b

In [None]:
# your code here

Use the function above to reduce the text corpus in the list `prophet_filter` into a single string. Assign this new string to the variable `prophet_string`.

In [54]:
# your code here

from functools import reduce

# Use reduce to concatenate the words in prophet_filter with spaces
prophet_string = reduce(concat_space, prophet_filter)

# Check the first 100 characters of the resulting string
print(prophet_string[:100])

PROPHET  |Almustafa, chosen beloved, who was dawn unto his own day, had waited twelve years in city 
