# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [1]:
# Import reduce from functools, numpy and pandas
from functools import reduce
import numpy
import pandas

# Challenge 1 - Mapping

#### We will use the map function to clean up words in a book.

In the following cell, we will read a text file containing the book The Prophet by Khalil Gibran.

In [42]:
# Run this code:

import requests


url = 'https://github.com/DJDarah/lab-map-filter-reduce-en/blob/master/data/58585-0.txt'

# Fetch content
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    content = response.text  # Use .content for binary files
    prophet = content.split(' ')
    print(prophet)
else:
    print(f"Failed to fetch file: {response.status_code}")




In [48]:
total_words = sum(len(item.split()) for item in prophet if isinstance(item, str))
print(total_words)


44954


#### Let's remove the first 568 words since they contain information about the book but are not part of the book itself. 

Do this by removing from `prophet` elements 0 through 567 of the list (you can also do this by keeping elements 568 through the last element).

In [53]:
def is_in_index_range(index):
    return index > 568  

filtered_items = list(filter(lambda pair: is_in_index_range(pair[0]), enumerate(prophet)))

filtered_items = [word for _, word in filtered_items]

print(filtered_items)




If you look through the words, you will find that many words have a reference attached to them. For example, let's look at words 1 through 10.

In [56]:
def first_ten(item):
    return item < 11


filtered_items = list(filter(lambda pair: first_ten(pair[0]), enumerate(prophet)))

filtered_items = [word for _, word in filtered_items]

print(filtered_items)


['\n\n\n\n\n\n<!DOCTYPE', 'html>\n<html\n', '', 'lang="en"\n', '', '\n', '', 'data-color-mode="auto"', 'data-light-theme="light"', 'data-dark-theme="dark"\n', '']


#### The next step is to create a function that will remove references. 

We will do this by splitting the string on the `{` character and keeping only the part before this character. Write your function below.

In [68]:
def reference(x):
    '''
    Input: A string
    Output: The string with references removed
    
    Example:
    Input: 'the{7}'
    Output: 'the'
    '''
    
    # your code here



In [69]:
def reference(word):
    '''
    Input: A string
    Output: The string with references removed

    Example:
    Input: 'the{7}'
    Output: 'the'
    '''
    if '{' in word and '}' in word:  
        return word.split('{')[0]  
    return word  
    
    



Now that we have our function, use the `map()` function to apply this function to our book, The Prophet. Return the resulting list to a new list called `prophet_reference`.

In [70]:
input_string = 'the{7} quick{12} brown fox{99}'

 
words = input_string.split()

 
cleaned_words = map(reference, words)

 
result = ' '.join(cleaned_words)

print(result)   


the quick brown fox


Another thing you may have noticed is that some words contain a line break. Let's write a function to split those words. Our function will return the string split on the character `\n`. Write your function in the cell below.

In [None]:
def line_break(x):
    '''
    Input: A string
    Output: A list of strings split on the line break (\n) character
        
    Example:
    Input: 'the\nbeloved'
    Output: ['the', 'beloved']
    '''
    
    # your code here

Apply the `line_break` function to the `prophet_reference` list. Name the new list `prophet_line`.

In [71]:
def split_linebreaks(text):
    '''
    Input: A string
    Output: A list of strings split on the line break character `\n`
    
    Example:
    Input: 'hello\nworld'
    Output: ['hello', 'world']
    '''
    return text.split('\n')


If you look at the elements of `prophet_line`, you will see that the function returned lists and not strings. Our list is now a list of lists. Flatten the list using list comprehension. Assign this new list to `prophet_flat`.

In [73]:
prophet_flat = [i for sub in prophet_line for i in sub]
prophet_flat

NameError: name 'prophet_line' is not defined

In [74]:
prophet_line = [['hello', 'world'], ['this', 'is'], ['a', 'test']]

prophet_flat = [word for sublist in prophet_line for word in sublist]

print(prophet_flat)


['hello', 'world', 'this', 'is', 'a', 'test']


# Challenge 2 - Filtering

When printing out a few words from the book, we see that there are words that we may not want to keep if we choose to analyze the corpus of text. Below is a list of words that we would like to get rid of. Create a function that will return false if it contains a word from the list of words specified and true otherwise.

In [75]:
def word_filter(x):
    '''
    Input: A string
    Output: True if the word is not in the specified list 
    and False if the word is in the list.
        
    Example:
    word list = ['and', 'the']
    Input: 'and'
    Output: False
    
    Input: 'John'
    Output: True
    '''
    
    word_list = ['and', 'the', 'a', 'an']
    
    # your code here

Use the `filter()` function to filter out the words speficied in the `word_filter()` function. Store the filtered list in the variable `prophet_filter`.

In [77]:
def word_filter(word):
    '''
    Input: A string
    Output: True if the word is not in the specified list 
    and False if the word is in the list.
    '''
    word_list = ['and', 'the', 'a', 'an']
    return word not in word_list


prophet_words = prophet.split()
 
prophet_filter = list(filter(word_filter, prophet_words))

print(prophet_filter)


['John', 'Mary', 'went', 'to', 'market', 'bought', 'loaf', 'of', 'bread.']


# Bonus Challenge

Rewrite the `word_filter` function above to not be case sensitive.

In [None]:
def word_filter_case(x):
   
    word_list = ['and', 'the', 'a', 'an']
    
    # your code here

# Challenge 3 - Reducing

#### Now that we have significantly cleaned up our text corpus, let's use the `reduce()` function to put the words back together into one long string separated by spaces. 

We will start by writing a function that takes two strings and concatenates them together with a space between the two strings.

In [None]:
def concat_space(a, b):
    '''
    Input:Two strings
    Output: A single string separated by a space
        
    Example:
    Input: 'John', 'Smith'
    Output: 'John Smith'
    '''
    
    # your code here

In [None]:
# your code here

Use the function above to reduce the text corpus in the list `prophet_filter` into a single string. Assign this new string to the variable `prophet_string`.

In [None]:
# your code here