## Module submission header
### Submission preparation instructions 
_Completion of this header is mandatory, subject to a 2-point deduction to the assignment._ Only add plain text in the designated areas, i.e., replacing the relevant 'NA's. You must fill out all group member Names and Drexel email addresses in the below markdown list, under header __Module submission group__. It is required to fill out descriptive notes pertaining to any tutoring support received in the completion of this submission under the __Additional submission comments__ section at the bottom of the header. If no tutoring support was received, leave NA in place. You may as well list other optional comments pertaining to the submission at bottom. _Any distruption of this header's formatting will make your group liable to the 2-point deduction._

### Module submission group
- Group member 1
    - Name: Yiyun Fan
    - Email: yf366@drexel.edu
- Group member 2
    - Name: Shreekant Malviya
    - Email: sm4546@drexel.edu
- Group member 3
    - Name: Kunal Sharma
    - Email: kos26@drexel.edu
- Group member 4
    - Name: NA
    - Email: NA
### Additional submission comments
- Tutoring support received: NA
- Other (other): NA

# Assignment group 1: Textual feature extraction and numerical comparison

## Module B _(35 points)_ Key word in context

Key word in context (KWiC) is a common format for concordance lines, i.e., contextualized instances of principal words used in a book. More generally, KWiC is essentially the concept behind the utility of 'find in page' on document viewers and web browsers. This module builds up a KWiC utility for finding key word-containing sentences, and 'most relevant' paragraphs, quickly.

__B1.__ _(3 points)_ Start by writing a function called `load_book` that reads in a book based on a provided `book_id` string and returns a list of `paragraphs` from the book. When book data is loaded, you should remove the space characters at the beginning and end of the text (e.g., using `strip()`). Then, to split books into paragraphs, use the `re.split()` method to split the input in cases where there are two or more new lines. Note, that books are in the provided `data/books` directory.

Note: this module is not focused on text pre-processing beyond a split into paragraphs; you do _not_ need to remove markup or non-substantive content.

In [384]:
# B1:Function(3/3)

import re

def load_book(book_id):
    
    with open("./data/books/"+book_id+".txt", "r") as fh:
        paragraphs = fh.read()
        paragraphs = paragraphs.strip()
        paragraphs = re.split("\n\n", paragraphs)
    return paragraphs

To test your function, lets apply it to look at a few paragraphs from book 84.

In [385]:
# B1:SanityCheck
paragraphs = load_book('84')
print(len(paragraphs))
print(paragraphs[10])

723
These reflections have dispelled the agitation with which I began my
letter, and I feel my heart glow with an enthusiasm which elevates me
to heaven, for nothing contributes so much to tranquillize the mind as
a steady purpose--a point on which the soul may fix its intellectual
eye.  This expedition has been the favourite dream of my early years. I
have read with ardour the accounts of the various voyages which have
been made in the prospect of arriving at the North Pacific Ocean
through the seas which surround the pole.  You may remember that a
history of all the voyages made for purposes of discovery composed the
whole of our good Uncle Thomas' library.  My education was neglected,
yet I was passionately fond of reading.  These volumes were my study
day and night, and my familiarity with them increased that regret which
I had felt, as a child, on learning that my father's dying injunction
had forbidden my uncle to allow me to embark in a seafaring life.


__B2.__ _(10 points)_ Next, write a function called `kwic(paragraphs, search_terms)` that accepts a list of string `paragraphs` and a set of `search_term` strings. The function should:

1. initialize `data` as a `defaultdict` of lists
2. loop over the `paragraphs` and apply `spacy`'s processing to produce a `doc` for each;
3. loop over the `doc.sents` resulting from each `paragraph`;
4. loop over the words in each `sentence`;
5. check: `if` a `word` is `in` the `search_terms` set;
6. `if` (5), then `.append()` the reference under `data[word]` as a list: `[[i, j, k], sentence]`, where `i`, `j`, and `k` refer to the paragraph-in-book, sentence-in-paragraph, and word-in-sentence indices, respectively.

Your output, `data`, should then be a default dictionary of lists of the format:
```
data['word'] = [[[i, j, k], ["These", "are", "sentences", "containing", "the", "word", "'word'", "."]],
                ...,]
```

Note, we have imported spacy and set it up to use the `"en"` model. This will require you to install spacy by running `pip install spacy` and downloading the `"en"` model by running the command `python -m spacy download en`.

In [386]:

from collections import defaultdict
import spacy
nlp = spacy.load("en")

def kwic(paragraphs, search_terms = {}):

    #---Your code starts here---
    data = defaultdict(list)
    for i, paragraph in enumerate(paragraphs): 
        doc = nlp(paragraph)
        sentences = list(doc.sents)
        for j, sentence in enumerate(sentences): 
            for k, word in enumerate(sentence): 
                if word.text in search_terms:
                    data[word.text].append([[i, j, k], [token.text for token in sentence]])        
    #---Your code ends here---
    
    return(data)

Now, let's test your function using the paragraphs from your `load_book` function.

In [387]:
# B2:SanityCheck
kwic(paragraphs, {'Ocean', 'seas'})

defaultdict(list,
            {'Ocean': [[[10, 2, 26],
               ['I',
                '\n',
                'have',
                'read',
                'with',
                'ardour',
                'the',
                'accounts',
                'of',
                'the',
                'various',
                'voyages',
                'which',
                'have',
                '\n',
                'been',
                'made',
                'in',
                'the',
                'prospect',
                'of',
                'arriving',
                'at',
                'the',
                'North',
                'Pacific',
                'Ocean',
                '\n',
                'through',
                'the',
                'seas',
                'which',
                'surround',
                'the',
                'pole',
                '.',
                ' ']]],
             'seas': [[[10, 2, 30],
             

__B3.__ _(2 points)_ Let's test your `kwic` search function's utility using the pre-processed `paragraphs` from book `84` for the key words `Frankenstein` and `monster` in context. Answer the inline questions about these tests.

In [388]:
# B3:SanityCheck
results = kwic(paragraphs, {"Frankenstein", "monster"})

print("# of sentences 'Frankenstein' appears in: {}".format(len(results['Frankenstein'])))
print("# of sentences 'monster' appears in: {}".format(len(results['monster'])))
print()

print(" ".join(results['Frankenstein'][7][1]))
print()
print(" ".join(results['monster'][0][1]))

# of sentences 'Frankenstein' appears in: 27
# of sentences 'monster' appears in: 31

She nursed Madame 
 Frankenstein , my aunt , in her last illness , with the greatest affection 
 and care and afterwards attended her own mother during a tedious 
 illness , in a manner that excited the admiration of all who knew her , 
 after which she again lived in my uncle 's house , where she was beloved 
 by all the family .  

I started from my sleep with horror ; a cold dew covered my forehead , my 
 teeth chattered , and every limb became convulsed ; when , by the dim and 
 yellow light of the moon , as it forced its way through the window 
 shutters , I beheld the wretch -- the miserable monster whom I had 
 created .  


In [389]:
# B3:Inline(1/2)

# Is the kwic function fast or slow? Print "Fast" or "Slow"
print("Slow")

Slow


In [390]:
# B3:Inline(1/2)

# How many sentences does the work Frankenstein appear in? Print the integer (0 is just a placeholder).
print(len(results['Frankenstein']))

27


__B4.__ _(10 pts)_ The cost of _indexing_ a given book turns out to be the limiting factor here for kwic. Presently, we have our pre-processing `load_book` function just splitting a document into paragraphs. Rewrite the `load_book` function to do some additional preprocessing. Specifically, this function should be modified to:

1. split a `book` into paragraphs and loop over them, but
2. process each paragraph with `spacy`;
3. store the `document` as a triple-nested list, so that each word _string_ is reachable via three indices: `word = document[i][j][k]`;
4. record an `index = defaultdict(list)` containing a list of `[i,j,k]` lists for each word; and
5. `return document, index`

Pre-computing the `index` will allow us to efficiently look up the locations of each word's instance in `document`, and the triple-list format of our document will allow us fast access to extract the sentence for KWiC. 

In [428]:
# B4:Function(10/10)

def load_book(book_id):
    
    #---Your code starts here---
    index = defaultdict(list)
    with open(f"./data/books/{book_id}.txt", "r") as fh:
        paragraphs = re.split('\n{2,}', fh.read().strip())
        
        for i, paragraph in enumerate(paragraphs):
            doc = nlp(paragraph)
            paragraph_list = []
            sentences = list(doc.sents)
            for j, sentence in enumerate(sentences):
                sentences_list = []
                for k, word in enumerate(sentence):
                    sentences_list.append(word.text)
                    index[word.text].append([i,j,k])
                paragraph_list.append(sentences_list) 
                
            document.append(paragraph_list)
    return(document, index)

Now, let's test your new function on `book_id` = `'84'`. We'll use the returned document to access a particular sentence and print out the `[i,j,k]` locations of the word `'monster'` from `index`.

In [429]:
# B4:SanityCheck

# load the book
document, index = load_book("84")

In [430]:
# B4:SanityCHeck

# Output paragraph 9, sentence 5
document[9][5]

['There', ',', 'Margaret', ',', 'the', 'sun', 'is', 'forever', '\n', 'visible']

In [431]:
# B4:SanityCheck

# Output the indices for monster
index['monster']

[[124, 10, 57],
 [136, 3, 6],
 [139, 3, 4],
 [142, 1, 4],
 [243, 3, 29],
 [261, 3, 18],
 [280, 0, 2],
 [321, 1, 35],
 [345, 9, 6],
 [380, 13, 5],
 [397, 1, 46],
 [437, 0, 16],
 [439, 0, 3],
 [477, 7, 8],
 [478, 7, 6],
 [510, 1, 8],
 [527, 0, 1],
 [538, 19, 22],
 [560, 3, 31],
 [585, 4, 43],
 [587, 11, 72],
 [606, 3, 2],
 [615, 2, 11],
 [633, 10, 9],
 [639, 1, 21],
 [644, 4, 17],
 [653, 8, 5],
 [663, 5, 2],
 [673, 0, 39],
 [673, 2, 2],
 [709, 11, 1]]

__B5.__ _(5 pts)_ Finally, make a new function called `fast_kwic` that takes a `document` and `index` from our new `load_book` function as well as a provided list of `search_terms` (just like our original kwic function). The function should loops through all specified `search_terms` to identify indices from `index[word]` for the key word-containing sentences and use them to extract these sentences from `document` into the same data structure as output by __B2__:
```
data['word'] = [[[i, j, k], ["These", "are", "sentences", "containing", "the", "word", "'word'", "."]],
                ...,]
```

In [432]:
# B5:Function(5/5)

def fast_kwic(document, index, search_terms = {}):
    
    data = defaultdict(list)
    for i in search_terms:
        if index[i]:
            indices = index[i]
            for j in indices:
                data[i].append([[j[0], j[1], j[2]], document[j[0]][j[1]]])
            
    
    return(data)

To test our new function, lets test it on the same keywords as before: `Frankenstein` and `monster`. Note that the output from this sanity check should be the same as the one from **B3**. 

In [433]:
# B5:SanityCheck

fast_results = fast_kwic(document, index, {'Frankenstein', 'monster'})

print("# of sentences 'Frankenstein' appears in: {}".format(len(fast_results['Frankenstein'])))
print("# of sentences 'monster' appears in: {}".format(len(fast_results['monster'])))
print()

print(" ".join(fast_results['Frankenstein'][7][1]))
print()
print(" ".join(fast_results['monster'][0][1]))

# of sentences 'Frankenstein' appears in: 27
# of sentences 'monster' appears in: 31

She nursed Madame 
 Frankenstein , my aunt , in her last illness , with the greatest affection 
 and care and afterwards attended her own mother during a tedious 
 illness , in a manner that excited the admiration of all who knew her , 
 after which she again lived in my uncle 's house , where she was beloved 
 by all the family .  

I started from my sleep with horror ; a cold dew covered my forehead , my 
 teeth chattered , and every limb became convulsed ; when , by the dim and 
 yellow light of the moon , as it forced its way through the window 
 shutters , I beheld the wretch -- the miserable monster whom I had 
 created .  


__B6.__ _(5 pts)_ Your goal here is to modify the pre-processing in `load_book` one more time! Make a small modification to the input: `load_book(book_id, pos = True, lemma = True):`, to accept two boolean arguments, `pos` and `lemma` specifying how to identify each word as a key term. In particular, each word will now be represented in both of the `document` and `index` as a tuple: `heading = (text, tag)`, where `text` contains the `word.text` attribute from `spacy` if `lemma = False`, and `word.lemma_` attribute if `True`. Similarly, `tag` should be left empty as `""` if `pos = False` and otherwise contain `word.pos_`.

Note this functions output should still consist of a `document` and `index` in the same format aside from the replacement of `word` with `heading`, which will allow for the same use of output in `fast_kwic`, although more specified by the textual features.

In [436]:
# B6:Function(5/5)

def load_book(book_id, pos = True, lemma = True):
    
    with open("./data/books/"+book_id+".txt", "r") as fh:
        document = []
        index = defaultdict(list)
        paragraphs = fh.read()
        paragraphs = paragraphs.strip()
        paragraphs = re.split('\n{2,}', paragraphs)
        for i, paragraph in enumerate(paragraphs):
            doc = nlp(paragraph)
            doc = list(doc.sents)
            paragraph_list = []
            for j, sentence in enumerate(doc):
                sentences_list = []
                for k, word in enumerate(sentence):
                    if lemma: 
                        text = word.lemma_
                    else: 
                        text = word.text
                    if pos: 
                        tag = word.pos_
                    else: 
                        tag = ""
                    heading = (text, tag)   
                    sentences_list.append(word.text.strip())
                    index[heading].append([i,j,k])
                paragraph_list.append(sentences_list)
            document.append(paragraph_list)
    return document, index

In [437]:
# B6:SanityCheck
document, index = load_book("84", pos = True, lemma = True)

In [438]:
# B6:SanityCheck
print("Sentence with ('cold', 'NOUN'):")
" ".join(fast_kwic(document, index, search_terms = {('cold', 'NOUN')})[('cold', 'NOUN')][0][1])

Sentence with ('cold', 'NOUN'):


'The  cold is not excessive , if you are wrapped in furs -- a dress which I have  already adopted , for there is a great difference between walking the  deck and remaining seated motionless for hours , when no exercise  prevents the blood from actually freezing in your veins . '

In [439]:
# B6:SanityCheck
print("Sentence with ('cold', 'ADJ'):")
" ".join(fast_kwic(document, index, search_terms = {('cold', 'ADJ')})[('cold', 'ADJ')][0][1])

Sentence with ('cold', 'ADJ'):


'I am already far north of London , and as I walk in the streets of  Petersburgh , I feel a cold northern breeze play upon my cheeks , which  braces my nerves and fills me with delight . '