# Assignment group 1: Textual feature extraction and numerical comparison

## Module B _(35 points)_ Key word in context

Key word in context (KWiC) is a common format for concordance lines, i.e., contextualized instances of principal words used in a book. More generally, KWiC is essentially the concept behind the utility of 'find in page' on document viewers and web browsers. This module builds up a KWiC utility for finding key word-containing sentences, and 'most relevant' paragraphs, quickly.

__B1.__ _(3 points)_ Initialize `spacy`'s English model and write a function called `load_book(book_id)`, which reads a book of specified number (as a string, `book_id`) and executes a regular expressiion to `re.split()` the loaded `book` (a string) into a list of `paragraphs` (strings). 

Test your code on `book_id = '84'`, and print the number of paragraphs in the resulting output.

Note: this module is not focused on text pre-processing beyond a split into paragraphs; you are only required to determine a _reasonable_ split criterion, and _not_ to remove markup or non-substantive content.

In [1]:
from collections import defaultdict
import spacy, json, re

nlp = spacy.load("en")

def load_book(book_id):
    ## load the book
    with open("./data/books/"+book_id+".txt") as f:
        book = f.read().strip()
        
    paragraphs = re.split("\n\n+", book)
    
    return paragraphs

paragraphs = load_book('84')
print(len(paragraphs))

  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)


723


__B2.__ _(10 points)_ Write a function called `kwic(paragraphs, search_terms)` that accepts a list of string `paragraphs` and a set of `search_term` strings. The function should:

1. initialize `data` as a `defaultdict` of lists
2. loop over the `paragraphs` and apply `spacy`'s processing to produce a `doc` for each;
3. loop over the `doc.sents` resulting from each `paragraph`;
4. loop over the words in each `sentence`;
5. check: `if` a `word` is `in` the `search_terms` set;
6. `if` (5), then `.append()` the reference under `data[word]` as a list: `[[i, j, k], sentence]`,

where `i`, `j`, and `k` refer to the paragraph-in-book, sentence-in-paragraph, and word-in-sentence indices, respectively.

Your output, `data`, should then be a default dictionary of lists of the format:
```
data['word'] = [[[i, j, k], ["These", "are", "sentences", "containing", "the", "word", "'word'", "."]],
                ...,]
```

In [2]:
def kwic(paragraphs, search_terms = {}):
    
    ## set up the kwic data we're storing
    data = defaultdict(list)

    for i, paragraph in enumerate(paragraphs):
        
        ## run NLP on the paragraph
        doc = nlp(paragraph)

        ## loop over sentences
        for j, sentence in enumerate(doc.sents):
            ## loop over words in sentences
            for k, word in enumerate(sentence):
                ## check if word is a match
                if word.text in search_terms:
                    ## store with index in the data
                    data[word.text].append([[i, j,  k], [x.text for x in sentence]])
            
    return(data)

__B3.__ _(2 points)_ Prove your `kwic` search function's utility using the pre-processed paragraphs from book `84` and __B1__. Exhibit examples of the key words `Frankenstein` and `monster` in context and and comment on the run time of this program and explain why it runs so darn slow, and in particular would not support repeated queries. Note: if you think it doesn't, then just confirm `kwic`'sfunction, and proceed to part __B5__. You can comment here after completing the module.

_Response._ We have to wait for the process to search through the entire document, one word at a time every time we call it. While can provide batches of key words to 'capture' results all at once, but if we want to come back and search for _another_ word after, e.g., having some motivation from seeing previous output, we'll have to wait just as long again, all over. Pre-processing an index would help _a lot_.

In [3]:
results = kwic(paragraphs, {"Frankenstein", "monster"})

In [4]:
print(len(results['Frankenstein']), len(results['monster']))

27 31


In [5]:
print(" ".join(results['Frankenstein'][7][1]))
print()
print(" ".join(results['monster'][0][1]))

She nursed Madame 
 Frankenstein , my aunt , in her last illness , with the greatest affection 
 and care and afterwards attended her own mother during a tedious 
 illness , in a manner that excited the admiration of all who knew her , 
 after which she again lived in my uncle 's house , where she was beloved 
 by all the family .  

I started from my sleep with horror ; a cold dew covered my forehead , my 
 teeth chattered , and every limb became convulsed ; when , by the dim and 
 yellow light of the moon , as it forced its way through the window 
 shutters , I beheld the wretch -- the miserable monster whom I had 
 created .  


__B4.__ _(10 pts)_ The cost of _indexing_ a given book turns out to be the limiting factor here in the process. Presently, we have our pre-processing `load_book` function just splitting a document into paragraphs. This function should be modified not only to:

1. split a `book` into paragraphs and loop over them, but
2. process each paragraph with `spacy`;
3. store the `document` as a triple-nested list, so that each word _string_ is reachable via three indices: `word = document[i][j][k]`;
4. record an `index = defaultdict(list)` containing a list of `[i,j,k]` lists for each word; and
5. `return document, index`

Pre-computing the `index` will allow us to efficiently look up the locations of each word's instance in `document`, and the triple-list format of our document will allow us fast access to extract the sentence for KWiC. Exhibit this modified version of `load_book`'s function on `book_id = '84'` and print out the `[i,j,k]` locations of the word `'monster'` from `index`.

In [6]:
def load_book(book_id):
    ## load the book
    with open("./data/books/"+book_id+".txt") as f:
        book = f.read().strip()

    ## set up the kwic index for fast access
    ## keys are (lemma, POS)
    ## values are lists of (i,j,k) indices: (paragraph, sentence, word)
    index = defaultdict(list)
        
    paragraphs = re.split("\n\n+", book)
    
    ## documents look like a triple-nested list: paragraphs/sentences/words
    document = []

    for i, paragraph in enumerate(paragraphs):
        ## create the new paragraph
        document.append([])
        
        ## run NLP on the paragraph
        doc = nlp(paragraph)

        ## loop over sentences
        for j, sentence in enumerate(doc.sents):
            ## create the new sentence
            document[-1].append([])
            
            ## loop over words in sentences
            for k, word in enumerate(sentence):
                ## append the new word
                document[-1][-1].append(word.text)
                
                ## store the location of this instance in the index
                location = [i,j,k]
                index[word.text].append(location)
                    
    return(document, index)

In [7]:
document, index = load_book("84")

In [8]:
index['monster']

[[124, 9, 57],
 [136, 3, 6],
 [139, 3, 4],
 [142, 1, 4],
 [243, 3, 29],
 [261, 3, 18],
 [280, 0, 2],
 [321, 1, 35],
 [345, 9, 6],
 [380, 13, 5],
 [397, 1, 46],
 [437, 0, 16],
 [439, 0, 3],
 [477, 7, 8],
 [478, 7, 6],
 [510, 1, 8],
 [527, 0, 1],
 [538, 19, 8],
 [560, 3, 31],
 [585, 3, 43],
 [587, 11, 72],
 [606, 3, 2],
 [615, 2, 11],
 [633, 8, 9],
 [639, 1, 21],
 [644, 4, 17],
 [653, 8, 5],
 [663, 5, 2],
 [673, 0, 39],
 [673, 1, 2],
 [709, 11, 1]]

__B5.__ _(5 pts)_ Finally, make a new function called `fast_kwic(document, index, search_terms)` that loops through all specified `search_terms` to identify indices from `index[word]` for the key word-containing sentences and use them to extract these sentences from `document` into the same data structure as output by __B2__:
```
data['word'] = [[[i, j, k], ["These", "are", "sentences", "containing", "the", "word", "'word'", "."]],
                ...,]
```
Confirm your output again by exhibiting examples of the key words `Frankenstein` and `monster` in context.

In [9]:
def fast_kwic(document, index, search_terms = {}):

    ## set up the kwic data we're storing
    data = defaultdict(list)
    
    for term in search_terms:
        for i, j, k in index[term]:
            
            ## store the data
            data[term].append([[i, j, k], document[i][j]])
    
    return(data)

In [10]:
fast_results = fast_kwic(document, index, {'Frankenstein', 'monster'})

In [11]:
print(len(fast_results['Frankenstein']), len(fast_results['monster']))

27 31


In [12]:
print(" ".join(fast_results['Frankenstein'][7][1]))
print()
print(" ".join(fast_results['monster'][0][1]))

She nursed Madame 
 Frankenstein , my aunt , in her last illness , with the greatest affection 
 and care and afterwards attended her own mother during a tedious 
 illness , in a manner that excited the admiration of all who knew her , 
 after which she again lived in my uncle 's house , where she was beloved 
 by all the family .  

I started from my sleep with horror ; a cold dew covered my forehead , my 
 teeth chattered , and every limb became convulsed ; when , by the dim and 
 yellow light of the moon , as it forced its way through the window 
 shutters , I beheld the wretch -- the miserable monster whom I had 
 created .  


__B6.__ _(5 pts)_ Your goal here is to modify the pre-processing in `load_book` one more time! Make a small modification to the input: `load_book(book_id, pos = True, lemma = True):`, to accept two boolean arguments, `pos` and `lemma` specifying how to identify each word as a key term. In particular, each word will now be represented in both of the `document` and `index` as a tuple: `heading = (text, tag)`, where `text` contains the `word.text` attribute from `spacy` if `lemma = False`, and `word.lemma_` attribute if `True`. Similarly, `tag` should be left empty as `""` if `pos = False` and otherwise contain `word.pos_`. When you've completed this part, exhibit your function's utility by using its ouput in the `fast_kwic` function to search for the key terms `('cold', NOUN)` and `('cold', ADJ)`.

Note this functions output should still consist of a `document` and `index` in the same format aside from the replacement of `word` with `heading`, which will allow for the same use of output in `fast_kwic`, although more specified by the textual features.

In [13]:
def load_book(book_id, pos = True, lemma = True):
    ## load the book
    with open("./data/books/"+book_id+".txt") as f:
        book = f.read().strip()

    ## set up the kwic index for fast access
    ## keys are (lemma, POS)
    ## values are lists of (i,j,k) indices: (paragraph, sentence, word)
    index = defaultdict(list)
        
    paragraphs = re.split("\n\n+", book)
    
    ## documents look like a triple-nested list: paragraphs/sentences/words
    document = []

    for i, paragraph in enumerate(paragraphs):
        ## create the new paragraph
        document.append([])
        
        ## run NLP on the paragraph
        doc = nlp(paragraph)

        ## loop over sentences
        for j, sentence in enumerate(doc.sents):
            ## create the new sentence
            document[-1].append([])
            
            ## loop over words in sentences
            for k, word in enumerate(sentence):
                ## append the new word
                document[-1][-1].append(word.text)
                
                ## create the heading for this entry
                heading = (word.lemma_ if lemma else word.text, 
                           word.pos_ if pos else "")
                
                ## store the location of this instance in the index
                location = [i,j,k]
                index[heading].append(location)
                    
    return(document, index)

In [14]:
document, index = load_book("84", pos = True, lemma = True)

In [15]:
" ".join(fast_kwic(document, index, search_terms = {('cold', 'NOUN')})[('cold', 'NOUN')][0][1])

'The \n cold is not excessive , if you are wrapped in furs -- a dress which I have \n already adopted , for there is a great difference between walking the \n deck and remaining seated motionless for hours , when no exercise \n prevents the blood from actually freezing in your veins .  '

In [16]:
" ".join(fast_kwic(document, index, search_terms = {('cold', 'ADJ')})[('cold', 'ADJ')][0][1])

'I am already far north of London , and as I walk in the streets of \n Petersburgh , I feel a cold northern breeze play upon my cheeks , which \n braces my nerves and fills me with delight .  '