# Assignment group 1: Textual feature extraction and numerical comparison

## Module C _(35 points)_ Similarity of word usage across a document

Here we'll be building up some code to discover how different terms are utilized similarly across a document. For this, our first task will be to create a word frequency counting function.

__C1.__ _(12 points)_ Define a function called `count_words(paragraph, pos = True, lemma = True)` that `return`s a `Counter()` called `frequency`. In `frequency`, each key will consist of a `heading = (text, tag)`, where `text` contains the `word.text` attribute from `spacy` if `lemma = False`, and `word.lemma_` attribute if `True`. Similarly, `tag` should be left empty as `""` if `pos = False` and otherwise contain `word.pos_`. The `Counter()` should simply contain the number of times each `heading` is observed in the `paragraph`.

In [23]:
# C1:Function(12/12)

from collections import Counter
import spacy, json, re

nlp = spacy.load("en")

def count_words(paragraph, pos = True, lemma = True):
    
    word_list = []
    frequency = Counter()
    paragraph = nlp(paragraph)
    for sentences in list(paragraph.sents):
        for words in sentences:
            if (pos == True) and (lemma == True):
                frequency[(words.lemma_, words.pos_)] += 1
            elif (pos == False) and (lemma == True):
                frequency[(words.lemma_," ")] += 1
            elif (pos == True) and (lemma == False):
                frequency[(words.text, words.pos_)] += 1
            elif (pos == False) and (lemma == False):
                frequency[(words.text, " ")] += 1
    return frequency

Let's make sure your function works by testing it on a short sentence. 

In [24]:
# C1:SanityCheck

count_words("The quick brown fox jumps over the lazy dog.")

Counter({('the', 'DET'): 2,
         ('quick', 'ADJ'): 1,
         ('brown', 'ADJ'): 1,
         ('fox', 'PROPN'): 1,
         ('jump', 'VERB'): 1,
         ('over', 'ADP'): 1,
         ('lazy', 'ADJ'): 1,
         ('dog', 'NOUN'): 1,
         ('.', 'PUNCT'): 1})

__C2.__ _(8 pts)_ Next, define a function called `book_TDM(book_id, pos = True, lemma = True)` and copy into it the TDM-producing code from __Section 2.1.5.1__ of the lecture notes, now `return`-ing `TDM` and `all_words`. Once copied, modify this function to call `count_words` appropriately, now passing through the user of `book_TDM`'s specified `lemma` and `pos` arguments.

In [44]:
# C2:Function(8/8)

import numpy as np
from collections import Counter
import re

def book_TDM(book_id, pos = True, lemma = True):

    with open("./data/books/"+book_id+".txt", "r") as fh:
        paragraphs = fh.read()
        paragraphs = paragraphs.strip()

        ## the 'master' set, keeps track of the words in all documents
        all_words = set()
        ## store the word frequencies by book
        all_doc_frequencies = {}
        
        doc = nlp(paragraphs)

        for j, sentence in enumerate(doc.sents):
            frequency = count_words(sentence.text, pos, lemma)
            all_doc_frequencies[j] = frequency
            doc_words = set(frequency.keys())
            all_words = all_words.union(doc_words)
        
            
        ## create a matrix of zeros: (words) x (documents)
        TDM = np.zeros((len(all_words),len(all_doc_frequencies)))
        ## fix a word ordering for the rows
        all_words = sorted(list(all_words))
        ## loop over the (sorted) document numbers and (ordered) words; fill in matrix
        
        for j in all_doc_frequencies:
            for i, word in enumerate(all_words):
                TDM[i,j] = all_doc_frequencies[j][word]

    return(TDM, all_words)


To test your code's function, let's process `book_id = 84` with both of `pos = True` and `lemma = True` and print out the `TDM`'s `.shape` attribute and the first ten terms in `all_words`.

In [35]:
# C2:SanityCheck

TDM, terms = book_TDM("84", pos = True, lemma = True)
terms[:10]

[('\n', 'SPACE'),
 ('\n  ', 'SPACE'),
 ('\n   ', 'SPACE'),
 ('\n     ', 'SPACE'),
 ('\n                              ', 'SPACE'),
 (' ', 'SPACE'),
 ('  ', 'SPACE'),
 ('    ', 'SPACE'),
 ('     ', 'SPACE'),
 ('               ', 'SPACE')]

In [36]:
# C2:SanityCheck

TDM.shape

(6262, 40)

__C3.__ _(8 pts)_ Next, your job is to define two functions. The first is `sim(u,v)`, which shoud take two arbitrary numeric vectors and compute/output the `cosine_similarity`, as described in __Section 1.1.2.10__.  

The second function is `term_sims(i, TDM)`, which should utilize the first function (`sim` function) to output a list of cosine similarity values (`sim_values`) between the word/row `i` and all others (rows) in the `TDM`.

Note: each of these functions can be straightforwardly completed using a single line of code! Exhibit your knowledge of comprehensions and vectorization!

In [38]:
# C3:Function(4/8)
def sim(u,v):
    
    cosine_similarity = np.dot(u,v) / (np.linalg.norm(u) * np.linalg.norm(v))
    
    return cosine_similarity

In [39]:
# C3:SanityCheck

print("Exactly similar:", sim(np.array([1,2,3]), np.array([1,2,3])))
print("Exactly dissimilar:", sim(np.array([1,2,3]), np.array([-1,-2,-3])))
print("In the middle:", sim(np.array([1,1]), np.array([-1,1])))

Exactly similar: 1.0
Exactly dissimilar: -1.0
In the middle: 0.0


In [42]:
# C3:Function(4/8)

def term_sims(i, TDM):
    
    sim_values = [sim(i, row) for row in TDM]
    
    return sim_values

In [43]:
# C3:SanityCheck

# Compare word/row 0 to all other (rows) in the TDM
term_sims(0, TDM)

  cosine_similarity = np.dot(u,v) / (np.linalg.norm(u) * np.linalg.norm(v))


[array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan]),
 array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan]),
 array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan]),
 array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan]),
 array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, n

__C4.__ _(7 pts)_ Finally, your goal now is to a write function, `most_similar(term, terms, TDM, top = 25)`, that utilizes `term_sims` to output a sorted list of the `top = N` terms (`top_n_terms`) most similar to one specified (`term`). The output data type should be a list of lists, with each inner list representing information for a similar term as: `[row_ix, similarity, term]`. 

\[Hint: to locate the row containing the term of interest, utilize the list `.index()` method in application to the `terms` argument.\]

In [None]:
# C4:Function(6/7)

def most_similar(term, terms, TDM, top = 25):
    
    #---Your code starts here---

    #---Your code ends here---
    
    return top_n_terms

Now, let's test your functions utility on a `TDM` produced for `book_id = 84` and exhibit the top 25 similar terms to both of `('monster', 'NOUN')` and `('beautiful', 'ADJ')`.

In [None]:
# C4:SanityCheck

most_similar(('monster', 'NOUN'), terms, TDM, top = 25)

In [None]:
# C4:SanityCheck

most_similar(('beautiful', 'ADJ'), terms, TDM, top = 25)

In [None]:
# C4:Inline

# Comment on the ordered results returned in the sanity checks.
# Do you think the algorithm is exhibiting sensible results? print "Yes" or "No"
print("<ANSWER>")