# Assignment group 1: Textual feature extraction and numerical comparison

## Module C _(35 points)_ Similarity of word usage across a document

Here we'll be building up some code to discover how different terms are utilized similarly across a document. For this, our first task will be to create a word frequency counting function.

__C1.__ _(12 points)_ Define a function called `count_words(paragraph, pos = True, lemma = True)` that `return`s a `Counter()` called `frequency`. In `frequency`, each key will consist of a `heading = (text, tag)`, where `text` contains the `word.text` attribute from `spacy` if `lemma = False`, and `word.lemma_` attribute if `True`. Similarly, `tag` should be left empty as `""` if `pos = False` and otherwise contain `word.pos_`. The `Counter()` should simply contain the number of times each `heading` is observed in the `paragraph`.

In [1]:
# C1:Function(12/12)

from collections import Counter
import spacy, json, re

nlp = spacy.load("en")

def count_words(paragraph, pos = True, lemma = True):

    #---Your code starts here
    doc = nlp(paragraph)      
    frequency = Counter()
    for word in doc: 
        if lemma: 
            text = word.lemma_
        else: 
            text = word.text
        if pos: 
            tag = word.pos_
        else: 
            tag = ""
        heading = (text, tag)
        frequency[heading] += 1
    #---Your code ends here
    
    return frequency

Let's make sure your function works by testing it on a short sentence. 

In [2]:
# C1:SanityCheck

count_words("The quick brown fox jumps over the lazy dog.")

Counter({('the', 'DET'): 2,
         ('quick', 'ADJ'): 1,
         ('brown', 'ADJ'): 1,
         ('fox', 'PROPN'): 1,
         ('jump', 'VERB'): 1,
         ('over', 'ADP'): 1,
         ('lazy', 'ADJ'): 1,
         ('dog', 'NOUN'): 1,
         ('.', 'PUNCT'): 1})

__C2.__ _(8 pts)_ Next, define a function called `book_TDM(book_id, pos = True, lemma = True)` and copy into it the TDM-producing code from __Section 2.1.5.1__ of the lecture notes, now `return`-ing `TDM` and `all_words`. Once copied, modify this function to call `count_words` appropriately, now passing through the user of `book_TDM`'s specified `lemma` and `pos` arguments.

In [3]:
# C2:Function(8/8)

import numpy as np
from collections import Counter
import re

def book_TDM(book_id, pos = True, lemma = True):

    #---Your code starts here---
    text = open(f"./data/books/{book_id}.txt", "r").read()
    all_words = set()
    all_doc_frequencies = {}
    doc = nlp(text)
    for j, sentence in enumerate(doc.sents):
        frequency = count_words(sentence.text, pos, lemma)
        all_doc_frequencies[j] = frequency
        doc_words = set(frequency.keys())
        all_words = all_words.union(doc_words)
    
    TDM = np.zeros((len(all_words),len(all_doc_frequencies)))
    all_words = sorted(list(all_words))
    for j in all_doc_frequencies:
        for i, word in enumerate(all_words):
            TDM[i,j] = all_doc_frequencies[j][word]
    #---Your code ends here---

    return(TDM, all_words)


To test your code's function, let's process `book_id = 84` with both of `pos = True` and `lemma = True` and print out the `TDM`'s `.shape` attribute and the first ten terms in `all_words`.

In [4]:
# C2:SanityCheck

TDM, terms = book_TDM("84", pos = True, lemma = True)
terms[:10]

[('\n', 'SPACE'),
 ('\n\n', 'SPACE'),
 ('\n\n  ', 'SPACE'),
 ('\n\n    ', 'SPACE'),
 ('\n\n     ', 'SPACE'),
 ('\n\n               ', 'SPACE'),
 ('\n\n                    ', 'SPACE'),
 ('\n\n                                                ', 'SPACE'),
 ('\n  ', 'SPACE'),
 ('\n   ', 'SPACE')]

In [5]:
# C2:SanityCheck

TDM.shape

(6266, 3902)

__C3.__ _(8 pts)_ Next, your job is to define two functions. The first is `sim(u,v)`, which shoud take two arbitrary numeric vectors and compute/output the `cosine_similarity`, as described in __Section 1.1.2.10__.  

The second function is `term_sims(i, TDM)`, which should utilize the first function (`sim` function) to output a list of cosine similarity values (`sim_values`) between the word/row `i` and all others (rows) in the `TDM`.

Note: each of these functions can be straightforwardly completed using a single line of code! Exhibit your knowledge of comprehensions and vectorization!

In [6]:
# C3:Function(4/8)
def sim(u,v):
    
    #---Your code starts here
    cosine_similarity = u.dot(v) / (np.linalg.norm(u) * np.linalg.norm(v)) 
    #---Your code ends here
    
    return cosine_similarity

In [7]:
# C3:SanityCheck

print("Exactly similar:", sim(np.array([1,2,3]), np.array([1,2,3])))
print("Exactly dissimilar:", sim(np.array([1,2,3]), np.array([-1,-2,-3])))
print("In the middle:", sim(np.array([1,1]), np.array([-1,1])))

Exactly similar: 1.0
Exactly dissimilar: -1.0
In the middle: 0.0


In [8]:
# C3:Function(4/8)

def term_sims(i, TDM):
    
    #---Your code starts here
    sim_values = [sim(TDM[j,],TDM[i,]) for j in range(len(TDM)) if j != i]
    
    #---Your code ends here
    
    return sim_values

In [9]:
# C3:SanityCheck

# Compare word/row 0 to all other (rows) in the TDM
term_sims(0, TDM)

[0.3778103392705014,
 0.009957275377167391,
 0.0,
 0.008623253429104237,
 0.018292682926829267,
 0.0,
 0.018292682926829267,
 0.0,
 0.0,
 0.0,
 0.0,
 0.6101046533478747,
 0.006097560975609756,
 0.07578592925955181,
 0.20896153821725802,
 0.058667101883215104,
 0.06814184302332854,
 0.1636367198104125,
 0.06251858736100573,
 0.06251858736100573,
 0.7973957529858674,
 0.007712872341874096,
 0.008623253429104237,
 0.212658952698075,
 0.008623253429104237,
 0.1793193238625128,
 0.025869760287312714,
 0.0,
 0.008623253429104237,
 0.008623253429104237,
 0.04311626714552119,
 0.7261135704346007,
 0.8130088819528111,
 0.7705401601405453,
 0.0,
 0.0,
 0.0,
 0.012195121951219513,
 0.0,
 0.012195121951219513,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.013000043680220149,
 0.012195121951219513,
 0.008623253429104237,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.008623253429104237,
 0.0,
 0.0,
 0.0,
 0.017246506858208475,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0049786376885836954,
 0.0,
 0.0,
 0.017246506858

__C4.__ _(7 pts)_ Finally, your goal now is to a write function, `most_similar(term, terms, TDM, top = 25)`, that utilizes `term_sims` to output a sorted list of the `top = N` terms (`top_n_terms`) most similar to one specified (`term`). The output data type should be a list of lists, with each inner list representing information for a similar term as: `[row_ix, similarity, term]`. 

\[Hint: to locate the row containing the term of interest, utilize the list `.index()` method in application to the `terms` argument.\]

In [10]:
# C4:Function(6/7)

def most_similar(term, terms, TDM, top = 25):
    
    #---Your code starts here---
    top_n_terms = []
    sorted_sim_values = sorted(term_sims(terms.index(term), TDM), reverse = True)
    for row_ix in range(len(terms)):
        if sim(TDM[row_ix,], TDM[terms.index(term),]) in sorted_sim_values[:30] and row_ix != terms.index(term): 
            top_n_terms.append([row_ix, sim(TDM[row_ix,], TDM[terms.index(term),]), terms[row_ix]])                            
                               
    #---Your code ends here---
    
    return top_n_terms

Now, let's test your functions utility on a `TDM` produced for `book_id = 84` and exhibit the top 25 similar terms to both of `('monster', 'NOUN')` and `('beautiful', 'ADJ')`.

In [11]:
# C4:SanityCheck

most_similar(('monster', 'NOUN'), terms, TDM, top = 25)

[[932, 0.12309149097933272, ('beneficial', 'ADJ')],
 [949, 0.17407765595569785, ('besiege', 'VERB')],
 [987, 0.12309149097933272, ('blind', 'VERB')],
 [998, 0.12309149097933272, ('blot', 'NOUN')],
 [1202, 0.12309149097933272, ('chatter', 'VERB')],
 [1400, 0.12309149097933272, ('confessor', 'NOUN')],
 [1503, 0.17407765595569785, ('convulsed', 'ADJ')],
 [1619, 0.12309149097933272, ('dark', 'NOUN')],
 [1648, 0.17407765595569785, ('dearer', 'ADJ')],
 [1734, 0.17407765595569785, ('depart', 'NOUN')],
 [1807, 0.12309149097933272, ('detestable', 'ADJ')],
 [1851, 0.17407765595569785, ('dim', 'NOUN')],
 [1908, 0.17407765595569785, ('disown', 'VERB')],
 [2134, 0.12309149097933272, ('engagement', 'NOUN')],
 [2278, 0.17407765595569785, ('existence', 'VERB')],
 [2466, 0.1315903389919538, ('finger', 'NOUN')],
 [2548, 0.17407765595569785, ('forehead', 'NOUN')],
 [2610, 0.12309149097933272, ('fringe', 'VERB')],
 [2633, 0.17407765595569785, ('furiously', 'ADV')],
 [2794, 0.12309149097933272, ('gun', 'NO

In [12]:
# C4:SanityCheck

most_similar(('beautiful', 'ADJ'), terms, TDM, top = 25)

[[57, 0.1889822365046136, ('27th', 'NOUN')],
 [229, 0.1889822365046136, ('Lavenza', 'PROPN')],
 [265, 0.1889822365046136, ('Montalegre', 'PROPN')],
 [373, 0.1889822365046136, ('Uri', 'PROPN')],
 [379, 0.1889822365046136, ('Villa', 'PROPN')],
 [598, 0.1889822365046136, ('alluring', 'ADJ')],
 [878, 0.1889822365046136, ('bat', 'NOUN')],
 [1300, 0.1889822365046136, ('coast', 'VERB')],
 [1568, 0.1889822365046136, ('croaking', 'NOUN')],
 [1637, 0.1889822365046136, ('dazzling', 'ADJ')],
 [1877, 0.1889822365046136, ('discontent', 'NOUN')],
 [2056, 0.1889822365046136, ('ecstatic', 'ADJ')],
 [2072, 0.1889822365046136, ('elasticity', 'NOUN')],
 [2105, 0.1889822365046136, ('emulate', 'VERB')],
 [2146, 0.1889822365046136, ('ennui', 'NOUN')],
 [2171, 0.1889822365046136, ('entrancingly', 'ADV')],
 [2307, 0.1889822365046136, ('exquisitely', 'ADV')],
 [2611, 0.1889822365046136, ('frog', 'NOUN')],
 [3280, 0.1889822365046136, ('interrupted', 'ADJ')],
 [3333, 0.1889822365046136, ('issue', 'NOUN')],
 [3819

In [None]:
# C4:Inline

# Comment on the ordered results returned in the sanity checks.
# Do you think the algorithm is exhibiting sensible results? print "Yes" or "No"
print("Yes")