# Assignment group 1: Textual feature extraction and numerical comparison

## Module C _(35 points)_ Similarity of word usage across a document

Here we'll be building up some code to discover how different terms are utilized similarly across a document. For this, our first task will be to create a word frequency counting function.

__C1.__ _(12 points)_ Define a function called `count_words(paragraph, pos = True, lemma = True)` that `return`s a `Counter()` called `frequency`. In `frequency`, each key will consist of a `heading = (text, tag)`, where `text` contains the `word.text` attribute from `spacy` if `lemma = False`, and `word.lemma_` attribute if `True`. Similarly, `tag` should be left empty as `""` if `pos = False` and otherwise contain `word.pos_`. The `Counter()` should simply contain the number of times each `heading` is observed in the `paragraph`.

In [1]:
from collections import Counter
import spacy, json, re

nlp = spacy.load("en")

def count_words(paragraph, pos = True, lemma = True):
    doc = nlp(paragraph)
    
    frequency = Counter()
    for sentence in doc.sents:
        for word in sentence:
            heading = (word.lemma_ if lemma else word.text, 
                       word.pos_ if pos else "")
        
            frequency[heading] += 1
    return frequency

  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)


__C2.__ _(8 pts)_ Next, define a function called `book_TDM(book_id, pos = True, lemma = True)` and copy into it the TDM-producing code from __Section 2.1.5.1__ of the lecture notes, now `return`-ing `TDM` and `all_words`. Once copied, modify this function to call `count_words` appropriately, now passing through the user of `book_TDM`'s specified `lemma` and `pos` arguments.

To provde your code's function, process `book_id = 84` with both of `pos = True` and `lemma = True` and print out the `TDM`'s `.shape` attribute and the first ten terms in `all_words`.

In [2]:
import numpy as np
from collections import Counter
import re

def book_TDM(book_id, pos = True, lemma = True):
    ## load the book
    with open("./data/books/"+book_id+".txt") as f:
        book = f.read().strip()
    
    ## break the document into paragraphs
    paragraphs = re.split("\n\n+", book)

    ## the 'master' set, keeps track of the words in all documents
    all_words = set()

    ## store the word frequencies by book
    all_doc_frequencies = {}

    ## loop over the sentences
    for j, paragraph in enumerate(paragraphs):        
        frequency = count_words(paragraph, pos = pos, lemma = lemma)
        all_doc_frequencies[j] = frequency
        doc_words = set(frequency.keys())
        all_words = all_words.union(doc_words)

    ## create a matrix of zeros: (words) x (documents)
    TDM = np.zeros((len(all_words),len(all_doc_frequencies)))
    ## fix a word ordering for the rows
    all_words = sorted(list(all_words))
    ## loop over the (sorted) document numbers and (ordered) words; fill in matrix
    for j in all_doc_frequencies:
        for i, word in enumerate(all_words):
            TDM[i,j] = all_doc_frequencies[j][word]

    return(TDM, all_words)

In [3]:
TDM, terms = book_TDM("84", pos = True, lemma = True)
terms[:10]

[('\n', 'SPACE'),
 ('\n  ', 'SPACE'),
 ('\n   ', 'SPACE'),
 ('\n     ', 'SPACE'),
 ('\n                              ', 'SPACE'),
 (' ', 'SPACE'),
 ('  ', 'SPACE'),
 ('    ', 'SPACE'),
 ('     ', 'SPACE'),
 ('               ', 'SPACE')]

In [4]:
TDM.shape

(6221, 723)

__C3.__ _(8 pts)_ Next, your job is to define two functions. The first is `sim(u,v)`, which shoud take two arbitrary numeric vectors and compute/output the cosine similarity, as described in __Section 1.1.2.10__.  

The second function is `term_sims(i, TDM)`, which should utilize the first function (`sim`) to output a list of cosine similarity values between the word/row `i` and all others (rows) in the `TDM`. 

Note: each of these functions can be straightforwardly completed using a single line of code! Exhibit your knowledge of comprehensions and vectorization!

In [5]:
def sim(u,v):
    return u.dot(v) / (np.linalg.norm(u) * np.linalg.norm(v))

In [6]:
def term_sims(i, TDM):
    return [sim(TDM[i,], TDM[sim_i,]) for sim_i in range(TDM.shape[0])]

__C4.__ _(7 pts)_ Finally, your goal now is to a write function, `most_similar(term, terms, TDM, top = 25)`, that utilizes `term_sims` to output a sorted list of the `top = N` terms most similar to one specified (`term`). The output data type should be a list of lists, with each inner list representing information for a similar term as: `[row_ix, similarity, term]`. Once complete, prove your function's utility on a `TDM` produced for `book_id = 84` and exhibit the top 25 similar terms to both of `('monster', 'NOUN')` and `('beautiful', 'ADJ')`.

Once computation is complete, comment on the ordered results returned in the markdown cell below. Do you think the algorithm is exhibiting sensible result? What would you do to improve?

\[Hint: to locate the row containing the term of interest, utilize the list `.index()` method in application to the `terms` argument.\]

_Response._

In [7]:
def most_similar(term, terms, TDM, top = 25):
    term_index = terms.index(term)
    sims = term_sims(term_index, TDM)
    sims = sorted(enumerate(sims), key = lambda x: x[1], reverse = True)
    for sim_i, term_sim in sims[:25]:
        print(sim_i, term_sim, terms[sim_i])

In [8]:
most_similar(('monster', 'NOUN'), terms, TDM, top = 25)

3628 1.0 ('monster', 'NOUN')
470 0.3380617018914066 ('asseveration', 'NOUN')
1254 0.3380617018914066 ('correct', 'VERB')
2048 0.3380617018914066 ('existence', 'VERB')
2332 0.3380617018914066 ('formation', 'NOUN')
3703 0.3380617018914066 ('mutilate', 'VERB')
4243 0.3380617018914066 ('posterity', 'NOUN')
28 0.3072240245631793 ('-PRON-', 'PRON')
2710 0.30578831486257535 ('hideous', 'ADJ')
5 0.2929277805918349 (' ', 'SPACE')
1473 0.29277002188455997 ('demoniacal', 'ADJ')
2325 0.29277002188455997 ('forgetfulness', 'NOUN')
12 0.28811865589289226 ('!', 'PUNCT')
5483 0.27909246469593313 ('that', 'ADP')
1730 0.2760262237369417 ('down', 'ADV')
29 0.27266231928075757 ('.', 'PUNCT')
0 0.27092834781732605 ('\n', 'SPACE')
327 0.2680355919935828 ('and', 'CCONJ')
5487 0.2665740440836767 ('the', 'DET')
3748 0.2548235957188128 ('neck', 'NOUN')
1163 0.253546276418555 ('connected', 'ADJ')
3718 0.253546276418555 ('narrative', 'NOUN')
68 0.250263571622741 (';', 'PUNCT')
20 0.2502061289650293 (',', 'PUNCT')


In [9]:
most_similar(('beautiful', 'ADJ'), terms, TDM, top = 25)

602 0.9999999999999998 ('beautiful', 'ADJ')
646 0.40824829046386296 ('beneath', 'ADV')
4152 0.37499999999999994 ('picturesque', 'NOUN')
4758 0.37499999999999994 ('rotterdam', 'PROPN')
5036 0.37499999999999994 ('singularly', 'ADV')
6060 0.37499999999999994 ('whence', 'ADV')
1462 0.35355339059327373 ('delineate', 'NOUN')
1771 0.35355339059327373 ('dun', 'NOUN')
2221 0.35355339059327373 ('fifth', 'ADJ')
3415 0.35355339059327373 ('luxuriance', 'NOUN')
3442 0.35355339059327373 ('mainz', 'PROPN')
3462 0.35355339059327373 ('mannheim', 'PROPN')
3505 0.35355339059327373 ('meander', 'VERB')
4062 0.35355339059327373 ('pearly', 'ADJ')
4225 0.35355339059327373 ('populous', 'ADJ')
4969 0.35355339059327373 ('shipping', 'NOUN')
4993 0.35355339059327373 ('shrivelled', 'ADJ')
5238 0.35355339059327373 ('steep', 'ADJ')
5262 0.35355339059327373 ('straight', 'ADJ')
5893 0.35355339059327373 ('variegate', 'VERB')
5946 0.35355339059327373 ('vineyard', 'NOUN')
6080 0.35355339059327373 ('whiteness', 'NOUN')
6110