# COLX 561 Lab Assignment 3: Word Similarity (Cheat sheet)

## Assignment Objectives

Methods such as PMI, LSA, and Word2Vec sound like they could be capturing important semantic information, but up until now, you've had to take our word for it.  In this lab, you'll be evaluating the quality of several of these methods by calculating their similarity to a hand-derived word-similarity dataset.  You won't achieve perfect similarity scores (even human annotators rarely agree exactly about semantic similarity), but you will see that the quality can vary significantly between methods.

In this assignment you will

- Evaluate these scores by comparison to a manually-built word similarity dataset
- Compare them with scores derived from a manual resource (WordNet)
- Derive word similarity scores using distributional approaches such as PMI, LSA, and Word2Vec

## Getting Started

Run the code below to access relevant modules (you can add to this as needed)

In [1]:
# !python3 -m pip install --user gensim

import nltk
from nltk.corpus import brown
from nltk.corpus import wordnet as wn
from nltk import WordNetLemmatizer
from nltk.corpus import wordnet_ic
from collections import defaultdict, Counter
import numpy as np
from scipy.sparse import csr_matrix, dok_matrix
from sklearn.decomposition import TruncatedSVD
from scipy.spatial.distance import cosine
from sklearn.feature_extraction import DictVectorizer
from scipy.stats import pearsonr


In [None]:
!python3 -m pip install gensim 

In [None]:
from gensim.models import Word2Vec

The code below allows you to compare the similarity scores over the same sets of words based on their Pearson correlation co-efficient. The two similarity_dicts should have the same keys. This will allow you to evaluate the methods you implement in this lab

In [2]:
def compare_similarity(similarity_dict1,similarity_dict2):
    assert list(similarity_dict1.keys()) == list(similarity_dict2.keys())
    X = [similarity_dict1[word_pair] for word_pair in similarity_dict1]
    Y = [similarity_dict2[word_pair] for word_pair in similarity_dict1]
    return pearsonr(X,Y)[0]

# The P-value is the probability that you would have found the current result if the correlation coefficient were in fact zero (null hypothesis). 
# If this probability is lower than the conventional 5% (P<0.05) the correlation coefficient is called statistically significant.

In [4]:
x, y = [1, 2, 3, 4, 5, 6, 7], [1, 2, 3, 4, 5, 6, 7] 
print(pearsonr(x, y))
pearsonr(x, y)[0]

PearsonRResult(statistic=0.9999999999999999, pvalue=2.494476486799542e-40)


0.9999999999999999

In this lab, lemmatization is a good idea, to help mitigate sparsity. You can use the lemmatizer/lemmatize function below.

In [3]:
lemmatizer = WordNetLemmatizer()

def lemmatize(word): 
    lemma = lemmatizer.lemmatize(word,'n')
    if lemma == word:
        lemma = lemmatizer.lemmatize(word,'v')
    return lemma
    

## Tidy Submission

rubric={mechanics:1}

To get the marks for tidy submission:

- Submit the assignment by filling in this jupyter notebook with your answers embedded
- Be sure to follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions)

### Exercise 0: Preparing the gold standard


#### Exercise 0.0:
rubric={accuracy:1}

For this lab we will be comparing our methods against a popular dataset of word similarities called Similarity-353 (353 refers to the number of pairs in the data set). You need to first obtain this data set, which can be downloaded <a href="http://www.gabrilovich.com/resources/data/wordsim353/wordsim353.zip">here</a>. The file we will be using is called *combined.tab*. The file is tab-formatted with the first two columns corresponding to two words, and the third column representing a human-annotated similarity between the two words. You can just include it with your lab.

For example, the similarity score between "television" and "radio" is given in the following line:

television&emsp;radio&emsp;6.77

You should load this file into a Python dictionary called `gold_standard` where the pairs are the keys, and the similarly score is the value. No need to lemmatize here.

For example, the "television/radio" example should be stored in the `gold_standard` dict as {("television", "radio"):6.77}

Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., & Ruppin, E. (2002). Placing Search in Context: The Concept Revisited. *ACM Transactions on Information Systems*, 20(1), 116–131. https://doi.org/10.1145/503104.503110

```
tiger   cat     7.35
student professor       6.81
smart   student 4.62
sugar   approach        0.88
problem challenge       6.75
planet  people  5.75
video   archive 6.34
money	cash	9.15
money	cash	9.08
```

In [4]:
import csv 
# csv.DictReader

gold_standard = {} # gold_standard[('tiger', 'cat')] = 7.35 

#Your code here

#Your code here

In [5]:
assert gold_standard[("tiger","cat")] == 7.35
assert len(gold_standard) == 352 # turns out there's one duplicate!

#### Exercise 0.1:
rubric={accuracy:2}

Next, we will filter this dataset to create a much smaller test set which is better suited to the resources we are testing here. First, we will remove words which are rare or not present in the Brown corpus. In this assignment, we will be treating the <i>paragraphs</i> of the Brown corpus as our "documents", you can iterate over them by using the `paras` method of the corpus reader (ie, for paragraph in brown.paras()).

Using the Brown paragraphs, you should obtain the counts of all words in the corpus - then, you should remove from "gold_standard" any pairs that contain words that occur fewer than 5 times in the Brown corpus.  For example, if you have a pair ("beetle", "insect"), and "beetle" occurs 3 times, and "insect" 24, you remove the pair (ie, if either word is < 5, remove the pair).  You should use the provided lemmatizer and lowercase the words before calculating their counts.

Prior to removing these pairs, gold_standard should have 352 items.  After, it should have 261.

```
tiger   cat     7.35
student professor       6.81
smart   student 4.62
sugar   approach        0.88
problem challenge       6.75
planet  people  5.75
#video   archive 6.34                    <-- will be deleted because of video appears 2, and archive 4; 
money	cash	9.15
money	cash	9.08
```

In [6]:
threshold = 5

brown_counts = Counter() # dfs

#Your code here

#Your code here
    
print(len(gold_standard))

261


In [7]:
assert ("video","archive") not in gold_standard
assert len(gold_standard) == 261
print("Success!")

Success!


#### Exercise 0.2:
rubric={accuracy:1}

The second filtering removes any words which do not have a single primary sense, or whose primary sense is not a noun (check this after you have found the primary sense). You can adapt the code for this from Lab 1, again **returning a primary (dominant) sense only when it consist of at least 75%** of all the counts for that word. You can assume everything in the test set is a noun for this purpose (no need to use the lemmatizer function above).  After removing these words, your gold standard set will be down to 57 entries.

```
#tiger   cat     7.35           <-- will be deleted because 'tiger' doesn't have a dominant sense
student professor       6.81
smart   student 4.62
sugar   approach        0.88
problem challenge       6.75
planet  people  5.75
money	cash	9.15
money	cash	9.08
```

In [8]:
dominant_sense_ratio = 0.75

def get_dominant_sense(word):
    lemma =  lemmatizer.lemmatize(word,'n')
    syn_counts = []
    total = 0
    #Your code here

    #Your code here 

to_delete = []
for pair in gold_standard:
    sense1 = get_dominant_sense(pair[0])
    sense2 = get_dominant_sense(pair[1])
    if not sense1 or not sense2 or ".n." not in str(sense1) or ".n." not in str(sense2):
        to_delete.append(pair)
        

for pair in to_delete:
    del gold_standard[pair]
    

In [9]:
assert ("tiger","cat") not in gold_standard
assert len(gold_standard) == 57
print("Success!")

Success!



### Methodologies

1. WordNet
2. Co-occurrence matrices
     - PMI
     - LSA
3. Word2Vec

### Exercise 1: WordNet

##### 1.1
rubric={accuracy:2}

Now you will create several dictionaries with similarity scores for pairs of words in your test set derived using the techniques discussed in class. The first of these is the Wu-Palmer scores (`wup_similarity`) derived from the hypernym relationships in WordNet, which you should calculate using the primary sense for each word derived above. You can use the built-in method included in the NLTK interface, you don't have to implement your own. When you're done, print out the correlation with the gold standard by using the `compare_similarity` function provided above.  Garrett gets a correlation of 0.4094...

$r = \frac{\sum{(x-m_x)(y-m_y)}}{\sqrt{\sum{(x-m_x)^2} \sum{(y-m_y)^2}}}$ where $m_x$ and $m_y$ are the mean of the dictionary $x$ and $y$, respectively. 

In [10]:
wupalmer_similarities = {}

#Your code here#
for pair in gold_standard:
    wupalmer_similarities[pair] =  ...
#Your code here

print(compare_similarity(wupalmer_similarities,gold_standard))

0.40938383930272365


#### 1.2 Optional
rubric={reasoning:1}

Test all the other WordNet-based similarity measures available in NLTK, including those that require a corpus. (There is a list [here](https://www.nltk.org/howto/wordnet.html)) Discuss what you see, and why you think some do better than others.  This code will be very similar to Exercise 1.1, except you will replace wup_similarity with the other available similarities.  Your scores will range between 0.35 and 0.45.

In [11]:
# import nltk
# nltk.download('wordnet_ic')

brown_ic = wordnet_ic.ic('ic-brown.dat')

path_similarities = {}
lch_similarities = {}
res_similarities = {}
lin_similarities = {}

#Your code here
    
#Your code here    

print("Path:", compare_similarity(path_similarities,gold_standard))
print("Leacock-Chodorow:", compare_similarity(lch_similarities, gold_standard))
print("Wu-Palmer:",compare_similarity(wupalmer_similarities,gold_standard))
print("Resnik:", compare_similarity(res_similarities, gold_standard))
print("Lin Similarity",compare_similarity(lin_similarities, gold_standard))

Path: 0.37053061382292296
Leacock-Chodorow: 0.3701027071347122
Wu-Palmer: 0.40938383930272365
Resnik: 0.42194564019584446
Lin Similarity 0.4369632240383018


Answer:



### Exercise 2: Co-occurrence matrices

#### 2.1
rubric={accuracy:3,efficiency:1}

Build a word-word co-occurrence matrix `X` using scipy, where each cell represents the number of paragraphs of the Brown that two corresponding words types appear together in. Before you do this, you will want to review **the creation of sparse matrices from DSCI 512** (`scipy.sparse.dok_matrix` (https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.dok_matrix.html)). Importantly, you should create a **symmetric matrix** where both the row $n$ and the column $n$ refer to the same word type. For this, you will also want to create a lookup from words to indices in the array called `word_index_lookup`; you can use the words in the document frequency dictionary you built in **Exercise 0**. It will likely take several minutes, so it's a good idea to print out your progress (100 or 1000 paragraphs at a time - there are ~16,000 paragraphs in the Brown corpus).

- `brown_words = {'the': 13125, 'of': 10633, 'and': 10514, 'a': 10409, 'to': 10046, ... }`
- `word_index_lookup` = the length of `brown_words` 
- `X` = the length of `brown_words` $\times$ the length of `brown_words`

For example, `{'The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.', ...
}` from a paragraph in the Brown corpus, we count the co-occurrence of `('the', 'the')`, `('the', 'fulton')`, ..., `('fulton', 'the')`, ... where each word should be `lower()`  and `lemmatize()`

    - words = {} 
    - all words in "paragraph" to `words`
    - and their co-occurrence in `words`: (w0,w0), (w0,w1), ..., (w1,w0), (w1,w1), ..., (wn,wn)


Then, you convert your `dok_matrix` into `csr_matrix` (Recall that this is `CountVectorizer`'s return value)

In [12]:
# takes a lot of time...............  > 15min

word_index_lookup = {}
for word in brown_counts:
    ... 

# Your code here
for paragraph in brown.paras():
    ...

    for sent in paragraph:
        ...
    # then, 
    # update (wi, wj) by using: X[word_index_lookup[xi],word_index_lookup[xj]] += 1

# Your code here

1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000


In [13]:
assert X.shape == (38822,38822)
assert np.sum(X) == 54930486
assert np.sum(X[word_index_lookup["similarity"]]) == 1778
print("Success!")

Success!


#### 2.2
rubric={accuracy:3,efficiency:1}

Use the co-occurrence matrix to calculate the positive pointwise mutual information (PPMI) between words in your test set (ie, **gold_standard**). For efficiency, use numpy operations as much as possible, only calculate the statistics you need for this testing (do NOT calculate PPMI for all pairs of words), and don't calculate the same thing more than once. When you are done, evaluate your result using the similarity pairs.  Garrett gets ~0.26 similarity.

PMI (pointwise mutual information) is observed / expected in log-space:

$PMI = \log{\left(\frac{P(X_{ij})}{P(X_i)P(X_j)}\right)}$ where $P(X_{ij})$ = `X[i][j]/np.sum(X)` and $P(X_{i})$ =  `X[i]/np.sum(X)` (in real, you should calculate with $PMI = \max (0, \log{\left(\frac{P(X_{ij})}{P(X_i)P(X_j)}\right)})$. 

`X` = $n \times n$ where $n$ is the length of `brown_words`

Otherwise, PMI can be viewed as $\log{\left(\frac{P(X_{i}|X_{j})}{P(X_i)}\right)}$ or $\log{\left(\frac{P(X_{j}|X_{i})}{P(X_j)}\right)}$

In [14]:
total = np.sum(X)    #The total counts of all words
P_x = {}             #The marginal proability of a word x
PMI_similarity = {}  #The Positive Pointwise Mutual Information of the matrix.

#Your code here
for pair in gold_standard:
    # make sure your pair words are lowered; 
    # find index of pair words from `word_index_lookup`
    # get P_x --> P(Xi) & P(Xj)
    #then 
    PMI_similarity[pair] = max ( 0, log( P(X_ij) / P(Xi)*P(Xj) .... ) ) # you need to change them with your variable names

print(compare_similarity(PMI_similarity,gold_standard))

0.2559103711054943


#### 2.3
rubric={accuracy:2}

Do LSA using the co-occurrence matrix you built in 2.1. That is, use SVD to lower the dimensionality of the matrix (set k= 200), and evaluate using the similarity pairs. You should only be calculating cosine similarity for the words you need to compare in your evaluation. Don't be surprised if your results here are significantly worse than in 2.2 (though they should be well above zero correlation).  For example, Garrett gets ~0.13.  Note that SVD and LSA can take a bit of time.

$LSA_{similarities} = 1 - cosine(X_{svd}[i] - X_{svd}[j])$ because `scipy.spatial.distance.cosine` computes the **cosine distance** (1- $COS_{similarities}$). 

In [15]:
svd = TruncatedSVD(...)
LSA_similarities = {}

#Your code here
for pair in gold_standard:
    LSA_similarities[pair]  = 1-cosine(...)
#Your code here

print(compare_similarity(LSA_similarities,gold_standard))

0.12674569509666345


#### 2.4 (Optional)
rubric={accuracy:1}

Recent research has suggested that it is better to do SVD on a PPMI (Positive PMI) weighted matrix rather than the raw co-occurence matrix like you built in 2.1. Try out this idea by converting your sparse matrix into a (still sparse) PPMI matrix before applying SVD. For an added challenge, you must accomplish this entirely using vectorization, no loops. See [here](https://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix) for the methods available to do elementwise operations on the CSR sparse matrix. you can't use normal numpy array operators for most of this. Remember that your matrix is symmetric, that's important.  Garrett gets ~0.18 at this point.

In [16]:
#Your code here

#Your code here

0.17934109110381627


#### 2.5
rubric={accuracy:2,efficiency:1}

Finally, do LSA (i.e. truncatedSVD) with **a term-document matrix (i.e. the rows are terms, columns are documents** they appear in). This time, you should build your sparse matrix using a DictVectorizer. For full efficiency marks, iterate through the Brown corpus just once to build your matrix, and don't just transpose from a document-term matrix (Hint: use your word-index lookup dict from 2.1 to build a list of empty feature dictionaries--one for each word--*before* you start iterating over the Brown corpus, then add the document "features" to the feature dict of the corresponding word when you encounter the word in a particular document when iterating). Evaluate as usual. This should give you your best results for this section.

In [17]:
td_matrix = []
for word in brown_counts: # see 0.1
    td_matrix.append(defaultdict(int))

#Your code here

for para in brown.paras(): # a paragraph in brown = 1 document; 
    ...




LSA_similarities = {}
...
#Your code here

print(compare_similarity(LSA_similarities,gold_standard))

0.35887018977212554


### Exercise 3: Word2Vec

#### 3.1

rubric={accuracy:2}

Build a Word2Vec model on the Brown corpus with gensim using default settings, and use cosine to derive similarity scores to compare to the gold standard. Code to build the corpus is provided. Don't be surprised if the default is quite poor (ie, < 0.1)!  You will be using this code several times in the next few exercises, so it might be a good idea to encapsulate it in a function.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, & K. Q. Weinberger (Eds.), *Advances in Neural Information Processing Systems 26* (pp. 3111–3119). Curran Associates, Inc. http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf


OR

- **fastText** http://fasttext.cc/ by Mikolov ($\approx$ Word2Vec, except for subword information), the vector file starts with its dimension $m~n$ at the beginning
- **GloVe** https://nlp.stanford.edu/projects/glove/ by Stanford, no dimension information

https://radimrehurek.com/gensim/models/word2vec.html

`model = gensim.models.Word2Vec(corpus)`:
- `epochs=`,
- `size=`,
- `workers=`,
- `seed=`

In [18]:
corpus = [[lemmatize(word.lower()) for word in sent] for sent in nltk.corpus.brown.sents()] 


model = Word2Vec(corpus)

def check_model(model):
    #Your code here

    #Your code here    

check_model(model)

0.1838846038461938


#### 3.2

rubric={accuracy:3,reasoning:1}

Many ML models have a number of so-called "hyper-parameters" that can be tuned to affect model quality.  We've seen a few of these in class, such as window size and number of iterations.  Some of these parameters can have a significant effect on the quality of the model, and should be tuned on a development set - for this lab, we'll replicate process of hyper-parameter tuning by evaluating results on your gold standard.  Note that this is typically a bad way of doing things!  When you tune on your test set, you are biasing it towards the data in your test set, and your reported results may be artificially high.  In real experiments, you should separate out another small set to serve as a ``development`` or ``tuning`` set. 

Explore the parameter space of the word2vec model. You should at least consider the following parameters (see the [docs](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec)): iter, size, window, sg, and negative. You will need to show evidence that you have found good (if not great) parameter settings (make sure you test a good range of values!  One way to find suitable values to test is to look at the documentation, and see what the default value is.  Increase and decrease that value to explore preliminary values, and then keep increasing or decreasing it in the direction of improvement), and discuss what you found in the box below -- some parameters are going to have more of an effect than others. It's okay to optimize the parameters one at a time (the order provided in the list above is a fine one to proceed with).  One strategy is to optimize the first parameter, then "lock" that parameter to the discovered value, and try tuning the second parameter, etc.  That said, you're free to optimize in other ways. To speed things up, you might want to use the worker option (which allows multiprocessing), and the seed option could be useful to eliminate random variation (ie, if the seed is identical in each run, improvements are more likely due to the change of parameter you made, instead of random seed variation). You should be able to get the highest performance among all the options tested in this lab. (Garrett's best numbers are ~0.5)  You probably only need a few lines of code for each of these experiments, but they may take a while, especially as you increase the number of iterations, window size, etc.

- number of epochs
- dim size
- window size
- negative sampling
- ...

In [19]:
# Your hyper-parameter tuning

### Your Answer and Discussion:



In [None]:
# it goes up to 0.5 ... using w2v