# CS4765/6765 NLP Assignment 3: Word vectors

In this two part assignment you will first examine and interact with word vectors. (This part of the assignment is adapted from a recent CS224N assignment at Stanford.) You will then implement a new approach to sentiment analysis.

In this assignment we will use [gensim](https://radimrehurek.com/gensim/) to access and interact with word embeddings. In gensim we’ll be working with a KeyedVectors object which represents word embeddings. [Documentation for KeyedVectors is available.](https://radimrehurek.com/gensim/models/keyedvectors.html) However, this assignment description and the sample code in it might be sufficient to show you how to use a KeyedVectors object.




In [4]:
import gensim.downloader
model = gensim.downloader.load('fasttext-wiki-news-subwords-300')

# Part 1: Examining word vectors

## Polysemy and homonymy

Polysemy and homonymy are the phenomena of words having multiple meanings/senses. The nearest neighbours (under cosine similarity) for a given word can indicate whether it has multiple senses.

Consider the following example which shows the top-10 most similar words for *mouse*. The "input device" and "animal" senses of *mouse* are clearly visible from the top-10 most similar words. 


In [2]:
# Find words most similar using cosine similarity to "mouse". 
# restrict_vocab=100000 limits the results to most frequent
# 100000 words. This avoids rare words in the output. For this
# assignment, whenever you call most_simlilar, also pass
# restrict_vocab=100000.
model.most_similar('mouse', restrict_vocab=100000)

[('mice', 0.7038448452949524),
 ('rat', 0.6446240544319153),
 ('rodent', 0.6280483603477478),
 ('Mouse', 0.6180493831634521),
 ('cursor', 0.6154769062995911),
 ('keyboard', 0.6149151921272278),
 ('rabbit', 0.607288658618927),
 ('cat', 0.6070616245269775),
 ('joystick', 0.5888146162033081),
 ('touchpad', 0.5878496766090393)]

*Cursor*, *keyboard*, *joystick*, *touchpad* correspond to the input device sense. *Rat*, *rodent*, *rabbit*, *cat* correspond to the animal sense.


You can observe something similar for the different senses of the word *leaves*. Find a new example that exhibits polysemy/homonymy, show its top-10 most similar words, and explain why they show that this word has multiple senses. Write your answer in the code and text boxes below.

In [3]:
# Write your code here
model.most_similar('lie',restrict_vocab=100000)

[('lies', 0.83467036485672),
 ('lying', 0.7737990021705627),
 ('lied', 0.7100739479064941),
 ('falsehood', 0.6283779740333557),
 ('lay', 0.6210237741470337),
 ('truth', 0.6181300282478333),
 ('pretend', 0.614453911781311),
 ('untruth', 0.6050377488136292),
 ('deceit', 0.6038239598274231),
 ('deception', 0.595779538154602)]

The word lie has multiple senses with similarities to words like "falsehood", "deceit", and "lay". For instance, falsehood and deceit describe the act of being untruthful, while lay refers to the act of resting in a horizontal position.

## Synonyms and antonyms

Find three words (w1 , w2 , w3) such that w1 and w2 are synonyms (i.e., have roughly the same meaning), and w1 and w3 are antonyms (i.e., have opposite meanings), but the similarity between w1 and w3 > the similarity between w1 and w2. Note that this should be counter to your expectations, because synonyms (which mean roughly the same thing) would be expected to be more similar than antonyms (which have opposite meanings). Explain why you think this unexpected situation might have occurred.

Here is an example. w1 = *happy*, w2 = *cheerful*, and w3 = *sad*. (You will need to find a different example for your report.) Notice that the antonyms *happy* and *sad* are (slightly) more similar than the (near) synonyms *happy* and *cheerful*.


In [25]:
# Find the cosine similarity between "happy" and "cheerful"
model.similarity('happy', 'cheerful')


0.68476284

In [27]:
# and between "happy" and "sad".
model.similarity('happy', 'sad')


0.69010293

In [20]:
# Write your code here
model.similarity('like','adore')

0.51089025

In [18]:
model.similarity('like','hate')

0.5742829

"like" and "adore" show positive emotions. "like" and "hate" are antonyms. Here, the similarity between "like" and "adore" is lower than the similarity between "like" and "hate". In Word2Vec model, if two words often appear together, their vector similarity increases. The words "like" and "adore" may not be appearing as frequently and together as the words "like" and "hate". 

## Analogies

Analogies such as man is to king as woman is to X can be solved using word embeddings. This analogy can be expressed as X = woman + king − man. The following code snippet shows how to solve this analogy with gensim. Notice that the model gets it correct! I.e., *queen* is the most similar word.

In [29]:
# Find the model's predictions for the solution to the analogy
# "man" is to "king" as "woman" is to X
model.most_similar(positive=['woman', 'king'],
                   negative=['man'],
                   restrict_vocab=100000)


[('queen', 0.7786749005317688),
 ('monarch', 0.6666999459266663),
 ('princess', 0.653827428817749),
 ('kings', 0.6497675180435181),
 ('queens', 0.6284460425376892),
 ('prince', 0.6235989928245544),
 ('ruler', 0.5971586108207703),
 ('kingship', 0.5883600115776062),
 ('lady', 0.5851913094520569),
 ('royal', 0.5821066498756409)]

### Correct analogy

Find a new analogy that the model is able to answer correctly (i.e., the most-similar word is the solution to the analogy). Explain briefly why the analogy holds. For the above example, this explanation would be something along the lines of a king is a ruler who is a man and a queen is a ruler who is a woman.


In [28]:
# Write your code here
model.most_similar(positive=['chef', 'painting'],
                   negative=['artist'],
                   restrict_vocab=100000)

[('cooking', 0.6424155235290527),
 ('restaurant', 0.5705956220626831),
 ('kitchen', 0.5680593252182007),
 ('chefs', 0.558533251285553),
 ('Chef', 0.5407431721687317),
 ('cook', 0.5393264889717102),
 ('cookery', 0.5337901711463928),
 ('repainting', 0.5258774757385254),
 ('cooks', 0.5199151039123535),
 ('decorating', 0.5195807218551636)]

Artist is someone who paints and paintings are their creative innovation where they show their talent. The analogy of "artist" to "painting" relates to the anlogy of "chef" to "cooking" as it represents a form of art and creativity for a chef.

### Incorrect analogy

Find a new analogy that the model is not able to answer correctly. Again explain briefly why the analogy holds. For example, here is an analogy that the model does not answer correctly:


In [65]:
# Find the model's predictions for the solution to the analogy
# "plate" is to "food" as "cup" is to X
model.most_similar(positive=['cup', 'food'],
                   negative=['plate'],
                   restrict_vocab=100000)

[('cups', 0.5481787919998169),
 ('coffee', 0.5461026430130005),
 ('beverage', 0.5460603833198547),
 ('drink', 0.5451807975769043),
 ('tea', 0.53434818983078),
 ('foods', 0.5310320854187012),
 ('drinks', 0.516447901725769),
 ('beverages', 0.5022991299629211),
 ('milk', 0.4976045787334442),
 ('non-food', 0.4929129481315613)]

A plate is used to serve food as a cup is used to serve a drink, but the model does not predict *drink*, or a similar term, as the most similar word.

In [38]:
# Write your code here
model.most_similar(positive=['bread', 'liquid'],
                   negative=['milk'],
                   restrict_vocab=100000)

[('flatbread', 0.5114241242408752),
 ('dough', 0.49446845054626465),
 ('mixture', 0.4813441038131714),
 ('molten', 0.48022177815437317),
 ('buttered', 0.47864025831222534),
 ('breads', 0.4758601486682892),
 ('illiquid', 0.4682086110115051),
 ('porous', 0.4669260084629059),
 ('liquids', 0.465412437915802),
 ('sourdough', 0.46143409609794617)]

"milk" is in a "liquid" state. The ideal output of this code is "porous" for the word "bread", but the model predicts flatbread as the most similar word.

## Bias

Consider the examples below. The first shows the words that are most similar to *man* and *worker* and least similar to *woman*. The second shows the words that are most similar to *woman* and *worker* and least similar to *man*.

In [85]:
# Find the words that are most similar to "man" and "worker" and
# least similar to "woman".
model.most_similar(positive=['man', 'worker'],
                   negative=['woman'],
                   restrict_vocab=100000)



[('workman', 0.7217649817466736),
 ('laborer', 0.6744564175605774),
 ('labourer', 0.6498093605041504),
 ('workers', 0.6487939357757568),
 ('foreman', 0.6226886510848999),
 ('machinist', 0.6098095178604126),
 ('employee', 0.6091086864471436),
 ('technician', 0.6029269099235535),
 ('helper', 0.5994961261749268),
 ('manager', 0.5832769274711609)]

In [86]:
# Find the words that are most similar to "woman" and "worker" and
# least similar to "man".
model.most_similar(positive=['woman', 'worker'],
                   negative=['man'],
                   restrict_vocab=100000)



[('cheerleaders', 0.7048168778419495),
 ('cheerleading', 0.6419737339019775),
 ('Cheerleader', 0.6108335256576538),
 ('girl', 0.5797179341316223),
 ('schoolgirl', 0.5546731948852539),
 ('Cheerleaders', 0.547419011592865),
 ('businesswoman', 0.5433568954467773),
 ('tomboy', 0.5339425802230835),
 ('mom', 0.5313844084739685),
 ('actress', 0.5302129983901978)]

The output shows that *man* is associated with some stereotypically male jobs (e.g., foreman, machinist) while *woman* is associated with some stereotypically female jobs (e.g., housewife, nurse, seamstress). This indicates that there is gender bias in the word embeddings.

Find a new example, using the same approach as above, that indicates that there is bias in the word embeddings. Briefly explain how the model output indicates that there is bias in the word embeddings. (You are by no means restricted to considering gender bias here. You are encouraged to explore other ways that embeddings might indicate bias.)

In [44]:
# Write your code here
model.most_similar(positive=['doctor', 'intelligent'],
                   negative=['worker'],
                   restrict_vocab=100000)


[('astute', 0.5551915764808655),
 ('smart', 0.5531858801841736),
 ('perceptive', 0.5490319728851318),
 ('unintelligent', 0.5447569489479065),
 ('brilliant', 0.5297819375991821),
 ('doctors', 0.5252407789230347),
 ('erudite', 0.5232177376747131),
 ('sane', 0.5215501189231873),
 ('clever', 0.5201746225357056),
 ('well-informed', 0.5177281498908997)]

In [45]:
model.most_similar(positive=['worker', 'intelligent'],
                   negative=['doctor'],
                   restrict_vocab=100000)

[('unintelligent', 0.5780892372131348),
 ('smart', 0.5646598935127258),
 ('skilled', 0.5364747047424316),
 ('hard-working', 0.5275830626487732),
 ('efficient', 0.5258950591087341),
 ('intelligence', 0.523655354976654),
 ('hardworking', 0.5231699347496033),
 ('perceptive', 0.5225290656089783),
 ('adaptable', 0.5180816054344177),
 ('industrious', 0.5173808336257935)]

There is an occupational bias in word embeddings. If a worker is said to be intelligent, the model predicts that a doctor is asture or smart, which is similar to being intelligent. Whereas, the model predicts that a worker is unintelligent when a doctor is associated with the word intelligent.

# Part 2: Sentiment Analysis

## Background and data

In this part you will consider sentiment analysis of tweets. You will need the data for this assignmnet from D2L: train.docs.txt. train.classes.txt, test.docs.txt, test.classes.txt. Put those files in the same directory that you run this notebook from.

train.docs.txt and test.docs.txt are training and testing tweets, respectively, in one-tweet-per-line format. These are tweets related to health care reform in the United States from early 2010. All tweets contain the hashtag #hcr. These tweets have been manually labeled as “positive”, “negative”, or “neutral”.

These are real tweets. Some of the tweets contain content that you might find offensive (e.g., expletives, racist and homophobic remarks). Despite this offensive content, these tweets are still very valuable data, and building NLP systems that can operate over them is important. That is why we are working with this potentially-offensive data in this assignment.

This dataset is further described in the following paper: Michael Speriosu, Nikita Sudan, Sid Upadhyay, and Jason Baldridge. 2011. [Twitter Polarity Classification with Label Propagation over Lexical Links and the Follower Graph](https://aclanthology.org/W11-2207/). In Proceedings of the First Workshop on Unsupervised Methods in NLP. Edinburgh, Scotland.

train.classes.txt and test.classes.txt contain class labels for the training and test data, 1 label per line. The labels are “positive”, “neutral”, and “negative”.

## Approach

We will consider sentiment analysis using an average of word embeddings document representation and a multinomial logistic regression classifier. We will compare this approach to a most-frequent class baseline.

Complete the function `vec_for_doc` below. (You should not modify other parts of the
code.) This function takes a list consisting of the tokens in a document $d$. It then returns a vector $\vec{v}$ representing the document as the average of the embeddings for the words in the document as follows:

\begin{equation}
d = w_1, w_2, ... w_n
\end{equation}
\begin{equation}
\vec{v} = \dfrac{\vec{w_1} + \vec{w_2} + ... + \vec{w_n}}{n}\\
\end{equation}

You can then run the code to compare logistic regression using an average of word embeddings to a most-frequent class baseline. (If your implementation of `vec_for_doc` is correct, logistic regression should be the baseline in terms of accuracy (by a little bit) and in terms of F1.



In [52]:
# TODO: Implement this function. tokenized_doc is a list of tokens in
# a document. Return a vector representation of the document as
# described above.
# Hints: 
# -You can get the vector for a word w using model[w] or
#  model.get_vector(w)
# -You can add vectors using + and sum, e.g.,
#  model['cat'] + model['dog']
#  sum([model['cat'], model['dog']])
# -You can see the shape of a vector using model['cat'].shape
# -The vector you return should have the same shape as a word vector 
def vec_for_doc(tokenized_doc):
    # TODO: Add your code here
    
    word_shape = model['farmer'].shape
    word_v = np.zeros(word_shape)
    
    for token in tokenized_doc:
        if token in model:
            word_v += model[token]
        
    return(word_v/len(tokenized_doc))



In [53]:
import math, re
import numpy as np
from sklearn.linear_model import LogisticRegression

# Get the train and test documents and classes. File formats
# are similar to assignment 2.
train_texts_fname = 'train.docs.txt'
train_klasses_fname = 'train.classes.txt'
test_texts_fname = 'test.docs.txt'
test_klasses_fname = 'test.classes.txt'

train_texts = [x.strip() for x in open(train_texts_fname,
                                       encoding='utf8')]
train_klasses = [x.strip() for x in open(train_klasses_fname,
                                         encoding='utf8')]
test_texts = [x.strip() for x in open(test_texts_fname,
                                      encoding='utf8')]
test_klasses = [x.strip() for x in open(test_klasses_fname,
                                        encoding='utf8')]

# A simple tokenizer. Applies case folding
def tokenize(s):
    tokens = s.lower().split()
    trimmed_tokens = []
    for t in tokens:
        if re.search('\w', t):
            # t contains at least 1 alphanumeric character
            t = re.sub('^\W*', '', t) # trim leading non-alphanumeric chars
            t = re.sub('\W*$', '', t) # trim trailing non-alphanumeric chars
        trimmed_tokens.append(t)
    return trimmed_tokens

# train_vecs and test_vecs are lists; each element is a vector
# representing a (train or test) document
train_vecs = [vec_for_doc(tokenize(x)) for x in train_texts]
test_vecs = [vec_for_doc(tokenize(x)) for x in test_texts]

# Train logistic regression, similarly to assignment 2
lr = LogisticRegression(multi_class='multinomial',
                        solver='sag',
                        penalty='l2',
                        max_iter=1000000,
                        random_state=0)
lr = LogisticRegression()
clf = lr.fit(train_vecs, train_klasses)
results = clf.predict(test_vecs)



In [48]:
# Determine accuracy and macro F1 using sklearn evaluation metrics

import sklearn.metrics

acc = sklearn.metrics.accuracy_score(test_klasses, results)
f1 = sklearn.metrics.f1_score(test_klasses, results, average='macro')

print("Accuracy: ", acc) 
print("Macro F1: ", f1)



Accuracy:  0.6975
Macro F1:  0.3875844421020532


In [49]:
# Also determine accuracy and macro F1 for a most-frequent class baseline

from sklearn.dummy import DummyClassifier

baseline_clf = DummyClassifier(strategy="most_frequent")
baseline_clf.fit(train_vecs, train_klasses)
baseline_results = baseline_clf.predict(test_vecs)

acc = sklearn.metrics.accuracy_score(test_klasses, baseline_results)
f1 = sklearn.metrics.f1_score(test_klasses, baseline_results, average='macro')

print("Baseline accuracy: ", acc) 
print("Baseline macro F1: ", f1)


Baseline accuracy:  0.67
Baseline macro F1:  0.26746506986027946


# Submitting your work

When you're done, submit a3.ipynb to the assignment 3 folder on D2L.