#  Word vectors

In this two part assignment you will first examine and interact with word vectors. (This part of the assignment is adapted from a recent CS224N assignment at Stanford.) You will then implement a new approach to sentiment analysis.

In this assignment we will use [gensim](https://radimrehurek.com/gensim/) to access and interact with word embeddings. In gensim we’ll be working with a KeyedVectors object which represents word embeddings. [Documentation for KeyedVectors is available.](https://radimrehurek.com/gensim/models/keyedvectors.html) However, this assignment description and the sample code in it might be sufficient to show you how to use a KeyedVectors object.




In [1]:
#hanieh ghabelialla  
import gensim.downloader
model = gensim.downloader.load('fasttext-wiki-news-subwords-300')



# Part 1: Examining word vectors

## Polysemy and homonymy

Polysemy and homonymy are the phenomena of words having multiple meanings/senses. The nearest neighbours (under cosine similarity) for a given word can indicate whether it has multiple senses.

Consider the following example which shows the top-10 most similar words for *mouse*. The "input device" and "animal" senses of *mouse* are clearly visible from the top-10 most similar words.


In [2]:
# Find words most similar using cosine similarity to "mouse".
# restrict_vocab=100000 limits the results to most frequent
# 100000 words. This avoids rare words in the output. For this
# assignment, whenever you call most_simlilar, also pass
# restrict_vocab=100000.
model.most_similar('mouse', restrict_vocab=100000)

[('mice', 0.7038448452949524),
 ('rat', 0.6446240544319153),
 ('rodent', 0.6280483603477478),
 ('Mouse', 0.6180493831634521),
 ('cursor', 0.6154769062995911),
 ('keyboard', 0.6149151921272278),
 ('rabbit', 0.607288658618927),
 ('cat', 0.6070616245269775),
 ('joystick', 0.5888146162033081),
 ('touchpad', 0.5878496766090393)]

*Cursor*, *keyboard*, *joystick*, *touchpad* correspond to the input device sense. *Rat*, *rodent*, *rabbit*, *cat* correspond to the animal sense.


You can observe something similar for the different senses of the word *leaves*. Find a new example that exhibits polysemy/homonymy, show its top-10 most similar words, and explain why they show that this word has multiple senses. Write your answer in the code and text boxes below.

In [53]:
model.most_similar('Match', restrict_vocab=100000)



[('Matches', 0.7960425019264221),
 ('Matching', 0.7419515252113342),
 ('Score', 0.6290581226348877),
 ('Game', 0.6225259900093079),
 ('Test', 0.6177693605422974),
 ('Play', 0.6137019991874695),
 ('Man', 0.6104843616485596),
 ('Knockout', 0.6100562810897827),
 ('Round', 0.6062699556350708),
 ('Fight', 0.6050064563751221)]


answer : The word "match" has different meanings, such as a sports game, a tool for starting a fire, or a pairing that fits well. When we use a model to find similar words, it shows a mix of these meanings because it learned the word's usage from many different contexts.

## Synonyms and antonyms

Find three words (w1 , w2 , w3) such that w1 and w2 are synonyms (i.e., have roughly the same meaning), and w1 and w3 are antonyms (i.e., have opposite meanings), but the similarity between w1 and w3 > the similarity between w1 and w2. Note that this should be counter to your expectations, because synonyms (which mean roughly the same thing) would be expected to be more similar than antonyms (which have opposite meanings). Explain why you think this unexpected situation might have occurred.

Here is an example. w1 = *happy*, w2 = *cheerful*, and w3 = *sad*. (You will need to find a different example for your report.) Notice that the antonyms *happy* and *sad* are (slightly) more similar than the (near) synonyms *happy* and *cheerful*.


In [7]:
# Find the cosine similarity between "happy" and "cheerful"
model.similarity('happy', 'cheerful')


0.68476284

In [8]:

model.similarity('happy', 'sad')


0.69010293

In [18]:
model.similarity("complex", "Complicated")


0.4755332

In [19]:
model.similarity("complex","simple")

0.64020914

 answer : this situation could occur because words are often defined by their opposites in various texts, leading to their frequent co-occurrence within similar contexts, which in turn causes the embedding algorithm to place them closer together in the vector space.

## Analogies

Analogies such as man is to king as woman is to X can be solved using word embeddings. This analogy can be expressed as X = woman + king − man. The following code snippet shows how to solve this analogy with gensim. Notice that the model gets it correct! I.e., *queen* is the most similar word.

In [20]:
# Find the model's predictions for the solution to the analogy
# "man" is to "king" as "woman" is to X
model.most_similar(positive=['woman', 'king'],
                   negative=['man'],
                   restrict_vocab=100000)


[('queen', 0.7786749005317688),
 ('monarch', 0.6666999459266663),
 ('princess', 0.653827428817749),
 ('kings', 0.6497675180435181),
 ('queens', 0.6284460425376892),
 ('prince', 0.6235989928245544),
 ('ruler', 0.5971586108207703),
 ('kingship', 0.5883600115776062),
 ('lady', 0.5851913094520569),
 ('royal', 0.5821066498756409)]

### Correct analogy

Find a new analogy that the model is able to answer correctly (i.e., the most-similar word is the solution to the analogy). Explain briefly why the analogy holds. For the above example, this explanation would be something along the lines of a king is a ruler who is a man and a queen is a ruler who is a woman.


In [22]:
# "man" is to "businessman" as "woman" is to X
model.most_similar(positive=['woman', 'businessman'],
                   negative=['man'],
                   restrict_vocab=100000)

[('businesswoman', 0.8535027503967285),
 ('businessperson', 0.7498276233673096),
 ('entrepreneur', 0.7154306769371033),
 ('housewife', 0.6567320227622986),
 ('philanthropist', 0.6523854732513428),
 ('industrialist', 0.6522493958473206),
 ('banker', 0.6502622365951538),
 ('politician', 0.6422467231750488),
 ('lawyer', 0.6274409890174866),
 ('congresswoman', 0.6196478605270386)]

answer : The analogy holds because businessman refers to a man engaged in business activities, and by changing the gender to woman while keeping the occupation aspect constant, we look for the female equivalent, which is businesswoman.

### Incorrect analogy

Find a new analogy that the model is not able to answer correctly. Again explain briefly why the analogy holds. For example, here is an analogy that the model does not answer correctly:


In [33]:
# Find the model's predictions for the solution to the analogy
# "plate" is to "food" as "cup" is to X
model.most_similar(positive=['cup', 'food'],
                   negative=['plate'],
                   restrict_vocab=100000)

[('cups', 0.5481787919998169),
 ('coffee', 0.5461026430130005),
 ('beverage', 0.5460603833198547),
 ('drink', 0.5451807975769043),
 ('tea', 0.53434818983078),
 ('foods', 0.5310320854187012),
 ('drinks', 0.516447901725769),
 ('beverages', 0.5022991299629211),
 ('milk', 0.4976045787334442),
 ('non-food', 0.4929129481315613)]

A plate is used to serve food as a cup is used to serve a drink, but the model does not predict drink, or a similar term, as the most similar word.

In [34]:
# "sock" is to "feet" as "glove" is to X
model.most_similar(positive=['gloves', 'feet'],
                   negative=['socks'],
                   restrict_vocab=100000)



[('inches', 0.5423212647438049),
 ('hands', 0.5274760723114014),
 ('foot', 0.5268399715423584),
 ('glove', 0.5170854926109314),
 ('meters', 0.4980737268924713),
 ('fingertips', 0.4865354895591736),
 ('elbows', 0.4857611656188965),
 ('centimeters', 0.47440657019615173),
 ('metres', 0.47025373578071594),
 ('millimeters', 0.4700991213321686)]

socks is used for feet as gloves  used for hands but the model does not predict *hands*, or a similar term, as the most similar word.The model might not answer correctly due to varying associations with these words in its training data.

## Bias

Consider the examples below. The first shows the words that are most similar to *man* and *worker* and least similar to *woman*. The second shows the words that are most similar to *woman* and *worker* and least similar to *man*.

In [35]:
# Find the words that are most similar to "man" and "worker" and
# least similar to "woman".
model.most_similar(positive=['man', 'worker'],
                   negative=['woman'],
                   restrict_vocab=100000)



[('workman', 0.7217649817466736),
 ('laborer', 0.6744564175605774),
 ('labourer', 0.6498093605041504),
 ('workers', 0.6487939357757568),
 ('foreman', 0.6226886510848999),
 ('machinist', 0.6098095178604126),
 ('employee', 0.6091086864471436),
 ('technician', 0.6029269099235535),
 ('helper', 0.5994961261749268),
 ('manager', 0.5832769274711609)]

In [44]:
# Find the words that are most similar to "man" and "worker" and
# least similar to "woman".
model.most_similar(positive=['woman', 'worker'],
                   negative=['man'],
                   restrict_vocab=100000)

[('workers', 0.6522067189216614),
 ('employee', 0.6391042470932007),
 ('housewife', 0.608704686164856),
 ('nurse', 0.5983445644378662),
 ('employer', 0.5973758101463318),
 ('caseworker', 0.5940523147583008),
 ('seamstress', 0.581404447555542),
 ('laborer', 0.5809912085533142),
 ('policewoman', 0.5767977237701416),
 ('Worker', 0.5750083327293396)]

The output shows that *man* is associated with some stereotypically male jobs (e.g., foreman, machinist) while *woman* is associated with some stereotypically female jobs (e.g., housewife, nurse, seamstress). This indicates that there is gender bias in the word embeddings.


In [43]:
# Find the words that are most similar to "woman" and "sport" and
# least similar to "man".
model.most_similar(positive=['woman', 'sport'],
                   negative=['man'],
                   restrict_vocab=100000)

[('sports', 0.7389599084854126),
 ('netball', 0.6590258479118347),
 ('athletics', 0.6529361009597778),
 ('sporting', 0.6372365355491638),
 ('athlete', 0.5908907651901245),
 ('sportsperson', 0.5895546078681946),
 ('gymnastics', 0.5854197144508362),
 ('volleyball', 0.5823972821235657),
 ('soccer', 0.5809586644172668),
 ('sportive', 0.5755864977836609)]

In [42]:
model.most_similar(positive=[ 'man', 'sport'],
                   negative=['women'],
                   restrict_vocab=100000)



[('sportsman', 0.677655041217804),
 ('sporting', 0.6111401915550232),
 ('sportive', 0.5944186449050903),
 ('sports', 0.5667303204536438),
 ('football', 0.5594378709793091),
 ('showman', 0.552361011505127),
 ('athlete', 0.5520541667938232),
 ('sportsperson', 0.5415318608283997),
 ('racing', 0.5353881120681763),
 ('beast', 0.5351638197898865)]

The output shows that some sports are associated more with men, like football, while other sports are associated with women, like volleyball, indicating a potential bias.

# Part 2: Sentiment Analysis

## Background and data

In this part you will consider sentiment analysis of tweets. You will need the data for this assignmnet from D2L: train.docs.txt. train.classes.txt, test.docs.txt, test.classes.txt. Put those files in the same directory that you run this notebook from.

train.docs.txt and test.docs.txt are training and testing tweets, respectively, in one-tweet-per-line format. These are tweets related to health care reform in the United States from early 2010. All tweets contain the hashtag #hcr. These tweets have been manually labeled as “positive”, “negative”, or “neutral”.

These are real tweets. Some of the tweets contain content that you might find offensive (e.g., expletives, racist and homophobic remarks). Despite this offensive content, these tweets are still very valuable data, and building NLP systems that can operate over them is important. That is why we are working with this potentially-offensive data in this assignment.

This dataset is further described in the following paper: Michael Speriosu, Nikita Sudan, Sid Upadhyay, and Jason Baldridge. 2011. [Twitter Polarity Classification with Label Propagation over Lexical Links and the Follower Graph](https://aclanthology.org/W11-2207/). In Proceedings of the First Workshop on Unsupervised Methods in NLP. Edinburgh, Scotland.

train.classes.txt and test.classes.txt contain class labels for the training and test data, 1 label per line. The labels are “positive”, “neutral”, and “negative”.

## Approach

We will consider sentiment analysis using an average of word embeddings document representation and a multinomial logistic regression classifier. We will compare this approach to a most-frequent class baseline.

Complete the function `vec_for_doc` below. (You should not modify other parts of the
code.) This function takes a list consisting of the tokens in a document $d$. It then returns a vector $\vec{v}$ representing the document as the average of the embeddings for the words in the document as follows:

\begin{equation}
d = w_1, w_2, ... w_n
\end{equation}
\begin{equation}
\vec{v} = \dfrac{\vec{w_1} + \vec{w_2} + ... + \vec{w_n}}{n}\\
\end{equation}

You can then run the code to compare logistic regression using an average of word embeddings to a most-frequent class baseline. (If your implementation of `vec_for_doc` is correct, logistic regression should be the baseline in terms of accuracy (by a little bit) and in terms of F1.



In [52]:

# TODO: Implement this function. tokenized_doc is a list of tokens in
# a document. Return a vector representation of the document as
# described above.
# Hints:
# -You can get the vector for a word w using model[w] or
#  model.get_vector(w)
# -You can add vectors using + and sum, e.g.,
#  model['cat'] + model['dog']
#  sum([model['cat'], model['dog']])
# -You can see the shape of a vector using model['cat'].shape
# -The vector you return should have the same shape as a word vector
def vec_for_doc(tokenized_doc):

    doc_vector = np.zeros(300)

    token_count = 0

    for token in tokenized_doc:

        try:
            doc_vector += model.get_vector(token)
            token_count += 1
        except KeyError:

            continue
    if token_count > 0:
        return doc_vector / token_count
    else:

        return np.zeros(model.vector_size)



In [54]:
import math, re
import numpy as np
from sklearn.linear_model import LogisticRegression

# Get the train and test documents and classes. File formats
# are similar to assignment 2.
train_texts_fname = 'train.docs.txt'
train_klasses_fname = 'train.classes.txt'
test_texts_fname = 'test.docs.txt'
test_klasses_fname = 'test.classes.txt'

train_texts = [x.strip() for x in open(train_texts_fname,
                                       encoding='utf8')]
train_klasses = [x.strip() for x in open(train_klasses_fname,
                                         encoding='utf8')]
test_texts = [x.strip() for x in open(test_texts_fname,
                                      encoding='utf8')]
test_klasses = [x.strip() for x in open(test_klasses_fname,
                                        encoding='utf8')]

# A simple tokenizer. Applies case folding
def tokenize(s):
    tokens = s.lower().split()
    trimmed_tokens = []
    for t in tokens:
        if re.search('\w', t):
            # t contains at least 1 alphanumeric character
            t = re.sub('^\W*', '', t) # trim leading non-alphanumeric chars
            t = re.sub('\W*$', '', t) # trim trailing non-alphanumeric chars
        trimmed_tokens.append(t)
    return trimmed_tokens

# train_vecs and test_vecs are lists; each element is a vector
# representing a (train or test) document
train_vecs = [vec_for_doc(tokenize(x)) for x in train_texts]
test_vecs = [vec_for_doc(tokenize(x)) for x in test_texts]

# Train logistic regression, similarly to assignment 2
lr = LogisticRegression(multi_class='multinomial',
                        solver='sag',
                        penalty='l2',
                        max_iter=1000000,
                        random_state=0)
lr = LogisticRegression()
clf = lr.fit(train_vecs, train_klasses)
results = clf.predict(test_vecs)



In [55]:
# Determine accuracy and macro F1 using sklearn evaluation metrics

import sklearn.metrics

acc = sklearn.metrics.accuracy_score(test_klasses, results)
f1 = sklearn.metrics.f1_score(test_klasses, results, average='macro')

print("Accuracy: ", acc)
print("Macro F1: ", f1)



Accuracy:  0.6975
Macro F1:  0.3709784548964257


In [56]:
# Also determine accuracy and macro F1 for a most-frequent class baseline

from sklearn.dummy import DummyClassifier

baseline_clf = DummyClassifier(strategy="most_frequent")
baseline_clf.fit(train_vecs, train_klasses)
baseline_results = baseline_clf.predict(test_vecs)

acc = sklearn.metrics.accuracy_score(test_klasses, baseline_results)
f1 = sklearn.metrics.f1_score(test_klasses, baseline_results, average='macro')

print("Baseline accuracy: ", acc)
print("Baseline macro F1: ", f1)


Baseline accuracy:  0.67
Baseline macro F1:  0.26746506986027946


# Submitting your work

When you're done, submit a3.ipynb to the assignment 3 folder on D2L.