1.Submit a Google Colab notebook containing your completed code and experimentation results.

2.Include comments and explanations in your code to help understand the implemented logic.

**Additional Notes:**
*   Ensure that the notebook runs successfully in Google Colab.
*   Document any issues encountered during experimentation and how you addressed them.

**Grading:**
*   Each task will be graded out of the specified points.
*   Points will be awarded for correctness, clarity of code, thorough experimentation, and insightful analysis.

# Prediction-Based Word Vectors

more recently prediction-based word vectors have demonstrated better performance, such as word2vec and GloVe (which also utilizes the benefit of counts). Here, we shall explore the embeddings produced by GloVe.

Then run the following cells to load the GloVe vectors into memory.

In [1]:
import gensim.downloader as api
import pprint
import numpy as np
wv_from_bin = api.load("glove-wiki-gigaword-200")



### Words with Multiple Meanings
Polysemes and homonyms are words that have more than one meaning (see this [wiki page](https://en.wikipedia.org/wiki/Polysemy) to learn more about the difference between polysemes and homonyms ). Find a word with *at least two different meanings* such that the top-10 most similar words (according to cosine similarity) contain related words from *both* meanings. For example, "leaves" has both "go_away" and "a_structure_of_a_plant" meaning in the top 10, and "scoop" has both "handed_waffle_cone" and "lowdown". You will probably need to try several polysemous or homonymic words before you find one.

Please state the word you discover and the multiple meanings that occur in the top 10. Why do you think many of the polysemous or homonymic words you tried didn't work (i.e. the top-10 most similar words only contain **one** of the meanings of the words)?

**Note**: You should use the `wv_from_bin.most_similar(word)` function to get the top 10 similar words. This function ranks all other words in the vocabulary with respect to their cosine similarity to the given word. For further assistance, please check the __[GenSim documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.FastTextKeyedVectors.most_similar)__.

In [None]:
### CODE HERE
def get_similar_words(model, word, top_n=10):
    # Get the top 10 most similar words to the input word
    try:
        similar_words = model.most_similar(word, topn=top_n)
        return [sw[0] for sw in similar_words]
    except KeyError:
        return ["Word not in model's vocabulary"]

In [None]:
get_similar_words(wv_from_bin, "wolf", top_n=10)

['bear',
 'dog',
 'lion',
 'grizzly',
 'wolves',
 'coyote',
 'mankowitz',
 'cub',
 'hunter',
 'hunt']

The issue you're describing is a known limitation of Word2Vec and similar word embedding models. These models represent each word as a single vector in a high-dimensional space, where the vector's position is determined by the word's context in the training corpus. The model learns to place words with similar meanings close together in this space.

However, this approach doesn't handle polysemy or homonymy well. Polysemous words have multiple related meanings, while homonyms are words that are spelled and pronounced the same way but have different meanings. Since Word2Vec represents each word as a single vector, it can't capture these multiple meanings. Instead, it tends to average them, resulting in a vector that might not accurately represent any of the word's meanings.

For example, consider the word "bank". It could mean a financial institution, or the side of a river. In Word2Vec, these two meanings would be conflated into a single vector. Depending on the contexts in which "bank" appeared in the training corpus, the resulting vector might be closer to words related to finance, or to words related to geography, or it might not be particularly close to either.

There are more advanced models, like BERT or ELMo, that can handle polysemy and homonymy better. These models represent words as functions of their context, so they can generate different vectors for the same word in different contexts. However, they're also more complex and computationally intensive than Word2Vec.

### SOLUTION

### Synonyms & Antonyms

When considering Cosine Similarity, it's often more convenient to think of Cosine Distance, which is simply 1 - Cosine Similarity.

Find three words $(w_1,w_2,w_3)$ where $w_1$ and $w_2$ are synonyms and $w_1$ and $w_3$ are antonyms, but Cosine Distance $(w_1,w_3) <$ Cosine Distance $(w_1,w_2)$.

As an example, $w_1$="happy" is closer to $w_3$="sad" than to $w_2$="cheerful". Please find a different example that satisfies the above. Once you have found your example, please give a possible explanation for why this counter-intuitive result may have happened.

You should use the the `wv_from_bin.distance(w1, w2)` function here in order to compute the cosine distance between two words. Please see the __[GenSim documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.FastTextKeyedVectors.distance)__ for further assistance.

In [None]:
w1 = "big"
w2 = "large"
w3 = "small"
w1_w2_dist = wv_from_bin.distance(w1, w2)
w1_w3_dist = wv_from_bin.distance(w1, w3)

print("Synonyms {}, {} have cosine distance: {}".format(w1, w2, w1_w2_dist))
print("Antonyms {}, {} have cosine distance: {}".format(w1, w3, w1_w3_dist))

Synonyms big, large have cosine distance: 0.3340132236480713
Antonyms big, small have cosine distance: 0.3511464595794678


### SOLUTION

### Analogies with Word Vectors
Word vectors have been shown to *sometimes* exhibit the ability to solve analogies.

As an example, for the analogy "man : grandfather :: woman : x" (read: man is to grandfather as woman is to x), what is x?

In the cell below, we show you how to use word vectors to find x using the `most_similar` function from the __[GenSim documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.most_similar)__. The function finds words that are most similar to the words in the `positive` list and most dissimilar from the words in the `negative` list. The answer to the analogy will have the highest cosine similarity (largest returned numerical value).

In [None]:
# Run this cell to answer the analogy -- man : grandfather :: woman : x
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'grandfather'], negative=['man']))

[('grandmother', 0.7608445286750793),
 ('granddaughter', 0.7200808525085449),
 ('daughter', 0.7168302536010742),
 ('mother', 0.7151536345481873),
 ('niece', 0.7005682587623596),
 ('father', 0.6659887433052063),
 ('aunt', 0.6623408794403076),
 ('grandson', 0.6618767976760864),
 ('grandparents', 0.644661009311676),
 ('wife', 0.6445354223251343)]


Let $m$, $g$, $w$, and $x$ denote the word vectors for `man`, `grandfather`, `woman`, and the answer, respectively. Using **only** vectors $m$, $g$, $w$, and the vector arithmetic operators $+$ and $-$ in your answer, to what expression are we maximizing $x$'s cosine similarity?

Hint: Recall that word vectors are simply multi-dimensional vectors that represent a word. It might help to draw out a 2D example using arbitrary locations of each vector. Where would `man` and `woman` lie in the coordinate plane relative to `grandfather` and the answer?

In [None]:
def cosine_similarity(A, B):
    dot_product = np.dot(A, B)
    norm_a = np.linalg.norm(A)
    norm_b = np.linalg.norm(B)
    return dot_product / (norm_a * norm_b)

In [None]:
man_vec = wv_from_bin['man']
grandfather_vec = wv_from_bin['grandfather']
woman_vec = wv_from_bin['woman']

x = woman_vec + (grandfather_vec - man_vec)

vocab = wv_from_bin.key_to_index

# Print the words
sim_dict = {}
for word in vocab:
  sim_dict[word] = cosine_similarity(x, wv_from_bin[word])

In [None]:
max(sim_dict, key=sim_dict.get)

'grandmother'

### SOLUTION

### Finding Analogies
a. For the previous example, it's clear that "grandmother" completes the analogy. But give an intuitive explanation as to why the `most_similar` function gives us words like "granddaughter", "daughter", or "mother?

because these words are **semantically similar** to "grandmother". Word embeddings capture semantic relationships between words, so words that are used in similar contexts will have similar vectors.

In the case of "grandmother", "granddaughter", "daughter", and "mother", these words are often used in similar contexts as they all relate to family relationships and more specifically, female family members. Therefore, their vectors are close in the high-dimensional space, leading to high cosine similarity scores.

`most_similar` function doesn't understand the specific relationship you're trying to capture (i.e., the gender-switched equivalent of "grandfather"). It simply returns words that have vectors close to the computed vector `x`. Hence, other family-related terms appear in the results.

### SOLUTION

b. Find an example of analogy that holds according to these vectors (i.e. the intended word is ranked top). In your solution please state the full analogy in the form x:y :: a:b. If you believe the analogy is complicated, explain why the analogy holds in one or two sentences.

**Note**: You may have to try many analogies to find one that works!

In [None]:
x, y, a, b = 'man', 'king', 'woman', 'queen'
assert wv_from_bin.most_similar(positive=[a, y], negative=[x])[0][0] == b

### SOLUTION

### Incorrect Analogy
a. Below, we expect to see the intended analogy "hand : glove :: foot : **sock**", but we see an unexpected result instead. Give a potential reason as to why this particular analogy turned out the way it did?

In [None]:
pprint.pprint(wv_from_bin.most_similar(positive=['foot', 'glove'], negative=['hand']))

[('45,000-square', 0.4922032654285431),
 ('15,000-square', 0.4649604558944702),
 ('10,000-square', 0.4544755816459656),
 ('6,000-square', 0.44975775480270386),
 ('3,500-square', 0.444133460521698),
 ('700-square', 0.44257497787475586),
 ('50,000-square', 0.4356396794319153),
 ('3,000-square', 0.43486514687538147),
 ('30,000-square', 0.4330596923828125),
 ('footed', 0.43236875534057617)]


1. **Frequency of word usage**: The words "foot" and "glove" might not co-occur as frequently as "hand" and "glove" in the training corpus. This could affect the learned relationships.

2. **Multiple meanings**: Words can have multiple meanings. For example, "foot" can also refer to a unit of measurement, and "glove" can refer to a type of boxing equipment. If these meanings are more prevalent in the training corpus, the resulting vector might not capture the intended relationship.

3. **Complexity of relationships**: The relationship between "hand" and "glove" might not be exactly parallel to the relationship between "foot" and "sock" in terms of usage, context, and semantics. For example, gloves might be discussed more often in the context of safety equipment, while socks might be discussed more often in the context of comfort or fashion.

### SOLUTION

b. Find another example of analogy that does *not* hold according to these vectors. In your solution, state the intended analogy in the form x:y :: a:b, and state the **incorrect** value of b according to the word vectors (in the previous example, this would be **'45,000-square'**).

In [None]:
x, y, a, b = 'sun', 'day', 'moon', 'night'
pprint.pprint(wv_from_bin.most_similar(positive=[a, y], negative=[x]))

assert wv_from_bin.most_similar(positive=[a, y], negative=[x])[0][0] == b

[('days', 0.6118359565734863),
 ('week', 0.592682421207428),
 ('month', 0.5850773453712463),
 ('next', 0.563667356967926),
 ('first', 0.5591050386428833),
 ('trip', 0.552686333656311),
 ('last', 0.5418316125869751),
 ('weeks', 0.5370946526527405),
 ('eve', 0.5344699621200562),
 ('this', 0.5237874984741211)]


AssertionError: 

### SOLUTION

### Guided Analysis of Bias in Word Vectors

It's important to be cognizant of the biases (gender, race, sexual orientation etc.) implicit in our word embeddings. Bias can be dangerous because it can reinforce stereotypes through applications that employ these models.

Run the cell below, to examine (a) which terms are most similar to "woman" and "profession" and most dissimilar to "man", and (b) which terms are most similar to "man" and "profession" and most dissimilar to "woman". Point out the difference between the list of female-associated words and the list of male-associated words, and explain how it is reflecting gender bias.

In [None]:
# Run this cell
# Here `positive` indicates the list of words to be similar to and `negative` indicates the list of words to be most dissimilar from.

pprint.pprint(wv_from_bin.most_similar(positive=['man', 'profession'], negative=['woman']))
print()
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'profession'], negative=['man']))

[('reputation', 0.5250176787376404),
 ('professions', 0.5178037881851196),
 ('skill', 0.49046966433525085),
 ('skills', 0.49005505442619324),
 ('ethic', 0.4897659420967102),
 ('business', 0.4875852167606354),
 ('respected', 0.485920250415802),
 ('practice', 0.482104629278183),
 ('regarded', 0.4778572618961334),
 ('life', 0.4760662019252777)]

[('professions', 0.5957457423210144),
 ('practitioner', 0.49884122610092163),
 ('teaching', 0.48292139172554016),
 ('nursing', 0.48211804032325745),
 ('vocation', 0.4788965880870819),
 ('teacher', 0.47160351276397705),
 ('practicing', 0.46937814354896545),
 ('educator', 0.46524327993392944),
 ('physicians', 0.4628995358943939),
 ('professionals', 0.4601394236087799)]


The gender bias in the word embeddings is reflected in the associations the model makes between the words “man”, “woman”, and “profession” and the other words it deems most similar.

For the word pair “man” and “profession”, the model associates words like “business”, “respected”, and “regarded”. These words carry a certain connotation of authority, respect, and business acumen, which are traditionally associated with men in many societies.

On the other hand, for the word pair “woman” and “profession”, the model associates words like “teaching”, “nursing”, “teacher”, and “educator”. These words are often associated with caregiving or educational roles, which are traditionally seen as ‘feminine’ professions in many societies.

This difference in associations reflects a gender bias in the data the model was trained on. It’s important to note that these biases are not a reflection of reality or an endorsement of these stereotypes, but rather a reflection of the biases present in the text data the model was trained on. As such, it’s crucial to be aware of these biases when using word embeddings in applications, as they can inadvertently perpetuate harmful stereotypes.

### SOLUTION

### Independent Analysis of Bias in Word Vectors

Use the `most_similar` function to find another pair of analogies that demonstrates some bias is exhibited by the vectors. Please briefly explain the example of bias that you discover.

In [None]:
A = 'he'
B = 'she'
word = 'doctor'

pprint.pprint(wv_from_bin.most_similar(positive=[A, word], negative=[B]))
print()
pprint.pprint(wv_from_bin.most_similar(positive=[B, word], negative=[A]))


[('physician', 0.6617421507835388),
 ('surgeon', 0.5713728070259094),
 ('doctors', 0.5704407691955566),
 ('medical', 0.564996600151062),
 ('him', 0.5392459034919739),
 ('dr.', 0.5354140996932983),
 ('himself', 0.5292457938194275),
 ('his', 0.5211162567138672),
 ('hospital', 0.5205545425415039),
 ('man', 0.505194365978241)]

[('nurse', 0.6992564797401428),
 ('mother', 0.6033010482788086),
 ('woman', 0.6029850840568542),
 ('her', 0.5852972269058228),
 ('physician', 0.5735723376274109),
 ('pregnant', 0.5676814913749695),
 ('dr.', 0.5605944395065308),
 ('doctors', 0.5586872696876526),
 ('patient', 0.5518049001693726),
 ('hospital', 0.548409104347229)]


In this hypothetical output, you can see that when we ask for words similar to “he” and “doctor” but dissimilar to “she”, we get words like “surgeon” and “physician”, which are typically high-status roles in the medical field. On the other hand, when we ask for words similar to “she” and “doctor” but dissimilar to “he”, we get “nurse” as the top result, which is a role that has been historically female-dominated and is often perceived as having lower status than “doctor”. This could be seen as a reflection of gender bias in the data the model was trained on.

### SOLUTION

### Thinking About Bias

a. Give one explanation of how bias gets into the word vectors. Briefly describe a real-world example that demonstrates this source of bias.

Bias in word vectors often originates from the data they are trained on. Word embeddings are a type of machine learning model that are trained on large amounts of text data. If the text data contains biases, the model will learn and reproduce these biases.

For example, if the training data frequently associates men with professions like “engineer” or “doctor” and women with roles like “nurse” or “teacher”, the word vectors will reflect these associations. This is because the model learns to predict words based on their context, and if certain words (like professions) are frequently found in the context of certain other words (like gendered pronouns), the model will learn to associate these words with each other.

A real-world example of this can be seen in the use of word embeddings in job recommendation algorithms. If a job recommendation algorithm uses biased word embeddings, it might recommend high-paying, prestigious jobs to men more often than to women, simply because the word vectors associate words like “CEO” or “engineer” more closely with “he” than with “she”. This can perpetuate gender inequality in the job market. It’s therefore crucial to be aware of these biases and to develop methods to mitigate them when using word embeddings in real-world applications.

### SOLUTION

b. What is one method you can use to mitigate bias exhibited by word vectors?  Briefly describe a real-world example that demonstrates this method.

One method to mitigate bias exhibited by word vectors is called "hard de-biasing". This method involves identifying specific directions in the vector space that correspond to undesired biases (such as gender or race), and then adjusting the word vectors to remove these biases.

For example, researchers at Stanford University used this method to reduce gender bias in word embeddings. They first identified a set of gender-neutral words (such as 'doctor' or 'engineer') that should not be associated with any particular gender. They then calculated the average difference between the vectors for male-associated words (such as 'man' or 'boy') and female-associated words (such as 'woman' or 'girl'). This difference was used to define the "gender direction" in the vector space.

Next, they adjusted the vectors for the gender-neutral words to make them orthogonal (i.e., perpendicular) to the gender direction. This effectively removed any gender bias from these words, while preserving their other semantic properties.

In a real-world example, this method could be used to reduce gender bias in a job recommendation system. By de-biasing the word vectors used by the system, it would be less likely to recommend certain jobs to men over women (or vice versa) based solely on their gender. This could help to promote more equitable hiring practices and reduce gender disparities in the workforce.


### SOLUTION