1.Submit a Google Colab notebook containing your completed code and experimentation results.

2.Include comments and explanations in your code to help understand the implemented logic.

**Additional Notes:**
*   Ensure that the notebook runs successfully in Google Colab.
*   Document any issues encountered during experimentation and how you addressed them.

**Grading:**
*   Each task will be graded out of the specified points.
*   Points will be awarded for correctness, clarity of code, thorough experimentation, and insightful analysis.

# Prediction-Based Word Vectors

more recently prediction-based word vectors have demonstrated better performance, such as word2vec and GloVe (which also utilizes the benefit of counts). Here, we shall explore the embeddings produced by GloVe.

Then run the following cells to load the GloVe vectors into memory.

In [1]:
import gensim.downloader as api
import pprint
wv_from_bin = api.load("glove-wiki-gigaword-200")



### Words with Multiple Meanings
Polysemes and homonyms are words that have more than one meaning (see this [wiki page](https://en.wikipedia.org/wiki/Polysemy) to learn more about the difference between polysemes and homonyms ). Find a word with *at least two different meanings* such that the top-10 most similar words (according to cosine similarity) contain related words from *both* meanings. For example, "leaves" has both "go_away" and "a_structure_of_a_plant" meaning in the top 10, and "scoop" has both "handed_waffle_cone" and "lowdown". You will probably need to try several polysemous or homonymic words before you find one.

Please state the word you discover and the multiple meanings that occur in the top 10. Why do you think many of the polysemous or homonymic words you tried didn't work (i.e. the top-10 most similar words only contain **one** of the meanings of the words)?

**Note**: You should use the `wv_from_bin.most_similar(word)` function to get the top 10 similar words. This function ranks all other words in the vocabulary with respect to their cosine similarity to the given word. For further assistance, please check the __[GenSim documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.FastTextKeyedVectors.most_similar)__.

In [37]:
### CODE HERE
# Load the GloVe word vectors
wv_from_bin = api.load("glove-wiki-gigaword-200")

word = 'bank'

top_10_similar_words = wv_from_bin.most_similar(word, topn=10)


In [44]:

word = 'novel'

top_10_similar_words = wv_from_bin.most_similar(word, topn=10)
top_10_similar_words

[('novels', 0.7762995958328247),
 ('adaptation', 0.7534605264663696),
 ('book', 0.7485204935073853),
 ('fiction', 0.734031617641449),
 ('author', 0.7132314443588257),
 ('novelist', 0.6811251044273376),
 ('story', 0.677466869354248),
 ('memoir', 0.6494017243385315),
 ('novella', 0.6484362483024597),
 ('tale', 0.6375203132629395)]

### SOLUTION

### Synonyms & Antonyms

When considering Cosine Similarity, it's often more convenient to think of Cosine Distance, which is simply 1 - Cosine Similarity.

Find three words $(w_1,w_2,w_3)$ where $w_1$ and $w_2$ are synonyms and $w_1$ and $w_3$ are antonyms, but Cosine Distance $(w_1,w_3) <$ Cosine Distance $(w_1,w_2)$.

As an example, $w_1$="happy" is closer to $w_3$="sad" than to $w_2$="cheerful". Please find a different example that satisfies the above. Once you have found your example, please give a possible explanation for why this counter-intuitive result may have happened.

You should use the the `wv_from_bin.distance(w1, w2)` function here in order to compute the cosine distance between two words. Please see the __[GenSim documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.FastTextKeyedVectors.distance)__ for further assistance.

In [7]:
w1 = "hot"
w2 = "warm"
w3 = "cold"
w1_w2_dist = wv_from_bin.distance(w1, w2)
w1_w3_dist = wv_from_bin.distance(w1, w3)

print("Synonyms {}, {} have cosine distance: {}".format(w1, w2, w1_w2_dist))
print("Antonyms {}, {} have cosine distance: {}".format(w1, w3, w1_w3_dist))

Synonyms hot, warm have cosine distance: 0.4111672639846802
Antonyms hot, cold have cosine distance: 0.40621882677078247


### SOLUTION
**Analysis:** <br>
In many semantic spaces, the word "hot" and its antonym "cold" might be closer together than "hot" and its synonym "warm" because "cold" is a more contrasting concept to "hot" than "warm" is. Cosine distance captures the angular difference between word vectors, and in some cases, this angular difference might be smaller between antonyms due to their contrasting nature, leading to a smaller cosine distance. Conversely, synonyms might have a larger cosine distance due to the subtle differences in their usage contexts, resulting in a larger angular difference between their word vectors.

### Analogies with Word Vectors
Word vectors have been shown to *sometimes* exhibit the ability to solve analogies.

As an example, for the analogy "man : grandfather :: woman : x" (read: man is to grandfather as woman is to x), what is x?

In the cell below, we show you how to use word vectors to find x using the `most_similar` function from the __[GenSim documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.most_similar)__. The function finds words that are most similar to the words in the `positive` list and most dissimilar from the words in the `negative` list. The answer to the analogy will have the highest cosine similarity (largest returned numerical value).

In [8]:
# Run this cell to answer the analogy -- man : grandfather :: woman : x
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'grandfather'], negative=['man']))

[('grandmother', 0.7608445286750793),
 ('granddaughter', 0.7200808525085449),
 ('daughter', 0.7168302536010742),
 ('mother', 0.7151536345481873),
 ('niece', 0.7005682587623596),
 ('father', 0.6659887433052063),
 ('aunt', 0.6623408794403076),
 ('grandson', 0.6618767976760864),
 ('grandparents', 0.644661009311676),
 ('wife', 0.6445354223251343)]


Let $m$, $g$, $w$, and $x$ denote the word vectors for `man`, `grandfather`, `woman`, and the answer, respectively. Using **only** vectors $m$, $g$, $w$, and the vector arithmetic operators $+$ and $-$ in your answer, to what expression are we maximizing $x$'s cosine similarity?

Hint: Recall that word vectors are simply multi-dimensional vectors that represent a word. It might help to draw out a 2D example using arbitrary locations of each vector. Where would `man` and `woman` lie in the coordinate plane relative to `grandfather` and the answer?

### SOLUTION
In the analogy "man : grandfather :: woman : x", we're essentially looking for the word that is to "woman" as "grandfather" is to "man". Geometrically, this means finding the vector x such that: <br>x=w+(g−m)<br>

This expression involves vector arithmetic: <br>
g−m calculates the direction from "man" to "grandfather".
<br>
w+(g−m) moves from the position of "woman" in the direction from "man" to "grandfather".
So, we're maximizing the cosine similarity of vector x to the result of adding the direction from "man" to "grandfather" to the vector representing "woman".

This operation essentially involves shifting the vector representing "woman" along the same direction and magnitude as the vector from "man" to "grandfather". Geometrically, it's akin to translating the position of "woman" to a position that would be analogous to "man" becoming "grandfather". <br>
clearer form of expression:

x = woman + (grandfather − man)


### Finding Analogies
a. For the previous example, it's clear that "grandmother" completes the analogy. But give an intuitive explanation as to why the `most_similar` function gives us words like "granddaughter", "daughter", or "mother?

### SOLUTION

1. Family Relationship: The word vectors for "woman" and "man" are likely to be close in the vector space because they represent similarly, "grandfather" and "grandmother" are close because they represent the same family relationship. When we subtract the vector for "man" from "grandfather", we get a direction vector that represents the concept of becoming a grandfather, which is likely to be almost orthogonal to the gender vector difference between "woman" and "man".

2.  Proximity in Vector Space: Words like "granddaughter", "daughter", and "mother" are conceptually related to the concept of "woman" and "grandfather", albeit not exactly synonymous. They share semantic similarities because they represent familial relationships that are closely associated with the concept of "grandfather" and "woman". Therefore, they may appear among the most similar words due to their proximity in the vector space, even though "grandmother" might be the most direct completion of the analogy.

3.  Frequency in Corpus: Additionally, the frequency of these words in the training corpus might also influence their cosine similarity scores. If "granddaughter", "daughter", or "mother" appear more frequently in contexts similar to "woman" and "grandfather" than "grandmother" does, they might receive higher similarity scores despite "grandmother" being a more direct completion of the analogy.

b. Find an example of analogy that holds according to these vectors (i.e. the intended word is ranked top). In your solution please state the full analogy in the form x:y :: a:b. If you believe the analogy is complicated, explain why the analogy holds in one or two sentences.

**Note**: You may have to try many analogies to find one that works!

In [9]:
x, y, a, b = "king", "queen", "man", "woman"
assert wv_from_bin.most_similar(positive=[a, y], negative=[x])[0][0] == b

### SOLUTION
"king : queen :: man : woman"

In this analogy, we expect the word that completes the analogy to be "woman" since "queen" is the female counterpart to "king" just as "woman" is the female counterpart to "man".

This analogy holds because the word vectors capture the semantic relationship between "king" and "queen" as well as "man" and "woman", which are hierarchical and gender-specific. Therefore, the word "woman" is ranked top in the most_similar function output.

### Incorrect Analogy
a. Below, we expect to see the intended analogy "hand : glove :: foot : **sock**", but we see an unexpected result instead. Give a potential reason as to why this particular analogy turned out the way it did?

In [10]:
pprint.pprint(wv_from_bin.most_similar(positive=['foot', 'glove'], negative=['hand']))

[('45,000-square', 0.4922032654285431),
 ('15,000-square', 0.4649604558944702),
 ('10,000-square', 0.4544755816459656),
 ('6,000-square', 0.44975775480270386),
 ('3,500-square', 0.444133460521698),
 ('700-square', 0.44257497787475586),
 ('50,000-square', 0.4356396794319153),
 ('3,000-square', 0.43486514687538147),
 ('30,000-square', 0.4330596923828125),
 ('footed', 0.43236875534057617)]


### SOLUTION
why the best similarities might be numerical values rather than words?

**Data Bias:** The training data could contain biases or patterns that are not intuitively related to the analogy but are statistically significant within the data. For instance, if the dataset contains a lot of real estate listings or descriptions where measurements and "foot" or "square foot" are frequently mentioned together, the model might learn to associate "foot" more strongly with square footage than with items worn on the foot.




**Polysemy:** The word "foot" is polysemous; it has multiple meanings, including as a unit of measurement (as in "square foot") and as a part of the body. If the model's representation of "foot" leans more towards its usage as a unit of measurement rather than a body part due to the training data, the analogy will skew towards related concepts in that domain.


b. Find another example of analogy that does *not* hold according to these vectors. In your solution, state the intended analogy in the form x:y :: a:b, and state the **incorrect** value of b according to the word vectors (in the previous example, this would be **'45,000-square'**).

In [22]:
x, y, a, b = "bread" , "bakery" , "book" , "library"
pprint.pprint(wv_from_bin.most_similar(positive=[a, y], negative=[x]))

[('books', 0.616112232208252),
 ('bookstore', 0.5795304775238037),
 ('bookshop', 0.5778532028198242),
 ('author', 0.5619520545005798),
 ('publishing', 0.5522783994674683),
 ('published', 0.5197589993476868),
 ('novel', 0.5005371570587158),
 ('publications', 0.49985089898109436),
 ('publisher', 0.49705472588539124),
 ('memoir', 0.4887678921222687)]


### SOLUTION
Just as bread is produced and sold in a bakery, books are available and stored in a library.

Reason: The training data might have a stronger association between books and the concept of publishing or production than their storage or availability in libraries.

### Guided Analysis of Bias in Word Vectors

It's important to be cognizant of the biases (gender, race, sexual orientation etc.) implicit in our word embeddings. Bias can be dangerous because it can reinforce stereotypes through applications that employ these models.

Run the cell below, to examine (a) which terms are most similar to "woman" and "profession" and most dissimilar to "man", and (b) which terms are most similar to "man" and "profession" and most dissimilar to "woman". Point out the difference between the list of female-associated words and the list of male-associated words, and explain how it is reflecting gender bias.

In [23]:
# Run this cell
# Here `positive` indicates the list of words to be similar to and `negative` indicates the list of words to be most dissimilar from.

pprint.pprint(wv_from_bin.most_similar(positive=['man', 'profession'], negative=['woman']))
print()
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'profession'], negative=['man']))

[('reputation', 0.5250176787376404),
 ('professions', 0.5178037881851196),
 ('skill', 0.49046966433525085),
 ('skills', 0.49005505442619324),
 ('ethic', 0.4897659420967102),
 ('business', 0.4875852167606354),
 ('respected', 0.485920250415802),
 ('practice', 0.482104629278183),
 ('regarded', 0.4778572618961334),
 ('life', 0.4760662019252777)]

[('professions', 0.5957457423210144),
 ('practitioner', 0.49884122610092163),
 ('teaching', 0.48292139172554016),
 ('nursing', 0.48211804032325745),
 ('vocation', 0.4788965880870819),
 ('teacher', 0.47160351276397705),
 ('practicing', 0.46937814354896545),
 ('educator', 0.46524327993392944),
 ('physicians', 0.4628995358943939),
 ('professionals', 0.4601394236087799)]


### SOLUTION



*   For men, the model listed qualities like being skilled, ethical, and respected, without pointing to specific jobs. It's like saying men are seen in a wide variety of important roles but not tying them down to any particular job.

*   For women, the model pointed to specific jobs like teaching and nursing, which are often seen as caring or nurturing roles. It kind of boxed women into certain types of jobs traditionally thought of as "women's work."



### Independent Analysis of Bias in Word Vectors

Use the `most_similar` function to find another pair of analogies that demonstrates some bias is exhibited by the vectors. Please briefly explain the example of bias that you discover.

In [34]:
# Run this cell to answer the analogy -- man : grandfather :: woman : x
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'grandfather'], negative=['man']))
x = woman + (grandfather − man)

A = "christian"
B = "muslim"
word = "peaceful"

pprint.pprint(wv_from_bin.most_similar(positive=[A, word], negative=[B])) # christian + (peaceful - muslim)
print()
pprint.pprint(wv_from_bin.most_similar(positive=[B, word], negative=[A])) # muslim + (peaceful - christian)


[('principles', 0.4248402416706085),
 ('spirit', 0.4119269847869873),
 ('peacefully', 0.4070417582988739),
 ('sustainable', 0.4070028066635132),
 ('promote', 0.39793041348457336),
 ('harmony', 0.3913986384868622),
 ('coexistence', 0.3912509083747864),
 ('harmonious', 0.3873632252216339),
 ('unity', 0.3870052695274353),
 ('dialogue', 0.386999249458313)]

[('peacefully', 0.5776247978210449),
 ('kashmir', 0.5749312043190002),
 ('kashmiris', 0.5432711243629456),
 ('iranians', 0.5095259547233582),
 ('muslims', 0.49575552344322205),
 ('moslem', 0.4929211735725403),
 ('demonstrations', 0.4893590211868286),
 ('protests', 0.48372882604599),
 ('musharraf', 0.47846537828445435),
 ('crackdown', 0.4762367904186249)]


### SOLUTION

**Analysis of "Christian + peaceful - Muslim"** <br>
The terms associated with "Christian + peaceful - Muslim" include "principles," "spirit," "peacefully," "sustainable," "promote," "harmony," "coexistence," "harmonious," "unity," and "dialogue." These words paint a picture of peace in abstract, idealistic terms, focusing on the positive aspects of peace like harmony, unity, and dialogue and suggesting an idealized view of peace associated with Christian contexts.


**Analysis of "Muslim + peaceful - Christian"** <br>
Conversely, the terms related to "Muslim + peaceful - Christian" are more specific and grounded in geopolitical realities, including "peacefully," "kashmir," "kashmiris," "iranians," "muslims," "moslem," "demonstrations," "protests," "musharraf," and "crackdown." This list reflects a direct association with specific regions and events, such as Kashmir and Iran, and suggests a narrative of struggle or conflict, with "demonstrations," "protests," and "crackdown" indicating societal unrest. Overall, the context is markedly different from the abstract principles associated with the Christian query.


### Thinking About Bias

a. Give one explanation of how bias gets into the word vectors. Briefly describe a real-world example that demonstrates this source of bias.

### SOLUTION

**Data-Driven Bias**

Word vectors are generated by analyzing vast amounts of text data. If this data contains biased language or perspectives—like stereotypical roles, unequal representation of genders, races, ethnicities, or any other group—these biases are learned by the model.

**Real-World Example:**

Imagine we have a model that helps companies decide who to interview for jobs. This model is taught how to pick candidates by looking at old job ads and the resumes of people who got those jobs. If most of these old ads for engineering jobs talk about men or use words more often used when talking about men, the model starts to think that engineering jobs are mostly for men. And if ads for nursing jobs do the same for women, the model thinks nursing is mostly for women.

Now, when this model looks at new applications, it might prefer men for engineering jobs and women for nursing jobs, not because it's trying to be unfair, but because that's what it learned from the old ads and resumes. This happened for real with some hiring models, making it harder for women to get considered for some tech jobs.

b. What is one method you can use to mitigate bias exhibited by word vectors?  Briefly describe a real-world example that demonstrates this method.


### SOLUTION


**Adversarial Training for Equalizing Word Representations:**

Adversarial training involves training a model to simultaneously generate word embeddings while also training an adversary to predict sensitive attributes from these embeddings. The goal is to encourage the model to generate embeddings that do not contain information about the sensitive attributes, effectively equalizing the representations across different groups.

**Real-World Example:**

Consider a scenario where a machine learning model is trained to generate word embeddings using a dataset containing text data from various sources, including social media posts, news articles, and literature. To mitigate biases related to gender, the model is trained using an adversarial setup:

Embedding Generation: The model learns to generate word embeddings from the input text data, aiming to capture semantic information while minimizing the influence of sensitive attributes like gender.

Adversarial Training: Simultaneously, an adversary model is trained to predict the gender of individuals mentioned in the text based on the generated word embeddings. The adversary's objective is to correctly predict gender while the embedding model aims to generate embeddings that confuse the adversary.

Equalization Objective: The embedding model is optimized using an objective function that penalizes the adversary's ability to predict gender from the embeddings. This encourages the embedding model to generate embeddings that are indistinguishable in terms of gender, effectively equalizing the representations across different gender groups.
