<a href="https://colab.research.google.com/github/kussy29/machine_learning/blob/main/lab12_nlp_towards_attention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Homework Assignment: Is a *queen* really just a *king*, minus a *man*, plus a *woman*?**

--------------



In class, we dealt with **embeddings** trained for **sentiment classification**. These embeddings are optimized to separate *positive* from *negative* expressions and **do not encode deeper semantic information**.

However, in modern natural language processing, there exist other embeddings — such as those from **BERT**, **word2vec**, or **GloVe** — that **do capture semantic structure**. These models are trained on large corpora, and their embeddings often allow for meaningful **vector arithmetic**, like the famous:

```
embedding("king") - embedding("man") + embedding("woman") ≈ embedding("queen")
```

This homework explores **semantic vector relationships** using such pretrained embeddings.

## **The Objective**

Your task is to:

1. Construct semantic classes of word pairs.
2. Visualize them using PCA.
3. Explore arithmetic operations in embedding space.

## **Tasks & Deliverables**

### 1. **Semantic Pair Classes**

- You must gather **at least 10 classes** of semantically related word pairs.
- Each class must contain **at least 5 pairs**.
- That gives a **minimum total of 100 unique words** (10 classes x 5 pairs x 2 words per pair).

Two example classes:

**Class 1: Gender**

- (king, queen)
- (man, woman)
- (doctor, nurse)
- (prince, princess)
- *(you must add one more)*

**Class 2: Verb tense (past tense)**

- (bring, brought)
- (get, got)
- (like, liked)
- *(you must add two more)*

**Your job:**

- Invent or search for **at least 10 such classes**, including the examples above.
- Each class must be conceptually coherent.
- Other examples: singular/plural, country/capital, comparative/superlative, tool/user, job/object, etc.

### 2. **Global PCA (Across All Words)**

- Use PCA to reduce the **entire set of 100 word embeddings** to 2D, and plot it.
- Plot the additional **10 separate charts**, one for each class.
  - Each chart should display only the 10 words (5 pairs) of the given class.
- Points should be labeled with the words themselves.

### 3. **Local PCA (Per Class)**

- For each class (10 total), perform PCA **only** on the 10 words of that class.
- Plot these class-wise PCA visualizations as separate charts.
- Again, points should be labeled with the words.

**Total: 21 charts**
(1 global plot with 100 words + 10 global-space class plots + 10 local PCA class plots)

Charts should be presented in a self-explanatory manner with clear labels.

### 4. **Embedding Arithmetic**

For each class, choose **one example pair** (e.g., (king, queen)) and perform the operation:

```
embedding(B) - embedding(A) + embedding(C)
```

Where A and B form a known pair, and C is another base word.
For example:

```
embedding("king") - embedding("man") + embedding("woman")
```

* For each such result vector, find the **5 closest word embeddings** (using cosine similarity or Euclidean distance).
* Print the top 5 neighbors **with their distances**.
* Do this **once per class** (i.e., 10 times).

This will make it possible to verify if
 ```
embedding("queen") ≈ embedding("king") - embedding("man") + embedding("woman")
```
for the *gender*-related class.


### 5. **Discussion**

* Analyze and interpret your 21 plots.
* Discuss whether the vector relationships are preserved.
* Does PCA capture semantic differences?
* Are the closest words from the arithmetic meaningful?
* What kinds of relationships are captured, and what are not?
* Are some classes better behaved than others?


### 6. **Publish on GitHub**  
   - Place the Colab notebook in your **GitHub repository** for this course.
   - In your repository’s **README**, add a **link** to the notebook and also include an **“Open in Colab”** badge at the top of the notebook so it can be launched directly from GitHub.


## Acknowledgments

*This homework assignment was inspired by an idea from my master's student **Andrzej Małek**, to whom I would like to express my thanks.*



**Class 1: Gender**


*   King - Queen
*   Man - Woman
*   Doctor - Nurse
*   Prince - Princess
*   Actor - Actress
*   Uncle - Aunt

**Class 2: Past tense**


*   bring - brought
*   get - got
*   like - liked
*   giva - gave
*   leave - left

**Class 3: Capital**

*   Poland - Warsaw
*   USA - Washington, D.C.
*   Washington - Olympia
*   Turkey - Ankara
*   Liechtenstein - Vaduz

**Class 4: Antonims**

*   preety - ugly
*   happy - sad
*   noise - silence
*   wealth - poverty
*   quickly - slowly
*   left - right

**Class 5: Adjective**

*  Poland - polish
*  wealth - wealthy
*  happiness - happy
*  creativity - creative
*  beaty - beatiful

**Class 6: Substitutes**

*  Orange - tangerine
*  Butter - margarine
*  Coffee - Tea
*  Coca-cola - Pepsi
*  Sandwich - Burger

**Class 7 - diminutives**

*  dear - darling
*  Charles - Charlie
*  duck - duckling
*  dog - doggie
*  cat - kitty

**Class 8 - Child**

*  Dog - puppy
*  Cow - calf
*  cat - kitten
*  Horse - foal
*  Swan - cygnet

**Class 9 - Meat**

*  pig - pork
*  cow - beef
*  sheep - mutton
*  deer - venison
*  bird - poultry

**Class 10 - Food**

*  potato - chips
*  lettuce - salad
*  pork - chop
*  wheat - bread
*  avokado - guacamole

In [None]:
!pip install gensim torch

import torch
import torch.nn as nn
from gensim.models import KeyedVectors

# Load pretrained Word2Vec model (e.g., Google News)
# Make sure you've downloaded it first (e.g., GoogleNews-vectors-negative300.bin)
word2vec_path = 'GoogleNews-vectors-negative300.bin'
word2vec = KeyedVectors.load_word2vec_format(word2vec_path, binary=True)

# Create embedding matrix
embedding_dim = word2vec.vector_size
vocab_size = len(word2vec.key_to_index)
embedding_matrix = torch.zeros((vocab_size, embedding_dim))

# Mapping from word to index
word2idx = {}
for i, word in enumerate(word2vec.key_to_index):
    embedding_matrix[i] = torch.tensor(word2vec[word])
    word2idx[word] = i

# Create PyTorch Embedding layer
embedding_layer = nn.Embedding.from_pretrained(embedding_matrix, freeze=False)  # freeze=True if not trainable

In [None]:
word = "king"
idx = word2idx[word]
embedding = embedding_layer(torch.tensor(idx))