# NLP exercise Lab2
# Name: Trao An Huy
# MSSV: 22280041

# Word embedding and one-hot encoding







## One-hot encoding

> One-hot encoding is the process of turning categorical factors into a numerical structure that machine learning algorithms can readily process. It functions by representing each category in a feature as a binary vector of 1s and 0s, with the vector's size equivalent to the number of potential categories.

In [37]:
data = ['cold', 'cold', 'warm', 'cold', 'hot', 'hot', 'warm', 'cold', 'warm', 'hot']

### One-hot integer encoding

In [38]:
import numpy as np
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(np.array(data))

print(data)
print(integer_encoded)

['cold', 'cold', 'warm', 'cold', 'hot', 'hot', 'warm', 'cold', 'warm', 'hot']
[0 0 2 0 1 1 2 0 2 1]


### One-hot binary encoding

In [39]:
import numpy as np
from sklearn.preprocessing import OneHotEncoder

one_hot_encoder = OneHotEncoder(sparse_output=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded_data = one_hot_encoder.fit_transform(integer_encoded)

print(data)
print(onehot_encoded_data)

['cold', 'cold', 'warm', 'cold', 'hot', 'hot', 'warm', 'cold', 'warm', 'hot']
[[1. 0. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]]


## Problem 1
What are the limitations of one-hot encoding?

## ANSWER .
- Curse of Dimensionality (High Cardinality Problem):

Explanation: When a categorical feature has a very large number of unique categories (high cardinality), one-hot encoding creates a new binary feature (column) for each category. This drastically increases the number of features (dimensionality) in the dataset.

Impact: High dimensionality can lead to increased computational cost (memory and processing time), make models harder to train, potentially require more data to generalize well, and increase the risk of overfitting. Imagine one-hot encoding a 'ZIP Code' or 'City' column in a large dataset – you could add thousands of new columns.

- Multicollinearity:

Explanation: The newly created binary features are often highly correlated. Specifically, if you know the values of k-1 columns for a given feature (where k is the number of categories), you automatically know the value of the k-th column (it will be 1 if all others are 0, and 0 if one of the others is 1). This is sometimes called the "dummy variable trap".

Impact: While many algorithms can handle this, it can be problematic for some models, particularly linear models like Linear Regression or Logistic Regression. It can destabilize coefficient estimates and make their interpretation difficult. Often, one of the one-hot encoded columns is dropped to avoid perfect multicollinearity.

- Increased Sparsity:

Explanation: After one-hot encoding, each original data point will have a '1' in only one of the newly created columns for that feature, and '0's in all the others. This results in a dataset matrix that is mostly filled with zeros (sparse).

Impact: While sparse matrices can be stored efficiently using specialized formats, processing them can still be computationally intensive if not handled correctly. Some algorithms might not perform optimally on highly sparse data.

- Loss of Ordinal Information (If Applicable):

Explanation: If the original categorical variable has an inherent order (e.g., 'low', 'medium', 'high'; 'cold', 'warm', 'hot'), one-hot encoding treats each category as independent and equally different from all others. It doesn't preserve the ordinal relationship between the categories.

Impact: The model loses potentially valuable information about the ranking or order between categories. For such cases, other encoding methods like Ordinal Encoding might be more appropriate (though Ordinal Encoding has its own limitations, like implying equal distance between ranks).

- Handling New/Unseen Categories:

Explanation: If the model encounters a category in new data (e.g., test set or during prediction) that was not present in the training data used to fit the encoder, the encoder won't have a corresponding column for it.

Impact: This can cause errors during the transformation process. Strategies are needed to handle this, such as ignoring the category (resulting in all zeros for that feature's columns), assigning it to a predefined 'other' category (if planned during training), or raising an error.

## Word embedding

ELI5 for word embeddings
> The word embeddings can be thought of as a child’s understanding of the words. Initially, the word embeddings are randomly initialized and they don’t make any sense, just like the baby has no understanding of different words. It’s only after the model has started getting trained, the word vectors/embeddings start to capture the meaning of the words, just like the baby hears and learns different words."

In [40]:
import torch
from torch import nn
from torch.nn import functional as F

In [41]:
import pandas as pd

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

### Unigram transformation

In [42]:
from nltk import ngrams
from typing import List

def ngrams_transform(document: List[str],
                     n_gram: int) -> List[str]:
    """
    N-grams transformations for a given text

    Args:
    document (List[str]) -- The document to-be-processed
    n_gram   (int)       -- Number of grams

    Returns:
    A list of string after n-grams processed
    """

    ### START YOUR CODE HERE ###

     # Use nltk.ngrams to generate tuples of n-grams
    # ngrams() returns an iterator, so we need to process it
    ngram_tuples = ngrams(document, n_gram)

    # Convert each tuple into a space-separated string
    # Use a list comprehension for conciseness
    ngram_list = [" ".join(gram) for gram in ngram_tuples]

    return ngram_list

    ### END YOUR CODE HERE ###

In [43]:
n_grams_list = ngrams_transform(corpus,
                                n_gram=1)
n_grams_list

['This is the first document.',
 'This document is the second document.',
 'And this is the third one.',
 'Is this the first document?']

In [44]:
# Integer label for the given corpus
label_encoder = LabelEncoder()
corpus_vector = label_encoder.fit_transform(np.array(n_grams_list))

# Tensorize the input vector
example_text_tensor = torch.Tensor(corpus_vector).to(dtype=torch.long)
print(f"Example text tensor: {example_text_tensor}")
print(f"Shape of example text tensor: {example_text_tensor.shape}")

Example text tensor: tensor([3, 2, 0, 1])
Shape of example text tensor: torch.Size([4])


### Create an example for embedding function to map from a word dimension to a lower dimensional space

In [45]:
num_vocab = 22 # number of vocabulary
num_dimension = 50 # dimensional embeddings

# Declare the mapping function
example_embedding_function = nn.Embedding(num_vocab, num_dimension)

In [46]:
example_output_tensor = example_embedding_function(example_text_tensor)
print(f"Embedding shape: {example_output_tensor.shape}")

Embedding shape: torch.Size([4, 50])


# Word2vec


* Word2vec is a **class of models** that represents a word in a large text corpus as a vector in n-dimensional space(or n-dimensional feature space) bringing similar words closer to each other.



* Word2vec is a simple yet popular model to construct representating embedding for words from a representation space to a much lower dimensional space (compared to the respective number of words in a dictionary).



* Word2Vec has two neural network-based variants, which are:

    * Continuous Bag of Words (CBOW)
    * Skip-gram.
![](https://kavita-ganesan.com/wp-content/uploads/skipgram-vs-cbow-continuous-bag-of-words-word2vec-word-representation-2048x1075.png)


## Continuous Bag of words (CBOW)

* The Continuous Bag-of-Words model (CBOW) is frequently used in NLP deep learning. It is a model that tries to predict words given the context of a few words before and a few words after the target word. This is distinct from language modeling, since CBOW is not sequential and does not have to be probabilistic. Typically, CBOW is used to quickly train word embeddings, and these embeddings are used to initialize the embeddings of some more complicated model. Usually, this is referred to as pretraining embeddings. It almost always helps performance a couple of percent.

* CBOW is modelled as follows:
    * Given a target word $w_i$ and an $N$ context window on each side, $w_{i-1}, \cdots, w_{i-N}$ and $w_{i+1},\cdots, w_{i+N}$, referring to all context words collectively as $C$.

    * CBOW tries to minimize the objective function:

$$
-\log p(w_i|C) = -\log\text{Softmax}\left(A\left(\sum_{w\in C}q_w\right)+b\right)
$$

where $q_w$ is the embedding of word $w$.

In [47]:
# N = 2 according to the definition
CONTEXT_SIZE = 2

corpus = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells."""

corpus = corpus.split()
len(corpus)

62

### Create an integer mapping

In [48]:
vocab = set(corpus)
vocab_size = len(vocab)

# Integer word mapping
word_to_idx = {word: i for i, word in enumerate(vocab)}
word_to_idx

{'about': 0,
 'process.': 1,
 'direct': 2,
 'processes': 3,
 'beings': 4,
 'abstract': 5,
 'rules': 6,
 'idea': 7,
 'computers.': 8,
 'our': 9,
 'Computational': 10,
 'that': 11,
 'with': 12,
 'spirits': 13,
 'by': 14,
 'the': 15,
 'effect,': 16,
 'a': 17,
 'computational': 18,
 'conjure': 19,
 'called': 20,
 'We': 21,
 'study': 22,
 'As': 23,
 'evolution': 24,
 'pattern': 25,
 'processes.': 26,
 'programs': 27,
 'other': 28,
 'directed': 29,
 'is': 30,
 'to': 31,
 'evolve,': 32,
 'are': 33,
 'The': 34,
 'things': 35,
 'People': 36,
 'data.': 37,
 'spells.': 38,
 'manipulate': 39,
 'create': 40,
 'In': 41,
 'program.': 42,
 'of': 43,
 'process': 44,
 'inhabit': 45,
 'they': 46,
 'computer': 47,
 'we': 48}

### Build context according to the given corpus

In [49]:
data = []

for i in range(CONTEXT_SIZE, len(corpus) - CONTEXT_SIZE):
    context = (
        [corpus[i - j - 1] for j in range(CONTEXT_SIZE)]
        + [corpus[i + j + 1] for j in range(CONTEXT_SIZE)]
    )
    target = corpus[i]
    data.append((context, target))

data

[(['are', 'We', 'to', 'study'], 'about'),
 (['about', 'are', 'study', 'the'], 'to'),
 (['to', 'about', 'the', 'idea'], 'study'),
 (['study', 'to', 'idea', 'of'], 'the'),
 (['the', 'study', 'of', 'a'], 'idea'),
 (['idea', 'the', 'a', 'computational'], 'of'),
 (['of', 'idea', 'computational', 'process.'], 'a'),
 (['a', 'of', 'process.', 'Computational'], 'computational'),
 (['computational', 'a', 'Computational', 'processes'], 'process.'),
 (['process.', 'computational', 'processes', 'are'], 'Computational'),
 (['Computational', 'process.', 'are', 'abstract'], 'processes'),
 (['processes', 'Computational', 'abstract', 'beings'], 'are'),
 (['are', 'processes', 'beings', 'that'], 'abstract'),
 (['abstract', 'are', 'that', 'inhabit'], 'beings'),
 (['beings', 'abstract', 'inhabit', 'computers.'], 'that'),
 (['that', 'beings', 'computers.', 'As'], 'inhabit'),
 (['inhabit', 'that', 'As', 'they'], 'computers.'),
 (['computers.', 'inhabit', 'they', 'evolve,'], 'As'),
 (['As', 'computers.', 'evol

### Problem 2
Name at least 2 limitations at this context construction step? Explain your answers.

### Vectorize context

In [50]:
def make_context_vector(context: List[str],
                        word_to_idx: dict) -> torch.Tensor:
    """
    Function to map a word context vector into a torch tensor

    Args:
    context (List[str]) -- A context (including individual n-grams tokens)
    word_to_idx (dict)  -- A functionto map a word into its respective integer

    Returns:
    A pytorch tensor including a list of mapped word

    Example:
    ['are', 'We', 'to', 'study'] --> tensor([40, 22, 27, 47])
    """

    ### START YOUR CODE HERE ###

    # 1. Look up the index for each word in the context using the word_to_idx dictionary.
    #    A list comprehension is a concise way to do this.
    idxs = [word_to_idx[word] for word in context]

    # 2. Convert the list of indices into a PyTorch tensor.
    #    Specify dtype=torch.long as indices are typically represented as long integers.
    context_vector = torch.tensor(idxs, dtype=torch.long)

    return context_vector

    ### END YOUR CODE HERE ###

In [51]:
# Functional test
print("Example sample: ", data[0][0])
make_context_vector(data[0][0], word_to_idx)

Example sample:  ['are', 'We', 'to', 'study']


tensor([33, 21, 31, 22])

### CBOW model implementation

In [52]:
class CBOW(nn.Module):
    def __init__(self,
                 vocab_size: int,
                 embed_dim: int) -> None:
        """
        Model constructor
        """
        super().__init__()

        self.vocab_size = vocab_size
        self.embed_dim = embed_dim

        self.embedding_layer = nn.Embedding(vocab_size, embed_dim)
        self.linear_layer = nn.Linear(embed_dim, vocab_size)

        # Neural weight initialization
        nn.init.xavier_normal_(self.embedding_layer.weight)
        nn.init.xavier_normal_(self.linear_layer.weight)

    def forward(self, inputs):
        """
        Function to conduct forward passing
        """
        embedding = self.embedding_layer(inputs)
        embedding = torch.sum(embedding, dim=1)
        output = self.linear_layer(embedding)
        output_softmax = F.log_softmax(output, dim=1)
        return output_softmax

In [53]:
cbow_model = CBOW(vocab_size=vocab_size,
                  embed_dim=10)

# Enable gradient for model training
cbow_model.train()
cbow_model

CBOW(
  (embedding_layer): Embedding(49, 10)
  (linear_layer): Linear(in_features=10, out_features=49, bias=True)
)

### Train

#### Hyperparameters and training configuration

In [54]:
num_epochs: int = 5
learning_rate: float = 5e-2
optimizer: torch.optim = torch.optim.Adam(cbow_model.parameters(),
                                          lr=learning_rate)

loss_function = nn.NLLLoss()

#### Training phase

In [55]:
for epoch in range(1, num_epochs + 1):
    print(f"#Epoch {epoch}/{num_epochs}")

    # Construct input and target tensor
    input_vector, target_vector = torch.tensor(make_context_vector(data[0][0], word_to_idx)), torch.tensor(word_to_idx[data[0][1]])
    input_vector = input_vector.unsqueeze(0)
    target_vector = target_vector.unsqueeze(0)

    # Join whole data into 1 tensor set
    for idx in range(1, len(data)):
        input_tensor = torch.tensor(make_context_vector(data[idx][0], word_to_idx)).unsqueeze(0)
        target_tensor = torch.tensor(word_to_idx[data[idx][1]]).unsqueeze(0)
        torch.cat((input_vector, input_tensor), 0)
        torch.cat((target_vector, target_tensor), 0)

    # Zero out the gradients from the old instance to avoid tensor accumulation
    cbow_model.zero_grad()

    # Forward passing
    log_probabilities = cbow_model(input_vector)

    # Evaluate loss
    loss = loss_function(log_probabilities, target_vector)

    # Backpropagation
    loss.backward()

    # Update the gradient according to the optimization algorithm
    optimizer.step()

    # Get loss values
    epoch_loss = loss.item()
    print("Loss:", epoch_loss)

#Epoch 1/5
Loss: 3.704463243484497
#Epoch 2/5
Loss: 3.2461307048797607
#Epoch 3/5
Loss: 2.7708699703216553
#Epoch 4/5
Loss: 2.1763405799865723
#Epoch 5/5
Loss: 1.4540271759033203


  input_vector, target_vector = torch.tensor(make_context_vector(data[0][0], word_to_idx)), torch.tensor(word_to_idx[data[0][1]])
  input_tensor = torch.tensor(make_context_vector(data[idx][0], word_to_idx)).unsqueeze(0)


#### Inference

In [56]:
with torch.no_grad(): # No gradient update in inference
    context = ['In', 'processes.', 'we', 'conjure']

    # Vectorize input from text to numeric type
    input_tensor = torch.tensor(make_context_vector(context, word_to_idx)).unsqueeze(0)

    # Model makes prediction
    output_tensor = cbow_model(input_tensor)

    # Get the item id with the highest probability
    prediction = torch.argmax(output_tensor).detach().tolist()

    # Query the respective word from the given item id
    key_list = list(word_to_idx.keys())
    prediction = key_list[prediction]

    print("Context:", context)
    print("Prediction:", prediction)

Context: ['In', 'processes.', 'we', 'conjure']
Prediction: about


  input_tensor = torch.tensor(make_context_vector(context, word_to_idx)).unsqueeze(0)


## Skip-gram

<center>
<img src="https://machinelearningcoban.com/tabml_book/_images/word2vec2.png">
</center>

- Skip gram is based on the distributional hypothesis where words with similar distribution is considered to have similar meanings. Researchers of skip gram suggested a model with less parameters along with the novel methods to make optimization step more efficient.

- Vanilla SkipGram model:

<center>
<img src="https://d3i71xaburhd42.cloudfront.net/a1d083c872e848787cb572a73d97f2c24947a374/5-Figure1-1.png" scale=70%>
</center>

- Main idea is to optimize model so that if it is queried with a word, it should correctly guess all the context (context = 2 in the figure) words. That is,
$$
y=\sigma(Ux)
$$
    - where $x$, $y$ are one-hot encoded word vector, $U$ is the embedding matrix, and $\sigma(\cdot)$ is the softmax function.

With the same dataset, training set for skip gram can be much larger than that of NPLM since it can have $2c$ samples $\left(w_t:w_{t-c}, ...,w_t:w_{t-1},w_t:w_{t+1},...,w_{t+c}\right)$ while other n-gram based models have one $\left((w_{t-c},...w_{t-1},w_{t+1},...,w_{t+c}):w_t\right)$.

In [57]:
corpus = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells."""

In [58]:
def make_context_vector(context: List[str],
                        word_to_idx: dict) -> torch.Tensor:
    """
    Function to map a word context vector into a torch tensor

    Args:
    context (List[str]) -- A context (including individual n-grams tokens)
    word_to_idx (dict)  -- A functionto map a word into its respective integer

    Returns:
    A pytorch tensor including a list of mapped word

    Example:
    ['are', 'We', 'to', 'study'] --> tensor([40, 22, 27, 47])
    """
    # Map words to indices, handling potential unknown words (assigning a default index, e.g., 0 for <UNK>)
    idxs = [word_to_idx.get(w, word_to_idx.get("<UNK>", 0)) for w in context]
    return torch.tensor(idxs, dtype=torch.long)

In [59]:
class SkipGramModel(nn.Module):
    def __init__(self,
                 vocab_size: int,
                 embed_dim: int) -> None:
        """
        Model construction
        """
        super().__init__()
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim

        ### START YOUR CODE HERE ###

        # Declare embedding function u and v
        # with given vocab size and embed dim using nn.Embedding
        # v_embedding_layer: Represents the embeddings for the center words (input words)
        self.v_embedding_layer = nn.Embedding(self.vocab_size, self.embed_dim)
        # u_embedding_layer: Represents the embeddings for the context words (output words)
        self.u_embedding_layer = nn.Embedding(self.vocab_size, self.embed_dim)

        # Network weight initialization with Xavier uniform initialization
        # Ensure 'init' is imported: import torch.nn.init as init
        nn.init.xavier_uniform_(self.v_embedding_layer.weight)
        nn.init.xavier_uniform_(self.u_embedding_layer.weight)

        ### END YOUR CODE HERE ###

    def forward(self, center_words, context):
        """
        Function to perform forward passing
        """
        v_embedding = self.v_embedding_layer(center_words)
        u_embedding = self.u_embedding_layer(context)

        score = torch.mul(v_embedding, u_embedding)
        score = torch.sum(score, dim=1)
        log_score = F.logsigmoid(score)
        return log_score

In [60]:
skipgram_model = SkipGramModel(vocab_size=vocab_size,
                               embed_dim=128)

skipgram_model.train()
skipgram_model

SkipGramModel(
  (v_embedding_layer): Embedding(49, 128)
  (u_embedding_layer): Embedding(49, 128)
)

### Prepare training data to match the format of SkipGram model

In [61]:
def gather_training_data(corpus,
                         word_to_idx: dict,
                         context_size: int):
    """
    This function is to transform the given corpus
    into the correct format for SkipGram to serve as its input
    """

    training_data = []
    all_vocab_indices = list(range(len(word_to_idx)))

    split_text = corpus.split('\n')

    # For each sentence
    for sentence in split_text:
        indices = []
        indices = [word_to_idx[word] for word in sentence.split(' ')]

        # For each word treated as center word
        for center_word_pos in range(len(indices)):

            # For each window  position
            for w in range(-context_size, context_size+1):
                context_word_pos = center_word_pos + w

                # Make sure we dont jump out of the sentence
                if context_word_pos < 0 or context_word_pos >= len(indices) or center_word_pos == context_word_pos:
                    continue

                context_word_idx = indices[context_word_pos]
                center_word_idx  = indices[center_word_pos]

                # Same words might be present in the close vicinity of each other. we want to avoid such cases
                if center_word_idx == context_word_idx:
                    continue

                training_data.append([center_word_idx, context_word_idx])

    return training_data

In [62]:
training_data = gather_training_data(corpus,
                                     word_to_idx,
                                     context_size=2)
training_data = torch.tensor(training_data).to(dtype=torch.long)
training_data.shape

torch.Size([212, 2])

In [63]:
training_data

tensor([[21, 33],
        [21,  0],
        [33, 21],
        [33,  0],
        [33, 31],
        [ 0, 21],
        [ 0, 33],
        [ 0, 31],
        [ 0, 22],
        [31, 33],
        [31,  0],
        [31, 22],
        [31, 15],
        [22,  0],
        [22, 31],
        [22, 15],
        [22,  7],
        [15, 31],
        [15, 22],
        [15,  7],
        [15, 43],
        [ 7, 22],
        [ 7, 15],
        [ 7, 43],
        [ 7, 17],
        [43, 15],
        [43,  7],
        [43, 17],
        [43, 18],
        [17,  7],
        [17, 43],
        [17, 18],
        [17,  1],
        [18, 43],
        [18, 17],
        [18,  1],
        [ 1, 17],
        [ 1, 18],
        [10,  3],
        [10, 33],
        [ 3, 10],
        [ 3, 33],
        [ 3,  5],
        [33, 10],
        [33,  3],
        [33,  5],
        [33,  4],
        [ 5,  3],
        [ 5, 33],
        [ 5,  4],
        [ 5, 11],
        [ 4, 33],
        [ 4,  5],
        [ 4, 11],
        [ 4, 45],
        [1

### Hyperparamters and training configuration

In [64]:
num_epochs: int = 200
learning_rate: float = 5e-1
optimizer: torch.optim = torch.optim.SGD(skipgram_model.parameters(),
                                          lr=learning_rate)

In [65]:
input_vector = training_data[:, 0]
# Accessing elements of the tensor directly as indices for word_to_idx
target_vector = training_data[:, 1]

In [66]:
input_vector

tensor([21, 21, 33, 33, 33,  0,  0,  0,  0, 31, 31, 31, 31, 22, 22, 22, 22, 15,
        15, 15, 15,  7,  7,  7,  7, 43, 43, 43, 43, 17, 17, 17, 17, 18, 18, 18,
         1,  1, 10, 10,  3,  3,  3, 33, 33, 33, 33,  5,  5,  5,  5,  4,  4,  4,
         4, 11, 11, 11, 11, 45, 45, 45,  8,  8, 23, 23, 46, 46, 46, 32, 32, 32,
        32,  3,  3,  3,  3, 39, 39, 39, 39, 28, 28, 28, 28,  5,  5,  5,  5, 35,
        35, 35, 35, 20, 20, 20, 37, 37, 34, 34, 24, 24, 24, 43, 43, 43, 43, 17,
        17, 17, 17, 44, 44, 44, 44, 30, 30, 30, 30, 29, 29, 29, 29, 14, 14, 14,
        14, 17, 17, 17, 17, 25, 25, 25, 25, 43, 43, 43,  6,  6, 20, 20, 17, 17,
        17, 42, 42, 42, 42, 36, 36, 36, 36, 40, 40, 40, 40, 27, 27, 27, 27, 31,
        31, 31, 31,  2,  2,  2,  2, 26, 26, 26, 26, 41, 41, 41, 16, 16, 48, 48,
        19, 19, 19, 15, 15, 15, 15, 13, 13, 13, 13, 43, 43, 43, 43, 15, 15, 15,
        15, 47, 47, 47, 47, 12, 12, 12, 12,  9,  9,  9, 38, 38])

### Training phase

In [67]:
for epoch in range(num_epochs + 1):
    """
    Adapt the given CBOW training code for SkipGram
    Following by the instruction comments, or you could do it on your own ;)
    """
    # Construct input and target tensor
    inputs = training_data[:, 0]  # Center words
    targets = training_data[:, 1]  # Context words

    skipgram_model.zero_grad()

    # Zero out the gradients from the old instance to avoid tensor accumulation


    # Forward passing
    logsoftmax_prediction = skipgram_model(inputs, targets)

    # Evaluate loss (Negative log likelihood)
    loss = torch.mean(-1 * logsoftmax_prediction)

    # Backpropagation
    loss.backward()

    # Update the gradient according to the optimization algorithm
    optimizer.step()

    # Get loss values
    epoch_loss = loss.item()

    # Log result
    if epoch % 50 == 0:
        print(f"#Epoch {epoch}/{num_epochs}")
        print("Loss:", epoch_loss)

#Epoch 0/200
Loss: 0.6916677951812744
#Epoch 50/200
Loss: 0.606799840927124
#Epoch 100/200
Loss: 0.5255119800567627
#Epoch 150/200
Loss: 0.44386813044548035
#Epoch 200/200
Loss: 0.3657078146934509


### Inference

In [68]:
input_tensor = torch.tensor(make_context_vector(context, word_to_idx)).unsqueeze(0)
print(input_tensor)

tensor([[41, 26, 48, 19]])


  input_tensor = torch.tensor(make_context_vector(context, word_to_idx)).unsqueeze(0)


In [69]:
with torch.no_grad():
    context = ['we']

    ### START YOUR CODE HERE ###
    # Based on the given inference code in the previous section, training code and the context
    # Implement the inference flow from the given context to an output word
      # Convert the input word to its index
    input_word_idx = torch.tensor([word_to_idx[context[0]]])

    # Get the embedding for the input word
    input_word_embedding = skipgram_model.v_embedding_layer(input_word_idx)

    # Calculate similarity scores with all words in the vocabulary
    all_word_embeddings = skipgram_model.u_embedding_layer.weight
    similarity_scores = torch.matmul(input_word_embedding, all_word_embeddings.T)

    # Get the word index with the highest score (excluding the input word)
    scores = similarity_scores.flatten()
    input_word_idx_value = input_word_idx.item()
    scores[input_word_idx_value] = float('-inf')  # Exclude the input word

    # Get the highest scoring word index
    best_idx = torch.argmax(scores).item()

    # Convert the index back to a word
    key_list = list(word_to_idx.keys())
    prediction = key_list[best_idx]

    ### END YOUR CODE HERE ###
    print("Context:", context)
    print("Prediction:", prediction)

Context: ['we']
Prediction: the


## Problem 3
What are the differences between CBOW and Skip-gram?

###### **Core Differences:**  


 **1. CBOW (Continuous Bag-of-Words):**  
   * **Objective:** Predict the **target word** based on surrounding **context words**.  
   * **How it works:** The model takes vectors of context words within a window, typically averaging (or summing) them, and uses the combined vector to predict the middle word (target word).  
   * **Example:** Given the context "the cat ___ on the mat," CBOW tries to predict the missing word, e.g., "sits."  
   * **Input:** Multiple context word vectors.  
   * **Output:** A single target word vector.  

 **2. Skip-gram:**  
   * **Objective:** Predict **context words** around a given **target word**.  
   * **How it works:** The model takes the target word’s vector and uses it to predict different words that are likely to appear in its surrounding window.  
   * **Example:** Given the target word "sits," Skip-gram tries to predict context words such as "the," "cat," "on," "the," "mat."  
   * **Input:** A single target word vector.  
   * **Output:** Multiple context word vectors.  

 **Detailed Comparison Table:**  

| **Criteria**            | **CBOW (Continuous Bag-of-Words)**                | **Skip-gram**                               |
|------------------------|------------------------------------------------|---------------------------------------------|
| **Main objective**     | Predicts the target word from context words.   | Predicts context words from a target word. |
| **Input**             | Vectors of multiple context words.             | Vector of a single target word.            |
| **Output**            | Vector of a single target word.                | Vectors of multiple context words.         |
| **Training speed**    | **Faster.** Treats the context as a single observation (usually by averaging vectors). | **Slower.** Generates more (target, context) pairs per window, requiring more predictions. |
| **Representation quality** | Works well for **frequent words.** Averaging may "smooth out" contextual information. | Works better for **rare words** and small datasets. Learns better word representations by considering individual (target, context) pairs. |
| **Handling rare words**  | Less effective than Skip-gram.               | More effective than CBOW.                  |
| **Computational complexity** | Lower.                                  | Higher (since more predictions are required per input word). |
| **Example (Window=2)**  | Input: ("the", "quick", "fox", "jumps") → Output: "brown" | Input: "brown" → Output: ("the", "quick", "fox", "jumps") |

####### **When to Use Which Model?**  

 **Use CBOW if:**  
   - You need **faster training**.  
   - Your dataset is **large**.  
   - You care more about performance on **frequent words**.  

**Use Skip-gram if:**  
   - You have a **smaller dataset**.  
   - You want **better representations for rare words** or specialized phrases.  
   - You prioritize **representation quality** over training speed.  

### **Summary:**  
CBOW is **faster** and better suited for frequent words by predicting the target word from context. Skip-gram is **slower** but more effective for rare words, producing higher-quality word representations by predicting context words from a target word.