# Word embedding and one-hot encoding







## One-hot encoding

> One-hot encoding is the process of turning categorical factors into a numerical structure that machine learning algorithms can readily process. It functions by representing each category in a feature as a binary vector of 1s and 0s, with the vector's size equivalent to the number of potential categories.

In [None]:
data = ['cold', 'cold', 'warm', 'cold', 'hot', 'hot', 'warm', 'cold', 'warm', 'hot']

### One-hot integer encoding

In [None]:
import numpy as np
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(np.array(data))

print(data)
print(integer_encoded)

['cold', 'cold', 'warm', 'cold', 'hot', 'hot', 'warm', 'cold', 'warm', 'hot']
[0 0 2 0 1 1 2 0 2 1]


### One-hot binary encoding

In [None]:
import numpy as np
from sklearn.preprocessing import OneHotEncoder

one_hot_encoder = OneHotEncoder(sparse_output=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded_data = one_hot_encoder.fit_transform(integer_encoded)

print(data)
print(onehot_encoded_data)

['cold', 'cold', 'warm', 'cold', 'hot', 'hot', 'warm', 'cold', 'warm', 'hot']
[[1. 0. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]]


In [None]:
np.array(data).reshape(len(data), 1)

array([['cold'],
       ['cold'],
       ['warm'],
       ['cold'],
       ['hot'],
       ['hot'],
       ['warm'],
       ['cold'],
       ['warm'],
       ['hot']], dtype='<U4')

In [None]:
test = one_hot_encoder.fit_transform(np.array(data).reshape(len(data), 1))
test

array([[1., 0., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 1., 0.]])

## Problem 1
What are the limitations of one-hot encoding?

- Không thể hiện được sự giống nhau về mặt ngữ nghĩa của từ. Các từ đều có khoảng cách như nhau so với các từ khác.

In [None]:
data = ['cold', 'warm', 'hot', 'cold']
data = np.array(data)

one_hot_encoder = OneHotEncoder(sparse_output=False)
integer_encoded = one_hot_encoder.fit_transform(data.reshape(len(data), 1))

#So sánh khoảng cách (sự tương đồng) giữa 'cold' và 'warm' so với 'hot'
print(np.linalg.norm(integer_encoded[0] - integer_encoded[2]))
print(np.linalg.norm(integer_encoded[1] - integer_encoded[2]))

1.4142135623730951
1.4142135623730951


- Không ghi nhận được tầm quan trọng của các từ trong câu


In [None]:
data = ['cold', 'warm', 'hot', 'the', 'weather', 'is']
data = np.array(data)

one_hot_encoder = OneHotEncoder(sparse_output=False)
one_hot_encoder.fit(data.reshape(len(data), 1))
print(one_hot_encoder.categories_)

sentence = ['the', 'weather', 'is', 'hot']
sentence = np.array(sentence).reshape(len(sentence), 1)
onehot_encoded_data = one_hot_encoder.transform(sentence)
onehot_encoded_data


[array(['cold', 'hot', 'is', 'the', 'warm', 'weather'], dtype='<U7')]


array([[0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 1.],
       [0., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0.]])

- Bị giới hạn trong bộ từ vựng ban đầu và gặp khó khăn khi thêm từ mới.

## Word embedding

ELI5 for word embeddings
> The word embeddings can be thought of as a child’s understanding of the words. Initially, the word embeddings are randomly initialized and they don’t make any sense, just like the baby has no understanding of different words. It’s only after the model has started getting trained, the word vectors/embeddings start to capture the meaning of the words, just like the baby hears and learns different words."

In [None]:
import torch
from torch import nn
from torch.nn import functional as F

In [None]:
import pandas as pd

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

### Unigram transformation

#### Giải thích về thư viện typing.list

Okay, let's discuss why List from the typing module is used in this case instead of the built-in list.

Why typing.List?
In Python, type hints were introduced with PEP 484 to provide static typing capabilities. The typing module is a core part of this, offering type annotations like List, Dict, Tuple, etc.

Here's why typing.List is preferred over the built-in list in this scenario:

Type Clarity and Readability: Using typing.List makes the code more readable and explicit about the expected data types. It clearly communicates that the document argument is expected to be a list of strings and the function will return a list of strings. This enhances code understanding for both the developer and other readers.

Static Analysis and Error Detection: Static type checkers like MyPy can use type hints to analyze your code and detect potential type errors before runtime. This helps catch bugs early in the development process. If you used the built-in list without specifying the type of elements it contains, static analysis tools wouldn't be as effective in identifying potential type-related issues.

Improved Code Maintainability: Type hints make it easier to maintain and refactor code. When you or someone else needs to modify the code in the future, the type hints serve as documentation, providing clear guidance on the expected data types. This reduces the risk of introducing unintended type errors during code changes.

Consistency with Modern Python Practices: Using type hints is considered a best practice in modern Python development. It aligns with the trend of writing more robust and maintainable code.

Why not just list?
While the built-in list would work functionally, it lacks the type information that typing.List provides. Using just list would be like saying "this variable is a list," but not specifying what kind of items the list should contain.

In summary:

typing.List enhances code readability, enables static analysis for error detection, improves maintainability, and aligns with modern Python best practices.
Using the built-in list without type information reduces the benefits of static typing and can make your code less clear and harder to maintain in the long run.
Therefore, even though the built-in list might work in this specific instance, typing.List is the preferred and more robust choice for providing type hints and ensuring code quality. I hope this explanation is helpful!

In [None]:
from nltk import ngrams
help(ngrams)

Help on function ngrams in module nltk.util:

ngrams(sequence, n, **kwargs)
    Return the ngrams generated from a sequence of items, as an iterator.
    For example:
    
        >>> from nltk.util import ngrams
        >>> list(ngrams([1,2,3,4,5], 3))
        [(1, 2, 3), (2, 3, 4), (3, 4, 5)]
    
    Wrap with list for a list version of this function.  Set pad_left
    or pad_right to true in order to get additional ngrams:
    
        >>> list(ngrams([1,2,3,4,5], 2, pad_right=True))
        [(1, 2), (2, 3), (3, 4), (4, 5), (5, None)]
        >>> list(ngrams([1,2,3,4,5], 2, pad_right=True, right_pad_symbol='</s>'))
        [(1, 2), (2, 3), (3, 4), (4, 5), (5, '</s>')]
        >>> list(ngrams([1,2,3,4,5], 2, pad_left=True, left_pad_symbol='<s>'))
        [('<s>', 1), (1, 2), (2, 3), (3, 4), (4, 5)]
        >>> list(ngrams([1,2,3,4,5], 2, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>'))
        [('<s>', 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, '</s>')]
 

In [None]:
len(corpus)

4

In [None]:
for gram in ngrams(corpus[0].split(' '), n = 2):
  print(gram)

('This', 'is')
('is', 'the')
('the', 'first')
('first', 'document.')


In [None]:
[' '.join(gram) for gram in ngrams(corpus[0].split(' '), n = 2)]

['This is', 'is the', 'the first', 'first document.']

- Cần truyền vào một list

In [None]:
for sentence in corpus:
  print(sentence)

This is the first document.
This document is the second document.
And this is the third one.
Is this the first document?


In [None]:
from nltk import ngrams
from typing import List

def ngrams_transform(document: List[str],
                     n_gram: int) -> List[str]:
    """
    N-grams transformations for a given text

    Args:
    document (List[str]) -- The document to-be-processed
    n_gram   (int)       -- Number of grams

    Returns:
    A list of string after n-grams processed
    """

    ### START YOUR CODE HERE ###
    result = []
    for sentence in document:
      result += [' '.join(gram) for gram in ngrams(sentence.split(' '), n = n_gram)]

    return result
    ### END YOUR CODE HERE ###

In [None]:
n_grams_list = ngrams_transform(corpus,
                                n_gram=1)
n_grams_list

['This',
 'is',
 'the',
 'first',
 'document.',
 'This',
 'document',
 'is',
 'the',
 'second',
 'document.',
 'And',
 'this',
 'is',
 'the',
 'third',
 'one.',
 'Is',
 'this',
 'the',
 'first',
 'document?']

In [None]:
# Integer label for the given corpus
label_encoder = LabelEncoder()
corpus_vector = label_encoder.fit_transform(np.array(n_grams_list))

# Tensorize the input vector
example_text_tensor = torch.Tensor(corpus_vector).to(dtype=torch.long)
print(f"Example text tensor: {example_text_tensor}")
print(f"Shape of example text tensor: {example_text_tensor.shape}")

Example text tensor: tensor([ 2,  7, 10,  6,  4,  2,  3,  7, 10,  9,  4,  0, 12,  7, 10, 11,  8,  1,
        12, 10,  6,  5])
Shape of example text tensor: torch.Size([22])


### Create an example for embedding function to map from a word dimension to a lower dimensional space

In [None]:
num_vocab = 22 # number of vocabulary
num_dimension = 50 # dimensional embeddings

# Declare the mapping function
example_embedding_function = nn.Embedding(num_vocab, num_dimension)

In [None]:
example_output_tensor = example_embedding_function(example_text_tensor)
print(f"Embedding shape: {example_output_tensor.shape}")

Embedding shape: torch.Size([22, 50])


In [None]:
# Print embedding example
example_output_tensor

tensor([[-2.0442,  1.5012, -1.4179,  ...,  0.1615, -0.4502, -1.6173],
        [ 1.2046,  0.3967,  0.8174,  ...,  0.3511,  2.1552,  0.0819],
        [ 1.7798, -1.6000,  0.1290,  ..., -0.6039, -1.3091,  0.5135],
        ...,
        [ 1.7798, -1.6000,  0.1290,  ..., -0.6039, -1.3091,  0.5135],
        [ 0.3608,  1.5465, -0.4711,  ..., -1.3572, -1.3828, -0.1157],
        [ 1.7119,  0.6990, -0.4768,  ...,  0.4151, -0.1432, -0.0275]],
       grad_fn=<EmbeddingBackward0>)

#### Giải thích về các giá trị được khởi tạo

Trong thư viện `torch.nn.Embedding`, các giá trị được khởi tạo ngẫu nhiên từ phân phối chuẩn N(0, 1). Điều này được thực hiện thông qua hàm `torch.nn.init.normal_` với các giá trị mặc định là `mean=0` và `std=1`.

Dưới đây là đoạn mã trong tệp `embedding.cpp` và `sparse.py` của thư viện PyTorch mà bạn có thể tham khảo:

```c++ name=torch/csrc/api/src/nn/modules/embedding.cpp
void EmbeddingImpl::reset_parameters() {
  torch::nn::init::normal_(weight); // Khởi tạo với phân phối chuẩn N(0, 1)
  if (options.padding_idx().has_value()) {
    torch::NoGradGuard no_grad;
    weight[*options.padding_idx()].fill_(0);
  }
}
```

```python name=torch/nn/modules/sparse.py
def reset_parameters(self) -> None:
    init.normal_(self.weight) # Khởi tạo với phân phối chuẩn N(0, 1)
    self._fill_padding_idx_with_zero()
```

Như vậy, khi bạn khởi tạo một đối tượng `nn.Embedding`, các giá trị trong ma trận trọng số sẽ được khởi tạo ngẫu nhiên theo phân phối chuẩn N(0, 1).

# Word2vec


* Word2vec is a **class of models** that represents a word in a large text corpus as a vector in n-dimensional space(or n-dimensional feature space) bringing similar words closer to each other.



* Word2vec is a simple yet popular model to construct representating embedding for words from a representation space to a much lower dimensional space (compared to the respective number of words in a dictionary).



* Word2Vec has two neural network-based variants, which are:

    * Continuous Bag of Words (CBOW)
    * Skip-gram.
![](https://kavita-ganesan.com/wp-content/uploads/skipgram-vs-cbow-continuous-bag-of-words-word2vec-word-representation-2048x1075.png)


## Continuous Bag of words (CBOW)

* The Continuous Bag-of-Words model (CBOW) is frequently used in NLP deep learning. It is a model that tries to predict words given the context of a few words before and a few words after the target word. This is distinct from language modeling, since CBOW is not sequential and does not have to be probabilistic. Typically, CBOW is used to quickly train word embeddings, and these embeddings are used to initialize the embeddings of some more complicated model. Usually, this is referred to as pretraining embeddings. It almost always helps performance a couple of percent.

* CBOW is modelled as follows:
    * Given a target word $w_i$ and an $N$ context window on each side, $w_{i-1}, \cdots, w_{i-N}$ and $w_{i+1},\cdots, w_{i+N}$, referring to all context words collectively as $C$.

    * CBOW tries to minimize the objective function:

$$
-\log p(w_i|C) = -\log\text{Softmax}\left(A\left(\sum_{w\in C}q_w\right)+b\right)
$$

where $q_w$ is the embedding of word $w$.

In [None]:
# N = 2 according to the definition
CONTEXT_SIZE = 2

corpus = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells."""

corpus = corpus.split()
len(corpus)

62

### Create an integer mapping

In [None]:
len(set(corpus))

49

In [None]:
vocab = set(corpus)
vocab_size = len(vocab)

# Integer word mapping
word_to_idx = {word: i for i, word in enumerate(vocab)}
word_to_idx

{'things': 0,
 'are': 1,
 'called': 2,
 'create': 3,
 'We': 4,
 'Computational': 5,
 'that': 6,
 'pattern': 7,
 'spells.': 8,
 'a': 9,
 'about': 10,
 'evolution': 11,
 'other': 12,
 'rules': 13,
 'The': 14,
 'the': 15,
 'our': 16,
 'People': 17,
 'by': 18,
 'we': 19,
 'beings': 20,
 'effect,': 21,
 'to': 22,
 'of': 23,
 'is': 24,
 'conjure': 25,
 'processes.': 26,
 'with': 27,
 'In': 28,
 'inhabit': 29,
 'idea': 30,
 'computational': 31,
 'direct': 32,
 'data.': 33,
 'directed': 34,
 'processes': 35,
 'abstract': 36,
 'study': 37,
 'program.': 38,
 'evolve,': 39,
 'programs': 40,
 'manipulate': 41,
 'computer': 42,
 'process': 43,
 'As': 44,
 'spirits': 45,
 'they': 46,
 'computers.': 47,
 'process.': 48}

### Build context according to the given corpus

In [None]:
data = []

for i in range(CONTEXT_SIZE, len(corpus) - CONTEXT_SIZE):
    context = (
        [corpus[i - j - 1] for j in range(CONTEXT_SIZE)]
        + [corpus[i + j + 1] for j in range(CONTEXT_SIZE)]
    )
    target = corpus[i]
    data.append((context, target))

data

[(['are', 'We', 'to', 'study'], 'about'),
 (['about', 'are', 'study', 'the'], 'to'),
 (['to', 'about', 'the', 'idea'], 'study'),
 (['study', 'to', 'idea', 'of'], 'the'),
 (['the', 'study', 'of', 'a'], 'idea'),
 (['idea', 'the', 'a', 'computational'], 'of'),
 (['of', 'idea', 'computational', 'process.'], 'a'),
 (['a', 'of', 'process.', 'Computational'], 'computational'),
 (['computational', 'a', 'Computational', 'processes'], 'process.'),
 (['process.', 'computational', 'processes', 'are'], 'Computational'),
 (['Computational', 'process.', 'are', 'abstract'], 'processes'),
 (['processes', 'Computational', 'abstract', 'beings'], 'are'),
 (['are', 'processes', 'beings', 'that'], 'abstract'),
 (['abstract', 'are', 'that', 'inhabit'], 'beings'),
 (['beings', 'abstract', 'inhabit', 'computers.'], 'that'),
 (['that', 'beings', 'computers.', 'As'], 'inhabit'),
 (['inhabit', 'that', 'As', 'they'], 'computers.'),
 (['computers.', 'inhabit', 'they', 'evolve,'], 'As'),
 (['As', 'computers.', 'evol

### Problem 2
Name at least 2 limitations at this context construction step? Explain your answers.

- Quá trình trên chưa xử lý các từ bị trùng lặp. Ví dụ, từ 'processes' xuất hiện ở giữa câu khác với từ đó ở cuối câu vì dấu chấm: 'processes.'. Do đó, trong bộ voca xuất hiện cả hai từ trên. Hay từ 'computational' khi xuất hiện ở đầu câu thì được viết hoa chữ cái đầu tiên.

In [None]:
word_to_idx['Computational'] != word_to_idx['computational']

True

In [None]:
word_to_idx['processes.'] != word_to_idx['processes']

True

- Tập training trên bao gồm các bộ từ nằm giữa 2 câu. Điều này không thể hiện được bối cảnh của các từ context và target.

In [None]:
# Xét 2 câu sau trong corpus
# People create programs to direct processes. In effect, we conjure the spirits of the computer with our spells.
data[-11]

(['direct', 'to', 'In', 'effect,'], 'processes.')

### Vectorize context

In [None]:
def make_context_vector(context: List[str],
                        word_to_idx: dict) -> torch.Tensor:
    """
    Function to map a word context vector into a torch tensor

    Args:
    context (List[str]) -- A context (including individual n-grams tokens)
    word_to_idx (dict)  -- A function to map a word into its respective integer

    Returns:
    A pytorch tensor including a list of mapped word

    Example:
    ['are', 'We', 'to', 'study'] --> tensor([40, 22, 27, 47])
    """

    ### START YOUR CODE HERE ###
    return torch.tensor([word_to_idx.get(word) for word in context])
    ### END YOUR CODE HERE ###

In [None]:
# Functional test
print("Example sample: ", data[0][0])
make_context_vector(data[0][0], word_to_idx)

Example sample:  ['are', 'We', 'to', 'study']


tensor([ 1,  4, 22, 37])

### CBOW model implementation

In [None]:
class CBOW(nn.Module):
    def __init__(self,
                 vocab_size: int,
                 embed_dim: int) -> None:
        """
        Model constructor
        """
        super().__init__()

        self.vocab_size = vocab_size
        self.embed_dim = embed_dim

        self.embedding_layer = nn.Embedding(vocab_size, embed_dim)
        self.linear_layer = nn.Linear(embed_dim, vocab_size)

        # Neural weight initialization
        nn.init.xavier_normal_(self.embedding_layer.weight)
        nn.init.xavier_normal_(self.linear_layer.weight)

    def forward(self, inputs):
        """
        Function to conduct forward passing
        """
        embedding = self.embedding_layer(inputs)
        embedding = torch.sum(embedding, dim=1)
        output = self.linear_layer(embedding)
        output_softmax = F.log_softmax(output, dim=1)
        return output_softmax

In [None]:
cbow_model = CBOW(vocab_size=vocab_size,
                  embed_dim=10)

# Enable gradient for model training
cbow_model.train()
cbow_model

CBOW(
  (embedding_layer): Embedding(49, 10)
  (linear_layer): Linear(in_features=10, out_features=49, bias=True)
)

In [None]:
# prompt: run 1 forward pass with context_example on cbow_model

# Assuming cbow_model and other necessary variables are defined as in the provided code.
context_example = data[1][0]
integer_context = make_context_vector(context_example, word_to_idx)
output = cbow_model(integer_context.unsqueeze(0)) # Add a batch dimension
output


tensor([[-3.4890, -4.3462, -4.2833, -4.0922, -3.6406, -4.1345, -3.7928, -3.9724,
         -4.4476, -4.0507, -4.4788, -4.0544, -4.0365, -4.0280, -3.8548, -3.9399,
         -3.6291, -3.9395, -3.6216, -3.7341, -4.0825, -3.5447, -3.8730, -3.7023,
         -3.8599, -3.8637, -3.6783, -3.9750, -3.8881, -4.1699, -4.1348, -4.3463,
         -3.6063, -3.7652, -3.7825, -4.0533, -3.7050, -4.2818, -3.4532, -4.0201,
         -3.9828, -4.3346, -4.5141, -3.7451, -3.5092, -3.8068, -3.5355, -3.7247,
         -3.9387]], grad_fn=<LogSoftmaxBackward0>)

In [None]:
softmax = torch.nn.Softmax(dim=-1)
torch.sum(softmax(output))

tensor(1., grad_fn=<SumBackward0>)

### Train

#### Hyperparameters and training configuration

In [None]:
num_epochs: int = 5
learning_rate: float = 5e-2
optimizer: torch.optim = torch.optim.Adam(cbow_model.parameters(),
                                          lr=learning_rate)

loss_function = nn.NLLLoss()

#### Training phase

In [None]:
for epoch in range(1, num_epochs + 1):
    print(f"#Epoch {epoch}/{num_epochs}")

    # Construct input and target tensor
    input_vector, target_vector = torch.tensor(make_context_vector(data[0][0], word_to_idx)), torch.tensor(word_to_idx[data[0][1]])
    input_vector = input_vector.unsqueeze(0)
    target_vector = target_vector.unsqueeze(0)

    # Join whole data into 1 tensor set
    for idx in range(1, len(data)):
        input_tensor = torch.tensor(make_context_vector(data[idx][0], word_to_idx)).unsqueeze(0)
        target_tensor = torch.tensor(word_to_idx[data[idx][1]]).unsqueeze(0)
        input_tensor = torch.cat((input_vector, input_tensor), 0)
        target_tensor = torch.cat((target_vector, target_tensor), 0)

    # Zero out the gradients from the old instance to avoid tensor accumulation
    cbow_model.zero_grad()

    # Forward passing
    log_probabilities = cbow_model(input_vector)

    # Evaluate loss
    loss = loss_function(log_probabilities, target_vector)

    # Backpropagation
    loss.backward()

    # Update the gradient according to the optimization algorithm
    optimizer.step()

    # Get loss values
    epoch_loss = loss.item()
    print("Loss:", epoch_loss)

#Epoch 1/5
Loss: 4.489914894104004
#Epoch 2/5
Loss: 3.731325626373291
#Epoch 3/5
Loss: 3.085339307785034
#Epoch 4/5
Loss: 2.4373152256011963
#Epoch 5/5
Loss: 1.6826579570770264


  input_vector, target_vector = torch.tensor(make_context_vector(data[0][0], word_to_idx)), torch.tensor(word_to_idx[data[0][1]])
  input_tensor = torch.tensor(make_context_vector(data[idx][0], word_to_idx)).unsqueeze(0)


In [None]:
print(input_vector)

tensor([[ 1,  4, 22, 37]])


#### Inference

In [None]:
with torch.no_grad(): # No gradient update in inference
    context = ['In', 'processes.', 'we', 'conjure']

    # Vectorize input from text to numeric type
    input_tensor = torch.tensor(make_context_vector(context, word_to_idx)).unsqueeze(0)

    # Model makes prediction
    output_tensor = cbow_model(input_tensor)

    # Get the item id with the highest probability
    prediction = torch.argmax(output_tensor).detach().tolist()

    # Query the respective word from the given item id
    key_list = list(word_to_idx.keys())
    prediction = key_list[prediction]

    print("Context:", context)
    print("Prediction:", prediction)

Context: ['In', 'processes.', 'we', 'conjure']
Prediction: about


  input_tensor = torch.tensor(make_context_vector(context, word_to_idx)).unsqueeze(0)


In [None]:
with torch.no_grad(): # No gradient update in inference
    idx = 18
    context = data[idx][0]
    ground_truth = data[idx][1]

    # Vectorize input from text to numeric type
    input_tensor = torch.tensor(make_context_vector(context, word_to_idx)).unsqueeze(0)

    # Model makes prediction
    output_tensor = cbow_model(input_tensor)

    # Get the item id with the highest probability
    prediction = torch.argmax(output_tensor).detach().tolist()
    # Query the respective word from the given item id
    key_list = list(word_to_idx.keys())
    prediction = key_list[prediction]

    print("Context:", context)
    print("Prediction:", prediction)
    print("Ground truth: ", ground_truth)

Context: ['As', 'computers.', 'evolve,', 'processes']
Prediction: People
Ground truth:  they


  input_tensor = torch.tensor(make_context_vector(context, word_to_idx)).unsqueeze(0)


In [None]:
data[15][1]

'inhabit'

## Skip-gram

<center>
<img src="https://machinelearningcoban.com/tabml_book/_images/word2vec2.png">
</center>

- Skip gram is based on the distributional hypothesis where words with similar distribution is considered to have similar meanings. Researchers of skip gram suggested a model with less parameters along with the novel methods to make optimization step more efficient.

- Vanilla SkipGram model:

<center>
<img src="https://d3i71xaburhd42.cloudfront.net/a1d083c872e848787cb572a73d97f2c24947a374/5-Figure1-1.png" scale=70%>
</center>

- Main idea is to optimize model so that if it is queried with a word, it should correctly guess all the context (context = 2 in the figure) words. That is,
$$
y=\sigma(Ux)
$$
    - where $x$, $y$ are one-hot encoded word vector, $U$ is the embedding matrix, and $\sigma(\cdot)$ is the softmax function.

With the same dataset, training set for skip gram can be much larger than that of NPLM since it can have $2c$ samples $\left(w_t:w_{t-c}, ...,w_t:w_{t-1},w_t:w_{t+1},...,w_{t+c}\right)$ while other n-gram based models have one $\left((w_{t-c},...w_{t-1},w_{t+1},...,w_{t+c}):w_t\right)$.

In [None]:
corpus = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells."""

In [None]:
class SkipGramModel(nn.Module):
    def __init__(self,
                 vocab_size: int,
                 embed_dim: int) -> None:
        """
        Model construction
        """
        super().__init__()
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim

        ### START YOUR CODE HERE ###
        # Declare embedding function u and v
        # with given vocab size and embed dim using nn.Embedding
        self.v_embedding_layer = nn.Embedding(self.vocab_size, self.embed_dim)
        self.u_embedding_layer = nn.Embedding(self.vocab_size, self.embed_dim)

        # Network weight initialization with Xavier initialization
        nn.init.xavier_normal_(self.u_embedding_layer.weight)
        nn.init.xavier_normal_(self.v_embedding_layer.weight)

        ### END YOUR CODE HERE ###

    def forward(self, center_words, context):
        """
        Function to perform forward passing
        """
        v_embedding = self.v_embedding_layer(center_words)
        u_embedding = self.u_embedding_layer(context)

        score = torch.mul(v_embedding, u_embedding)
        score = torch.sum(score, dim=1)
        log_score = F.logsigmoid(score)
        return log_score

In [None]:
skipgram_model = SkipGramModel(vocab_size=vocab_size,
                               embed_dim=128)

skipgram_model.train()
skipgram_model

SkipGramModel(
  (v_embedding_layer): Embedding(49, 128)
  (u_embedding_layer): Embedding(49, 128)
)

### Prepare training data to match the format of SkipGram model

In [None]:
def gather_training_data(corpus,
                         word_to_idx: dict,
                         context_size: int):
    """
    This function is to transform the given corpus
    into the correct format for SkipGram to serve as its input
    """

    training_data = []
    all_vocab_indices = list(range(len(word_to_idx)))

    split_text = corpus.split('\n')

    # For each sentence
    for sentence in split_text:
        indices = []
        indices = [word_to_idx[word] for word in sentence.split(' ')]

        # For each word treated as center word
        for center_word_pos in range(len(indices)):

            # For each window  position
            for w in range(-context_size, context_size+1):
                context_word_pos = center_word_pos + w

                # Make sure we dont jump out of the sentence
                if context_word_pos < 0 or context_word_pos >= len(indices) or center_word_pos == context_word_pos:
                    continue

                context_word_idx = indices[context_word_pos]
                center_word_idx  = indices[center_word_pos]

                # Same words might be present in the close vicinity of each other. we want to avoid such cases
                if center_word_idx == context_word_idx:
                    continue

                training_data.append([center_word_idx, context_word_idx])

    return training_data

In [None]:
training_data = gather_training_data(corpus,
                                     word_to_idx,
                                     context_size=2)
training_data = torch.tensor(training_data).to(dtype=torch.long)
training_data.shape

torch.Size([212, 2])

In [None]:
training_data

tensor([[ 4,  1],
        [ 4, 10],
        [ 1,  4],
        [ 1, 10],
        [ 1, 22],
        [10,  4],
        [10,  1],
        [10, 22],
        [10, 37],
        [22,  1],
        [22, 10],
        [22, 37],
        [22, 15],
        [37, 10],
        [37, 22],
        [37, 15],
        [37, 30],
        [15, 22],
        [15, 37],
        [15, 30],
        [15, 23],
        [30, 37],
        [30, 15],
        [30, 23],
        [30,  9],
        [23, 15],
        [23, 30],
        [23,  9],
        [23, 31],
        [ 9, 30],
        [ 9, 23],
        [ 9, 31],
        [ 9, 48],
        [31, 23],
        [31,  9],
        [31, 48],
        [48,  9],
        [48, 31],
        [ 5, 35],
        [ 5,  1],
        [35,  5],
        [35,  1],
        [35, 36],
        [ 1,  5],
        [ 1, 35],
        [ 1, 36],
        [ 1, 20],
        [36, 35],
        [36,  1],
        [36, 20],
        [36,  6],
        [20,  1],
        [20, 36],
        [20,  6],
        [20, 29],
        [ 

### Hyperparamters and training configuration

In [None]:
num_epochs: int = 200
learning_rate: float = 5e-1
optimizer: torch.optim = torch.optim.SGD(skipgram_model.parameters(),
                                          lr=learning_rate)

### Training phase

In [None]:
for epoch in range(num_epochs + 1):
    """
    Adapt the given CBOW training code for SkipGram
    Following by the instruction comments, or you could do it on your own ;)
    """
    ### START YOUR CODE HERE ###

    # Construct input and target tensor
    inputs = training_data[:, 0]
    targets = training_data[:, 1]

    # Zero out the gradients from the old instance to avoid tensor accumulation
    skipgram_model.zero_grad()

    # Forward passing
    logsoftmax_prediction = skipgram_model(inputs, targets)

    # Evaluate loss (Negative log likelihood)
    loss = torch.mean(-1 * logsoftmax_prediction)

    # Backpropagation
    loss.backward()

    # Update the gradient according to the optimization algorithm
    optimizer.step()

    # Get loss values
    epoch_loss = loss.item()

    # Log result
    if epoch % 50 == 0:
        print(f"#Epoch {epoch}/{num_epochs}")
        print("Loss:", epoch_loss)

    ### END YOUR CODE HERE ###

#Epoch 0/200
Loss: 0.6919984817504883
#Epoch 50/200
Loss: 0.6037424206733704
#Epoch 100/200
Loss: 0.5197837948799133
#Epoch 150/200
Loss: 0.4365466237068176
#Epoch 200/200
Loss: 0.3578314185142517


### Inference

In [None]:
word_to_idx

{'things': 0,
 'are': 1,
 'called': 2,
 'create': 3,
 'We': 4,
 'Computational': 5,
 'that': 6,
 'pattern': 7,
 'spells.': 8,
 'a': 9,
 'about': 10,
 'evolution': 11,
 'other': 12,
 'rules': 13,
 'The': 14,
 'the': 15,
 'our': 16,
 'People': 17,
 'by': 18,
 'we': 19,
 'beings': 20,
 'effect,': 21,
 'to': 22,
 'of': 23,
 'is': 24,
 'conjure': 25,
 'processes.': 26,
 'with': 27,
 'In': 28,
 'inhabit': 29,
 'idea': 30,
 'computational': 31,
 'direct': 32,
 'data.': 33,
 'directed': 34,
 'processes': 35,
 'abstract': 36,
 'study': 37,
 'program.': 38,
 'evolve,': 39,
 'programs': 40,
 'manipulate': 41,
 'computer': 42,
 'process': 43,
 'As': 44,
 'spirits': 45,
 'they': 46,
 'computers.': 47,
 'process.': 48}

In [None]:
with torch.no_grad():
    context = ['we'] # center word

    ### START YOUR CODE HERE ###
    # Based on the given inference code in the previous section, training code and the context
    # Implement the inference flow from the given context to an output word

    # Chọn ma trận embedding là ma trận V
    embedding_matrix = skipgram_model.v_embedding_layer.weight

    # Embedding từ đầu vào
    input_tensor = torch.tensor(word_to_idx[context[0]]).unsqueeze(0)
    input_embedding = embedding_matrix[input_tensor]

    # Sử dụng log_softmax thay thế cho soft_max
    predict_matrix = F.log_softmax(input_embedding @ embedding_matrix.T, dim=1)

    context_size = 2
    # Sử dụng ma trận embedding để tìm các từ tương đồng
    top_indices = torch.topk(predict_matrix, 2 * context_size + 1, sorted=True).indices

    # IN kết quả
    key_list = list(word_to_idx.keys())
    prediction = [key_list[idx] for idx in top_indices[0]]
    prediction = prediction[1:]


    print("Context:", context)
    print("Prediction:", prediction)

Context: ['we']
Prediction: ['study', 'the', 'spirits', 'with']


## Problem 3
What are the differences between CBOW and Skip-gram?

- Mục đích của CBOW là dự đoán từ *target* dựa vào các từ *context* lân cận và mục đích của Skip-gram là dự đoán các từ *context* dựa vào từ target. Do đó, đầu vào của mô hình CBOW là một danh sách các từ lân cận và đầu ra là một từ target duy nhất. Ngược lại, đầu vào của Skip-gram là một từ target duy nhất và đầu ra là các từ lân cận.
- Trong mô hình CBOW, ngoài ma trận embedding còn có thêm ma trận trọng số được thể hiện trong lớp Linear. Ma trận này có nhiệm vụ biến đổi từ ma trận embedding tổng hợp đầu vào thành ma trận kết quả. Còn trong mô hình Skip-gram, ta chỉ có 2 ma trận embedding từ cho từ context và từ trung tâm.