## Word2Vec

**Word Embeddings**: A word embedding represents words as vectors in a high-dimensional space, where similar words are closer to each other. 

Word embeddings enable algorithms to understand the semantic relationships between words and perform tasks like language translation, sentiment analysis, and text classification more effectively.

In Word2Vec, word embeddings are created using a shallow neural network model trained on a large corpus of text. There are two main architectures for training word embeddings in Word2Vec: Continuous Bag of Words (CBOW) and Skip-gram.

1. **Continuous Bag of Words (CBOW)**: In CBOW, the model is trained to predict the target word based on its context words within a certain window size. The input to the model is a context of surrounding words, and the output is the target word. The model learns to map words to vectors in such a way that similar words have similar vectors.


2. **Skip-gram**: In Skip-gram, the model is trained to predict the context words given a target word. The input to the model is a single word, and the output is the probability distribution of context words within a certain window size. The model learns to generate word embeddings by maximizing the probability of predicting context words given the target word.

<center>
<img src="images/skipgram_detailed.png" alt="Skip-gram Architecture" style="width: 600px;"/>
<figcaption align = "center"> Skip-gram architecture </figcaption>
</center>


During training, the neural network adjusts the word embeddings (vector representations) based on the errors in predicting the target or context words. The training process involves iterating through the corpus multiple times (epochs) to update the word embeddings until they capture meaningful semantic relationships between words.


|Aspect |	CBOW  |	Skip-gram |
|---	|---	|---	|
| Architecture |	Predicts a target word given its context  |	Predicts context words given a target word |
| Objective	| Minimizes the loss of predicting a word |	Minimizes the loss of predicting context |
| Training Data	| Requires a large corpus of text |	Requires a large corpus of text | 
| Computational Efficiency	| Typically trains faster than Skip-gram | Requires more training iterations and can be computationally expensive | 
| Model Size |	Smaller model size compared to Skip-gram | Larger model size compared to CBOW | 
| Context Window Size |	Fixed window size for context words | Can handle varying context window sizes | 
| Word Representations | Dense word vectors | Dense word vectors | 
| Semantic Relationships | Loses some of the finer sematic details | Captures more fine-grained semantic relationships | 
| Performance on Rare Words | Tends to perform poorly on rare words | Tends to perform better on rare words | 
| Performance on Frequent Words | Tends to perform well on frequent words | Tends to perform worse on frequent words | 
| Use Cases	| Good for applications with limited resources | Useful for a wider range of NLP tasks | 
| Typical Application | Pretrained word embeddings for small datasets | Pretrained word embeddings for large datasets | 
| Word2Vec Implementation |	Often preferred for Word2Vec implementation	| Also used for Word2Vec but with some modifications | 



<center>
<img src="images/cbow_skipgram_example.png" alt="Example use of CBOW and Skip-gram" style="width: 600px;"/>
<figcaption align = "center"> Example use of CBOW and Skip-gram </figcaption>
</center>


### CBOW Model: 

- The CBOWModel class is a custom PyTorch Module
- It consists of an embedding layer (nn.Embedding) and a linear layer (nn.Linear).
- Embedding Layer: Converts the indices of context words into dense vectors. 
    - The nn.Embedding layer takes two parameters: vocabulary size and the dimensionality of the embeddings.
    - Each unique word in the vocabulary will be associated with a unique embedding vector in this layer. 
    - The output of the embedding layer is averaged to a get a single vector that represents the context.
    - When the model is trained, these vectors are updated in a way that words with similar meanings end up with similar vectors. This allows the model to generalize well to unseen data.
- Linear Layer: a fully connected layer or a dense layer applies a linear transformation to the incoming data. 
    - It is defined by two parameters: the number of input features and the number of output features.
    - In this case, embedding_dim is the number of input features and vocab_size is the number of output features. 
    - The vocab_size can be interpreted as the unnormalized log probabilities for each word in the vocabulary being the target word.
    - The nn.Linear class automatically creates the weight and bias tensors, which are learned during training. 
                output = input.matmul(weight.t()) + bias

- The torch.mean function computes the average of the embeddings of the context words. 
- The context vector is then passed through the linear layer to predict the target word.

<center>
<img src="images/cbow_detailed.png" alt="CBOW Architecture" style="width: 600px;"/>
<figcaption align = "center"> CBOW architecture </figcaption>
</center>

In [7]:
import torch
import torch.nn as nn


def make_context_vector(context, word_to_ix):
    idxs = [word_to_ix[w] for w in context]
    return torch.tensor(idxs, dtype=torch.long)

CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right
EMDEDDING_DIM = 300

raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()

# By deriving a set from `raw_text`, we deduplicate the array
vocab = set(raw_text)
vocab_size = len(vocab)

word_to_ix = {word:ix for ix, word in enumerate(vocab)}
ix_to_word = {ix:word for ix, word in enumerate(vocab)}

data = []
for i in range(2, len(raw_text) - 2):
    context = [raw_text[i - 2], raw_text[i - 1],
               raw_text[i + 1], raw_text[i + 2]]
    target = raw_text[i]
    data.append((context, target))


class CBOW(torch.nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(CBOW, self).__init__()

        #out: 1 x emdedding_dim
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(embedding_dim, 128)
        self.activation_function1 = nn.ReLU()
        
        #out: 1 x vocab_size
        self.linear2 = nn.Linear(128, vocab_size)
        self.activation_function2 = nn.LogSoftmax(dim = -1)
        

    def forward(self, inputs):
        embeds = sum(self.embeddings(inputs)).view(1,-1)
        out = self.linear1(embeds)
        out = self.activation_function1(out)
        out = self.linear2(out)
        out = self.activation_function2(out)
        return out

    def get_word_emdedding(self, word):
        word = torch.tensor([word_to_ix[word]])
        return self.embeddings(word).view(1,-1)
    

model = CBOW(vocab_size, EMDEDDING_DIM)

loss_function = nn.NLLLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)


# TRAINING

print ("\n** ------ *** ------ * TRAINING * ------ *** ------ **\n ")

iterations = 100
for epoch in range(iterations):
    total_loss = 0

    for context, target in data:
        context_vector = make_context_vector(context, word_to_ix)  

        log_probs = model(context_vector)

        total_loss += loss_function(log_probs, torch.tensor([word_to_ix[target]]))
    if epoch % 10 == 0:
        print ('Epoch:', str(epoch)+"/"+str(iterations), 'Loss:', total_loss.item())

    #optimize at the end of each epoch
    optimizer.zero_grad()
    total_loss.backward()
    optimizer.step()


# TESTING
print ("\n** ------ *** ------ * TESTING * ------ *** ------ **\n ")
context = ['People','create','to', 'direct']
context_vector = make_context_vector(context, word_to_ix)
a = model(context_vector)

#Print result
print(f'Raw text: {" ".join(raw_text)}')
print(f'Context: {context}')
print(f'Prediction: {ix_to_word[torch.argmax(a[0]).item()]}')


** ------ *** ------ * TRAINING * ------ *** ------ **
 
Epoch: 0/100 Loss: 230.70034790039062
Epoch: 10/100 Loss: 111.81954956054688
Epoch: 20/100 Loss: 42.16696548461914
Epoch: 30/100 Loss: 17.44841194152832
Epoch: 40/100 Loss: 9.785517692565918
Epoch: 50/100 Loss: 6.553323268890381
Epoch: 60/100 Loss: 4.847179889678955
Epoch: 70/100 Loss: 3.8093013763427734
Epoch: 80/100 Loss: 3.117750644683838
Epoch: 90/100 Loss: 2.6279261112213135

** ------ *** ------ * TESTING * ------ *** ------ **
 
Raw text: We are about to study the idea of a computational process. Computational processes are abstract beings that inhabit computers. As they evolve, processes manipulate other abstract things called data. The evolution of a process is directed by a pattern of rules called a program. People create programs to direct processes. In effect, we conjure the spirits of the computer with our spells.
Context: ['People', 'create', 'to', 'direct']
Prediction: programs


### Word2Vec Skip gram implementation using PyTorch 

https://github.com/lukysummer/SkipGram_with_NegativeSampling_Pytorch/blob/master/SkipGram_NegativeSampling.ipynb

References: 
1. https://jalammar.github.io/illustrated-word2vec/
2. https://github.com/OlgaChernytska/word2vec-pytorch/tree/main
3. https://towardsdatascience.com/nlp-101-word2vec-skip-gram-and-cbow-93512ee24314