# Embedding in NLP

Understanding the concept of embeddings in natural language processing with an example.

---

## Key Terms:
1. **Vocabulary ($\mathcal{V}$)**: A set of all possible unique tokens (words, symbols, etc.) in the dataset. For example, if we have a vocabulary of 30,000 words, $|\mathcal{V}| = 30,000$.

2. **Embedding Size ($D$)**: The size of the vector that represents each token. For example, if each word is represented as a 1024-dimensional vector, $D = 1024$.

3. **Embedding Matrix ($\Omega_e$)**: A matrix of size $D \times |\mathcal{V}|$. Each column in this matrix represents the vector (embedding) of a token in the vocabulary. This matrix is learned during training.

4. **One-Hot Encoding**: A way to represent tokens as vectors. For a vocabulary of size $|\mathcal{V}|$, each token is represented as a $|\mathcal{V}|$-dimensional vector where all values are 0 except for one position, which is set to 1. For example:
   - "cat" → [0, 1, 0, 0, ...]
   - "dog" → [0, 0, 1, 0, ...]

### The Process:
1. **Example Vocabulary**:
   Let our vocabulary be:
   $$ 
   \mathcal{V} = \{ \text{"The", "cat", "sat", "on", "mat"} \}
   $$ 
   Thus, $|\mathcal{V}| = 5$.

2. **Input Tokens**:
   Suppose our input sequence is:
   $$ 
   \text{Tokens} = [\text{"The"}, \text{"cat"}, \text{"sat"}] 
   $$ 
   There are $N = 3$ tokens in this sequence.

3. **One-Hot Matrix ($T$)**:
   Convert these tokens into a one-hot encoded matrix $T$ of size $|\mathcal{V}| \times N$:
   $$ 
   T = \begin{bmatrix} 
        1 & 0 & 0 \\
        0 & 1 & 0 \\
        0 & 0 & 1 \\
        0 & 0 & 0 \\
        0 & 0 & 0
   \end{bmatrix} 
   $$ 
   Here, each column corresponds to one token, and the rows correspond to vocabulary terms.

4. **Embedding Matrix ($\Omega_e$)**:
   Suppose $D = 3$ (embedding size). The embedding matrix $\Omega_e$ will have random initial values:
   $$ 
   \Omega_e = \begin{bmatrix}
       0.1 & 0.4 & 0.3 & 0.2 & 0.5 \\
       0.6 & 0.1 & 0.2 & 0.7 & 0.3 \\
       0.8 & 0.9 & 0.4 & 0.5 & 0.6
   \end{bmatrix}
   $$ 
   Each column of $\Omega_e$ corresponds to the embedding for each token in the vocabulary.

5. **Input Embeddings ($X$)**:
   Multiply the one-hot encoded matrix $T$ with the embedding matrix $\Omega_e$ to compute the input embeddings $X$. This is the transformation:
   $$ 
   X = \Omega_e T.
   $$ 
---


## Implementation in PyTorch


In [3]:
import torch
import torch.nn as nn

The **torch.nn** module is often used to store word embeddings and retrieve them using indices. The input to the module is a list of indices, and the output is the corresponding word embeddings.

In [22]:

vocab_size = 5  # Number of unique tokens
embedding_dim = 3  # Size of each embedding vector

# Create an embedding layer
embedding_layer = nn.Embedding(vocab_size, embedding_dim)
# Print results
print("Embedding matrix (Omega_e):")
print(embedding_layer.weight.data)


Embedding matrix (Omega_e):
tensor([[-0.8805, -0.6517,  0.4077],
        [ 0.4389, -1.1243, -0.8373],
        [ 1.3981, -1.4097,  0.8434],
        [ 2.0104,  2.2844,  0.1933],
        [ 0.7380,  0.5161,  1.5216]])


In [23]:
# Example token indices
token_indices = torch.tensor([0, 1, 3, 4])  # "cat", "dog", "bird", "fish"

In [24]:
# Create the one-hot encoded matrix
T = torch.nn.functional.one_hot(token_indices, num_classes=vocab_size).to(torch.float)
print("One-hot encoded matrix:")
print(T)
print("Embeddings for tokens:")
print(T@embedding_layer.weight.data)

One-hot encoded matrix:
tensor([[1., 0., 0., 0., 0.],
        [0., 1., 0., 0., 0.],
        [0., 0., 0., 1., 0.],
        [0., 0., 0., 0., 1.]])
Embeddings for tokens:
tensor([[-0.8805, -0.6517,  0.4077],
        [ 0.4389, -1.1243, -0.8373],
        [ 2.0104,  2.2844,  0.1933],
        [ 0.7380,  0.5161,  1.5216]])


In [25]:
# Get embeddings using embedding_layer
X = embedding_layer(token_indices)
print("Embeddings for tokens:")
print(X)

Embeddings for tokens:
tensor([[-0.8805, -0.6517,  0.4077],
        [ 0.4389, -1.1243, -0.8373],
        [ 2.0104,  2.2844,  0.1933],
        [ 0.7380,  0.5161,  1.5216]], grad_fn=<EmbeddingBackward0>)
