## TOKEN EMBEDDINGS


![Screenshot](images/screenshot3.png)


a well trained embedding can capture significant syntactical information

<div class="alert alert-block alert-success">
    
Let's illustrate how the token ID to embedding vector conversion works with a hands-on
example. Suppose we have the following four input tokens with IDs 2, 3, 5, and 1:</div>

In [24]:
import torch
input_ids = torch.tensor([1, 2, 3, 4, 5])

<div class="alert alert-block alert-success">
    
For the sake of simplicity and illustration purposes, suppose we have a small vocabulary of
only 6 words (instead of the 50,257 words in the BPE tokenizer vocabulary), and we want
to create embeddings of size 3 (in GPT-3, the embedding size is 12,288 dimensions):

</div>

<div class="alert alert-block alert-success">
    
Using the vocab_size and output_dim, we can instantiate an embedding layer in PyTorch,
setting the random seed to 123 for reproducibility purposes:

</div>

In [25]:
vocab_size = 6
output_dim = 4
torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size,output_dim)

In [26]:
embedding_layer.weight

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.3035, -0.5880],
        [ 0.3486,  0.6603, -0.2196, -0.3792],
        [-0.1606, -0.4015,  0.6957, -1.8061],
        [ 1.8960, -0.1750,  1.3689, -1.6033],
        [-0.7849, -1.4096, -0.4076,  0.7953],
        [ 0.9985,  0.2212,  1.8319, -0.3378]], requires_grad=True)

<div class="alert alert-block alert-info">
    
We can see that the weight matrix of the embedding layer contains small, random values.
These values are optimized during LLM training as part of the LLM optimization itself, as we
will see in upcoming chapters. Moreover, we can see that the weight matrix has six rows
and three columns. There is one row for each of the six possible tokens in the vocabulary.
And there is one column for each of the three embedding dimensions.
    
</div>

<div class="alert alert-block alert-success">
    
After we instantiated the embedding layer, let's now apply it to a token ID to obtain the
embedding vector:

</div>

In [None]:
embedding_layer(torch.tensor([3])) # 4ht row of the matrix

tensor([[ 1.8960, -0.1750,  1.3689, -1.6033]], grad_fn=<EmbeddingBackward0>)

<div class="alert alert-block alert-info">
    
each row in the embedding matrix is just an lookup to the token ids
    
</div>

In [None]:
torch.tensor([3])

tensor([3])

![Screenshot](images/screenshot4.png)


this is after sinosuindal encoding, pick and 2 words which are close you'll see that most of the vector is same and for any two vectroo far away from each other the vectors are differnt , thus this is captuirinmg the relative positions pretty good, also if you see absolute positions of words are also captures in this as first word and last word are quite different (pick a line horizontally it represents positional encoding vector)

![Screenshot](images/screenshot5.png)


**POSITIONAL EMBEDDINGS (ENCODING WORD POSITIONS)**

In [24]:
from torch.utils.data import Dataset, DataLoader
import torch
import tiktoken
import re

In [31]:
from torch.utils.data import Dataset, DataLoader
import torch
class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        #using sliding window to ceate input and target pairs
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i: i + max_length]
            target_chunk = token_ids[i+1: max_length + 1 + i]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)
    
    def __getitem__(self, idx):

        return self.input_ids[idx], self.target_ids[idx]

In [32]:
class SimpleTokenizerV2:
    def __init__(self,vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self, text):
        pre_processed = re.split(r'([,.:;"!?()\'_]|--|\s)', text)   
        pre_processed = [item for item in pre_processed if item.strip()]
        pre_processed = [item if item in self.str_to_int else "<|unk|>" for item in pre_processed]

        ids = [self.str_to_int[s] for s in pre_processed]
        return ids
    
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text


In [33]:
def create_dataloader_v1(txt, batch_size=4, max_length=256, 
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):

    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader

In [34]:
import torch
vocab_size = 50257
output_dim = 256

token_embedding_layer = torch.nn.Embedding(vocab_size,output_dim)

In [35]:
with open("wharton_verdict.txt", "r", encoding = "utf-8") as f:
    raw_text = f.read()

In [42]:
max_length = 4
dataloader = create_dataloader_v1(raw_text, batch_size = 8,
                                  max_length = max_length, stride = 2,
                                  shuffle = True)

data_iter = iter(dataloader)
inputs, targets  = next(data_iter)

In [43]:
print("Token IDs:\n", inputs)
print("\nInputs shape:\n", inputs.shape)

Token IDs:
 tensor([[   13,  1002,   340,   547],
        [  510,   379,   262,  4286],
        [ 7109, 14655,   683,   866],
        [  666,   966,    13,   383],
        [ 1936,  2431,   438,   392],
        [ 1917,    13,  1675, 24456],
        [  464,   748,   586,   652],
        [  526,   198, 43920,  3619]])

Inputs shape:
 torch.Size([8, 4])


In [47]:
token_embeddings = token_embedding_layer(inputs)
print("\nToken Embeddings:\n", token_embeddings.shape)


Token Embeddings:
 torch.Size([8, 4, 256])


In [48]:
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)

In [50]:
pos_embeddings = pos_embedding_layer(torch.arange(max_length))
print(pos_embeddings.shape)

torch.Size([4, 256])


In [51]:
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)

torch.Size([8, 4, 256])


<div class="alert alert-block alert-info">
    
As shown in the preceding code example, the input to the pos_embeddings is usually a
placeholder vector torch.arange(context_length), which contains a sequence of
numbers 0, 1, ..., up to the maximum input length − 1. 

The context_length is a variable
that represents the supported input size of the LLM. 

Here, we choose it similar to the
maximum length of the input text. 

In practice, input text can be longer than the supported
context length, in which case we have to truncate the text.
    
</div>

<div class="alert alert-block alert-info">
    
As we can see, the positional embedding tensor consists of four 256-dimensional vectors.
We can now add these directly to the token embeddings, where PyTorch will add the 4x256-
dimensional pos_embeddings tensor to each 4x256-dimensional token embedding tensor in
each of the 8 batches:
    
</div>