## Token embeddings

In [5]:
with open("../the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
    
print("raw text till 50 charcters is: ", raw_text[:50])

raw text till 50 charcters is:  I HAD always thought Jack Gisburn rather a cheap g


#### tiktoken tokenizer

In [6]:
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")

#### DataLoader and Dataset

In [10]:
import torch
from torch.utils.data import Dataset, DataLoader

In [15]:
class GPTDataLoaderV1(Dataset):
    def __init__(self, text, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []
        
        # tokenize the text
        token_ids = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
        
        # chunking
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1 : i + 1 + max_length]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))
            
    def __len__(self):
        return len(self.input_ids)
    
    def __getitem__(self, index):
        return self.input_ids[index], self.target_ids[index]        

In [16]:
def createDataLoader_V1(txt, max_length:256, stride:128, batch_size:4, shuffle=True, drop_last=True, num_workers=0 ):
    # tokenizer
    tokenizer =  tiktoken.get_encoding("gpt2")
    
    # dataset
    dataset = GPTDataLoaderV1(txt, tokenizer, max_length=max_length, stride=stride)
    
    # dataloader 
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, num_workers=num_workers, drop_last=drop_last)
    
    return dataloader

**create the dataloader object wiht batch size 1, max_lenght=4 and pass it to iter**

In [17]:
dataloader = createDataLoader_V1(raw_text, batch_size=1, max_length=4, stride=1, shuffle=False)

data_iter = iter(dataloader)
first_batch = next(data_iter)
print("first batch is: \n", first_batch)

first batch is: 
 [tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]


In [25]:
inputs, targets = first_batch
print(f"first batch input output pairs are\n {inputs} ---> {targets}")
print(f" {tokenizer.decode(inputs.squeeze().tolist())} ---> {tokenizer.decode(targets.squeeze().tolist())}")

first batch input output pairs are
 tensor([[  40,  367, 2885, 1464]]) ---> tensor([[ 367, 2885, 1464, 1807]])
 I HAD always --->  HAD always thought


**let's examine the 2nd batch as well**

In [28]:
dataloader = createDataLoader_V1(raw_text, max_length=4, stride=1, batch_size=1, shuffle=False)
data_iter = iter(dataloader)

for i in range(0, 2):
    batch = next(data_iter)
    inputs, targets = batch
    print(f"\nbatch: {i} input output pairs are\n {inputs} ---> {targets}")
    print(f" {tokenizer.decode(inputs.squeeze().tolist())} ---> {tokenizer.decode(targets.squeeze().tolist())}")


batch: 0 input output pairs are
 tensor([[  40,  367, 2885, 1464]]) ---> tensor([[ 367, 2885, 1464, 1807]])
 I HAD always --->  HAD always thought

batch: 1 input output pairs are
 tensor([[ 367, 2885, 1464, 1807]]) ---> tensor([[2885, 1464, 1807, 3619]])
  HAD always thought ---> AD always thought Jack


**We see here tokenizer considered "HAD" as a composite word of H and AD, so in the 2nd batch it has splitted it. Had it been "had", tokenizer won't have splitted it**

#### Let's examine the bahavior with different values

In [31]:
dataloader = createDataLoader_V1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)

data_iter = iter(dataloader)

for i in range(0, 2):
    batch = next(data_iter)
    inputs, targets = batch
    print(f"\nbatch: {i} input output pairs are\n {inputs} ---> {targets}")
    print(f"\n{tokenizer.decode(inputs.flatten().tolist())} ---> {tokenizer.decode(targets.flatten().tolist())}")


batch: 0 input output pairs are
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]]) ---> tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])

I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me to hear that, --->  HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me to hear that, in

batch: 1 input output pairs are
 tensor([[  287,   262,  6001,   286],
   

### Embeddings

In [32]:
input_ids = torch.tensor([2, 3, 4, 5])

In [33]:
vocab_size = 6
output_dim = 3

torch.manual_seed(42)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

In [35]:
print(embedding_layer.weight)
print("\nThe shape the weight matrix is: ", embedding_layer.weight.shape)

Parameter containing:
tensor([[ 1.9269,  1.4873, -0.4974],
        [ 0.4396, -0.7581,  1.0783],
        [ 0.8008,  1.6806,  0.3559],
        [-0.6866,  0.6105,  1.3347],
        [-0.2316,  0.0418, -0.2516],
        [ 0.8599, -0.3097, -0.3957]], requires_grad=True)

The shape the weight matrix is:  torch.Size([6, 3])


**let's get the embedding vector for id number 3**

In [36]:
print(embedding_layer(torch.tensor([3])))

tensor([[-0.6866,  0.6105,  1.3347]], grad_fn=<EmbeddingBackward0>)


**getting vector for all the ids we defined above**

In [37]:
print(embedding_layer(torch.tensor(input_ids)))

tensor([[ 0.8008,  1.6806,  0.3559],
        [-0.6866,  0.6105,  1.3347],
        [-0.2316,  0.0418, -0.2516],
        [ 0.8599, -0.3097, -0.3957]], grad_fn=<EmbeddingBackward0>)


  print(embedding_layer(torch.tensor(input_ids)))


<div><h3>Generating Token Emebddings similar GPT2</h3></div>

**Now we'll first create an embedding layer of size 50257 * 256**
</br>

**Then we'll use the dataloader we designed above to generate the token ids**
</br>

**Next we'll pass those ids to embedding layer to get the corresponding enocded vector**

**GPT2 embeddings**

In [42]:
vocab_size = 50257 # means 50257 token were used to train gpt2
output_dim = 256

token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
print("Shape of token_embedding_layer weight is:", token_embedding_layer.weight.shape)

Shape of token_embedding_layer weight is: torch.Size([50257, 256])


In [39]:
max_length = 4
dataloader = createDataLoader_V1(raw_text, batch_size=8, max_length=max_length, stride=max_length, shuffle=False)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)

In [40]:
print("Token IDs: \n", inputs)
print("\nInputs shape: \n", inputs.shape)

Token IDs: 
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Inputs shape: 
 torch.Size([8, 4])


**By now we have tokens of size 8 by 4, but we need to convert each of these tokens to a vecotr of size 256, i.e., 1 by 256 or 256 by 1**

In [50]:
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)

torch.Size([8, 4, 256])


### Positional Encoding

In [51]:
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)

In [52]:
pos_embeddings = pos_embedding_layer(torch.arange(max_length))
print(pos_embeddings.shape)

torch.Size([4, 256])


### Final input embeddings

In [54]:
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)

torch.Size([8, 4, 256])
