## Reading in a short story text sample into Python

Reference code from [llms from scratch](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb)

## Vector Embedding

### Using pre-trained token embeddings

Import a Trained Model using [gensim](https://radimrehurek.com/gensim/). Then load the [word2vec-google-news-300](https://huggingface.co/fse/word2vec-google-news-300) vector model from Google which has already been pretrained.

In [3]:
"""
After trying a number of times to download the model directly via api.load('word2vec-google-news-300').
I downloaded it directly from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing in this local folder
"""

# import gensim.downloader as api
# import time

# model = None
# retries = 5
# delay_seconds = 5

# for attempt in range(retries):
#     try:
#         print(f"Attempt {attempt + 1} to load word2vec-google-news-300...")
#         model = api.load('word2vec-google-news-300')
#         print("Model loaded successfully.")
#         break  # Exit loop if successful
#     except Exception as e:
#         print(f"Error loading model: {e}")
#         if attempt < retries - 1:
#             print(f"Retrying in {delay_seconds} seconds...")
#             time.sleep(delay_seconds)
#         else:
#             print("Max retries reached. Could not load model.")

from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format('./data/GoogleNews-vectors-negative300.bin.gz', binary=True)

We can get the vector representation of the word dog from the Google News model

In [None]:
dog = model.get_vector(key="dog")

print(dog)

[ 5.12695312e-02 -2.23388672e-02 -1.72851562e-01  1.61132812e-01
 -8.44726562e-02  5.73730469e-02  5.85937500e-02 -8.25195312e-02
 -1.53808594e-02 -6.34765625e-02  1.79687500e-01 -4.23828125e-01
 -2.25830078e-02 -1.66015625e-01 -2.51464844e-02  1.07421875e-01
 -1.99218750e-01  1.59179688e-01 -1.87500000e-01 -1.20117188e-01
  1.55273438e-01 -9.91210938e-02  1.42578125e-01 -1.64062500e-01
 -8.93554688e-02  2.00195312e-01 -1.49414062e-01  3.20312500e-01
  3.28125000e-01  2.44140625e-02 -9.71679688e-02 -8.20312500e-02
 -3.63769531e-02 -8.59375000e-02 -9.86328125e-02  7.78198242e-03
 -1.34277344e-02  5.27343750e-02  1.48437500e-01  3.33984375e-01
  1.66015625e-02 -2.12890625e-01 -1.50756836e-02  5.24902344e-02
 -1.07421875e-01 -8.88671875e-02  2.49023438e-01 -7.03125000e-02
 -1.59912109e-02  7.56835938e-02 -7.03125000e-02  1.19140625e-01
  2.29492188e-01  1.41601562e-02  1.15234375e-01  7.50732422e-03
  2.75390625e-01 -2.44140625e-01  2.96875000e-01  3.49121094e-02
  2.42187500e-01  1.35742

To get the number of dimensions it was trained on, we use shape

In [None]:
dog.shape

Let's try find similar words

If we have `king + woman - man` as a vector operation. We should have a result that matches queen or woman. 

In [None]:
print(model.most_similar(positive=['king', 'woman'], negative=['man'], topn=10))

[('queen', 0.7118193507194519), ('monarch', 0.6189674139022827), ('princess', 0.5902431011199951), ('crown_prince', 0.5499460697174072), ('prince', 0.5377321839332581), ('kings', 0.5236844420433044), ('Queen_Consort', 0.5235945582389832), ('queens', 0.5181134343147278), ('sultan', 0.5098593831062317), ('monarchy', 0.5087411999702454)]


### Creating our own token embeddings

Suppose we have input tokens with the ids below

In [4]:
import torch

input_ids = torch.tensor([2, 3, 5, 1])

For simplicity, we'll create a token embedding with a vocabulary size of 6

In [5]:
vocab_size = 6 #rows
output_dim = 3 #columns

torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

Let's view the embedding layers weight matrix

In [None]:
print(embedding_layer.weight)

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


The embedding layer contains a row for each vocabulary and a column for each dimension.


We can apply a token id to obtain the embedding vector. This becomes a map where we can access the row number using the token id as the key

In [None]:
print(embedding_layer(torch.tensor([3])))

Let's now retrieve 3D vector embedding representations of multiple token ids

In [None]:
print(embedding_layer(input_ids))

## Positional Encoding/Embedding (Encoding Word Positions)

We'll now use a 256 embedding size(columns), a 50257 token id size(rows) to perform encoding.

In [None]:
vocab_size = 50257
output_dim = 256

# randomly initialised layer
token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

Next we'll sample data from a DataLoader and process the data in 8 batches. This means that parameters will be updated after going through 8 batches.

In [14]:
## Data Loader from Previous example
import tiktoken
from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
    # stride determines how much we slide.
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        #Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        #Use a sliding window to chuch the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i+max_length]
            output_chunk = token_ids[i+1:i+max_length+1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(output_chunk))

    def __len__(self):
        return len(self.input_ids)

    # Implemented for PyTorch Dataloader to use when loading into the Dataloader
    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]
    
def create_dataloader_v1(txt, batch_size=4, max_length=256, 
                         stride=128, shuffle=True, drop_last=True, 
                         num_workers=0):
    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers,
    )

    return dataloader

## Import the raw text
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

Now we instantiate the data loader which samples data in a sliding window

In [16]:
context_length = 4

dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=context_length, 
    stride=context_length, shuffle=False
)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)

print("Token IDs", inputs) 
print("Inputs Shape", inputs.shape)

Token IDs tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])
Inputs Shape torch.Size([8, 4])


In [26]:
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)

torch.Size([8, 4, 256])


As seen from the shape above, each token is now a 256 dimensional vector.

For GPT absolute embedding approach we just need to create another embedding layer that has the same context length.

We only need to do positional embedding once. All other  after which the positional embedding vector will be retrieved once.

In [None]:
context_length = 4
output_dim = 256

# randomly initialised layer
position_embedding_layer = torch.nn.Embedding(context_length, output_dim)
position_embeddings = position_embedding_layer(torch.arange(context_length))

print(position_embeddings.shape)

torch.Size([4, 256])


Add token embeddings to position embeddings as the Position embedding pattern demands. The 8x4x256 can be added to the 4x256 matrix by performing broadcasting which replicates the 4x256 for each of the 8 rows.

In [31]:
final_vector_embedding_output = token_embeddings + position_embeddings
print(final_vector_embedding_output[0])

tensor([[-3.1951, -0.5051,  1.0742,  ...,  2.0425, -0.7197,  0.9533],
        [ 0.2950, -0.5241,  0.9449,  ...,  0.0627,  0.6589, -2.6676],
        [-0.3345,  0.0423,  0.3279,  ...,  1.9987,  1.1475, -0.8502],
        [-2.4531,  0.4572,  0.6748,  ...,  1.8094, -0.2438, -1.3124]],
       grad_fn=<SelectBackward0>)
