## Reading in a short story text sample into Python

Reference code from [llms from scratch](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb)

## Vector Embedding

### Using pre-trained token embeddings

Import a Trained Model using [gensim](https://radimrehurek.com/gensim/). Then load the [word2vec-google-news-300](https://huggingface.co/fse/word2vec-google-news-300) vector model from Google which has already been pretrained.

In [None]:
import gensim.downloader as api
import time

model = None
retries = 5
delay_seconds = 5

for attempt in range(retries):
    try:
        print(f"Attempt {attempt + 1} to load word2vec-google-news-300...")
        model = api.load('word2vec-google-news-300')
        print("Model loaded successfully.")
        break  # Exit loop if successful
    except Exception as e:
        print(f"Error loading model: {e}")
        if attempt < retries - 1:
            print(f"Retrying in {delay_seconds} seconds...")
            time.sleep(delay_seconds)
        else:
            print("Max retries reached. Could not load model.")

Attempt 1 to load word2vec-google-news-300...
Retrying in 5 seconds...
Attempt 2 to load word2vec-google-news-300...
Retrying in 5 seconds...
Attempt 3 to load word2vec-google-news-300...

Demo using word2vec to be shown.

### Creating our own token embeddings

Suppose we have input tokens with the ids below

In [6]:
import torch

input_ids = torch.tensor([2, 3, 5, 1])

For simplicity, we'll create a token embedding with a vocabulary size of 6

In [None]:
vocab_size = 6 #rows
output_dim = 3 #columns

torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

Let's view the embedding layers weight matrix

In [None]:
print(embedding_layer.weight)

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


The embedding layer contains a row for each vocabulary and a column for each dimension.


We can apply a token id to obtain the embedding vector. This becomes a map where we can access the row number using the token id as the key

In [None]:
print(embedding_layer(torch.tensor([3])))

Let's now retrieve 3D vector embedding representations of multiple token ids

In [None]:
print(embedding_layer(input_ids))