## Understanding Word Embeddings

Just converting text to matrices in order to represent them in a way a computer can understand.

Word2Vec is a pretrained model to generate embeddings.

An interesting fact about embeddings is that when we are dealing with more advanced models, we require to store text embeddings in more dimensional space to account for more text.

## Tokenizing Text


Here is an example of opening text files and acessing things by character.

There is basic ways and there is this customary way, fully fledged out for all types of punctuation.
```python
first_pass =re.split(r'([,.:;?_!"()\s]|--)', test_text)
second_pass = re.split(r'([,.:;?_!"()\s]|--)', text[0:100])
```

is a good example

In [None]:
with open("data/the-verdict.txt", "r", encoding="utf-8") as f:
  text = f.read()

print(len(text), text[0:100], '\n')


#more advanced split
import re

test_text = "Hello, world. This is a test."

first_pass =re.split(r'([,.:;?_!"()\s]|--)', test_text)
second_pass = re.split(r'([,.:;?_!"()\s]|--)', text[0:100])
preprocessed = re.split(r'([,.:;?_!"()\s]|--)', text)
example_1 = [item.strip() for item in first_pass if item.strip()]
example_2 = [item.strip() for item in second_pass if item.strip()]
preprocessed = [item.strip() for item in preprocessed if item.strip()]


20479 I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no g 



### Converting Tokens to Token ID

the first line just creates a alphabetical set (alphabetical, no repeats)

the second mystical line is just a quick way to make a dictionary with each ! mapping to an index. it holds no correlation to the actual text -- it's just vocabulary.

In [None]:
alphabetical_words = sorted(set(preprocessed))
alphabetical_words.extend(["<|endoftext|>", "<|unk|>"])
vocabulary = {token:integer for integer, token in enumerate(alphabetical_words)}
print(vocabulary)

{'!': 0, '"': 1, "'": 2, "'Are": 3, "'It's": 4, "'coming'": 5, "'done'": 6, "'subject": 7, "'technique'": 8, "'way": 9, '(': 10, ')': 11, ',': 12, '--': 13, '.': 14, ':': 15, ';': 16, '?': 17, 'A': 18, 'Ah': 19, 'Among': 20, 'And': 21, 'Arrt': 22, 'As': 23, 'At': 24, 'Be': 25, 'Begin': 26, 'Burlington': 27, 'But': 28, 'By': 29, 'Carlo': 30, 'Chicago': 31, 'Claude': 32, 'Come': 33, 'Croft': 34, 'Destroyed': 35, 'Devonshire': 36, "Don't": 37, 'Dubarry': 38, 'Emperors': 39, 'Florence': 40, 'For': 41, 'Gallery': 42, 'Gideon': 43, 'Gisburn': 44, "Gisburn's": 45, 'Gisburns': 46, 'Grafton': 47, 'Greek': 48, 'Grindle': 49, "Grindle's": 50, 'Grindles': 51, 'HAD': 52, 'Had': 53, 'Hang': 54, 'Has': 55, 'He': 56, 'Her': 57, 'Hermia': 58, "Hermia's": 59, 'His': 60, 'How': 61, 'I': 62, "I'd": 63, "I'll": 64, "I've": 65, 'If': 66, 'In': 67, 'It': 68, "It's": 69, 'Jack': 70, "Jack's": 71, 'Jove': 72, 'Just': 73, 'Lord': 74, 'Made': 75, 'Miss': 76, "Money's": 77, 'Monte': 78, 'Moon-dancers': 79, 'Mr': 

#### Creating Tokenizer Class (So that we can supprt backwards operations)

In other words, if we have an operation to convert a token into a matrix, we should have a backward implementation. It has build in functionality for everything we've been doing

In [None]:
class SimpleTokenizerV1:
    def __init__(self, vocabulary):
        self.str_to_int = vocabulary
        self.int_to_str = {i:s for s,i in vocabulary.items()}

    def encode(self, text):
        preprocessed = re.split('([,.:;?_!"()\s]|--)', text)

        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]

        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text



In [None]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = { i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [
            item if item in self.str_to_int
            else "<|unk|>" for item in preprocessed
        ]

        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)
        return text

#### Tiktoken tokenizer

much more widespread then a custom tokenizer, handles more cases, probably more useful since it's more built out.

In [None]:

!pip install tiktoken==0.7.0
from importlib.metadata import version
import tiktoken

version("tiktoken")




'0.7.0'

In [None]:
tiktoken_tokenizer = tiktoken.get_encoding("gpt2")
example_text = "hello mother"
encoded_data = tiktoken_tokenizer.encode(example_text, allowed_special={'<|endoftext|>'})
decoded_data = tiktoken_tokenizer.decode(encoded_data)
print(encoded_data, decoded_data)

[31373, 2802] hello mother


#### Example Encode / Decode (Lame custom classes)

Important to remember that we pass in raw text into the encode. It seems strange that we have to pass proccessed vocabulary, but it is what it is.



In [None]:
tokenizer = SimpleTokenizerV1(vocabulary)
ids = tokenizer.encode(text)
print(ids, '\n')
print(tokenizer.decode(ids))

tokenizer_2 = SimpleTokenizerV2(vocabulary)
ids = tokenizer_2.encode(text)
print(ids, '\n')



[62, 52, 167, 1024, 70, 44, 840, 133, 274, 503, 13, 1023, 133, 517, 452, 409, 13, 929, 606, 1097, 731, 525, 982, 1037, 685, 1037, 554, 1006, 12, 589, 1007, 557, 744, 568, 513, 12, 550, 531, 387, 568, 770, 12, 683, 133, 862, 1123, 12, 175, 414, 566, 589, 133, 1086, 749, 1007, 98, 14, 10, 116, 62, 840, 1024, 606, 1142, 547, 226, 99, 756, 40, 14, 11, 1, 110, 557, 744, 568, 513, 1, 13, 1006, 1097, 1110, 1007, 1134, 260, 606, 14, 62, 262, 554, 81, 14, 43, 117, 13, 568, 624, 31, 918, 13, 343, 568, 1063, 134, 14, 1, 87, 315, 607, 515, 1037, 886, 1007, 1080, 744, 719, 791, 9, 1072, 16, 257, 62, 376, 1019, 744, 1006, 12, 80, 14, 97, 13, 1007, 668, 1037, 22, 605, 163, 62, 1019, 744, 14, 1, 110, 1138, 12, 749, 81, 14, 118, 656, 12, 715, 608, 132, 868, 132, 195, 1023, 1014, 1109, 848, 589, 174, 406, 1089, 744, 699, 14, 21, 606, 1097, 733, 753, 1007, 81, 14, 119, 1118, 710, 14, 53, 733, 1007, 427, 58, 34, 12, 198, 1007, 624, 47, 42, 903, 12, 950, 685, 227, 45, 1, 79, 1, 1037, 876, 12, 1130, 996, 58

## Data Sampling with Sliding Window

this is the main logic used behind next-word predicion tasks. we use a context size, so like how many words of context we want to give the model.

In [None]:
enc_text = tiktoken_tokenizer.encode(text)
enc_sample = enc_text[50:]


context_size = 4
context_window = enc_sample[:context_size]
prediction_window = enc_sample[1:context_size+1]

for i in range(1, context_size+1):
  enc_context = enc_sample[:i]
  enc_prediction = enc_sample[i]
  print(enc_context, "---->", enc_prediction, '\n')
  print(tiktoken_tokenizer.decode(enc_context), "---->", tiktoken_tokenizer.decode([enc_prediction]), '\n')

[290] ----> 4920 

 and ---->  established 

[290, 4920] ----> 2241 

 and established ---->  himself 

[290, 4920, 2241] ----> 287 

 and established himself ---->  in 

[290, 4920, 2241, 287] ----> 257 

 and established himself in ---->  a 



## Creating Dataset / Dataloaders

Important functionality to practice, might delete from my codebase and replicate for practice un

In [None]:
import torch
from torch.utils.data import Dataset, DataLoader
import tiktoken

# Initialize the tokenizer
byte_tokenizer = tiktoken.get_encoding("gpt2")

class GPTDatasetV1(Dataset):
  '''
  input: raw text, tokenizer selection, length of
  '''
  def __init__(self, txt, tokenizer, max_length, stride):
      self.input_ids = []
      self.target_ids = []

      # Tokenize the entire text
      token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
      assert len(token_ids) > max_length, "Number of tokenized inputs must at least be equal to max_length+1"

      # Use a sliding window to chunk the book into overlapping sequences of max_length
      for i in range(0, len(token_ids) - max_length, stride):
          input_chunk = token_ids[i:i + max_length]
          target_chunk = token_ids[i + 1: i + max_length + 1]
          self.input_ids.append(torch.tensor(input_chunk))
          self.target_ids.append(torch.tensor(target_chunk))

  def __len__(self):
      return len(self.input_ids)

  def __getitem__(self, idx):
      return self.input_ids[idx], self.target_ids[idx]



batch size is how many words you want per example like 3 is `['the', 'cat', 'is']`. 128 refers to the

In [None]:
def create_dataloader_v1(txt, batch_size=4, max_length=256,
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):

    # Initialize the tokenizer
    tokenizer = byte_tokenizer

    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader




this loads the data for trainig, each index mapping to it's corresponding truth label (one word at a time).

In [None]:
with open("data/the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

dataloader = create_dataloader_v1(
    raw_text, batch_size=1, max_length=4, stride=1, shuffle=False #stride 1 is one after the ote
)

dataloader_large = create_dataloader_v1(
    raw_text, batch_size=8, max_length=4, stride=4, shuffle=False #stride 1 is one after the ote
)
data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]


## Create Token Embeddings