## Step 1: Creating Tokens

In [42]:
#Reading from book, the Verdict
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

#Total number of character print and print the first 1000 characters
print("The total number of characters: ", len(raw_text))
print(raw_text[:1000])

#Goal is to tokenize all the character is the book to the individual words and special characters that we will then turn into embedding for LLM training

#The LLM we are traing wil be on single book due to the hardware limitation, just using single book to illustrate the concept

The total number of characters:  20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me to hear that, in the height of his glory, he had dropped his painting, married a rich widow, and established himself in a villa on the Riviera. (Though I rather thought it would have been Rome or Florence.)

"The height of his glory"--that was what the women called it. I can hear Mrs. Gideon Thwing--his last Chicago sitter--deploring his unaccountable abdication. "Of course it's going to send the value of my picture 'way up; but I don't think of that, Mr. Rickham--the loss to Arrt is all I think of." The word, on Mrs. Thwing's lips, multiplied its _rs_ as though they were reflected in an endless vista of mirrors. And it was not only the Mrs. Thwings who mourned. Had not the exquisite Hermia Croft, at the last Grafton Gallery show, stopped me before Gisburn's "Moon-dancers" to say, with tears in her eyes: "We shall not look upon i

In [43]:
#How can we best split the tect to obtain a list of tokens?
#We can use the regular expression operations to split the text into tokens. This will split the text into words and special characters 
# based on white space and punctuation marks


import re
text = "Hello, world! This is a test."
#Split wherever white spaces are found
result = re.split(r'(\s)', text)
print(result)

#We will modify the regular expression splits on whitespace (\s) and punctuation marks ([\s.,!?:;])
result = re.split(r'([\s.,!?:;])', text)
print(result)

#The issue is our list still includes whitespace characters, we can remove them by filtering out empty strings
#item.strip will only be true when there is no whitespace
result = [token for token in result if token.strip()]
print(result)

#Remove white space or not?
''' When develping a tokenizer, whether we should encode whitesspaces as separate character is based on the application requirements.
Removing whitespaces reducte the memorry and compuing requirements.
However, keeping whitespces can be useful, if we train model that are  senstive to the exact structure of the text (for eg,, Python code
which is sensitive to indentation and spacing). It makes sense to keep the white space here for the training of the LLM'''

#We will modify to include all the punctuations that can be found in the text
#Two line of code where we build the tokenizer and filter out the empty strings
text = "Hello, world! Is This-- a test?"
result = re.split(r'([\s.,!?:;()--])', text)
result = [token.strip() for token in result if token.strip()]
print(result)

#Strio whitespace from each item and then filter out whtespace from any string
result = [token for token in result if token.strip()]
print(result)

#For building LLM different tokenizer scheme is used which is called byte pair encoding (BPE).




['Hello,', ' ', 'world!', ' ', 'This', ' ', 'is', ' ', 'a', ' ', 'test.']
['Hello', ',', '', ' ', 'world', '!', '', ' ', 'This', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']
['Hello', ',', 'world', '!', 'This', 'is', 'a', 'test', '.']
['Hello', ',', 'world', '!', 'Is', 'This', '-', '-', 'a', 'test', '?']
['Hello', ',', 'world', '!', 'Is', 'This', '-', '-', 'a', 'test', '?']


In [44]:
#Apply the toekenization in entire raw text
preprocessed_text = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed_text = [token.strip() for token in preprocessed_text if token.strip()]
print(preprocessed_text[:100])
print(len(preprocessed_text))

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in', 'the', 'height', 'of', 'his', 'glory', ',', 'he', 'had', 'dropped', 'his', 'painting', ',', 'married', 'a', 'rich', 'widow', ',', 'and', 'established', 'himself', 'in', 'a', 'villa', 'on', 'the', 'Riviera', '.', '(', 'Though', 'I', 'rather', 'thought', 'it', 'would', 'have', 'been', 'Rome', 'or', 'Florence', '.', ')', '"', 'The', 'height', 'of', 'his', 'glory', '"', '--', 'that', 'was', 'what', 'the', 'women', 'called', 'it', '.', 'I', 'can', 'hear', 'Mrs', '.', 'Gideon', 'Thwing', '--', 'his', 'last', 'Chicago', 'sitter', '--']
4690


## Step 2: Creating Token IDs

In [45]:
#Now creating the list of all unique tokens and sort them alphabetically to determine the vcabulary size

all_words = sorted(set(preprocessed_text))
vocab_size = len(all_words)
print(vocab_size)

1130


In [46]:
#After determining the voacab size, we can now assign a unique integer to each token in the vocabulary  using a dictionary  data structure
#We will create a dictionary that maps each token to a unique integer   and vice versa
word_to_id = {word: i for i, word in enumerate(all_words)}
print(word_to_id)

for i, item in enumerate(word_to_id.items()):
    print(item)
    if i >= 50:
        break

#Now the dictionay will contain the individual tokens associated with unique integers labels
#This process is encoding, later we need decoder to convert from the token id to the word itshelf, to give output in word form, basically also called reverse mapping


{'!': 0, '"': 1, "'": 2, '(': 3, ')': 4, ',': 5, '--': 6, '.': 7, ':': 8, ';': 9, '?': 10, 'A': 11, 'Ah': 12, 'Among': 13, 'And': 14, 'Are': 15, 'Arrt': 16, 'As': 17, 'At': 18, 'Be': 19, 'Begin': 20, 'Burlington': 21, 'But': 22, 'By': 23, 'Carlo': 24, 'Chicago': 25, 'Claude': 26, 'Come': 27, 'Croft': 28, 'Destroyed': 29, 'Devonshire': 30, 'Don': 31, 'Dubarry': 32, 'Emperors': 33, 'Florence': 34, 'For': 35, 'Gallery': 36, 'Gideon': 37, 'Gisburn': 38, 'Gisburns': 39, 'Grafton': 40, 'Greek': 41, 'Grindle': 42, 'Grindles': 43, 'HAD': 44, 'Had': 45, 'Hang': 46, 'Has': 47, 'He': 48, 'Her': 49, 'Hermia': 50, 'His': 51, 'How': 52, 'I': 53, 'If': 54, 'In': 55, 'It': 56, 'Jack': 57, 'Jove': 58, 'Just': 59, 'Lord': 60, 'Made': 61, 'Miss': 62, 'Money': 63, 'Monte': 64, 'Moon-dancers': 65, 'Mr': 66, 'Mrs': 67, 'My': 68, 'Never': 69, 'No': 70, 'Now': 71, 'Nutley': 72, 'Of': 73, 'Oh': 74, 'On': 75, 'Once': 76, 'Only': 77, 'Or': 78, 'Perhaps': 79, 'Poor': 80, 'Professional': 81, 'Renaissance': 82, 'Ri

In [47]:
#For decoder, we create the reverse mapping of the dictionary. This will map toekn Ids back to the corresponding text tokens
id_to_word = {i: word for word, i in word_to_id.items()}
print(id_to_word)


{0: '!', 1: '"', 2: "'", 3: '(', 4: ')', 5: ',', 6: '--', 7: '.', 8: ':', 9: ';', 10: '?', 11: 'A', 12: 'Ah', 13: 'Among', 14: 'And', 15: 'Are', 16: 'Arrt', 17: 'As', 18: 'At', 19: 'Be', 20: 'Begin', 21: 'Burlington', 22: 'But', 23: 'By', 24: 'Carlo', 25: 'Chicago', 26: 'Claude', 27: 'Come', 28: 'Croft', 29: 'Destroyed', 30: 'Devonshire', 31: 'Don', 32: 'Dubarry', 33: 'Emperors', 34: 'Florence', 35: 'For', 36: 'Gallery', 37: 'Gideon', 38: 'Gisburn', 39: 'Gisburns', 40: 'Grafton', 41: 'Greek', 42: 'Grindle', 43: 'Grindles', 44: 'HAD', 45: 'Had', 46: 'Hang', 47: 'Has', 48: 'He', 49: 'Her', 50: 'Hermia', 51: 'His', 52: 'How', 53: 'I', 54: 'If', 55: 'In', 56: 'It', 57: 'Jack', 58: 'Jove', 59: 'Just', 60: 'Lord', 61: 'Made', 62: 'Miss', 63: 'Money', 64: 'Monte', 65: 'Moon-dancers', 66: 'Mr', 67: 'Mrs', 68: 'My', 69: 'Never', 70: 'No', 71: 'Now', 72: 'Nutley', 73: 'Of', 74: 'Oh', 75: 'On', 76: 'Once', 77: 'Only', 78: 'Or', 79: 'Perhaps', 80: 'Poor', 81: 'Professional', 82: 'Renaissance', 83:

In [48]:
#Will implment a complete tokenizer class that will handle the tokenization and encoding of the text
"""
This calss will have an encode method that splits text into tokens and carries out strings-to-integer mapping to produce token ids 


In additiona, we implement decode method that carries out the reverse mapping of the token ids back to the text tokens

Step 1: Store the vocabulary as a class attribite for access in the encode and decode methods
Step 2: Create an inverse vocabulary for the reverse mapping to the original text tokens
Step 3: Process input text into token IDs
Step 4: Convert token IDs back to text
Step 5: Replace space before the specified punctuation marks
"""

class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_id = vocab
        self.id_to_str = {i: word for word, i in vocab.items()}
        
    def encode(self, text):
        preprocessed = re.split(r'([.,!?:;"()\']|--|\s)', text)
        
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        ids = [self.str_to_id[token] for token in preprocessed]
        return ids
    
    def decode(self, ids):
        text = " ".join([self.id_to_str[i] for i in ids])
        #Replacing spaces before the specified punctuation marks
        text = re.sub(r'\s([.,!?:;"()\'])', r'\1', text)
        return text
        
    

In [49]:
"""For testing purpose, instantiate a new tokenizer object from SimpleTokenizerV1 class and encode and decode a sample text"""
tokenizer = SimpleTokenizerV1(word_to_id)
sample_text = """"It's the last he painted, you know," 
                    Mrs. Gisburn said with pardonable pride."""
encoded_ids = tokenizer.encode(sample_text)
print(encoded_ids)
#This block of code will prin the following token IDs.

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


In [50]:
#Converting the token IDs back to the original text
decoded_text = tokenizer.decode(encoded_ids)
print(decoded_text)

#Test passed: Designed a toenizer capable of tokenizing and decoding text using the token IDs based on the snippet from the training dataset




" It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.


In [51]:
#What if sentence present in the text is not present in the vocabulary?

text = "Hello, do you like tea?"
print(tokenizer.encode(text))

#The problem here is that the word "Hello" is not present in the vocabulary.
#This highlights the need to consider large and diverse trainining sets to extend the vocabulary to cover all possible words in the text when working in LLMs
#This can be dealt by adding the special context token

KeyError: 'Hello'

## Adding Special Context Tokens

In the previous code, we implemented a simple tokenizer and applied it to a passage from training set.

Now we will modify the tokenizer to handle unknown words.

In particular, we will modify the vocab and tokenizer we implemented in the previous section, SimpleTokenizerV2, to support two new tokens, <|unk|> and 
<|endoftext|>.

In [None]:
# We can modify the tokenizer to use an <|unk|> token if it encounters an unknown word or that is not present in the vocabulary
#Furthermore, we also add the toke between the unrelated texts to separate them
#For examle, when training GPT-like LLMs on multiple independent documents or books, it is common to insert a token before each document or book that follows a previous text source


#Now we need to modify our vocab to include these two special tokens, and <|endoftext|>, by adding thse list of all uniqye words that is created in previous section

all_tokens = sorted(set(preprocessed_text))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

vocab = {word: i for i, word in enumerate(all_tokens)}
print(len(vocab.items()))

#Without adding two tokens the length of voacabulary was lesser than 2

1132


In [None]:
#As an additional quick check, printing the last 5 entries of the updated vocabulary
for i, item in enumerate(vocab.items()):
    if i >= len(vocab) - 5:
        print(item)
        
#Thus we created a text tokenizer that handles the unknown words

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


# Step 1: Replace Unknow words by <|unk|> tokens
# Step 2: Replace spaces before the specified punctuations

In [None]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_id = vocab
        self.id_to_str = {i: word for word, i in vocab.items()}
        
    def encode(self, text):
        preprocessed = re.split(r'([.,!?:;"()\']|--|\s)', text)
        
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [item if item in self.str_to_id 
               else "<|unk|>" for item in preprocessed]
        ids = [self.str_to_id[token] for token in preprocessed]
        
        return ids
    
    def decode(self, ids):
        text = " ".join([self.id_to_str[i] for i in ids])
        #Replacing spaces before the specified punctuation marks
        text = re.sub(r'\s+([.,!?:;"()\'])', r'\1', text)
        return text

In [None]:
tokenizer = SimpleTokenizerV2(vocab)
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."

#Joining two text sources together
text = " <|endoftext|> ".join((text1, text2))

print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [None]:
tokenizer.encode(text)

#The output will be the token IDs for the text, with the <|unk|> token replacing the unknown word "Hello" in the first sentence

#Now we don't have any error to worry about as we have added unknow token in vocabulary.

[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]

In [None]:
#Now we will be using the decode function now and pass the encoded text into the decoder 
tokenizer.decode(tokenizer.encode(text))

#Here you will see the unknown text "Hello" is replaced by the <|unk|> token in the decoded text

#We can confirm that the training datset didn't include the word hello and the palace.



'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'

## Notes

> Until now we have discussed tokenization as an essential step in processing text as input to LLMs. Depending on LLM, some researchers also considered adding the special tokens as following:

[BOS] (begining of sequence): This tokens marks the start of the texts. It signifies to the LLM where a piece of content begins.

[EOS] (end of sequence): This toeken is positioned at the end of the text, and is epsecially useful when concatenating multiple unrealted texts, similar to <|endoftext|>. For instance, when combining two different wikipedia articles or books, the [EOS] tokens indicates where one article ends and the other starts.

[PAD] (padding): When training LLMs with batch sizes larger than one, teh batcj might contain texts of varying lengths. To ensure all texts have the same length, the shorter texts are extended or "padded" using [PAD] token, up to the length of the longest text in the batch.

 > Note that the tokenizer used for GPT models do not need any of these tokens mentioned above but only uses  and <|endoftext|> token for simplicity.

> The tokenizer used for GPT models also doesn't use an |<unk>| token for outof-vocabulary words. Instead GPT model models uses a byte pair encoding tokenizer, which breaks down words to subwords units.

## BYTE PAIR ENCODING

> The previously implemented tokenization scheme in the previous section was for the illustration purpose only

> Now we will go through more sophisticated tokenixaton scheme based on the concept called the byte pair encoding (BPE).
`
> The BPE tokenizer covered in this section was used to train LLMs such as GPT-2, GPT -3, and the original model used in chatGPT.

> Since, implementing BPE can be relatively complicated, we will use ecisiting Python open-source library called tiktoken. This is a fast BPE tokeniser for use with OpenAI's models.

> This library implements the BPE algorithm very efficiently based on the source code in RUST.

In [52]:
#Once installed tiktoken ibrary, we can use the WordPunctTokenizer class to tokenize the text
import importlib
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")

#The usage of this tokenizer is similar to the SimpleTokenizerV2 class we implemented previously via an encode method:

text = ("Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace." #Solves OOV problem
)

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

#The code below will print the token IDs
print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 286, 617, 34680, 27271, 13]


In [53]:
#We can convert this token id back to the text using the decode method
decoded_text = tokenizer.decode(integers)
print(decoded_text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace.


## Notes

We can make two noteworthy observation sbased on the token IDs and decode text above.

> First, the <|endoftext|> token is assigned a relatively large token ID, namely 50256. In fact the BPE tokenizer, which was used to train models such as GPT-2, GPT-3, and the original model used in ChatGPT has, total vocab size of 50257, with <|endoftext|> being assigned the largest token ID.

> Second, the BPE tokenizer above encodes and decodes unknown words, such as "someunknownPlace" correctly. The BPE tokenizer can handle any unknown words. How does it achieve this without using the <|unk|> tokens?

> This is so because, the alogorithm underlying BPE breakd down the words that aren't in its predefined vocabulary into smaller subword units or even the individual characters. This enables it to handle the out of vocabulary words. So, thanks to the BPE algorithm, if the tokenizer encounters an unfamililar words during tokenization, it can represent it as a sequence of subword tokens or characters.




## CREATING INPUT TARGET PAIRS

> In this section, we will implement a data loader that fetches the input target pairs using a sliding window approach.

> To get started, we will first tokenize the whole The Verdict short story we worked with earlier using the BPE tokenizer

In [55]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
    
enc_text = tokenizer.encode(raw_text)
print(len(enc_text))

#This is the total number of tokens in the training set, after applying BPE tokenizer

5145


In [None]:
#Now, removing the first 50 tokens from the dataset for demonstration purposes as it results in slightly more interesting tesxt passage in the next step

enc_sample = enc_text[50:]
print(len(enc_sample))

#One of the easiest and most intutive ways to create the input-target pairs for the nextword prediction task is to create two variables, x and y, when x
#contains the input tokens and y contains the target tokens, which are input shifted by 1. Basically, sliding window approach.

#Why the input and output array size is the same? This is the context size. Context size is how many words do you wanna give as a output to the model
#to make its prediction. The context size determines how many tokens are included in the input.


5095


In [None]:
context_size = 4 #length of the input
""" 
The context size of 4 means that the model will be trained to look at a sequence of 4 tokens to predict the next token in sequence.
The input x is the first 4 token [1,2,3,4] and the target y is the next 4 token [2,3,4,5].
"""

x = enc_sample[:context_size]
y = enc_sample[1:context_size + 1]

print(f"x: {x}")
print(f"y:      {y}")
#This is how input and output pairs are constructed.


x: [290, 4920, 2241, 287]
y:      [4920, 2241, 287, 257]


In [None]:
#Processing the inputs along with the targest, which are the inputs shifted by one position, we can then create the next-word prediction task as follows:

for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(f"context: {context} -> desired: {desired}")
 
 #Everything left of the arrow (-->) refers to the input an LLM would recieve, and the token ID on the right side of the arrow represents the target token ID
 #that LLM is supposed to predict.
 

context: [290] -> desired: 4920
context: [290, 4920] -> desired: 2241
context: [290, 4920, 2241] -> desired: 287
context: [290, 4920, 2241, 287] -> desired: 257


In [63]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(f"context: {tokenizer.decode(context)} -> desired: {tokenizer.decode([desired])}")

context:  and -> desired:  established
context:  and established -> desired:  himself
context:  and established himself -> desired:  in
context:  and established himself in -> desired:  a


## Notes

> Now we have created the input-target pairs that we can turn into use for the LLM training for next step.

> There's only one more task before we can turn the tokens into embeddings, implementing an efficient data loader that iterates over the input dataset and returns the inputs and targets as PyTorch tensors, whoch can be thought of multidimensional array.

> In particular, we will be returning two tensors: an input tensors containing the text that LLM sees and the target tensors that includes the targets for the LLM to predict.

# IMPLEMENTING A DATA LOADER


In [None]:
#For more ef

## Notes

For the efficient data loader implementation, we will use PyTorch's built in Dataset and DataLoader classes.

> Step 1: Tokenize the entire text.

> Step 2: Use a sliding window to chunk the book into overlapping sequences of max_length.

> Step 3: Return the total number of rows in the dataset.

> Step 4: Return a single row from the dataset.

In [82]:
from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
    def __init__(self, text, tokenizer, max_length, stride): #Stride is how much we slide for next input output batch
        self.input_ids = []
        self.target_ids = []
        
        #Tokenize the entire text
        token_ids = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
        
        #Using a sliding window to chunk the book into overlapping sequences of max_lengt
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i+max_length]
            target_chunk = token_ids[i+1:i+max_length+1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))
            
            
    def __len__(self):
        return len(self.input_ids)
    
    def __getitem__(self, idx):
        x = self.input_ids[idx]
        y = self.target_ids[idx]
        
        return x, y

## Notes

> Based on the Pytorch Dataset class.

> Defines how individual rows are feteched from the dataset.

> Each row consisits of a number of token IDs (based on max_length) assigned to the input_chunk tensor.

> The target_chunk tensor contains the corresponding targets.

> Recommendation: Look to see how data returned from the dataset looks like when we combined the dataset with Pytorch DataLoader

The following code will use the GPTDatasetV1 to load the inputs in batches via a PyTorch DataLoader:

> Step 1: Initialize the tokenizer

> Step 2: Create a Dataset

> Step 3: drop_last = True drops the last batch if it is shorter than the specified batch_size to prevent loss spikes during training.

> Step 4: The number of CPU processes to use for preprocessing

In [83]:
def create_data_loader_v1(txt, batch_size = 4, max_length = 256, stride= 128, shuffle=True, drop_last=True, num_workers=0):
    #Initializing the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")
    
    #Creating the dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
    
    #Creating the DataLoader
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers)
    
    return dataloader

#We will test the dataloadee with a batch size of 1 for an LLM with context size of 4.as_integer_ratio

#This will develop and intuition fo how the GPTDataSetV1 class and create_data_loader_v1 function work togther

In [84]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
    
#Convert dataloader into a Python iterator to fetch the next entry via Python's built-in next function

In [85]:
import torch
print("Pytorch version: ", torch.__version__)
dataloader = create_data_loader_v1(raw_text, batch_size=1, max_length=4, stride=1, shuffle=False)

data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

Pytorch version:  2.6.0
[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]


## Notes
 
> The first_batch variable contains two tensors: the first tensors store the input token IDs, and the second tensors store the target tokens IDs.

> Since, the max_length is set to 4, each of the two tensors contains 4 token IDs.

> Note: An input size of 4 is relativel small and only chosen for illustration purpose. It is common to train LLM with th input size of atlears 256.`

In [86]:
second_batch = next(data_iter)
print(second_batch)

[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]


> If we compare the first with the second batch, we can see that the second batch's token ID is shifted by one position compared to the first batch.

> The stride setting dictates the number of positions the input shifts across the bacteches, emulating a sliding window approach.
