## Data Sampling using Sliding Window

### Creating Input-Target pairs

We will implement a `data loader` that fetches input-target pairs using a `sliding window` approach.

In [None]:
import tiktoken

In [None]:
tokenizer=tiktoken.get_encoding("gpt2")

In [None]:
with open("verdict.txt", 'r', encoding='utf-8-sig') as f:
  raw_text=f.read()

enc_text=tokenizer.encode(raw_text)
print(len(enc_text))

In [None]:
enc_sample=enc_text[50:]

- Now the most intuitive way to create input-target pairs for the nextword prediction is to consider 2 variables x, y.

- $x$ contains the input tokens and $y$ contains the targets, which are the inputs shifted by 1.

In [None]:
context_size=4

x=enc_sample[:context_size]
y=enc_sample[1:context_size+1]

print(f"x: {x}")
print(f"y: {y}")

In [None]:
for i in range(1, context_size+1):
  context=enc_sample[:i]
  desired=enc_sample[i]

  print(context, "---->", desired)

In [None]:
for i in range(1, context_size+1):
  context=enc_sample[:i]
  desired=enc_sample[i]

  context_decoded=tokenizer.decode(context)
  desired_decoded=tokenizer.decode([desired])

  print(context_decoded, "---->", desired_decoded)

Now we have to implement an efficient data loader that iterates over the input dataset and returns inputs and targets as `PyTorch tensors`, which can be thought of as multi-dimensional arrays.

### Implementing Data Loader

- We will be using PyTorch's built-in `Dataset` and `DataLoader` classes.

- Step 1: Tokenize the entire text.

- Step 2: Use a sliding window to chunk the book into overlapping sequences of max_length.

- Step 3: Return the total number of rows in the dataset.

- Step 4: Returning a single row from the dataset.

In [None]:
from torch.utils.data import Dataset, DataLoader
from torch import tensor

class GPTDatasetV1(Dataset):
  def __init__(self, txt, tokenizer, max_length, stride):
    self.input_ids=[]
    self.target_ids=[]

    # Tokenize the entire text
    token_ids=tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

    # Apply the sliding window approach to chunk the dataset
    for i in range(0, len(token_ids)-max_length, stride):
      input_chunk=token_ids[i:i+max_length]
      target_chunk=token_ids[i+1:i+max_length+1]
      self.input_ids.append(tensor(input_chunk))
      self.target_ids.append(tensor(target_chunk))

  def __len__(self):
    return len(self.input_ids)
  
  def __getitem__(self, idx):
    return self.input_ids[idx], self.target_ids[idx]

- Step 1: Initialize the tokenizer.

- Step 2: Create dataset.

- Step 3: `drop_last=True` drops the last batch if it is shorter than the specified batch_size to prevent loss spikes during training.

- Step 4: Number of CPU processes to use for pre-processing.

In [None]:
def create_data_loader_v1(txt, batch_size=4, max_length=256,
                          stride=128, shuffle=True, drop_last=True,
                          num_workers=0):
  # Initialize the tokenizer
  tokenizer=tiktoken.get_encoding("gpt2")

  # Create dataset
  dataset=GPTDatasetV1(txt, tokenizer, max_length, stride)

  # Create dataloader
  dataloader=DataLoader(dataset, batch_size=batch_size,
                        shuffle=shuffle, drop_last=drop_last,
                        num_workers=num_workers)
  
  return dataloader

- `batch_size`: The number of batches the model processes at once before updating it's parameters.

- `num_workers`: For parallel processing on different threads of CPU.

In [None]:
dataloader=create_data_loader_v1(
  raw_text,
  batch_size=1,
  max_length=4,
  stride=1,
  shuffle=False
)

data_iter=iter(dataloader)
first_batch=next(data_iter)
print(first_batch)

In [None]:
second_batch=next(data_iter)
print(second_batch)

In [None]:
dataloader=create_data_loader_v1(
  raw_text,
  batch_size=8,
  max_length=4,
  stride=4,
  shuffle=False
)

data_iter=iter(dataloader)
inputs, targets=next(data_iter)

print(f"Inputs: {inputs}")
print(f"Targets: {targets}")