# Data

## 1 Dataset

Here we mainly implement a custom **Dataset** of ourselves.
And we must do the following thing:
- Inherit `torch.utils.data.Dataset`
- Implement `__init__`
<br>
    This is to instantiaing the Dataset object 
- Implement `__len__`
<br>
    returns the number of samples in our dataset
- Implement `__getitem__`
<br>
    loads and returns a sample from the dataset at the given index idx

Let's define a class, and implement custom Dataset
And here we use the poetry of Borges for example

In [1]:
import torch
from torch.utils.data import Dataset

In [None]:
class LLMDatset(Dataset):
    
    # instantiate function
    def __init__(self, text, tokenizer, max_len, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(text)
        assert len(token_ids) > max_len, "Number of tokenized inputs must at %max_len + 1" % max_len

        for i in range(0, len(token_ids) - max_len, stride):
            input_chunk = token_ids[i : i + max_len]
            target_chunk = token_ids[i + 1 : i + 1 + max_len]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        
        return len(self.input_ids)
    
    def __getitem__(self, index):
        return self.input_ids[index], self.target_ids[index]
