# Dataset
in here, we're going to create NpyTokensDataset <br/>
our goal is to efficiently stream token sequences from large ```.npy``` files without loading everything into RAM <br/>
we'll be using NumPy memmap and PyTorch Dataset.

# 1. Imports & skeleton

In [1]:
import numpy as np, torch
from torch.utils.data import Dataset

class NpyTokensDataset_note(Dataset):
    def __init__(self, path: str, seq_len: int):
        ...
    def __getstate__(self):
        ...
    def __setstate__(self, state):
        ...
    def __len__(self):
        ...
    def __getitem__(self, idx):
        ...


# 2. ```__init__```
what it does?
- saves path / seq_len
- opens ```.npy``` file with ```mmap_mode``` so it's not all loaded into memory
- accepts 1D or 2D arrays (we're mainly using 2D. I've included 1D since i was testing out previous one as well)
- asserts that the array length is longer than ```seq_len + 1``` for proper ```(x, y)``` shift

In [1]:
class NpyTokensDataset_note(Dataset):
    def __init__(self, path: str, seq_len: int):
        self.path = str(path)
        self.seq_len = int(seq_len)
        # open as memmap (not full load)
        arr = np.load(self.path, mmap_mode="r")

        # record shape info
        self._ndim = arr.ndim
        self._shape = arr.shape

        # enforce 1D or 2D
        assert self._ndim in (1, 2), f"Expected 1D/2D, got {self._ndim}"

        # ensure we can create (x, y) with a 1-token shift
        if self._ndim == 1:
            assert self._shape[0] > self.seq_len + 1, \
                f"1D length {self._shape[0]} must be > seq_len+1={self.seq_len+1}"
        else:
            assert self._shape[1] > self.seq_len + 1, \
                f"2D length {self._shape[1]} must be > seq_len+1={self.seq_len+1}"

        # keep the memmap handle (in main process)
        self._arr = arr


NameError: name 'Dataset' is not defined

#### (TL;DR) WHY MEMMAP?
in the actual pretraining, i'm feeding huge data (around 35 GB for text file) <br/>
and when we perform calculation, it will require huge memory for each block (idk... around 20 GB for each chunk?)
it's not going to fit into our RAM since we also need to remember values & other stuffs <br/>
(mine has 40GB RAM but it's still small to fit all 8 chunks)

# 3. ```__getstate__``` & ```__setstate__```
the problem is that: when ```DataLoader(num_workers)``` forks worker process, we don't want to pickle a huge memmap array <br/>
so, save only lightweight fields in ```__getstate__```, then reopen the memmap in each worker's ```__setstate__```

In [3]:
class NpyTokensDataset_note(Dataset):
    ...
    def __getstate__(self):
        # return only small, picklable state.
        return {
            "path": self.path,
            "seq_len": self.seq_len,
            "_ndim": self._ndim,
            "_shape": self._shape,
        }

    def __setstate__(self, state):
        # restore small state
        self.__dict__.update(state)
        # reopen memmap in the worker process
        self._arr = np.load(self.path, mmap_mode="r")


# 4. ```__len__```
what it does?
- for 1D: returns at least 4096; otherwise, an estimate of how many non-overlapping windows fit
- for 2D: multiplies row by a per-row window count (at least 4), then capped to at least 4096

this is a virtual length to keep random sampling going<br/>
we don't actuallly index by ```idx``` below

In [4]:
class NpyTokensDataset_note(Dataset):
    ...
    def __len__(self):
        if self._ndim == 1:
            # rough capacity estimate or a minimum baseline
            return max(4096, (self._shape[0] - 1) // self.seq_len)
        rows, L = self._shape
        # for 2D, ensure each row contributes at least a few samples
        return max(4096, rows * max(4, (L - 1) // self.seq_len))


#### (TL;DR) Why 4096?
```DataLoader``` progress bars and samplers behave nicely even if your array is short <br/>
it also gives enough iterations per epoch to see metrics <br/>
(i also tried 8192, but it stretched epoch times too much; progress bars moved slowly and eval checkpoints lagged)

# 5. ```__getitem__```
what it does?
- ignores idx for true randomization
- picks a random start ```s``` so that ```[s : s+se_len]``` is ```x``` and ```[s+1 : s+seq_len+1]``` is ```y```
- for 2D, it also picks random row ```r```
- zero-copy slices with ```.astype(np.int64, copy=False)``` and converts to torch tensors

In [2]:
class NpyTokensDataset_note(Dataset):
    ...
    def __getitem__(self, idx):
        arr = self._arr
        if self._ndim == 1:
            L = self._shape[0]
            s = np.random.randint(0, L - self.seq_len - 1)
            x = arr[s:s+self.seq_len].astype(np.int64, copy=False)
            y = arr[s+1:s+self.seq_len+1].astype(np.int64, copy=False)
        else:
            rows, L = self._shape
            r = np.random.randint(0, rows) # pick a random row/document
            s = np.random.randint(0, L - self.seq_len - 1) # pick a random start within that row
            row = arr[r]
            x = row[s:s+self.seq_len].astype(np.int64, copy=False)
            y = row[s+1:s+self.seq_len+1].astype(np.int64, copy=False)

        return torch.from_numpy(x), torch.from_numpy(y)


NameError: name 'Dataset' is not defined

# Smoke test

In [1]:
from torch.utils.data import DataLoader
import sys
sys.path.append("../python_files")
from npy_datasets import NpyTokensDataset

SEQ_LEN = 512
PATH = "../materials/train_smoke.npy" 
# for full dataset, use these lines instead
# PATH = "../materials/train_smoke.npy" 

train_ds = NpyTokensDataset(PATH, seq_len=SEQ_LEN)

# small test
loader = DataLoader(
    train_ds,
    batch_size=12,
    shuffle=False,            
    num_workers=2, # small number of workers
    pin_memory=True, # if using CUDA
    persistent_workers=True, # keeps workers alive for speed
    prefetch_factor=4, # multiples of batches per worker
    drop_last=True,
)


# Sanity check
what to see:
- size and type of input and targets should match
- Sample x shows some value
- Sample y is sample x but shifted one position left

In [2]:
# grab one batch
xb, yb = next(iter(loader))

print("Input (x):", xb.shape, xb.dtype, xb.device)
print("Target (y):", yb.shape, yb.dtype, yb.device)

# show first sequence of tokens (truncated for readability)
print("\n")
print("Sample x[0]:", xb[0, :10].tolist()) # first 20 tokens
print("Sample y[0]:", yb[0, :10].tolist()) # shifted by 1


Input (x): torch.Size([12, 512]) torch.int64 cpu
Target (y): torch.Size([12, 512]) torch.int64 cpu


Sample x[0]: [1052, 724, 7314, 322, 35, 208, 3801, 1500, 1933, 20934]
Sample y[0]: [724, 7314, 322, 35, 208, 3801, 1500, 1933, 20934, 653]


# DONE!
you'll see ```Datasets.py``` in python_files folder <br/>
that's exactly what this notebook has (maybe some names are different)