# Synthetic Regression Data

In [1]:
!pip install jedi

Collecting jedi
  Downloading jedi-0.19.2-py2.py3-none-any.whl.metadata (22 kB)
Downloading jedi-0.19.2-py2.py3-none-any.whl (1.6 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.4/1.6 MB[0m [31m13.4 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.6/1.6 MB[0m [31m28.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m21.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: jedi
Successfully installed jedi-0.19.2


In [2]:
# Keep pip/tools fresh
!pip -q install -U pip setuptools wheel

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.8 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/1.8 MB[0m [31m5.3 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━[0m [32m1.5/1.8 MB[0m [31m21.1 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m19.1 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m51.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
# Install d2l but SKIP its pinned (old) deps
!pip -q install d2l==1.0.3 --no-deps

In [4]:
%matplotlib inline
import random
import torch
from d2l import torch as d2l

## Generating the Dataset

In [7]:
class SyntheticRegressionData(d2l.DataModule):
    """Synthecti data for linear regression"""
    def __init__(self, w, b, noise = 0.01, num_train = 1000, num_val = 1000,
                 batch_size = 32):
        super().__init__()
        self.save_hyperparameters()
        n = num_train + num_val
        self.X = torch.randn(n, len(w))
        noise = torch.randn(n, 1) * noise
        self.y = torch.matmul(self.X, w.reshape((-1, 1))) + b + noise

In [8]:
data = SyntheticRegressionData(w = torch.tensor([2, -3.4]), b = 4.2)

In [9]:
print('features', data.X[0], '\nlabel:', data.y[0])

features tensor([ 1.1986, -1.9502]) 
label: tensor([13.2176])


## Reading the Dataset

In [10]:
@d2l.add_to_class(SyntheticRegressionData)
def get_dataloader(self, train):
    if train:
        indices = list(range(0, self.num_train))
        # The examples are read in random order
        random.shuffle(indices)
    else:
        indices = list(range(self.num_train, self.num_train + self.num_val))
    for i in range(0, len(indices), self.batch_size):
        batch_indices = torch.tensor(indices[i:i + self.batch_size])
        yield self.X[batch_indices], self.y[batch_indices]

In [11]:
X, y = next(iter(data.train_dataloader()))
print('X shape:', X.shape, '\ny shape:', y.shape)

X shape: torch.Size([32, 2]) 
y shape: torch.Size([32, 1])


This hand-rolled generator is inefficient.

## Concise Implementation of the Data Loader

In [14]:
@d2l.add_to_class(d2l.DataModule)
def get_tensorloader(self, tensors, train, indices = slice(0, None)):
    tensors = tuple(a[indices] for a in tensors)
    dataset = torch.utils.data.TensorDataset(*tensors)
    return torch.utils.data.DataLoader(dataset, self.batch_size,
                                        shuffle = train)

In [15]:
@d2l.add_to_class(SyntheticRegressionData)
def get_dataloader(self, train):
    i = slice(0, self.num_train) if train else slice(self.num_train, None)
    return self.get_tensorloader((self.X, self.y), train, i)

In [16]:
X, y = next(iter(data.train_dataloader()))
print('X shape:', X.shape, '\ny shape:', y.shape)

X shape: torch.Size([32, 2]) 
y shape: torch.Size([32, 1])


In [17]:
len(data.train_dataloader())

32

## Exercises

### 3.3.5.1

What will happen if the number of examples cannot be divided by the batch size. How would you change this behavior by specifying a different argument by using the framework's API?

If $N$ (number of examples) isn’t a multiple of $B$ (batch size), the last batch is smaller by default.

Example: $N = 1000, B = 32 \implies$ you get 31 full batches of 32 and one final batch of 8.

To change this behavior in PyTorch, pass `drop_last=True` to the `DataLoader` so the incomplete final batch is dropped (all batches have the same size):

When to use which:

*   Training: `drop_last=True` is handy if you need fixed batch size (e.g., BatchNorm, certain fused kernels, multi-GPU balance).
*   Validation/Test: keep `drop_last=False` to evaluate all examples.





In [None]:
@d2l.add_to_class(d2l.DataModule)
def get_tensorloader(self, tensors, train, indices = slice(0, None)):
    tensors = tuple(a[indices] for a in tensors)
    dataset = torch.utils.data.TensorDataset(*tensors)
    return torch.utils.data.DataLoader(dataset, self.batch_size,
                                        shuffle = train,
                                       drop_last=train)

### 3.3.5.2

Suppose that we want to generate a huge dataset, where both the size of the parameter vector w and the number of examples num_examples are large.
1. What happens if we cannot hold all data in memory?
2. How would you shuffle the data if it is held on disk? Your task is to design an efficient algorithm that does not require too many random reads or writes. Hint: pseudorandom permutation generators allow you to design a reshuffle without the need to store the permutation table explicitly (Naor and Reingold, 1999).

“Don’t preload everything” — stream minibatches instead

Problem (in human terms):

Your dataset is like a 1-terabyte photo album, but your computer’s memory (RAM) is only 16-GB. You can’t open the whole album at once.

Fix:

Open one page at a time: read just enough examples to make a minibatch, feed them to the model, then read the next minibatch. That’s streaming.

Why this works:

You only ever keep one minibatch (say 1–64 MB) in memory, not the whole dataset.

“Store data in shards” — medium files, not one giant blob

Problem:

Disk is fast if you read sequentially, but slow if you jump around randomly. One giant file can be unwieldy; thousands of tiny files cause overhead.

Fix:

Split data into shards: many medium-sized files (e.g., 64–256 MB each).


*   Big enough to read efficiently in long, sequential chunks.
*   Small enough that you can reshuffle the order of shards each epoch (a cheap way to get randomness).


Picture:

Think of 100 labeled boxes instead of 1 huge crate or 10,000 envelopes. You can shuffle box order easily, and inside each box you grab items in order.

“If `w` is huge” — the parameters don’t fit easily

Sometimes the model is the big thing. Examples:


*   A gigantic embedding table for words/items/users (millions of rows).
*   A very wide linear model (10M+ features).


Here are practical tricks to still train:

**Minibatch / online SGD**

Key idea: You never need the whole dataset in memory to compute a gradient—only the current minibatch. So even with a huge dataset, training works the same: stream minibatches (from disk) and update `w` each step.

**Sparse updates (only touch what you use)**

When it helps: Features are “mostly zeros” (e.g., bag-of-words). Each sample only references a tiny subset of parameters.

Trick: Use layers/ops that do indexed lookups and sparse updates (e.g., `nn.Embedding` / `EmbeddingBag`).


*   The optimizer updates only the rows of w that were actually used in the minibatch.
*   Memory traffic and compute are much smaller than touching all of `w`.

Mental picture: You have a phone book with 10 million names (parameters), but for one call (minibatch) you look up and update only 500 of them.

**The problem we’re solving**

You have N records on disk. You want to read them in a new random order each epoch (to train well), without doing tons of slow random disk reads.

If you just pick a random index for every next record, the disk has to jump all over the place → slow.

**The fix: Block PRP (shuffle big chunks, read each chunk sequentially)**

Instead of permuting records, permute blocks of records.



1.   Pick a block size B (e.g., 64k records).
2.   Then you have M = ceil(N / B) blocks, numbered 0..M-1:

*   Block 0 holds records [0 .. B-1]
*   Block 1 holds records [B .. 2B-1]
*   ...

3.   Use a PRP on the block IDs to get a new block order each epoch.
4.   For each block (in that permuted order), read it sequentially (fast!), and optionally do a small in-RAM shuffle within the block for extra randomness.


### 3.3.5.3

Implement a data generator that produces new data on the fly, every time the iterator is called.

In [11]:
import torch
from torch.utils.data import IterableDataset, DataLoader

class _OnTheFlyRegression(d2l.HyperParameters, IterableDataset):
    """Stream fresh samples: X ~ N(0, I), y = X·w + b + eps."""
    def __init__(self, w, b, noise = 0.01, num_train = 1000, num_val = 1000,
                 train = False):
        super().__init__()
        self.save_hyperparameters()
        self.w = torch.as_tensor(self.w, dtype = torch.float32).reshape(-1) # (d, )

    def __iter__(self):
        d = self.w.numel()
        L = self.num_train if self.train else self.num_val  # <- pick train/val length
        for _ in range(L):
            X = torch.randn(d)
            eps = torch.randn(1) * self.noise
            y = (X @ self.w) + self.b + eps
            yield X, y



*   Every pass through the loop makes a brand-new `X` and noise `eps`.
*   `y` is computed from those fresh values.
*   `yield` returns that sample to the `DataLoader`, which stacks many such samples into a minibatch.

So the on-the-fly generation is exactly the `torch.randn(...)` calls inside `__iter__`, executed each time you iterate.

**Non-streaming(book's original)**
No new calls to `torch.randn` during iteration.

Iteration indexes into already-made tensors; you may shuffle order, but the values don’t change across epochs.

**Streaming(on-the-fly)**

Data is not stored up front. It’s generated during iteration:

In [8]:
class SyntheticRegressionData(d2l.DataModule):
    """Synthetic data for linear regression (precomputed OR streaming)."""
    def __init__(self, w, b, noise = 0.01, num_train = 1000, num_val = 1000,
                 batch_size = 32, streaming = False):
        super().__init__()
        self.save_hyperparameters()
        if not self.streaming:
            n = num_train + num_val
            self.X = torch.randn(n, len(w))
            noise = torch.randn(n, 1) * noise
            self.y = torch.matmul(self.X, w.reshape((-1, 1))) + b + noise

In [12]:
@d2l.add_to_class(d2l.DataModule)
def get_tensorloader(self, tensors, train, indices = slice(0, None)):
    tensors = tuple(a[indices] for a in tensors)
    dataset = torch.utils.data.TensorDataset(*tensors)
    return torch.utils.data.DataLoader(dataset, self.batch_size,
                                        shuffle = train)

In [13]:
# Switch between precomputed (non-streaming) and on-the-fly (streaming)
@d2l.add_to_class(SyntheticRegressionData)
def get_dataloader(self, train: bool):
    if getattr(self, "streaming", False):
        ds = _OnTheFlyRegression(self.w, self.b, self.noise,
                                 self.num_train, self.num_val, train)
        return DataLoader(ds, batch_size=self.batch_size, shuffle=False)
    else:
        i = slice(0, self.num_train) if train else slice(self.num_train, None)
        return self.get_tensorloader((self.X, self.y), train, i)

**What `getattr(self, "streaming", False)` means**

`getattr(obj, name, default)` returns `obj.name` if it exists; otherwise it returns `default`.

So here it reads: “If `self.streaming` exists and is truthy, use the streaming path; otherwise act as if it’s `False`.”

This avoids `AttributeError` if some instance doesn’t have `self.streaming` (e.g., older code or an object created before we introduced the flag).

In [14]:
data = SyntheticRegressionData(w=torch.tensor([2.0, -3.4]), b=4.2,
                               noise=0.01, num_train=1000, num_val=1000,
                               batch_size=32, streaming=True)
X, y = next(iter(data.train_dataloader()))
print(X.shape, y.shape)  # torch.Size([32, 2]) torch.Size([32, 1])

torch.Size([32, 2]) torch.Size([32, 1])



```
X, y = next(iter(data.train_dataloader()))
```

does two things:


*   `iter(data.train_dataloader())` creates a new iterator over the DataLoader (i.e., “start an epoch from the beginning”).
*  ` next(...)` asks that iterator for the first minibatch, and you assign that batch to the local variables `X` and `y`.


**Why you don’t see `data.X` / `data.y` “saved” anymore:**


*   In your class you only create and store `self.X` and `self.y` when `streaming == False`.
*   When `streaming == True`, you do not create those tensors. Instead, `train_dataloader()` builds an `_OnTheFlyRegression` dataset that generates fresh samples inside `__iter__` and yields them to the DataLoader. Nothing is stored in `self`—the batches are created on demand and returned to you, then discarded unless you keep them.

So that line does “save” the batch—into the local Python variables `X` and `y`. It just doesn’t store the whole dataset on `data.X`/`data.y` because, in streaming mode, that dataset doesn’t exist.

### 3.3.5.4

How would you design a random data generator that generates the same data each time it is called?

In [25]:
import torch
from torch.utils.data import IterableDataset, DataLoader

class OnTheFlyDeterministic(d2l.HyperParameters, IterableDataset):
    """Generates the SAME sequence every time you iterate."""
    def __init__(self, w, b, noise = 0.01, num_train = 1000, num_val = 1000,
                 train = True, seed = 0):
        super().__init__()
        self.save_hyperparameters()
        self.w = torch.as_tensor(self.w, dtype = torch.float32).reshape(-1) # (d, )

    def __iter__(self):
        d = self.w.numel()
        L = self.num_train if self.train else self.num_val  # <- pick train/val length

        g = torch.Generator()  #local RNG (doesn't touch global state)
        g.manual_seed(self.seed) #<- key line: reset RNG each iteration

        for _ in range(L):
            X = torch.randn(d, generator = g)
            eps = torch.randn(1, generator = g) * self.noise
            y = (X @ self.w) + self.b + eps
            yield X, y

**What’s an RNG?**

RNG = Random Number Generator. In code it’s a tiny state machine that spits out a sequence of numbers that look random. It’s deterministic: if you start it from the same seed, it will produce the same sequence every time.

In PyTorch there are two flavors:


*   a global RNG (used when you call `torch.randn(...)` without specifying anything), and
*   local RNG objects you create with `torch.Generator()` and pass explicitly (e.g., `torch.randn(..., generator=g)`).


Using a local RNG isolates your randomness so other parts of your program can’t “consume” random numbers and change your sequence.


```
g = torch.Generator()          # local RNG (doesn't touch global state)
g.manual_seed(self.seed)       # reset g's internal state to a fixed start
```

`torch.Generator()`

Makes a new RNG object `g` with its own internal state. It’s separate from PyTorch’s global RNG. Nothing else can accidentally advance it unless they also use `g`.

`g.manual_seed(self.seed)`

Sets g’s starting point. From now on, every call that uses `generator=g` will draw the same sequence of “random” numbers each time you recreate and reseed `g `with the same seed. That’s what gives you repeatable data.

In [26]:
ds = OnTheFlyDeterministic(w=torch.tensor([2.0, -3.4]), b=4.2,
                               noise=0.01, num_train=1000, train = True,
                           seed = 42)

loader = DataLoader(ds, batch_size=32, shuffle=False)

X1, y1 = next(iter(loader))
X2, y2 = next(iter(loader))

print("First batch == first batch from new iterator?",
      torch.equal(X1, X2), torch.equal(y1, y2))  # True, True

First batch == first batch from new iterator? True True


`DataLoader` is acting as a batch maker + iterator factory wrapped around your dataset `ds`. Here’s what it does, step by step, in plain terms:


1.   Starts an iteration when you ask for it

When you call `iter(loader)` (or loop for `X, y in loader`:), the loader starts pulling samples from ds.

2.   Pulls items from your dataset

Because `ds` is an IterableDataset, `DataLoader` just calls `ds.__iter__()` and reads items in the order the dataset yields them.

Your dataset yields per-sample pairs `(X, y)` with shapes `(d,)` and `(1,)`.

3.   Groups samples into minibatches of 32

It collects 32 such samples at a time (that’s `batch_size=32`).

4.   Stacks them into batched tensors (the “collate” step)

The default collate function turns a list of 32 tuples into a tuple of stacked tensors:

`X_batch` becomes shape `(32, d)`

`y_batch` becomes shape `(32, 1)`


