# 1.3 Synthetic Regression Data


## 1.3.1 Generating the Dataset

The following code snippet generates 1000 examples
with 2-dimensional features drawn 
from a standard normal distribution.
The resulting design matrix $\mathbf{X}$
belongs to $\mathbb{R}^{1000 \times 2}$. 
We generate each label by applying 
a *ground truth* linear function, 
corrupting them via additive noise $\boldsymbol{\epsilon}$, 
drawn independently and identically for each example:

$$\mathbf{y}= \mathbf{X} \mathbf{w} + b + \boldsymbol{\epsilon}.$$

For convenience we assume that $\boldsymbol{\epsilon}$ is drawn 
from a normal distribution with mean $\mu= 0$ 
and standard deviation $\sigma = 0.01$.

In [2]:
%matplotlib inline
import random
import torch
from d2l import torch as d2l

In [3]:
class SyntheticRegressionData(d2l.DataModule):  #@save
    """Synthetic data for linear regression."""
    def __init__(self, w, b, noise=0.01, num_train=1000, num_val=1000,
                 batch_size=32):
        super().__init__() # 调用父类的初始化方法
        self.save_hyperparameters() # 保存超参数
        n = num_train + num_val # 计算样本总数
        self.X = torch.randn(n, len(w))   # 生成随机特征数据（大小为 n x 特征数）
        noise = torch.randn(n, 1) * noise # 生成加性噪声（大小为 n x 1）
        self.y = torch.matmul(self.X, w.reshape((-1, 1))) + b + noise  # 计算标签数据，通过特征和权重相乘并加上偏置和噪声

Below, we set the true parameters to $\mathbf{w} = [2, -3.4]^\top$ and $b = 4.2$.
Later, we can check our estimated parameters against these *ground truth* values.

In [4]:
# 创建一个 SyntheticRegressionData 实例，设置真实参数 w 和 b
data = SyntheticRegressionData(w=torch.tensor([2, -3.4]), b=4.2)

Each row in `features` consists of a vector in $\mathbb{R}^2$ and each row in `labels` is a scalar. Let's have a look at the first entry.

In [5]:
# 打印第一个样本的特征和标签
print('features:', data.X[0],'\nlabel:', data.y[0])

features: tensor([-0.1407,  0.1343]) 
label: tensor([3.4590])


## 1.3.2 Reading the Dataset
Training machine learning models often requires multiple passes over a dataset, 
grabbing one minibatch of examples at a time. 
This data is then used to update the model. 
To illustrate how this works, we implement the `get_dataloader` method,
registering it in the `SyntheticRegressionData` class via `add_to_class`.
It takes a batch size, a matrix of features,
and a vector of labels, and generates minibatches of size `batch_size`.
As such, each minibatch consists of a tuple of features and labels. 
Note that we need to be mindful of whether we're in training or validation mode: 
in the former, we will want to read the data in random order, 
whereas for the latter, being able to read data in a pre-defined order 
may be important for debugging purposes.


In [6]:
@d2l.add_to_class(SyntheticRegressionData)
def get_dataloader(self, train): #生成数据加载器，支持训练模式和验证模式
    if train:
        indices = list(range(0, self.num_train)) # 如果是训练模式，创建一个包含训练数据索引的列表
        # The examples are read in random order
        random.shuffle(indices) # 训练模式下随机打乱数据顺序
    else:
        indices = list(range(self.num_train, self.num_train+self.num_val))
    for i in range(0, len(indices), self.batch_size): # 使用 minibatch 大小迭代数据
        batch_indices = torch.tensor(indices[i: i+self.batch_size]) # 获取当前 minibatch 的索引
        yield self.X[batch_indices], self.y[batch_indices]  # 返回当前 minibatch 的特征和标签

To build some intuition, let's inspect the first minibatch of
data. Each minibatch of features provides us with both its size and the dimensionality of input features.
Likewise, our minibatch of labels will have a matching shape given by `batch_size`.

In [7]:
X, y = next(iter(data.train_dataloader())) #获取第一个 minibatch 的数据
print('X shape:', X.shape, '\ny shape:', y.shape)

X shape: torch.Size([32, 2]) 
y shape: torch.Size([32, 1])


Throughout the iteration we obtain distinct minibatches
until the entire dataset has been exhausted (try this).
While the iteration implemented above is good for didactic purposes,
it is inefficient in ways that might get us into trouble with real problems.
For example, it requires that we load all the data in memory
and that we perform lots of random memory access.
The built-in iterators implemented in a deep learning framework
are considerably more efficient and they can deal
with sources such as data stored in files, 
data received via a stream, 
and data generated or processed on the fly. 
Next let's try to implement the same method using built-in iterators.

## 1.3.3 Concise Implementation of the Data Loader
Rather than writing our own iterator,
we can call the existing API in a framework to load data.
As before, we need a dataset with features `X` and labels `y`. 
Beyond that, we set `batch_size` in the built-in data loader 
and let it take care of shuffling examples  efficiently.

In [9]:
@d2l.add_to_class(d2l.DataModule)  #@save
def get_tensorloader(self, tensors, train, indices=slice(0, None)):
    tensors = tuple(a[indices] for a in tensors)   # 根据指定的索引切片选择子集合
    dataset = torch.utils.data.TensorDataset(*tensors) # 创建一个 PyTorch 的 TensorDataset
    return torch.utils.data.DataLoader(dataset, self.batch_size,  # 使用 batch_size 和 shuffle 参数创建数据加载器
                                       shuffle=train)

In [12]:
@d2l.add_to_class(SyntheticRegressionData)  #@save
def get_dataloader(self, train):
    i = slice(0, self.num_train) if train else slice(self.num_train, None)  # 根据是否是训练模式选择数据切片的范围
    return self.get_tensorloader((self.X, self.y), train, i) # 使用 get_tensorloader 方法创建数据加载器

In [13]:
X, y = next(iter(data.train_dataloader()))
print('X shape:', X.shape, '\ny shape:', y.shape)

X shape: torch.Size([32, 2]) 
y shape: torch.Size([32, 1])


The new data loader behaves just like the previous one, except that it is more efficient and has some added functionality.
For instance, the data loader provided by the framework API 
supports the built-in `__len__` method, 
so we can query its length, 
i.e., the number of batches.

In [11]:
len(data.train_dataloader()) # 获取数据加载器的长度

32