# 3. Synthetic Regression Data
Synthetic data does not have patterns. However, it is for didatic purposes, helping us to evaluate the properties of our learning algo and to confirm that our implementations works as expected. For example, if we create data fro which the correct params are know a priori, then we can check the model can infact recovert them.

In [1]:
%matplotlib inline
import random
import torch
from d2l import torch as d2l

## Generate Dataset
1000 examples with 2D features drawn from a standard normal distribution

Note that we have the ground truth w = [2, -3.4] and b = 4.2. Later we can check our estimated parameters against these ground truth values

In [2]:
class SyntheticRegressionData(d2l.DataModule):  #@save
    """Synthetic data for linear regression."""
    def __init__(self, w, b, noise=0.01, num_train=1000, num_val=1000,
                 batch_size=32):
        super().__init__()
        self.save_hyperparameters()
        n = num_train + num_val
        self.X = torch.randn(n, len(w))
        noise = torch.randn(n, 1) * noise
        self.y = torch.matmul(self.X, w.reshape((-1, 1))) + b + noise

In [3]:
data = SyntheticRegressionData(w=torch.tensor([2, -3.4]), b=4.2)

In [4]:
print('features:', data.X[0],'\nlabel:', data.y[0])

features: tensor([ 0.4505, -0.9686]) 
label: tensor([8.3928])


## Reading the Dataset
Training ML models often requires multiple passes over a dataset, grabbing one minbatch of examples at a time. The data is then used to update the model.

Each minibatch consists of a tuple of features and labels.

Note that we need to be mindful of whether we are in training or validation mode
- Training: we will want to read data in random roder
- Validation: read data in predefined order for debugging

In [5]:
# Implement get_dataloader(), registering it in class via add_to_class - this is decorator
@d2l.add_to_class(SyntheticRegressionData)
def get_dataloader(self, train):
    if train:
        indices = list(range(0, self.num_train))
        # The examples are read in random order
        random.shuffle(indices)
    else:
        indices = list(range(self.num_train, self.num_train+self.num_val))
    for i in range(0, len(indices), self.batch_size):
        batch_indices = torch.tensor(indices[i: i+self.batch_size])
        yield self.X[batch_indices], self.y[batch_indices]

In [6]:
# This function is inherit from d2l.DataModule
X, y = next(iter(data.train_dataloader()))
print('X shape:', X.shape, '\ny shape:', y.shape)

X shape: torch.Size([32, 2]) 
y shape: torch.Size([32, 1])


This implemenetation is problem as it requires we load all the data in memory and that we perfomr lots of random memeory access.

Thus, we shouldd us iterator so that they can deal with sources in files, stream or data generated or processes on the fly

## Concise Implementation of the Data Loader
- More efficient

In [7]:
@d2l.add_to_class(d2l.DataModule)  #@save
def get_tensorloader(self, tensors, train, indices=slice(0, None)):
    tensors = tuple(a[indices] for a in tensors)
    dataset = torch.utils.data.TensorDataset(*tensors)
    return torch.utils.data.DataLoader(dataset, self.batch_size,
                                       shuffle=train)

@d2l.add_to_class(SyntheticRegressionData)  #@save
def get_dataloader(self, train):
    i = slice(0, self.num_train) if train else slice(self.num_train, None)
    return self.get_tensorloader((self.X, self.y), train, i)

In [8]:
X, y = next(iter(data.train_dataloader()))
print('X shape:', X.shape, '\ny shape:', y.shape)

X shape: torch.Size([32, 2]) 
y shape: torch.Size([32, 1])


In [9]:
len(data.train_dataloader())

32