# 3.3 Synthetic Regression Data

## 1. What will happen if the number of examples cannot be divided by the batch size. How would you change this behavior by specifying a different argument by using the framework’s API?

torch.utils.data.DataLoader() function has the `drop_last` arg. If set to True, it drops the last incomplete batch if the dataset size is not divisible by the batch size. If False and the size of the dataset is not divisible by the batch size, then the last batch will be smaller.

By default, the `drop_last` arg is false.

## 2. Suppose that we want to generate a huge dataset, where both the size of the parameter vector w and the number of examples num_examples are large.

1. What happens if we cannot hold all data in memory?

2. How would you shuffle the data if it is held on disk? Your task is to design an efficient algorithm that does not require too many random reads or writes. Hint: [pseudorandom permutation generators](https://en.wikipedia.org/wiki/Pseudorandom_permutation) allow you to design a reshuffle without the need to store the permutation table explicitly (Naor and Reingold, 1999).

pass

## 3. Implement a data generator that produces new data on the fly, every time the iterator is called.

In [23]:
import torch
import random
from d2l import torch as d2l

class SyntheticRegressionData(d2l.DataModule):
    def __init__(self, w, b, noise=0.01, num_train=1000, num_val=1000,
                 batch_size=32):
        super().__init__()
        self.save_hyperparameters()
        n = num_train + num_val
        self.X = torch.randn(n, len(w))
        noise = torch.randn(n, 1) * noise
        self.y = torch.matmul(self.X, w.reshape((-1, 1))) + b + noise
    def get_dataloader(self, train):
        if train:
            indices = list(range(0, self.num_train))
            # The examples are read in random order
            random.shuffle(indices)
        else:
            indices = list(range(self.num_train, self.num_train+self.num_val))
        for i in range(0, len(indices), self.batch_size):
            batch_indices = torch.tensor(indices[i: i+self.batch_size])
            yield self.X[batch_indices], self.y[batch_indices]

data = SyntheticRegressionData(w=torch.tensor([2.0, 3.0]), b=0.5)
cnt = 0
for X, y in data.train_dataloader():
    cnt += 1
    print(X.shape, y.shape)
    if cnt == 10:
        break

torch.Size([32, 2]) torch.Size([32, 1])
torch.Size([32, 2]) torch.Size([32, 1])
torch.Size([32, 2]) torch.Size([32, 1])
torch.Size([32, 2]) torch.Size([32, 1])
torch.Size([32, 2]) torch.Size([32, 1])
torch.Size([32, 2]) torch.Size([32, 1])
torch.Size([32, 2]) torch.Size([32, 1])
torch.Size([32, 2]) torch.Size([32, 1])
torch.Size([32, 2]) torch.Size([32, 1])
torch.Size([32, 2]) torch.Size([32, 1])


## 4. How would you design a random data generator that generates the same data each time it is called?

所有的随机数算法在初始化阶段都需要一个随机种子(random seed), 完全相同的种子每次将产生相同的随机数序列。如果没有手动进行显式设置，系统则默认根据时间来选择这个值，此时每次生成的随机数因时间差异而不同。

In [27]:
torch.manual_seed(0)
random.seed(0)
data1 = SyntheticRegressionData(w=torch.tensor([2.0, 3.0]), b=0.5, num_train=5, num_val=5)
# data2 = SyntheticRegressionData(w=torch.tensor([2.0, 3.0]), b=0.5, num_train=5, num_val=5)
print(data1.y)   # 第一次调用和第二次调用的结果不一致，原因是每次调用都生成了新的随机数，导致结果不一致。
# print(data2.y) # 但同一个位置的语句再次运行生成的结果是一样的.
print(torch.randn(1))
print(torch.randn(1))


tensor([[-5.2177],
        [-1.3179],
        [-2.9639],
        [ 5.3909],
        [ 0.9725],
        [ 4.1932],
        [ 8.0397],
        [-0.1258],
        [ 6.1278],
        [ 4.6633]])
tensor([-0.0209])
tensor([-0.7185])
