### Linear Regression
- As long as the design matrix X has full rank (no feature is linearly dependent on the others), then there will be just one critical point on the loss surface and it corresponds to the minimum of the loss over the entire domain. 

##### Minibatch Stochastic Gradient Descent
- The specific choice of the size of the said minibatch depends on many factors, such as the amount of memory, the number of accelerators, the choice of layers, and the total dataset size. Despite all that, a number between 32 and 256, preferably a multiple of a large power of 2 , is a good start.
- Although the algorithm converges slowly towards the minimizers it typically will not find them exactly in a finite number of steps. Moreover, the minibatches B used for updating the parameters are chosen at random. This breaks determinism.
- Linear regression happens to be a learning problem with a global minimum (whenever X
 is full rank, or equivalently, whenever X^T*X is invertible). 
##### Vectorization for Speed
- When training our models, we typically want to process whole minibatches of examples simultaneously. Doing this efficiently requires that we vectorize the calculations and leverage fast linear algebra libraries rather than writing costly for-loops in Python.

In [1]:
%matplotlib inline
import math
import time
import numpy as np
import torch
from d2l import torch as d2l

n = 10000
a = torch.ones(n)
b = torch.ones(n)

c = torch.zeros(n)
t = time.time()
for i in range(n):
    c[i] = a[i] + b[i]
print(f'{time.time()-t:.10f} sec')

t = time.time()
d = a + b
print(f'{time.time()-t:.30f} sec')

0.0496573448 sec
0.000991106033325195312500000000 sec


##### Loss Function
Naturally, fitting our model to the data requires that we agree on some measure of fitness (or, equivalently, of unfitness).

##### Analytic Solution
Loss Function for Linear model :
$$ L(\mathbf{w},b) = \frac{1}{n} \sum_{i=1}^{n} (\mathbf{w}^T\mathbf{x}^{(i)}+b-y^{(i)})^2 $$
If we subsume the bias \mathbf{b} into the parameter \mathbf{w} by appending a column to the design matrix consisting of all 1s. Then prediction is to minimize  $||\mathbf{y} - \mathbf{X}\mathbf{w}||^2$.
As long as the design matrix \mathbf{X} has full rank(no feature is linearly dependent on the others), then there will be just one critical point on the loss surface and it corresponds to the minimum of the loss over the entire domain.
$$ \partial_\mathbf{w} ||\mathbf{y} - \mathbf{X}\mathbf{w}||^2 = 2\mathbf{X^{T}}(\mathbf{X}\mathbf{w}-\mathbf{y})=0\ and\ hence\ \mathbf{X^{T}\mathbf{y}=\mathbf{X^{T}}\mathbf{X}\mathbf{w}} $$
$$ \mathbf{w^{*}} = (\mathbf{X^{T}\mathbf{X})^{-1}\mathbf{X^{T}}\mathbf{y} $$
The solution will be unique if the matrix $\mathbf{X^{T}}\mathbf{X}\ $ is invertible.



##### The normal noise
$$ y=\mathbf{w^{T}}\mathbf{x}+b+\epsilon\ where\ \epsilon\ is\ N(0,\sigma^{2}) $$
The likelihood of seeing a particular y for a given \mathbf{x} is 
$$ P(y|\mathbf{x})={\frac{1}{\sqrt{2\pi\sigma^{2}}} exp(-{\frac{1}{2\sigma^{2}}  (y-\mathbf{w^{T}x}-b)^2) $$
According to the principle of maximum likelihood, the best values of parameters $\mathbf{w}$ and $b$ are those that maximize the likelihood of the entire dataset:
$$ P(\mathbf{y|X})=\prod_{i=1}^{n} p{y^{i}|x^{i}}$$
For historical reasons, optimizations are more often expressed as minimization rather than maximization.
So, we can minimize the negative loglikelihood,$$ -logP(\mathbf{y|X})=\sum_{i=1}^{n}{\frac{1}{2}}log(2\pi\sigma^{2})+{\frac{1}{2\sigma^{2}}}(y^{i}-\mathbf{w^{T}x^{i}-b})^{2}$$
It follows that minimizing the mean squared error is equivalent to the maximum likelihood estimation of a linear model under the assumption of additive Gaussian noise.

# Object-Oriented Design for Implementation
+  (i) Module contains models, losses, and optimization methods
+  (ii) DataModule provides data loaders for training and validation
+  (iii) both classes are combined using the Trainer class

The first utility function allows us to register functions as methods in a class after the class has been created.

In [2]:
import numpy as np
import torch
import time
from torch import nn
from d2l import torch as d2l

def add_to_class(Class): 
    """Register functions as methods in created class"""
    def wrapper(obj):
        setattr(Class, obj.__name__, obj)
    return wrapper

# example
class A:
    def __init__(self):
        self.b = 1 # initialization of the A


@add_to_class(A)
def do(self):
    print('Class attribute "b" is',self.b)



The second one is a utility class that saves all arguments in a class’s __init__ method as class attributes. This allows us to extend constructor call signatures implicitly without additional code.

In [3]:
import numpy as np
import torch
import time
from torch import nn
from d2l import torch as d2l

class HyperParameters: #@save
    def save_hyperparameters(self,ignore=[]):
        raise NotImplemented
    
    
class B(d2l.HyperParameters):
    def __init__(self,a,b,c):
        self.save_hyperparameters(ignore=['b'])
        print('self.a=',self.a,'self.c=',c)
        print('There is no self.b = ', not hasattr(self,'b'))

b = B(a=1,b=2,c=3)

self.a= 1 self.c= 3
There is no self.b =  True


The final utility allows us to plot experiment progress interactively while it is going on. In deference to the much more powerful (and complex) TensorBoard we name it ProgressBoard. The implementation is deferred to Section 23.7.

In [None]:
import numpy as np
import torch
import time
from torch import nn
from d2l import torch as d2l

class HyperParameters: 
    def save_hyperparameters(self,ignore=[]):
        raise NotImplemented

class ProgressBoard(d2l.HyperParameters):
    """The Board that plots data points in animation"""
    def __init__(self,xlabel=None,ylabel=None,xlim=None,ylim=None,
                 xscale='linear',yscale='linear',ls=['-','--','-.',':'],colors=['C0','C1','C2','C3'],fig=None,axes=None,figsize=(3.5,2.5),display=True):
        self.save_hyperparameters()
    
    def draw(self,x,y,label,every_n=1):
        raise NotImplemented

board = d2l.ProgressBoard('y')
for x in np.arange(0,10,0.1):
    board.draw(x,np.sin(x),'sin',every_n=2)
    board.draw(x,np.cos(x),'cos',every_n=10)

### Module

In [None]:
import numpy as np
import torch
import time
from torch import nn
from d2l import torch as d2l

class Module(nn.Module,d2l.HyperParameters):
    """The base class of models"""
    def __init__(self,plot_train_per_epoch=2, plot_valid_per_epoch=1):
        super().__init__()
        self.save_hyperparameters()
        self.board = ProgressBoard()
        
    def loss(self,y_hat, y):
        raise NotImplementedError
    
    def forward(self,X):
        assert hasattr(self, 'net'),'Neural network is defined'
        return self.net(X)
    
    def plot(self,key, value, train):
        """Plot a point in animation"""
        assert hasattr(self, 'trainer'), 'Trainer is not inited'
        self.board.xlabel = 'epoch'
        if train:
            x = self.trainer.train_batch_idx / \
                self.trainer.num_train_batches
            n = self.trainer.num_train_batcher / \
                self.plot_train_per_epoch
        else:
            x = self.trainer.epoch + 1
            n = self.trainer.num_val_batches / \ 
                self.plot_valid_per_epoch
        self.board.draw(x, value.to(d2l.cpu()).detach().numpy(),('train_' if train else 'val_') + key,every_n=int(n))
        
    def training_step(self,batch):
        l = self.loss(self(*batch[:-1]),batch[-1])
        self.plot('loss', l, train=True)
        
    def validation_step(self,batch):
        l = self.loss(self(*batch[:-1]),batch[-1])
        self.plot('loss', l, train=False)
        
    def configure_optimizers(self):
        raise NotImplementedError
        

##### DataModule

In [None]:
import numpy as np
import torch
import time
from torch import nn
from d2l import torch as d2l

class DataModule(d2l.HyperParameters):
    """The base of the data"""
    def __init__(self,root='../data',num_workers=4):
        self.save_hyperparameters()
        
    def get_dataloader(self, train):
        raise NotImplementedError
    
    def train_dataloader(self):
        return self.get_dataloader(train=True)
    
    def val_dataloader(self):
        return self.get_dataloader(train=False)

##### Training

In [None]:
class Trainer(d2l.HyperParameters):
    """THe base class for training models with data"""
    def __init__(self, max_epochs, num_gpus=0, gradient_clip_val=0):
        
        

#### Synthetic Regression Data
##### Reading the Dataset

In [None]:
import random
import torch
from torch import nn
from d2l import torch as d2l

class DataModule(d2l.HyperParameters):
    """The base of the data"""
    def __init__(self,root='../data',num_workers=4):
        self.save_hyperparameters()
        
    def get_dataloader(self, train):
        raise NotImplementedError
    
    def train_dataloader(self):
        return self.get_dataloader(train=True)
    
    def val_dataloader(self):
        return self.get_dataloader(train=False)
# 
class SyntheticRegressionData(d2l.DataModule): #@save
    """Synthetic data for linear regression"""
    def __init__(self,w,b,noise=0.01,num_train=1000,num_val=1000,batch_size=32):
        super().__init__()
        self.save_hyperparameters()
        n = num_train + num_val
        self.X = torch.randn(n,len(w))
        noise = torch.randn(n,1) * noise
        self.y = torch.matmul(self.X,w.reshape((-1,1))) + b + noise
"""
@d2l.add_to_class(SyntheticRegressionData)
def get_dataloader(self, train):
    if train:
        indices = list(range(0, self.num_train))
        # The examples are read in random order
        random.shuffle(indices)
    else:
        indices = list(range(self.num_train, self.num_train+self.num_val))
    for i in range(0, len(indices), self.batch_size):
        batch_indices = torch.tensor(indices[i: i+self.batch_size])
        yield self.X[batch_indices], self.y[batch_indices]
"""


data = SyntheticRegressionData(w=torch.tensor([2, -3.4]), b=torch.tensor([4.2]))

@d2l.add_to_class(DataModule)
def get_tensorloader(self, tensors, train, indices=slice(0,None)):
    tensors = tuple(a[indices] for a in tensors)
    dataset = torch.utils.data.TensorDataset(*tensors)
    return torch.utils.data.DataLoader(dataset, self.batch_size, shuffle=train)

@d2l.add_to_class(SyntheticRegressionData)
def get_dataloader(self, train):
    i = slice(0, self.num_train) if train else slice(self.num_train, None)
    return self.get_tensorloader((self.X, self.y), train, i)



X,y = next(iter(data.train_dataloader()))

print(X.shape,y.shape)

## Linear Regression Implementation from Scratch

In [3]:
import torch
from d2l import torch as d2l

class SyntheticRegressionData(d2l.DataModule): 
    """Synthetic data for linear regression"""
    def __init__(self,w,b,noise=0.01,num_train=1000,num_val=1000,batch_size=32):
        super().__init__()
        self.save_hyperparameters()
        n = num_train + num_val
        self.X = torch.randn(n,len(w))
        noise = torch.randn(n,1) * noise
        self.y = torch.matmul(self.X,w.reshape((-1,1))) + b + noise

@d2l.add_to_class(SyntheticRegressionData)
def get_dataloader(self, train):
    i = slice(0, self.num_train) if train else slice(self.num_train, None)
    return self.get_tensorloader((self.X, self.y), train, i)

class LinearRegressionScratch(d2l.Module):
    """the linear regression model implemented from scratch"""
    def __init__(self, num_inputs, lr, sigma=0.01):
        super().__init__()
        self.save_hyperparameters()
        self.w = torch.normal(0, sigma, (num_inputs, 1), requires_grad=True)
        self.b = torch.zeros(1,requires_grad=True)
        
@d2l.add_to_class(LinearRegressionScratch) 
def forward(self,X):
    return torch.matmul(X,self.w) + self.b

@d2l.add_to_class(LinearRegressionScratch)
def loss(self, y_hat, y):
    l = 1/2*(y_hat-y)**2
    return l.mean()

class Trainer(d2l.HyperParameters): 
    """The base class for training models with data."""
    def __init__(self, max_epochs, num_gpus=0, gradient_clip_val=0):
        self.save_hyperparameters()
        assert num_gpus == 0, 'No GPU support yet'

    def prepare_data(self, data):
        self.train_dataloader = data.train_dataloader()
        self.val_dataloader = data.val_dataloader()
        self.num_train_batches = len(self.train_dataloader)
        self.num_val_batches = (len(self.val_dataloader)
                                if self.val_dataloader is not None else 0)

    def prepare_model(self, model):
        model.trainer = self
        model.board.xlim = [0, self.max_epochs]
        self.model = model

    def fit(self, model, data):
        self.prepare_data(data)
        self.prepare_model(model)
        self.optim = model.configure_optimizers()
        self.epoch = 0
        self.train_batch_idx = 0
        self.val_batch_idx = 0
        for self.epoch in range(self.max_epochs):
            self.fit_epoch()

    def fit_epoch(self):
        raise NotImplementedError

class SGD(d2l.HyperParameters):
    """Minibatch stochastic gradient descent"""
    def __init__(self, params, lr):
        self.save_hyperparameters()
    
    def step(self):
        for param in self.params:
            param -= self.lr * param.grad
    
    def zero_grad(self):
        for param in self.params:
            if param.grad is not None:
                param.grad.zero_()

@d2l.add_to_class(LinearRegressionScratch)
def configure_optimizers(self):
    return SGD([self.w, self.b], self.lr)

model1 = LinearRegressionScratch(2, lr=0.03)
data1 = SyntheticRegressionData(w=torch.tensor([2,-3.4]),b=4.2)
trainer = Trainer(max_epochs=3)
trainer.fit(model1,data1)

print(model1.w.reshape(data.w.reshape))
print(model1.b)

NotImplementedError: 