# Residual Learning
> Understanding the role of residuals in model training.

- toc: true 
- badges: true
- comments: true
- sticky_rank: 1
- author: Abhishek Sharma
- image: images/aeda.png
- categories: [deeplearning, math, fastai]

## What is a residual?

Residuals are the difference between actual and estimated value.

## What is residual learning?

In the context of ensemble learning, a base model is used to fit the residuals to make the ensemble model more accurate. In deep learning, various architectures use a block/layer to fit the residual to improve the performance of the DNN.

## How does Gradient Boosting Machines use residuals?

In [11]:
from sklearn import datasets
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

In [6]:
X, y = datasets.make_regression(n_samples=1000, random_state=41)
Xtr, Xva, ytr, yva = train_test_split(X, y, test_size=0.2, random_state=41)

In [3]:
from sklearn.tree import DecisionTreeRegressor

In [8]:
tree_reg1 = DecisionTreeRegressor(max_depth=2, random_state=41)
tree_reg1.fit(Xtr, ytr)

y2 = ytr - tree_reg1.predict(Xtr)

tree_reg2 = DecisionTreeRegressor(max_depth=2, random_state=41)
tree_reg2.fit(Xtr, y2)

y3 = y2 - tree_reg2.predict(Xtr)

tree_reg3 = DecisionTreeRegressor(max_depth=2, random_state=41)
tree_reg3.fit(Xtr, y3)

y_pred = sum(tree.predict(Xva) for tree in (tree_reg1, tree_reg2, tree_reg3))

### Gradient Boosting

**How does residuals play a part in Gradient Boosting Learning?**

- Train a base learner `tree_reg1` to fit data (`X`) and labels (`y`)
- Train a base learner `tree_reg2` that fits on data (`X`) and **residuals** between the `label` and predicted value of base learner `tree_reg1`. Essentially, we are using a base learner to learn the **residuals**.
- Finally the result of all the base learners are added to make the final prediction.

The above code is equivalent to calling the GradientBoostingRegressor with `3` base learners.

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0, random_state=41)
gbrt.fit(Xtr, ytr)

gb_preds = gbrt.predict(Xva)

In [None]:
sum(y_pred - gb_preds)

## Role of residual learning in training deep networks?

## Example: Compare two networks trained to fit "pi" using with and without residual block

In [93]:
from fastai.data.all import *
from fastai.vision.all import *

In [167]:
bs    = 1
items = [(1., np.pi)]

In [168]:
items

[(1.0, 3.141592653589793)]

In [169]:
class f(ItemTransform):
    def encodes(self, x): return x[0]
class s(ItemTransform):
    def encodes(self, x): return x[1]

dsets = Datasets(items, tfms=[[f],[s]])

dls = dsets.dataloaders(bs=1)
dls.one_batch()
# dls = DataLoader(items, bs=1)
# dls.one_batch()

(tensor([1.], device='cuda:0', dtype=torch.float64),
 tensor([3.1416], device='cuda:0', dtype=torch.float64))

In [170]:
class WithoutResBlock(Module):
    def __init__(self, n):
        store_attr('n')
        self.lin = nn.Linear(1, 1)
        
    def forward(self, x):
        out = self.lin(x)
        
        for i in range(self. n):
            out = self.lin(out)
        
        return out

In [171]:
class DummyResBlock(Module):
    def __init__(self, n):
        store_attr('n')
        self.lin = nn.Linear(1, 1)
        
    def forward(self, x):
        out = self.lin(x)
        for i in range(self.n):
            t = self.lin(out)
            out = t + out
        return out

In [160]:
x, y = dls.one_batch()
m = DummyResBlock(n=2).cuda().double()
m(x)

tensor([1.6790], device='cuda:0', dtype=torch.float64, grad_fn=<AddBackward0>)

In [172]:
m = DummyResBlock(n=2).cuda().double()
learn = Learner(dls, m, loss_func=mse)

In [173]:
learn.fit(n_epoch=500)

epoch,train_loss,valid_loss,time
0,8.327976,,00:00
1,8.324323,,00:00
2,8.320645,,00:00
3,8.316943,,00:00
4,8.313217,,00:00
5,8.309466,,00:00
6,8.305691,,00:00
7,8.30189,,00:00
8,8.298066,,00:00
9,8.294218,,00:00


In [174]:
wores = WithoutResBlock(n=2).cuda().double()
learn = Learner(dls, wores, loss_func=mse)

In [175]:
learn.fit(n_epoch=500)

epoch,train_loss,valid_loss,time
0,10.71639,,00:00
1,10.710617,,00:00
2,10.704815,,00:00
3,10.698986,,00:00
4,10.69313,,00:00
5,10.687246,,00:00
6,10.681334,,00:00
7,10.675397,,00:00
8,10.669433,,00:00
9,10.663443,,00:00
