# Deep Learning From Scratch


如果我们将模型定义为一个函数，并将参数作为其某些运算的输入，那么我们就可以使用以下程序对其进行 "拟合"，从而以最佳方式描述数据：

1、将观察数据反复传递给模型，在 "前向传递 "过程中跟踪沿途计算的计算量（quantities）
2、计算损失，表示模型的预测与预期输出或目标的偏差程度
3、利用前向传递的计算量和第 1 章中的链式法则计算出的结果，计算每个输入参数最终对损耗的影响程度
4、更新参数值，以便下一组观测数据通过模型时，损失有望减少

为了能够说明以及应对模型的复杂性和遍历性，需要将前两章的内容进行一个整合，以便能够更好地理解模型的工作原理。也就是需要创建一些高层级的对象。

## 神经网络的构建块：Operations

`Operation` 类将代表神经网络中的一个组成函数，它有前向（`forward`）和反向（`backward`）方法，每个方法都接受 `ndarray` 作为输入，并产生 `ndarray` 的输出。

至少有两种类型的 `Operation` 类。第一类入矩阵乘法，第二类就是激活函数。

在矩阵乘法中，输入为 `ndarray`，返回一个不同维度的 `ndarray`。在激活函数中，例如 `sigmoid`，是对类型为 `ndarray` 的输入的每个元素应用某个函数（操作）。

`ndarray` 是如何通过 `Operation` 的呢？

每个 `Operation` 中，在前向传递（`forward pass`）时，向前计算输出，在反向传递（`backward pass`）时，接收梯度（“输出的梯度”），也就是 `loss` 相对于每个输出元素的偏导数。
在反向传递（`backward pass`）中，每个 `Operation` 都会向后发送梯度（“输入的梯度”），也就是 `loss` 相对于每个输入元素的偏导数。

根据上述描述，实现 `Operation` 类有以下约束条件：
- 输出梯度（output gradient）的 `ndarray` 的形状与输出（output）的 `ndarray` 的形状必须相同
- 在反向传递过程中的向后发送的输入梯度（input gradient）的 `ndarray` 的形状与输入（input）的 `ndarray` 的形状必须相同


![Operation](./images/03_an_operation.png)



带权重

![Operation With Weight](./images/03_an_operation_with_param.png)



对输入（input）进行前向传播（forward pass）计算输出（output），在反响传播中计算输入梯度（input gradient）。前向传播中的输入的形状与反向传播中的输入的形状是相同的。

In [1]:
import numpy as np
from numpy import ndarray

from typing import List

In [2]:
def assert_same_shape(array: ndarray,
                      array_grad: ndarray):
    assert array.shape == array_grad.shape, \
        '''
        Two ndarray's should have the same shape;
        instead, first ndarray's shape is {0}
        and second ndarray's shape is {1}.
        '''.format(tuple(array_grad.shape), tuple(array.shape))
    return None

In [3]:
class Operation(object):
    """
    Base class for an "operation" in a neural network.
    """

    def __init__(self):
        pass

    def forward(self, input_: ndarray):
        """
        Stores input in the self._input instance variable
        Calls the self._output() function.
        """
        self.input_ = input_

        self.output = self._output()

        return self.output

    def backward(self, output_grad: ndarray) -> ndarray:
        """
        Calls the self._input_grad() function.
        Checks that the appropriate shapes match.
        """
        assert_same_shape(self.output, output_grad)

        self.input_grad = self._input_grad(output_grad)

        assert_same_shape(self.input_, self.input_grad)
        return self.input_grad

    def _output(self) -> ndarray:
        """
        The _output method must be defined for each Operation
        """
        raise NotImplementedError()

    def _input_grad(self, output_grad: ndarray) -> ndarray:
        """
        The _input_grad method must be defined for each Operation
        """
        raise NotImplementedError()

In [4]:
class ParamOperation(Operation):
    """
    An Operation with parameters.
    """

    def __init__(self, param: ndarray) -> ndarray:
        """
        The ParamOperation method
        :param param: 
        """
        super().__init__()
        self.param = param

    def backward(self, output_grad: ndarray) -> ndarray:
        """
        Calls self._input_grad and self._param_grad.
        Checks appropriate shapes.
        """
        assert_same_shape(self.output, output_grad)

        self.input_grad = self._input_grad(output_grad)
        self.param_grad = self._param_grad(output_grad)

        assert_same_shape(self.input_, self.input_grad)
        assert_same_shape(self.param, self.param_grad)

        return self.input_grad

    def _param_grad(self, output_grad: ndarray) -> ndarray:
        """
        Every subclass of ParamOperation must implement _param_grad.
        """
        raise NotImplementedError()

## 神经网络构建块：Layers

`Layer` 类将代表神经网络中的一层，它由一个或多个 `Operation` 组成。每个 `Layer` 都有一个 `forward` 方法，该方法接受 `ndarray` 作为输入，并返回 `ndarray` 作为输出。

使用 `Operations` 表示神经网络：

![layers](./images/03_layers.png)


使用 `Layers` 表示神经网络：

![layers](./images/03_layers2.png)



深度学习模型是一个有着多层隐藏层的神经网络。

### 其他构建块

- 输入矩阵与参数矩阵的乘法
- 偏置的加法
- 激活函数

In [5]:
class WeightMultiply(ParamOperation):
    """
    Weight multiplication operation for a neural network.
    """

    def __init__(self, W: ndarray):
        """
        Initialize Operation with self.param = W.
        """
        super().__init__(W)

    def _output(self) -> ndarray:
        """
        Compute output.
        """
        return np.dot(self.input_, self.param)

    def _input_grad(self, output_grad: ndarray) -> ndarray:
        """
        Compute input gradient.
        """
        return np.dot(output_grad, np.transpose(self.param, (1, 0)))

    def _param_grad(self, output_grad: ndarray) -> ndarray:
        """
        Compute parameter gradient.
        """
        return np.dot(np.transpose(self.input_, (1, 0)), output_grad)

In [6]:
class BiasAdd(ParamOperation):
    """
    Compute bias addition.
    """

    def __init__(self, B: ndarray):
        """
        Initialize Operation with self.param = B.
        Check appropriate shape.
        """
        assert B.shape[0] == 1

        super().__init__(B)

    def _output(self) -> ndarray:
        """
        Compute output.
        """
        return self.input_ + self.param

    def _input_grad(self, output_grad: ndarray) -> ndarray:
        """
        Compute input gradient.
        """
        return np.ones_like(self.input_) * output_grad

    def _param_grad(self, output_grad: ndarray) -> ndarray:
        """
        Compute parameter gradient.
        """
        param_grad = np.ones_like(self.param) * output_grad
        return np.sum(param_grad, axis=0).reshape(1, param_grad.shape[1])

In [7]:
class Sigmoid(Operation):
    """
    Sigmoid activation function.
    """

    def __init__(self) -> None:
        super().__init__()

    def _output(self) -> ndarray:
        """
        Compute output.
        """
        return 1.0 / (1.0 + np.exp(-1.0 * self.input_))

    def _input_grad(self, output_grad: ndarray) -> ndarray:
        """
        Compute input gradient.
        """
        sigmoid_backward = self.output * (1.0 - self.output)
        return sigmoid_backward * output_grad

In [9]:
class Linear(Operation):
    """
    "Identity" activation function
    """

    def __init__(self) -> None:
        """Pass"""
        super().__init__()

    def _output(self) -> ndarray:
        """Pass through"""
        return self.input_

    def _input_grad(self, output_grad: ndarray) -> ndarray:
        """Pass through"""
        return output_grad

In [10]:
class Layer(object):
    """
    A "layer" of neurons in a neural network.
    """

    def __init__(self, neurons: int):
        """
        The number of "neurons" roughly corresponds to the "breadth" of the layer
        """
        self.neurons = neurons
        self.first = True
        self.params: List[ndarray] = []
        self.param_grads: List[ndarray] = []
        self.operations: List[Operation] = []

    def _setup_layer(self, num_in: int) -> None:
        """
        The _setup_layer function must be implemented for each layer.
        """
        raise NotImplementedError()

    def forward(self, input_: ndarray) -> ndarray:
        """
        Passes input forward through a series of operations.
        """
        if self.first:
            self._setup_layer(input_)
            self.first = False

        self.input_ = input_

        for operation in self.operations:
            input_ = operation.forward(input_)

        self.output = input_

        return self.output

    def backward(self, output_grad: ndarray) -> ndarray:
        """
        Passes output_grad backward through a series of operations.
        """
        assert_same_shape(self.output, output_grad)

        for operation in reversed(self.operations):
            output_grad = operation.backward(output_grad)

        input_grad = output_grad

        self._param_grads()

        return input_grad

    def _param_grads(self) -> ndarray:
        """
        Extracts the _param_grads from a layer's operations.
        """
        self.param_grads = []
        for operation in self.operations:
            if issubclass(operation.__class__, ParamOperation):
                self.param_grads.append(operation.param_grad)

    def _params(self) -> ndarray:
        """
        Extracts the _params from a layer's operations
        """

        self.params = []
        for operation in self.operations:
            if issubclass(operation.__class__, ParamOperation):
                self.params.append(operation.param)

全连接层有时也称为 `Dense` 层

In [11]:
class Dense(Layer):
    """
    A fully connected layer which inherits from "Layer"
    """

    def __init__(self,
                 neurons: int,
                 activation: Operation = Sigmoid()):
        """
        Requires an activation function upon initialization
        """
        super().__init__(neurons)
        self.activation = activation

    def _setup_layer(self, input_: ndarray) -> None:
        """
        Defines the operations of a fully connected layer.
        """
        if self.seed:
            np.random.seed(self.seed)

        self.params = []

        # weights
        self.params.append(np.random.randn(input_.shape[1], self.neurons))

        # bias
        self.params.append(np.random.randn(1, self.neurons))

        self.operations = [WeightMultiply(self.params[0]),
                           BiasAdd(self.params[1]),
                           self.activation]

        return None

### 神经网络的构建块：Loss

神经网络的“学习”过程：
- 矩阵数据 $X$ 作为神经网络的输入，并在前向传播中通过每一个 `Layer`，并最终计算出 `prediction`
- 通过 `prediction` 和目标值 `y` 之间的差异，计算出 `loss` 以及 `loss gradient`，也就是 `loss` 相对于网络最后一层中每个元素（即生成 `prediction` 的元素）的偏导数
- 在反响传播过程中，将 `loss gradient` 传递回网络的每一层，并计算出参数的梯度，也就是 `loss` 相对于每个参数的偏导数



![backpropagation](./images/03_backpropagation.png)

In [12]:
class Loss(object):
    """
    The "loss" of a neural network
    """

    def __init__(self):
        """Pass"""
        pass

    def forward(self, prediction: ndarray, target: ndarray) -> float:
        """
        Computes the actual loss value
        """
        assert_same_shape(prediction, target)

        self.prediction = prediction
        self.target = target

        loss_value = self._output()

        return loss_value

    def backward(self) -> ndarray:
        """
        Computes gradient of the loss value with respect to the input to the loss function
        """
        self.input_grad = self._input_grad()

        assert_same_shape(self.prediction, self.input_grad)

        return self.input_grad

    def _output(self) -> float:
        """
        Every subclass of "Loss" must implement the _output function.
        """
        raise NotImplementedError()

    def _input_grad(self) -> ndarray:
        """
        Every subclass of "Loss" must implement the _input_grad function.
        """
        raise NotImplementedError()

In [13]:
class MeanSquaredError(Loss):

    def __init__(self) -> None:
        """Pass"""
        super().__init__()

    def _output(self) -> float:
        """
        Computes the per-observation squared error loss
        """
        loss = (
                np.sum(np.power(self.prediction - self.target, 2)) /
                self.prediction.shape[0]
        )

        return loss

    def _input_grad(self) -> ndarray:
        """
        Computes the loss gradient with respect to the input for MSE loss
        """

        return 2.0 * (self.prediction - self.target) / self.prediction.shape[0]

### 神经网络的构建块：Put it together

在有了 `Layer`, `Operation`, `Loss` 等操作后，我们如何将这些基础构建块组合在一起，以构建一个完整的神经网络呢？一个 `NeuralNetwork` 需要：
- 一组 `Layer`。每个 `Layer` 中有 `forward` 和 `backward` 方法，这些方法接收 `ndarray` 作为输入，并返回 `ndarray` 作为输出
- 每个 `Layer` 都有一组 `Operation`，在 `_setup_layer` 函数中设置，并保存在 `operations` 属性中
- `Operations` 和 `Layer` 一样，有 `forward` 和 `backward` 方法，这些方法接收 `ndarray` 作为输入，并返回 `ndarray` 作为输出
- 在每个 `operation` 中，在 `backward` 函数中的 `output_grad` 的形状，必须与 `Layer` 中的 `output` 的形状相同，对于 `input_grad` 也是如此
- 有些 `operation` 有参数，存储在 `param` 属性中
- 一个 `NeuralNetwork` 同样也有一个 `Loss`，它的输入参数为 `NeuralNetwork` 最后一个 `operation` 的输出和 `target`


我们再描述下 `NeuralNetwork` 整体过程：
1. 接收 `X` 和 `y` 作为输入，它们都是 `ndarray`
2. 将 `X` 顺序传递给每个 `Layer` 的 `forward` 方法，最终得到 `prediction`
3. 使用 `Loss` 计算 `prediction` 和 `y` 之间的 `loss value` 和 `logg gradient`，并传递给 `backward`
4. 通过 `backward` 方法，将 `loss gradient` 作为输入，计算每一层 `Layer` 的 `param_gradient`
5. 调用每一层 `Layer` 的 `update_param` 方法，改方法将使用神经网络的整体学习率以及新计算的 `param_grads` 来更新参数

In [14]:
class NeuralNetwork(object):
    """
    A Neural Network
    """

    def __init__(self, layers: List[Layer], loss: Loss, seed: int = 1) -> None:
        """
        Neural Network need layers, and a loss
        """

        self.layers = layers
        self.loss = loss
        self.seed = seed
        if seed:
            for layer in self.layers:
                setattr(layer, "seed", self.seed)
        # self.param_grads: List[ndarray] = []

    def forward(self, X_batch: ndarray) -> ndarray:
        """
        Passes data forward through a series of layers
        """

        x_out = X_batch
        for layer in self.layers:
            x_out = layer.forward(x_out)

        return x_out

    def backward(self, loss_grad: ndarray) -> None:
        """
        Passes data backward through a series of layers
        """

        grad = loss_grad
        for layer in reversed(self.layers):
            grad = layer.backward(grad)

        return None

    def train_batch(self,
                    X_batch: ndarray,
                    y_batch: ndarray) -> float:
        """
        Passes data forward through the layers.
        Computes the loss.
        Passes data backward through the layers.
        """

        prediction = self.forward(X_batch)

        loss = self.loss.forward(prediction, y_batch)

        self.backward(self.loss.backward())

        return loss

    def params(self):
        """
        Gets the parameters for the network.
        """

        for layer in self.layers:
            yield from layer.params

    def param_grads(self):
        """
        Gets the gradient of the loss with respect to the parameters for the network.
        """

        for layer in self.layers:
            yield from layer.param_grads

### 神经网络的构建块：Trainer and Optimizer

除了上述的 `NeuralNetwork` 外，还需要一个 `Trainer` 和一个 `Optimizer`。`Trainer` 用于训练神经网络，`Optimizer` 用于更新参数。`Trainer` 包括了 `NeuralNetwork` 和 `Optimizer`。

In [15]:
class Optimizer(object):
    """
    Base class for a neural network optimizer.
    """

    def __init__(self,
                 lr: float = 0.01):
        """
        Every optimizer must have an initial learning rate.
        """
        self.lr = lr

    def step(self) -> None:
        """
        Every optimizer must implement the "step" function.
        """
        pass

In [16]:
class SGD(Optimizer):
    """
    Stochastic gradient descent optimizer.
    """

    def __init__(self,
                 lr: float = 0.01) -> None:
        """Pass"""
        super().__init__(lr)

    def step(self):
        """
        For each parameter, adjust in the appropriate direction, with the magnitude of the adjustment 
        based on the learning rate.
        """
        for (param, param_grad) in zip(self.net.params(),
                                       self.net.param_grads()):
            param -= self.lr * param_grad

In [17]:
from copy import deepcopy
from typing import Tuple


class Trainer(object):
    """
    Trains a neural network
    """

    def __init__(self,
                 net: NeuralNetwork,
                 optim: Optimizer) -> None:
        """
        Requires a neural network and an optimizer in order for training to occur. 
        Assign the neural network as an instance variable to the optimizer.
        """
        self.net = net
        self.optim = optim
        self.best_loss = 1e9
        setattr(self.optim, 'net', self.net)

    def generate_batches(self,
                         X: ndarray,
                         y: ndarray,
                         size: int = 32) -> Tuple[ndarray]:
        """
        Generates batches for training 
        """
        assert X.shape[0] == y.shape[0], \
            '''
            features and target must have the same number of rows, instead
            features has {0} and target has {1}
            '''.format(X.shape[0], y.shape[0])

        N = X.shape[0]

        for ii in range(0, N, size):
            X_batch, y_batch = X[ii:ii + size], y[ii:ii + size]

            yield X_batch, y_batch

    def fit(self, X_train: ndarray, y_train: ndarray,
            X_test: ndarray, y_test: ndarray,
            epochs: int = 100,
            eval_every: int = 10,
            batch_size: int = 32,
            seed: int = 1,
            restart: bool = True) -> None:
        """
        Fits the neural network on the training data for a certain number of epochs.
        Every "eval_every" epochs, it evaluated the neural network on the testing data.
        """

        np.random.seed(seed)
        if restart:
            for layer in self.net.layers:
                layer.first = True

            self.best_loss = 1e9

        for e in range(epochs):

            if (e + 1) % eval_every == 0:
                # for early stopping
                last_model = deepcopy(self.net)

            X_train, y_train = permute_data(X_train, y_train)

            batch_generator = self.generate_batches(X_train, y_train,
                                                    batch_size)

            for ii, (X_batch, y_batch) in enumerate(batch_generator):
                self.net.train_batch(X_batch, y_batch)

                self.optim.step()

            if (e + 1) % eval_every == 0:

                test_preds = self.net.forward(X_test)
                loss = self.net.loss.forward(test_preds, y_test)

                if loss < self.best_loss:
                    print(f"Validation loss after {e + 1} epochs is {loss:.3f}")
                    self.best_loss = loss
                else:
                    print(
                        f"""Loss increased after epoch {e + 1}, final loss was {self.best_loss:.3f}, using the model from epoch {e + 1 - eval_every}""")
                    self.net = last_model
                    # ensure self.optim is still updating self.net
                    setattr(self.optim, 'net', self.net)
                    break

In [18]:
def permute_data(X, y):
    perm = np.random.permutation(X.shape[0])
    return X[perm], y[perm]

In [19]:
# Evaluation metrics
def mae(y_true: ndarray, y_pred: ndarray):
    """
    Compute mean absolute error for a neural network.
    """
    return np.mean(np.abs(y_true - y_pred))


def rmse(y_true: ndarray, y_pred: ndarray):
    """
    Compute root mean squared error for a neural network.
    """
    return np.sqrt(np.mean(np.power(y_true - y_pred, 2)))


def eval_regression_model(model: NeuralNetwork,
                          X_test: ndarray,
                          y_test: ndarray):
    """
    Compute mae and rmse for a neural network.
    """
    preds = model.forward(X_test)
    preds = preds.reshape(-1, 1)
    print("Mean absolute error: {:.2f}".format(mae(preds, y_test)))
    print()
    print("Root mean squared error {:.2f}".format(rmse(preds, y_test)))

In [20]:
lr = NeuralNetwork(
    layers=[Dense(neurons=1,
                  activation=Linear())],
    loss=MeanSquaredError(),
    seed=20190501
)

nn = NeuralNetwork(
    layers=[Dense(neurons=13,
                  activation=Sigmoid()),
            Dense(neurons=1,
                  activation=Linear())],
    loss=MeanSquaredError(),
    seed=20190501
)

dl = NeuralNetwork(
    layers=[Dense(neurons=13,
                  activation=Sigmoid()),
            Dense(neurons=13,
                  activation=Sigmoid()),
            Dense(neurons=1,
                  activation=Linear())],
    loss=MeanSquaredError(),
    seed=20190501
)

In [21]:
# 准备数据
import pandas as pd

# New source for Boston housing data per https://scikit-learn.org/1.0/whats_new/v1.0.html#changes-1-0
# data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv("./housing.csv", sep="\\s+", skiprows=22, header=None)

data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]
features = np.array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS',
                     'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT'])

# Scaling the data
from sklearn.preprocessing import StandardScaler

s = StandardScaler()
data = s.fit_transform(data)

In [22]:
def to_2d_np(a: ndarray,
             type: str = "col") -> ndarray:
    """
    Turns a 1D Tensor into 2D
    """

    assert a.ndim == 1, \
        "Input tensors must be 1 dimensional"

    if type == "col":
        return a.reshape(-1, 1)
    elif type == "row":
        return a.reshape(1, -1)

In [23]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.3, random_state=80718)

# make target 2d array
y_train, y_test = to_2d_np(y_train), to_2d_np(y_test)

Train the models

In [24]:
trainer = Trainer(lr, SGD(lr=0.01))

trainer.fit(X_train, y_train, X_test, y_test,
            epochs=50,
            eval_every=10,
            seed=20190501)
print()
eval_regression_model(lr, X_test, y_test)

Validation loss after 10 epochs is 18.965
Validation loss after 20 epochs is 5.640
Validation loss after 30 epochs is 3.228
Validation loss after 40 epochs is 2.377
Validation loss after 50 epochs is 2.063

Mean absolute error: 1.03

Root mean squared error 1.44


In [25]:
trainer = Trainer(nn, SGD(lr=0.01))

trainer.fit(X_train, y_train, X_test, y_test,
            epochs=50,
            eval_every=10,
            seed=20190501)
print()
eval_regression_model(nn, X_test, y_test)

Validation loss after 10 epochs is 10.329
Validation loss after 20 epochs is 6.254
Validation loss after 30 epochs is 4.606
Validation loss after 40 epochs is 3.568
Validation loss after 50 epochs is 3.004

Mean absolute error: 1.39

Root mean squared error 1.73


In [26]:
trainer = Trainer(dl, SGD(lr=0.01))

trainer.fit(X_train, y_train, X_test, y_test,
            epochs=50,
            eval_every=10,
            seed=20190501)
print()
eval_regression_model(dl, X_test, y_test)

Validation loss after 10 epochs is 19.001
Validation loss after 20 epochs is 8.479
Validation loss after 30 epochs is 6.069
Validation loss after 40 epochs is 4.749
Validation loss after 50 epochs is 4.009

Mean absolute error: 1.52

Root mean squared error 2.00
