In [1]:
import matplotlib.pyplot as plt
import matplotlib
import numpy as np

from numpy import ndarray
from copy import deepcopy

from typing import Callable, List, Tuple

In [2]:
def assert_same_shape(array: np.ndarray, array_grad: np.ndarray):
    assert array.shape == array_grad.shape, \
        '''
        Two ndarrays should have the same shape;
        instead, first ndarray's shape is {0}
        and second ndarray's shape is {1}.
        '''.format(tuple(array_grad.shape), tuple(array.shape))
    return None

# 03. Deep Learning from Scratch
You may not realize it, but you now have all the mathematical and conceptual foundations to answer the key questions about deep learning models that I posed at the beginning of the book: you understand how neural networks work—the computations involved with the matrix multiplications, the loss, and the partial derivatives with respect to that loss—as well as why those computations work (namely, the chain rule from calculus). We achieved this understanding by building neural networks from first principles, representing them as a series of “building blocks” where each building block was a single mathematical function. In this chapter, you’ll learn to represent these building blocks themselves as abstract Python classes and then use these classes to build deep learning models; by the end of this chapter, you will indeed have done “deep learning from scratch”!

We’ll also map the descriptions of neural networks in terms of these building blocks to more conventional descriptions of deep learning models that you may have heard before. For example, by the end of this chapter, you’ll know what it means for a deep learning model to have “multiple hidden layers.” This is really the essence of understanding a concept: being able to translate between high-level descriptions and low-level details of what is actually going on. Let’s begin building toward this translation. So far, we’ve described models just in terms of the operations that happen at a low level. In the first part of this chapter, we’ll map this description of models to common higher-level concepts such as “layers” that will ultimately allow us to more easily describe more complex models.

## 3.1 Deep Learning Definition: A First Pass
What is a “deep learning” model? In the previous chapter, we defined a model as a mathematical function represented by a computational graph. The purpose of such a model was to try to map inputs, each drawn from some dataset with common characteristics (such as separate inputs representing different features of houses) to outputs drawn from a related distribution (such as the prices of those houses). We found that if we defined the model as a function that included parameters as inputs to some of its operations, we could “fit” it to optimally describe the data using the following procedure:
1. Repeatedly feed observations through the model, keeping track of the quantities computed along the way during this “forward pass.”

2. Calculate a loss representing how far off our model’s predictions were from the desired outputs or target.

3. Using the quantities computed on the forward pass and the chain rule math worked out in `Chapter 1`, compute how much each of the input parameters ultimately affects this loss.

4. Update the values of the parameters so that the loss will hopefully be reduced when the next set of observations is passed through the model.

We started out with a model containing just a series of linear operations transforming our features into the target (which turned out to be equivalent to a traditional linear regression model). This had the expected limitation that, even when fit “optimally”, the model could nevertheless represent only linear relationships between our features and our target.

We then defined a function structure that applied these linear operations first, then a nonlinear operation (the sigmoid function), and then a final set of linear operations. We showed that with this modification, our model could learn something closer to the true, nonlinear relationship between input and output, while having the additional benefit that it could learn relationships between combinations of our input features and the target.

What is the connection between models like these and deep learning models? We’ll start with a somewhat clumsy attempt at a definition: ***deep learning models are represented by series of operations that have at least two, nonconsecutive nonlinear functions involved***.

I’ll show where this definition comes from shortly, but first note that since deep learning models are just a series of operations, the process of training them is in fact identical to the process we’ve been using for the simpler models we’ve already seen. After all, what allows this training process to work is the differentiability of the model with respect to its inputs; and as mentioned in `Chapter 1`, the composition of differentiable functions is differentiable, so as long as the individual operations making up the function are differentiable, the whole function will be differentiable, and we’ll be able to train it using the same four-step training procedure just described.

However, so far our approach to actually training these models has been to compute these derivatives by manually coding the forward and backward passes and then multiplying the appropriate quantities together to get the derivatives. For the simple neural network model in `Chapter 2`, this required 17 steps. Because we’re describing the model at such a low level, it isn’t immediately clear how we could add more complexity to this model (or what exactly what that would mean) or even make a simple change such as swapping out a different nonlinear function for the sigmoid function. To transition to being able to build arbitrarily “deep” and otherwise “complex” deep learning models, we’ll have to think about where in these 17 steps we can create reusable components, at a higher level than individual operations, that we can swap in and out to build different models. To guide us in the right direction as far as which abstractions to create, we’ll try to map the operations we’ve been using to traditional descriptions of neural networks as being made up of “layers,” “neurons,” and so on.

As our first step, we’ll have to create an abstraction to represent the individual operations we’ve been working with so far, instead of continuing to code the same matrix multiplication and bias addition over and over again.

## 3.2 The Building Blocks of Neural Networks: Operations
The `Operation` class will represent one of the constituent functions in our neural networks. We know that at a high level, based on the way we’ve used such functions in our models, it should have forward and backward methods, each of which receives an `ndarray` as an input and outputs an `ndarray`. Some operations, such as matrix multiplication, seem to have another special kind of input, also an `ndarray`: the parameters. In our `Operation` class—or perhaps in another class that inherits from it—we should allow for `params` as another instance variable.

Another insight is that there seem to be two types of `Operations`: 
+ some, such as the matrix multiplication, return an `ndarray` as output that is a different shape than the `ndarray` they received as input; 
+ by contrast, some `Operations`, such as the sigmoid function, simply apply some function to each element of the input `ndarray`. 

What, then, is the “general rule” about the shapes of the `ndarrays` that get passed between our operations? Let’s consider the `ndarrays` passed through our `Operations`: each `Operation` will send outputs forward on the forward pass and will receive an “output gradient” on the backward pass, which will represent the partial derivative of the loss with respect to every element of the `Operation`’s output (computed by the other `Operations` that make up the network). Also on the backward pass, each `Operation` will send an “input gradient” backward, representing the partial derivative of the loss with respect to each element of the input.

These facts place a few important restrictions on the workings of our `Operations` that will help us ensure we’re computing the gradients correctly:
+ The shape of the output gradient `ndarray` must match the shape of the output.

+ The shape of the input gradient that the `Operation` sends backward during the backward pass must match the shape of the `Operation`’s input.

This will all be clearer once you see it in a diagram; let’s look at that next.

<img src="images/03_01_02.png" style="width:600px;"/>

In [3]:
# Base class for an "operation" in a neural network.
class Operation(object):
    def __init__(self):
        pass

    # Stores input in the self._input instance variable 
    # Calls the self._output() function.
    def forward(self, input_: ndarray):
        self.input_ = input_
        self.output = self._output()
        return self.output

    # Calls the self._input_grad() function.
    # Checks that the appropriate shapes match.
    def backward(self, output_grad: ndarray) -> ndarray:
        assert_same_shape(self.output, output_grad)
        self.input_grad = self._input_grad(output_grad)
        assert_same_shape(self.input_, self.input_grad)
        return self.input_grad

    # The _output method must be defined for each Operation
    def _output(self) -> ndarray:
        raise NotImplementedError()

    # The _input_grad method must be defined for each Operation
    def _input_grad(self, output_grad: ndarray) -> ndarray:
        raise NotImplementedError()

For any individual `Operation` that we define, we’ll have to implement the `_output` and `_input_grad` functions, so named because of the quantities they compute.

> We’re defining base classes like this primarily for pedagogical reasons: it is important to have the mental model that all `Operations` you’ll encounter throughout deep learning fit this blueprint of sending inputs forward and gradients backward, with the shapes of what they receive on the forward pass matching the shapes of what they send backward on the backward pass, and vice versa.

We’ll define the specific `Operations` we’ve used thus far—matrix multiplication and so on—later in this chapter. First we’ll define another class that inherits from `Operation` that we’ll use specifically for `Operations` that involve parameters:

In [4]:
# An Operation with parameters.
class ParamOperation(Operation):
    # The ParamOperation method
    def __init__(self, param: ndarray) -> ndarray:
        super().__init__()
        self.param = param

    # Calls self._input_grad and self._param_grad.
    # Checks appropriate shapes.
    def backward(self, output_grad: ndarray) -> ndarray:
        assert_same_shape(self.output, output_grad)
        self.input_grad = self._input_grad(output_grad)
        self.param_grad = self._param_grad(output_grad)
        assert_same_shape(self.input_, self.input_grad)
        assert_same_shape(self.param, self.param_grad)
        return self.input_grad
    
    # Every subclass of ParamOperation must implement _param_grad.
    def _param_grad(self, output_grad: ndarray) -> ndarray:
        raise NotImplementedError()

Similar to the base `Operation`, an individual `ParamOperation` would have to define the `_param_grad` function in addition to the `_output` and `_input_grad` functions.

We have now formalized the neural network building blocks we’ve been using in our models so far. We could skip ahead and define neural networks directly in terms of these `Operations`, but there is an intermediate class we’ve been dancing around for a chapter and a half that we’ll define first: the `Layer`.

## 3.3 The Building Blocks of Neural Networks: Layers
In terms of `Operations`, layers are a series of linear operations followed by a nonlinear operation. For example, our neural network from the last chapter could be said to have had five total operations: two linear operations—a weight multiplication and the addition of a bias term—followed the sigmoid function and then two more linear operations. In this case, we would say that the first three operations, up to and including the nonlinear one, would constitute the first layer, and the last two operations would constitute the second layer. In addition, we say that the input itself represents a special kind of layer called the **input layer** (in terms of numbering the layers, this layer doesn’t count, so that we can think of it as the “zeroth” layer). The last layer, similarly, is called the **output layer**. The middle layer—the “first one,” according to our numbering—also has an important name: it is called a **hidden layer**, since it is the only layer whose values we don’t typically see explicitly during the course of training.

The output layer is an important exception to this definition of layers, in that it does not have to have a nonlinear operation applied to it; this is simply because we often want the values that come out of this layer to have values between negative infinity and infinity (or at least between 0 and infinity), whereas nonlinear functions typically “squash down” their input to some subset of that range relevant to the particular problem we’re trying to solve (for example, the sigmoid function squashes down its input to between 0 and 1).

To make the connection explicit, `Figure 3-3` shows the diagram of the neural network from the prior chapter with the individual operations grouped into layers.

<img src="images/03_03.png" style="width:600px;"/>

You can see that the input represents an “input” layer, the next three operations (ending with the sigmoid function) represent the next layer, and the last two operations represent the last layer.

This is, of course, rather cumbersome. And that’s the point: representing neural networks as a series of individual operations, while showing clearly how neural networks work and how to train them, is too “low level” for anything more complicated than a two-layer neural network. That’s why the more common way to represent neural networks is in terms of layers, as shown in `Figure 3-4`.

<img src="images/03_04.png" style="width:600px;"/>


## 3.4 Building Blocks on Building Blocks
What specific `Operations` do we need to implement for the models in the prior chapter to work? Based on our experience of implementing that neural network step by step, we know there are three kinds:
+ The matrix multiplication of the input with the matrix of parameters
+ The addition of a bias term
+ The sigmoid activation function

Let’s start with the `WeightMultiply Operation`:

In [5]:
# Weight multiplication operation for a neural network.
class WeightMultiply(ParamOperation):
    # Initialize Operation with self.param = W.
    def __init__(self, W: ndarray):
        super().__init__(W)

    # Compute output.
    def _output(self) -> ndarray:
        return np.dot(self.input_, self.param)

    # Compute input gradient.
    def _input_grad(self, output_grad: ndarray) -> ndarray:
        return np.dot(output_grad, np.transpose(self.param, (1, 0)))

    # Compute parameter gradient.
    def _param_grad(self, output_grad: ndarray)  -> ndarray:      
        return np.dot(np.transpose(self.input_, (1, 0)), output_grad)

Here we simply code up the matrix multiplication on the forward pass, as well as the rules for “sending gradients backward” to both the inputs and the parameters on the backward pass (using the rules for doing so that we reasoned through at the end of `Chapter 1`). As you’ll see shortly, we can now use this as a building block that we can simply plug into our `Layers`.

Next up is the addition operation, which we’ll call `BiasAdd`:

In [6]:
# Compute bias addition.
class BiasAdd(ParamOperation):
    # Initialize Operation with self.param = B.
    # Check appropriate shape.
    def __init__(self, B: ndarray):
        assert B.shape[0] == 1
        super().__init__(B)

    # Compute output.
    def _output(self) -> ndarray:
        return self.input_ + self.param

    # Compute input gradient.
    def _input_grad(self, output_grad: ndarray) -> ndarray:
        return np.ones_like(self.input_) * output_grad

    # Compute parameter gradient.
    def _param_grad(self, output_grad: ndarray) -> ndarray:
        param_grad = np.ones_like(self.param) * output_grad
        return np.sum(param_grad, axis=0).reshape(1, param_grad.shape[1])

Finally, let’s do sigmoid:

In [7]:
# Sigmoid activation function.
class Sigmoid(Operation):
    def __init__(self) -> None:
        super().__init__()

    # Compute output.
    def _output(self) -> ndarray:
        return 1.0/(1.0+np.exp(-1.0 * self.input_))

    # Compute input gradient.
    def _input_grad(self, output_grad: ndarray) -> ndarray:
        sigmoid_backward = self.output * (1.0 - self.output)
        input_grad = sigmoid_backward * output_grad
        return input_grad

In [8]:
# "Identity" activation function
class Linear(Operation):
    def __init__(self) -> None:       
        super().__init__()

    def _output(self) -> ndarray:
        return self.input_

    def _input_grad(self, output_grad: ndarray) -> ndarray:
        return output_grad

Now that we’ve defined these Operations precisely, we can use them as building blocks to define a Layer.

### The Layer Blueprint
Because of the way we’ve written the `Operations`, writing the `Layer` class is easy:
+ The forward and backward methods simply involve sending the input successively forward through a series of `Operations`—exactly as we’ve been doing in the diagrams all along! This is the most important fact about the working of Layers; the rest of the code is a wrapper around this and mostly involves bookkeeping:
    - Defining the correct series of `Operations` in the `_setup_layer` function and initializing and storing the parameters in these `Operations` (which will also take place in the `_setup_layer` function)
    - Storing the correct values in `self.input_` and `self.output` on the forward method
    - Performing the correct assertion checking in the backward method
+ Finally, the `_params` and `_param_grads` functions simply extract the parameters and their gradients (with respect to the loss) from the `ParamOperations` within the layer.

Here’s what all that looks like:

In [9]:
# A "layer" of neurons in a neural network.
class Layer(object):
    # The number of "neurons" roughly corresponds to the "breadth" of the layer
    def __init__(self, neurons: int):
        self.neurons = neurons
        self.first = True
        self.params: List[ndarray] = []
        self.param_grads: List[ndarray] = []
        self.operations: List[Operation] = []

    # The _setup_layer function must be implemented for each layer
    def _setup_layer(self, num_in: int) -> None:
        raise NotImplementedError()

    # Passes input forward through a series of operations
    def forward(self, input_: ndarray) -> ndarray:
        if self.first:
            self._setup_layer(input_)
            self.first = False
        self.input_ = input_
        for operation in self.operations:
            input_ = operation.forward(input_)
        self.output = input_
        return self.output

    # Passes output_grad backward through a series of operations
    # Checks appropriate shapes
    def backward(self, output_grad: ndarray) -> ndarray:
        assert_same_shape(self.output, output_grad)
        for operation in reversed(self.operations):
            output_grad = operation.backward(output_grad)
        input_grad = output_grad
        self._param_grads()
        return input_grad

    # Extracts the _param_grads from a layer's operations
    def _param_grads(self) -> ndarray:
        self.param_grads = []
        for operation in self.operations:
            if issubclass(operation.__class__, ParamOperation):
                self.param_grads.append(operation.param_grad)

    # Extracts the _params from a layer's operations
    def _params(self) -> ndarray:
        self.params = []
        for operation in self.operations:
            if issubclass(operation.__class__, ParamOperation):
                self.params.append(operation.param)

Just as we moved from an abstract definition of an `Operation` to the implementation of specific `Operations` needed for the neural network from `Chapter 2`, let’s now implement the Layer from that network as well.

### The Dense Layer
We called the `Operations` we’ve been dealing with `WeightMultiply`, `BiasAdd`, and so on. What should we call the layer we’ve been using so far? A `LinearNonLinear` layer?

A defining characteristic of this layer is that each output neuron is a function of all of the input neurons. Thus these layers are often called `fully connected layers`; recently, in the popular `Keras` library, they are also often called `Dense layers`, a more concise term that gets across the same idea.

Now that we know what to call it and why, let’s define the `Dense layer` in terms of the operations we’ve already defined—as you’ll see, because of how we defined our `Layer` base class, all we need to do is to put the `Operations` defined in the previous section in as a list in the `_setup_layer` function.

In [10]:
# A fully connected layer which inherits from "Layer"
class Dense(Layer):
    # Requires an activation function upon initialization
    def __init__(self, neurons: int, activation: Operation = Sigmoid()):
        super().__init__(neurons)
        self.activation = activation

    # Defines the operations of a fully connected layer.
    def _setup_layer(self, input_: ndarray) -> None:
        if self.seed:
            np.random.seed(self.seed)
        self.params = []
        # weights
        self.params.append(np.random.randn(input_.shape[1], self.neurons))
        # bias
        self.params.append(np.random.randn(1, self.neurons))
        self.operations = [WeightMultiply(self.params[0]),
                           BiasAdd(self.params[1]),
                           self.activation]
        return None

What building blocks should we now add on top of `Operation` and `Layer`? To train our model, we know we’ll need a `NeuralNetwork` class to wrap around Layers, just as `Layers` wrapped around `Operations`. It isn’t obvious what other classes will be needed, so we’ll just dive in and build `NeuralNetwork` and figure out the other classes we’ll need as we go.


## 3.5 The NeuralNetwork Class, and Maybe Others
What should our `NeuralNetwork` class be able to do? At a high level, it should be able to learn from data: more precisely, it should be able to take in batches of data representing “observations” ($X$) and “correct answers” ($y$) and learn the relationship between $X$ and $y$, which means learning a function that can transform $X$ into predictions $p$ that are very close to $y$.

How exactly will this learning take place, given the `Layer` and `Operation` classes just defined? Recalling how the model from the last chapter worked, we’ll implement the following:
1. The neural network should take $X$ and pass it successively forward through each `Layer` (which is really a convenient wrapper around feeding it through many `Operations`), at which point the result will represent the prediction.

2. Next, prediction should be compared with the value $y$ to calculate the loss and generate the “loss gradient,” which is the partial derivative of the loss with respect to each element in the last layer in the network (namely, the one that generated the prediction).

3. Finally, we’ll send this loss gradient successively backward through each layer, along the way computing the “parameter gradients”—the partial derivative of the loss with respect to each of the parameters—and storing them in the corresponding Operations.

<img src="images/03_05.png" style="width:600px;"/>

### Loss Class
How should we implement this? First, we’ll want our neural network to ultimately deal with `Layers` the same way our `Layers` dealt with `Operations`. For example, we want the forward method to receive $X$ as input and simply do something like:

```python
for layer in self.layers:
    X = layer.forward(X)
return X
```

Similarly, we’ll want our backward method to take in an argument—let’s initially call it grad—and do something like:

```python
for layer in reversed(self.layers): 
    grad = layer.backward(grad)
```

Where will grad come from? It has to come from the loss, a special function that takes in the prediction along with $y$ and:
+ Computes a single number representing the “penalty” for the network making that prediction.

+ Sends backward a gradient for every element of the prediction with respect to the loss. This gradient is what the last Layer in the network will receive as the input to its backward function.

In the example from the prior chapter, the loss function was the squared difference between the prediction and the target, and the gradient of the prediction with respect to the loss was computed accordingly.

How should we implement this? It seems like this concept is important enough to deserve its own class. Furthermore, this class can be implemented similarly to the `Layer` class, except the forward method will produce an actual number (a float) as the loss, instead of an ndarray to be sent forward to the next `Layer`. Let’s formalize this.

In [11]:
# The "loss" of a neural network
class Loss(object):
    def __init__(self):
        pass

    # Computes the actual loss value
    def forward(self, prediction: ndarray, target: ndarray) -> float:
        assert_same_shape(prediction, target)
        self.prediction = prediction
        self.target = target
        loss_value = self._output()
        return loss_value

    # Computes gradient of the loss value with respect to the input to the loss function
    def backward(self) -> ndarray:
        self.input_grad = self._input_grad()
        assert_same_shape(self.prediction, self.input_grad)
        return self.input_grad

    # Every subclass of "Loss" must implement the _output function.
    def _output(self) -> float:
        raise NotImplementedError()

    # Every subclass of "Loss" must implement the _input_grad function.
    def _input_grad(self) -> ndarray:
        raise NotImplementedError()

As in the `Operation` class, we check that the gradient that the loss sends backward is the same shape as the prediction received as input from the last layer of the network:

In [12]:
class MeanSquaredError(Loss):
    def __init__(self) -> None:
        super().__init__()

    # Computes the per-observation squared error loss
    def _output(self) -> float:
        loss = (np.sum(np.power(self.prediction - self.target, 2))/self.prediction.shape[0])
        return loss
    
    # Computes the loss gradient with respect to the input for MSE loss
    def _input_grad(self) -> ndarray:
        return 2.0 * (self.prediction - self.target) / self.prediction.shape[0]

Here, we simply code the forward and backward rules of the mean squared error loss formula.

This is the last key building block we need to build deep learning from scratch. Let’s review how these pieces fit together and then proceed with building a model!


## 3.6 Deep Learning from Scratch
We ultimately want to build a `NeuralNetwork` class, using `Figure 3-5` as a guide, that we can use to define and train deep learning models. Before we dive in and start coding, let’s describe precisely what such a class would be and how it would interact with the `Operation`, `Layer`, and `Loss` classes we just defined:

1. A `NeuralNetwork` will have a list of `Layers` as an attribute. The `Layers` would be as defined previously, with forward and backward methods. These methods take in `ndarray` objects and return `ndarray` objects.

2. Each `Layer` will have a list of `Operations` saved in the operations attribute of the layer during the `_setup_layer` function.

3. These `Operations`, just like the Layer itself, have forward and backward methods that take in `ndarray` objects as arguments and return `ndarray` objects as outputs.

4. In each operation, the shape of the `output_grad` received in the backward method must be the same as the shape of the `output` attribute of the `Layer`. The same is true for the shapes of the `input_grad` passed backward during the backward method and the `input_` attribute.

5. Some operations have parameters (stored in the `param` attribute); these operations inherit from the `ParamOperation` class. The same constraints on input and output shapes apply to `Layers` and their forward and backward methods as well — they take in `ndarray` objects and output `ndarray` objects, and the shapes of the input and output attributes and their corresponding gradients must match.

6. A `NeuralNetwork` will also have a `Loss`. This class will take the output of the last operation from the `NeuralNetwork` and the target, check that their shapes are the same, and calculate both a loss value (a number) and an `ndarray` `loss_grad` that will be fed into the output layer, starting **backpropagation**.

### Implementing Batch Training
We’ve covered several times the high-level steps for training a model one batch at a time. They are important and worth repeating:
1. Feed input through the model function (the “forward pass”) to get a prediction.

2. Calculate the number representing the loss.

3. Calculate the gradient of the loss with respect to the parameters, using the chain rule and the quantities computed during the forward pass.

4. Update the parameters using these gradients.

We would then feed a new batch of data through and repeat these steps.

Translating these steps into the `NeuralNetwork` framework just described is straightforward:
1. Receive $X$ and $y$ as inputs, both `ndarrays`.

2. Feed $X$ successively forward through each `Layer`.

3. Use the `Loss` to produce loss value and the loss gradient to be sent backward.

4. Use the loss gradient as input to the backward method for the network, which will calculate the `param_grads` for each layer in the network.

5. Call the `update_params` function on each layer, which will use the overall learning rate for the `NeuralNetwork` as well as the newly calculated `param_grads`.

We finally have our full definition of a neural network that can accommodate batch training. Now let’s code it up.

In [13]:
# The class for a neural network.
class NeuralNetwork(object):
    # Neural networks need layers, and a loss.
    def __init__(self, layers: List[Layer], loss: Loss, seed: int = 1) -> None:
        self.layers = layers
        self.loss = loss
        self.seed = seed
        if seed:
            for layer in self.layers:
                setattr(layer, "seed", self.seed)        

    # Passes data forward through a series of layers.
    def forward(self, x_batch: ndarray) -> ndarray:
        x_out = x_batch
        for layer in self.layers:
            x_out = layer.forward(x_out)
        return x_out
    
    # Passes data backward through a series of layers.
    def backward(self, loss_grad: ndarray) -> None:
        grad = loss_grad
        for layer in reversed(self.layers):
            grad = layer.backward(grad)
        return None
    
    # Passes data forward through the layers.
    # Computes the loss.
    # Passes data backward through the layers.
    def train_batch(self, x_batch: ndarray, y_batch: ndarray) -> float:
        predictions = self.forward(x_batch)
        loss = self.loss.forward(predictions, y_batch)
        self.backward(self.loss.backward())
        return loss
    
    # Gets the parameters for the network.
    def params(self):
        for layer in self.layers:
            yield from layer.params

    # Gets the gradient of the loss with respect to the parameters for the network.
    def param_grads(self):
        for layer in self.layers:
            yield from layer.param_grads

With this `NeuralNetwork` class, we can implement the models from the prior chapter in a more modular, flexible way and define other models to represent complex nonlinear relationships between input and output. For example, here’s how to easily instantiate the two models we covered in the last chapter—the linear regression and the neural network:

```python
linear_regression = NeuralNetwork(
    layers=[Dense(neurons = 1)], 
    loss = MeanSquaredError(), 
    learning_rate = 0.01)

neural_network = NeuralNetwork(
    layers=[Dense(neurons=13, activation=Sigmoid()), 
            Dense(neurons=1, activation=Linear())], 
    loss = MeanSquaredError(), 
    learning_rate = 0.01 )
```

We’re basically done; now we just feed data repeatedly through the network in order for it to learn. To make this process cleaner and easier to extend to the more complicated deep learning scenarios we’ll see in the following chapter, however, it will help us to define another class that carries out the training, as well as an additional class that carries out the “learning”, or the actual updating of the `NeuralNetwork` parameters given the gradients computed on the backward pass. Let’s quickly define these two classes.


## 3.7 Trainer and Optimizer
First, let’s note the similarities between these classes and the code we used to train the network in `Chapter 2`. There, we used the following code to implement the four steps described earlier for training the model:

```python
# pass X_batch forward and compute the loss 
forward_info, loss = forward_loss(X_batch, y_batch, weights)

# compute the gradient of the loss with respect to each of the weights 
loss_grads = loss_gradients(forward_info, weights)

# update the weights 
for key in weights.keys():
    weights[key] -= learning_rate * loss_grads[key]
```

This code was within a for loop that repeatedly fed data through the function defining and updated our network.

With the classes we have now, we’ll ultimately do this inside a fit function within the `Trainer` class that will mostly be a wrapper around the train function used in the prior chapter. The main difference is that inside this new function, the first two lines from the preceding code block will be replaced with this line:

```python
neural_network.train_batch(X_batch, y_batch)
```

Updating the parameters, which happens in the following two lines, will take place in a separate `Optimizer` class. And finally, the for loop that previously wrapped around all of this will take place in the `Trainer` class that wraps around the `NeuralNetwork` and the `Optimizer`.

Next, let’s discuss why we need an `Optimizer` class and what it should look like.

### Optimizer
In the model we described in the last chapter, each `Layer` contains a simple rule for updating the weights based on the parameters and their gradients. As we’ll touch on in the next chapter, there are many other update rules we can use, such as ones involving the history of gradient updates rather than just the gradient updates from the specific batch that was fed in at that iteration. Creating a separate `Optimizer` class will give us the flexibility to swap in one update rule for another, something that we’ll explore in more detail in the next chapter.

The base `Optimizer` class will take in a `NeuralNetwork` and, every time the step function is called, will update the parameters of the network based on their current values, their gradients, and any other information stored in the `Optimizer`:

In [14]:
# Base class for a neural network optimizer.
class Optimizer(object):
    # Every optimizer must have an initial learning rate.
    def __init__(self, lr: float = 0.01):
        self.lr = lr

    # Every optimizer must implement the "step" function.
    def step(self) -> None:
        pass

And here’s how this looks with the straightforward update rule we’ve seen so far, known as *stochastic gradient descent*:

In [15]:
# Stochasitc gradient descent optimizer.
class SGD(Optimizer):   
    def __init__(self, lr: float = 0.01) -> None:
        super().__init__(lr)

    # For each parameter, adjust in the appropriate direction, with the magnitude of the adjustment 
    # based on the learning rate.
    def step(self):
        for (param, param_grad) in zip(self.net.params(), self.net.param_grads()):
            param -= self.lr * param_grad

> Note that while our `NeuralNetwork` class does not have an `_update_params` method, we do rely on the `params()` and `param_grads()` methods to extract the correct `ndarrays` for optimization.

That’s the basic `Optimizer` class; let’s cover the `Trainer` class next.

### Trainer
In addition to training the model as described previously, the `Trainer` class also links together the `NeuralNetwork` with the `Optimizer`, ensuring the latter trains the former properly. You may have noticed in the previous section that we didn’t pass in a `NeuralNetwork` when initializing our `Optimizer`; instead, we’ll assign the `NeuralNetwork` to be an attribute of the `Optimizer` when we initialize the `Trainer` class shortly, with this line:

```python
setattr(self.optim, 'net', self.net)
```

In the following subsection, I show a simplified but working version of the `Trainer` class that for now contains just the fit method. This method trains our model for a number of epochs and prints out the loss value after each set number of epochs. In each epoch, we:
1. Shuffle the data at the beginning of the epoch

2. Feed the data through the network in batches, updating the parameters after each batch has been fed through

The epoch ends when we have fed the entire training set through the `Trainer`.

In the following is the code for a simple version of the `Trainer` class, we hide two self-explanatory helper methods used during the fit function: 
+ **generate_batches**: generates batches of data from `X_train` and `y_train` for training
+ **permute_data**: shuffles `X_train` and `y_train` at the beginning of each epoch

We also include a restart argument in the train function: if `True` (default), it will reinitialize the model’s parameters to random values upon calling the train function:

In [16]:
# Trains a neural network
class Trainer(object):
    # Requires a neural network and an optimizer in order for training to occur. 
    # Assign the neural network as an instance variable to the optimizer.
    def __init__(self, net: NeuralNetwork, optim: Optimizer) -> None:
        self.net = net
        self.optim = optim
        self.best_loss = 1e9
        setattr(self.optim, 'net', self.net)
    
    # Generates batches for training 
    def generate_batches(self, X: ndarray, y: ndarray, size: int = 32) -> Tuple[ndarray]:
        assert X.shape[0] == y.shape[0], \
        '''
        features and target must have the same number of rows, instead
        features has {0} and target has {1}
        '''.format(X.shape[0], y.shape[0])
        # gen batch
        N = X.shape[0]
        for ii in range(0, N, size):
            X_batch, y_batch = X[ii:ii+size], y[ii:ii+size]
            yield X_batch, y_batch

    # Fits the neural network on the training data for a certain number of epochs.
    # Every "eval_every" epochs, it evaluated the neural network on the testing data.
    def fit(self, X_train: ndarray, y_train: ndarray, X_test: ndarray, y_test: ndarray,
            epochs: int=100,
            eval_every: int=10,
            batch_size: int=32,
            seed: int = 1,
            restart: bool = True)-> None:
        np.random.seed(seed)
        # restart
        if restart:
            for layer in self.net.layers:
                layer.first = True
            self.best_loss = 1e9
        # epoch loop
        for e in range(epochs):
            if (e+1) % eval_every == 0:
                # for early stopping
                last_model = deepcopy(self.net)
            # shuffles X_train and y_train at the beginning of each epoch
            X_train, y_train = permute_data(X_train, y_train)
            batch_generator = self.generate_batches(X_train, y_train, batch_size)
            # batch loop
            for ii, (X_batch, y_batch) in enumerate(batch_generator):
                self.net.train_batch(X_batch, y_batch)
                self.optim.step()
            # test
            if (e+1) % eval_every == 0:
                test_preds = self.net.forward(X_test)
                loss = self.net.loss.forward(test_preds, y_test)
                if loss < self.best_loss:
                    print(f"Validation loss after {e+1} epochs is {loss:.3f}")
                    self.best_loss = loss
                else:
                    print(f"""Loss increased after epoch {e+1}, final loss was {self.best_loss:.3f}, 
                        using the model from epoch {e+1-eval_every}""")
                    self.net = last_model
                    # ensure self.optim is still updating self.net
                    setattr(self.optim, 'net', self.net)
                    break

## 3.8 Putting Everything Together
Here is the full code to train our network using all the `Trainer` and `Optimizer` classes. We’ll set the learning rate to $0.01$ and the maximum number of epochs to $50$ and evaluate our models every $10$ epochs:

In [17]:
# Compute mean absolute error for a neural network.
def mae(y_true: ndarray, y_pred: ndarray): 
    return np.mean(np.abs(y_true - y_pred))

# Compute root mean squared error for a neural network.
def rmse(y_true: ndarray, y_pred: ndarray):
    return np.sqrt(np.mean(np.power(y_true - y_pred, 2)))

# Compute mae and rmse for a neural network.
def eval_regression_model(model: NeuralNetwork, X_test: ndarray, y_test: ndarray):
    preds = model.forward(X_test)
    preds = preds.reshape(-1, 1)
    print("Mean absolute error: {:.2f}".format(mae(preds, y_test)))
    print("Root mean squared error {:.2f}".format(rmse(preds, y_test)))
    
# Turns a 1D Tensor into 2D
def to_2d_np(a: ndarray, type: str="col") -> ndarray:
    assert a.ndim == 1, "Input tensors must be 1 dimensional"
    if type == "col":        
        return a.reshape(-1, 1)
    elif type == "row":
        return a.reshape(1, -1)

# shuffles dataset
def permute_data(X, y):
    perm = np.random.permutation(X.shape[0])
    return X[perm], y[perm]

Load boston housing dataset:

In [18]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_boston

boston = load_boston()
data = boston.data
target = boston.target
features = boston.feature_names

s = StandardScaler()
data = s.fit_transform(data)


X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.3, random_state=80718)

# make target 2d array
y_train, y_test = to_2d_np(y_train), to_2d_np(y_test)

In [19]:
lr = NeuralNetwork(
    layers=[Dense(neurons=1, activation=Linear())],
    loss=MeanSquaredError(),
    seed=20190501
)

trainer = Trainer(lr, SGD(lr=0.01))
trainer.fit(X_train, y_train, X_test, y_test,
       epochs = 50,
       eval_every = 10,
       seed=20190501);
print()
eval_regression_model(lr, X_test, y_test)

Validation loss after 10 epochs is 30.293
Validation loss after 20 epochs is 28.469
Validation loss after 30 epochs is 26.293
Validation loss after 40 epochs is 25.541
Validation loss after 50 epochs is 25.087

Mean absolute error: 3.52
Root mean squared error 5.01


In [20]:
nn = NeuralNetwork(
    layers=[Dense(neurons=13, activation=Sigmoid()),
            Dense(neurons=1, activation=Linear())],
    loss=MeanSquaredError(),
    seed=20190501
)

trainer = Trainer(nn, SGD(lr=0.01))
trainer.fit(X_train, y_train, X_test, y_test,
       epochs = 50,
       eval_every = 10,
       seed=20190501);
print()
eval_regression_model(nn, X_test, y_test)

Validation loss after 10 epochs is 27.435
Validation loss after 20 epochs is 21.839
Validation loss after 30 epochs is 18.918
Validation loss after 40 epochs is 17.195
Validation loss after 50 epochs is 16.215

Mean absolute error: 2.60
Root mean squared error 4.03


In [21]:
dl = NeuralNetwork(
    layers=[Dense(neurons=13, activation=Sigmoid()),
            Dense(neurons=13, activation=Sigmoid()),
            Dense(neurons=1, activation=Linear())],
    loss=MeanSquaredError(),
    seed=20190501
)

trainer = Trainer(dl, SGD(lr=0.01))
trainer.fit(X_train, y_train, X_test, y_test,
       epochs = 50,
       eval_every = 10,
       seed=20190501);
print()
eval_regression_model(dl, X_test, y_test)

Validation loss after 10 epochs is 44.143
Validation loss after 20 epochs is 25.278
Validation loss after 30 epochs is 22.339
Validation loss after 40 epochs is 16.500
Validation loss after 50 epochs is 14.655

Mean absolute error: 2.45
Root mean squared error 3.83
