# 6. Using a neural network to fit the data
This chapter covers
* Nonlinear activation functions as the key difference compared with linear models
* Working with PyTorch’s `nn` module
* Solving a linear-fit problem with a neural network

In this chapter, we will make some changes to our model architecture: we’re going to
implement a full artificial neural network to solve our temperature-conversion
problem. We’ll continue using our training loop from the last chapter, along with our
Fahrenheit-to-Celsius samples split into training and validation sets. We could start to
use a quadratic model: rewriting `model` as a quadratic function of its input (for
example, $y = a * x**2 + b * x + c$). Since such a model would be differentiable,
PyTorch would take care of computing gradients, and the training loop would work as
usual. That wouldn’t be too interesting for us, though, because we would still be fixing
the shape of the function.

This is the chapter where we begin to hook together the foundational work we’ve
put in and the PyTorch features you’ll be using day in and day out as you work on your
projects. You’ll gain an understanding of what’s going on underneath the porcelain of
the PyTorch API, rather than it just being so much black magic. Before we get into the
implementation of our new model, though, let’s cover what we mean by *artificial neural network*.

![](images/6.1.png)
Figure 6.1 Our mental model of the learning process, as implemented in chapter 5

## 6.1 Artificial neurons
At the core of deep learning are **neural networks**: mathematical entities capable of representing complicated functions through a composition of simpler functions. The term neural network is obviously suggestive of a link to the way our brain works. As a matter of fact, although the initial models were inspired by neuroscience,1 modern artificial neural networks bear only a slight resemblance to the mechanisms of neurons in the brain. It seems likely that both artificial and physiological neural networks use vaguely similar mathematical strategies for approximating complicated functions because that family of strategies works very effectively.

The basic building block of these complicated functions is the neuron, as illustrated in figure 6.2. At its core, it is nothing but a linear transformation of the input (for example, multiplying the input by a number [the weight] and adding a constant [the bias]) followed by the application of a fixed nonlinear function (referred to as the activation function).

Mathematically, we can write this out as $o = f(w * x + b)$, with $x$ as our input, $w$ our weight or scaling factor, and $b$ as our bias or offset. $f$ is our activation function, set to the hyperbolic tangent, or `tanh` function here. In general, $x$ and, hence, o can be simple scalars, or vector-valued (meaning holding many scalar values); and similarly, $w$ can be a single scalar or matrix, while $b$ is a scalar or vector (the dimensionality of the
inputs and weights must match, however). In the latter case, the previous expression is referred to as a `layer` of neurons, since it represents many neurons via the multidimensional weights and biases.

![](images/6.2.png)

### 6.1.1 Composing a multilayer network
A multilayer neural network, as represented in figure 6.3, is made up of a composition
of functions like those we just discussed
```
x_1 = f(w_0 * x + b_0)
x_2 = f(w_1 * x_1 + b_1)
...
y = f(w_n * x_n + b_n)
```
where the output of a layer of neurons is used as an input for the following layer.
Remember that `w_0` here is a matrix, and `x` is a vector! Using a vector allows `w_0` to hold an entire layer of neurons, not just a single weight.

![](images/6.3.png)

### 6.1.2 Understanding the error function
An important difference between our earlier linear model and what we’ll actually be using for deep learning is the shape of the error function. Our linear model and *error-squared loss function* had a *convex error curve* with a singular, clearly defined minimum. If we were to use other methods, we could solve for the parameters minimizing the error function automatically and definitively. That means that our parameter updates were attempting to *estimate* that singular correct answer as best they could.

Neural networks do not have that same property of a convex error surface, even when using the same error-squared loss function! There’s no single right answer for each parameter we’re attempting to approximate. Instead, we are trying to get all of the parameters, when acting **in concert**, to produce a useful output. Since that useful output is only going to **approximate** the truth, there will be some level of imperfection. Where and how imperfections manifest is somewhat arbitrary, and by implication the
parameters that control the output (and, hence, the imperfections) are somewhat arbitrary as well. This results in neural network training looking very much like parameter estimation from a mechanical perspective, but we must remember that the theoretical underpinnings are quite different.

A big part of the reason neural networks have non-convex error surfaces is due to the activation function. The ability of an ensemble of neurons to approximate a very wide range of useful functions depends on the combination of the linear and nonlinear behavior inherent to each neuron.

### 6.1.3 All we need is activation
The activation function plays two important roles:
* In the inner parts of the model, it allows the output function to have different slopes at different values—something a linear function by definition cannot do. By trickily composing these differently sloped parts for many outputs, neural networks can approximate arbitrary functions, as we will see in section 6.1.6
* At the last layer of the network, it has the role of concentrating the outputs of the preceding linear operation into a given range.

Let’s talk about what the second point means. Pretend that we’re assigning a “good doggo” score to images. Pictures of retrievers and spaniels should have a high score, while images of airplanes and garbage trucks should have a low score. Bear pictures should have a lowish score, too, although higher than garbage trucks.

**CAPPING THE OUTPUT RANGE**

We want to firmly constrain the output of our linear operation to a specific range so that the consumer of this output doesn’t have to handle numerical inputs of puppies at 12/10, bears at –10, and garbage trucks at –1,000.
One possibility is to just cap the output values: anything below 0 is set to 0, and anything
above 10 is set to 10. That’s a simple activation function called [`torch.nn.Hardtanh`](https://pytorch.org/docs/stable/nn.html#hardtanh), but note that the default range is –1 to +1.

**COMPRESSING THE OUTPUT RANGE**
Another family of functions that work well is `torch.nn.Sigmoid`, which includes `1 / (1 + e ** -x)`, `torch.tanh`, and others that we’ll see in a moment. These functions have a curve that asymptotically approaches 0 or –1 as x goes to negative infinity, approaches 1 as x increases, and have a mostly constant slope at x == 0. Conceptually, functions shaped this way work well because there’s an area in the middle of our linear function’s output that our neuron (which, again, is just a linear function followed by an activation) will be sensitive to, while everything else gets lumped next to the boundary values. As we can see in figure 6.4, our garbage truck gets a score of –0.97, while bears and foxes and wolves end up somewhere in the –0.3 to 0.3 range.

![](images/6.4.png)
Figure 6.4 Dogs, bears, and garbage trucks being mapped to how dog-like they are via the tanh activation function

In [1]:
import math
math.tanh(-2.2)

-0.9757431300314515

In [2]:
math.tanh(0.1)

0.09966799462495582

In [3]:
math.tanh(2.5)

0.9866142981514303

With the bear in the sensitive range, small changes to the bear will result in a noticeable
change to the result.

### 6.1.4 More activation functions
There are quite a few activation functions, some of which are shown in figure 6.5. In the first column, we see the smooth functions `Tanh` and `Softplus`, while the second column has “hard” versions of the activation functions to their left: `Hardtanh` and `ReLU`. `ReLU` (for *rectified linear unit*) deserves special note, as it is currently considered one of the best-performing general activation functions; many state-of-the-art results
have used it.

![](images/6.5.png)
Figure 6.5 A collection of common and not-so-common activation functions

### 6.1.5 Choosing the best activation function
Without these characteristics, the network either falls back to being a linear model or becomes difficult to train.

The following are true for the functions:
* They have at least one sensitive range, where nontrivial changes to the input result in a corresponding nontrivial change to the output. This is needed for training.
* Many of them have an insensitive (or saturated) range, where changes to the input result in little or no change to the output.

The activation function will have at least one of these:
* A lower bound that is approached (or met) as the input goes to negative infinity
* A similar-but-inverse upper bound for positive infinity

Thinking of what we know about how backpropagation works, we can figure out that the errors will propagate backward through the activation more effectively when the inputs are in the response range, while errors will not greatly affect neurons for which the input is saturated (since the gradient will be close to zero, due to the flat area around the output).

Put together, all this results in a pretty powerful mechanism: we’re saying that in a network built out of linear + activation units, when different inputs are presented to the network, (a) different units will respond in different ranges for the same inputs, and (b) the errors associated with those inputs will primarily affect the neurons operating in the sensitive range, leaving other units more or less unaffected by the learning process. In addition, thanks to the fact that derivatives of the activation with respect to its inputs are often close to 1 in the sensitive range, estimating the parameters of the linear transformation through gradient descent for the units that operate in that range will look a lot like the linear fit we have seen previously.

### 6.1.6 What learning means for a neural network
With a deep neural network model, we have a universal
approximator and a method to estimate its parameters. This approximator can be customized
to our needs, in terms of model capacity and its ability to model complicated
input/output relationships, just by composing simple building blocks. We can see
some examples of this in figure 6.6.

![](images/6.6.png)

Figure 6.6 Composing multiple linear units and tanh activation functions to produce nonlinear outputs

The four upper-left graphs show four neurons—A, B, C, and D—each with its own
(arbitrarily chosen) weight and bias. Each neuron uses the `Tanh` activation function with a min of –1 and a max of 1. The varied weights and biases move the center point and change how drastically the transition from min to max happens, but they clearly all have the same general shape. The columns to the right of those show both pairs of neurons added together (A + B and then C + D). Here, we start to see some interesting properties that mimic a single layer of neurons. A + B shows a slight S curve, with the extremes approaching 0, but both a positive bump and a negative bump in the middle. Conversely, C + D has only a large positive bump, which peaks at a higher value than our single-neuron max of 1.

In the third row, we begin to compose our neurons as they would be in a two-layer network. Both C(A + B) and D(A + B) have the same positive and negative bumps that A + B shows, but the positive peak is more subtle. The composition of C(A + B) + D(A + B) shows a new property: two clearly negative bumps, and possibly a very subtle second positive peak as well, to the left of the main area of interest. All this with only four neurons in two layers!

## 6.2 The PyTorch nn module
PyTorch has a whole submodule dedicated to neural networks, called `torch.nn`. It contains the building blocks needed to create all sorts of neural network architectures. Those building blocks are called **modules** in PyTorch parlance (such building blocks are often referred to as **layers** in other frameworks). A PyTorch module is a Python class deriving from the `nn.Module` base class. A module can have one or more **Parameter** instances as attributes, which are tensors whose values are optimized during the training process (think `w` and `b` in our linear model). A module can also have one or more submodules (subclasses of `nn.Module`) as attributes, and it will be able to track their parameters as well.

### 6.2.1 Using `__call__` rather than forward
All PyTorch-provided subclasses of `nn.Module` have their `__call__` method defined. This allows us to instantiate an `nn.Linear` and call it as if it was a function, like so (code/p1ch6/1_neural_networks.ipynb):

```
# In[5]:
import torch.nn as nn
linear_model = nn.Linear(1, 1) # We’ll look into the constructor 
linear_model(t_un_val)         # arguments in a moment.

# Out[5]:
tensor([[0.6018],
    [0.2877]], grad_fn=<AddmmBackward>)

```

Calling an instance of `nn.Module` with a set of arguments ends up calling a method named forward with the same arguments. The forward method is what executes the forward computation, while `__call__` does other rather important chores before and after calling forward. So, it is technically possible to call forward directly, and it will produce the same output as `__call__`, but this should not be done from user code:
```
y = model(x)             # Correct!
y = model.forward(x)     # Silent error. Don’t do it!
```
Here’s the implementation of `Module._call_` (we left out the bits related to the JIT and made some simplifications for clarity; torch/nn/modules/module.py, line 483, class: Module):
```
def __call__(self, *input, **kwargs):
    for hook in self._forward_pre_hooks.values():
        hook(self, input)
    result = self.forward(*input, **kwargs)
    
    for hook in self._forward_hooks.values():
        hook_result = hook(self, input, result)
        # ...
    for hook in self._backward_hooks.values():
        # ...
    return result
```
As we can see, there are a lot of hooks that won’t get called properly if we just use `.forward(…)` directly.

### 6.2.2 Returning to the linear model
Back to our linear model. The constructor to `nn.Linear` accepts three arguments: the number of input features, the number of output features, and whether the linear model includes a bias or not (defaulting to `True`, here):
```
# In[5]:
import torch.nn as nn

linear_model = nn.Linear(1, 1) # The arguments are input size, output
linear_model(t_un_val)         # size, and bias defaulting to True.

# Out[5]:
tensor([[0.6018],
    [0.2877]], grad_fn=<AddmmBackward>)
```