# 6. Using a neural network to fit the data
This chapter covers
* Nonlinear activation functions as the key difference compared with linear models
* Working with PyTorch’s `nn` module
* Solving a linear-fit problem with a neural network

In this chapter, we will make some changes to our model architecture: we’re going to
implement a full artificial neural network to solve our temperature-conversion
problem. We’ll continue using our training loop from the last chapter, along with our
Fahrenheit-to-Celsius samples split into training and validation sets. We could start to
use a quadratic model: rewriting `model` as a quadratic function of its input (for
example, $y = a * x**2 + b * x + c$). Since such a model would be differentiable,
PyTorch would take care of computing gradients, and the training loop would work as
usual. That wouldn’t be too interesting for us, though, because we would still be fixing
the shape of the function.

This is the chapter where we begin to hook together the foundational work we’ve
put in and the PyTorch features you’ll be using day in and day out as you work on your
projects. You’ll gain an understanding of what’s going on underneath the porcelain of
the PyTorch API, rather than it just being so much black magic. Before we get into the
implementation of our new model, though, let’s cover what we mean by *artificial neural network*.

![](images/6.1.png)
Figure 6.1 Our mental model of the learning process, as implemented in chapter 5

## 6.1 Artificial neurons
At the core of deep learning are **neural networks**: mathematical entities capable of representing complicated functions through a composition of simpler functions. The term neural network is obviously suggestive of a link to the way our brain works. As a matter of fact, although the initial models were inspired by neuroscience,1 modern artificial neural networks bear only a slight resemblance to the mechanisms of neurons in the brain. It seems likely that both artificial and physiological neural networks use vaguely similar mathematical strategies for approximating complicated functions because that family of strategies works very effectively.

The basic building block of these complicated functions is the neuron, as illustrated in figure 6.2. At its core, it is nothing but a linear transformation of the input (for example, multiplying the input by a number [the weight] and adding a constant [the bias]) followed by the application of a fixed nonlinear function (referred to as the activation function).

Mathematically, we can write this out as $o = f(w * x + b)$, with $x$ as our input, $w$ our weight or scaling factor, and $b$ as our bias or offset. $f$ is our activation function, set to the hyperbolic tangent, or `tanh` function here. In general, $x$ and, hence, o can be simple scalars, or vector-valued (meaning holding many scalar values); and similarly, $w$ can be a single scalar or matrix, while $b$ is a scalar or vector (the dimensionality of the
inputs and weights must match, however). In the latter case, the previous expression is referred to as a `layer` of neurons, since it represents many neurons via the multidimensional weights and biases.

![](images/6.2.png)

### 6.1.1 Composing a multilayer network
A multilayer neural network, as represented in figure 6.3, is made up of a composition
of functions like those we just discussed
```
x_1 = f(w_0 * x + b_0)
x_2 = f(w_1 * x_1 + b_1)
...
y = f(w_n * x_n + b_n)
```
where the output of a layer of neurons is used as an input for the following layer.
Remember that `w_0` here is a matrix, and `x` is a vector! Using a vector allows `w_0` to hold an entire layer of neurons, not just a single weight.

![](images/6.3.png)

### 6.1.2 Understanding the error function
An important difference between our earlier linear model and what we’ll actually be using for deep learning is the shape of the error function. Our linear model and *error-squared loss function* had a *convex error curve* with a singular, clearly defined minimum. If we were to use other methods, we could solve for the parameters minimizing the error function automatically and definitively. That means that our parameter updates were attempting to *estimate* that singular correct answer as best they could.

Neural networks do not have that same property of a convex error surface, even when using the same error-squared loss function! There’s no single right answer for each parameter we’re attempting to approximate. Instead, we are trying to get all of the parameters, when acting **in concert**, to produce a useful output. Since that useful output is only going to **approximate** the truth, there will be some level of imperfection. Where and how imperfections manifest is somewhat arbitrary, and by implication the
parameters that control the output (and, hence, the imperfections) are somewhat arbitrary as well. This results in neural network training looking very much like parameter estimation from a mechanical perspective, but we must remember that the theoretical underpinnings are quite different.

A big part of the reason neural networks have non-convex error surfaces is due to the activation function. The ability of an ensemble of neurons to approximate a very wide range of useful functions depends on the combination of the linear and nonlinear behavior inherent to each neuron.