# Multilayer Perceptrons

In this chapter, begin to work with truly "deep" machine learning models, with multiple connected layers. Higher risk of overfitting with deep networks so will revisit the concepts of regularisation and generalisation.

In [1]:
%matplotlib inline
import torch
from d2l import torch as d2l

## Hidden Layers

In our previous examples, our model mapped inputs directly to outputs via a single affine transformation (linear plus bias), followed by a softmax operation to coerce the outputs into a valid probability distribution. The key issue with this is that the assumption of _linearity_ is a strong (and often incorrect) one. 

### Limitations of Linear Models

Linearity implies the weaker assumption of monotonicity, that is to say that an increase/decrease in our feature/covariate must always correspond to either an increase or decrease in the prediction (posive or negative weight). Many situations where this doesn't make sense, e.g. body temperature as a feature for predicting health of a patient, too high and too low are _both_ bad. 

For image recognition, this assumption is akin to saying that an individual pixel being brighter/darker must always be indiciative of whether an image is, say, a cat or a dog. Perhaps this is appropriate for charaacter recognition, but certainly not for more complex real-world scenarios. 

While for the temperature example, we could conceivably change our feature to be, say, the _distance_ from a healthy temperature (e.g. 37C), it's far less obvious how such feature engineering could be performed for image recognition. In deep neural networks, we learn this "feature engineering" so that we have a representation which can be transformed via a final linear predictor that acts on that representation. 

### Incorporating Hidden Layers

We can overcome the limitations of linearity by incorporating hidden layers. The simplest way of doing this is to have a series of fully-connected layers. We can imagine that the L-1 layers just serve to produce a _representation_ of the input data, which is transformed by the final output layer as a linear predictor. These are often called multilayer perceptron architectures. 

![Screenshot 2025-03-09 at 16.16.47.png](attachment:35a813f4-070e-4f94-8ba4-8b84c2d60c70.png)

In the above example, 4 inputs, 3 outputs with 5 units in the hidden layer. 

### From Linear to Non-Linear

As before, denote the matrix $\mathbf{X} \in \mathbb{R} ^ {n \times d}$ as a minibatch matrix of $n$ examples, each containing $d$ features. For a one-hidden-layer neural network with $h$ units in the hidden layer, we denote $\mathbf{H} \in \mathbb{R} ^ {n \times h}$ as the outputs of the hidden layer. 

Since the layers are fully connected, we have hidden layer weights $\mathbf{W}^{(1)} \in \mathbb{R}^{d \times h}$ and biases $\mathbf{b}^{(1)} \in \mathbb{R}^{1 \times h}$. We also have output layer weights $\mathbf{W}^{(2)} \in \mathbb{R}^{h \times q}$ (q is the number of output features) and biases $\mathbf{b}^{(2)} \in \mathbb{R}^{1 \times q}$. 

We can then compute the outputs $\mathbf{O} \in \mathbb{R}^{n \times q}$ as:

$$ \mathbf{H} = \mathbf{X}\mathbf{W}^{(1)} + \mathbf{b}^{(1)} $$
$$ \mathbf{O} = \mathbf{H}\mathbf{W}^{(2)} + \mathbf{b}^{(2)} $$

Howver, in this example, although we are tracking more variables and parameters, the results are still just affine transformations of the parameters, so we still have a linear model! Generally, an affine function of an affine function is still just an affine function - and our linear model was already capable of representing any affine function we may have wished to represent. 