# Multilayer Perceptrons

- https://d2l.ai/chapter_multilayer-perceptrons/index.html

The simplest deep networks are called multilayer perceptrons, and they consist of multiple layers of neurons each fully connected to those in the layer below (from which they receive input) and those above (which they, in turn, influence).

The **linear** model is:

$$
y = Wx + b
$$

- $y \in \mathbb{R}^{\text{out\_features}}$
- $x \in \mathbb{R}^{\text{in\_features}}$
- $W \in \mathbb{R}^{\text{in\_features} \times \text{out\_features}}$
- $b \in \mathbb{R}^{\text{out\_features}}$

Strictly speaking, this is an *affine transformation* of input features, which is characterized by a linear transformation of features via a weighted sum, combined with a *translation* via the added bias.

[If we consider a batch size](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html),

$$
y = xW + b
$$

- $y \in \mathbb{R}^{\text{batch\_size} \times \text{out\_features}}$
- $x \in \mathbb{R}^{\text{batch\_size} \times \text{in\_features}}$
- $W \in \mathbb{R}^{\text{in\_features} \times \text{out\_features}}$
- $b \in \mathbb{R}^{\text{out\_features}}$

In PyTorch, $A^T=W$

$$
y = xA^T + b
$$

- $y \in \mathbb{R}^{\text{batch\_size} \times \text{out\_features}}$
- $x \in \mathbb{R}^{\text{batch\_size} \times \text{in\_features}}$
- $A \in \mathbb{R}^{\text{out\_features} \times \text{in\_features}}$
- $b \in \mathbb{R}^{\text{out\_features}}$

##  Linear neural networks

![](https://d2l.ai/_images/softmaxreg.svg)
- A single-layer neural network

$$
o = xW + b
$$

This model maps inputs directly to outputs via a single affine transformation.

If our labels truly were related to the input data by a simple affine transformation, then this approach would be sufficient. However, linearity (in affine transformations) is a strong assumption.

For example, linearity implies the weaker assumption of monotonicity, i.e., that any increase in our feature must either always cause an increase in our model's output (if the corresponding weight is positive), or always cause a decrease in our model's output (if the corresponding weight is negative).

But what about classifying images of cats and dogs? Should increasing the intensity of the pixel at location (13, 17) always increase (or always decrease) the likelihood that the image depicts a dog? Reliance on a linear model corresponds to the implicit assumption that the only requirement for differentiating cats and dogs is to assess the brightness of individual pixels. This approach is doomed to fail in a world where inverting an image preserves the category.

## Incorporating Hidden Layers

We can overcome the limitations of linear models by incorporating one or more hidden layers. The easiest way to do this is to stack many fully connected layers on top of one another. Each layer feeds into the layer above it, until we generate outputs.

We can think of the first $L-1$ layers as our representation and the final layer as our linear predictor. This architecture is commonly called a multilayer perceptron, often abbreviated as MLP.

![](https://d2l.ai/_images/mlp.svg)
- A two-layer MLP with a hidden layer of five hidden units.

This MLP has four inputs, three outputs, and its hidden layer contains five hidden units. Since the input layer does not involve any calculations, producing outputs with this network requires implementing the computations for both the hidden and output layers; thus, the number of layers in this MLP is two. Note that both layers are fully connected. Every input influences every neuron in the hidden layer, and each of these in turn influences every neuron in the output layer.

## From Linear to Nonlinear

We denote by the matrix $x \in \mathbb{R}^{n \times d}$ a minibatch of $n$ examples where each example has $d$ inputs (features). For a one-hidden-layer MLP whose hidden layer has $r$ hidden units, we denote by $h \in \mathbb{R}^{n \times r}$ the outputs of the hidden layer, which are **hidden representations**. Since the hidden and output layers are both fully connected, we have hidden-layer weights $W^{(1)} \in \mathbb{R}^{d \times r}$ and biases $b^{(1)}\in \mathbb{R}^{1 \times r}$ and output-layer weights $W^{(2)} \in \mathbb{R}^{r \times q}$ and biases $b^{(2)}\in \mathbb{R}^{1 \times q}$. This allows us to calculate the outputs $o \in \mathbb{R}^{n \times q}$ of the one-hidden-layer MLP as follows:

$$
\begin{align}
h & = xW^{(1)} + b^{(1)} \\
o & = hW^{(2)} + b^{(2)} \\
\end{align}
$$

The hidden units above are given by an affine function of the inputs, and the outputs are just an affine function of the hidden units. An affine function of an affine function is itself an affine function.

To see this formally we can just collapse out the hidden layer in the above definition, yielding an equivalent single-layer model with parameters $W=W^{(1)}W^{(2)}$ and $b=b^{(1)}W^{(2)}+b^{(2)}$

$$
o = (xW^{(1)} + b^{(1)})W^{(2)} + b^{(2)} = xW^{(1)}W^{(2)} + b^{(1)}W^{(2)} + b^{(2)} = xW + b
$$

In order to realize the potential of multilayer architectures, we need one more key ingredient: a nonlinear **activation function** $\sigma$ to be applied to each hidden unit following the affine transformation. For instance, a popular choice is the ReLU (rectified linear unit) activation function $\sigma(x)=\text{max}(0, x)$ operating on its arguments elementwise. The outputs of activation functions $\sigma(\cdot)$ are called **activations**. In general, with activation functions in place, it is no longer possible to collapse our MLP into a linear model (2-layer MLP):

$$
\begin{align}
h & = \sigma(xW^{(1)} + b^{(1)}) \\
o & = hW^{(2)} + b^{(2)} \\
\end{align}
$$

Since each row in $x$ corresponds to an example in the minibatch, with some abuse of notation, we define the nonlinearity $\sigma$ to apply to its inputs in a rowwise fashion, i.e., one example at a time. Quite frequently the activation functions we use apply not merely rowwise but elementwise. That means that after computing the linear portion of the layer, we can calculate each activation without looking at the values taken by the other hidden units.

To build more general MLPs, we can continue stacking such hidden layers (L-layer MLP):
$$
\begin{align}
h^{(1)} & = \sigma_1(xW^{(1)} + b^{(1)}) \\
h^{(2)} & = \sigma_2(h^{(1)}W^{(2)} + b^{(2)}) \\
& \vdots \\
h^{(l)} & = \sigma_l(h^{(l-1)}W^{(l)} + b^{(l)}) \\
& \vdots \\
o & = h^{L-1}W^{(L)} + b^{(L)} \\
\end{align}
$$

In [None]:
import torch
from torch import nn

In [None]:
n = 5
d = 3
r = 4
q = 2

In [None]:
# Input (n, d)
x = torch.randn(n, d)
print(x.shape)

# Single-layer neural network
layer = nn.Linear(d, q)

# Output (n, q)
o = layer(x)
print(o.shape)

torch.Size([5, 3])
torch.Size([5, 2])


In [None]:
# (q, d)
W = layer.weight
W.shape

torch.Size([2, 3])

In [None]:
# (q)
b = layer.bias
b.shape

torch.Size([2])

In [None]:
o

tensor([[-0.6177,  0.3824],
        [-0.9826, -0.2021],
        [-0.2464,  0.6741],
        [-0.3961,  0.6676],
        [-0.3510,  0.4752]], grad_fn=<AddmmBackward0>)

In [None]:
x @ W.T + b

tensor([[-0.6177,  0.3824],
        [-0.9826, -0.2021],
        [-0.2464,  0.6741],
        [-0.3961,  0.6676],
        [-0.3510,  0.4752]], grad_fn=<AddBackward0>)

In [None]:
# 2-layer MLP
class MLP(nn.Module):
    def __init__(self, d, r, q):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(d, r),
            nn.ReLU(),
            nn.Linear(r, q)
        )

    def forward(self, x):
        return self.layers(x)

In [None]:
model = MLP(d, r, q)

In [None]:
print(x.shape)
o = model(x)
print(o.shape)

torch.Size([5, 3])
torch.Size([5, 2])


In [None]:
model

MLP(
  (layers): Sequential(
    (0): Linear(in_features=3, out_features=4, bias=True)
    (1): ReLU()
    (2): Linear(in_features=4, out_features=2, bias=True)
  )
)

In [None]:
model.layers

Sequential(
  (0): Linear(in_features=3, out_features=4, bias=True)
  (1): ReLU()
  (2): Linear(in_features=4, out_features=2, bias=True)
)

In [None]:
layer1 = model.layers[0]
layer1

Linear(in_features=3, out_features=4, bias=True)

In [None]:
act = model.layers[1]
act

ReLU()

In [None]:
layer2 = model.layers[2]
layer2

Linear(in_features=4, out_features=2, bias=True)

In [None]:
# (r, d)
W1 = layer1.weight
W1.shape

torch.Size([4, 3])

In [None]:
# (r)
b1 = layer1.bias
b1.shape

torch.Size([4])

In [None]:
# (q, r)
W2 = layer2.weight
W2.shape

torch.Size([2, 4])

In [None]:
# (q)
b2 = layer2.bias
b2.shape

torch.Size([2])

In [None]:
o

tensor([[ 0.1659, -0.1240],
        [ 0.1274, -0.1152],
        [ 0.1274, -0.1152],
        [ 0.1350, -0.1169],
        [ 0.1383, -0.1652]], grad_fn=<AddmmBackward0>)

In [None]:
h = act(x @ W1.T + b1)
h @ W2.T + b2

tensor([[ 0.1659, -0.1240],
        [ 0.1274, -0.1152],
        [ 0.1274, -0.1152],
        [ 0.1350, -0.1169],
        [ 0.1383, -0.1652]], grad_fn=<AddBackward0>)