In [1]:
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt

# What We're Building
$$\underbrace{\mathbb{R}_4}_{\text{Input layer}}\xrightarrow{}\underbrace{\mathbb{R}_{64}}_{\text{Hidden layer }h_1}\xrightarrow{}\underbrace{\mathbb{R}_{64}}_{\text{Hidden layer }h_2}\xrightarrow{}\underbrace{\mathbb{R}_3}_{\text{Output layer}}$$

<img src="../etc/multi-out-arch.svg" width="1200" height="700" style="display: block; margin: auto;">

## Terminology
> __Fully connected__ means that each node in layer $n$ projects to each node in layer $n+1$. Each connection has its own weight.

## Softmax'ing the Outputs
Applying the sigmoid function $\forall\in\mathbb{R}_3$ would cause us to obtain _isolated_ probabilities, e.g., not a probabilitiy distribution over all predicted classes. Therefore the output vectors don't add up to 1, e.g. $\sum \mathbb{R}_3\ne 1$. For this reason, we pass the raw outputs from $\mathbb{R}_3$ through the _softmax_ function instead.
> The _softmax_ function
$$\sigma_i=\frac{e^{z_i}}{\sum e^z}$$
* With softmax
    $\sum\mathbb{R}_3=1$

Thus, we can summarize the forward phase (from $\mathbb{R}_3$ to finality) as:
$$\mathbb{R}_3\xrightarrow{}\sigma\xrightarrow{}\begin{pmatrix}\hat y_1 \\ \hat y_2 \\ \hat y_3\end{pmatrix}$$
where $\sigma$ is the _softmax_ function and $\sum\begin{pmatrix}\hat y_1 \\ \hat y_2 \\ \hat y_3\end{pmatrix}\equiv1$.