# Neural Networks for Statisticians

## 0. Introduction

Introductions to neural networks are typically presented for people coming from a background in computer science and/or programming, but rarely for people coming from statistics. A statistician hoping to learn about this exciting field frequently encounters a few hurdles:
- Terminological differences may obscure connections between neural networks and standard statistical models
- The machine learning community frequently emphasizes things that statisticians tend to ignore (distributed computing, optimization), while ignoring things that statisticians tend to emphasize (interval estimators, hypothesis testing)
- The machine learning community uses different computational tools than statisticians
- The study of neural networks is an extremely fast-moving field; important new advances appear almost every month

This tutorial will take the nontraditional path of implementing simple statistical methods in industrial-grade neural network packages before really justifying their need. This is in part to show just how easy developing and training neural network models has become, even compared to just a few years ago. I hope that upon seeing this, readers will be motivated to look further into these packages, and thus further into the field of neural networks.

I will avoid the term "deep learning" here simply because I don't find it particularly necessary to distinguish it from "the study of neural networks" for our purposes here. Also, in my experience, statisticians tend to dismiss it and other newer data-related terms ("data science", "big data", or even "machine learning") as contentless buzzwords. I disagree, but we can discuss that some other time.

## 1. Standard Models Rewritten as Neural Nets

I believe that the best place to begin starting to understanding neural networks is to see them from the generalized linear model (GLM) point of view. Simple network architectures can be directly related to the three models presented here; more advanced architectures can then be built on top of those.

Note that despite being written as graphs, it is best to view these models as having *no connection* with either Bayesian networks or decision tree models. This isn't actually true, but for now, pretend that it is.

### Linear Regression

Here is a linear function of $x_1, \ldots, x_p$, written as a neural network:

![Linear Regression](../pics/Linear-Regression.png)

This graph is read from bottom to top. There are two layers in this network: an **input layer** and an **output layer**. Each arrow corresponds to a **weight** $w_j$. The line in the output neuron represents the identity function; it is the **activation function** for that neuron. Input neurons do not have activation functions; they exist solely to pass along their values $x_1$, $x_2$, etc. to the next layer. When interpreting these sorts of diagrams, **bias** terms are generally not explicitly represented in the graph; it is assumed that the bias term is added in before the activation function is evaluated.

So the value of the output neuron is therefore:
$$ \operatorname{identity} \left(b + \sum_{j = 1}^p w_j x_j \right) = b + \sum_{j = 1}^p w_j x_j $$
where $b$ is the bias term and $\mathbf{w}$ is the weight vector. Incoming connections to the output neuron are summed together, a bias term is added, and the activation function is applied.

Now suppose for each $y_i$ value in my training set, I am using this function to make a prediction:
$$ y^*_i = f(\mathbf{x}_i; \mathbf{w}, b) = b + \sum_{j = 1}^p w_j x_{ij} $$
I want to choose the values of $\mathbf{w}$ and $b$ in order to make my predictions as good as possible, which means *minimizing the badness of my predictions*. Badness is quantified by a **loss function**, which you can choose depending on the problem. A loss function $L(y, y^*)$ must satisfy the following qualities:
- If $y = y^*$, $L(y, y^*) = 0$ (if your prediction is exactly correct, its badness is 0)
- $\forall y, y^*$, $L(y, y^*) \geq 0$ (there is no such thing as negative badness)

One such function is the *squared-error loss function* $L(y, y^*) = (y - y^*)^2$. Choosing $\mathbf{w}, b$ in order to minimize the average loss incurred by our predicted values is then:
$$ \underset{\mathbf{w}, b}{\operatorname{argmin}} \frac{1}{n} \sum_{i = 1}^n \left[ y_i - \left( b + \sum_{j = 1}^p w_j x_j \right) \right] ^2 $$
where $n$ is the number of examples in your **training set**. This will give you the same weight and bias values as if you had assumed *iid* normal errors and decided to fit the model using maximum likelihood.

### Logistic Regression

![Logistic Regression](../pics/Logistic-Regression.png)

We can do the exact same thing for logistic regression by replacing the identity function in the output neuron with the logistic sigmoid function $\sigma(x) = \frac{1}{1 + e^{-x}}$. It should be clear that after applying the same operations as before, but with $\sigma(x)$ instead of $\operatorname{identity}(x)$, we will get the expectation function for a logistic regression. This is simply because the sigmoid function is the inverse of the typical logit link function.

Obviously the likelihood for a logistic regression model is not equivalent to that of a linear regression model, even after accounting for the different (inverse) link function. When fitting a logistic regression with maximum likelihood, we want:
$$\underset{\mathbf{\beta}}{\operatorname{argmax}} \sum_{i = 1}^n \left[ y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right] =$$
$$\underset{\mathbf{\beta}}{\operatorname{argmax}} \sum_{i = 1}^n \left[ y_i \log \left(\sigma \left[ b + \sum_{j = 1}^p w_j x_j \right] \right) + (1 - y_i) \log \left(1 - \sigma \left[ b + \sum_{j = 1}^p w_j x_j \right] \right) \right]$$

$\sigma \left( b + \sum_{j = 1}^p w_j x_j \right)$ is the value generated by the network, so maximizing the log-likelihood here corresponds to minimizing the average **binary cross-entropy loss**, where "badness" is measured by the function $L(y, y^*) = -y \log(y^*) - (1 - y) \log(1 - y^*)$.

Logistic regression, and neural networks with sigmoid output activations more generally, are frequently used for the purpose of **binary classification**: determining whether observations belong to one of two classes. If the output value is greater than 0.5, the observation would be classified as being in "group 1"; otherwise it will be classified as being in "group 0". This 0.5 threshold may be moved depending on the specific problem being solved.

Note that there is no reason we couldn't use a network with a sigmoid output function and minimize it with respect to squared-error loss, or use the identity function with binary cross-entropy loss. In general, we can use any univariate function we want in the top layer, and minimize the weights with respect to any loss function we want. That said, the optimization problem for a neural network does frequently end up being nothing more than maximum likelihood estimation, under some assumption about the conditional distribution of the $y$'s given the $x$'s.

### Categorical Regression

Generalized linear models with a categorical random component seem to appear less frequently in statistics than in machine learning, so here I will go through the full explanation of this model before connecting it to neural networks.

Consider a design matrix $\mathbf{X}$ and a response matrix $Y$, where the $i^\text{th}$ row of $Y$ is a $k$-dimensional **one-hot** vector (a single entry is 1, the others are 0). This means that the $i^\text{th}$ training example belongs to the $j^\text{th}$ class, where $j \in 1, \ldots, k$, as specified by the location of the 1 in that row.

Consider the "multinomial distribution with $n = 1$", otherwise known as the **categorical distribution** or the **multinoulli distribution**. We will say $\mathbf{y}_i | \mathbf{x}_i \sim \text{Categorical}(\mathbf{p})$. As with any other GLM, we model the mean of the response distribution ($\mathbf{p}$) as an inverse link function of a linear predictor of the elements of $\mathbf{x}_i$.

In this case, the inverse link function we use is the [softmax function](https://en.wikipedia.org/wiki/Softmax_function), which is the $k$-dimensional analogue of the logistic function:
$$\operatorname{softmax}(\mathbf{z}) = \frac{e^{\mathbf{z}}}{\sum_{j = 1}^k e^{z_j}}$$
The exponentiation in the numerator is a vectorized operation; each member of $\mathbf{z}$ is exponentiated and the output is a $k$-dimensional vector. Note that if one element of $\mathbf{z}$ is much larger than the others, the output of the softmax function will be approximately 1 at that index, and approximately 0 at the others, just like the vector-valued (hard) max function. Also note that the sum of the $k$ elements of this function's output vector will always be 1. It is therefore appropriate to use this to model the mean of a categorical distribution.

We need a $k$-dimensional linear predictor of $\mathbf{x}_i$ for this model, so we will have a parameter matrix:
$$\mathbf{p}_i = \operatorname{softmax} \left( \mathbf{b} + \mathbf{W} \mathbf{x}_i \right)$$
where $\mathbf{b}$ is a $k$-dimensional vector and $\mathbf{W}$ is a $k \times p$-dimensional matrix of weights.

So the likelihood function for this model is:
\begin{eqnarray*}
L(\mathbf{\theta}) &=& \prod_{i = 1}^n \prod_{j = 1}^k p_{ij}^{y_{ij}}\\
\ell(\mathbf{\theta}) &=& \sum_{i = 1}^n \sum_{j = 1}^k {y_{ij}} \log(p_{ij})\\
&=& \sum_{i = 1}^n \mathbf{y}_i^T \log(\mathbf{p}_i)\\
&=& \sum_{i = 1}^n \mathbf{y}_i^T \log \left[ \operatorname{softmax} \left( \mathbf{b} + \mathbf{W} \mathbf{x}_i \right) \right]
\end{eqnarray*}
where the logarithms in the final two lines are understood to be vectorized across their arguments.

#### Categorical Regression as a Neural Network

![Categorical Regression](../pics/Categorical-Regression.png)

Here I've drawn a 3-class categorical regression as a neural network. $\sigma$ is occassionally used to refer to the softmax function as well as the logistic function, so I use $\sigma_i$ to denote its $i^\text{th}$ component. Unlike most neural networks, the neurons in the output layer here must "communicate" with each other; they all must sum to 1. This is a consequence of our network diagrams representing scalar values for each neuron rather than vector (or, more generally, **tensor**) values. More on alternative specifications of networks to come later.

Since we have multiple output neurons now, it makes more sense to represent the weights as a matrix rather than a vector, the same way that a matrix is required for the "statistical version" of the categorical regression model above.

The loss function most frequently used when training this type of network is called the **softmax cross-entropy** loss function $L(\mathbf{y}, \mathbf{y}^*) = -\mathbf{y}^T \log(\mathbf{y}^*)$. Note that the outputs and the true $y$-values of this model are vectors rather than scalars, so the loss function must take vector inputs. Like all loss functions, however, it must produce a scalar output.

## 2. Optimization

Statisticians frequently need to resort to numerical optimization methods when closed-form solutions don't exist for say, a log-likelihood maximization problem. Iterative optimization algorithms are extremely important when training neural networks, not only because of the intractability of the loss minimization problem but also because these methods allow us to avoid having to keep all the data in memory at the same time (more on this in a bit). This can become very important considering the scale of the datasets that neural networks are frequently applied to.

Unlike statisticians, who frequently use optimization methods requiring second derivatives, those training neural networks almost exclusively use first-order methods. This, again, has to do with the scale of the problems that neural networks are used for; the networks may become so large that the Hessian matrix may not even fit into memory.

It should be noted that machine learners tend to phrase their optimization problems in terms of *minimizing* a function rather than maximizing it. This is because the metric being optimized tends to be a measure of loss rather than one of utility. Of course, minimizing a function is the same as maximizing its negative, and vice-versa.

### Batch Gradient Descent

Imagine we would like to minimize a function $\ell(\mathbf{\theta}; \mathbf{X}, \mathbf{y})$, where we would like to optimize over the parameter vector $\mathbf{\theta}$ and treat $\mathbf{X}$ as fixed. This is frequently something like a negative log-likelihood, as in the three cases mentioned above. Recall that the direction of steepest ascent of a function is its gradient $\nabla_\mathbf{\theta} \ell(\mathbf{\theta}; \mathbf{X}, \mathbf{y})$. If we view this function as a hillside we are standing on (at some current point $\mathbf{\theta}^{(i)}$), then in order to get to the bottom of the valley, we should move in the direction of the negative gradient. This leads to the following algorithm:
1. Initialize the value of $\mathbf{\theta}$ at some value $\mathbf{\theta}^{(0)}$
2. $\mathbf{\theta}^{(i + 1)} \leftarrow \mathbf{\theta}^{(i)} - \alpha \cdot \nabla_\mathbf{\theta} \ell(\mathbf{\theta}; \mathbf{X}, \mathbf{y}) \vert_{\mathbf{\theta} = \mathbf{\theta}^{(i)}}$
3. Repeat Step 2 until convergence

$\alpha$ here is called the **learning rate**. Values of $\alpha$ that are too high can lead to frequent overshooting of the true minimum, while values that are too low may make the algorithm take a long time to converge. When optimizing complex functions, the learning rate is often lowered from iteration or iteration or in response to some event. Therefore, it might be more accurate to write $\alpha$ above as $\alpha^{(i)}$.

In general, you will not actually minimize the loss function of a neural network! The loss functions being minimized are highly **non-convex** (intuitively, they don't look like bowls), and therefore gradient descent is prone to get stuck at local minima, have trouble with saddle points, etc. This problem has led to a lot of research into optimization methods for neural networks. With all this said, finding a good local minimum may be sufficient for the problem at hand. At the end of the day, performance on the test set is what matters, not where your optimization method ends up.

### Mini-batches

Consider the case where $\ell(\mathbf{\theta}; \mathbf{X}, \mathbf{y})$ is a negative log-likelihood, and suppose that the response values $y_i$ (which may be a vector in some cases) are independent. Then we can write our problem as:
$$ \underset{\mathbf{\theta}}{\operatorname{argmin}} \ell(\mathbf{\theta}; \mathbf{X}, \mathbf{y}) = \underset{\mathbf{\theta}}{\operatorname{argmin}} \frac{1}{n} \sum_{i = 1}^n \ell_i(\mathbf{\theta}; \mathbf{x}^{(i)}, y_i)$$
as the $\frac{1}{n}$ won't affect the result of the optimization. Because the joint distribution of the responses can be decomposed into the product of the marginals, the total loss here can be decomposed into the sum of indvidual per-datapoint loss terms.

Now let's think about the case where my entire dataset $\mathbf{X}$ doesn't fit in memory at once. In this case, it's useful to make the following approximation:
$$\nabla_\mathbf{\theta} \frac{1}{n} \sum_{i = 1}^n \ell_i(\mathbf{\theta}; \mathbf{x}^{(i)}, y_i) \approx \nabla_\mathbf{\theta} \frac{1}{m} \sum_{i^* = 1}^m \ell_{i^*}(\mathbf{\theta}; \mathbf{x}^{(i^*)}, y_{i^*})$$
where we are now averaging over a *randomly chosen subset* of our data points (that is $m < n$). A new subset is sampled on each iteration of the gradient descent algorithm. The averaging is necessary to keep everything on the same scale. This set of $m$ data points is called a **mini-batch**. Frequently, $\frac{n}{m}$ steps of the mini-batch gradient descent algorithm constitute one training **epoch**. This is how many steps would be necessary to see all of the data (though in general, some data points will be seen multiple times in an epoch, while others not at all).

The term **mini-batch** always implies multiple data points per batch. You may also encounter the term **stochastic gradient descent**, which will refer either to the algorithm just described or to the special case of it where $m = 1$.

### Autodifferentiation and Backpropagation

Gradient descent and its extensions require, as the name implies, the gradient of the average loss with respect to the parameters. For simple models, this is no problem, but as we will see later, this may be quite tedious to compute for larger models.

Luckily, modern neural network packages help us out by providing **autodifferentiation** functionality. This means that when we specify a neural network model using a certain API, we will be able to use (mini-batch) gradient descent with *no manual calculus whatsoever*. Better yet, these are not finite difference derivatives; any numerical issues that occur have to do with how the computer represents numbers internally, but not because the approximation of the derivative is not precise enough.

For each iteration of our gradient descent algorithm, we imagine (a mini-batch worth of) $x$-values traveling from the input layer up to the output layer. The values computed along the way, either at the output layer or at intermediate **hidden layers**, are stored for use in the computation of the derivative. After this "forward pass", a "backward pass" occurs, where, starting from the loss function (which we can imagine lives just beyond the output layer) back to the input layer, the chain rule is applied in a computationally efficient way in order to take the derivatives with respect to the parameters. This is where the term **backpropagation** comes from, as well as the criticism that it's nothing more than the chain rule. As it turns out, naive methods of taking these derivatives will cripple the training of larger networks; the computational efficiency is key.

You will hear often neural network people talk about something like "the error signal propagating backward through the graph"; that is, information from the value of the loss function being used to tweak the parameter values via gradient descent. The notion of "gradient information" moving backward from output to input also motivates some of the heuristics used to train neural networks.

## 3. Implementations

All of the basic models from Section 1 can be trained in one line of code with the right package, but here I will take a slightly more roundabout way in order to demonstrate packages that people actually use for training neural networks more generally.

Python is the language most frequently used for neural networks. Though there are neural network packages in other languages, the vast majority of the research community tends to use those in Python, and they are accordingly the most developed. Increased computational horsepower (particularly with GPUs) is one of the main reasons why neural networks are experiencing their current boom in popularity after years of obscurity, and dedicated neural network packages let you take advantage of this without descending into the bowels of your computer seeking performance-enhancing boosts (that is, you don't need to write any CUDA C).

Probably the most popular neural network package is Google's [TensorFlow](https://www.tensorflow.org/), a C++ package with a highly-developed Python API (you don't need to know any C++ to use it). The name, in addition to sounding hip and cool, is also fairly accurate: both the forward and backward passes through a neural network can be thought of in terms of tensors flowing along the edges a **computational graph**. What is a tensor, you ask? A tensor is a multidimensional array of numbers. For example, a vector is a rank-1 tensor, a matrix is a rank-2 tensor, and a "cube of numbers" would be a rank-3 tensor. A minibatch of color images, then, might represent a rank-4 tensor (1 dimension indexing the samples, 2 dimensions for the $x$- and $y$- dimensions of the image, and a final dimension for the RGB color channels).

As it turns out, coding in TensorFlow is a bit difficult, requiring you to more or less manually specify the tensor transformations you want to apply. Thankfully, living on top of TensorFlow is a higher-level API named [Keras](https://keras.io/). Keras tends to operate at just the right level of abstraction: Every "chunk" of a neural network you might think of in your head translates to roughly one line of Keras code. When you train and evaluate your Keras model, however, you have the full power of TensorFlow working behind the scenes.

## 4. Networks with Hidden Layers