# 10 Deep Learning

## 10.1 Single Layer Neural Networks
![Figure 10.1](SLNN.png)

A neural network takes an input vector of $p$ variables $X=(X_1, X_2, \ldots, X_p)$ and builds a nonlinear function $f(X)$ to predict the response $Y$. The figure above shows a simple _feed-forward neural network_ for modeling a quantitative response using $p=4$ predictors. The four features $X_1, \ldots, X_4$ make up the units in the _input layer_. The arrows indicate that each of the inputs from the input layer feeds into each of the $K$ _hidden units_ (we  get to pick $K$; here we chose $5$). The neural network model has the form

\begin{align*}
f(X) &= \beta_0 + \sum^K_{k=1}{\beta_k h_k(X)} \\
     &= \beta_0 + \sum^K_{k=1}{\beta_k g(w_{k0} + \sum^p_{j=1}{w_{kj}X_j})}\text{.} \tag{10.1}
\end{align*}

It is built up here in two steps. First the $K$ _activations_ $A_k, k=1, \ldots,K$, in the hidden layer are computed as functions of the input features $X_1,\ldots,X_p$.

\begin{align*}
A_k = h_k(X) = g(w_{k0} + \sum^p_{j=1}{w_{kj}X_j})\text{,}\tag{10.2}
\end{align*}

where $g(z)$ is a nonlinear _activation function_ that is specified in advance. We can think of each $A_k$ as a different transformation $h_k(X)$ of the original features, much like the basis functions of Chapter 7. These $K$ activations from the hidden layer then feed into the output layer, resulting in

\begin{align*}
f(X) = \beta_0 + \sum^K_{k=1}{\beta_kA_k}\text{,}\tag{10.3}
\end{align*}

a linear regression model in the $K=5$ activations. ALl the parameters $\beta_0,\ldots,\beta_K$ and $w_{10},\ldots,w_{Kp}$ need to be estimated from data. In the early instances of neural networks, the _sigmoid_ activation function was favored,

\begin{align*}
g(z) = \frac{e^z}{1 + e^z} = \frac{1}{1 + e^{-z}}\text{,}\tag{10.4}
\end{align*}

which is the same function used in logistic regression to convert a linear function into probabilities between zero and one. The preferred choice in modern neural networks in the _ReLU_ (_rectified linear unit_) activation function, which takes the form

\begin{align*}
g(z) = (z)_+= \begin{cases}
      0 & \text{if}\ z<0 \\
      z & \text{otherwise}
\end{cases}\tag{10.5}
\end{align*}

A ReLU activation can be computed and stored more efficiently than a sigmoid activation. Although it thresholds at zero, because we apply it to a linear function (10.2) the constant term $w_{k0}$ will shift this inflection point.

So in words, the model depicted in the figure above derives five new features by computing five different linear combinations of $X$, and then squashes each through an activation function $g(\cdot)$ to transform it. The final model is linear in these derived variables.

The name _neural network_ orginally derived from thinking of these hidden units as analogous to neurons in the brain &mdash; values of the activations $A_k = h_k(X)$ close to one are _firing_, while those close to zero are _silent_ (using the sigmoid activation function).

The nonlinearity in the activation function $g(\cdot)$ is essential, since without it the model $f(X)$ in (10.1) would collapse into a simple linear model in $X_1,\ldots,X_p$. Moreover, having a nonlinear activation function allows the model to capture complex nonlinearities and interaction effects. Consider a very simple example with $p=2$ input variables $X=(X_1, X_2)$, and $K=2$ hidden units $h_1(X)$ and $h_2(X)$ with $g(z)=z^2$. We specify the other parameters as

\begin{align*}
\beta_0 = 0\text{,}\ \ \ \ & \beta_1 = \frac{1}{4}\text{,} & \beta_2 = -\frac{1}{4}\text{,} \\
w_{10} = 0\text{,}\ \ \ \ & w_{11} = 1\text{,} & w_{12} = 1\text{,} \\
w_{20} = 0\text{,}\ \ \ \ & w_{21} = 1\text{,} & w_{22} = -1\text{.}\tag{10.6}
\end{align*}

From (10.2), this means that

\begin{align*}
h_1(X) &= (0 + X_1 + X_2)^2\text{,} \\
h_2(X) &= (0 + X_1 - X_2)^2\text{.}\tag{10.7}
\end{align*}

Then plugging (10.7) into (10.1), we get

\begin{align*}
f(X) &= 0 + \frac{1}{4} \cdot (0 + X_1 + X_2)^2 - \frac{1}{4} \cdot (0 + X_1 + X_2)^2 \\
     &= \frac{1}{4}\left[(X_1 + X_2)^2 - (X_1 - X_2)^2\right] \\
     &= X_1X_2\text{.}\tag{10.8}
\end{align*}

So the sum of two nonlinear transformations of linear functions can give us an interaction! In practice we would not use a quadratic function for $g(z)$, since we would always get a second-degree polynomial in the original coordinates $X_1,\ldots,X_p$. The sigmoid or ReLU activations do not have such a limitation.

Fitting a neural network requires estimating the unkown parameters in (10.1). FOr a quantitative response, typically squared-error loss is used, so that the parameters are chosen to minimize

\begin{align*}
\sum^n_{i=1}{(y_i - f(x_i))^2}\text{.}\tag{10.9}
\end{align*}

Details about how to perform this inimization are provided in Section 10.7.

## 10.2 Multilayer Neural Networks

## 10.3 Convolutional Neural Networks

### 10.3.1 Convolution Layers

### 10.3.2 Pooling Layers

### 10.3.3 Architecture of a Convolutional Neural Network

### 10.3.4 Data Augmentation

### 10.3.5 Results Using a Pretrained Classifier

## 10.4 Document Classification

## 10.5 Recurrent Neural Networks

### 10.5.1 Sequential Models for Document Classification

### 10.5.2 Time Series Forcasting

### 10.5.3 Summary of RNNs

## 10.6 When to Use Deep Learning

## 10.7 Fitting a Neural Network

### 10.7.1 Backpropagation

### 10.7.2 Regularization and Stochasic Gradient Descent

### 10.7.3 Dropout Learning

## 10.8 Interpolation and Double Descent