# Neural Networks

In [1]:
# Import some helper functions (please ignore this!)
from utils import *

**Context:** So far, we've focused on translating our IHH colleague's goals into probabilistic models, and then fitting these models to data to help them answer scientific questions. In each model's conditional distributions, we've had to make two choices: what distribution to use, and how the distributions parameter should depend on the condition. For example, in regression, recall we picked the following conditional distribution:
\begin{align}
p_{Y | X}(\cdot | x_n; \underbrace{W, \sigma}_{\theta}) = \mathcal{N}( \underbrace{\mu(x_n; W)}_{\text{trend}}, \underbrace{\sigma^2}_{\text{noise}} ),
\end{align}
where $\mu(x_n; W)$ represents the "trend" of the data. We've had to decide whether $\mu(x_n; W)$ should be linear, polynomial, or some other function. As our data grows in complexity---for example, as $x_n$ becomes high-dimensional---it becomes increasingly difficult to make up functions that are, expressive, fast, and easy to code. We will show you why below. 

**Challenge:** So what functions should be use in our probabilistic models? Here, we will introduce a new type of function---a *neural network*. As we will show here, neural networks are expressive, fast, and easy to code. 

**Outline:** 
* Shortcomings of other expressive functions, likely polynomials
* The idea behind neural networks: using function composition to create expressive functions
* Introduce *a little bit* of linear algebra to help introduce neural networks
* Introduce neural networks, implement them in `NumPyro` and fit them to IHH data
* Connect the math behind neural networks to the pictures used to represent them in popular media

## The Shortcomings of Polynomials

**The Universality of Polynomials.** In both chapters about regression and classification, we observed the benefits of using non-linear functions for data with non-linear trends. In regression, we've focused on polynomials as our primary tool for creating non-linear functions, and for our low-dimensional data, they seemed to work great! So you may be wondering, why not apply them to high-dimensional data as well? In fact, polynomials boast a very powerful property: they are *universal function approximators*. By this, we mean that for any continuous function on some bounded interval $[a, b]$, we can find a polynomial that approximates it arbitrarily well (this is known as the [Stoneâ€“Weierstrass theorem](https://en.wikipedia.org/wiki/Stone%E2%80%93Weierstrass_theorem)). This means that for *any* data set that consists of continuous trends, *theoretically speaking*, polynomials can capture it. This is a huge deal! So let's see how polynomials measure up against a linear regression and neural network regression:

```{figure} _static/figs/example_regression_inductive_bias_percent_ood_0.png
---
name: fig-regression-inductive-bias-closeup
align: center
---
Examples of linear, polynomial, and neural network regression on IHH data.
```

As you can see, polynomial regression seems to capture the trend in these regression data sets super well. But *practically speaking*, polynomials actually come with several challenges. 

**Challenge 1: Numerical Instability.** Polynomials are numerically unstable. Imagine you want to use a degree-20 polynomial in your regression model. This means that, in fitting your model, you will have to evaluate $x^20$. When $x = 0.1$ and when $x = 10.0$, you're asking your computer to represent numbers like $0.000000000000000000001$ and $1000000000000000000000$. Because your computer only has finite precision, very small numbers are at risk of being rounded down to $0$ and very large numbers may overflow. 

**Challenge 2: Inductive Bias.** Oftentimes, we're less interested in seeing what our model does on data we've already observed. Instead, we want to know what it might do for a *new* data point. For example, suppose we're asked to develop a model to predict telekinetic ability and glow from age (using data from the regression chapter). We aren't interested in seeing the model's predictions on patients included in our data set---we've already observed how telekinetic ability and glow correlate with age. We're interested in the model's predictions for *new* patients. For example, what happens if we get an input that we've never seen before? For example, a patient that's much older or younger than the rest of the patients in the data? We call the trend of the model away from the data its "inductive bias." Different models that fit the data equally well may actually have different inductive biases. Let's illustrate what we mean visually. The plot below shows the very same three models plotted above; however, this time the plots are zoomed out so you can see each model's behavior away from the data.

```{figure} _static/figs/example_regression_inductive_bias_percent_ood_30.png
---
name: fig-regression-inductive-bias
align: center
---
A zoomed-out plot of the same linear, polynomial, and neural network regressor as in the figure above.
```

## Expressivity via Function Composition

## Multivariate Linear Transforms

Suppose we're developing a regression model for high-dimensional data and need to pick which $\mu(\cdot; W)$ to use. Let our input data be $D$ dimensional, meaning that our inputs are an array (or vector): $x_n = [ x_n^{(1)}, x_n^{(2)}, \dots, x_n^{(D)} ]$. We use the superscript to denote the dimension of the data. 

## Neural Networks

## Connecting the Pictures to the Math