# Lecture #16: Neural Network Models for Regression
## AM 207: Advanced Scientific Computing
### Stochastic Methods for Data Analysis, Inference and Optimization
### Fall, 2021

<img src="fig/logos.jpg" style="height:150px;">

In [2]:
### Import basic libraries
from autograd import numpy as np
from autograd import grad
import numpy
import scipy as sp
import pandas as pd
import sklearn as sk
from sklearn.datasets.samples_generator import make_blobs
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
import math
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from matplotlib import rc
from IPython.display import HTML
from IPython.display import YouTubeVideo
%matplotlib inline

## Outline
1. Regression as Generalized Linear Models
2. Neural Networks
3. Automatic Differentiation and BackPropagation

# Regression as Generalized Linear Models

## Linear Regression Models
Here is a generalized linear model (GLM) you've known since day one!
\begin{align}
\mu &= \mathbf{w}^\top \mathbf{X}^{(n)}\\
Y^{(n)}&\sim \mathcal{N}(\mu, \sigma^2)
\end{align}
Alternatively, we can write this model as

\begin{align}
    Y^{(n)} = \mathbf{w}^\top \mathbf{X}^{(n)} + \epsilon; \quad \epsilon \sim \mathcal{N}(0, \sigma^2)
\end{align}

That is, injecting covariates into a normal likelihood is precisely linear regression! Just like in the case of logistic regression, we can form scientific hypotheses by examining the parameters of a linear regression model:

\begin{align}
    \widehat{\text{income}} = 2 * \text{education (yr)} + 3.1 * \text{married} - 1.5 * \text{gaps in work history}
\end{align}

## How Would You Parameterize a Non-linear Trend?
<img src="./fig/fig12.png" style='height:400px;'>
It's not easy to think of a function $g(x)$ can capture the trend in the data, e.g. what degree of polynomial should we use?

## Review of the Geometry of Logistic Regression 
In **logistic regression**, we model the probability of an input $\mathbf{x}$ being labeled '1' as a function of its distance from the hyperplane parametrized by $\mathbf{w}$
<img src="./fig/fig0.png" style='height:300px;'>
That is, we model $p(y=1 | \mathbf{w}, \mathbf{x}) = \sigma(\mathbf{w}^\top \mathbf{x})$. Where $\mathbf{w}^\top \mathbf{x}=0$ is the equation of the decision boundary.

## How would you parametrize a ellipitical decision boundary?

<img src="./fig/fig1.png" style='height:300px;'>

We can say that the decision boundary is given by a ***quadratic function*** of the input:
$$
w_1x^2_1 + w_2x^2_2 + w_3 = 0
$$
We say that we can fit such a decision boundary using logistic regression with degree 2 polynomial features

## How would you parametrize an arbitrary complex decision boundary?

<img src="./fig/fig2.png" style='height:300px;'>

It's not easy to think of a function $g(x)$ can capture this decision boundary.

**GOAL:** Find models that can capture *arbitrarily complex* functions.

# Neural Networks

## Approximating Arbitrarily Complex Decision Boundaries

Given an exact parametrization, we could learn the functional form, $g$, of the decision boundary directly. 

However, assuming an exact form for $g$ is restrictive. 

Rather, we can build increasingly good approximations, $\widehat{g}$, of $g$ by composing simple functions. 

## What is a Neural Network?

**Goal:** build a good approximation $\widehat{g}$ of a complex function $g$ by composing simple functions.

For example, let the following picture represents $a = f\left(\sum_{i}w_ix_i\right)$, where $f$ is a non-linear transform, and we denote the intermediate value $\sum_{i}w_ix_i$ by $s$.

<img src="./fig/fig4.png" style='height:300px;'>

**Note:** we always assume that $x_0=1$ and hence $w_0$ is the intercept or ***bias*** of the linear expression $\sum_i w_i x_i$.

## Neural Networks as Function Approximators

Then we can define the approximation $\widehat{g}$ with a graphical schema representing a complex series of compositions and sums of the form, $f\left(\sum_{i}w_ix_i\right)$

<img src="./fig/fig5.png" style='height:300px;'>

This is a ***neural network***. We denote the weights of the neural network collectively by $\mathbf{W}$.
The non-linear function $f$ is called the ***activation function***.

**Note:** 
Typically, at each node, we want to take a weighted sum of the values of the previous nodes with a additional ***bias term***. That is, we want as input $\sum_i w^l_{ij} \text{node}^l_i + \text{bias}_j$ for the $j$-th node in the $l$-th hidden layer. This is often done by adding an extra node per layer with the value of 1:
<img src="./fig/bias.png" style='height:250px;'>
The bias terms are considered to be part of the network parameters and when we jointly denote the network parameters by $\mathbf{W}$, bias terms are included.

## A Flexible Framework for Function Approximation


<img src="./fig/fig6.png" style='height:500px;'>

## Common Choices for the Activation Function

<img src="./fig/fig8.png" style='height:500px;'>

## Neural Networks are Universal Function Approximators

So what kind of functions can be approximated by neural networks?

**Theorem: (Hornik, Stinchombe, White, 1989)** Fix a "nice" activation function $f$. For any continuous function $g$ on a compact set $K$, there exists a feedforward neural network with activation $f$, having only a single hidden layer, which approximates $g$ to within an arbitrary degree of precision on $K$.

For this reason, we call neural networks ***universal function approximators***.

## Neural Networks Regression

**Model for Regression:** $Y^{(n)}\sim \mathcal{N}(\mu, \sigma^2)$, $\mu = g_\mathbf{W}(\mathbf{X}^{(n)})$, where $g_\mathbf{W}$ is a neural network with parameters $\mathbf{W}$.

**Training Objective:** find $\mathbf{W}$ to maximize the likelihood of our data. This is equivalent to minimizing the Mean Square Error,
$$
\max_{\mathbf{W}}\, \mathrm{MSE}(\mathbf{W}) = \frac{1}{N}\sum^N_{n=1} \left(y_n - g_\mathbf{W}(x_n)\right)^2
$$

**Optimizing the Training Objective:** For linear regression (when $g_\mathbf{W}$ is a linear function), we computed the gradient of the MSE with respective to the model parameters $\mathbf{W}$, set it equal to zero and solved for the optimal $\mathbf{W}$ analytically (see Homework #0). For logistic regression, we computed the gradient and used (stochastic) gradient descent to "solve for where the gradient is zero".

Can we do the same when $g_\mathbf{W}$ is a neural network?

# Automatic Differentiation and Backpropagation

## Gradient Computation for Neural Networks

Computing the gradient for any parameter $w^l_{ij}$ in the following network requires us to use the ***chain rule***:

\begin{align}
\frac{\partial}{\partial t} g(h(t)) = g'(h(t))h'(t),\quad& \text{or}\quad\frac{\partial g}{\partial t} = \frac{\partial g}{\partial h} \frac{\partial h}{\partial t}
\end{align}

This is because a neural network is just a big composition of functions.

<img src="./fig/fig7.png" style='height:150px;'>

## Example: Computing Neural Network Gradients

<img src="./fig/backprop.jpg" style='height:600px;'>

## Backpropagation: Gradient Descent for Neural Networks

The ***backpropagation*** algorithm consists of three phases:
0. (**Initialize**) intialize the network parameters $\mathbf{W}$
1. Repeat:
  1. (**Forward Pass**) compute all intermediate values $s_{ij}^l$ and $a_{ij}^l$ for the given covariates $\mathbf{X}$
  2. (**Backward Pass**) compute all the gradients $\frac{\partial \mathcal{L}}{\partial w^l_{ij}}$
  3. (**Update Parameters**) update each parameter by $-\eta \frac{\partial \mathcal{L}}{\partial w^l_{ij}}$
  
<img src="./fig/graph_structure.png" style='height:200px;'>

## Gradient Computation with Automatic Differentiation

The forwards-backwards way of computing the gradient lends itself to an algorithm that automates gradient computation for any neural network. 

This is a special instance of ***reverse mode automatic differentiation*** -- a method of algorithmically computing exact gradients for functions defined by combinations of simple functions, by drawing graphical models of the composition of functions and then taking gradients by going forwards-backwards.
<img src="./fig/function.png" style='height:50px;'>

<img src="./fig/computation_graph.png" style='height:150px;'>


# What Does a Neural Network Learn?

## Why is a Neural Network Classifier So Effective?

Visualizing the decision boundary:

<img src="./fig/boundary.png" style='height:350px;'>

## Why is a Neural Network Classifier So Effective?

Visualizing the output of the last hidden layer.

Before neural network models became wildly popular in machine learning, a common method for building non-linear classifiers is to first map the data, in the input space $\mathbb{R}^{\text{input}}$, into a 'feature' space $\mathbb{R}^{\text{feature}}$, such that the classes are well-separated in the feature space. Then, a linear classifier can be fitted to the transformed data.

If we ignore the output node of our neural network classifier, we are left with a function, $\mathbb{R}^{2} \to \mathbb{R}^2$, mapping the data from the input space to a 2-dimensional feature space. The transformed data (and in general, the output from a hidden layer in a neural network) is called a ***representation*** of the data. 

<img src="fig/architecture.jpeg" style="height:350px;">

Visualizing these representations can often shed light on how and what neural network models learns from the data.

<img src="./fig/latent.png" style='height:350px;'>

## Two Interpretations of a Neural Network Classifier: 
<table>
    <tr><td><font size="3">A Complex Decision Boundary $g\quad\quad\quad\quad\quad$</font></td>
        <td><font size="3">A Transformation $g_0$ and a linear model $g_1\quad\quad\quad\quad$</font></td>
    <tr><td><img src="fig/decision.png" style="height:350px;"></td>
        <td><img src="fig/architecture2.png" style="height:400px;"></td></tr>
</table>

# With Great Flexibility Comes with Great Problems

## Neural Network Regression vs Linear Regression

Linear models are easy to interpret. Once we've found the MLE of the model parameters, we can formulate scientific hypotheses about the relationship between the outcome $Y$ and the covariates $\mathbf{X}$:

\begin{align}
    \widehat{\text{income}} = 2 * \text{education (yr)} + 3.1 * \text{married} - 1.5 * \text{gaps in work history}
\end{align}

What do the weights of a neural network tell you about the relationship between the covariates and the outcome?
<img src="./fig/fig5.png" style='height:250px;'>

## Interpretable Deep Learning
We might be tempted to conclude that neural networks are uninterpretable due to their complexity. But just because we can't understand neural networks by inspecting the value of the individual weights, it does not mean that we can't understand them.

In [The Mythos of Model Interpretability](https://arxiv.org/abs/1606.03490), the authors survey a large number of methods for interpreting deep models. 

<img src="fig/cnnviz.jpg" style="height:400px;" align="center"/>

## Can Machine Learning Models Make Use of Human Concepts?
***(with Anita Mahinpei, Justin Clark, Ike Lage, Finale Doshi-Velez)***

What if instead building complex non-linear models based on raw inputs, we instead build simple linear models based on human interpretable **concepts**? We use a neural network to predict concepts from inputs and then use a linear model to predict the outcome from the concepts. We interpret the relationship between the outcome and the concepts via the linear model. These models are called **concept bottleneck models**.

In [The Promises and Pitfalls of Black-box Concept Learning Models](https://arxiv.org/abs/2106.13314), we examine the advantages and drawbacks of these models.

<img src="fig/slide15.png" style="height:300px;" align="center"/>

## Can Machine Learning Models Learn to Explore Hypothetical Scenarios?
***(with Michael Downs, Jonathan Chu, Wisoo Song, Yaniv Yacoby, Finale Doshi-Velez)***

Rather than explaining why the model made a decision, it's often more helpful to explain how to change the data in order to change the model's decision. This modified input is a **counter-factual**. In [CRUDS: Counterfactual Recourse Using Disentangled Subspaces](https://finale.seas.harvard.edu/files/finale/files/cruds-_counterfactual_recourse_using_disentangled_subspaces.pdf), we study how to automatically generate counter-factual explanations that can help users achieve a favorable outcome from a decision system.

<img src="fig/slide16.png" style="height:350px;" align="center"/>

## Right for the Right Reasons?

In [*An explainable deep-learning algorithm for the detection of acute intracranial haemorrhage from small datasets*](), the authors build a neural network model to detect acute intracranial haemorrhage (ICH) and classifies five ICH subtypes. 

Model classifications are explained by highlighting the pixels that contributed the most to the decision. The highligthed regions tends to overlapped with ‘bleeding points’ annotated by neuroradiologists on the images.

<img src="./fig/shap.png" style="height: 350px;" align="center"/>

## The Perils of Explanations
In [*How machine-learning recommendations influence clinician treatment selections: the example of antidepressant selection*](), the authors found that clinicians interacting with incorrect recommendations paired with simple explanations experienced significant reduction in treatment selection accuracy.

<img src="./fig/reliance.png" style="height: 350px;" align="center"/>


**Take-away:** Incorrect ML recommendations may adversely impact clinician treatment selections and that explanations are insufficient for addressing overreliance on imperfect ML algorithms.

## Generalization Error and Bias/Variance
Complex models have ***low bias*** -- they can model a wide range of functions, given enough samples.

But complex models like neural networks can use their 'extra' capacity to explain non-meaningful features of the training data that are unlikely to appear in the test data (i.e. noise). These models have ***high variance*** -- they are very sensitive to small changes in the data distribution, leading to drastic performance decrease from train to test settings.

<table>
    <tr>
        <td>
            <img src="./fig/fig11.png" style="width: 380px;" align="center"/>
        </td>
        <td>
            <img src="./fig/fig12.png" style="width: 380px;" align="center"/>
        </td>
    </tr>
</table>

## Generalization of Deep Models

Just as in the case of linear and polynomial models, we can prevent nerual networks from overfitting (i.e. poor generalization due to high variance) by regularization or by ensembling a large number of models.

However, a new body of work like [Deep Double Descent: Where Bigger Models and More Data Hurt](https://mltheory.org/deep.pdf) show that very wide neural networks (with far more parameters than there are data observations) actually ceases to overfit as the width surpasses a certain threshold. In fact, as the width of a neural network approaches infinity, training the neural network becomes kernel regression (this kernel is called the ***neural tangent kernel***)!

<img src="./fig/dd.jpg" style="width: 500px;" align="center"/>