# LUNA
### Michael Butler, Max Cembalest, M. Elaine Cunha

## Problem Statement
_What is the problem the paper aims to solve?_

Risk averse machine learning applications require models that return not only high accuracy predictions, but also - crucially - reasonable predictive uncertainty on test data. Admittedly, "reasonable" uncertainty possesses many interpretations, but one aspect essential for assessing risk is the accurate expression of epistemic uncertainty. In other words, a model whose predictions correctly account for epistemic uncertainty should generally return higher uncertainty for test data that live far away from training data in the feature space.
 
An increasingly popular approach to achieving adequate predictive uncertainty involves Bayesian modeling of neural networks - rather than producing a single point estimate, Bayesian models result in a distribution of predictions for any given input. Users may then infer model confidence based on the variation in the estimates produced: low variability suggests greater confidence in predictions, while high variability indicates greater uncertainty.
 
However, producing Bayesian estimates for a neural network with the desired expression of uncertainty requires several layers of computation that can be both complex and resource expensive. Additionally, evaluating the correctness of predictive uncertainty during training is often difficult in higher dimensional settings. The LUNA paper proposes an algorithm for specifically modeling predictive uncertainty in a computationally efficient class of Bayesian models to ensure appropriate uncertainty expression.


## Context/Scope
_Why is this problem important or interesting?_

Recent years have seen an explosion in machine learning applications for healthcare, police work, autonomous vehicles, and the like – expanding use and interpretation beyond strictly research and scientific bases. In these high stakes situations, it is critical that the practitioner receive a valid assessment of model uncertainty to inform an appropriate course of action.  Myriad examples exist of users pursuing ill-informed (or simply wrong) policies from overly confident machine learning models, resulting in devastating consequences.\cite{NYTimes} Thus, achieving accurate expressions of predictive uncertainty is crucial as the domain pervades other high-risk fields.

## Existing Work
_What has been done in literature?_

$\color{red}{\text{Not finished - need to update citations}}$

The broader literature in this area focuses on both assessing robust metrics for the quality of uncertainty and developing tractable models that are capable of expressing suitable predictive uncertainty.
 
In the first category, Yao et al. \cite{year} showed that log-likelihood alone is an insufficient metric for evaluating models with out-of-distribution data, specifically because it requires existing data to assess the likelihood.

Within the second research category are several approaches to modeling predictive uncertainty in neural networks: 
 
- **Gaussian Processes (GPs)** produce appropriate expressions of predictive uncertainty without relying on feature maps to transform the data. \cite{(Rasmussen & Williams, 2006)} This non-parametric quality enables adaptive modeling by avoiding the choice of a specific feature map for the given dataset. However, GPs are rather computationally slow and do not scale well with larger datasets.
- **Bayesian Neural Networks (BNNs)**\cite{neal} apply a prior over the weights of a neural network and train for the best fit distribution of parameters, resulting in a distribution of predictions for a single input at test time. The variance in these predictions can then be used to evaluate a model’s certainty in its predictions. However, BNN implementation involves significant computational complexity and cost. 
- **Neural Linear Models (NLMs)** have emerged as an alternative for producing Bayesian estimates from neural networks without the inherent complexity of BNNs.\cite{snoek}\cite{zhou} NLMs fit a model in two stages: 1) train a standard neural network over all data, 2) replace the last layer with coefficients fit with a Bayesian linear regression; all weights outside of the last layer are held constant at point estimates.Ober & Rasmussen gave a systematic review of NLM performance on benchmark tasks (https://arxiv.org/pdf/1912.08416.pdf)
 
Of these methods, inference is most efficient with NLMs. Consequently, NLMs have received increasing attention and advocacy for their use in recent years. \cite{Pinsler et al., Snoek et al.} However, although the computational savings are notable, NLMs are still susceptible to misleading expressions of uncertainty, especially when trained with regularization. We explain this failure mode in detail in the next section.


## Contribution
_What is the gap in literature that the paper is trying to fill? What is the unique contribution?_
- elaine

## Technical Content (High Level)
_What are the high level ideas behind their technical contribution?_
- elaine


## Technical Content (Details)
_Highlight (not copy and paste entire sections) the relevant details that are important to focus on (e.g. if there's a model, define it; if there is a theorem, state it and explain why it's important, etc)._

$\color{red}{\text{Placeholder - not finished}}$

In more technical terms, suppose we have training data that consist of features $X \in \mathbb{R}^{N \times D}$ and target $Y \in R^N$, where $N$ is the number of observations, and $D$ is the number of features. In the first stage of NLM training, we train a neural network with $K$ hidden layers, each with width $L_k$ using our training data. To train, we minimize the following objective function:
$$
C = \frac{1}{N}||(Y - f_{\Theta}(X))||^2_2 + \gamma||\Theta||^2_2
$$
In other words, we seek to find a neural network, $f$, with parameters $\Theta$ that minimizes the mean squared error between the predictions and the actual target data, $Y$, while penalizing large coefficients using the positive regularization term $\gamma$. Once trained, the $f$ maps from: $\mathbb{R}^D \to \mathbb{R}$. 

Next we chop off the final set of weights, producing a feature map that now predicts the final hidden layer: $\phi_{\theta}: \mathbb{R}^{D} \to \mathbb{R}^{L_K}$, where $\theta \subset \Theta$ contains the network weights from the $K-1$ layers, and $L_K$ is the dimension of the last hidden layer. As a last step, we apply a prior to the last layer of weights, and run a Bayesian linear regression to predict our target data $Y$. In other words, the Bayesian linear regression takes the form of:
$$
y \sim \mathcal{N}(\mathbf{\Phi}_{\theta}\mathbf{w},\sigma^2\mathbf{I}), \hspace{0.5cm} w \sim \mathcal{N}(\mathbf{0}, \alpha\mathbf{I})
$$
where $\mathbf{\Phi}_{\theta}$ is a $N \times (L_K+1)$ matrix that applies the feature map to $X$ and augments the resulting $N \times L_K$ matrix with a row of ones for the bias; $\mathbf{w}$ represents the final layer of weights; and the covariance of $y$ and the prior over $w$ are identity matrices multiplied by a constant ($\sigma^2$ or $\alpha$, respectively). This formula illustrates how the fitted weights of the neural network serve as the basis for the Bayesian linear regression. 


## Experiments
_Which types of experiments were performed? What claims were these experiments trying to prove? Did the results prove the claims?_

### The experiment datasets:
The authors experimented on 3 internally-designed simple toy datasets with gaps that, though simple in appearance, demonstrate a major failure mode for many neural network designs:
- Cubic Dataset
- Squiggle Dataset
- Sinusoidal Dataset

In addition, 5 externally-sourced datasets (from the Machine Learning Repository at UC Irvine) were given artificially-created gaps to test model robustness to limited data.
I have no idea what these datasets are really about and I think that's OK.
- Yacht - Froude
- Bostom - RM
- Boston - LSTAT
- Concrete - CEMENT
- Concrete - SUPER
### The experiment Goals:
The primary goal of the experiments is to show that LUNA holds up against the gold standards (and exceeds a variety of other models) when modeling data with gaps.

These experiments are proxies for the models' ability to handle data with gaps:
- Visualize that the uncertainty spread is wider in the gap of the Cubic Dataset
- Generalization: forecasting the gap without having the gap data of the Squiggle Dataset
- Transfer learning: forecasting the gap after having the gap data of the Squiggle Dataset
    - * Please note that max needs to double check his understanding of this
- Bayesian optimization: measuring how quickly a model locates a Sinusoidal Dataset's maximum
- Comparing the MSE, log-likelihood, and epistemic-uncertainty on the 5 UCI datasets.


The authors use low MSE and high log-likelihood as approximate measures of a good fit for a model, on *both* the gap and non-gap datasets. Epistemic uncertainty, in this context, is the standard deviation of the posterior predictive, which is low or high depending on how much your posterior predictions agree with one another. We can "identify" gaps in our data by building models that produce a posterior predictive with lots of agreeing predictions in the data-rich regions and disagreeing predictions in the gaps. Therefore epistemic uncertainty should, for a robust model, be higher on the gap dataset than on the non-gap dataset.

Each model is initialized with the same architecture: a ReLU-activation neural network with 2 layers and a width of 50 nodes per layer.

The authors compared LUNA to:
- NLM
- Monte Carlo with dropout
- MAP estimation
- bootstrapped ensemble
- Bayesian neural network with HMC sampling (gold standard, provides useful uncertainties exploring the posterior only a lil, slow)
- Gaussian process (gold standard, the improv jazz of modeling, and equivalent to BNN with $\infty$ width i think? also I'm unclear on how slow it is.)


### The Experiment Results


#### Cubic Dataset: Visualizing Uncertainty Spread
The LUNA model found a better uncertainty spread in the gap than the NLM et al did. And notably, it achieved comparable uncertainty spread to the gold standards.

#### Squiggle Dataset: Generalization and Transfer Learning
All of the methods worked well when their width $\rightarrow \infty$ (aka 500), but LUNA achieves notably better uncertainty estimation for low-width networks in the generalization and transfer-learning tasks.

**#### Sinusoidal Dataset: Bayesian Optimization
LUNA typically took fewer iterations than GP, NLM, and Random to find the maximum. What's "Random"?

#### UCI Dataset: Comparing MSE, Log-Likelihood, and Epistemic Uncertainty
The LUNA model had comparably low MSE and high log-likelihood on both the gap and non-gap datasets, whereas NLM and MCD had higher MSE and lower log-likelihood in the gap than the non-gap. In other words, LUNA better identified the 
- Luna shows higher uncertainty in Gap vs Not Gap, whereas NLM and MCD have same uncertainty in Gap and Not Gap

(Max question: did they experiment on a dataset with its own *natural* gap, as opposed to an artificial gap?)

## Evaluation
_(your opinion) - do you think the work is technically sound? Do you think the proposed model/inference method is practical to use on real data and tasks? Do you think the experimental section was strong (there are sufficient evidence to support the claims and eliminate confounding factors)?_


The work is sound, though we wish we were good enough at this to find flaws in our professor's work.

The paper effectively communicates LUNA's key innovation in the objective function. The experiments thoroughly demonstrate the effectiveness of LUNA on the toy datasets. The experimentation design is useful because, in many downstream tasks, an algorithm should properly identify gaps in its training data and exercise hesitance in human contexts. This principle seems to embody sensible wisdom, that one should be confident when equipped with sufficient data while uncertain/hesitant where appropriate. 

The experiments are inherently limited on the UCI datasets because data from human contexts is high-dimensional, making it impossible to visualize simple "uncertaity bumps" (like the toy dataset). But the epistemic uncertainty measure (posterior predictive st.dev) is a useful numerical measure of uncertainty to check that the model is more hesitant in the presence of gaps in the data.

In terms of model scalability, the hyperparameter search limits us from implementing LUNA quickly, particularly since it seems different datasets often require their own hyperparameter scan. However, the authors include useful graphics that demonstrate the effect of hyperparameters on the Cubic Dataset.

**One component of the training process we believe the authors could have gone further into exploring is the # of iterations, since we happened to produce worse results for 10,000 iterations than 5,000 iterations.



## Future Work 
_(for those interested in continuing research in a related field) - do you think you can suggest a concrete change or modification that would improve the existing solution(s) to the problem of interest? Try to implement some of these changes/modifications._

* Future work in model evaluation: an "uncertainty trace": visualizing the uncertainty bump after 1000,2000,...,10000 iterations of training, a la the hand-drawn photo he sent Michael and Elain in fb messenger.



## Broader Impact 
How does this work potentially impact (both positively and negatively) the broader machine learning community and society at large when this technology is deployed? In the applications of this technology, who are the potentially human stakeholders? What are the potential risks to the interest of these stakeholders in the failure modes of this technology? Is there potential to exploit this technology for malicious purposes?




# Code
### Max is setting up all our code in order in one notebook, FinalCode.ipynb

Code: At least one clear working pedagogical example demonstrating the problem the paper is claiming to solve. 

- want to get useful uncertainty in data-scarce regions
- log-likelihood doesnt cut it
- instead, use the more vague metric of wider predictive bumps when in data-scarce region
- PriorPredictives_Demo.ipynb

Code: At lease a bare bones implementation of the model/algorithm/solution (in some cases, you may be able to make assumptions  to simplify the model/algorithm/solution with the approval of your instructor)
- LUNADemo/NLMDemo.ipynb

Code: Demonstration on at least one instance that your implementation solves the problem.
- LUNADemo/NLMDemo.ipynb

Code: Demonstration on at least one instance the failure mode of the model/algorithm/solution, with an explanation for why failure occurred (is the dataset too large? Did you choose a bad hyper parameter?). The point of this is to point out edge cases to the user.
- number of iterations too high (maybe we dont have time, but experiment w increasing step size for optimization?)
- different gap datasets require diff hyperparameters (max thinks)
- 

## References

- elaine

Test:

<cite data-cite="kluyver2016jupyter">Kluyver et al. (2016)</cite>