# Analysis, Goals, and Predictions

Here, I'd like to go over the loss functions being used to train the models and what we are aiming for, so that we can better understand the models' performance as training progresses. My approach will follow that of Ng (2015), as outlined in the textbook __"Deep Learning" by Goodfellow et al.__:

* Determine your goals -- error metric(s) and target (re: desired) value(s). 
* Establish a working end-to-end pipeline. 
* Determine bottlenecks in performance, their sources, and whether they're due to overfitting/underfitting/software defect(s). 
* Repeatedly make incremental changes such as gathering new data, adjusting hyperparams, or changing algorithms. 

## Selecting Hyperparameters

#### Manual HyperParameter Tuning

Here are I'll just bullet the main ideas:
* The learning rate is perhaps the most important hyperparameter. The training error increases appx exponentially as the learning rate decreases below its optimal value. Above the optimal value, the training error basically shoots off to infinity (vertical wall). 
* Next, the best perfomance usually comes from a large model that is regularized well, for example, by using dropout. 
* Table showing typical hyperparameter relationships with model capacity. Remember that you can basically brute force your way to good performance by jacking up the model capacity and training set size. 

| Hyperparameter | Increases capacity when... | 
| -------------- | -------------------------- |
| Num hidden units | increased | 
| Learning rate | tuned optimally |
| Convolution kernal width | increased | 
| Implicit zero padding | increased | 
| Weight decay coefficient | decreased | 
| Dropout rate | decreased | 


#### Automatic HyperParameter Optimization

__Grid Search__: This is what I'm doing right now. User selects a small finite set of values to explore. Grid search trains a model for every joint specification of hyperparameter values in the Cartesian product of possible values. The experiment with the best _validation error_ is chosen as the best. 

__Random Search (Better)__: 
1. Define a marginal distribution for each hyperparameter, e.g. multinoulli for discrete hparams or uniform (log-scale) for positive real-valued hyparams. For example, if we were interested in the range $[10^{-5}, 0.1]$ for the learning rate:
$$
\begin{align}
\texttt{logLearningRate} &\sim Unif[-1, -5] \\
\texttt{learningRate} &= 10^{logLearningRate}
\end{align}
$$

## Debugging Strategies

Determining whether or not a machine learning model is broken is hard. Here are some debugging tips:
* __Visualize the model in action__: Not just the quantitative stuff. How do the filters look? How is the chatbot responding?
* __Visualize the worst mistakes__: For example, our chatbot models output probabilities for the word tokens, and we either sample or argmax. One way to get an idea of what sentences our model does poorly on is to choose examples where the output probability max is *small*. In other words, if argmax(output) is much lower than usual, that says our model is rather unsure what is the best next word (think of the limiting case where it outputs 1/numOutputs for all possible tokens!). 
* __Fit a tiny dataset__: Oooh, I like this one! Even small models can be guaranteed to be able to fit a sufficiently small dataset. Make sure you can write program that can train on say, a handful of input-output sentences, and produce the output given any of the inputs with near perfect accuracy. 
* __Monitor histograms of activations/gradients__: The preactivation can tell us if the units saturate, or how often they do. For tanh units, the average of the absolute value of the preactivations tells us how saturated the unit is. It is also useful to compare the parameter gradients with the parameters themselves. Ideally, we'd like the gradients over a minibatch to be about 1 percent of the magnitude of the parameter. 