# Deep Learning Notes 

Assumes some knowledge already, this is more of a note taking space for me. 

---------------------------------------------------------------------------
# Loss Functions 

Loss functions are a metric for the network's performance and come in a variety of flavors for different purposes. The core purpose is the Loss Function is to measure the distance between the ground truth and the model's outputs. Depending on the type of problem you want to solve, you will chose a specific loss function. Here are the most common ones: 

| **Loss Function**             | **Purpose**                | **Keras**                      |
|---------------------------|------------------------|----------------------------|
| **Binary Cross-Entropy**      | Binary Prediction      | `binary_crossentropy`      |
| **Categorical Cross-Entropy** | Multi-class Pred - OneHot | `categorical_crossentropy` |
| **Sparse Categorical Cross-Entropy** | Multi-class Pred - Int | `sparse_categorical_crossentropy` |
| **Mean Squared Error**        | Continious Regression  | `mean_squared_error`       |
| **Cosine Proximity**        | Vector Oreintation  | `cosine_proximity`       |


**Cross-Entropy** is a quantity from the field of information theory that measures the distance between probablity distributions (GT vs. output). This is good for models that output probablitites. In the case of Categorical Cross Entropy, it is imperative to one-hot-encode the data. Sparse Categorical Cross Entropy avoids this need and takes in intger values alone. 

**Mean-Squared-Error** measures the distance between two quantities (residuals) into a sort of average. Sum-of-Squared-Errors (SSE) is another option, though this can explode more easily. Root Mean Squared Error (RMSE)is just the square root of the mean square error. That is probably the most easily interpreted statistic, since it has the same units as the quantity plotted on the vertical axis for a linear regression. 

**Cosine Proximity**, or Cosine Similary, is a measure of how close two vectors are in terms of orientation, and not magnitude. This is useful in models like Word2Vec. 

More information on the [math here](https://isaacchanghau.github.io/2017/06/07/Loss-Functions-in-Artificial-Neural-Networks/), or in the [Keras Docs](https://keras.io/losses/)

---------------------------------------------------------------------------
#### Other Metrics 

In addition to loss there are other metrics to validate the training of a network. Metrics differ for classification and regression problems. 

**Accuracy** (acc) is the simple percentage of correct predictions for a classification problem. 

**Mean Absolute Error** (MAE) is the absolute value of the difference between the predictions and the targets. It can only be used for regression. It can be interpreted as how off you are in a one-to-one comparison with the scale units of your target. 

### Important Consideration 

You may find that at some points, your accuracy increases while loss stays the same or increases. How is this possible? Well, the loss displayed is an average of pointwise loss values, but accuracy is a threshold of the class prediction probablities. The number of correctly classified points may be increasing (better accuracy) but this may not be seen in the loss. 

---------------------------------------------------------------------------
# Optmizers 

In general, it safe to start with RMSprop (`rmsprop`), whatever the problem. 

---------------------------------------------------------------------------
# Activation Functions 

[CS231](https://cs231n.github.io/neural-networks-1/#actfun)

TLDR: “What neuron type should I use?” For middle layers, ese the eLU non-linearity, be careful with your learning rates and possibly monitor the fraction of “dead” units in a network. If this does not work, try ReLU, maxout, or tanh (worst case). 

For the final layer, this depends on what you want to predict. For probablities, softmax. For regression, no activation. For values [0, 1] (binary classification) sigmoid. 

<img src="img/activations.png" width="600" />

### Sigmoid
The sigmoid non-linearity has the mathematical form σ(x)=1/(1+e−x); it takes a real-valued number and “squashes” it into range between 0 and 1.Large negative numbers become 0 and large positive numbers become 1.  In practice, the sigmoid non-linearity has recently fallen out of favor and it is rarely ever used. It has two major drawbacks: saturation to 0 gradient at the tails, and being non-zero centered. 

### Tanh 

 Tahn squashes values between [-1, 1]. Tanh non-linearity is always preferred to the sigmoid nonlinearity since it is 0 centered, but the gradients are still 0 at the tails. 
 
 ### ReLU 
The Rectified Linear Unit has become very popular. It computes the function `f(x)=max(0,x)`; the activation is simply thresholded at zero. It is inexpensive, but the activations can "die" at high gradients

### Leaky ReLU 

Leaky ReLUs are one attempt to fix the “dying ReLU” problem. Instead of the function being zero when x < 0, a leaky ReLU will instead have a small negative slope (of 0.01, or so). That is, the function computes `f(x)=1(x<0)(αx)+1(x>=0)(x)` where α is a small constant. 

### eLU 
ELU(Exponential linear unit) function takes care of the Vanishing gradient problem. 
Now what ELU does is that it tries to make the mean activation close to zero and as it is an exponential function it does not saturate. This behavior helps to push the mean activation of neurons closer to zero which is beneficial for learning and it helps to learn representations that are more robust to noise.

### MaxOut 
The Maxout neuron computes the function `max(wT1x+b1,wT2x+b2)`. Notice that both ReLU and Leaky ReLU are a special case of this form (for example, for ReLU we have w1,b1=0). The Maxout neuron therefore enjoys all the benefits of a ReLU unit (linear regime of operation, no saturation) and does not have its drawbacks (dying ReLU). However, unlike the ReLU neurons it doubles the number of parameters for every single neuron, leading to a high total number of parameters.

### Softmax 

This is a soft/smooth approximation of max. Softmax functions output a probablity distriubtion over many categories.  The output of the softmax function is equivalent to a categorical probability distribution, it tells you the probability that any of the classes are true.

---------------------------------------------------------------------------
# Why is my Neural Net not Working? 

You can follow these two guides: [this simple one](http://theorangeduck.com/page/neural-network-not-working), and [Andrej Karpathy's from CS231n](https://cs231n.github.io/neural-networks-3/#gradcheck)

## Before Anything: 
- try to visualize your results
- try to visualize the data before it goes into the network 
- train for a few more epochs to make sure things aren't working 

## Gradient Checks: 
Do this stuff after you design your network, but before real training. 

#### Loss with no Regularization 
- Turn off all regularization and check the data loss 
- ex.  for CIFAR-10 with a Softmax classifier we would expect the initial loss to be 2.302
    -  we expect a diffuse probability of 0.1 
    -  10 classes, and Softmax loss is the negative log probability of the correct class
    - so: -ln(0.1) = 2.302.

#### Increase Regularization 
- now, increasing regularization should increase the loss 

#### Overfit on a subset 
- Overfit on a small subset of the data (20 examples)
- Ensure you can get 0 loss 
- If you don't pass, something is wrong 

## Essential Checklist: 

####  0. You're not documenting your process! C'mon! Be a scientist! 

#### 1. You did not shuffle your data 

#### 2. You did not normalize your data 
- data should be between [0, 1]
- your data could be heterogenous, where features have different scales 
    - normalize independently on a per-feature basis 
    - mean = 0, std = 1
    - `x -= x.mean(axis=0`
    - `x /= x.std(axis=0)`

####  3. Does the last layer of the network have (N nodes == N classes)? 

####  4. Are you using the right activation on the last layer?(check above)
- no activation for regression 
- softmax for probablities 
- sigmoid [0, 1]
- ReLU [0, inf]

####  5. Are you using the right loss function? (check above)
- are you one-hot encoding? (if so don't use sparse categorical cross entropy) 
- MSE for regression 

#### 6. Are you regularizing? 
- dropout, start with 0.9 and work down to 0.5
- L1/L2
- data augmentation 

#### 7. You are using the wrong learning rate 
- Which of the following does your loss look like?
- Low learning rates the improvements will be linear.
- With high learning rates they will start to look more exponential. 
- Higher learning rates will decay the loss faster, but get stuck (green line).
    - This is because there is too much "energy" in the optimization and the parameters are bouncing around chaotically, unable to settle in a nice spot in the optimization landscape.
<img src="img/learningrates.jpeg" width="240" />

#### 8. You're using the wrong batch size 
- low batch size (1-16) increases stochasticity, noisy updates 
- choppiness can "jump" out of local minima
- increases training time 
- chart shows good LR, noisy udpate implies low batch size 
<img src="img/loss.jpeg" width="240" />

#### 9. Can you reduce the dimensionality of your data 
- feature engineering! 

#### 10. You're network is too big for your data 
- deeper != better 
- scale down the number of hidden layers and nodes 
- start with 3-8 layers, 256-1024 nodes, go deeper after sucess 

#### 11. You're scaling up wrong 
- You are applying regularization and increasing network size at the wrong time 
- The ideal workflow is iterative, like tightening a car wheel: 
        **start with a small, basic network**
        --> train, work out kinks, graident check 
        while Tuning: 
            --> overfit this model (add layers, nodes, epochs, gridsearch)
            --> add regularization (L1/L2, dropout, remove layers)
            --> work out kinks