# Deep Learning

refs:
* Coursera. deeplearning.ai
* https://blog.goodaudience.com/artificial-neural-networks-explained-436fcf36e75


## Main concepts 


ANN can learning non-linear relationships 

<img src="images/non-linear_and_linear_decision_edge.png" width="400" align="left"/> 



### Architecture

* neuron

<div style="clear:both">
<img src="images/neuron_ANN.png" width="400" align="left"/> 
</div>

<br>

<div style="clear:both">
** activation func (**allow ANN to learn no-linear relationships**) 

    Good activatiobn functions has non-linear shapes, easy to compute the function and the first derivative of the function

*** **sigmoid function**: most common function. widely used on logiostic regression  
**** (range: 0.0 and 1.0)  
**** good for binnary classifiers (output layer) 


*** **softmax**: 
**** range: vector where each element is between 0.0 and 1.0. There is nclass elements in the vector output
**** good for multi-class classificatio (output layer)
**** emphsaize the most likely class and return probabilities

*** **tanh**: hyperbolic tangent   
**** range: -1.0 and 1.0  
**** mean value is zero this is good in optimization problems (remember why we should normalize the input features)  
**** good for hidden layers  

*** **ReLu**: very common  
*** range: 0  and inf   
*** good for hidden layers  
</div>

* layers  

<div style="clear:both">
<img src="images/layers.jpeg" width="400" align="left"/> 
</div>

<br>

<div style="clear:both">
* Loss

*** **cross-entropy loss** or **or log loss**: measure the performance of classifier where the outputs ranging between 0.0 and 1.0 

Cross-entropy loss increases as the predicted probability diverges from the actual label
Is the average of log likelihood over all the data

* Forward Propagation: computes the output given an input. (used in train and prediction phase)

* Back-propagation: computes the gradiens in order to train the model while the ANN is learning. Only used in train phase
</div>

## How to train recipe

refs: http://karpathy.github.io/2019/04/25/recipe/

1. General advices

    * fast n furious approach does not work
    * patient and pay attention in detail tends to work (correlates with success)
    * being defensive and obsessed about visualizations works
    * do baby steps and avoid to **introduce a lot of unverified complexity at once**
    * Build simple to complex
    * ReLu are good for hidden layers
        * Positive side learn faster than logistic and tanh due to the slope
        * The Negative side can make train stuck caused by dead neurons where the gradient becomes zero
    
1. Become one with data 

    * inspect data
    * try to see patterns (your brain is good at it)
    * always check for:
        * duplicated 
        * corrupted data
        * wrong labels (if not systemic may not hurt to much)
        * imbalance data
        
1. Set up pipeline for trainning and evaluations and test it

    * work with fixed seed
    * simplify . does not add any regularization
    * **verify loss init**: -log(1/n_classes) for classifiers
    * **overfit one batch or small train sample dataset as little as 2**
    * **input independent** (shuffles labels) (the DNN should not learn. See the errors in test n val dataset)
    * visualize the input of DNN. y_hat = model(X). Vis X.  

1. Overfit (reduce bias error)

    * overfit
        * focus in **train loss** should be close to zero
        * if you try with many models  that you increased the complexity can suggest that you have a BUG
    * do not be a hero. start with the most related paper and copy and paste their simple architecture.
        * for images, ResNet-50 is a good start
        * for voice, xvectors
    * **Adam is safe with learning rate e3-4** !? but you can try different learning rate.
    * **Add complexity only one at time**. If you have multiple features. Suggest to add one by one and unsure you get a performance boost. Or you can try smaller image and the increase the image size 
    * **do not trust learning rate decay**. He always disable learning rate decays entirely. It is a personal advice. less problematic maybe

1. Regularize (reduce variance error)

    * **get more data** is by the far preferred way to regularize a model. It is **the only guarantee way to improve the model.**
    * **data augmentation**. The next best thing
    * **pretrain**. It is really rarely hurts to use a pre-trained network even if you have enough data.
        * xvectors
        * ResNet-50
    * **make smaller input dimensionality**. Remove features that can have spurious signal (Remove ciorrelated features)
    * **make smaller model**. Personal advise
        * He used to use FC layers after ImageNet, but these days he uses average pooling. eliminating a tons of parameters
    * **decrease batch size**. helps with regularization
    * **dropout**
    * **early stopping**
    
1. Tune

    * **random over grid search** Never use grid serach
    * **hyper-parameter optmization**
    
    
1. Squeeze the juice (It is not preference)

    * leave it training. One time he forgot one model running and get SOTA (state of the art) !?
    * ensembles
        * TODO: read this paper about hot to use ensemble to build one simple model: https://arxiv.org/abs/1503.02531
    

## Learning rate diagnostics 

* refs:
    
    * https://towardsdatascience.com/understanding-learning-rates-and-how-it-improves-performance-in-deep-learning-d0d4059c1c10   (approach to detrmine best lr)
    * https://www.dataquest.io/blog/learning-curves-machine-learning/



In [1]:
!pwd
!ls images

/mnt/sdb1/leandro/ds_pragmatic_programming
data_frame.png				 pathlib_cheatsheet_p1.png
iris_petal_sepal.png			 pivot-table-datasheet.png
layers.jpeg				 refactor_notebooks.png
neuron_ANN.png				 resampling.png
non-linear_and_linear_decision_edge.png  smote.png
notebook_vs_code.png			 split-apply-combine.png
onehot.png				 tomek.png


### Bias and variance trade-off 

**Train error still to much high for the application**

<img src="images/biasvariance.png" height="250" width="400">

**variance error is related to gap between train and error** 

There is a minimum total error

<img src="images/irr_error.png" height="250" width="400">

### High bias


* left: high bias and low variance 
* right: high bias and high variance

What to do:

* Adding more training instances.
* Adding more features.
* Feature selection.
* Hyperparameter optimization
* train longer (deep learning)

<img src="images/add_data.png" height="400" width="600">

### Low bias high variance error


* left: low variance 
* right: high variance

What to do?

* Adding more training instances.  

* Increase the regularization for our current learning algorithm. This should decrease the variance and increase the bias.  

    * L1 or L2
    * dropout

* Reducing the numbers of features in the training data we currently use. The algorithm will still fit the training data very well, but due to the decreased number of features, it will build less complex models. This should increase the bias and decrease the variance. 


<img src="images/low_high_var.png" height="400" width="600">

### Learning rates 

Learning rate controls how much we are adjusting the weights of our network with respect the gradient of the loss function. 


<img src="images/learning_rate.png" height="400" width="600">

* Too small: **Less training time, lesser money spent on GPU cloud compute. :)**
* Too large: does not converge

<img src="images/learning_rate2.png" height="200" width="300">

### Is there a methodology to detrmine best learning rate?

In the article **Cyclical Learning Rates for Training Neural Networks""** Leslie N. Smith argued that you could estimate a good learning rate by training the model initially with a very low learning rate and increasing it (either linearly or exponentially) at each iteration.


1. change the learningrate at each minibatch (lienarlly or exponentially)
1. plot the learning rate (log) against loss; (choose the one close to the minumum)


**The python package fastai has function to do that** fastai is like keras for pytorch

### Tips n learning curve diagnostics

* https://stats.stackexchange.com/questions/345990/why-does-the-loss-accuracy-fluctuate-during-the-training-keras-lstm

* https://stats.stackexchange.com/questions/187335/validation-error-less-than-training-error 

**Summary**

* val loss or error smaller than train reasons
    * diff in train and val data distributions. 
        * Maybe train has harder case while validation has more easy cases.
        * data wrongly labeledin train datasets
        * dropouts (highe level) can cause that sometimes

* loss oscilation reasons
    1. batch_size is too small
    1. large neural network and small data  (**always compare #parmeters and #samples**)
    
    
* Batch size trade off (alsoe related to previous one)
    * too large make training slow
    * too small loss oscilation and takes more epoch to converge
    * large batch (**small number of mini batches** per epoch) size can make the DNN not learn

You can think of model evaluation in four different categories:

1. Underfitting – Validation and training error high

1. Overfitting – Validation error is high, training error low

1. Good fit – Validation error low, slightly higher than the training error

1. Unknown fit - Validation error low, training error 'high'


I say 'unknown' fit because the result is counter intuitive to how machine learning works. The essence of ML is to predict the unknown. If you are better at predicting the unknown than what you have 'learned', AFAIK the data between training and validation must be different in some way. 


=================================

There are several reasons that can cause fluctuations in training loss over epochs. The main one though is the fact that almost all neural nets are trained with different forms of stochastic gradient decent. This is why batch_size parameter exists which determines how many samples you want to use to make one update to the model parameters. If you use all the samples for each update, you should see it decreasing and finally reaching a limit. Note that there are other reasons for the loss having some stochastic behavior.

This explains why we see oscillations. But in your case, it is more that normal I would say. Looking at your code, I see two possible sources.

Large network, small dataset: It seems you are training a relatively large network with 200K+ parameters with a very small number of samples, ~100. To put this into perspective, you want to learn 200K parameters or find a good local minimum in a 200K-D space using only 100 samples. Thus, you might end up just wandering around rather than locking down on a good local minima. (The wandering is also due to the second reason below).

Very small batch_size. You use very small batch_size. So it's like you are trusting every small portion of the data points. Let's say within your data points, you have a mislabeled sample. This sample when combined with 2-3 even properly labeled samples, can result in an update which does not decrease the global loss, but increase it, or throw it away from a local minima. When the batch_size is larger, such effects would be reduced. Along with other reasons, it's good to have batch_size higher than some minimum. Having it too large would also make training go slow. Therefore, batch_size is treated as a hyperparameter.

### Gradient descent algorithms

refs: https://stats.stackexchange.com/questions/49528/batch-gradient-descent-versus-stochastic-gradient-descent

$
J(\theta)=\frac{1}{2}\sum_{i=1}^N(y_i−h_{\theta}(x_i)^2
$

$
\theta_j = \theta_j − \alpha \frac{\partial J(\theta)}{\partial \theta_j} 
$

The update is given by 


$
\Delta \theta_j = \alpha \frac{\partial J(\theta)}{\partial \theta_j} \equiv  \sum_{i=1}^N(y_i−h_{\theta}(x_i))x_i
$

1. **Gradient descent**  

    * Compute the gradient of the cost function using the entire dataset 
    * Update the weights.
    
    Pros n cons  
    
    * **Computational slow and utilizes a lot of memory**  
    * Guarantee that loss func always will reduce   
    

1. **Stochastic Gradient Descent**
    
    * Compute gradient for each sample
    
    Pros n cons  
    
    * More sensible to noisy  
    * Faster than Gradient decsent  
    * Use less memmory   


1. **Mini batch Gradient**  

    * Compute gradient for each mini batch (This is a estimation of the true Gradient )  

    Pros n Cons  
    
    * More robust to noisys data
    * Faster than all methods
    * Use less memory than Gradient but more than Stochastic


See this discusison for batches sizes:
* https://stats.stackexchange.com/questions/316464/how-does-batch-size-affect-convergence-of-sgd-and-why

>  the minibatch size gets larger the convergence of SGD actually gets harder/worse,

* Paper: https://research.fb.com/publications/accurate-large-minibatch-sgd-training-imagenet-in-1-hour/

>  large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization.  

* https://stats.stackexchange.com/questions/164876/tradeoff-batch-size-vs-number-of-iterations-to-train-a-neural-network/236393#236393 

> It has been observed in practice that when using a larger batch there is a significant degradation in the quality of the model, as measured by its ability to generalize. 
