### Fundamental Difference between ML and DL

- In "simple" ML (e.g. LinReg, LogReg, Trees, Bayesian Models), feature engineering is everything!
    - hyperparameter optimization is *somewhat* useful.


- In "deep" learning, hyperparameter optimization is everything. The onus of deciding the architecture / structure of the network is up to you!!
    - a lot of the FE is being done for you already! This is done in the hidden layers.

### What are the hyperparameters in Neural Networks?
    - 'Neural networks are not "off-the-shelf" algorithms in the way that random forest or logistic regression are. Even for simple, feed-forward networks, the onus is largely on the user to make numerous decisions about how the network is configured, connected, initialized and optimized.'

- Number of layers:
    - the more layers, the more "hidden features" the model learns. Too many -> tends to overfit.
    - too many layers --> `Vanishing Gradient Problem`: if the network is too "deep", then backprop starts to "fizzle out". Diminishing returns.
- Number of epochs:
    - the number of iterations in backpropagation: the more epochs, the more the model trains.
- Activation Functions:
    - Step
        - primitive, not used in practice.
    - Sigmoid
        - gives probabilities
        - IF YOU ARE DOING BINARY CLASSIFICATION, then the LAST LAYER MUST BE A SIGMOID. 
        - a single neuron in the last layer!
    - ReLu 
        - normally used in hidden layers only
        - trains super fast; the gradient / deriv is super easy to calculate
        - us it for hidden layers!
        - other variants: Leaky ReLU / ELU
    - Linear
        - no activation!
        - turns your NN into a regression problem.
    - Softmax:
        - extension of logistic / sigmoid function for MULTIPLE CLASSES.
        - this mathematical function ensures that all the probabilities add up to 1.
        - IF YOU ARE DOING MULTICLASS CLASSIFICATION, then the LAST LAYER MUST USE A SOFTMAX FUNCTION.
- Batch Size:
    - Batch == "sub"-epoch 
    - larger batch size: faster training 
    - smaller batch size: lighter load on memory 
- Number of neurons in each layer:
    - same as before. more neurons == more learning. too many neurons == maybe overfitting.
- Optimizers:
    - different variants of backprop
- Weight Initialization:
    - use the default one.
- Type of layers:
    - Fully Connected == "Dense" Layers 
    - Convolutional Kernels 
    - Recurrent Neural Network layers 

---
---
### Best Practices
---
---

How to avoid overfitting?

- Reduce complexity of model
    - make the model complex at first, then then reduce.
    - complexity ~= number of neurons
- Increase data size:
    - Image / Data Augmentation:
        - see: Keras Data Augmentation / Image Augmention 
        - random pixel shifting / contract / rotation
- Regularization:
    - DropOut
        - randomly shuts off some percentage of neurons during training. Prevents overspecialization. 
    - BatchNormalization
        - standard scaling between layers -- improves training speed.
        - basically the same thing as standard scaling as part of feature engineering in simple ML.
- Transfer Learning
    - You'll see this later.
        - We can take existing networks that have been pre-trained and re-fit them for our own purposes.
        - Some popular network: VGG-16 / VGG-19 (16-layers and 19-layers, respectively)
        - ResNet50, which has 50 layers.

**How to save a model in Keras:**
- there is a `m.save('model.h5')`
- https://www.tensorflow.org/guide/keras/save_and_serialize

### Further Reading:
   - What should I do when my neural network doesn't learn? Best practices for when you are stuck: 
https://stats.stackexchange.com/questions/352036/what-should-i-do-when-my-neural-network-doesnt-learn
        - The top answers in this post are extremely informative / give you more good "rules of thumb".
