# Copyright Note


## Deep Learning Reference 1
**Most of the deep learning and convolution neural network materials' figures, definitions, and examples are courtesy of or adapted from the following book **: **Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems 2nd Edition
by Aurélien Géron  (Author)**[ relative links](https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1492032646)
## Deep Learning Reference 2

**Most of the deep learning and convolution neural network materials' figures, definitions, and examples are courtesy of or adapted from the following book **: **Deep Learning with Python by Francois Chollet (2018)**[relative link](https://www.amazon.com/Deep-Learning-Python-Francois-Chollet/dp/1617294438/ref=sr_1_3?ie=UTF8&qid=1532546159&sr=8-3&keywords=deep+learning+with+python)

**Other examples are adapted from wiki and internet resources**

-----------------------------------

# Lecture Objective
## Understanding the course goals and activities.
## Knowing how to get the maximum from the course.
## Going through the Canvas.
## Starting BMI6015 review and skimming through HW1

-----------------------

# Lecture overview

## [Differnce between ML and DL architecture.][1]:
 
   **Machine Learning (ML):** Learning and discovering from data patterns to perform a specific task (1959).  
   
   
   <img style="float: center" src="./images/typesOfML.png" alt="drawing" Hight="300" width="300"/>

   **Deep Learning (ML):** Learning from data feature representations to extract features to perform a specific task (1986).

<img style="float: center" src="./images/DL&ML&Rules-based.png" alt="drawing" Hight="600" width="600"/>


**Machine Learning:**  From data to perform specific task, learning/training:
1. Mapping features to output.
2. Discovering patterns.

**Deep Learning:** Multi-layer model to extract data/features representation. 

<img style="float: center" src="./images/ML&DL.png" alt="drawing" Hight="600" width="600"/>

[1]: http://www.deeplearningbook.org/

Some figures from [this resource](https://hprc.tamu.edu/files/training/2021/Spring/Introduction_to_DL_with_TensorFlow.pdf)

### Problems with NN and DL

- Huge Data size.
- A lot of hyperparamaters.
- Time and memory consumption.



## BMI6015-basics (review)

### [Overfitting vs. Underfitting](https://learnopencv.com/bias-variance-tradeoff-in-machine-learning/)

#### [Bias and varience  trade-off][1]



1. Bias is the difference between the average prediction of our model and the correct value which we are trying to predict.
   - High Bias leads to model underfitting – too general to identify specific detailed data patterns.
   - High Bias is due to using simpler model like linear vs. quadratic -- Wrong assumptions.
   - Solutions:
      - Introduce new features.
      - Use other tecnhiques.  
2. Variance is the variability of model prediction for a given data point or a value.
   - High Variance leads to model overfitting – too specific and very sensitive to data variations, noise and outliers.
   - High Variance is due to using models with high degree of freedoms like high-degree polynomial -- Wrong assumptions.
   - Solutions:
      - Increase the number of examples.
      - Reduce the variability of feature values using data preprocessing techniques.
      - Use other tecnhiques.
      
![image.png](attachment:1e819e47-91fa-4515-a295-0fb5182913a7.png)


__We aim to reduce the generalization error__ In the equation of $Generalization\;Error=Bias^2+Variance+Irreducible\;Noise$

- Where Irreducible Noise:
    - Is the data noise.
    - Cannot be reduced regardless the model performance.
    - Can be reduced only by cleaning up the data.


[1]: https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229


####  Diagram and solution

<img style="float: center" src="./images/Machine-Learning-Workflow.png" alt="drawing" Hight="700" width="700"/> 



### Data splitting techniques

#### [Hold-out cross-validation][1]

Split the data randomly into training, validation/development, and test/unseen parts randomly: usually 60%, 20%, and 20% respectively or any other splitting techniques:
1. Training set: Data set to build the model.
2. Validation/development set: Data set to evaluate the learning algorithm with different configurations. It is called development set, since we are using it while developing our model. It can be a bit biased, that's why we need the third kind of data set.
3. Test/unseen set: Data set to check the accuracy of the final model and get the unbiased results.

##### <span style="color:blue"> Hold-out cross-validation: Notes</span>
1. It is a simple algorithm but we could not use it to prove model generalization.

#### [K-fold cross-validation][2]

We usually split the data into K=10 folds (the size of each fold is 1/K) such that: 
1. for each fold i where i = 1,2,...K, do:
    1. train the learner on all folds except i.
    2. Use the ith fold for testing the model in (1) and report the performance results.
2. Average the model performance results in the K iterations in step (1)

##### <span style="color:blue"> K-fold cross-validation: Notes</span>
1. It is an in-place and computaionally doable algorithm but we could use it to prove model generalization on the population.

<img style="float: center" src="./images/Kfold-cross-validation.jpg" alt="drawing" Hight="300" width="500"/> 


#### [Bootstrapping cross-validation][3]


1. Choose a number of bootstrap samples to perform. //Usually 100,200,..., or 1000 repetitions 
2. Choose a sample size. // Usually a sample size = the size of population.
3. For each bootstrap sample:
   1. Randomly draw a sample with replacement (in-the-bag training sample) with the chosen size:
       - While the size of the sample is less than the chosen size:
            - Randomly select an observation from the dataset
            - Add it to the sample (i.e., In-the-bag training sample).
   2. Fit a model on the data sample
   3. Estimate the model performance on the remaining unselected observations (the out-of-bag sample).
4. Calculate the average of the model performance results in all bootstrap samples in step (3).


##### <span style="color:blue"> Bootstrapping: Notes</span>
1. We use [0.632 bootstrap rule][4] in which the in-the-bag sample has 63.2% distinct observations and the out-the-bag sample has the remaining 38.8% observations.
2. It is a computationally extensive and out-of-place algorithm and we could use it to prove model generalization on the simulated population.


<img style="float: center" src="./images/Bootstrap-example.jpg" alt="drawing" Hight="100" width="300"/> 


[1]: https://www.mff.cuni.cz/veda/konference/wds/proc/pdf10/WDS10_105_i1_Reitermanova.pdf
[2]: https://en.wikipedia.org/wiki/Cross-validation_(statistics)
[3]:https://machinelearningmastery.com/a-gentle-introduction-to-the-bootstrap-method/
[4]:http://rasbt.github.io/mlxtend/user_guide/evaluate/bootstrap_point632_score/


#### Stratisfied Kfold

Stratified K-Folds cross-validator. Provides train/test indices to split data in train/test sets. This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class

#### Repeated Stratisfied Kfold

Stratified K-Folds cross-validator. Provides train/test indices to split data in train/test sets. This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class. 


#### [Sklearn library](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection)

In [None]:
import numpy as np
from sklearn.model_selection import RepeatedStratifiedKFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 1, 1])
rskf = RepeatedStratifiedKFold(n_splits=3, n_repeats=5,random_state=1)
for train_index, test_index in rskf.split(X, y):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

## Neural Network basics

# [Gradient Descent][1]

<b>Gradient descent:</b> is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. TThe general idea of Gradient Descent is to tweak parameters iteratively in order to minimize a cost function.

## In the linear regression:

We have to tweak $\theta$ parameters with the following gradient descent:

<img style="float: center" src="./images/GD-1.png" alt="drawing" height="300" width="400"/>

<u> Learning rate too small</u>:

<img style="float: center" src="./images/GD-2.jpg" alt="drawing" height="300" width="400"/>

<u> Learning rate too large</u>:

<img style="float: center" src="./images/GD-3.jpg" alt="drawing" height="300" width="400"/>

if the random initialization starts the algorithm on the left, then it will converge to <u>a local minimum</u>, which is not as good as the global minimum. If it starts on the right, then it will take a very long time to cross <u>the plateau</u>, and if you stop too early you will never reach the global minimum.

<img style="float: center" src="./images/GD-4.jpg" alt="drawing" height="300" width="400"/>

We need <u>feature scaling with gradient descent</u> to goes straight forward towards the minimum:

<img style="float: center" src="./images/FS-GD.jpg" alt="drawing" height="300" width="400"/>

### Partial Derivatives of the cost function

<center> $\frac{\partial}{\partial\theta_j} MSE(\theta)=\frac{2}{m}\sum_{i=1}^m(\theta^T.x^{i}-y^{(i)})x_j^{(i)} $ </center>


### Gradient vector of the cost function

<center>
$\nabla_{\theta_j}MSE(\theta)=\begin{pmatrix} 
      \frac{\partial}{\partial\theta_0} MSE(\theta)\\
      \frac{\partial}{\partial\theta_0} MSE(\theta) \\
      \cdots\\
      \cdots\\
      \frac{\partial}{\partial\theta_n} MSE(\theta) 
   \end{pmatrix} 
   =\frac{2}{m}X^T(X.\theta-y)
$
</center>

### Gradient descent step

<center> $\theta^{(next step)}=\theta-\eta\nabla_{\theta}MSE(\theta)$ </center> 

where $\nabla$ is the learning rate. We repeat this step till loss function becomes <b>zero</b>.


[1]:https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html

### Definitions and terminologies

The components of a neural network model i.e **the loss function, optimization algorithm, the activation function** play very important roles in efficiently and effectively training a model and produce accurate results. 

**A loss function (cost objective function)**: How the network will be able to measure its performance on the training data, and thus how it will be able to steer itself in the right direction.

**An optimizer**—The mechanism through which the network will update itself based on the data it sees and its loss function.

[relative link](https://medium.com/data-science-group-iitr/loss-functions-and-optimization-algorithms-demystified-bb92daff331c)


**Gradient Descent**: It is an iterative optimization algorithm used in machine learning to find the best results (minima of a curve).
Gradient means the rate of inclination or declination of a slope. Descent means the instance of descending.
[relative link](http://ruder.io/optimizing-gradient-descent/)

**Epochs**: One Epoch is when an ENTIRE dataset is passed forward and backward through the neural network only ONCE.

**Batch Size**: Total number of training examples present in a single batch.

**Iterations**: The numbers of batches needed to complete one epoch. 

[relative link](https://towardsdatascience.com/epoch-vs-iterations-vs-batch-size-4dfb9c7ce9c9)

[Activation Functions](https://en.wikipedia.org/wiki/Activation_function)

[Difference between Softmax and Sigmoid](http://dataaspirant.com/2017/03/07/difference-between-softmax-function-and-sigmoid-function/)