### The ML workflow

1. Frame the problem
2. Collect, analyze and prepare data
3. Select and train models
4. Tune the chosen model
5. Deploy to production and maintain the system

### The elements of a supervised ML system

1. Some **data** under numeric form.
1. A **model** able to produce results from data.
1. A **loss** (or **cost**) function to quantify the model error (difference between expected and actual results).
1. An **optimization algorithm** to update the model's parameters in order to minimize the loss.

### Design matrix

- $\pmb{X}$: matrix of the form (*samples, features*) expected by most ML algorithms and called *design matrix*.
  - First dimension is for the $m$ samples.
  - Second dimension is for the $n$ features of each sample.

$$\pmb{X} = \begin{bmatrix}
       \ x^{(1)T} \\
       \ x^{(2)T} \\
       \ \vdots \\
       \ x^{(m)T} \\
     \end{bmatrix} = 
\begin{bmatrix}
       \ x^{(1)}_1 & x^{(1)}_2 & \cdots & x^{(1)}_n \\
       \ x^{(2)}_1 & x^{(2)}_2 & \cdots & x^{(2)}_n \\
       \ \vdots & \vdots & \ddots & \vdots \\
       \ x^{(m)}_1 & x^{(m)}_2 & \cdots & x^{(m)}_n
     \end{bmatrix} \in \pmb{R}^{m \times n} $$

### Terminology and notations: loss

- $\mathcal{L_{\pmb{X, y}}(\pmb{\theta})}$, sometimes noted $\mathcal{J_{\pmb{X, y}}(\pmb{\theta})}$: **loss** (or **cost**) function that quantifies the difference between expected results (called *ground truth*) and actual results computed by the model.
- During model training, the input dataset $\pmb{X}$ and the expected results $\pmb{y}$ are treated as constants. The loss depends solely on the model parameters $\pmb{\theta}$. To simplify notations, the loss function will be written $\mathcal{L(\pmb{\theta})}$.
- Different loss functions exist. The choice depends on the learning type.

### Loss functions for regression

- *Mean Absolute Error* (aka *l1 or Manhattan norm*):

$$\mathrm{MAE}(\boldsymbol{\pmb{\theta}}) = \frac{1}{m}\sum_{i=1}^m |\mathcal{h}_\theta(\mathbf{x}^{(i)}) - y^{(i)}| = \frac{1}{m}{\lVert{h_\theta(\pmb{X}) - \pmb{y}}\rVert}_1 = \frac{1}{m}{\lVert{y' - \pmb{y}}\rVert}_1$$

- *Mean Squared Error*, most sensible to outliers:

$$\mathrm{MSE}(\boldsymbol{\pmb{\theta}}) = \frac{1}{m}\sum_{i=1}^m (\mathcal{h}_\theta(\mathbf{x}^{(i)}) - y^{(i)})^2 = \frac{1}{m}{{\lVert{h_\theta(\pmb{X}) - \pmb{y}}\rVert}_2}^2$$

- *Root Mean Squared Error* (aka *l2 or Euclidean norm*), the default choice:

$$\mathrm{RMSE}(\boldsymbol{\pmb{\theta}}) = \sqrt{\frac{1}{m}\sum_{i=1}^m (\mathcal{h}_\theta(\mathbf{x}^{(i)}) - y^{(i)})^2} = \frac{1}{m}{\lVert{h_\theta(\pmb{X}) - \pmb{y}}\rVert}_2$$

### Loss function for binary classification

- The expected result $y^{(i)}$ is either 0 or 1.
- The computed result $y'^{(i)}$ is a probability (float value between 0 and 1).
- A frequently used loss function for binary classification is the *Binary Crossentropy* (aka *logistic loss* or *negative log likelyhood*): 

$$\mathcal{L}(\boldsymbol{\pmb{\theta}}) = -\frac{1}{m}\sum_{i=1}^m \left(y^{(i)} \log_e(y'^{(i)}) + (1-y^{(i)}) \log_e(1-y'^{(i)})\right)$$

### Loss function for multiclass classification

The standard loss function for multiclass classification is *Categorical Crossentropy*:

$$\mathcal{L}(\boldsymbol{\pmb{\theta}}) = -\frac{1}{m}\sum_{i=1}^m\sum_{k=1}^K y^{(i)}_k \log_e(y'^{(i)}_k)$$

It is equivalent to _Binary Crossentropy_ when $K = 2$

### Analytical solution

- Technique for computing the regression coefficients $\theta_i$ analytically (by calculus).
- One-step learning algorithm (no iterations).
- Also called *Ordinary Least Squares*.

$$\pmb{\theta^{*}} = (\pmb{X}^T\pmb{X})^{-1}\pmb{X}^T\pmb{y}$$

- $\pmb{\theta^*}$ is the parameter vector that minimizes the loss function $\mathcal{L}(\theta)$.
- This result is called the **Normal Equation**.

### True/False positives and negatives

- **True Positive (TP)**: the model _correctly_ predicts the positive class.
- **False Positive (FP)**: the model _incorrectly_ predicts the positive class.
- **True Negative (TN)**: the model _correctly_ predicts the negative class.
- **False Negative (FN)**: the model _incorrectly_ predicts the negative class.

$$Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$$

### Precision and recall

- **Precision**: proportion of positive identifications that were actually correct.
- **Recall** (or *sensitivity*): proportion of actual positives that were identified correctly.

$$Precision = \frac{TP}{TP + FP} = \frac{\text{True Positives}}{\text{Total Predicted Positives}}$$

$$Recall = \frac{TP}{TP + FN} = \frac{\text{True Positives}}{\text{Total Actual Positives}}$$

### F1 score

- Weighted average (*harmonic mean*) of precision and recall.
- Also known as _balanced F-score_ or _F-measure_.

$$F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}$$

Good metric in case of class imbalance, when precision and recall are both important.

### ROC curve and AUC

$$\text{TP Rate} = \frac{TP}{TP + FN} = Recall\;\;\;\;
\text{FP Rate} = \frac{FP}{FP + TN}$$

- ROC stands for "Receiver Operating Characteristic".
- A ROC curve plots TPR vs. FPR at different classification thresholds.
- AOC ("Area Under the ROC Curve") provides an aggregate measure of performance across all possible classification thresholds.

### Decision Trees in a nutshell

- Supervised method, used for classification or regression.
- Build a tree-like structure based on a series of questions on the data.

[![Decision Tree Example](images/dt_pdsh.png)](https://jakevdp.github.io/PythonDataScienceHandbook/05.08-random-forests.html)

### Tree nodes

Each node is a step in the decision process, starting with the *root node* (depth 0). Leaf nodes represent predictions of the model.

Node attributes are:

- **Gini**: measure of the node *impurity*.
- **Samples**: number of samples the node applies to.
- **Value**: number of samples of each class the node applies to.

### Random Forests in a nutshell

- Ensemble of Decision Trees, generally trained via bagging.
- May be used for classification or regression.
- Trees are grown using a random subset of features.
- Ensembling mitigates the individual shortcomings of Decision Trees (overfitting, sensibility to small changes in the dataset).
- On the other hand, results are less interpretable.

# Artificial Neural Networks

### Neuron output

![Neuron output](images/neuron_output.png)

### Universal approximation theorem (1991)

- The hidden layers of a neural network transform their input space.
- A network can be seen as a series of non-linear compositions applied to the input data.
- Given appropriate complexity and appropriate learning, a network can theorically approximate any continuous function.
- One of the most important theoretical results for neural networks.


### Activation functions

- Applied to the weighted sum of neuron inputs to produce its output.
- Always non-linear. If not, the whole network could only apply a linear transformation to its inputs and couldn't solve complex problems.
- The main ones are:
  - **sigmoid** (*logistic function*)
  - **tanh** (*hyberbolic tangent*)
  - **ReLU** (*Rectified Linear Unit*)
  
$$\sigma(z) = \frac{1}{1 + e^{-z}}\;\;\;\;
tanh(z) = 2\sigma(2z) - 1\;\;\;\;
ReLU(z) = max(0,z)$$
