```
___  ___           _     _              _                       _
|  \/  |          | |   (_)            | |                     (_)
| .  . | __ _  ___| |__  _ _ __   ___  | | ___  __ _ _ __ _ __  _ _ __   __ _
| |\/| |/ _` |/ __| '_ \| | '_ \ / _ \ | |/ _ \/ _` | '__| '_ \| | '_ \ / _` |
| |  | | (_| | (__| | | | | | | |  __/ | |  __/ (_| | |  | | | | | | | | (_| |
\_|  |_/\__,_|\___|_| |_|_|_| |_|\___| |_|\___|\__,_|_|  |_| |_|_|_| |_|\__, |
                                                                         __/ |
                                                                        |___/
```

# Supervised learning
In `supervised learning`, we have $x$, the `input`, and $y$, the `output`
(or `label`).
For instance, $x$ can be some bio-medical information about a patient or a
customer. $y$ can respectively be whether the patient is healthy or sick, or
the likelihood of the customer shurning.

The collection of admissible inputs and outputs are respectively called the
`input space` $\mathbb{X}$ and `output space` $\mathbb{Y}$.

In supervised learning, the goal is to come up with some function
$\hat{y} : \mathbb{X} \rightarrow \mathbb{Y}$ modelling some phenomenon.

$\hat{y} \in \mathbb{H}$ is known as the `hypothesis`, the `model`,
the `predictor` (classifier or regressor).
$\mathbb{H}$ is called the `hypothesis space` (or `model class).

Supervised learning works in two stages:

1. **Learning/training**: a *good* model $\hat{y}_*$ is selected from
$\mathbb{H}$;
2. **Prediction/inference**: $\hat{y}_*$ is used to make prediction on (new)
data $x \in \mathbb{X}$.

> ###### Distribution (technical note)
> Not all (x, y) pair are equally likely (or even possible). This is captured
> through the notion of a probabilistic distribution $\mathcal{D}$.
> $(x, y) \sim  \mathcal{D}$ indicates that the labeled pair $(x, y)$ has been
> drawn from $\mathcal{D}$.


## Data
To select the best hypothesis, we have a `learning set` of labeled pairs
$LS = \{(x_i, y_i) \}_{i=1}^n$.

### Inputs
Typically, we will assume that $\mathbb{X} \subseteq \mathbb{R}^p$. Therefore,
it is customary to group the inputs into a `learning matrix` $X$ such that the
element at the $i$th row and $j$th column is the $j$th component of the vector
corresponding to instance $i$.

In [None]:
from sklearn import datasets
digits = datasets.load_digits()

# Transforming images to a learning matrix
X = digits.images.reshape((len(digits.images), -1))
print(X.shape)
X  # each correspond to a sample, each column to a variable

The dimensions of the vectors are called (input) `variable`, or `features`.

Coming up with the appropriate features is not trivial. This is either done
manually (`feature engineering`) or as part of learning the hypothesis
(`representation learning`).


### Classification and regression
The nature of $\mathbb{Y}$ dictates the type of problems. When the output
variable is discrete (eg. healthy/sick), the problem is known as
`classification`. The output can be referred to as the `class`. A hypothesis
can further be called a `classifier` in this setting.

When the output is continuous (eg. the number of cases, the gross production),
the problem is a `regression`. A hypothesis can be referred to as a `regressor`
in this setting.

> ###### Class probabilities and encoding (technical note)
> The discrete nature of the output in classification is often quite limiting.
> A common work-around is for the classifier to output a vector $\hat{p} of
> size $K$ (where $K$ is the number of classes)
> indicating the probability (according to the model) of belonging to each
> class.
>
> The true output must sometime also match this representation, in which case
> classes are encoded in a one-hot vector

In [None]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

y = np.array([0, 1, 1, 2, 0, 1])  # y[i] is the class of the ith instance
OneHotEncoder().fit_transform(y.reshape((-1, 1))).toarray()

## Loss and error

We, of course, want a hypothesis that models well the phenomenon. For that, we
have the notion of `loss function`.

### Loss
A loss function $\ell$ measures how far away a prediction $\hat{y}$ falls
from the truth $y$.

In regression, the most common loss function is the `squared error`:
> $$\ell(y, \hat{y}) = (y - \hat{y})^2$$

In classification, we usually use the `zero-one` loss, which indicates
whether the model is making a mistake:
> $$\mathbb{I}(y \neq \hat{y})$$

When working with class probabilities (where $K$ is the number of classes),
a common choice of loss is the cross-entropy:
> $$\sum_{j=1}^K y^{(j)} \log \hat{p}^{(j)}(x)$$

> ###### Note on binary classification metrics
> There are also many specific metrics for binary classification (specificity,
> sensitivity, recall, auroc, aupr, FPR, F1-score, etc). This is
> motivated by the fact that not all errors should have the same weight. For
> instance, it might be better to wrongly diagnose a cancer (an error which can
> be caught later on) than to miss one.

###  Error (risk)
We usually want to know how the model performs in general. This is captured by
the notion of `error` or `risk`, which consists in taking the
average/expectation of the loss.

The error based on the squared error is the `mean squared error` (MSE).
Sometimes, it is preferable to take the root of the MSE, which is known as the
`root mean squared error` (RMSE).

The error based on the zero-one loss is known as the
`misclassification rate`. It is the average (or expected) number of mistakes
the model makes. Alternatively, the average number of correct predictions
the model makes is known as the `accuracy`.


> ###### Empirical vs. expected risk (technical note)
> The goal of supervised learning is
>> $$\hat{y}_* = \arg\min_{\hat{y} \in \mathbb{H}} \mathbb{E}_{(x,y) \sim \mathcal{D}} \{\ell(y, \hat{y}(x)) \}$$
> In words, we want to minimize the expected risk
>
> In practice, we do not have to the whole distribution. Rather we can estimate
> the risk given some set $S = \{(x_i, y_i)\}_{i=1}^n$:
>> $$\frac{1}{n} \sum_{i=1}^n \ell(y_i, \hat{y}(x_i))
>
> For the empirical error to be a reliable estimate of the expected risk, some
> precaution must be observed (see `overfitting` and `data shift`).

# Related paradigms
In the following paradigms, the goal remains to learn a good predictor:

- *semi-supervised learning*: we have access to unlabeled data
(only $x$ samples) in addition to the traditional learning set;
- *few-shot learning*: we only have a (too) small learning set;
- *zero-shot learning*: we have no learning set;
- *active learning*: the goal is o decide which inputs should be labeled to
improve the performance the best;
- *transfer learning*/*domain adaptation*: leverage knowledge learned on a
source task to help
in a target task;
- *transductive learning*: making predictions without explicitly building a
model.


# Other paradigms
From here on, the goal changes.

## Unsupervised learning
In unsupervised learning, the goal is to glean insight from the data.

### Clustering
The goal of clustering is to group samples that are similar in some sense.
The main difficulty is defining the notion of similarity.

The most common algorithms are `k-means`, `hiearchical clustering`.

### Dimensionality reduction
The goal of dimensionality reduction is to summarize the data by reducing the
number of variables. We can do that by

- selecting a subset of important variables;
- projecting onto another space (`PCA`, `t-SNE`, `feature learning`).


## Reinforcement learning

Reinforcement learning is a complex setting where the goal is for an agent to
learn a policy describing how it should behave in a (possibly only partially
observable) environment.


## Density estimation

> [TODO](https://en.wikipedia.org/wiki/Density_estimation)