# Introduction to Machine Learning and Deep Learning
### Core ML concepts

## Introduction

For general machine learning references, see e.g. [Bishop](#Bishop06), [Hastie et al](#Hastie01) and [Murphy](#Murphy12). In order to motivate some of the most important concepts, let's first review the definition of machine learning itself. There are several definitions and perspectives on this, but one of the most popular is due to [Mitchell](#Mitchell97):

> A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.

We can unpick this definition by looking at what is meant by _experience E, tasks T_ and _performance measure P_. 

*Tasks T.* One of the strengths of deep learning models are their flexibility to solve a wide range of problem tasks. Typical tasks could include:

* Classification
* Regression
* Clustering
* Anomaly detection
* Density estimation

*Experience E.* This relates to the type of data that is used to accomplish the given task. The data could be labelled examples (such as images of digits and their corresponding labels), unlabelled examples, or streaming data coming from an environment that an agent interacts with (this is the setting for reinforcement learning). Of course, the type of data needs to be appropriate for the learning task. A typical assumption is that the data is independent and identically distributed (iid).

*Performance measure P.* Given a learning task T and experience E, we then need a way of measuring how well a machine learning system accomplishes the task T. For example, for a regression task this could be the mean squared error, or for a binary classification task we could use binary cross entropy, or area under the ROC curve. 

## Example dataset

The following toy example works through fitting a regression model, and demonstrates several key concepts in machine learning. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# Create a regression dataset

def f(x, noise_std=0.0, n_samples=100):
    y_true = 0.3 * x**2 + 0.5 * x - 0.5 + 3 
    noise = noise_std * np.random.randn(n_samples, 1)
    return y_true + noise

X = np.random.uniform(low=-5, high=5, size=100)[..., np.newaxis]
y = f(X, noise_std=2)

In [None]:
X.shape, y.shape

In [None]:
plt.scatter(X, y)
xlinspace = np.linspace(-5, 5, 100)[..., np.newaxis]
plt.plot(xlinspace, f(xlinspace), c='C01')
plt.xlabel("x")
plt.ylabel("y")
plt.show()

## Linear regression

We have a **dataset** $\mathcal{D} := (\mathbf{x}_i, y_i)_{i=1}^N$ consisting of $N$ examples of inputs $\mathbf{x} \in\mathbb{R}^d$ and targets $y\in\mathbb{R}$. 

We will denote the $j$-th **input feature** ($j=1,\ldots,d$) of the input $\mathbf{x}$ by $x^{(j)}$.

Our linear regression model tries to find the best parameters $\theta_j$ ($j=0, 1,\ldots,d$) such that

$$
f_\theta(\mathbf{x}) := \theta_0 + \sum_{j=1}^d \theta_j x^{(j)}
$$

is a good predictor of the **target** value $y$.

To simplify notation, we will often augment the **feature vector** by adding a constant 1 feature to the inputs:

$$
\hat{\mathbf{x}} := [1, x^{(1)}, \ldots, x^{(d)}]
$$

Then our linear regression model can be written

$$
f_\theta(\mathbf{x}) := \sum_{j=0}^d \theta_j x^{(j)} = \theta^T \hat{\mathbf{x}}
$$

The parameters $\theta$ are found as the minimiser of the following mean squared error (MSE) **loss function**:

$$
{L}_{MSE}(\theta) := \frac{1}{N} \sum_{i=1}^N (\theta^T\hat{\mathbf{x}}_i - y_i)^2
$$

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
model = LinearRegression()
model.fit(X, y)
model.coef_, model.intercept_

In [None]:
def plot_model(model, data=None, ylim=None, plot_true=True):
    xlinspace = np.linspace(-5, 5, 100)[..., np.newaxis]
    predictions = model.predict(xlinspace)  # xlinspace @ model.coef_ + model.intercept_

    plt.plot(xlinspace, predictions, label='model')
    if plot_true:
        plt.plot(xlinspace, f(xlinspace), label='true')
    if data is not None:
        plt.scatter(data[0], data[1], alpha=0.2)
    plt.xlabel("x")
    plt.ylabel("y")
    if ylim is not None:
        plt.ylim(*ylim)
    plt.legend()
    plt.show()

In [None]:
plot_model(model, data=(X, y))

In [None]:
from sklearn.metrics import mean_squared_error

mean_squared_error(model.predict(X), y)

## Nonlinear basis functions

We can make our linear model more expressive (or increase the **capacity** of our model) with the use of nonlinear **basis functions**.

So far our linear regression model just uses the **input feature** provided in our data:

$$
f_\theta(\mathbf{x}) := \theta_0 + \theta_1 x
$$

However, the feature vector in our linear regression model can be anything, so we can also consider linear regression models of the form

$$
f_\theta(\mathbf{x}) = \sum_{j=1}^M \theta_j\phi_j(\mathbf{x}),
$$

where the $\phi_j$ are called the **basis functions**. These basis functions can be nonlinear in general. For example, we could consider degree $P$ polynomial regressors of the form

$$
f_\theta(\mathbf{x}) = \theta_0 + \theta_1 x + \theta_2 x^2 + \cdots + \theta_P x^P = \sum_{j=0}^P \theta_j x^j.
$$

In [None]:
# Fit polynomial regressors for different degrees

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

P = 2

poly = PolynomialFeatures(P, include_bias=False)
poly_features = poly.fit_transform(X)
model = LinearRegression()
model.fit(poly_features, y)

In [None]:
model = make_pipeline(PolynomialFeatures(P), LinearRegression())
model.fit(X, y)

In [None]:
plot_model(model, data=(X, y))

In [None]:
mean_squared_error(model.predict(X), y)

In [None]:
num_degrees = 20
degrees = np.arange(num_degrees)

polynomial_regressors = []
for degree in degrees:
    poly_regressor = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    poly_regressor.fit(X, y)
    polynomial_regressors.append(poly_regressor)

In [None]:
evaluations = [mean_squared_error(model.predict(X), y) for model in polynomial_regressors]
plt.semilogy(degrees, evaluations)
plt.xticks(degrees)
plt.xlabel("Degree")
plt.ylabel("MSE")
plt.show()

In [None]:
plot_model(polynomial_regressors[-1], data=(X, y))

## Data splits
In order to obtain a fair measure of the performance of an ML model, we typically split our available data into separate partitions. 

One partition will be the **training set**. This is used to infer the optimal parameters of our model, whilst the remaining data (also called **hold-out** data) is used purely for evaluation and not for training (optimising parameters). 

We may want to choose certain hyperparameters of our model (such as the polynomial degree $P$), in which case we can evaluate our model on the held-out data for each choice of hyperparameter and choose the hyperparameters that maximise the held-out data performance. In this case, the held-out data is called a **validation set**, and this process of choosing the best hyperparameters is **validation**. 

In addition, we may choose to define a third partition for a **test set**, which is used for a final evaluation of the model.

You should never use the validation or test splits for directly training the model (optimising its parameters).

In the following cell we use `sklearn` to make a training and validation partition of our toy dataset.

In [None]:
# We can use the train_test_split from sklearn to conveniently split the data

from sklearn.model_selection import train_test_split

print("x shape:", X.shape)
print("y shape:", y.shape)
x_train, x_val, y_train, y_val = train_test_split(X, y, test_size=0.4)
print("\nx_train shape:", x_train.shape)
print("y_train shape:", y_train.shape)
print("\nx_val shape:", x_val.shape)
print("y_val shape:", y_val.shape)

This means that in practice what we optimise during training is the loss

$$
\begin{equation}
L_{MSE}(\theta) = \frac{1}{| \mathcal{D}_{train} |}\sum_{x_i, y_i\in \mathcal{D}_{train}}(f_\theta(\mathbf{x}_i) - y_i)^2, \tag{2}
\end{equation}
$$

where $\mathcal{D}_{train}$ denotes the training data partition.

The following cells illustrate this for our toy dataset, by creating an example regression function and computing the training loss using the inbuilt function from `sklearn`.

In [None]:
num_degrees = 20
degrees = np.arange(num_degrees)

polynomial_regressors = []
for degree in degrees:
    poly_regressor = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    poly_regressor.fit(x_train, y_train)
    polynomial_regressors.append(poly_regressor)

In [None]:
train_loss = [mean_squared_error(model.predict(x_train), y_train) for model in polynomial_regressors]
val_loss = [mean_squared_error(model.predict(x_val), y_val) for model in polynomial_regressors]
plt.semilogy(degrees, train_loss, label='Train MSE')
plt.semilogy(degrees, val_loss, label='Val MSE')
plt.xticks(degrees)
plt.xlabel("Degree")
plt.ylabel("MSE")
plt.legend()
plt.show()

In [None]:
print(f"Degree of best performing model on validation set: {degrees[np.argmin(val_loss)]}")

By monitoring performance on both the training and validation sets, we can look for signs of **underfitting** and **overfitting**.

In this example, we can avoid overfitting by choosing a suitable degree for our polynomial features such that the performance is optimised on the validation set. 

This technique is a form of **regularisation**, where we control the **model capacity** to avoid overfitting. There are several other regularisation techniques that we will see later.

## Logistic regression

Consider a dataset $(\mathbf{x}_i, y_i)_{i=1}^N$, where each $y_i\in\{0, 1\}$. Recall that the **logistic regression** model is given by 

$$
f_\theta(\mathbf{x}) = \sigma(\theta^T\phi(\mathbf{x})),
$$

where $\sigma$ is the sigmoid function, given by

$$
\sigma(x) = \frac{1}{1 + e^-x}.
$$

The output of the function $f_\theta$ is interpreted as the probability of the input $\mathbf{x}$ belonging to the class label 1.

We optimise the parameters by minimising the **binary cross entropy** loss function:

$$
L_{BCE}(\theta) := -\frac{1}{N}\sum_{i=1}^N \{y_i \log f_\theta(\mathbf{x}_i) + (1 - y_i) \log (1 - f_\theta(\mathbf{x}_i))\}.
$$

<a class="anchor" id="references"></a>
### References

<a class="anchor" id="Bishop06"></a>
* Bishop, C. M. (2006), "Pattern Recognition and Machine Learning", Springer-Verlag, Berlin, Heidelberg.
<a class="anchor" id="Hastie01"></a>
* Hastie, T., Tibshirani, R. & Friedman, J. (2001), "The Elements of Statistical Learning", Springer New York Inc., New York, NY, USA.
<a class="anchor" id="Mitchell97"></a>
* Mitchell, T. (1997), "Machine Learning", McGraw-Hill, New York.
<a class="anchor" id="Murphy12"></a>
* Murphy, K. P. (2012), "Machine Learning: A Probabilistic Perspective", The MIT Press.
