# Introduction to Machine Learning and Deep Learning
### Keras / SGD / Multilayer perceptrons

We will use the following toy dataset to illustrate fitting a logistic regression classifier.

In [None]:
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=150, centers=[[0.0, 5.0], [0.0, -5.0]], cluster_std=3)

In [None]:
X.shape, y.shape

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(5, 3))
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='spring')

In [None]:
from sklearn.model_selection import train_test_split

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.25)

There is no closed form solution to this minimisation problem (unlike linear regression), so we use **gradient descent** to optimise the parameters.

The `sklearn` library has a `LogisticRegression` class (similar to the `LinearRegression` class we used) that can be used to fit a logistic regression model, but from now we will choose to use `keras` for our model development, as it is more flexible and we can also use it for general deep learning models.

In [None]:
import keras

from keras.models import Sequential
from keras.layers import Dense, Input

In [None]:
model = Sequential([
    Input(shape=(2,)),
    Dense(1, activation='sigmoid')  # No activation: linear regression
])
model.summary()

We will train the logistic regression model with **stochastic gradient descent** or SGD (

In [None]:
model.compile(loss='binary_crossentropy', optimizer='sgd')

In [None]:
model.fit(Xtrain, ytrain, epochs=50, batch_size=32)  # batch_size=32 is default

In [None]:
model.evaluate(Xtest, ytest)

In [None]:
model.predict(Xtest)

In [None]:
model.variables

In [None]:
w = model.variables[0].numpy().squeeze()
b = model.variables[1].numpy()

In [None]:
import numpy as np

x1 = np.linspace(X[:, 0].min(), X[:, 0].max(), 100)
x2 = (-b - w[0] * x1) / w[1]

fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(10, 3))
ax1.set_title("Training set")
ax1.scatter(Xtrain[:, 0], Xtrain[:, 1], c=ytrain, s=50, cmap='spring')
ax1.plot(x1, x2, c='b', linewidth=3)

ax2.set_title("Test set")
ax2.scatter(Xtest[:, 0], Xtest[:, 1], c=ytest, s=50, cmap='spring')
ax2.plot(x1, x2, c='b', linewidth=3)

plt.show()

## Multilayer perceptrons

The simplest type of deep learning model is the **multilayer perceptron**, also known as a **feedforward network**. This type of neural network can be viewed as an architecture consisting of layers of mathematical neurons, linked together in a directed acyclic graph.

#### MLP with single hidden layer
A key property of deep learning models is the fact that they are _compositional_ instead of _additive_. Where as linear regression models (or logistic regression) increase complexity by adding extra basis functions $\phi_i$ in the expansion

$$
f(\mathbf{x}) = \sum_{i} w_i \phi_i(\mathbf{x}),
$$

deep learning models increase complexity by composing multiple simple functions $\varphi_k$ together:

$$
f(\mathbf{x}) = \varphi_L(\varphi_{L-1}(\ldots\varphi_2(\varphi_1(\mathbf{x}))\ldots )).
$$

The functions $\varphi_k$ are defined to be affine transformations followed by an element-wise activation function. An example is the MLP with a single hidden layer: 

$$
\begin{align}
h_j^{(1)} &= \sigma\left( \sum_{i=1}^D w^{(0)}_{ji}x_i + b_j^{(0)} \right),\qquad j=1,\ldots,n_h,\\
\hat{y} &= \sigma_{out}\left( \sum_{i=1}^{n_h} w^{(1)}_{i}h^{(1)}_i + b^{(1)} \right). \\
\end{align}
$$

In the above, $\mathbf{x}\in\mathbb{R}^D$ is an example input, $n_h\in\mathbb{N}$ is the number of hidden units in the network, $\sigma, \sigma_{out}:\mathbb{R}\mapsto\mathbb{R}$ are activation functions, $w^{(0)}_{ji}\in\mathbb{R}$ and $w^{(1)}_{ji}\in\mathbb{R}$ are weights, and $b_j^{(0)}\in\mathbb{R}$ and $b^{(1)}\in\mathbb{R}$ are biases.

We will usually write equations (5) and (6) in the more concise form:

$$
\begin{align}
\mathbf{h}^{(1)} &= \sigma\left( \mathbf{W}^{(0)}\mathbf{x} + \mathbf{b}^{(0)} \right),\\
\hat{y} &= \sigma_{out}\left( \mathbf{w}^{(1)}\mathbf{h}^{(1)} + b^{(1)} \right),
\end{align}
$$

where $\mathbf{x}\in\mathbb{R}^D$, $\mathbf{W}^{(0)}\in\mathbb{R}^{n_h\times D}$, $\mathbf{b}^{(0)}\in\mathbb{R}^{n_h}$, $\mathbf{h}^{(1)}\in\mathbb{R}^{n_h}$, $\mathbf{w}^{(1)}\in\mathbb{R}^{1\times n_h}$, $b^{(1)}\in\mathbb{R}$ and we overload notation with the activation functions $\sigma, \sigma_{out}: \mathbb{R}\mapsto\mathbb{R}$ by applying them element-wise in the above.

This hidden layer is a type of neural network layer that is often referred to as a **dense** or **fully connected** layer.

#### MLP with multiple hidden layers
More generally, for an MLP with $L$ hidden layers, we have

$$
\begin{align}
\mathbf{h}^{(0)} &:= \mathbf{x}, \\
\mathbf{h}^{(k)} &= \sigma\left( \mathbf{W}^{(k-1)}\mathbf{h}^{(k-1)} + \mathbf{b}^{(k-1)} \right),\qquad k=1,\ldots, L,\\
\hat{y} &= \sigma_{out}\left( \mathbf{w}^{(L)}\mathbf{h}^{(L)} + b^{(L)} \right), 
\end{align}
$$

where $\mathbf{W}^{(k)}\in\mathbb{R}^{n_{k+1}\times n_k}$, $\mathbf{b}^{(k)}\in\mathbb{R}^{n_{k+1}}$, $\mathbf{h}^{(k)}\in\mathbb{R}^{n_k}$, and we have set $n_0 := D$, and $n_k$ is the number of units in the $k$-th hidden layer.

The hidden layers inside a deep network can be viewed as *learned feature extractors*. The weights of the network learn to encode the data in such a way as to represent progressively more complex or abstract features of the data that are useful for solving the problem task at hand. This hierarchy of representations is a core property of the expressive power of deep learning models ([Rumelhart et al 1986b](#Rumelhart86)).

## MLP classifier example

We will demonstrate the use of multiple hidden layers by fitting a classifier to the following 'two moons' dataset.

In [None]:
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=200, noise=0.2)

In [None]:
plt.scatter(X[:, 0], X[:, 1], c=y, alpha=0.8)
plt.show()

In [None]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.25)

In [None]:
model = Sequential([
    Input(shape=(2,)),
    Dense(128, activation='relu'),
    Dense(64, activation='relu'),
    Dense(1, activation='sigmoid')
])
model.summary()

In [None]:
model.compile(loss='binary_crossentropy', optimizer='sgd')

In [None]:
history = model.fit(Xtrain, ytrain, epochs=2000, batch_size=16, validation_data=(Xtest, ytest), verbose=0) 

In [None]:
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='val')
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()
plt.show()

In [None]:
nx, ny = 50, 50

x_ = np.linspace(X[:, 0].min(), X[:, 0].max(), nx)
y_ = np.linspace(X[:, 1].min(), X[:, 1].max(), ny)

X_, Y_ = np.meshgrid(x_, y_)

inputs = np.transpose(np.stack([X_, Y_]), [1, 2, 0])
inputs = np.reshape(inputs, (nx * ny, 2))

Z = model(inputs).numpy()
Z = np.reshape(Z, (nx, ny))

plt.contour(X_, Y_, Z, levels=[0.5], cmap='RdGy')
plt.scatter(Xtrain[:, 0], Xtrain[:, 1], c=ytrain, alpha=0.8)
plt.scatter(Xtest[:, 0], Xtest[:, 1], c=ytest, marker='*', alpha=0.8)
plt.show()

### References

* Robbins, H. and Monro, S. (1951), "A stochastic approximation method", *The annals of mathematical statistics*, 400â€“407.
* Rumelhart, D. E., Hinton, G., & Williams, R. (1986), "Learning representations by back-propagating errors", Nature, **323**, 533-536.