# Introduction to Neural Networks: Datasets, Models, and Losses 
Author: Pierre Nugues

## Dataset

We extract the counts of letters per chapter and the counts of _A_ from the *Salammbô* novel by Flaubert. There are 15 chapters in total.

In [None]:
import numpy as np

X = np.array(
    [[36961],
     [43621],
     [15694],
     [36231],
     [29945],
     [40588],
     [75255],
     [37709],
     [30899],
     [25486],
     [37497],
     [40398],
     [74105],
     [76725],
     [18317]])

y = np.array(
    [2503, 2992, 1042, 2487, 2014, 2805, 5062, 2643, 2126, 1784, 2641, 2766,
     5047, 5312, 1215])

## Visualizing the Dataset

In [None]:
import matplotlib.pyplot as plt

fr = plt.scatter(X, y, c='b', marker='x')
plt.title("Salammbô")
plt.xlabel("Letter count")
plt.ylabel("$A$ count")
plt.show()

## Models

We fit three different polynomial models

In [None]:
# The polynomial degrees we will test and their color
x = X.flatten()
degrees_col = [(1, 'r-'), (8, 'b-'), (9, 'g-')]

f, axes = plt.subplots(len(degrees_col), sharex=True, sharey=True)
x_vals = np.linspace(min(x), max(x), 1000)

for idx, (degree, color) in enumerate(degrees_col):
    axes[idx].scatter(x, y, marker='x')
    # We find the fitting coefficients
    z = np.polyfit(x, y, degree)
    # We use them to create a polynomial
    p = np.poly1d(z)
    legend = axes[idx].plot(x_vals, p(x_vals), color)
plt.show()

As a rule: Simpler models are better

### Using the Keras Engine to Carry out a Linear Regression

We create the architecture. The model has an intercept (a bias) by default.

In [None]:
from tensorflow.keras import models
from tensorflow.keras.layers import Dense

model = models.Sequential([
    Dense(1, input_dim=1, activation='linear')])

model.summary()

We use the mean squared error and nadam, a variant of stochastic gradient descent, to find the paramters

In [None]:
model.compile(optimizer='rmsprop', loss='mse', metrics=['mse'])

We fit the two parameters

In [None]:
history = model.fit(x, y, batch_size=1, epochs=200, verbose=0)

### Visualising the Loss

We visualise the loss during the training process

In [None]:
import matplotlib.pyplot as plt

loss = history.history['loss']
epochs = range(1, len(loss) + 1)
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.title('Training loss')
plt.legend()
plt.show()

### The Model

The model is linear and consists of two parameters

In [None]:
salammbo_model = model.get_weights()
salammbo_model

### Visualizing the Final Model

In [None]:
fr = plt.scatter(X, y, c='b', marker='x')
plt.plot(x, model.predict(x), color='red')
plt.title("Salammbô")
plt.xlabel("Letter count")
plt.ylabel("$A$ count")
plt.show()

## Tensorflow's Gradients
Tensorflow can compute automatically gradients. See https://www.tensorflow.org/guide/autodiff

### Functions

Let $f(x, y) = 4x^2 - y + 9$.

We have:
$$\begin{array}{ccc}
\frac{\partial f}{\partial x} &=& 8x\\
\frac{\partial f}{\partial y} &=& -1\\
\end{array}$$

We define the variables and we assign them values

In [None]:
import tensorflow as tf

x = tf.Variable(1.)
y = tf.Variable(2.)

We define the function and record it with `GradientTape`

In [None]:
with tf.GradientTape(persistent=True) as tape:
    f = 4.0 * x**2  - y + 9.0

And we compute the gradients with respect to $x$ and $y$ with `gradient()`

In [None]:
df_dx = tape.gradient(f, x)
df_dx

In [None]:
df_dy = tape.gradient(f, y)
df_dy

### Matrices

We can apply the partial differentiation to a linear combination involving a matrix and two vectors: $\mathbf{W} \cdot \mathbf{x} + \mathbf{b}$.

Let us model two outputs, $y_1$ and $y_2$ with:

$$
\begin{bmatrix}
\hat{y}_1\\
\hat{y}_2
\end{bmatrix}
=
\begin{bmatrix}
w_{1,1}&w_{1,2}\\
w_{2,1}&w_{2,2}
\end{bmatrix}
\cdot
\begin{bmatrix}
x_1\\
x_2
\end{bmatrix}
+
\begin{bmatrix}
b_1\\
b_2
\end{bmatrix}
$$
and create the matrix and the vectors.

In [None]:
W = tf.Variable(tf.random.uniform((2, 2)))
b = tf.Variable(tf.zeros((2, 1)))
x = tf.random.uniform((2, 1))

y_hat = W @ x + b

print('W:', W)
print('b:', b)
print('x:', x)
y_hat

`GradientTape` records operations for which we can compute partial derivatives.

In [None]:
with tf.GradientTape(persistent=True) as tape:
    y_hat = W @ x + b

We compute the partial derivatives of $\mathbf{W}$ with respect to $w_{ij}$ (first tensor) and of $\mathbf{b}$ with respect to $b_i$ (second tensor) with `gradient()`:

In [None]:
tape.gradient(y_hat, [W, b])

The partial derivatives of the matrix are arranged this way:
\begin{bmatrix}
\frac{\partial \mathbf{W}}{\partial w_{1,1}}&\frac{\partial \mathbf{W}}{\partial w_{1,2}}\\
\frac{\partial \mathbf{W}}{\partial w_{2,1}}&\frac{\partial \mathbf{W}}{\partial w_{2,2}}\\
\end{bmatrix}

### Gradient of the loss

We can now add a loss in the form of a mean square error and compute its gradient. Using the initial _Salammbô_ example:

In [None]:
X = np.array(
    [[36961],
     [43621],
     [15694],
     [36231],
     [29945],
     [40588],
     [75255],
     [37709],
     [30899],
     [25486],
     [37497],
     [40398],
     [74105],
     [76725],
     [18317]])
y = np.array(
    [2503, 2992, 1042, 2487, 2014, 2805, 5062, 2643, 2126, 1784, 2641, 2766,
     5047, 5312, 1215])

Our model was

In [None]:
salammbo_model

Let us compute the gradient of a model that has not yet reached the minimum

In [None]:
W = tf.Variable([[0.02]])
b = tf.Variable(-2.)

We compute the gradient

In [None]:
with tf.GradientTape(persistent=True) as tape:
    y_hat = X @ W + b
    loss = tf.reduce_mean((y_hat - y)**2)
tape.gradient(loss, [W, b])

Now let us compute the gradient with the model that has reached the minimum (`salammbo_model`).

In [None]:
(W, b) = salammbo_model
W = tf.Variable(W)
b = tf.Variable(b)

In [None]:
print(W)
b

In [None]:
with tf.GradientTape(persistent=True) as tape:
    y_hat = X @ W + b
    loss = tf.reduce_mean((y_hat - y)**2)
tape.gradient(loss, [W, b])

Adding the weight updates, we could easily implement a gradient descent. This is left as an exercise.