In [None]:
def info(s, x):
    print(f"{s}:\t\tshape: {x.shape}\n{x}")
    
info('train images', train_images)

train images:		shape: (60000, 28, 28)
[[[0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  ...
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]]

 [[0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  ...
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]]

 [[0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  ...
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]]

 ...

 [[0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  ...
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]]

 [[0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  ...
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]]

 [[0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  ...
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]]]


In [12]:
info('train labels', train_labels)

train labels:		shape: (60000,)
[5 0 4 ... 5 6 8]


In [13]:
# info('test images', test_images)
# info('test labels', test_labels)

info('train label [0] =', train_labels[0])
info('train image [0]', train_images[0])

train label [0] = 5
train image [0]:		shape: (28, 28)
[[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   3  18  18  18 126 136 175  26 166 255 247 127   0   0   0   0]
 [  0   0   0   0   0   0   0   0  30  36  94 154 170 253 253 253 253 253 225 172 253 242 195  64   0   0   0   0]
 [  0   0   0   0   0   0   0  49 238 253 253 253 253 253 253 253 253 251  93  82  82  56  39   0   0   0   0   0]
 [  0   0   0   0   0   0 

## Workflow

1. **Prepare** the training data;

2. **Train** the network to associate images and labels; 

3. **Predict** (also called "infer") on test images, that the network hasn't seen before.

## Layers

- The core building blocks are **layers** of artificial neurons/units;

- Each layer is a **data-processing module (transformation)**;

- Layers transform input data into more abstract **representations**;

- Layers are chained together in a **dense** network;

- A DL model is a succession of increasingly refined data.

# The network

- In this case: the network is a sequence of two dense layers. Dense means **fully-connected**:  the output of a unit in layer $k$ is linked to all units in layer $k + 1$;

- The input in the diagram emanates from a non-computational layer; <img src="images/tikz12.png" style="float:right">

# The network

- This input layer is not a conventional layer because it does not perform any computation.

- The units in the first layer have a **Rectified Linear (RELU)** activation function (aka positive part):

$$ relu(x) = \max(0, x)$$

- The second (and last) layer will be a 10-way **softmax** layer which means it will return an array of 10 scores, each an **interval**  $ \bf [0, 1]$, with all scores *summing to 1*;

- The scores can be interpreted as a **probability distribution** – the probability that the current digit image belongs to one of the 10 digit classes.

In [18]:
from tensorflow.keras import models
from tensorflow.keras import layers

network = models.Sequential()
network.add(layers.Dense(512, activation = 'relu', input_shape = (28 * 28,)))
network.add(layers.Dense(10, activation='softmax'))

## Loss, optimiser and performance metrics

- The **loss function**, the *categorical cross-entropy*, measures how well the network classifies the training data.

- The **optimiser**: the network update algorithm. The update depends on the bundle of input data (a *mini-batch*) and the loss function.

- **Metrics** monitor training and testing. Here we will only monitor **accuracy** (the fraction of the images that are correctly classified).

In [19]:
network.compile(
    optimizer='rmsprop',
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

## Preparing the image data

- Data is generally preprocessed (reshaped, normalised).

- The MNIST training images are stored in an array of 60000 pixel maps of 28 x 28 pxl of type `uint8` and values range in {0, 1, ..., 255}.

- The array is reshaped into an array of **shape** (60000, 28 * 28) and **dtype** `float32`, floating point values between 0 and 1.

In [20]:
train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype('float32') / 255

test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype('float32') / 255

The labels are categorically encoded (Keras comes with a nifty function to do just that).

Instead of integers, we have a 1 at the *index* of the class, and zeros elsewhere.

In [21]:
from tensorflow.keras.utils import to_categorical

print(train_labels[0:4])
train_labels = to_categorical(train_labels) 
print(train_labels[0:4])

test_labels = to_categorical(test_labels)

[5 0 4 1]
[[0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]]


## Before we train: what's our starting point?

- Let's see what our model can do without any training.

- We `evaluate` it on the test set, as we will do later.

In [23]:
test_loss, test_acc = network.evaluate(test_images, test_labels)

print()
print('test_acc:', test_acc)
print('1 would have been perfect...ground breaking! o(〒﹏〒)o')


test_acc: 0.0997999981045723
1 would have been perfect...ground breaking! o(〒﹏〒)o


## Training 

- All set to train!The model is *fitted* to its training data (= **training**). In Keras: `model.fit()`.

- Two quantities will be displayed during training:
    - the **loss** of the network over the training data;
    - the **accuracy** of the network over the training data.

- An **epoch** is a complete pass over the training set.

- Each pass is split into **mini-batches** (number of sample processed in one go).

In [24]:
network.fit(train_images, train_labels, epochs=5, batch_size=128)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f1fd0659f10>

## Testing

- Training accuracy is about 98.9% after five training epochs...

- How does the model performs on the test set?

In [25]:
test_loss, test_acc = network.evaluate(test_images, test_labels)

print()
print('test_acc:', test_acc)
print('Yay! ٩(◕‿◕｡)۶')


test_acc: 0.9797000288963318
Yay! ٩(◕‿◕｡)۶


## Evaluation

- Test set accuracy ~98% is lower than the training set accuracy.

- This is to be expected! (You do better on a test you prepared than on the exam you haven't seen!)

- This gap between *training accuracy* and *test accuracy* will become very important: it is called **overfitting**.

- Machine learning models tend to perform worse on new data than on their training data. (But, honestly, who doesn't??)

- Intuitively, what happens is that  the network *overadapts* to the training data, creating an excessively precise representation, which can perform poorly on new data.

## What's next?

- Clarify what is really going on behind the scenes;

- Tensors & tensor operations recap;

- Gradient descent – how the network actually trains.