<a href="https://colab.research.google.com/github/lab30041954/ML_IESE_Course/blob/main/%5BML-07%5D%20Deep%20learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [ML-07] Deep learning

## What is deep learning?

**Deep learning**, the current star of machine learning, is based on neural networks. The success of deep learning, not yet fully understood, is attributed to the ability of creating improved **representations** of the input data by means of successive layers of features.

Under this perspective, deep learning is a successful approach to **feature engineering**. Why is this needed? Because, in many cases, the available features do not provide an adequate representation of the data. So, replacing the original features by a new set may be useful. At the price of oversimplifying a complex question, the following two examples may help to understand this:

* A **pricing model** for predicting the sale price of a house from features like the square footage of the plot and the house, the location, the number of bedrooms, the existence of a garage, etc. You will probably agree that these are, indeed, the features that determine the price, so they provide a good representation of the data, and a **shallow learning** model, such as a gradient boosting regressor, would be a good approach. No feature engineering is needed here.

* A model for **image classification**. Here, the available features are based on a grid of pixels. But we do not recognize images from specific pixel positions. Recognition is based on **shapes** and **corners**. A shape is a created by a collection of pixels, each of them close to the preceding one. And a corner is created by two shapes intersecting in a specific way. This suggests that a neural network with an input layer of pixels, a first hidden layer of shapes, and a second layer of corners can provide a better representation, useful for image classification.

The number of hidden layers in a neural network is called the **depth**. But, although deep learning is based on neural networks with more than one hidden layer, there is more in deep learning than additional layers. In the MLP model as we have seen it, every hidden node is connected to all the nodes of the preceding layer and all the nodes of the following layer. In the deep learning context, these fully-connected layers are called **dense**. But there are other types of layers, and the most glamorous applications of deep learning are based on networks which are not fully-connected.

## Convolutional neural networks

In the classic MLP network, hidden and output layers were dense, that is, every node was connected to all neurons in the next layer. A **convolutional neural network** (CNN) contains other types of layer, such as **convolutional layers** and **max pooling layers**. These layers have low connectivity, and the connections are selected according a design which takes advantage of the hierarchical pattern in the data. The main idea is that dense layers learn global patterns in their input feature space (*e.g*. in the MNIST data, patterns involving all pixels), while these new layers learn local patterns, *i.e*. patterns found in small 1D or 2D windows of the inputs.

There are two subtypes of convolutional networks:

* **1D convolutional networks** (Conv1D), used with sequence data, such as text. Though they were relevant ten years ago, they have been left aside, due to the superior performance of the **transformer networks** used in **large language models**.

* **2D convolutional networks** (Conv2D), used in image classification.

## Applications to computer vision

In the CNN's used in **image classification**, the input is a 3D tensor, called a **feature map**. The feature map has two spatial axes, called **height** and **width**, and a **depth** axis. For an RGB image, the dimension of the depth axis would be 3, since the image has 3 color channels, red, green, and blue. For grayscale images like those of the MNIST digits, it is just 1 (the gray levels).

The basic innovation is the `Conv2D` layer, which extracts patches from its input feature map, typically with a 3 $\times$ 3 window, applying the same transformation to all of these patches, producing a new output feature map. This output feature map is still a 3D tensor: it has width, height and depth. Its depth can be arbitrary, since the output depth is a parameter of the layer, and the different channels in that depth axis no longer stand for specific colors like in an RGB input, but for different views of the input, called **filters**. The filters encode specific aspects of the input data. For instance, at a high level, a single filter could be encoding the concept "presence of a face in the input".

With convolutional networks, practitioners typically use two strategies for extracting more of their data:

* **Transfer learning**. Instead of starting to train your model with random coefficients, you start with those of a model which has been pre-trained with other data. There is plenty of supply of pre-trained models, as we will comment in lecture ML-20.

* **Data augmentation**. Expanding the training data with images obtained by transforming the original images. Typical transformations are: rotation with a random angle, random shift and zoom. Keras offers many resources for that, though we don't have room for them in this short course.

## CNN models in Keras

Let us use again the MNIST data as to illustrate the Keras syntax, now for CNN models. The height and the width are 28, and the depth is 1. We start by reshaping the training and test feature matrices as 3D arrays, so they can provide inputs for a `Conv2D` layer:

```
X_train, X_test = X_train.reshape(60000, 28, 28, 1), X_test.reshape(10000, 28, 28, 1)
```

*Note*. This reshaping may not be needed if you get the MINST data from other sources than the GitHub repository of this course.

In the Functional API, the network architecture is specified as a sequence of transformations. We have seen this in example ML-15. The following architecture has been taken from a Keras example:

```
input_tensor = Input(shape=(28, 28, 1))
x1 = layers.Conv2D(32, (3, 3), activation='relu')(input_tensor)
x2 = layers.MaxPooling2D((2, 2))(x1)
x3 = layers.Conv2D(64, (3, 3), activation='relu')(x2)
x4 = layers.MaxPooling2D((2, 2))(x3)
x5 = layers.Conv2D(64, (3, 3), activation='relu')(x4)
x6 = layers.Flatten()(x5)
x7 = layers.Dense(64, activation='relu')(x6)
output_tensor = layers.Dense(10, activation='softmax')(x7)
```

A summary of the network can be printed with the method `.summary()`. In this case, we woud get the table:

```
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ input_layer_3 (InputLayer)      │ (None, 28, 28, 1)      │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ conv2d_6 (Conv2D)               │ (None, 26, 26, 32)     │           320 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ max_pooling2d_6 (MaxPooling2D)  │ (None, 13, 13, 32)     │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ conv2d_7 (Conv2D)               │ (None, 11, 11, 64)     │        18,496 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ max_pooling2d_7 (MaxPooling2D)  │ (None, 5, 5, 64)       │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ conv2d_8 (Conv2D)               │ (None, 3, 3, 64)       │        36,928 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ flatten_2 (Flatten)             │ (None, 576)            │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_6 (Dense)                 │ (None, 64)             │        36,928 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_7 (Dense)                 │ (None, 10)             │           650 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
```

The table indicates, for every layer, the output shape and the number of parameters involved. The first layer is a `Conv2D` layer of 32 nodes. Every node takes data from a 3 $\times$ 3 window (submatrix) of the 28 $\times$ 28 pixel matrix, performing a convolution operation on those data. There are 26 $\times$ 26 such windows, so the output feature map will have height and width 26. The convolution is a linear function of the input data. For a specific node, the coefficients used by the convolution are the same for all windows.

`Conv2D` layers are typically alternated with `MaxPooling2D` layers. These layers also use windows (here 2 $\times$ 2 windows), from which they just extract the maximum value (no parameters needed). In the `MaxPooling2D` layer, the windows are disjoint, so the size of the feature map is halved. Therefore, the output feature map has height and width 13. We have an output feature map for every input feature map.

We continue with two `Conv2D` layers, with 64 nodes each, with a `MaxPooling2D` layer in-between. The output is now a tensor of shape `(3, 3, 64)`. The network is closed by a stack of two `Dense` layers. Since the input in the first of these layers has to be a one-dimensional, we have to flatten the 3D output of the last `Conv2D` layer to a 1D tensor. This is done with a `Flatten` layer, which involves no calculation, being just a reshape.

Next, we initialize the class `model.Models()`, specifying the input and the output:

```
clf = models.Models(input_tensor, output_tensor)
```

Now we can apply, as in the MLP model of lecture ML-16, the methods `.compile()`, `.fit()` and `.evaluate()`:

```
clf.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['acc'])
clf.fit(X_train, y_train, epochs=10)
clf.evaluate(X_test, y_test)
```

Alternatively, you can fit and evaluate the model in one shot, testing after every epoch:

```
clf.fit(X_train, y_train, epochs=10, validation_data=(X_test, y_test))
```


## Applications to sequence data

The second area of success of deep learning is **sequence data**. This is a generic expression including text, time series data, sound and video. You may find many sources about to use 1D convolutional networks and **recurrent neural networks** (RNN) to this type of data.

Though this course is also concerned with the applications of deep learning to text data, it does not cover the use of CNN and RNN models to text data, since it is gettting obsolete, due to the strong push of the large language models. So, we postpone text data analysis until we have introduced the new toolkit.

## Example - The MNIST data (2nd round)

### Introduction

We come back to the MNIST data of the preceding lecture. In this lecture, we train a deeper model, to explore the extent to which we can improve our previous results.

*Note*. If you run this notebook in Google Colab, you can find things a bit slow. You can speed up everything by chan ging the runtime to

### Questions

Q1. Rerun the part of the code used in the previous lecture that is needed to obtain the rescaled features matrices `X_train` and `X_test` and the target vectors `y_train` and `y_test`.

Q2. Train an MLP model, to be used as a benchmark.

Q3. Try now with a convolutional neural network. Do we get a real improvement with this *deep* model?

### Q1. Recovering the rescaled MNIST data

With a collection of code lines extracted from those used in the preceding lecture, we get `X_train` and `X_test`, `y_train` and `y_test`.

In [1]:
import numpy as np, pandas as pd
path = 'https://raw.githubusercontent.com/lab30041954/Data/main/'
df = pd.read_csv(path + 'mnist.csv.zip')
y = df['label'].values
X = df.drop(columns='label').values/255
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/7, random_state=0)

## Q2. MLP model

We train again the MLP model of the preceding lecture. Ten epochs will be enough, based on that experience.

In [2]:
from keras import Input, models, layers
input_tensor = Input(shape=(784,))
x = layers.Dense(32, activation='relu')(input_tensor)
output_tensor = layers.Dense(10, activation='softmax')(x)
mlpclf = models.Model(input_tensor, output_tensor)
mlpclf.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['acc'])
mlpclf.fit(X_train, y_train, epochs=10, validation_data=(X_test, y_test));

Epoch 1/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 3ms/step - acc: 0.8428 - loss: 0.5611 - val_acc: 0.9283 - val_loss: 0.2447
Epoch 2/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 2ms/step - acc: 0.9406 - loss: 0.2073 - val_acc: 0.9435 - val_loss: 0.1918
Epoch 3/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step - acc: 0.9548 - loss: 0.1618 - val_acc: 0.9501 - val_loss: 0.1662
Epoch 4/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 2ms/step - acc: 0.9612 - loss: 0.1322 - val_acc: 0.9562 - val_loss: 0.1464
Epoch 5/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 2ms/step - acc: 0.9666 - loss: 0.1157 - val_acc: 0.9602 - val_loss: 0.1358
Epoch 6/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step - acc: 0.9710 - loss: 0.0976 - val_acc: 0.9576 - val_loss: 0.1382
Epoch 7/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[

### Q3. CNN model

We dig deeper now, exploring **convolutional neural network** (CNN) models, in particular a model based on a 2D convolutional network. Instead of using a 60,000 $\times$ 784 features matrix (a 2D array), we pack the training data as a 4D array, in which the axes are:

* `axis=0` with dimension 60,000, identifies the inputs, *i.e*. the images,

* `axis=1` and `axis=2` both with dimension 28, identify the pixel positions.

* `axis=2` with 1 dimension, identifies the **channel**. Since we work with gray scale, there is only one channel (for RGB pictures, we would have three channels).

We reshape the arrays `X_train` and `X_test` as 4D arrays accordingly.


In [3]:
X_train, X_test = X_train.reshape(60000, 28, 28, 1), X_test.reshape(10000, 28, 28, 1)

As for the MLP model, we specify the network architecture as a sequence of transformations. In this example, we propose a standard CNN architecture, used by many authors for these data.

In this case, every input will be 3D array, with shape `(28,28,1)`. The networks starts with a sequence of three `Conv2D` layers, with 32, 64 and 64 nodes, respectively, with two interspersed `MaxPooling` layers. After passing these layers, the input has been transformed into a set of 64 smaller 3D arrays (we will see below how small). Then, this set is flattened into a 1D array, which is the input for the rest of the network. This last part part is the same as a MLP network with a hidden layer of 64 nodes.

In [4]:
input_tensor = Input(shape=(28, 28, 1))
    ...: x1 = layers.Conv2D(32, (3, 3), activation='relu')(input_tensor)
    ...: x2 = layers.MaxPooling2D((2, 2))(x1)
    ...: x3 = layers.Conv2D(64, (3, 3), activation='relu')(x2)
    ...: x4 = layers.MaxPooling2D((2, 2))(x3)
    ...: x5 = layers.Conv2D(64, (3, 3), activation='relu')(x4)
    ...: x6 = layers.Flatten()(x5)
    ...: x7 = layers.Dense(64, activation='relu')(x6)
    ...: output_tensor = layers.Dense(10, activation='softmax')(x7)
    ...: cnnclf= models.Model(input_tensor, output_tensor)

As for the MLP model, we can print a summary reporting the number of parameters involved in every layer:

* The first `Conv2D` layer has 32 nodes, with specific parameters. At every node, a special type of linear calculation, called **convolution**, takes place. The convolution extracts a single number of a 3 $\times$ 3 submatrix, using nine weights and one bias. This makes a total of 10 parameters for every node, a grand total of 32 $\times$ 10 = 320 parameters. The original 28 $\times$ 28 matrix can be covered by a 26 $\times$ 26 grid of 3 $\times$ 3 submatrices, so the outputs of this layer have shape (26, 26, 32).

* The `MaxPooling` layer uses 2 $\times$ 2 submatrices and, instead of a convolution, it just takes the maximum value. So, no parameters are involved, and the outputs have shape (13, 13, 32).

* The third layer has 64 nodes. Every node takes inputs from the 32 nodes of the preceding layer. Nine weights are needed for every input and one bias for the whole set, making a total of 32 $\times$ 9 + 1 = 289 parameters for every node. So, we have 64 $\times$ 289 = 18,496 additional parameters in this layer. Now, the outputs have shape `(11, 11, 64)`.

* The next two layers can be explained in a similar way. In the fifth layer, we have added 36,928 parameters and the outputs have shape `(3, 3, 64)`.

* The sixth layer consistes in taking 3 $\times$ 3 $\times$ 64 = 576 inputs and arranging them as a 1D array. No parameters are needed.

* The seventh layer is a dense layer, so have we already know how it works. The 64 nodes need 577 parameters each to manage the 576 inputs. So, we have 64 $\times$ 577 = 36,928 additional parameters, and the output has shape `(64,)`.

* The last layer has 10 nodes. Each node takes 64 inputs, involving 65 parameters. So we have 10 $\times$ 65 = 650 additional parameters, ad the output is a vector of 10 class probabilities.

At the end of the day, comparing this *deep* model with the previous MLP model, with only 32 hidden nodes, the number of parameters does not increase so much. This is due to the **low connectivity** of the CNN model.

In [5]:
cnnclf.summary()

The rest of the process is the same as in the MLP models. We use here `epochs=10`, to make it shorter. At the end of the first epoch, the model achieves 97.5% accuracy on the test data. After five epochs, this has been raised to 99%, which is quite satisfactory, and the improvement stops there.

In [6]:
cnnclf.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['acc'])
cnnclf.fit(X_train, y_train, epochs=10, validation_data=(X_test, y_test));

Epoch 1/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 4ms/step - acc: 0.8973 - loss: 0.3301 - val_acc: 0.9757 - val_loss: 0.0735
Epoch 2/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 3ms/step - acc: 0.9851 - loss: 0.0469 - val_acc: 0.9878 - val_loss: 0.0397
Epoch 3/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 7ms/step - acc: 0.9893 - loss: 0.0332 - val_acc: 0.9882 - val_loss: 0.0386
Epoch 4/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 3ms/step - acc: 0.9918 - loss: 0.0253 - val_acc: 0.9864 - val_loss: 0.0459
Epoch 5/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 4ms/step - acc: 0.9940 - loss: 0.0177 - val_acc: 0.9892 - val_loss: 0.0355
Epoch 6/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 3ms/step - acc: 0.9949 - loss: 0.0155 - val_acc: 0.9909 - val_loss: 0.0317
Epoch 7/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s