### Part II. Neural Networks and Deep Learning

# 10. Introduction to Artificial Neural Networks with Keras

Artificial Neural Networks (ANN) is a Machine Learning model inspired by the networks of biological neurons found in our brains.  

Although they draw from our biological brains, they have slightly evolved to be somehow different.

### Why is it different this time?

ANNs have been around for quite some time, with efforts going back to a seminal paper by McCulloch and Pitts (1943). However, after a long winter started in the 1960s they seem to be back in town as the cool kid. 

Why would this time be different? According to the author:

**1. Data quantity**, which allows ANNs to perform traditional ML on large and complex problems

**2. Computing power**, thanks to Moore's Law, the gaming industry (for GPUs) and cloud computing 

**3. Improved algorithms** (not that different from 1990s, but those differences had huge impact)

**4. Theoretical limitations** (e.g. getting stuck in local optima) are **rather rare** in practice, or **not as serious** as previously thought

**5. Virtual cycle** of applications > reaserch + funding > more and better applications

### Logical Computations with Neurons

_**Note**: Skipping the part on biological neurons_

McCulloch and Pitts proposed a simple model of the biological neuron later known as an **artificial neuron** characterized by one or more binary inputs and one binary output. 

Even with this simple neuron, it is possible to build an ANN that computes any logical proposition:

![ANN](images/10.ANN.png)

Assuming an activation threshold of two, we can see from the picture above how this could work. 

### The Perceptron

The next step in complexity is the _Perceptron_. Invented in 1957 by Frank Rosenblatt, it is based on an artificial neuron called a _threshold logic unit_ (TLU) / _linear threshold unit_ (LTU). Inputs and outputs are numbers and each input is connected with a weight.

TLU computes a weighted sum and then applies a step function to the sum:

1. $z = w_1x_1 + w_2n_2 + \cdots + w_nx_n = x^tw$

2. $h_w(x) = step(z) = step(x^tw)$

![Threshold Logic Unit](images/10.TLU.png)

Two common step functions used in Perceptrons:

1. Heaviside (z) $ = \begin{cases}
0 & z < 0\\
1 & z \ge 0
\end{cases}$

2. Sign (z) $ = \begin{cases}
-1 & z < 0 \\
0 & z = 0 \\
+1 & z > 0
\end{cases}$

A single TLU can be used for linear binary classification (using a threshold, similarly to LogReg or SVM).  
A Perceptron is simply a single layer of TLUs, with each TLU connected to all the inputs. Having multiple output TLU makes possible to perform multioutput classification. 

It is then possible to compute the outputs of a layer of artificial neurons for several instances at once:

$h_{W,b} (X) = \phi (XW + b)$

$X$ = matrix of input features  

$W$ = weights of neurons (expept bias). One row per input neuron and one col for artificial neuron. 

$b$ = vector containing all the connection weights between the bias neuron and the artificial neurons. It has one bias term per artificial neuron.

$\phi$ = activation function. In our case (artificial neuron = TLU), the activation function is a step function.

#### Training

The Perceptron is then trained reinforcing connections that help **reduce the error**. 

More specifically, the Perceptron is fed one training instance at a time, and for each instance it makes its predictions. For every output neuron that produced a wrong prediction, it reinforces the connection weights from the inputs that would have contributed to the correct prediction.

More formally:

$w_{i,y}^{(next step)} = w_{i,y} + \eta (y_j - \hat{y}_j)x_i $ 

* $w_{i,y}$ = connection weight between $i^{th}$ input neuron and $j^{th}$ output neuron

* $x_i$ is the $i^{th}$ input value of the current training instance

* $\hat{y}_j$ is the output of the $j^{th}$ output neuron for the current training instance

* $y_j$ is the target output of the $j^{th}$ output neuron for the current training instance

* $\eta$ is the learning rate

**Note**: the Perceptron decision boundaries are linear, but as long as training instances are linearly separable, it will converge to a solution. 

#### Limitations

The Perceptron is a fairly rudimentary ANN architecture, incapable for example to solve the exclusive or (XOR) classification problem. 

It turns out that some of the limitations of Perceptrons can be eliminated by stacking multiple Perceptrons: the resulting ANN is known as **Multilayer Perceptron (MLP)**. 

### Multilayer Perceptron and Backpropagation

An MLP is composed of multiple layers, generally named *input* (lower) layers, *hidden* layers, *output* (upper) layers. 

**Note**: so far we are only dealing with feedforward neural network (FNN). 

Backpropagation comes into play as a training method for MLP. In short, it computes the the gradient of the network’s error with regard to every single model parameter by going forward and then backwards. It does so many times until it converges. 

**Note**: More specifically, automatically computing gradients is called automatic differentiation, or **autodiff**. The autodiff technique used by backpropagation is called **reverse-mode autodiff**. 

Let's run through the algorithm:

1. **Forward pass**: each mini-batch of training instances is passed to the network’s input layer > hidden layers > output layers

2. **Error measurement**: network error computed using a loss function that compares the desired output and the actual output of the network

3. **Error attribution**: it computes how much each output connection contributed to the error

4. **Reverse pass**: tracing back error from output layer to lower layer down to inputs

5. **Gradient Descent**: tweak all the connection weights in the network, using the error gradients computed earlier

**Note**: it is important to randomize the initial hiddent layers weights (_break the simmetry_). 

In order for this to work, David Rumelhart, Geoffrey Hinton, and Ronald Williams replaced the step function with the logistic (sigmoid) function: 

$\displaystyle \sigma(z) = \frac{1}{1+ e^{(-z)}}$

This was particularly helpful for Gradient Descent, since with a stepwise function there is no gradient to work with. Secondly, having a non-linear function (and a chain of non-linear function over many layers) allows us to theoretically approximate any continuous function.

#### Regression MLPs

We can use MLPs for single variable regression tasks (with a single output neuron) or multivariable regression (with multiple output neurons). 

A typical regression MLP architecture consists of the following **hyperparameters**:
* n. Input neurons
* n. Hidden layers
* n. Neurons per hidden layer
* n. Output neurons
* Hidden activation function
* Output activation function
* Loss activation function

#### Classification MLPs

For a binary classification problem, you just need a single output neuron using the logistic activation function: the output will be a number between 0 and 1, which you can interpret as the estimated probability of the positive class. 

MLP classification **hyperparameters**:
* n. Input neurons
* n. Hidden layers
* n. Output neurons
* Output layer activation (log / softmax)
* Loss function

### Implementing MLPs with Keras

[Keras](https://github.com/keras-team/keras) is a high-level Deep Learning API. On the backend, it relies on one of three popular open source Deep Learning libraries: TensorFlow, Microsoft Cognitive Toolkit (CNTK), and Theano.

**Note**: for simplicity we will use the Tensorflow implementation `tf.keras` (without any specific TF features) therefore altough the code in this chapter can be used on any backend implementation (_usually by changing the imports_).

First thing, let's make sure that [Tensorflow](https://www.tensorflow.org/) is up and running (installing it in a virtualenv is recommended):

In [21]:
import tensorflow as tf

In [22]:
from tensorflow import keras

In [24]:
keras.__version__

'2.2.4-tf'

### Building an Image Classifier Using the Sequential API

Let's use the Fashion MNIST dataset:

In [27]:
# loading dataset
fashion_mnist = keras.datasets.fashion_mnist

In [28]:
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz


In [29]:
# check shape
X_train_full.shape

(60000, 28, 28)

In [30]:
# check type
X_train_full.dtype

dtype('uint8')

Let's create a validation set. Also, input scales since we will be working with Gradient Descent:

In [31]:
X_valid, X_train = X_train_full[:5000] / 255.0, X_train_full[5000:] / 255.0

In [32]:
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

In [34]:
class_names = ["T-shirt/top", "Trouser", "Pullover", "Dress", "Coat",
"Sandal", "Shirt", "Sneaker", "Bag", "Ankle boot"]

In [35]:
class_names[y_train[0]]

'Coat'

Time to build our Neural Network!

In [36]:
# create Sequential model (simplest NN)
model = keras.models.Sequential()

In [37]:
# first layer - no pars - preprocessing
model.add(keras.layers.Flatten(input_shape=[28, 28]))

In [38]:
# first Dense hidden layer with 300 neurons
model.add(keras.layers.Dense(300, activation="relu"))

Instructions for updating:
If using Keras pass *_constraint arguments to layers.


In [39]:
# second with 100 neurons
model.add(keras.layers.Dense(100, activation="relu"))

In [40]:
# output layer with 10 neurons - softmax act function since we have exclusive classes
model.add(keras.layers.Dense(10, activation="softmax"))

In [41]:
# model summary (default layer names)
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
dense (Dense)                (None, 300)               235500    
_________________________________________________________________
dense_1 (Dense)              (None, 100)               30100     
_________________________________________________________________
dense_2 (Dense)              (None, 10)                1010      
Total params: 266,610
Trainable params: 266,610
Non-trainable params: 0
_________________________________________________________________


In order to _break simmetry_ all the connection weights have been initialized randomly. We can change the initialization method by setting a specific `kernel_initializer`.  

#### Compiling the method

After a model is created, you must call its `compile()` method to specify the loss function and the optimizer to use:

In [42]:
model.compile(loss="sparse_categorical_crossentropy", # because we have sparse labels
              optimizer="sgd", # stochastic gradient descent - generally we would also set a learning rate (def = 0.01)
              metrics=["accuracy"]) # because it's a class problem

#### Training and Evaluation

In [43]:
history = model.fit(X_train, y_train, epochs=30,
                    validation_data=(X_valid, y_valid))

Train on 55000 samples, validate on 5000 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


Accuracy **89.30%**, not bad! 

**Note**: if the training set was skewed, it would be useful to set the `class_weight` argument when calling the `fit()` method, which would give a larger weight to underrepresented classes and a lower weight to overrepresented classes.

Now, if we are not satisfied with the performance of our model, there are many things we could do:

* Learning rate
* Change optimizer
* N. of layers
* N. of neurons per layer 
* Activation functions
* Batch size

etc.

Once we are satisfied with **validation** accuracy we can evaluate it on the **test** set. 

In [45]:
model.evaluate(X_test, y_test)



[77.16121282806397, 0.8327]

### Make predictions

In [47]:
X_new = X_test[:3]

In [48]:
y_proba = model.predict(X_new)

In [49]:
y_proba.round(2)

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)

Fairly easy to interpret: e.g. 1 we have a 100% prob of class 10, 0% of any other class. 

### Regression MLP using Sequential API

Let’s switch to the California housing problem and tackle it using a regression neural network. 

In [51]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

housing = fetch_california_housing()

Downloading Cal. housing from https://ndownloader.figshare.com/files/5976036 to C:\Users\giuse\scikit_learn_data
INFO:sklearn.datasets.california_housing:Downloading Cal. housing from https://ndownloader.figshare.com/files/5976036 to C:\Users\giuse\scikit_learn_data


In [52]:
X_train_full, X_test, y_train_full, y_test = train_test_split(housing.data, housing.target)

In [53]:
X_train, X_valid, y_train, y_valid = train_test_split(X_train_full, y_train_full)

In [54]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_valid = scaler.transform(X_valid)
X_test = scaler.transform(X_test)

The main difference here is that we only have one single neuron in the output layer and no activation function.

Since the dataset is quite noisy, we just use a single hidden layer with fewer neurons than before, to avoid overfitting:

In [55]:
model = keras.models.Sequential([
    keras.layers.Dense(30, activation="relu",
input_shape=X_train.shape[1:]),
    keras.layers.Dense(1)
])

model.compile(loss="mean_squared_error", optimizer="sgd")
history = model.fit(X_train, y_train, epochs=20,
                    validation_data=(X_valid, y_valid))
mse_test = model.evaluate(X_test, y_test)
X_new = X_test[:3] # pretend these are new instances
y_pred = model.predict(X_new)

Train on 11610 samples, validate on 3870 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


### Building Complex Models Using the Functional API

Although the Sequential API is quite easy to use, we may need to build models with more complex topologies. Here comes the Functional API to the rescue.

One such example is the _wide and deep_ NN, which connects all or part of the inputs directly to the output layer, allowing our network to learn both deep patterns and simple patterns. 

In [56]:
# input object
input_ = keras.layers.Input(shape=X_train.shape[1:]) 
# dense layer with 30 neurons
hidden1 = keras.layers.Dense(30, activation="relu")(input_)
# dense layer with 30 neurons (input is hidden1)
hidden2 = keras.layers.Dense(30, activation="relu")(hidden1)
# concatenate layer (input + hidden2 output)
concat = keras.layers.Concatenate()([input_, hidden2])
# output layer (single layer & no activation function)
output = keras.layers.Dense(1)(concat)
# create the model specifying inputs and outputs
model = keras.Model(inputs=[input_], outputs=[output])

The next step here is to understand how to send a subset of features to the wide net, and a subset to the deep net. To do this, we will use multiple inputs.  

In [57]:
input_A = keras.layers.Input(shape=[5], name="wide_input")
input_B = keras.layers.Input(shape=[6], name="deep_input")
hidden1 = keras.layers.Dense(30, activation="relu")(input_B)
hidden2 = keras.layers.Dense(30, activation="relu")(hidden1)
concat = keras.layers.concatenate([input_A, hidden2])
output = keras.layers.Dense(1, name="output")(concat)
model = keras.Model(inputs=[input_A, input_B], outputs=[output])

Let's now create our input matrices:

In [58]:
model.compile(loss="mse", optimizer=keras.optimizers.SGD(lr=1e-3))
X_train_A, X_train_B = X_train[:, :5], X_train[:, 2:]
X_valid_A, X_valid_B = X_valid[:, :5], X_valid[:, 2:]
X_test_A, X_test_B = X_test[:, :5], X_test[:, 2:]
X_new_A, X_new_B = X_test_A[:3], X_test_B[:3]
history = model.fit((X_train_A, X_train_B), y_train, epochs=20,
validation_data=((X_valid_A, X_valid_B), y_valid))
mse_test = model.evaluate((X_test_A, X_test_B), y_test)
y_pred = model.predict((X_new_A, X_new_B))

Train on 11610 samples, validate on 3870 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


Adding extra outputs to the NN is also quite easy:

In [59]:
# same as above
input_A = keras.layers.Input(shape=[5], name="wide_input")
input_B = keras.layers.Input(shape=[6], name="deep_input")
hidden1 = keras.layers.Dense(30, activation="relu")(input_B)
hidden2 = keras.layers.Dense(30, activation="relu")(hidden1)
concat = keras.layers.concatenate([input_A, hidden2])

# connect to appropriate layers
output = keras.layers.Dense(1, name="main_output")(concat)
aux_output = keras.layers.Dense(1, name="aux_output")(hidden2)
model = keras.Model(inputs=[input_A, input_B], outputs=[output,
aux_output])

Each output will have its own loss function:

In [60]:
model.compile(loss=["mse", "mse"], loss_weights=[0.9, 0.1], optimizer="sgd")

Now when we train the model, we need to provide labels for each output. In this example, the main output and the auxiliary output should try to predict the same thing, so they should use the same labels.

In [61]:
history = model.fit([X_train_A, X_train_B], [y_train, y_train], epochs=20,
validation_data=([X_valid_A, X_valid_B], [y_valid, y_valid]))

Train on 11610 samples, validate on 3870 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


Evaluation using total loss and individual losses:

In [62]:
total_loss, main_loss, aux_loss = model.evaluate(
    [X_test_A, X_test_B], [y_test, y_test])



### Subclassing API for Dynamic Models 

Some models involve loops, varying shapes, conditional branching, and other dynamic behaviors. 

To do that, we can construct a `Model` class, create our layers in the constructor, and use them to perform the computations you want in the `call()` method.

In [66]:
class WideAndDeepModel(keras.Model):
    def __init__(self, units=30, activation="relu", **kwargs):
        super().__init__(**kwargs) # handles standard args (e.g., name)
        self.hidden1 = keras.layers.Dense(units, activation=activation)
        self.hidden2 = keras.layers.Dense(units, activation=activation)
        self.main_output = keras.layers.Dense(1)
        self.aux_output = keras.layers.Dense(1)

    def call(self, inputs):
        input_A, input_B = inputs
        hidden1 = self.hidden1(input_B)
        hidden2 = self.hidden2(hidden1)
        concat = keras.layers.concatenate([input_A, hidden2])
        main_output = self.main_output(concat)
        aux_output = self.aux_output(hidden2)
        return main_output, aux_output

model = WideAndDeepModel()

This method makes our model much more customizable, but also less easy to inspect. We can also stack up several models together to form more complex models.  