# Basic dense (fully connected) neural network models



### Basic simple neural network model

In the previous notebooks, we saw how to implement simple neural network models with **just the output layer** for logistic, softmax (multiclass) and linear regression problems: this output layer had **only one node** (logistic and linear regression) which performed both the linear combination of input variables + bias and the sigmoid/linear activation:

<img src="https://drive.google.com/uc?export=view&id=1PRc719uT1kOUuCMbpHML2sEk7qp6UJnm">

(Softmax regression is slightly different: the single output layer has as many nodes as there are classes, each calculating the linear combination of input variables and the softmax activation).


### Basic dense neural network model

We are now building a **neural network model**, by adding **one hidden layer** (not deep) with **u nodes** (units):

<img src="https://drive.google.com/uc?export=view&id=1QROz9pFnMoqTeqrFbele8pFz8qXDSckq">

There's a number of `hyperparameters`:

- the **number of hidden nodes** (number of units in the hidden layer)
- the **type of activation function** in the hidden layer
- the **output activation function**
- the **loss function** (for backpropagation)
- the **optimizer** (for gradient descent)

By stacking together more than one hidden/intermediate layer (additional hyperparameter), we can then build **deep neural networks**.

## Loading libraries and setting the random seed

First of all, we load some necessary libraries; then we setup the random seed to ensure reproducibility of results. Since tensorflow uses an internal random generator we need to fix both the general seed (via numpy `seed()`) and tensorflow seed (via `set_seet()`)

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt

  # Set the seed using keras.utils.set_random_seed. This will set:
  # 1) `numpy` seed
  # 2) `tensorflow` random seed
  # 3) `python` random seed
tf.keras.utils.set_random_seed(15)

  # This will make TensorFlow ops as deterministic as possible, but it will
  # affect the overall performance, so it's not enabled by default.
  # `enable_op_determinism()` is introduced in TensorFlow 2.9.
tf.config.experimental.enable_op_determinism()

## Get the data

We get the usual `iris` dataset:

In [None]:
import sklearn.datasets

(features, target) = sklearn.datasets.load_iris(return_X_y = True) ## feature names are not returned
print(features.shape)
print(target.shape)

This is a three-class problem, and for the logistic regression example we need to binarise it:

In [None]:
unique, counts = np.unique(target, return_counts=True)
print(np.asarray((unique, counts)).T)

In [None]:
#updating class labels. To makes things difficult we put together old classes 0 and 1
#in a new class (non virginica) and keep old class 2 (virginica) as new class 1.
#For an easier problems put together versicolor and virginica and keep setosa by itself
j = 100 ## split: 50 for setosa vs versicolor+virginica, 100 for setosa+versicolor vs virginica
binary_target = np.copy(target)
binary_target[0:j] = 0
binary_target[j:150] = 1

In [None]:
unique, counts = np.unique(binary_target, return_counts=True)
print(np.asarray((unique, counts)).T)

## Training and validation sets

In [None]:
#we want to have the same proportion of classes in both train and validation sets
from sklearn.model_selection import StratifiedShuffleSplit

#building a StratifiedShuffleSplit object (sss among friends) with 20% data
#assigned to validation set (here called "test")
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=0)

#the .split() method returns (an iterable over) two lists which can be
#used to index the samples that go into train and validation sets
for train_index, val_index in sss.split(features, binary_target):
    features_train = features[train_index, :]
    features_val   = features[val_index, :]
    target_train   = binary_target[train_index]
    target_val     = binary_target[val_index]

#let's print some shapes to get an idea of the resulting data structure
print(features_train.shape)
print(features_val.shape)
print(target_train.shape)
print(target_val.shape)

In [None]:
from collections import Counter

print(Counter(target_train))
print(Counter(target_val))

In [None]:
target_train

## Build the neural network model

We now build our neural network for binary classification: it will be comprised of one intermediate layer and one output layer which that will perform the final classification (actually, the calculation of the probability of belonging to class `1` given the input features: $P(y=1|x$)).

The necessary steps are:

- model set-up (define the hyperparameters)
- model architecture
- compiling (putting together the configuration -model set-up- and the architecture)

In [None]:
## # Configuration options
input_shape = (features.shape[1],) ## tuple that specifies the number of features
hidden_nodes = 8
hidden_activation = 'relu'
output_activation = 'sigmoid'
loss_function = 'binary_crossentropy'
optimizer_used = 'SGD' ##stochastic gradient descent
num_epochs = 100

In [None]:
#we are building a "sequential" model, meaning that the data will
#flow like INPUT -> ELABORATION -> OUTPUT. In particular, we will
#not have any loops, i.e. our output will never be recycled as
#input for the first layer
from keras.models import Sequential

#a "dense" layer is a layer were all the data coming in are connected
#to all nodes (fully connected).
from keras.layers import Dense, Input

# 2-class logistic regression in Keras
model = Sequential()
model.add(Input(input_shape))
model.add(Dense(units=hidden_nodes, activation=hidden_activation))
model.add(Dense(units=1, activation=output_activation))

#the model is declared, but we still need to compile it to actually
#build all the data structures
model.compile(optimizer=optimizer_used, loss=loss_function)

In [None]:
print(model.summary())

The `summary()` method of the Keras model tells us that there are 49  parameters to train:
- w1, w2, w3, w4, b (weights for the 4 features + bias term) for each of the 8 nodes in the hidden layer ($\rightarrow$ (4+1) x 8 = 40 parameters);
- w1 - w8 + b: weights for the results from the 8 intermediate nodes ("new features") + bias term, for the output layer ($\rightarrow$ 8 + 1 = 9 parameters)
- layer 1 (40 parameters) + layer 2 (9 parameters) = 49 total parameters

## Train the neural network

In [None]:
import time

start = time.time()
history = model.fit(features_train, target_train, epochs=num_epochs, validation_data=(features_val, target_val), verbose=0)
end = time.time()
print(end - start)

In [None]:
#function to take a look at losses evolution
def plot_loss_history(h, title):
    plt.plot(h.history['loss'], label = "Train loss")
    plt.plot(h.history['val_loss'], label = "Validation loss")
    plt.xlabel('Epochs')
    plt.title(title)
    plt.legend()
    plt.show()

In [None]:
plot_loss_history(history, 'Logistic (' + str(num_epochs) + ' epochs)')

## Model evaluation

Any model is only useful when it's used to predict new, unknown data. The validation set was put apart and not really used for training for this specific reason.

Here we look at the following ways to evaluate our neural network model:

- error-rate / accuracy
- confusion matrix

To calculate the accuracy of the trained neural network model for binary classification, we first need to get the **predictions made on the validation set**.
Luckily, it's very easy to apply a trained model to new data (the validation set) via the [predict() method](https://keras.io/api/models/model_training_apis/#predict-method).

We can thus get our prediction for the iris flowers (see below the first 5 predictions):

In [None]:
predictions = model.predict(features_val)
print(predictions[0:5])

We plot the histogram of predictions (in the interval [0,1]), alongside the **0.5 classification threshold**

In [None]:
plt.hist(predictions)
plt.axvline(0.5, color='red', linestyle='dashed', linewidth=1)
plt.show()

#### Error rate / accuracy

In [None]:
predicted_class = np.where(predictions > 0.5, "virginica", "non-virginica")
target_class = np.where(target_val == 1, "virginica", "non-virginica")
target_class = target_class.reshape(len(target_class),1)

results = target_class == predicted_class

In [None]:
errors = np.invert(results).sum()
correct_predictions = results.sum()
total_n_predictions = len(results)

print("Error rate:", round(errors/total_n_predictions, 3))
print("Accuracy:", round(correct_predictions/total_n_predictions, 3))

#### Confusion matrix

In [None]:
from sklearn.metrics import confusion_matrix

labels = ['non-virginica','virginica']
con_mat_df = confusion_matrix( y_true = target_class, y_pred = predicted_class, labels=labels) #true are rows, predicted are columns
pd.DataFrame(
    con_mat_df,
    index = ['true:'+x for x in labels],
    columns = ['pred:'+x for x in labels])

## What if we want to add more layers?

This is very simple: you just need to specify one (or more) additional layer(s): see the example below.

For any additional layer, you need also to specify the number of units and the activation function (additional hyperparameters to fine-tune: the number of layers is itself another hyperparameter to tune).

In [None]:
input_shape = (features_train.shape[1],) ## tuple that specifies the number of features
hidden_nodes_1 = 8
hidden_nodes_2 = 5
hidden_activation_1 = 'relu'
hidden_activation_2 = 'relu'
output_activation = 'sigmoid'
loss_function = 'binary_crossentropy'
optimizer_used = 'rmsprop' ## Root Mean Square Propagation
num_epochs = 100

In [None]:
## resetting the seed (new model graph by tensorflow: seed needs to be specified again)
def reset_random_seeds(nseed, enable_determinism=False):
    tf.keras.utils.set_random_seed(nseed)
    #np.random.seed(n2)
    if enable_determinism:
        tf.config.experimental.enable_op_determinism()

reset_random_seeds(19)

# binary classification shallow neural network model in Keras
model = Sequential()
model.add(tf.keras.Input(input_shape))
model.add(Dense(units=hidden_nodes_1, activation=hidden_activation_1))
model.add(Dense(units=hidden_nodes_2, activation=hidden_activation_2))
model.add(Dense(1, activation=output_activation))

#the model is declared, but we still need to compile it to actually
#build all the data structures
model.compile(optimizer=optimizer_used, loss=loss_function)

In [None]:
print(model.summary())

<u>Parameters breakdown</u>:
- layer 1: 8 x (4 + 1) = 40
- layer 2: 5 x (8 + 1) = 45
- layer 3: 1 x (5 + 1) = 6
- 40 + 45 + 6 = 91 total parameters

In [None]:
start = time.time()
history = model.fit(features_train, target_train, epochs=num_epochs, validation_data=(features_val, target_val), verbose=0)
end = time.time()
print(end - start)

It usually takes longer to train a larger neural network: not necessarily this translates into a better performance of the model.

In [None]:
predictions = model.predict(features_val)

predicted_class = np.where(predictions > 0.5, "virginica", "non-virginica")
target_class = np.where(target_val == 1, "virginica", "non-virginica")
target_class = target_class.reshape(len(target_class),1)

labels = ['non-virginica','virginica']
con_mat_df = confusion_matrix( y_true = target_class, y_pred = predicted_class, labels=labels) #true are rows, predicted are columns
pd.DataFrame(
    con_mat_df,
    index = ['true:'+x for x in labels],
    columns = ['pred:'+x for x in labels])

## Neural network model for multiclass classification

We have here.

- the **input layer** (conventionally known as `layer 0`: the data)
- the **intermediate layer** (the core of the neural network)
- the **output layer**: in this case, we have here as many units as there are classes in the problem (each unit calculates the probability of belonging to one class, i.e. a unique combination of the input data, with specific weights: all probabilities are then normalised by the softmax function)

![multiclass_network](multiclass_classification_network.png)

The output layer must use the softamx activation function; in the intermediate layer, you are free to choose whichever activation function you prefer (except the linear function which, as you should know by now, wouldn't work!).

We use the original target vector (three classes):

In [None]:
unique, counts = np.unique(target, return_counts=True)
print(np.asarray((unique, counts)).T)

In [None]:
#the "utils" subpackage is very useful, take a look to it when you have time
from tensorflow.keras.utils import to_categorical

#converting to categorical
target_multi_cat = tf.keras.utils.to_categorical(target)

#let's take a look
print(target_multi_cat[0:5,:])

We will now:

1. split the data: training and validation sets
2. data preprocessing
3. set the hyperparameters: we now need to use the `softmax` activation function, and the `categorical_crossentropy` loss function
4. build the model
5. compile the model
6. train the model
7. evaluate the model

In [None]:
## 1) SPLITTING THE DATA
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=0)

#the .split() method returns (an iterable over) two lists which can be
#used to index the samples that go into train and validation sets
for train_index, val_index in sss.split(features, binary_target):
    features_train = features[train_index, :]
    features_val   = features[val_index, :]
    target_train   = target_multi_cat[train_index,:]
    target_val     = target_multi_cat[val_index,:]

In [None]:
#number of classes per split
print('\nClasses in train set:')
print(target_train.sum(axis=0))
print('\nClasses in validation set:')
print(target_val.sum(axis=0))

In [None]:
## 2) DATA PREPROCESSING: FEATURE NORMALIZATION

#calculating features averages and std devs
avg = features_train.mean()
std = features_train.std()

#standardizing the data (mean 0, std 1)
features_train = (features_train - avg)/std
features_val = (features_val - avg)/std

In [None]:
## 3) SET THE HYPERPARAMETERS
input_shape = (features_train.shape[1],) ## tuple that specifies the number of features
hidden_nodes_1 = 8
hidden_nodes_2 = 5
hidden_activation_1 = 'relu'
hidden_activation_2 = 'relu'
num_classes = target_val.shape[1] ## number of columns in OHE target array
output_activation = 'softmax'
loss_function = 'categorical_crossentropy'
optimizer_used = 'rmsprop' ## Root Mean Square Propagation
num_epochs = 100

In [None]:
## 4) BUILD THE NEURAL NETWORK MODEL

## resetting the seed (new model graph by tensorflow: seed needs to be specified again)
def reset_random_seeds(nseed, enable_determinism=False):
    tf.keras.utils.set_random_seed(nseed)
    #np.random.seed(n2)
    if enable_determinism:
        tf.config.experimental.enable_op_determinism()

reset_random_seeds(19)

# binary classification shallow neural network model in Keras
model = Sequential()
model.add(tf.keras.Input(input_shape))
model.add(Dense(units=hidden_nodes_1, activation=hidden_activation_1))
model.add(Dense(units=hidden_nodes_2, activation=hidden_activation_2))
model.add(Dense(num_classes, activation=output_activation))

In [None]:
## 5) COMPILE THE MODEL
model.compile(optimizer=optimizer_used, loss=loss_function)

In [None]:
print(model.summary())

<u>Parameters breakdown</u>:
- layer 1: 8 x (4 + 1) = 40
- layer 2: 5 x (8 + 1) = 45
- layer 3: 3 x (5 + 1) = 18
- 40 + 45 + 18 = 103 total parameters

In [None]:
## 6) TRAINING THE NEURAL NETWORK

start = time.time()
history = model.fit(features_train, target_train, epochs=num_epochs, validation_data=(features_val, target_val), verbose=0)
end = time.time()
print(end - start)

In [None]:
## 7) MODEL EVALUATION

predictions = model.predict(features_val)
print(predictions[0:5,:])

predicted_classes = np.argmax(predictions,axis=1)
predicted_classes = predicted_classes.reshape(len(predicted_classes),1)

target_classes = target[val_index]

con_mat_df = confusion_matrix(target_classes, predicted_classes, labels = [0,1,2])
print("\nConfusion matrix:")
print(con_mat_df)

## Neural network model for linear regression

We use the same diabetes dataset as in the previous notebook on the one-unit neural network model for linear regression.

The dataset has 442 samples (i.e. patients). For each sample we have 10 features (age, sex, blood pressure and so forth). We need to predict a score describing the disease progression (
[details here](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset)).
The features have already been normalized (`scaled=True`)

In [None]:
## 0) GET THE DATA

from sklearn.datasets import load_diabetes

#here "scaled=True" is already the default, we write it explicitly just to
#let you know that it's an existing option
X, y = load_diabetes(return_X_y=True, scaled=True)

print("First 5 rows of the features\n")
print(X[0:5,:])

print("\nFirst 5 target values\n")
y[0:5]

After obtaining the data, we do the usual routine:

1. split the data
2. set the hyperparameters (linear regression: loss function is MSE, activation function in the output node is linear)
3. build the model
4. compile the model
5. train the model
6. evaluate the model

This time, we build a smaller neural network with two intermediate layers.

! Caveat: this is most likely an overkill, i.e. an overparameterized / overcomplicated model for a simple problem that could be better solved in some other ways !

In [None]:
## 1) DATA SPLIT

from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, stratify=X[:,1])
print("Train features shape: " + str(X_train.shape))
print("Validation features shape: " + str(X_val.shape))

In [None]:
import tensorflow.keras as keras

## 2) SET THE HYPERPARAMETERS
input_shape = (X_train.shape[1],) ## tuple that specifies the number of features
hidden_nodes_1 = 3
hidden_nodes_2 = 3
hidden_activation_1 = 'sigmoid'
hidden_activation_2 = 'relu'
output_activation = 'linear'
loss_function = 'mse'
optimizer_used = keras.optimizers.SGD(learning_rate=0.001)
num_epochs = 100

In [None]:
## 3) BUILD THE NEURAL NETWORK MODEL

## resetting the seed (new model graph by tensorflow: seed needs to be specified again)
def reset_random_seeds(nseed, enable_determinism=False):
    tf.keras.utils.set_random_seed(nseed)
    if enable_determinism:
        tf.config.experimental.enable_op_determinism()

reset_random_seeds(77)

# binary classification shallow neural network model in Keras
model = Sequential()
model.add(tf.keras.Input(input_shape))
model.add(Dense(units=hidden_nodes_1, activation=hidden_activation_1))
model.add(Dense(units=hidden_nodes_2, activation=hidden_activation_2))
model.add(Dense(1, activation=output_activation))

In [None]:
## 4) COMPILE THE MODEL
model.compile(optimizer=optimizer_used, loss=loss_function)

In [None]:
print(model.summary())

<u>Parameters breakdown</u>:
- layer 1: 3 x (10 + 1) = 33
- layer 2: 3 x (3 + 1) = 12
- layer 3: 1 x (3 + 1) = 4

22 + 12 + 4 = 49 total parameters

In [None]:
## 5) TRAINING THE NEURAL NETWORK

start = time.time()
h = model.fit(x=X_train, y=y_train, epochs=num_epochs, verbose=0, validation_data=(X_val, y_val))
end = time.time()
print(end - start)

In [None]:
plot_loss_history(history, 'Logistic (' + str(num_epochs) + ' epochs)')

In [None]:
X_val[0:5,:]

In [None]:
#predicting
y_val_pred = model.predict(X_val)
y_train_pred = model.predict(X_train)

#plotting true vs predicted values
plt.plot(y_train, y_train_pred, 'o', label='Train')
plt.plot(y_val, y_val_pred, 'o', label='Validation')
plt.xlabel("Ground truth")
plt.ylabel("Predictions")

#done
plt.legend()
plt.show()

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

print(f"Mean squared error: {mean_squared_error(y_val, y_val_pred):.2f}")
print(f"Coefficient of determination: {r2_score(y_val, y_val_pred):.2f}")

res = np.corrcoef(y_val.T,y_val_pred.T)
print(f"Pearson correlation coefficient: {res[0,1]:.2f}")

## Conclusions

This notebook showed how to build a multi-layers neural network for a linear regression problem (continuous target variable to be predicted from some tabular data).
This is not usually a good idea: the neural network requires more computational power, more code, and it's performance are equivalent (at best!) to those of simpler regression models (oftentimes the performance is worse: overfitting).

Neural networks shine on more complex problem, where linear regression is not a viable alternative (e.g. complex data structures, non-tabular data, very large datasets, non-linear relationships between the variables etc.).