___
<h1> Machine Learning </h1>
<h2> M. Sc. in Electrical and Computer Engineering </h2>
<h3> Instituto Superior de Engenharia / Universidade do Algarve </h3>

[MEEC](https://ise.ualg.pt/en/curso/1477) / [ISE](https://ise.ualg.pt) / [UAlg](https://www.ualg.pt)

Pedro J. S. Cardoso (pcardoso@ualg.pt)
___

# ANN with Sklearn

SKlearn is a simple and efficient tool for data analysis and modeling. It provides a range of supervised and unsupervised learning algorithms through a consistent interface in Python. In particular, it provides a set of tools for neural networks.

We are going to consider the following neural network models:
- **Perceptron**
- **Multi-layer Perceptron (MLP)**

## Perceptron

The **Perceptron is a simple neural network model that learns the weights of the input features to make a binary classification**. The Perceptron is a single-layer neural network with a **step activation function**. In the Sklearn library, the Perceptron model is available in the linear_model module - sklearn.linear_model.Perceptron - and is used for binary classification problems (two classes). It does not allow for activation functions other than the step function.

Although the Perceptron is a simple model, it is the basis for more complex models, such as the Multi-layer Perceptron (MLP). 

So, let us load the necessary libraries and make an example with the Iris dataset.

In [None]:
from sklearn.datasets import load_iris
from sklearn.linear_model import Perceptron
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

iris = load_iris()

The original iris dataset has three classes. We are going to consider only two classes. So, we are going to consider only the first class (labled = 0) and the other two classes (versicolor and virginica) as a single class. The problem is therefore a binary classification problem, where the target variable is True if the class is 0 and False otherwise.

In [None]:
X = iris.data
y = iris.target == 0
y

In the next step we are going to split the data into training and test sets and normalize it.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    stratify=iris.target,
                                                    random_state=1)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Now, we are going to train the Perceptron model and predict the target values over the test data.

In [None]:
perceptron = Perceptron(random_state=1,
                        max_iter=1000,
                        tol=1e-50,
                        verbose=True).fit(X_train_scaled, y_train)

perceptron.score(X_test_scaled, y_test)

Not bad for a simple model! The Perceptron model is a linear model, so it is not able to learn complex patterns. However, it is a good starting point for more complex models, such as the Multi-layer Perceptron (MLP).

 Let us see the weights of the input features.

In [None]:
perceptron.coef_

## Multi-layer Perceptron (MLP) Classification: Iris dataset
In this section, we'll use the Iris dataset to make out first example.

So, load and split data:

In [None]:
from sklearn.datasets import load_iris, load_digits
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import r2_score

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

In [None]:
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target,
                                                    stratify=iris.target,
                                                    random_state=1)

Prepare a multi-layer perceptron classifier (MLPClassifier), train and get the score over the test data

(https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html)

The default parameters are:
- hidden_layer_sizes=(100,)
- activation='relu'
- solver='adam'
- alpha=0.0001
- ...

In [None]:
clf = MLPClassifier(verbose=True,  # uncomment to see loss function evolution
                    random_state=1
                    ).fit(X_train, y_train)

score = clf.score(X_test, y_test)
print(f'Final score: {score}')

You probably got a warning message related to the number of iterations. The default number of iterations is 200. The warning message is due to the fact that the optimization algorithm did not converge. 

ok...!? let us see if we can improve this... by increasing the number of iterations

In [None]:
clf = MLPClassifier(max_iter=1000,
                    random_state=1,
                    verbose=True
                    ).fit(X_train, y_train)

score = clf.score(X_test, y_test)
print(f'Final score: {score}')

That was good! Maybe there were other alternatives, like using more layers...?

In [None]:
clf = MLPClassifier(hidden_layer_sizes=(100, 100),
                    random_state=1,
                    max_iter=1000,
                    verbose=True
                    ).fit(X_train, y_train)

score = clf.score(X_test, y_test)
print(f'Final score: {score}')

Being a __multi-class classification problem__, the output layer has three neurons (one for each class: setosa, versicolor, and virginica). The __activation function is a softmax function__, which is a generalization of the logistic function to multiple dimensions. 

The sotmax function allows us to interpret the output of the network as probabilities. The class with the highest probability is the predicted class. The probabilities associated to each test instance are: 

In [None]:
clf.predict_proba(X_test)

Let us see the distribution of the probabilities, for each class, over the test data. This allows us to understand the confidence of the classifier in its predictions.

In [None]:
plt.hist(clf.predict_proba(X_test))

## Multi-layer Perceptron (MLP) Regression: Bike Sharing dataset

In this section, we'll use the Bike Sharing dataset to make a regression example. This dataset contains the hourly count of rental bikes between years 2017 and 2018 in Seoul, Korea. The goal is to predict the number of bikes rented in a given hour based on the features available.

In [None]:
df = pd.read_csv('./data/SeoulBikeData.csv')
df.head()

As we can see, the dataset contains both numerical and categorical data. The target variable is the number of rented bikes.

In [None]:
df.info()

First, we pre-process the data:
- transform the date into a datetime object and extract the hour
- replacing the categorical data by one-hot encoding and normalizing the numerical data.

In [None]:
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
df['hour'] = df['Date'].dt.hour
df['month'] = df['Date'].dt.month
df.drop(columns=['Date'], inplace=True)

df = pd.get_dummies(df, columns=['Seasons', 'Holiday', 'Functioning Day'])
df.info()

Now, we split the data into features and target and normalize it.

In [None]:
X_Seoul = df.drop(columns=['Rented Bike Count'])
y_Seoul = df['Rented Bike Count']

In [None]:
scaler = StandardScaler()
X_Seoul_scaled = scaler.fit_transform(X_Seoul)

Split the data into training and test sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_Seoul_scaled, y_Seoul,
                                                    test_size=0.8,
                                                    random_state=42)

Prepare a multi-layer perceptron regressor (MLPRegressor), train and get the score over the test data

In [None]:
model = MLPRegressor(max_iter=200,
                     random_state=1,
                     verbose=True
                     ).fit(X_train, y_train)

score = model.score(X_test, y_test)
print(f'Final score (R2): {score}')

You probably got a warning message related to the number of iterations. The default number of iterations is 200. The warning message is due to the fact that the optimization algorithm did not converge. So, let us increase the number of iterations.

In [None]:
model = MLPRegressor(max_iter=20000,
                     random_state=1,
                     verbose=True
                     ).fit(X_train, y_train)

score = model.score(X_test, y_test)
print(f'Final score (R2): {score}')

It's better... but maybe we can improve this... by increasing the number of hidden layers and neurons.
The previous model has only one hidden layer with 100 neurons. We can try to improve the model by adding more layers and neurons, for example:

In [None]:
model = MLPRegressor(max_iter=2000,
                     hidden_layer_sizes=(100, 100, 100),
                     random_state=1,
                     verbose=True
                     ).fit(X_train, y_train)

score = model.score(X_test, y_test)
print(f'Final score (R2): {score}')

A slightly better result was obtained. We can try to improve the model by adding more layers and neurons... or we can use a different activation function. The default activation function is the ReLU function. Try it...!

To compare the predicted values with the true values we can plot them.

In [None]:
y_pred = model.predict(X_test)
plt.plot(y_test, y_pred, 'o')
plt.plot([0,2500], [0,2500], '-', label='test')

# ANN with Keras

Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research.

Keras wraps the efficient numerical computation libraries Theano and TensorFlow and allows you to define and train neural network models in just a few lines of code.

In this section, we'll use Keras to build a neural network to classify the digits dataset and to predict the number of bikes rented in a given hour based on the Bike Sharing dataset.

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Input, Dropout
from keras.utils import to_categorical

from tensorflow.keras.utils import plot_model
import tensorflow as tf

from sklearn.metrics import r2_score

import pandas as pd

tf.random.set_seed(42)      # Set random seed for reproducibility

## Classification: Digits dataset

In this section, we'll use Keras to build a neural network to classify the digits dataset. The digits dataset consists of 8x8 pixel images of digits. The goal is to classify the images into one of ten classes (0-9). 

First, we'll load the data, normalize it, and split it into training, validation, and test sets. Then, we'll build a neural network with a simple architecture and train it on the training data. Finally, we'll evaluate the model on the test data.


In [None]:
digits = load_digits()      # We are using the digits dataset, from sklearn.datasets 
X = digits.data
y = digits.target

# Normalize data - values are between 0 and 16
X = X / 16.0


One-hot encode the labels: need because we are using a softmax activation function, which expects one-hot encoded labels - the output layer has 10 neurons (one for each digit)


In [None]:
y = to_categorical(y)
y[:5]

Split data into training, validation, and test sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3,
                                                    random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test,
                                                test_size=0.5,
                                                random_state=42)

Now, let us build the neural network. We'll use a simple architecture with:
- an input layer with 64 neurons (one for each pixel in the image),
- two hidden layers, each with 100 neurons, and ReLU activation functions,
- an output layer with 10 neurons (one for each digit) and a softmax activation function.
- a dropout layer to avoid overfitting 

In [None]:
tf.random.set_seed(42)                      # Set random seed for reproducibility

m, n = X_train.shape

model = Sequential()                        # Sequential model - a linear stack of layers

model.add(Input(shape=(n,)))                # Input layer with 64 neurons
model.add(Dense(100, activation='relu'))    # Hidden layer with 100 neurons and ReLU activation function
model.add(Dropout(0.1))                     # Dropout layer with 10% of neurons being dropped
model.add(Dense(100, activation='relu'))
model.add(Dense(10, activation='softmax'))  # 10 neurons, one for each digit and softmax activation function

In the next step, we'll compile the model, specifying the optimizer, loss function, and metrics to be used during training. Compiling the model means that Keras will prepare the model for training.

In [None]:
model.compile(optimizer='adam',
              loss='categorical_crossentropy',  # Cross-entropy loss function - used for multi-class classification problems
              metrics=['accuracy', 'f1_score']  # Accuracy and F1 score metrics
              )

The shapes of the input and output layers are:

In [None]:
model.summary()

A plot of the architecture is:

In [None]:
plot_model(model, to_file='model_architecture.png', show_shapes=True, show_layer_names=True)

In [None]:
model.fit(X_train, y_train,
          validation_data=(X_val, y_val),
          epochs=40, 
          batch_size=32,
          )

We achieved an accuracy of 0.9815 on the validation data and a f1-score of 0.9809. Let us evaluate the model on the test data, and see that we get similar results.

In [None]:
res = model.evaluate(X_test, y_test)

print(f'Test loss: {res[0]:.4f}')
print(f'Test accuracy: {res[1]:.4f}')

## Regression: Bike Sharing dataset

In this section, we'll use Keras to build a neural network to predict the number of bikes rented in a given hour based on the Bike Sharing dataset.

In [None]:
df = pd.read_csv('./data/SeoulBikeData.csv')
df.head()

First, we prepossess the data:
- transform the date into a datetime object and extract the hour
- replacing the categorical data by one-hot encoding and normalizing the numerical data.

In [None]:
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
df['hour'] = df['Date'].dt.hour
df['month'] = df['Date'].dt.month
df.drop(columns=['Date'], inplace=True)

df = pd.get_dummies(df, columns=['Seasons', 'Holiday', 'Functioning Day'])
df.info()

Now, we split the data into features and target and normalize it.

In [None]:
X = df.drop(columns=['Rented Bike Count'])
y = df['Rented Bike Count']

We normalize the data using the StandardScaler from sklearn.

In [None]:
scaler = StandardScaler()
X = scaler.fit_transform(X)

Split the data into training, validation, and test sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3,
                                                    random_state=42)

X_val, X_test, y_val, y_test = train_test_split(X_test, y_test,
                                                test_size=0.5,
                                                random_state=42)

Now, let us build the neural network. We'll use a simple architecture with:
- an input layer with 19 neurons (one for each feature),
- three hidden layers, each with 100 neurons, and ReLU activation functions,
- an output layer with one neuron and a linear activation function.
- a dropout layer to avoid overfitting
- the loss function is the mean squared error
- the metric is the mean absolute error
- the optimizer is the Adam optimization algorithm

In [None]:
m, n = X_train.shape

tf.random.set_seed(42)                      # Set random seed for reproducibility

model = Sequential()
model.add(Input(shape=(n,)))
model.add(Dense(100, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(100, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(100, activation='relu'))
model.add(Dense(1))

model.compile(
    optimizer='adam',                   # Adam optimization, a stochastic gradient descent method that is based on adaptive estimation of first-order and second-order moments.
    loss='mean_squared_error',          # Mean squared error loss function
    metrics=['mean_absolute_error']     # Mean absolute error metric
)

In [None]:
plot_model(model, to_file='model_architecture.png', show_shapes=True, show_layer_names=True)

In [None]:
model.fit(X_train, y_train,
          validation_data=(X_val, y_val),
          epochs=1000, 
          batch_size=32)

Let us see the loss evolution during training and check if there was overfitting.

In [None]:
history = model.history

plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Loss Evolution During Training')
plt.legend()
plt.grid()
plt.show()

So, we can see that after ~250 epochs the validation stop decreasing (maybe increasing) and the training loss continues to decrease. This is a sign of overfitting. We can stop the training at this point.


In [None]:
tf.random.set_seed(42)                      # Set random seed for reproducibility

model.fit(X_train, y_train,
          validation_data=(X_val, y_val),
          epochs=250,
          batch_size=32)

Let us evaluate the model on the test data.

In [None]:
res = model.evaluate(X_test, y_test)
print(f'Test loss: {res[0]:.4f}')
print(f'MSE: {res[1]:.4f}')

If wanted, we can compute the R² score, using the sklearn.metrics.r2_score function.

In [None]:
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f'R²: {r2:.4f}')

And now, we can plot the predicted values against the true values.

In [None]:
plt.plot(y_test, y_pred, 'o')
plt.plot([0,2500], [0,2500], '-', label='test')

## Dropout, L1, and L2 Regularization

Dropout, L1, and L2 regularization are techniques used to prevent overfitting in neural networks. We already saw the Dropout technique in the previous examples. Let us see the L1 and L2 regularization techniques.

In [None]:
model = Sequential()
model.add(Input(shape=(n,)))
model.add(Dense(100, activation='relu', kernel_regularizer='l1'))  # L1 regularization
model.add(Dropout(0.1))
model.add(Dense(100, activation='relu', kernel_regularizer='l2'))  # L2 regularization
model.add(Dropout(0.1))
model.add(Dense(100, activation='relu', kernel_regularizer='l1l2'))  # L1 and L2 regularization
model.add(Dense(1))

model.compile(
    optimizer='adam',
    loss='mean_squared_error',
    metrics=['mean_absolute_error']
)

model.fit(X_train, y_train,
          validation_data=(X_val, y_val),
          epochs=400,
          batch_size=32)

res = model.evaluate(X_test, y_test)
print(f'Test loss: {res[0]:.4f}')
print(f'MSE: {res[1]:.4f}')

y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f'R²: {r2:.4f}')

plt.plot(y_test, y_pred, 'o')
plt.plot([0,2500], [0,2500], '-', label='test')
