<div>
<table style="width: 100%">
    <td>
	<tr>
		<td>
		<table style="width: 100%">
			<tr>
                <td ><center><font size="30">WSD 2024-2025</font><center>
                    <center><font size="30">AI 4 Water Systems</font><center></td>
			</tr>
			<tr>
                <td><center><font size="5">Jupyter Notebook 2</font><center></td>
			</tr>
			<tr>
                <td><center><font size="10">Model M5 Tree</font><center></td>
			</tr>
            <tr>
                <td><center><font size="5">Claudia Bertini, Lecturer in Hydroinformatics</font><center></td>
			</tr> 
		</table>
		<td> <img src='ihe-logo-square.png'></img></td>
	</tr>
</table>
</div>  

# Recap and scope of this notebook
We aim at developing a data driven model that is predicting discharge in the next hour (Qt+1) using past rainfall and discharge hourly data. Throughout these 3 notebooks, we will learn how to pre-define the input features of the data driven model (Notebook 1), how to build a linear regression model and a M5 Tree model (Notebook 2), how to build an Artificial Neural Network (ANN, Notebook 3).

In <b>Notebook 1</b> we learnt how to load and visualize the data. We also used (linear) correlation and autocorrelation to identify the potential relevant input features. Based on our results, we concluded that potentially we could use the past 6 hours of both effective rainfall and discharge to train our data driven model. In Notebook 2, we learnt how to use these features to build first a <b>Linear regression model</b> and then a <b>M5 Tree model</b>. We explored different training options for the M5 model: pruned, unpruned, and with custom made constraints on the number of minimum leaves allowed in a node and maximum depth of the tree.
In this <b> Notebook 3 </b>, we will learn how to develop (train and test) an Artificial Neural Network (ANN) to predict hourly discharge (Qt+1) using the same input features as for the linear and M5 models. More specifically, we will learn how to develop a Multi Layer Perceptron (MLP), a specific kind of ANN. The first part of the notebook is almost identical to the initial part of Notebook 2, as the data loading and preparation (including training and testing split) is exactly the same. We repeat them here, so that you can run ANN and M5 indipendently one from another. Be careful in the initial libraries installed and loaded, because they are slightly different as we now import the library for the MLP and not the one for the Linear model.

# 1. Installing and importing the libraries needed
Before importing the libraries below, you might need to install them. You can use the following commands (the code to be run is in the following cells):
pip install matplotlib
pip install pandas
pip install openpyxl

The library os should be automatically installed. If it is not the case, you can add a cell and type "pip install os", then run it.

In [None]:
#pip install matplotlib

In [None]:
#pip install pandas

In [None]:
#pip install openpyxl

In [None]:
#pip install scikit-learn

In [None]:
#pip install numpy

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import os
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_absolute_error,mean_squared_error
import numpy as np

# 2. Load the data
To load the data correctly, first make sure that the file "Sieve-orig.xlsx" is saved in the same folder of this notebook. If it's not the case, please move the file to the folder of this notebook.
The file contains the hourly records of discharge (Qt) and effective precipitation (REt), from 1 Dec 1969 to 28 Feb 1970. There is also one column providing information about date and time.
In the first lines of the next cell, you will find information to be able to open this notebook in Google Colab, in case you prefer it.

In [None]:
# Define file path and check environment
if 'google.colab' in str(get_ipython()):
    from google.colab import drive
    drive.mount('/content/drive')
    base_path = '/content/drive/My Drive/'
else:
    base_path = os.getcwd()  # Use current directory in Jupyter Notebook = directory where the notebook is locally saved in your computer

# Ensure file exists before reading
file_name = 'Sieve-orig.xlsx'
file_path = os.path.join(base_path, file_name)

if not os.path.exists(file_path):
    raise FileNotFoundError(f"File not found: {file_path}")

# First load the file with the data from Sieve
df = pd.read_excel(file_path)
# you can visualize the first rows of the data by:
df.head()

# 3. Select the input features
Using the information on which input feature to retain contained at the beginning of this notebook, we now re-organize the dataset. We want each variable to be in one specific column. We can use (part) of the code we developed for the correlation analysis for this purpose.

In [None]:
# We first shift our rainfall data and copy them to separate columns. we take up to 5 steps back in time
for lag in range(1, 6):  # 10 steps back
    df[f'REt_lag{lag}'] = df['REt'].shift(lag)
    
# we do the same with discharge, taking up to 2 steps back in time
for lag in range(1, 3):  # 10 steps back
    df[f'Qt_lag{lag}'] = df['Qt'].shift(lag)

# you can print the headers if you want to visualize your data set
df.head()

You see now several rows with NaN (not a number) values. It is normal, as we are shifting the rows back. We can remove the NaN rows in the next steps.

In [None]:
# we create the target (Qt+1) column, which was not yet in the df
df['Qt+1'] = df['Qt'].shift(-1)

# now we remove the row with missing values (NaN = not a number)
df = df.dropna()
df.reset_index(inplace=True,drop=True)

# you can print the headers if you want to visualize your data set
df.head()

You can now see that all the variables have a dedicated column and that there are no more NaN values.

# 4. Training-Testing split
We now have to split the dataset into two parts, one used for training and one for testing purposes. For the testing part, we use the first 300 rows of our dataset, while we keep the rows from 301 to the very end for training.
It is also possible to use a an automatic training-testing split function built in scikit learn library, but it would not allow us to choose the period.

We then prepare the Input (X) and output (Y) datasets for our data driven model, being very careful that the target Qt+1 is not included in the Input (X).

In [None]:
# We split the data into training and testing, taking the first 300 rows for testing,
# and the rows from 301 onwards for training.
df_test = df.loc[:299]
df_train = df.loc[300:]

# we can visualize them
df_train.head()


We can see that the inputs and the output (Qt+1) are still in the same dataset, so we need to split them.

In [None]:
# Now we prepare the inputs (X) and the target (Y), being sure that X does not contain
# the target (Qt+1) and that Y contains only the target (Qt+1)

X_train = df_train.copy().drop(['Date','Qt+1'],axis=1)
X_train.head()


We can now see that the Qt+1 column is no longer in the dataframe. We repeat the same for the testing set and for the outputs. We also transform all X and Y into numpy array, as requested by the MLP model that we will implement.

In [None]:
X_train = X_train.to_numpy()
X_test = df_test.copy().drop(['Date','Qt+1'],axis=1).to_numpy()

y_train = df_train['Qt+1'].to_numpy()
y_test = df_test['Qt+1'].to_numpy()

# 5. Data normalization
Before training our MLP model, we first have to normalize the data. The normalization is desirable to make the training faster and easier, and make the algorithm converge more easily. Imagine, for instance, to have very different input variables, with different magnitude and ranges. It might happen, during training, that the input feature with the highest range (difference between max and min value) results to have more importance in the error determination than the others. In reality, this might not be due to the real importance of the feature itself, but simply to the numerical artifact. This might lead to suboptimal learning. In addition, the normalization improves the generalization performances of your model.

There exist different types of normalizations. The most common are those that normalize the values in the range [0,1], and those that normalizes the data with respect to their standard deviation and mean (also called standardization). In this notebook, we use the standardization already implemented in the sklearn library.

<b> Attention </b>: you can choose the normalization method you prefer, but you need to be careful to compute the normalization parameters (mean and standard deviation, for instance) only using the training data. Then, you re-use them to normalize also the testing set. It is important that the two are distinct, because if the testing set is included in the computation of the normalization parameters, then your test set will not be truly indipendent.

In [None]:
# print the beginning of your input features now, to then compare it with the normalized ones
print(X_train)

In [None]:
# Normalize input features (NOT THE TARGET, THAT WILL HAPPEN LATER)
inpscaler = StandardScaler() # call the standardizer
X_train = inpscaler.fit_transform(X_train) # this is the command that computes the parameters from the training set (fit) and then 
# already applies them for the normalization (transform)
X_test = inpscaler.transform(X_test) # this is the part that transforms the test data based on the training parameters. Be careful, there is no fitting here.

# print again the training set to see the difference
print(X_train)

In [None]:
print(y_train.shape)

In [None]:
# we first reshape the target, to make it 2Dimensional (required by the scaler)
y_train = np.reshape(y_train,[-1,1])
y_test = np.reshape(y_test,[-1,1])
# Normalize the target
tarscaler = StandardScaler() # call the standardizer
y_train = tarscaler.fit_transform(y_train) # this is the command that computes the parameters from the training set (fit) and then 
# already applies them for the normalization (transform)
y_test = tarscaler.transform(y_test) # this is the part that transforms the test data based on the training parameters. Be careful, there is no fitting here.

# print again the training set to see the difference
print(y_train)

Why did we keep the target separated from the input features?
Imagine that you have trained your model, which accepts as inputs normalized values and provides as outputs normalized values as well. Once you test the model, you will also obtain normalized values. You can then go back to the original dimensions by inverting the normalization. This step can be easily done in sklearn with a command, but for practical reasons it is easier to have one scaler only for the target. In case we used the same, indeed, we would have to input to the inverse normalization, one only matrix of input and output together (you can see that the dimensions of input and output are different).

# 6. Training and Testing the Multi Layer Perceptron (MLP) Model
We will now build and train an MLP model. From a programming perspective, we could use two libraries to implement an MLP: scikit-learn (abbreviated sklearn) or tensorflow. The first is easier to handle as a first time hands on experience, but it is also less flexible and it is less suitable in case you have (very) large datasets. The second, instead, is more flexible and allows for better customization of the model, it is therefore suitable for Deep Learning models, but it is also less easy to handle as a first timer. We will use sklearn for our introduction.
You can decide how many hidden layers to add and how many nodes each layer should have. This is regulated by the "hidden_layer_sizes=(64,64)". Currently, we have two hidden layers, both with 64 units each. You can change these numbers, add and reducing layers simply by adding/removing numbers. Mind to separate them with a comma. We also choose relu as activation function and adam optimizer (for the parameters optimization). The random state is a variable that controls that the initial weights are always the same, for the reproducibility of the notebook (you will not have changing results if you run the same model many times, unless you change parameters). You can learn more about the MLP regressor here https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html.

In [None]:
# we reshape back y_train and y_test, as MLP needs them 1D
y_train = np.reshape(y_train,(y_train.shape[0]))
y_test = np.reshape(y_test,(y_test.shape[0]))
# Train MLP Model
mlp_model = MLPRegressor(hidden_layer_sizes=(64, 64), activation='relu', solver='adam', max_iter=500, random_state=42)
mlp_model.fit(X_train, y_train)



Now we use the fitted model to predict the Qt+1 on the test set and we compute some metrics. The sklearn library does not have a built in method to track the loss function values during training. You can see it by applying the model on the training data and then computing the metrics. We show only the case for the testing set.

In [None]:
# Evaluate the model
y_pred = mlp_model.predict(X_test)
test_mae = mean_absolute_error(y_test, y_pred)
print(f"Test MAE: {test_mae:.3f}")
test_mse = mean_squared_error(y_test, y_pred)
print(f"Test MSE: {test_mse:.3f}")


There are several metrics that you can compute. You can look into the sklearn documentation which ones are already built in for you to use.

It seems that the error is very low, but be careful: it is computed on the normalized dataset! We need to now <b>de-normalize</b> again our predictions.

In [None]:
y_pred = mlp_model.predict(X_test).reshape(-1, 1)
y_pred_inv = tarscaler.inverse_transform(y_pred)
y_test_inv = tarscaler.inverse_transform(y_test.reshape(-1, 1))


In [None]:
test_mae = mean_absolute_error(y_test_inv, y_pred_inv)
print(f"Test MAE: {test_mae:.3f}")
test_mse = mean_squared_error(y_test_inv, y_pred_inv)
print(f"Test MSE: {test_mse:.3f}")

As you can see, the results change quite a lot! Remember that both MAE and MSE have dimensions (the same of the target).

We can now plot our results

In [None]:
# Plot actual vs predicted discharge
plt.figure(figsize=(8, 5))
plt.scatter(y_test_inv, y_pred_inv, color='darkolivegreen', alpha=0.6, label='Predicted vs Actual')
plt.plot([min(y_test_inv), max(y_test_inv)], [min(y_test_inv), max(y_test_inv)], color='black', linestyle='--', label='Perfect Fit')
plt.xlabel('Actual Discharge (m³/s)')
plt.ylabel('Predicted Discharge (m³/s)')
plt.title('Multi Layer Perceptron: Predicted vs Actual Discharge')
plt.legend()
plt.grid(True)
plt.show()
scatter_path = os.path.join(base_path, 'MLP_Scatter.png')
# save the plot on your local folder
plt.savefig(scatter_path, dpi=600, format='png')

In [None]:
# extract the dates of the test set
time = df['Date'].loc[:299]
# Plot hydrographs
plt.figure(figsize=(10, 5))
plt.plot(time,y_test_inv, label='Actual Discharge', color='navy')
plt.plot(time,y_pred_inv, label='Predicted Discharge', color='steelblue', linestyle='dashed')
plt.xlabel('Date')
plt.ylabel('Discharge (m³/s)')
plt.title('Actual vs Predicted Discharge Hydrograph')
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.show()
hydrograph_path = os.path.join(base_path, 'MLP_Hydrograph.png')
# save the plot on your local folder
plt.savefig(hydrograph_path, dpi=600, format='png')

<b> We can now save the results locally, on our laptop </b>

In [None]:
# First we create a dataframe with the observations and the predictions
dr = pd.DataFrame(columns=['Date', 'Obs', 'MLP'])
dr['MLP'] = pd.DataFrame(y_pred_inv)
dr['Obs'] = pd.DataFrame(y_test_inv)
dr['Date'] = pd.to_datetime(time)
dr.head()

# Now we save them
results_path = os.path.join(base_path, 'MLP_Predictions.xlsx')
dr.to_excel(results_path)

# 6. Conclusion
We have learnt how to split the training and testing data using pandas library. We then learnt how and why to normalize input features, how to call, train and test a MLP model. Finally, we have checked the performances using the mean squared error and with graphical inspections. What can you conclude about the different models tested? Which would you choose as the best model?
<b> Which model is the best among all those we implemented? Do you see much difference between them? </b>

You can keep exploring different training strategies, adding/reducing hidden layers and/or nodes, but also changing the activation function. Always monitor what happens to your metrics when changing parameters!

If you want, you can save all your predictions externally and then plot them all toghether again. You only need to load the saved data and adjust the plotting codes.