<div>
<table style="width: 100%">
    <td>
	<tr>
		<td>
		<table style="width: 100%">
			<tr>
                <td ><center><font size="30">WSD 2024-2025</font><center>
                    <center><font size="30">AI 4 Water Systems</font><center></td>
			</tr>
			<tr>
                <td><center><font size="5">Jupyter Notebook 2</font><center></td>
			</tr>
			<tr>
                <td><center><font size="10">Model M5 Tree</font><center></td>
			</tr>
            <tr>
                <td><center><font size="5">Claudia Bertini, Lecturer in Hydroinformatics</font><center></td>
			</tr> 
		</table>
		<td> <img src='ihe-logo-square.png'></img></td>
	</tr>
</table>
</div>  

# Recap and scope of this notebook
We aim at developing a data driven model that is predicting discharge in the next hour (Qt+1) using past rainfall and discharge hourly data. Throughout these 3 notebooks, we will learn how to pre-define the input features of the data driven model (Notebook 1), how to build a linear regression model and a M5 Tree model (Notebook 2), how to build an Artificial Neural Network (ANN, Notebook 3).

In <b>Notebook 1</b> we learnt how to load and visualize the data. We also used (linear) correlation and autocorrelation to identify the potential relevant input features. Based on our results, we concluded that potentially we could use the past 6 hours of both effective rainfall and discharge to train our data driven model. Usually, these inputs can be further refined, by trial and error: one defines all possible combinations of the input variables (that make sense) and train and test a data driven model using them. Let's assume that we have 100 combinations of input features, which results in 100 different data driven models trained. We then test all the 100 models on the same testing dataset and we compute one or more performance metrics (RMSE, MSE, NSE, etc.). The model(s) with best performance is (are) the one with the best set of input features. This process can take quite some time and for the purpose of this exercise, we already did it. Our results indicate that the <b>most relevant input features</b> to be used to predict Qt+1 are:
<b>REt, REt-1, REt-2, REt-3, REt-4, REt-5, Qt, Qt-1, Qt-2.</b> 

In this Notebook 2, we will use these features to build first a <b>Linear regression model</b> and then a <b>M5 Tree model</b>.

# 1. Installing and importing the libraries needed
Before importing the libraries below, you might need to install them. You can use the following commands (the code to be run is in the following cells):
pip install matplotlib
pip install pandas
pip install openpyxl

The library os should be automatically installed. If it is not the case, you can add a cell and type "pip install os", then run it.

In [None]:
pip install matplotlib

In [None]:
pip install pandas

In [None]:
pip install openpyxl

In [None]:
pip install m5py

In [None]:
pip install scikit-learn

In [None]:
pip install numpy

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import os
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
from m5py import M5Prime

# 2. Load the data
To load the data correctly, first make sure that the file "Sieve-orig.xlsx" is saved in the same folder of this notebook. If it's not the case, please move the file to the folder of this notebook.
The file contains the hourly records of discharge (Qt) and effective precipitation (REt), from 1 Dec 1969 to 28 Feb 1970. There is also one column providing information about date and time.
In the first lines of the next cell, you will find information to be able to open this notebook in Google Colab, in case you prefer it.

In [None]:
# Define file path and check environment
if 'google.colab' in str(get_ipython()):
    from google.colab import drive
    drive.mount('/content/drive')
    base_path = '/content/drive/My Drive/'
else:
    base_path = os.getcwd()  # Use current directory in Jupyter Notebook = directory where the notebook is locally saved in your computer

# Ensure file exists before reading
file_name = 'Sieve-orig.xlsx'
file_path = os.path.join(base_path, file_name)

if not os.path.exists(file_path):
    raise FileNotFoundError(f"File not found: {file_path}")

# First load the file with the data from Sieve
df = pd.read_excel(file_path)
# you can visualize the first rows of the data by:
df.head()

# 3. Select the input features
Using the information on which input feature to retain contained at the beginning of this notebook, we now re-organize the dataset. We want each variable to be in one specific column. We can use (part) of the code we developed for the correlation analysis for this purpose.

In [None]:
# We first shift our rainfall data and copy them to separate columns. we take up to 5 steps back in time
for lag in range(1, 6):  # 10 steps back
    df[f'REt_lag{lag}'] = df['REt'].shift(lag)
    
# we do the same with discharge, taking up to 2 steps back in time
for lag in range(1, 3):  # 10 steps back
    df[f'Qt_lag{lag}'] = df['Qt'].shift(lag)

# you can print the headers if you want to visualize your data set
df.head()

You see now several rows with NaN (not a number) values. It is normal, as we are shifting the rows back. We can remove the NaN rows in the next steps.

In [None]:
# we create the target (Qt+1) column, which was not yet in the df
df['Qt+1'] = df['Qt'].shift(-1)

# now we remove the row with missing values (NaN = not a number)
df = df.dropna()
df.reset_index(inplace=True,drop=True)

# you can print the headers if you want to visualize your data set
df.head()

You can now see that all the variables have a dedicated column and that there are no more NaN values.

# 4. Training-Testing split
We now have to split the dataset into two parts, one used for training and one for testing purposes. For the testing part, we use the first 300 rows of our dataset, while we keep the rows from 301 to the very end for training.
It is also possible to use a an automatic training-testing split function built in scikit learn library, but it would not allow us to choose the period.

We then prepare the Input (X) and output (Y) datasets for our data driven model, being very careful that the target Qt+1 is not included in the Input (X).

In [None]:
# We split the data into training and testing, taking the first 300 rows for testing,
# and the rows from 301 onwards for training.
df_test = df.loc[:299]
df_train = df.loc[300:]

# we can visualize them
df_train.head()


We can see that the inputs and the output (Qt+1) are still in the same dataset, so we need to split them.

In [None]:
# Now we prepare the inputs (X) and the target (Y), being sure that X does not contain
# the target (Qt+1) and that Y contains only the target (Qt+1)

X_train = df_train.copy().drop(['Date','Qt+1'],axis=1)
X_train.head()


We can now see that the Qt+1 column is no longer in the dataframe. We repeat the same for the testing set and for the outputs. We also transform all X and Y into numpy array, as requested by the linear regression model that we will implement.

In [None]:
X_train = X_train.to_numpy()
X_test = df_test.copy().drop(['Date','Qt+1'],axis=1).to_numpy()

y_train = df_train['Qt+1'].to_numpy()
y_test = df_test['Qt+1'].to_numpy()

# 5. Training and Testing the Linear Regression Model
We will now build and train a simple linear regression model. This kind of model is usually employed in studies with data driven models and Machine Learning as a benchmark against more fancy and complicated models, to see whether a more complicated model is needed or a simple linear regression yields good results already. 

In [None]:
# Train Multi-Linear Regression Model
mlr_model = LinearRegression()  # this just calls the model
mlr_model.fit(X_train, y_train) # this line actually TRAIN the model


Now we use the fitted model to predict the Qt+1 on the test set and we compute some metrics. The sklearn library does not have a built in method to track the loss function values during training. You can see it by applying the model on the training data and then computing the metrics. We show only the case for the testing set.

In [None]:
# Make predictions
y_pred = mlr_model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Multi-Linear Regression Model Performance:')
print(f'Mean Squared Error: {mse:.2f}')


There are several metrics that you can compute. You can look into the sklearn documentation which ones are already built in for you to use.

We can now plot the results, to see how they compare graphically to our observations. You can plot them in a scatterplot or in a regular plot. If you use the scatterplot, the data should align along the bisect to have a perfect simulation.

In [None]:
# Plot actual vs predicted discharge
plt.figure(figsize=(8, 5))
plt.scatter(y_test, y_pred, color='darkolivegreen', alpha=0.6, label='Predicted vs Actual')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='black', linestyle='--', label='Perfect Fit')
plt.xlabel('Actual Discharge (m³/s)')
plt.ylabel('Predicted Discharge (m³/s)')
plt.title('Multi-Linear Regression: Predicted vs Actual Discharge')
plt.legend()
plt.grid(True)
plt.show()
scatter_path = os.path.join(base_path, 'LinearModel_Scatter.png')
# save the plot on your local folder
plt.savefig(scatter_path, dpi=600, format='png')

In [None]:
# Plot hydrographs
plt.figure(figsize=(10, 5))
plt.plot(y_test, label='Actual Discharge', color='navy')
plt.plot(y_pred, label='Predicted Discharge', color='steelblue', linestyle='dashed')
plt.xlabel('Date')
plt.ylabel('Discharge (m³/s)')
plt.title('Actual vs Predicted Discharge Hydrograph')
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.show()
hydro_path = os.path.join(base_path, 'LinearModel_Hydrograph.png')

# 6. Training and Testing the M5 tree model
We now train the M5 tree model, using the same procedure as before. We do not need to change the input features and target, as they are always the same of the Linear model.

We will first train an <b>unpruned</b> M5 tree, then we will train a <b>pruned</b> model and check the differences.

In [None]:
# Let's start with an unpruned model
m5_unpruned = M5Prime(use_pruning=False) # call the model
m5_unpruned.fit(X_train, y_train) # train the model

In [None]:
# Make predictions on the test set
y_unpruned = m5_unpruned.predict(X_test)

In [None]:
# Evaluate the model
mse = mean_squared_error(y_test, y_unpruned)
print(f'Mean Squared Error: {mse:.2f} (m³/s)')

In [None]:
# we can print the trees
regr_1_label = 'unpruned'
print("\n----- %s" % regr_1_label)
print(m5_unpruned.as_pretty_text())

Now we repeat it with the pruned model.

In [None]:
# Let's now repeat with a pruned model
m5_pruned = M5Prime(use_pruning=True) # call the model
m5_pruned.fit(X_train, y_train) # train the model


In [None]:
# Make predictions on the test set
y_pruned = m5_pruned.predict(X_test)

In [None]:
# Evaluate the model
mse = mean_squared_error(y_test, y_pruned)
print(f'Mean Squared Error: {mse:.2f} (m³/s)')

In [None]:
# we can print the trees
regr_2_label = 'pruned'
print("\n----- %s" % regr_2_label)
print(m5_pruned.as_pretty_text())


<b> Which model is the best? Do you see much difference between the pruned and the unpruned versions? </b>

You can keep exploring different training strategies, such as for instance defining a minimum number of leaves per node and/or a maximum tree depth. Both options can be used together with the pruning. You can see an example below.

<b>?</b>  You can reduce the number of min_samples_leaf and increase at the same time the max_depth. Look at the MSE: how does it change? Is the model overfitting or underfitting?  Repeat the same by increasing the number of min_samples_leaf and reducing at the same time the max_depth. What changes? Are you now in underfitting or overfitting?

In [None]:
m5_leaves = M5Prime(min_samples_leaf=4,max_depth=15)
# note that in the line above you have just called the model, you have not fitted it
# on the training data yet. Remember to call model.fit(X_train, y_train) to train the model.
m5_leaves.fit(X_train,y_train)
y_leaves = m5_leaves.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_leaves)
print(f'Mean Squared Error: {mse:.2f} (m³/s)')

We can now plot all the model predictions together, and check (also) graphically which works best.

In [None]:
# we can merge the predictions in one pandas dataframe, and then loop through it to plot
predictions = ['Linear', 'M5-unpruned', 'M5-pruned', 'M5-leaves/depth']
dr = pd.DataFrame(columns=predictions)
dr['Linear']= pd.DataFrame(y_pred)
dr['M5-unpruned']= pd.DataFrame(y_unpruned)
dr['M5-pruned']= pd.DataFrame(y_pruned)
dr['M5-leaves/depth']= pd.DataFrame(y_leaves)
dr['Date'] = pd.to_datetime(df['Date'].loc[:299])
# Create subplots
fig, axes = plt.subplots(4, 1, figsize=(10, 12), sharex=True)

predictions = ['Linear', 'M5-unpruned', 'M5-pruned', 'M5-leaves/depth']
labels = ['Linear', 'M5-unpruned', 'M5-pruned', 'M5-leaves/depth']
colors = ['steelblue', 'firebrick', 'olivedrab', 'orchid']
legend_handles = []
for i, ax in enumerate(axes):
    actual, = ax.plot( dr['Date'],y_test, label='Actual Discharge', color='navy',linestyle='dashed')
    predicted, = ax.plot(dr['Date'],dr[predictions[i]], label=labels[i], color=colors[i],alpha=0.7)
    ax.set_ylabel('Discharge (m³/s)')
    ax.grid(True)
    # Collect handles only from the first subplot
    if i == 0:
        legend_handles.append(actual)
    legend_handles.append(predicted)

# Global legend
fig.legend(legend_handles, ['Actual Discharge'] + labels, loc='upper center', 
           fontsize=10, ncol=5, bbox_to_anchor=(0.5, 0.95))

plt.suptitle('Comparison of Actual vs Predicted Discharge')
plt.xticks(rotation=45)
plt.tight_layout(rect=[0, 0, 1, 0.96])  # Adjust layout to fit global title and legend
plt.show()
subplot_path = os.path.join(base_path, 'LinearvsM5_Hydrographs.png')

<b> We can now save our results locally, in our laptop </b>

In [None]:
# First we also add the observations to the dataframe dr with the results
dr['Obs'] = pd.DataFrame(y_test)
dr.head()

# Now we save them
results_path = os.path.join(base_path, 'M5_Linear_Predictions.xlsx')
dr.to_excel(results_path)

# 6. Conclusion
We have learnt how to split the training and testing data using pandas library. We then learnt how to call, train and test a simple Linear regression model and a M5 tree model. We have explored different training options for the M5 tree. Finally, we have compared the performances of all the models using the mean squared error and with graphical inspections. What can you conclude about the different models tested? Which would you choose as the best model?

You can further explore the different modelling options by reducing the number of input features used by the models. For a fair comparison across models, use the same input features in all models. What changes in the model performance? Is the reduction of input features worth the increase/decrease in MSE? Which set of input feautures would you recommend?