# Training a Machine Learning Surrogate

To train a machine learning surrogate, we need to import several essential libraries. We will use `numpy` and `pandas` for data manipulation, and `tensorflow` for constructing and training the neural network. Additionally, we will utilize the `KerasSurrogate` object from the `idaes.core.surrogate` module in IDAES for surrogate modeling.

In [None]:
# Import statements
import os
import numpy as np
import pandas as pd
import random as rn
import tensorflow as tf

# Import IDAES libraries
from idaes.core.surrogate.sampling.data_utils import split_training_validation
from idaes.core.surrogate.sampling.scaling import OffsetScaler
from idaes.core.surrogate.keras_surrogate import (
    KerasSurrogate,
    save_keras_json_hd5,
    load_keras_json_hd5,
)
from idaes.core.surrogate.plotting.sm_plotter import (
    surrogate_scatter2D,
    surrogate_parity,
    surrogate_residual,
)


from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

# fix environment variables to ensure consist neural network training
os.environ["PYTHONHASHSEED"] = "0"
os.environ["CUDA_VISIBLE_DEVICES"] = ""
np.random.seed(46)
rn.seed(1342)
tf.random.set_seed(62)

After importing the libraries, we will begin by reading the data generated from the Monte Carlo Simulation of the desalination model from an Excel file. We will then remove any points where the model was found to be infeasible using the `dropna` function from pandas. Given that the simulations are on the order of 1e3, we will use the entire dataset; however, users can choose the number of samples they want to use for training the surrogate model. Finally, we will define the input and output variables for the surrogate model and extract the input and output labels from the column names of the dataset.

In [None]:
# Import Auto-reformer training data
np.set_printoptions(precision=6, suppress=True)

# Reading the data from excel and sampling from it.
# mvc_simulation.xlsx is a dummy file name.
# One should replace this with the actual file name
csv_data = pd.read_excel("./mvc_simulations.xlsx")
csv_data.dropna(inplace=True)
data = csv_data.sample(n=len(csv_data))

# Defining the input and the output columns
input_data = data[["Inlet", "Recovery", "Flow"]]
output_data = data[["CAPEX", "OPEX"]]

# Define labels
input_labels = input_data.columns
output_labels = output_data.columns

We visualize the data to make sure the data is as expected. 

In [None]:
# Visualizing the dataframe to make sure things are in order
data

The next step is to set up the neural network architecture, where we specify the activation function, optimizer, number of hidden layers, and number of nodes per layer. These are hyperparameters that should be configured by the user. We then define the loss function and the metrics for evaluating the model's performance. Since this is a regression model, we use mean squared error (MSE) as the loss function, and mean absolute error (MAE) and mean squared error (MSE) as the metrics.

Next, we assign the input and output labels to variables `x` and `y`, respectively. We use a normalizing scaler to adjust the values by subtracting the mean and dividing by the standard deviation of each column. We then split the dataset into training and testing sets, using an 80/20 split.

We create a Keras `Sequential` object and add the input layer, followed by the hidden layers, and finally the output layer. We compile the model by specifying the loss function, optimizer, and metrics. We also set up a checkpoint to save the state of the model, storing the best-known weights based on the minimum validation loss in the `.model_checkpoint.hdf5` file.

Next, we fit the model using the training and validation data, specifying the number of epochs and the callback.

We then determine the minimum and maximum values of each input variable in the dataset to set the bounds for the input variables. These bounds are passed to the `KerasSurrogate` object along with the neural network model, input and output labels, input and output scalers, and input bounds. Finally, we save the `KerasSurrogate` object to a folder.

In [None]:
# Select the hyper parameters to be used for training model
activation, optimizer, n_hidden_layers, n_nodes_per_layer = "relu", "Adam", 3, 10
loss, metrics = "mse", ["mae", "mse"]

# Create data objects for training using scalar normalization
n_inputs = len(input_labels)
n_outputs = len(output_labels)
x = input_data
y = output_data

# Scaling the data using normal scaler (x-xmin)/(xmax-xmin)
# Converting scaled data to numpy from indexed series.
input_scaler = None
output_scaler = None
input_scaler = OffsetScaler.create_normalizing_scaler(x)
output_scaler = OffsetScaler.create_normalizing_scaler(y)
x = input_scaler.scale(x)
y = output_scaler.scale(y)
x = x.to_numpy()
y = y.to_numpy()
X_train, X_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, random_state=42
)

# Create Keras Sequential object and build neural network
model = tf.keras.Sequential()
model.add(
    tf.keras.layers.Dense(
        units=n_nodes_per_layer, input_dim=n_inputs, activation=activation
    )
)
for i in range(1, n_hidden_layers):
    model.add(tf.keras.layers.Dense(units=n_nodes_per_layer, activation=activation))
model.add(tf.keras.layers.Dense(units=n_outputs))

# Train surrogate (calls optimizer on neural network and solves for weights)
model.compile(loss=loss, optimizer=optimizer, metrics=metrics)
mcp_save = tf.keras.callbacks.ModelCheckpoint(
    ".model_checkpoint.hdf5", save_best_only=True, monitor="val_loss", mode="min"
)
history = model.fit(
    x=X_train,
    y=y_train,
    validation_data=(X_test, y_test),
    verbose=2,
    epochs=500,
    callbacks=[mcp_save],
)
# save model to JSON and create callable surrogate object
xmin, xmax = [0.00, 0.05, 0], [200, 0.95, 29.00]
input_bounds = {input_labels[i]: (xmin[i], xmax[i]) for i in range(len(input_labels))}

keras_surrogate = KerasSurrogate(
    model,
    input_labels=list(input_labels),
    output_labels=list(output_labels),
    input_bounds=input_bounds,
    input_scaler=input_scaler,
    output_scaler=output_scaler,
)
keras_surrogate.save_to_folder("keras_surrogate")

Now, we load the model from the folder and evaluate it using the entire dataset to assess the model's fit. This is done using the $R^2$ metric, where a value closer to 1 indicates a better fit.

In [None]:
# Loading the model from the folder and evaluating it.
surr = KerasSurrogate.load_from_folder("Keras_surrogate")
y_true = data.iloc[:, 3:5]  ### The true values from the dataframe.
y_pred = surr.evaluate_surrogate(input_data)  ### The predicted values

# Checking the R2 score to check the fit of the model.
r = r2_score(y_true, y_pred)
print(f"The R2 score = {r:.4f}")

Next, we visualize the model using the `surrogate_scatter2D` plot to examine the fit. We also use the `surrogate_parity` plot to check the deviation of the model predictions from the ground truth and the `surrogate_residual` plot to analyze the residuals.

In [None]:
surrogate_scatter2D(keras_surrogate, X_train, filename="keras_train_scatter2D.pdf")
surrogate_parity(keras_surrogate, X_train, filename="keras_train_parity.pdf")
surrogate_residual(keras_surrogate, X_train, filename="keras_train_residual.pdf")