# Soft sensor use case for predicting pH of a solution

In this notebook we create and train a model which is assumed to infer on 'AI Inference Server'.

## Loading the historical data

As we discussed in the [Soft Sensor Readme](../README.md), our hypothetical scenario contains three containers of liquids.  
We want to measure the pH value of container C, but we cannot do it directly with a sensor.  
Instead, we have measurements and control over various other parameters of the system that can affect the pH value saved into file `historical_data.csv`.

In [None]:
import pandas

training_data = pandas.read_csv('../data/historical_data.csv')
training_data.describe()

The loaded DataFrame comprises historical data, including 
- temperature measurements from three containers (`temperature_A`, `temperature_B`, `temperature_C`) 
- and the valve positions that regulate the flow rate from container A and B to container C (`valve_position_A` and `valve_position_B`).

Furthermore, the DataFrame includes the pH value of the liquid in container C (`ph_C`).  
This data is intended for training purposes only. In a real-world scenario, obtaining this specific data can be challenging and/or costly, which justifies the use of soft sensors.


## Splitting the Data

We allocate 20% of the data for testing the accuracy of our trained model in future evaluations.

In [None]:
from sklearn.model_selection import train_test_split

input_tags = ["temperature_A", "temperature_B", "temperature_C", "valve_position_A", "valve_position_B"]
X, y = training_data[input_tags], training_data['ph_C']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## Training a Prediction Model

We hypothesize that the pH value of the liquid in container C exhibits a linear relationship with the measurements from other sensors.  
Consequently, we have selected a linear regression model to capture this dependency.

In [None]:
from sklearn.linear_model import LinearRegression

# Create a linear regression model
linear_reg = LinearRegression()
# Fit the model on the training data
linear_reg.fit(X_train, y_train)

## Measure the error

In this section, we measure the error on both the training and testing datasets using the mean squared error (MSE). MSE is a common metric for evaluating the performance of regression models - the lower the MSE values, the better the model is expected to perform.

Furthermore, when the MSE values for the training and testing datasets are close to each other, it suggests that the model generalizes well to new, unseen data.

In [None]:
from sklearn.metrics import mean_squared_error

# Predict on the training set
y_train_pred = linear_reg.predict(X_train)

# Predict on the testing set
y_test_pred = linear_reg.predict(X_test)

# Measure the accuracy using mean squared error
train_mse = mean_squared_error(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)

print(f"Train MSE = {train_mse:.5f}\nTest MSE = {test_mse:.5f}")

# Print the linear regression coefficients
print("Coefficients:", linear_reg.coef_)

## Visualize the data

Here we plot the last 200 values of the true pH values compared to the predictions of our linear regression model.

In [None]:
import matplotlib.pyplot as plt

# Plot the true values and predictions for the test set
plt.figure(figsize=(14, 7))
plt.plot(y_test.values[:200], label='True ph_C values')
plt.plot(y_test_pred[:200], label='Predicted ph_C values')
plt.legend()
plt.xlabel('Sample index')
plt.ylabel('ph_C')
plt.title('True vs Predicted ph_C values')
plt.show()

### Save the model

Once the model is acceptable, we can save it into a joblib file. 

In [None]:
import joblib

model_path = f"../models/model.joblib"
with open(model_path, 'wb') as fh:
    joblib.dump(linear_reg, model_path, compress=9)

Notebook [20-CreateInferenceWrapper](20-CreateInferenceWrapper.ipynb) shows how to create a Python wrapper around the model.