## __PROCESS ANALYTICAL TECHNOLOGY IN FERMENTATION PROCESSES__
### __Fermentation monitoring using advanced spectroscoy__

PROSYS, Department of Chemical and Biochemical Engineering, Technical University of Denmark (DTU)

__Author:__ Pau Cabaneros Lopez

#### __OVERVIEW:__

This exercise uses spectral data collected in _real-time_ during a lignocellulosic ethanol fermentation. These samples were taken using attenuated total refractance, mid infrared (ATR-MIR) spectroscopy. The reference data was taken using high performance liquid chromatography (HPLC). 

The datasets used in this project consist of:

- Training dataset (containing both, spectra and the corresponding HPLC measurements).

- Fermentation dataset: containing a timeseries of spectra collected in real time, and another timeseries of off-line HPLC measurements. 

For detailed information about the training and fermentation datasets, see  _Transforming data to information: A parallel hybrid model for real-time state estimation in lignocellulosic ethanol fermentation_.



## __Step 0. Set up the environment__

When we use Google Colab, we need to setup the environment by installing the required packages and importing the necessary libraries.

### ✅ __0.1. Installing the dependencies__

In [None]:
%pip install dtuprosys

### ✅ __0.2. Importing the libraries__

In [None]:
import matplotlib.pyplot as plt

from dtuprosys.chemometrics.modelling.cross_validation import cross_validation
from dtuprosys.chemometrics.datasets import load_fermentation_data, load_train_data
from dtuprosys.chemometrics.plotting import plot_spectra, plot_predictions, plot_fermentation
from dtuprosys.chemometrics.preprocessing import RangeCut, Derivative, DriftCorrection
from mbpls.mbpls import MBPLS

## __Step 1. Explore the training data__

### ✅ __Exercise 1.1. Load the training data.__

__INSTRUCTIONS:__ load the training data and assign it to two variables: 
- ```train_spectra``` with the spectra data
- ```train_hplc``` with the HPLC data

The training data can be loaded using the ```load_train_data()``` function. An example of how to use this command is shown below:

__Usage example:__
```
train_spectra, train_hplc = load_train_data()
```

In [None]:
# Write your code below:


##### __💡 Getting stuck?__  

You can import the training data by typing the code below:

```
train_spectra, train_hplc = load_train_data()
```

### ✅ __Exercise 1.2. Inspect the training data__

__INSTRUCTIONS:__ inspect the ```train_spectra``` and ```train_hplc``` datasets.

##### _Exercise 1.2.A: Inspect the dataset ```train_spectra```._

In [None]:
# Run this cell to visualize the variable train_spectra
train_spectra

##### __💡 Getting stuck?__  

Run the cell above to display an overview of the ```train_spectra``` table.

##### _Exercise 1.2.B: Inspect the dataset ```train_hplc```._

In [None]:
# Run this cell to visualize the variable train_hplc.
# Write your code below:



##### __💡 Getting stuck?__  

To display the ```train_hplc``` table, run the following code in the cell above:

```
train_hplc
```

##### _Exercise 1.2.C: Check the sizes of the ```train_spectra``` and the ```train_hplc```._

__INSTRUCTIONS:__ Identify the dimensions of each table. This can be done using the ```.shape``` attribute.

__Example:__
To find the dimensions of the ```train_spectra``` you can use the following code:

```
dimensions_train_spectra = train_spectra.shape
```

In [None]:
# Run this cell to get the dimensions of the train_spectra table and assign it to the variable train_spectra.
dimensions_train_spectra = 

# Find the dimensions of the train_hplc and assign it to the variable dimensions_train_hplc.
# Write your code below:
dimensions_train_hplc = 


#! DO NOT MODIFY THE CODE BELOW !
# ------------------------------#
print("The train_spectra table has {} rows and {} columns.".format(dimensions_train_spectra[0], dimensions_train_spectra[1]))
print("The train_hplc table has {} rows and {} columns.".format(dimensions_train_hplc[0], dimensions_train_hplc[1]))
# ------------------------------#

##### __💡 Getting stuck?__  

- To find the size of the ``` train_spectra``` type:

```dimensions_train_spectra = train_spectra.shape```

- To find the size of the ``` train_hplc``` type:

```dimensions_train_hplc = train_hplc.shape```

### ✅ __Exercise 1.3 Plot the training data__

__INSTRUCTIONS:__ plot the ```train_spectra``` and color them according to the glucose concentration. 

Plotting the spectra is a good way to get a first impression of the data. The spectra can be plot using the ```plot_spectra()``` function.

__Usage example:__ 

```plot_spectra(train_spectra, title, x_label, y_label, reference=train_hplc)```

In [None]:
# Run this cell to visualize train_spectra.
# Write your code below:
title = ""
x_label = ""
y_label = ""


#! DO NOT MODIFY THE CODE BELOW !
# ------------------------------#
plt.show()
# ------------------------------#

##### __💡 Getting stuck?__  

You can use the code below to plot the spectra.

```
title = "Spectra used for training"
x_label = "Wavenumbers"
y_label = "Absorbance"
plot_spectra(train_spectra, title, x_label, y_label, reference=train_hplc)
```

## __Step 2. Preprocess the training data__

Preprocessing the spectra is fundamental for variable selection and to remove systematic noise from the data. In this step, we will preprocess the training data to prepare it for the model training.

### ✅ __Exercise 2.1. Preprocessing the spectra__

__INSTRUCTIONS:__ preprocess the spectra concatenating three preprocessing steps:

- Range cut (between 500 and 900)
- Drift correction
- Derivative (second order)



##### _Exercise 2.1.A: Configure the preprocessing steps_

First, the different preprocessing steps need to be configured.

- Configureing the ```RangeCut()```: the range cut needs to be configured by setting the minimum and maximum wavenumbers (```min_wavenumber``` and ```min_wavenumber```, respectively) to be considered. 

- Configuring the ```DriftCorrection()```: the drift correction does not require any specific configuration.

- Configureing the ```Derivative()```: the derivative needs to be configured by setting the order of the derivative (```derivative_order```).

In [None]:
# Range cut (between 500 and 900)
min_wavenumber =
max_wavenumber =
include_range_cut = True
range_cut = RangeCut(min_wavenumber, max_wavenumber, include=include_range_cut)

# Drift correction
include_drift_correction=True
drift_correction = DriftCorrection(include=include_drift_correction)

# Derivative (second order)
derivative_order = 
include_derivative = True
derivative = Derivative(derivative_order, include=include_derivative)

##### __💡 Getting stuck?__

You can configure the preprocessing steps by typing the code below:

```
range_cut = RangeCut(min_wavenumber=500, max_wavenumber=900, include=True)
drift_correction = DriftCorrection(include=True)
derivative = Derivative(derivative_order=1, include=True)
```


#### _Exercise 2.1.B: Apply ```range_cut``` to ```train_spectra```_

In [None]:
# Write your code below:

# Apply the range cut to train_spectra using the .apply_to() method. Then assign it to the variable range_cut_spectra.
range_cut_spectra = 

# Use the plot_spectra() function to visualize the range cut.
title = ""
x_label = ""
y_label = ""


#! DO NOT MODIFY THE CODE BELOW !
# ------------------------------#
plt.show()
# ------------------------------#

##### __💡 Getting stuck?__

- To apply the ```range_cut``` to the ```train_spectra``` type:

```
range_cut_spectra = range_cut.apply_to(train_spectra)
```

- To plot the spectra after the range cut, type:

```
title = "Spectra after range cut"
x_label = "Wavenumbers"
y_label = "Absorbance"
plot_spectra(range_cut_spectra, title, x_label, y_label, reference=train_hplc)
```

#### _Exercise 2.1.C: Apply ```drift_correction``` to ```range_cut_spectra```_

Drift correction is a simple way to remove scatter from the spectra. Below, drift correct the ```range_cut_spectra``` and assign it to the variable ```drift_corrected_spectra```.

In [None]:
# Write your code below:

# Apply the drift correction to range_cut_spectra using the .apply_to() method. Then assign it to the variable drift_correct_spectra.
drift_corrected_spectra = 

# Use the plot_spectra() function to visualize the drift corrected spectra.
title = ""
x_label = ""
y_label = ""


#! DO NOT MODIFY THE CODE BELOW !
# ------------------------------#
plt.show()
# ------------------------------#

##### __💡 Getting stuck?__

- To apply the ```drift_correction``` to the ```range_cut_spectra``` type:

```
drift_corrected_spectra = drift_correction.apply_to(range_cut_spectra)
```

- To plot the spectra after the drift correction, type:

```
title = "Spectra after drift correction"
x_label = "Wavenumbers"
y_label = "Absorbance"
plot_spectra(drift_corrected_spectra, title, x_label, y_label, reference=train_hplc)
```


#### _Exercise 2.1.D: Apply ```derivative``` to ```drift_corrected_spectra```_

Derivating a spectra is a very common way to remove additive and multiplicative scatter from the spectra. Below, derivate the ```drift_corrected_spectra``` and assign it to the variable ```derivative_spectra```.

In [None]:
# Write your code below:

# Apply the derivative to drift_corrected_spectra using the .apply_to() method and assign it to the variable derivative_spectra.
derivative_spectra =

# Use the plot_spectra() function to visualize the derivative spectra.
title = ""
x_label = ""
y_label = ""


#! DO NOT MODIFY THE CODE BELOW !
# ------------------------------#
plt.show()
# ------------------------------#

##### __💡 Getting stuck?__

- To apply the ```derivative``` to the ```drift_corrected_spectra``` type:

```
derivative_spectra = derivative.apply_to(drift_corrected_spectra)
```

- To plot the spectra after the derivative, type:

```
title = "Spectra after derivative"
x_label = "Wavenumbers"
y_label = "Absorbance"
plot_spectra(derivative_spectra, title, x_label, y_label, reference=train_hplc)
```

##### _Exercise 2.1.E: Define the ```processed_spectra``` variable_

In [None]:
# Define the processed spectra variable as the derivative_spectra.
preprocessed_spectra = derivative_spectra

## __Step 3. Train the model__

The next step is training the PLS model. To do so, we first need to specify the number of components to use in the PLS model. This can be achieved using a corss-validation procedure (Exercise 3.1). Then, we can train the PLS model (Exercise 3.2).

### ✅ __Exercise 3.1. Cross-validation__

__INSTRUCTIONS:__ perform a cross-validation to find the optimal number of components to use in the PLS model. The cross-validation can be performed using the ```cross_validation()``` function. This function will output a plot showing the RMSE and RMSECV (Root Mean Squared Error and Root Mean Squared Error of the Cross Validation) for the different number of components.

__Usage example:__ 

```cross_validation(preprocessed_spectra, train_hplc)```

In [None]:
# Cross validation
# Write your code below:


# ! DO NOT MODIFY THE CODE BELOW !
# ------------------------------#
plt.show()
# ------------------------------#

According to your cross-validation, what is the number of latent variables that you would select for this model? 

In [None]:
# Writ your code below:
nr_latent_variables =

# ! DO NOT MODIFY THE CODE BELOW !
# ------------------------------#
print("The number of latent variables is {}.".format(nr_latent_variables))
# ------------------------------#

##### __💡 Getting stuck?__

You can use the code below to perform the cross-validation.

```
cross_validation(preprocessed_spectra, train_hplc)
```

### ✅ __Exercise 3.2. Training the PLS model__

__INSTRUCTIONS:__ train the PLS model using the ```MBPLS``` module. This allows training the PLS model and predicting the glucose concentration. The model is then evaluated using the ```plot_predictions()``` function, which will show two outputs:

- A plot showing the predicted glucose concentration and the measured glucose concentration.
- The RMSE (Root Mean Squared Error) of the model.

__Usage example:__ 

```plot_predictions(predictions, train_hplc)```



In [None]:
# Define the MBPLS model with the number of latent variables you have chosen. 
model = MBPLS(n_components=nr_latent_variables, method='NIPALS')

# Use the fit_predict() method to 
# - train the model
# - predict the glucose concentration from the preprocessed spectra
predictions = model.fit_predict(preprocessed_spectra, train_hplc)

# Evaluate the performance of the model using the plot_predictions() function.
# Write your code below:


# ! DO NOT MODIFY THE CODE BELOW !
# ------------------------------#
plt.show()
# ------------------------------#

##### __💡 Getting stuck?__

The model definition is already provided in the code above, you just need to evaluate the model using the ```plot_predictions()``` function.

```
plot_predictions(predictions, train_hplc)
```

## __Step 4. Explore the fermentation data__

### ✅ __Exercise 4.1. Load  and inspecting the fermentation data__

##### _Exercise 4.1.A: Load the fermentation data_

__INSTRUCTIONS:__ load the fermentation data using the ```load_fermentation_data()``` function. This function will return two variables:

- ```fermentation_spectra```: the spectra of the fermentation data.
- ```fermentation_hplc```: the glucose concentration of the fermentation data.

__Usage example:__ 

```fermentation_spectra, fermentation_hplc = load_fermentation_data()```

In [None]:
# Write your code below:

##### __💡 Getting stuck?__

You can load the fermentation data using the code below:

```
fermentation_spectra, fermentation_hplc = load_fermentation_data()
```

##### _Exercise 4.1.B: Inspect the fermentation data_

In [None]:
# Find the dimensions of the fermentation_spectra and the fermentation_hplc and assign them to the variables 
# - dimensions_fermentation_spectra  
# - dimensions_fermentation_hplc

# Write your code below:
dimensions_fermentation_spectra =
dimensions_fermentation_hplc = 


#! DO NOT MODIFY THE CODE BELOW !
# ------------------------------#
print("The fermentation_spectra table has {} rows and {} columns.".format(dimensions_fermentation_spectra[0], dimensions_fermentation_spectra[1]))
print("The fermentation_hplc table has {} rows and {} columns.".format(dimensions_fermentation_hplc[0], dimensions_fermentation_hplc[1]))
# ------------------------------#

##### __💡 Getting stuck?__

You can find the dimensions of the fermentation spectra or hplc using the ```shape``` attribute. Type the code below:

```
dimensions_fermentation_spectra = fermentation_spectra.shape
dimensions_fermentation_hplc = fermentation_hplc.shape
```

```

## __Step 5. Preprocess the fermentation data__

### ✅ __Exercise 5.1. Preprocess the data using the sequence defined in Step 2__

__INSTRUCTIONS:__ preprocess the fermentation data using the sequence defined in Step 2. Remember that the sequence was the following:

- ```range_cut```
- ```drift_correction```
- ```derivative```

and they can be applyed using the ```.apply_to()``` method.

In [None]:
# Write your code below:
# Range cut
fermentation_range_cut = 

# Drift correction
fermentation_drift_corrected =

# Derivative
fermentation_derivative = 

# In this cell, we will define the variable preprocessed_fermentation_spectra, as the output of the final preprocessing step
fermentation_processed_spectra =

##### __💡 Getting stuck?__

You can apply the preprocessing steps using the code below:
    
```
fermentation_range_cut = range_cut.apply_to(fermentation_spectra)
fermentation_drift_corrected = drift_correction.apply_to(fermentation_range_cut)
fermentation_derivative = derivative.apply_to(fermentation_drift_corrected)
```
and then assign the result to the variable ```fermentation_processed_spectra```.

```
fermentation_processed_spectra = fermentation_derivative
```

## __Step 6. Predict the glucose concentration from the fermentation data__

### ✅ __Exercise 6.1. Predict the glucose concentration with the PLS model trained in Step 3__

__INSTRUCTIONS:__ predict the glucose concentration from the fermentation data using the PLS model trained in Step 3. The model can be applied using the ```.predict()``` method to the ```fermentation_processed_spectra```.

In [None]:
# Write your code below:
fermentation_predictions = 

##### __💡 Getting stuck?__

You can apply the model using the code below:

```
fermentation_predictions = model.predict(fermentation_processed_spectra)
```

### ✅ __Exercise 6.2. Evaluate the predictions using the PLS models__

__INSTRUCTIONS:__ evaluate the predictions using the ```plot_fermentation()``` function. This function will show two outputs:

- A plot showing the predicted glucose concentration and the measured glucose concentration over time.
- The RMSE (Root Mean Squared Error) of the model.

__Usage example:__ 

```plot_fermentation(fermentation_predictions, fermentation_hplc)```

In [None]:
# Write your code below:


##### __💡 Getting stuck?__

You can evaluate the model using the code below:

```
plot_fermentation(fermentation_predictions, fermentation_hplc)
```

### END OF EXERCISE