## Fermentation PAT course 2022
### Application of PAT to fermentation processes

Technical University of Denmark (DTU)

Department of Chemical and Biochemical Engineering, PROSYS

__Author:__ Pau Cabaneros Lopez

This exercise uses spectral data collected in _real-time_ during a lignocellulosic ethanol fermentation. These samples were taken using attenuated total refractance, mid infrared (ATR-MIR) spectroscopy. The reference data was taken using high performance liquid chromatography (HPLC). 

The datasets used in this project consist of:

- Training dataset (containing both, spectra and the corresponding HPLC measurements).

- Fermentation dataset: containing a timeseries of spectra collected in real time, and another timeseries of off-line HPLC measurements. 

For detailed information about the training and fermentation datasets, see  _Transforming data to information: A parallel hybrid model for real-time state estimation in lignocellulosic ethanol fermentation_.



## Step 0. Set up the environment

When we use Google Colab, we need to setup the environment by installing the required packages and importing the necessary libraries.

### 0.1. Installing the dependencies

!pip install dtuprosys

### 0.2. Importing the libraries

In [1]:
from dtuprosys.chemometrics.modelling.cross_validation import cross_validation
from dtuprosys.chemometrics.datasets import load_fermentation_data, load_train_data
from dtuprosys.chemometrics.plotting import plot_spectra, plot_predictions, plot_fermentation
from dtuprosys.chemometrics.preprocessing import RangeCut, Derivative, DriftCorrection
from mbpls.mbpls import MBPLS

## Step 1. Explore the training data

### 1.1. Load the training data

Train data is directly loaded using the ```load_train_data()``` function. In the cell below, create two variables ```train_spectra``` and ```train_hplc``` to store the training data.

e.g. ```train_spectra, train_hplc = load_train_data()```

In [3]:
# Write your code below:


### 1.2. Inspect the training data

To inspect the training data, we will first visualize each table. Below I show an example of how to visualize the training spectra:

In [None]:
# Run this cell to visualize the variable train_spectra
train_spectra

In [None]:
# Run this cell to visualize the variable train_hplc.
# Write your code below:


Then, we would like to identify the dimensions of each table. We can do so with the ```.shape``` attribute. Below I show an example of how to inspect the dimensions of the training spectra:

In [None]:
# Run this cell to get the dimensions of the train_spectra table and assign it to the variable train_spectra.
dimensions_train_spectra = train_spectra.shape

# Find the dimensions of the train_hplc and assign it to the variable dimensions_train_hplc.
# Write your code below:


print("The train_spectra table has {} rows and {} columns.".format(dimensions_train_spectra[0], dimensions_train_spectra[1]))
print("The train_hplc table has {} rows and {} columns.".format(dimensions_train_hplc[0], dimensions_train_hplc[1]))

### 1.3 Plot the training data

Plotting the spectra is a good way to get a first impression of the data. Fill in the code below to show the plots of the ```train_spectra``` using the function ```plot_spectra()```.

In [None]:
# Run this cell to visualize train_spectra.
# Write your code below:
title = ""
x_label = ""
y_label = ""



## Step 2. Preprocess the training data

Preprocessing is fundamental to remove noise and improve the quality of the data. In this step, we will preprocess the training data to prepare it for the model training.

### 2.1. Range Cut

Cut the spectra to the range of interest. The range of interest can be whatever you want, but it is recommended to use the range of the spectra between 900 and 1700 cm.


In [None]:
# Write your code below:
# Range cut
min_wavenumber =
max_wavenumber =
range_cut = RangeCut(min_wavenumber, max_wavenumber)

# Apply the range cut to train_spectra using the .apply_to() method and assign it to the variable range_cut_spectra.
range_cut_spectra = range_cut.apply_to(train_spectra)

# Use the plot_spectra() function to visualize the range cut.
title = ""
x_label = ""
y_label = ""


### 2.2. Drift Correction


Drift correction is a simple way to remove scatter from the spectra. Below, drift correct the ```range_cut_spectra``` and assign it to the variable ```drift_corrected_spectra```.

In [None]:
# Write your code below:
# Drift correction
drift_correct = 

# Apply the drift correction to range_cut_spectra using the .apply_to() method and assign it to the variable drift_correct_spectra.
drift_corrected_spectra = 

# Use the plot_spectra() function to visualize the drift corrected spectra.
title = ""
x_label = ""
y_label = ""


### 2.3. Derivative

Derivating a spectra is a very common way to remove additive and multiplicative scatter from the spectra. Below, derivate the ```drift_corrected_spectra``` and assign it to the variable ```derivative_spectra```. Please, calculate the second derivate of the spectra.

In [None]:
# Write your code below:
# Derivative
derivative_order =
derovatove = 

# Apply the derivative to drift_corrected_spectra using the .apply_to() method and assign it to the variable derivative_spectra.
derivative_spectra =

# Use the plot_spectra() function to visualize the derivative spectra.
title = ""
x_label = ""
y_label = ""


## Step 3. Train the model

The next step is training the PLS model. To do so, we first need to specify the number of components to use in the PLS model. This can be achieved using a corss-validation procedure. This is done using the function ```cross_validation()```. This function will output a plot showing the RMSE and RMSECV (Root Mean Squared Error and Root Mean Squared Error of the Cross Validation) for the different number of components.
