# Partial least squares regression (PLS) and Multiple Linear Regression (MLR)

In this notebook, the PLS and the MLR regression models will be presented and compared

## Data

First off, to get a better grasp of the task at hand, the data that will be used in this notebook will be shown below

In [1]:
from src.courses_notebooks.transformations import owu_from_csv
data = "mytable.csv"
owu = owu_from_csv(data)
owu

Unnamed: 0_level_0,Unnamed: 1_level_0,X:VCD,X:Glc,X:Lac,X:Titer
run,timestamps,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0.0,13.941176,13.588235,0.000000,0.000000
0,1.0,9.491875,0.000061,21.345499,0.000560
0,2.0,4.544648,0.003097,33.278266,0.015783
0,3.0,1.955019,0.007568,38.678038,0.031730
0,4.0,0.815245,0.022498,40.960971,0.049851
...,...,...,...,...,...
19,10.0,0.024984,0.000148,57.032180,0.050391
19,11.0,0.012335,-0.000122,57.066590,0.050391
19,12.0,0.006089,-0.000110,57.083578,0.050391
19,13.0,0.003006,-0.000128,57.091964,0.050391


## Batchwise unfolded (BWU) matrix

In [2]:
final_titer = []
owu["Y:Final_titer"] = 0
for run_ix, run_df in owu.groupby("run"):
    final_titer.append(run_df["X:Titer"].iloc[-1])
    owu["Y:Final_titer"].loc[run_ix] = run_df["X:Titer"].iloc[-1]
final_titer

[0.0503910454629536,
 0.0590018752917881,
 0.0528665793285199,
 0.0381429851136337,
 0.0939537535794846,
 0.04705785928204,
 0.0186526144156977,
 0.0205839781723832,
 0.0939650381194057,
 0.0449320418653611,
 0.085399452436487,
 0.0324888298857732,
 0.0111362022892304,
 0.0298666637936778,
 0.0503720803261121,
 0.0610760052487183,
 0.0478560442285922,
 0.1395789238174269,
 0.0,
 0.0503910454629536]

Now, the next step is to use the MLR model with the present data. For that, we will start by importing the necessary libraries

For us to be able to better utilise the data, we will shape it into a BWU matrix. What this means is that, the data will be arranged in the following manner:

[still need to find the image]

## Train-test split

Now, the fist step to use the data is to create a train-test split. For that, we will resort to sci-kit learn to split the data for us

In [3]:
owu.drop( "Y:Final_titer", axis = "columns")

Unnamed: 0_level_0,Unnamed: 1_level_0,X:VCD,X:Glc,X:Lac,X:Titer
run,timestamps,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0.0,13.941176,13.588235,0.000000,0.000000
0,1.0,9.491875,0.000061,21.345499,0.000560
0,2.0,4.544648,0.003097,33.278266,0.015783
0,3.0,1.955019,0.007568,38.678038,0.031730
0,4.0,0.815245,0.022498,40.960971,0.049851
...,...,...,...,...,...
19,10.0,0.024984,0.000148,57.032180,0.050391
19,11.0,0.012335,-0.000122,57.066590,0.050391
19,12.0,0.006089,-0.000110,57.083578,0.050391
19,13.0,0.003006,-0.000128,57.091964,0.050391


In [5]:
from sklearn.model_selection import train_test_split
import pandas as pd

y_data = final_titer
# x_data = owu.drop( "Y:Final_titer", axis = "columns")


bwu = pd.DataFrame()
for run_ix, (_,run) in zip(list(range(len(owu.groupby("run")))),owu.groupby("run")):
    row_df = pd.DataFrame()
    run.index = run.index.get_level_values("timestamps")
    
    for row_ix, (_,row) in zip(range(len(run)),run.iterrows()):
        row = row.to_frame().T
        row = row.add_suffix(f":{run_ix}")
        row.index = [int(run_ix)]
        
        if len(row_df) >0:
            row_df = pd.concat([row_df, row], axis=1)
        else:
            row_df = row
            

    if len(bwu)>0:
        print(row_df)
        bwu= pd.concat([bwu, row_df])
    else:
        bwu = row_df


bwu

Float64Index([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0,
              13.0, 14.0],
             dtype='float64', name='timestamps')
Float64Index([0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0,
              12.0, 13.0, 14.0],
             dtype='float64', name='timestamps')
     X:VCD:1    X:Glc:1  X:Lac:1  X:Titer:1  Y:Final_titer:1   X:VCD:1  \
1  13.941176  13.588235      0.0        0.0         0.059002  9.491875   

    X:Glc:1    X:Lac:1  X:Titer:1  Y:Final_titer:1  ...   X:VCD:1   X:Glc:1  \
1  0.000061  21.345499    0.00056         0.059002  ...  0.000224 -0.000017   

    X:Lac:1  X:Titer:1  Y:Final_titer:1   X:VCD:1   X:Glc:1    X:Lac:1  \
1  42.55146   0.059002         0.059002  0.000089 -0.000005  42.551718   

   X:Titer:1  Y:Final_titer:1  
1   0.059002         0.059002  

[1 rows x 75 columns]


InvalidIndexError: Reindexing only valid with uniquely valued Index objects

In [None]:
len(owu.groupby("run"))

In [None]:
x_train, y_train, x_test, y_test = train_test_split(x_data,y_data, test_size=0.2, random_state = 10)

## MLR model

Now, we wil start with the implementation of the MLR model

### Model training

Here, the model will be fitted using the training data generated before

In [None]:
lr = LinearRegression()
lr.fit(x_train, y_train)

### Model prediction

In [None]:
y_pred =  LR.predict(x_test)

### Prediction metrics

Now, we will show different metrics to evauale the prediction

In [None]:
#predicted vs measured (test) values plot
plt.scatter(y_test, y_pred, "ro")

#Absolute RMSE
rmser_abs_test = sqrt(sum((y_test-y_pred).^2)/len(x_data))
rmser_rel_test = abs_RMSE_test/std(f_DoE_test)

In [None]:
from sklearn.feature_selection import f_regression

anova_results = f_regression(x_train, y_train)
# anova_results = [f_regression(x_train, y_train[col_ix]) for col_ix in range(len(shape(y_test)[1]))]

### Prediction results 

In this section, we check the predictions of the models versus the training data for the final titer.

In [None]:
import matplotlib.pyplot as plt
titer_ix = 1
y_pred_on_training = LR.predict(x_train)
plt.scatter(y_pred_on_training[:,titer_ix],y_pred[:,titer_ix]) #correct this one as the amount of training and prediction values 

In [None]:
r2_score_mlr=r2_score(y_test,y_pred)
rmse_MLR = np.sqrt(mean_squared_error(y_test,y_pred)

### Check model prediction on test set

In this section, we use the trained model to predict the final titer and we test it on the test set created before

In [None]:
titer_ix = 1
plt.scatter(y_test[:,titer_ix],y_pred[:,titer_ix])

## PLS model

Now, we will train a PLS model with the same data. Here, you can also choose the amount of components the PLS model will use

In [None]:
from sklearn.cross_decomposition import PLSRegression
n_components=2

pls = PLSRegression(n_components=n_components)
pls.fit(x_train, y_train)


## PLS model analysis

In [None]:
# a way to get the plot variance explained by each component


### Compute fitted response residuals

In [None]:
#compute the response residuals
residuals = [0,1,2,3,4,5,6]


### Plot scores based on final titer

### Plot VIP scores

## Simulate Cross-Validation

In this section, we will simulate a typical cross- validation to define the optimal number of latent variables.

Crosss validation RMSE

# Historical models

In an so-called historical model, the data from different experiments are ordered into a batch-wise unfolded (BWU) matrix (i.e., every row corresponds to an experiment).
The BWU can be used to compute final properties of the experiment, like CQAs, which are typically the effect of the cumulated effect of the experiment profile.
In this example, we will use the BWU matrix to predict the final value of titer. Clearly, titer information are removed from the BWU matrix.

In [None]:
n_days = 10
n_latent_vars = 4

## Create the BWU matrix

Here the BWU matrix is created. The vaues of the manipulated variables are added as columns at the beginning of the matrix.

## Create model

Create a PLS model from the initial design to the final titer

### Explained variance plots vs number or principal components 

## Historical model residual Train set and Test set