# Workflow for feature testing and comparison

We are testing multiple ways to describe the calibration information in order to find suitable features to solve this classification problem.
Therefore, we have generated a workflow where feature testing can be done, and the modelling performance with (at least) three different classification algorithm are reported in an excel file (modelling_results.csv). To keep the results comparable, the exact same workflow is used for testing in order to avoid reporting differences due to randomization and different splitting of the training and test set.

## Libraries and read in cleaned data

Data cleaning (done by Yvonne) and following steps were taken:
- removing rows with nan in RT
- removing rows with nan in concentration
- removing calibration graphs with only 1 or 2 calibration points

Data set contains 3860 rows and no nan values


In [2]:
# libraries
import pandas as pd
import numpy as np
from plotnine import *

# data
file_path = "C:/Users/loma5202/OneDrive - Kruvelab/PhD/courses/machine_learning/project/ML_calibration_graph_linearity/0_data/data_ready_addfeatures_231122.csv"
df_calibrations = pd.read_csv(file_path)
#df_calibrations.info()

## load data to google colab
#from google.colab import files
#uploaded = files.upload()



In [None]:
#file_path = "data_ready_addfeatures_231122.csv"
#df_calibrations = pd.read_csv(file_path)
#df_calibrations.info()

## Feature engineering

Define features used for modelling here

In [None]:
# new features

In [None]:

# Plotting, if needed
fig = (
    ggplot(data = df_calibrations,
          mapping = aes(x = 'c_real_M', y = 'peak_area')) +
    geom_point(aes(color = "factor(note)")) +
    scale_color_manual(values=("lightgreen", "red")) +
    theme_bw() +
    #scale_y_log10() +
    #scale_x_log10() +

    facet_wrap("compound",
               ncol=4,
               scales="free") +
    theme(figure_size = (16, 30),
          axis_line = element_line(size = 0.5, colour = "black"),
          panel_grid_major = element_line(size = 0.05, colour = "black"),
          panel_grid_minor = element_line(size = 0.05, colour = "black"),
          axis_text = element_text(colour ='black'),
          aspect_ratio=1
          )
)
fig

In [None]:
# Here we should maybe add the density plots that Yvonne was also showing to show if there is a potential in classification

## Modelling

Using default values here

- Logistic regression
- SVM

### Training with the rf_error feature

In [2]:
## Decide on features for modelling
#features = ['peak_area','c_real_M']
#features = ['RT','peak_area','c_real_M']
#features = ['RT','peak_area','c_real_M', 'rf', 'rf_error']
#features = ['RT','peak_area','c_real_M', 'rf', 'rf_error', 'slope', 'intercept', 'residuals', 'abs_residuals']
features = ['RT','peak_area_norm1','c_real_M_norm1', 'rf_norm1', 'rf_error_norm1', 'slope', 'intercept', 'residuals_norm1', 'abs_residuals_norm1'] # best features
#eatures = ['RT','peak_area_norm2','c_real_M_norm2', 'rf_norm2', 'rf_error_norm2', 'slope', 'intercept', 'residuals_norm2', 'abs_residuals_norm2']

In [3]:
# Split dataset into features and target variable
X = df_calibrations[features]
y = df_calibrations[['note']]

In [5]:
X.shape

(3860, 9)

In [4]:
from sklearn.model_selection import train_test_split

# Split dataset into training set and test set
np.random.seed(123)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1) # 80% training and 20% test

print(X_train.shape) #(3134, 3)
print(y_train.shape) #(3134, 1)
print(X_test.shape) #(784, 3)
print(y_test.shape) #(784, 1)

(3088, 9)
(3088, 1)
(772, 9)
(772, 1)


## Logistic Regression

In [9]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# Create a logistic regression model instance with default settings
model = LogisticRegression(random_state=123)

# Perform 5-fold cross-validation and store the scores
cv_scores = cross_val_score(model, X, y.values.ravel(), cv=5)

# Calculate the average cross-validation accuracy
cv_accuracy = (cv_scores.mean()*100)

print(f'Average 5-Fold CV Accuracy: {cv_accuracy:.2f}%')

# Train the model on the entire training set
model.fit(X_train, y_train.values.ravel())

# Evaluate the model on the test set
test_accuracy = (model.score(X_test, y_test)*100)

print(f'Test Set Accuracy: {test_accuracy:.2f}%')

Average 5-Fold CV Accuracy: 53.45%
Test Set Accuracy: 40.54%


## SVM

In [10]:
from sklearn.svm import SVC

# Create an SVM classifier model instance with default settings
svm_model = SVC(random_state=123)

# Perform 5-fold cross-validation and store the scores
svm_cv_scores = cross_val_score(svm_model, X, y.values.ravel(), cv=5)

# Calculate the average cross-validation accuracy
svm_cv_accuracy = (svm_cv_scores.mean()*100)

print(f'Average 5-Fold CV Accuracy for SVM: {svm_cv_accuracy:.2f}%')

# Train the SVM model on the entire training set
svm_model.fit(X_train, y_train.values.ravel())

# Evaluate the model on the test set
svm_test_accuracy = (svm_model.score(X_test, y_test)*100)

print(f'Test Set Accuracy for SVM: {svm_test_accuracy:.2f}%')

Average 5-Fold CV Accuracy for SVM: 54.69%
Test Set Accuracy for SVM: 58.29%


### Training without the rf_error feature

In [3]:
## Decide on features for modelling
#features = ['peak_area','c_real_M']
#features = ['RT','peak_area','c_real_M']
#features = ['RT','peak_area','c_real_M', 'rf', 'rf_error']
#features = ['RT','peak_area','c_real_M', 'rf', 'rf_error', 'slope', 'intercept', 'residuals', 'abs_residuals']
#features = ['RT','peak_area_norm1','c_real_M_norm1', 'rf_norm1', 'rf_error_norm1', 'slope', 'intercept', 'residuals_norm1', 'abs_residuals_norm1'] # best features
#eatures = ['RT','peak_area_norm2','c_real_M_norm2', 'rf_norm2', 'rf_error_norm2', 'slope', 'intercept', 'residuals_norm2', 'abs_residuals_norm2']
features = ['RT','peak_area_norm1','c_real_M_norm1', 'rf_norm1', 'slope', 'intercept', 'residuals_norm1', 'abs_residuals_norm1'] # without rf_error

In [4]:
# Split dataset into features and target variable
X = df_calibrations[features]
y = df_calibrations[['note']]

In [5]:
from sklearn.model_selection import train_test_split

# Split dataset into training set and test set
np.random.seed(123)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1) # 80% training and 20% test

print(X_train.shape) #(3134, 3)
print(y_train.shape) #(3134, 1)
print(X_test.shape) #(784, 3)
print(y_test.shape) #(784, 1)

(3088, 8)
(3088, 1)
(772, 8)
(772, 1)


## Logistic Regression

In [14]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# Create a logistic regression model instance with default settings
model = LogisticRegression(random_state=123)

# Perform 5-fold cross-validation and store the scores
cv_scores = cross_val_score(model, X, y.values.ravel(), cv=5)

# Calculate the average cross-validation accuracy
cv_accuracy = (cv_scores.mean()*100)

print(f'Average 5-Fold CV Accuracy: {cv_accuracy:.2f}%')

# Train the model on the entire training set
model.fit(X_train, y_train.values.ravel())

# Evaluate the model on the test set
test_accuracy = (model.score(X_test, y_test)*100)

print(f'Test Set Accuracy: {test_accuracy:.2f}%')

Average 5-Fold CV Accuracy: 53.45%
Test Set Accuracy: 40.54%


## SVM

In [16]:
from sklearn.svm import SVC

# Create an SVM classifier model instance with default settings
svm_model = SVC(random_state=123)

# Perform 5-fold cross-validation and store the scores
svm_cv_scores = cross_val_score(svm_model, X, y.values.ravel(), cv=5)

# Calculate the average cross-validation accuracy
svm_cv_accuracy = (svm_cv_scores.mean()*100)

print(f'Average 5-Fold CV Accuracy for SVM: {svm_cv_accuracy:.2f}%')

# Train the SVM model on the entire training set
svm_model.fit(X_train, y_train.values.ravel())

# Evaluate the model on the test set
svm_test_accuracy = (svm_model.score(X_test, y_test)*100)

print(f'Test Set Accuracy for SVM: {svm_test_accuracy:.2f}%')

Average 5-Fold CV Accuracy for SVM: 54.69%
Test Set Accuracy for SVM: 58.29%


It gives the exact same results if rf_error is included or not... I don't have high confidence in these models, but will anyway try with a hyperparameter tuning to see if they can be improved.

## Logistic Regression with hyperparameter tuning

In [23]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Define the parameter grid for Logistic Regression
lr_param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],  # Regularization strength
    'penalty': ['none', 'l1', 'l2', 'elasticnet'],  # Type of penalty
    'solver' : ['lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'],  # Type of solver to use
    'l1_ratio' : [0, 0.2, 0.4, 0.6, 0.8, 1]
}

# Create the GridSearchCV object for Logistic Regression
lr_grid_search = GridSearchCV(LogisticRegression(random_state = 1, max_iter = 10000), 
                              lr_param_grid, cv = 5, scoring='accuracy')

# Perform the grid search and fit the model
lr_grid_search.fit(X, y.values.ravel())

# The best hyperparameters from GridSearchCV
print(f"Best hyperparameters for Logistic Regression: {lr_grid_search.best_params_}")

# Train the model using the best parameters on the entire training set
lr_best_model = lr_grid_search.best_estimator_
lr_best_model.fit(X_train, y_train.values.ravel())

# Evaluate the model on the test set
lr_test_accuracy = (lr_best_model.score(X_test, y_test)*100)
print(f'Test Set Accuracy for Logistic Regression: {lr_test_accuracy:.2f}%')


Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Ill-conditioned matrix (rcond=5.87299e-34): result may not be accurate.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Ill-conditioned matrix (rcond=5.22688e-34): result may not be accurate.
Further options are to use another solver or to avoid such situation in the first place. Possible remedies are removing collinear features of X or increasing the penalization strengths.
The original Linear Algebra message was:
Ill-conditioned matrix (rcond=4.26569e-34): result may not be accurate.
Further options are to use another solver or to avoid such situation in the first place. Possible remed

Best hyperparameters for Logistic Regression: {'C': 1, 'l1_ratio': 0, 'penalty': 'l1', 'solver': 'liblinear'}
Test Set Accuracy for Logistic Regression: 77.59%


## SVM with hyperparameter tuning

This one is not functional yet, I'll try it again. It was running for over 130 mins without results..

In [6]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

# Define the parameter grid for SVM
svm_param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],  # Regularization strength
    'gamma': [0.001, 0.5, 1, 'scale', 'auto'],  # Kernel coefficient
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid']  # Type of kernel
}

# Create the GridSearchCV object for SVM
svm_grid_search = GridSearchCV(SVC(random_state = 1), svm_param_grid, cv = 5, scoring = 'accuracy')

# Perform the grid search and fit the model
svm_grid_search.fit(X, y.values.ravel())

# The best hyperparameters from GridSearchCV
print(f"Best hyperparameters for SVM: {svm_grid_search.best_params_}")

# Train the model using the best parameters on the entire training set
svm_best_model = svm_grid_search.best_estimator_
svm_best_model.fit(X_train, y_train.values.ravel())

# Evaluate the model on the test set
svm_test_accuracy = (svm_best_model.score(X_test, y_test)*100)
print(f'Test Set Accuracy for SVM: {svm_test_accuracy:.2f}%')