# Workflow for feature testing and comparison

We are testing multiple ways to describe the calibration information in order to find suitable features to solve this classification problem.
Therefore, we have generated a workflow where feature testing can be done, and the modelling performance with (at least) three different classification algorithm are reported in an excel file (modelling_results.csv). To keep the results comparable, the exact same workflow is used for testing in order to avoid reporting differences due to randomization and different splitting of the training and test set.

## Libraries and read in cleaned data

Data cleaning (done by Yvonne) and following steps were taken:
- removing rows with nan in RT
- removing rows with nan in concentration
- removing calibration graphs with only 1 or 2 calibration points

Data set contains 3860 rows and no nan values


In [10]:
# libraries
import pandas as pd
import numpy as np
from plotnine import *

# data
file_path = "C:/Users/loma5202/OneDrive - Kruvelab/PhD/courses/machine_learning/project/ML_calibration_graph_linearity/0_data/data_ready_addfeatures_231122.csv"
df_calibrations = pd.read_csv(file_path)

In [None]:
#file_path = "data_ready_addfeatures_231122.csv"
#df_calibrations = pd.read_csv(file_path)
#df_calibrations.info()

## Feature engineering

Define features used for modelling here

In [None]:
# new features

In [None]:

# Plotting, if needed
fig = (
    ggplot(data = df_calibrations,
          mapping = aes(x = 'c_real_M', y = 'peak_area')) +
    geom_point(aes(color = "factor(note)")) +
    scale_color_manual(values=("lightgreen", "red")) +
    theme_bw() +
    #scale_y_log10() +
    #scale_x_log10() +

    facet_wrap("compound",
               ncol=4,
               scales="free") +
    theme(figure_size = (16, 30),
          axis_line = element_line(size = 0.5, colour = "black"),
          panel_grid_major = element_line(size = 0.05, colour = "black"),
          panel_grid_minor = element_line(size = 0.05, colour = "black"),
          axis_text = element_text(colour ='black'),
          aspect_ratio=1
          )
)
fig

In [None]:
# Here we should maybe add the density plots that Yvonne was also showing to show if there is a potential in classification

## Modelling

Try to improve Gordians best performing models with hyperparameter tuning
- Random Forest
- XGBoost

In [11]:
## Decide on features for modelling
#features = ['peak_area','c_real_M']
#features = ['RT','peak_area','c_real_M']
#features = ['RT','peak_area','c_real_M', 'rf', 'rf_error']
#features = ['RT','peak_area','c_real_M', 'rf', 'rf_error', 'slope', 'intercept', 'residuals', 'abs_residuals']
#features = ['RT','peak_area_norm1','c_real_M_norm1', 'rf_norm1', 'rf_error_norm1', 'slope', 'intercept', 'residuals_norm1', 'abs_residuals_norm1'] # best features
#eatures = ['RT','peak_area_norm2','c_real_M_norm2', 'rf_norm2', 'rf_error_norm2', 'slope', 'intercept', 'residuals_norm2', 'abs_residuals_norm2']
features = ['RT','peak_area_norm1','c_real_M_norm1', 'rf_norm1', 'slope', 'intercept', 'residuals_norm1', 'abs_residuals_norm1'] # without rf_error

In [12]:
# Split dataset into features and target variable
X = df_calibrations[features]
y = df_calibrations[['note']]

In [13]:
from sklearn.model_selection import train_test_split

# Split dataset into training set and test set
np.random.seed(123)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1) # 80% training and 20% test

print(X_train.shape) #(3134, 3)
print(y_train.shape) #(3134, 1)
print(X_test.shape) #(784, 3)
print(y_test.shape) #(784, 1)

(3088, 8)
(3088, 1)
(772, 8)
(772, 1)


## Random forest

In [15]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.metrics import accuracy_score

# Convert the labels (y_train and y_test) to 2D numpy arrays and flatten into 1D arrays
y_train_rf = y_train.values.ravel()
y_test_rf = y_test.values.ravel()

# Hyperparameter tuning setup
param_grid_rf = {
    'n_estimators': [50, 100, 200],
    'criterion': ['gini', 'entropy', 'log_loss'],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 4, 7, 10],
    'min_samples_leaf': [1, 2, 4],
    'min_weight_fraction_leaf': [0.0, 0.25, 0.5],
    'max_samples': [0.1, 0.4, 0.7, 1.0]
}

# GridSearchCV setup and fitting
np.random.seed(123) # random seed for consistency

rf_model = RandomForestClassifier(random_state = 1)

cv_rf = GridSearchCV(rf_model, param_grid_rf, cv = 5, scoring = "accuracy", n_jobs = 8)
cv_rf.fit(X_train, y_train_rf)
best_parameters_rf = cv_rf.best_params_

# Training the best RandomForest model found with GridSearchCV
best_rf = RandomForestClassifier(**best_parameters_rf, random_state = 1)
best_rf.fit(X_train, y_train_rf)

# Cross-validate the best model on the training set
cv_scores = cross_val_score(best_rf, X_train, y_train_rf, cv=5, scoring='accuracy')
cv_accuracy = (np.mean(cv_scores) * 100)

# Predict on the test set
y_pred_test_rf = best_rf.predict(X_test)

# Calculate accuracy on the test set
test_accuracy_rf = (accuracy_score(y_test_rf, y_pred_test_rf) * 100)

# Print out the hyperparameter settings and results
print(f"Best hyperparameters for RF: {best_parameters_rf}")
print(f'5-fold CV Accuracy: {cv_accuracy:.2f}%')
print(f'Test Set Accuracy: {test_accuracy_rf:.2f}%')

Best hyperparameters for RF: {'criterion': 'entropy', 'max_depth': 20, 'max_samples': 1.0, 'min_samples_leaf': 2, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100}
5-fold CV Accuracy: 83.55%
Test Set Accuracy: 80.57%


## Extreme gradient boosting