# Model programming

Now that we have our data ready to be used properly, we will start programming different types of machine learning models and testing the results we achieve. We will apply a K-Fold Cross Validation with, at least, a pair of CVs in order to be sure that there our models are properly generalising, reducing the risk of overfitting and underfitting. Moreover, we will try out different sets of hyperparametres with the aim of finding the best possible model.

## Model scoring

When it comes to model scoring analysis, the best option in the case, given the slight imbalance in the target, will be F1 Score, recall and ROC_AUC. Why are we using these? First of all, when it comes to **F1 Score**, it is a basic score that can set a reliable baseline for measuring our results. Secondly, we will choose to examine **recall** instead of accuracy because, in this case, we want to focus on capturing all true positives, regardless of capturing some false positives as well. This metric is useful in cases where detecting all positives is crucial, for example, to minimise the omission of failures. Finally, regarding **ROC_AUC**, this metric measures the model's ability to distinguish between classes. It is less sensitive to imbalance than accuracy and provides a general idea of how well the model separates the classes.

In [12]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression

In [9]:
df = pd.read_csv("../data/clean/clean_binary.csv")
df.head()

Unnamed: 0,Process temperature [K],Torque [Nm],Tool wear [min],Type_H,Type_L,Type_M,Target
0,0.000944,1.307442,1.397594,0,1,0,1
1,-1.967206,-0.290731,-1.623324,0,1,0,0
2,-0.929433,-0.078252,0.03135,0,1,0,0
3,1.492037,1.550753,-1.668866,0,0,1,0
4,0.926976,1.62232,1.306511,0,1,0,1


In [10]:
X = df.drop(columns=["Target"])
y = df["Target"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print('X_train shape: ', X_train.shape)
print('X_test shape: ', X_test.shape)
print('Y_train shape: ', y_train.shape)
print('Y_test shape: ', y_test.shape)

X_train shape:  (9012, 6)
X_test shape:  (2254, 6)
Y_train shape:  (9012,)
Y_test shape:  (2254,)


## Logistic Regression

Even though the name of this model may lead to confusion, this model is used for classification problems and not for regression and it is especially used for binary classification necessities. We will apply a set of hyperparametres to train the best possible model. In the case of Logistic Regression, we will choose a grid search with the following parametres: penalty, C, solver and class weight.

In [14]:
log_reg = LogisticRegression()

log_reg_params = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2', 'elasticnet'],
    'solver': ['liblinear', 'saga', 'lbfgs'],
    'class_weight': [None, 'balanced']
}

gs_log_reg = GridSearchCV(
    estimator=log_reg,
    param_grid=log_reg_params,
    cv=2,
    scoring="recall",
    verbose=1
)

In [15]:
gs_log_reg.fit(X_train, y_train)

Fitting 2 folds for each of 108 candidates, totalling 216 fits


96 fits failed out of a total of 216.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
24 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\ibai.valente\AppData\Roaming\Python\Python312\site-packages\sklearn\model_selection\_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\ibai.valente\AppData\Roaming\Python\Python312\site-packages\sklearn\base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ibai.valente\AppData\Roaming\Python\Python312\site-packages\sklearn\linear_model\_logistic.py", line 1194, in fit
    solver = _check_solver(self.solver, s