# Notebook 4: Supervised Machine Learning 🧠

## Overview
In this notebook, we apply supervised machine learning techniques to predict the quality of wine based on its chemical attributes. The primary goal is to train and evaluate models that can classify wine quality effectively. We will compare various models to identify the best-performing one.

## Steps
1. **Split Data:**
   - Divide the dataset into training and testing sets.

2. **Model Selection:**
   - Train multiple supervised models, including Logistic Regression, Random Forest, SVM, etc.

3. **Model Evaluation:**
   - Use metrics like Accuracy, Precision, Recall, and F1-Score to evaluate performance.

4. **Hyperparameter Tuning:**
   - Optimize the best-performing model to achieve better results.

5. **Feature Importance:**
   - Identify which features have the most significant impact on wine quality.

6. **Conclusion:**
   - Summarize key findings and prepare for comparison with unsupervised methods.


## 0. Load Data

In [22]:
import pandas as pd

df = pd.read_csv('../data/cleaned_dataset.csv')

## 1. Split Data

In [29]:
from sklearn.model_selection import train_test_split

# Target variable and features
X = df.drop('quality', axis=1)
y = df['quality']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Check the shapes
print(f"Training set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")


Training set shape: (1087, 11)
Test set shape: (272, 11)


## 2-3. Model Selection & Evaluation

In [168]:
from sklearn.preprocessing import StandardScaler
import numpy as np

# Scale the data (important for Logistic Regression and SVM)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Adjust classes to be consecutive for XGBoost
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

# Redefine models with adjustments
models = {
    "Logistic Regression": LogisticRegression(max_iter=2000, random_state=42),
    "Random Forest": RandomForestClassifier(random_state=42),
    "SVM (Linear Kernel)": SVC(kernel='linear', random_state=42),
    "SVM (RBF Kernel)": SVC(kernel='rbf', random_state=42),
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(random_state=42),
    "XGBoost": XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss'),
    "LightGBM": LGBMClassifier(random_state=42, objective='multiclass'),
    "K-Nearest Neighbors": KNeighborsClassifier(),
    "Naive Bayes": GaussianNB()
}

# Train and evaluate each model
results = []
for name, model in models.items():
    start_time = time.time()
    if name == "XGBoost":  # Use encoded labels for XGBoost
        model.fit(X_train_scaled, y_train_encoded)
        y_pred = label_encoder.inverse_transform(model.predict(X_test_scaled))
    else:
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    elapsed_time = time.time() - start_time
    
    # Append results
    results.append({
        "Model": name,
        "Accuracy": accuracy,
        "F1-Score": f1,
        "Time (s)": elapsed_time
    })

# Convert results to DataFrame
results_df = pd.DataFrame(results).sort_values(by="F1-Score", ascending=False)

# Display results
print("Model Performance:")
print(results_df)


Parameters: { "use_label_encoder" } are not used.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000096 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1006
[LightGBM] [Info] Number of data points in the train set: 1087, number of used features: 11
[LightGBM] [Info] Start training from score -4.911735
[LightGBM] [Info] Start training from score -3.253507
[LightGBM] [Info] Start training from score -0.857779
[LightGBM] [Info] Start training from score -0.932054
[LightGBM] [Info] Start training from score -2.093337
[LightGBM] [Info] Start training from score -4.352120
Model Performance:
                 Model  Accuracy  F1-Score  Time (s)
1        Random Forest  0.621324  0.601487  0.268245
7             LightGBM  0.606618  0.595429  0.168436
6              XGBoost  0.599265  0.583271  0.165150
3     SVM (RBF Kernel)  0.591912  0.564441  0.061055
5    Gradient Boosting  0.573529  0.561583  1.325994
9          Naive Bayes  0.544118  0.554837  0.00300

## Checking Overfitting

In [140]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, confusion_matrix, classification_report

# Entrenar el modelo RandomForest por defecto
from sklearn.ensemble import RandomForestClassifier

random_forest_default = RandomForestClassifier(random_state=42)
random_forest_default.fit(X_train, y_train)

# Predicciones en conjunto de entrenamiento y prueba
y_train_pred = random_forest_default.predict(X_train)
y_test_pred = random_forest_default.predict(X_test)

# Métricas en el conjunto de entrenamiento
train_accuracy = accuracy_score(y_train, y_train_pred)
train_f1 = f1_score(y_train, y_train_pred, average='weighted')
train_precision = precision_score(y_train, y_train_pred, average='weighted')
train_recall = recall_score(y_train, y_train_pred, average='weighted')

# Métricas en el conjunto de prueba
test_accuracy = accuracy_score(y_test, y_test_pred)
test_f1 = f1_score(y_test, y_test_pred, average='weighted')
test_precision = precision_score(y_test, y_test_pred, average='weighted')
test_recall = recall_score(y_test, y_test_pred, average='weighted')

# Imprimir resultados
print("\nTraining Performance:")
print(f"Accuracy: {train_accuracy:.4f}")
print(f"F1-Score: {train_f1:.4f}")
print(f"Precision: {train_precision:.4f}")
print(f"Recall: {train_recall:.4f}")

print("\nTest Performance:")
print(f"Accuracy: {test_accuracy:.4f}")
print(f"F1-Score: {test_f1:.4f}")
print(f"Precision: {test_precision:.4f}")
print(f"Recall: {test_recall:.4f}")


Training Performance:
Accuracy: 1.0000
F1-Score: 1.0000
Precision: 1.0000
Recall: 1.0000

Test Performance:
Accuracy: 0.6250
F1-Score: 0.6051
Precision: 0.5917
Recall: 0.6250


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Check desbalanceo de clases

In [146]:
df['quality'].value_counts()

quality
5    577
6    535
7    167
4     53
8     17
3     10
Name: count, dtype: int64

## Oversampling with SMOTE

In [164]:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score
from sklearn.metrics import balanced_accuracy_score
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from collections import Counter
import time
import pandas as pd

# Separar features (X) y target (y)
X = df.drop(columns=['quality'])  # Reemplaza 'quality' con tu columna objetivo
y = df['quality']

# Dividir datos en entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Ver distribución original de clases
print("Original class distribution (Train):", Counter(y_train))
print("Original class distribution (Test):", Counter(y_test))

# Aplicar SMOTE al conjunto de entrenamiento
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Ver nueva distribución de clases después de SMOTE
print("Class distribution after SMOTE (Train):", Counter(y_train_smote))

# Lista de modelos
models = {
    "Random Forest": RandomForestClassifier(random_state=42),
    "Logistic Regression": LogisticRegression(max_iter=1000, random_state=42),
    "Support Vector Machine": SVC(random_state=42),
    "K-Nearest Neighbors": KNeighborsClassifier(),
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "LightGBM": LGBMClassifier(random_state=42)
}

# DataFrame para almacenar los resultados
results = []

# Probar cada modelo
for name, model in models.items():
    print(f"Training {name}...")
    start_time = time.time()
    
    # Entrenar el modelo
    model.fit(X_train_smote, y_train_smote)
    
    # Predecir en el conjunto de prueba
    y_pred = model.predict(X_test)
    
    # Calcular métricas
    acc = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    bal_acc = balanced_accuracy_score(y_test, y_pred)
    
    elapsed_time = time.time() - start_time
    
    # Guardar resultados
    results.append({
        "Model": name,
        "Accuracy": acc,
        "F1-Score": f1,
        "Balanced Accuracy": bal_acc,
        "Time Taken (s)": elapsed_time
    })

# Crear un DataFrame con los resultados
results_df = pd.DataFrame(results).sort_values(by="F1-Score", ascending=False)

# Mostrar los resultados
print("\nModel Performance Comparison:")
print(results_df)


Original class distribution (Train): Counter({5: 461, 6: 428, 7: 134, 4: 42, 8: 14, 3: 8})
Original class distribution (Test): Counter({5: 116, 6: 107, 7: 33, 4: 11, 8: 3, 3: 2})
Class distribution after SMOTE (Train): Counter({6: 461, 7: 461, 5: 461, 4: 461, 8: 461, 3: 461})
Training Random Forest...
Training Logistic Regression...


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Training Support Vector Machine...
Training K-Nearest Neighbors...
Training Decision Tree...
Training LightGBM...
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000164 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2805
[LightGBM] [Info] Number of data points in the train set: 2766, number of used features: 11
[LightGBM] [Info] Start training from score -1.791759
[LightGBM] [Info] Start training from score -1.791759
[LightGBM] [Info] Start training from score -1.791759
[LightGBM] [Info] Start training from score -1.791759
[LightGBM] [Info] Start training from score -1.791759
[LightGBM] [Info] Start training from score -1.791759

Model Performance Comparison:
                    Model  Accuracy  F1-Score  Balanced Accuracy  \
0           Random Forest  0.551471  0.563075           0.304419   
5                LightGBM  0.551471  0.554117           0.283220   
1     Logistic Regression  0.419118  0.

## 4. Hyperparameter Tuning

In [131]:
from sklearn.model_selection import GridSearchCV, StratifiedKFold

# Define parameter grid
param_grid = { 
    'n_estimators': [20,25,30,35,40,35,50,60,70,75,80,85], 
    'max_depth': [1,2,3,4,5,6],
    'max_features': ['sqrt', 'log2', None],
    'max_leaf_nodes': [9,10,11,12,13,14,15,16,17,18,19,20], 
} 

# StratifiedKFold for better cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# GridSearchCV with RandomForest
grid_search = GridSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_grid=param_grid,
    cv=cv,
    verbose=1,
    n_jobs=-1  # Use multiple cores to speed up
)

# Fit the model
grid_search.fit(X_train, y_train)

# Evaluate on the test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')

print(f"Best Parameters: {grid_search.best_params_}")
print(f"Accuracy: {accuracy:.4f}")
print(f"F1-Score: {f1:.4f}")

Fitting 5 folds for each of 2592 candidates, totalling 12960 fits




Best Parameters: {'max_depth': 4, 'max_features': 'sqrt', 'max_leaf_nodes': 14, 'n_estimators': 30}
Accuracy: 0.6066
F1-Score: 0.5677
