# üìä Machine Learning Algorithm Comparison
## Driver Monitoring System - Drowsiness Detection

---

## üéØ Objective
This notebook compares **6 popular Machine Learning algorithms** for driver state detection:
1. **Logistic Regression**
2. **SVM (Linear)** - Support Vector Machine with linear kernel
3. **SVM (RBF)** - Support Vector Machine with RBF kernel
4. **Random Forest**
5. **XGBoost** - Extreme Gradient Boosting
6. **K-Nearest Neighbors (KNN)**

---

## üìù Contents
- ‚úÖ Load and explore data from `face_data.csv`
- ‚úÖ Data preprocessing: standardization, train/test split
- ‚úÖ Train each model with GridSearchCV
- ‚úÖ Detailed evaluation: Confusion Matrix, Classification Report
- ‚úÖ Compare performance across all models
- ‚úÖ Conclusions and recommendations

---

## üöÄ How to Use
1. Run each cell in order from top to bottom
2. Ensure `face_data.csv` exists in the working directory
3. Each model is trained and evaluated independently
4. Final comparison table will be displayed at the end

---

## üì¶ Step 1: Import Libraries

Import all necessary libraries:
- **pandas, numpy**: Data manipulation
- **matplotlib, seaborn**: Visualization
- **sklearn**: ML algorithms and evaluation tools
- **xgboost**: XGBoost algorithm

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (classification_report, confusion_matrix, 
                             accuracy_score, precision_score, recall_score, f1_score)

# ML Algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier

import joblib
from pathlib import Path

# Configure matplotlib
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("[OK] Successfully imported all libraries!")
print(f"Pandas version: {pd.__version__}")

## üìÇ Step 2: Load Data

Data collected from `face_data.csv` contains:
- **478 landmarks** from MediaPipe Face Mesh (each landmark has x, y coordinates)
- **4 classes**: 
  - 0: Awake
  - 1: Drowsy
  - 2: Looking Down (Phone)
  - 3: Microsleep

In [None]:
# Load data
DATA_FILE = "face_data.csv"
df = pd.read_csv(DATA_FILE)

print(f"[OK] Loaded data from {DATA_FILE}")
print(f"Shape: {df.shape}")
print(f"Number of samples: {len(df)}")
print(f"Number of features: {df.shape[1] - 1} (478 landmarks * 2 = 956 features)")

# Display data distribution by class
print("\n" + "="*70)
print("DATA DISTRIBUTION BY CLASS")
print("="*70)
class_names = {0: "Awake", 1: "Drowsy", 2: "Looking Down", 3: "Microsleep"}
for label, count in df['label'].value_counts().sort_index().items():
    print(f"Class {label} ({class_names[label]}): {count} samples ({count/len(df)*100:.1f}%)")

# Display first 5 rows
print("\n" + "="*70)
print("FIRST 5 ROWS")
print("="*70)
df.head()

## üìä Step 3: Data Visualization

Visualize class distribution to better understand the data

In [None]:
# Plot class distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart
class_counts = df['label'].value_counts().sort_index()
colors = ['#2ecc71', '#e74c3c', '#f39c12', '#9b59b6']
axes[0].bar([class_names[i] for i in class_counts.index], class_counts.values, color=colors)
axes[0].set_title('Sample Distribution by Class', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Classes')
axes[0].set_ylabel('Number of Samples')
axes[0].grid(axis='y', alpha=0.3)

# Add values on bars
for i, v in enumerate(class_counts.values):
    axes[0].text(i, v + 20, str(v), ha='center', fontweight='bold')

# Pie chart
axes[1].pie(class_counts.values, labels=[class_names[i] for i in class_counts.index], 
            autopct='%1.1f%%', colors=colors, startangle=90)
axes[1].set_title('Class Distribution Percentage', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

print("[OK] Data is fairly balanced across classes!")

## ‚öôÔ∏è Step 4: Data Preprocessing

Preprocessing steps:
1. **Separate features and target**: X (features) and y (labels)
2. **Train/test split**: 80% train, 20% test with stratify to preserve class ratios
3. **Standardization**: Use StandardScaler to scale features to mean=0, std=1

**Why standardization?**
- Algorithms like SVM, KNN, Logistic Regression are sensitive to feature scales
- Helps models converge faster and achieve better performance

In [None]:
# Separate features and labels
X = df.drop('label', axis=1)
y = df['label']

print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")

# Train/test split (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"\n[OK] Train/test split:")
print(f"  Training set: {X_train.shape[0]} samples ({X_train.shape[0]/len(df)*100:.1f}%)")
print(f"  Test set: {X_test.shape[0]} samples ({X_test.shape[0]/len(df)*100:.1f}%)")

# Standardize data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"\n[OK] Data standardized (StandardScaler)")
print(f"  Mean of X_train_scaled: {X_train_scaled.mean():.6f}")
print(f"  Std of X_train_scaled: {X_train_scaled.std():.6f}")

# Save scaler for later use
joblib.dump(scaler, 'scaler.pkl')
print(f"\n[OK] Saved scaler to scaler.pkl")

---
# ü§ñ PART 2: TRAIN AND EVALUATE ALGORITHMS
---

## 1Ô∏è‚É£ Logistic Regression

### üìö Algorithm Explanation
**Logistic Regression** is a basic classification algorithm that uses the sigmoid function to predict probabilities.

**Advantages:**
- ‚úÖ Simple, easy to understand and interpret
- ‚úÖ Fast training
- ‚úÖ Effective with linearly separable data
- ‚úÖ Provides probability predictions

**Disadvantages:**
- ‚ùå Not good with non-linear data
- ‚ùå Assumes feature independence

**Hyperparameters to tune:**
- `C`: Regularization strength (smaller = stronger regularization)
- `penalty`: L1 or L2 regularization

In [None]:
print("="*70)
print("1Ô∏è‚É£ LOGISTIC REGRESSION")
print("="*70)

# Define model
lr = LogisticRegression(max_iter=2000, random_state=42)

# GridSearchCV to find best hyperparameters
param_grid_lr = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']  # liblinear supports l1 and l2
}

grid_lr = GridSearchCV(lr, param_grid_lr, cv=5, scoring='f1_macro', n_jobs=-1, verbose=1)

print("\n[TRAINING] Training Logistic Regression with GridSearchCV...")
grid_lr.fit(X_train_scaled, y_train)

print(f"\n[OK] Best parameters: {grid_lr.best_params_}")
print(f"[OK] Best CV F1-score: {grid_lr.best_score_:.4f}")

# Predict on test set
y_pred_lr = grid_lr.predict(X_test_scaled)

# Evaluation
acc_lr = accuracy_score(y_test, y_pred_lr)
prec_lr = precision_score(y_test, y_pred_lr, average='macro')
rec_lr = recall_score(y_test, y_pred_lr, average='macro')
f1_lr = f1_score(y_test, y_pred_lr, average='macro')

print(f"\n" + "="*70)
print("RESULTS ON TEST SET")
print("="*70)
print(f"Accuracy:  {acc_lr:.4f} ({acc_lr*100:.2f}%)")
print(f"Precision: {prec_lr:.4f}")
print(f"Recall:    {rec_lr:.4f}")
print(f"F1-Score:  {f1_lr:.4f}")

# Classification Report
print(f"\n" + "-"*70)
print("CLASSIFICATION REPORT")
print("-"*70)
print(classification_report(y_test, y_pred_lr, target_names=[class_names[i] for i in range(4)]))

# Confusion Matrix
cm_lr = confusion_matrix(y_test, y_pred_lr)
plt.figure(figsize=(8, 6))
sns.heatmap(cm_lr, annot=True, fmt='d', cmap='Blues', 
            xticklabels=[class_names[i] for i in range(4)],
            yticklabels=[class_names[i] for i in range(4)])
plt.title('Confusion Matrix - Logistic Regression', fontsize=14, fontweight='bold')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.tight_layout()
plt.show()

# Save model
joblib.dump(grid_lr.best_estimator_, 'model_logistic_regression.pkl')
print("\n[OK] Saved model to model_logistic_regression.pkl")

## 2Ô∏è‚É£ SVM Linear (Support Vector Machine - Linear Kernel)

### üìö Algorithm Explanation
**SVM** finds the optimal hyperplane to separate classes with maximum margin.

**Advantages:**
- ‚úÖ Effective with high-dimensional data
- ‚úÖ Robust to outliers
- ‚úÖ Linear kernel is fast and stable

**Disadvantages:**
- ‚ùå Slow training with large datasets
- ‚ùå Sensitive to hyperparameters
- ‚ùå Hard to interpret

**Hyperparameters:**
- `C`: Trade-off between margin and classification errors
- `kernel='linear'`: Suitable for linearly separable data

In [None]:
print("="*70)
print("2Ô∏è‚É£ SVM LINEAR")
print("="*70)

# Define model
svm_linear = SVC(kernel='linear', random_state=42)

# GridSearchCV
param_grid_svm_linear = {
    'C': [0.01, 0.1, 1, 10, 100]
}

grid_svm_linear = GridSearchCV(svm_linear, param_grid_svm_linear, cv=5, scoring='f1_macro', n_jobs=-1, verbose=1)

print("\n[TRAINING] Training SVM Linear with GridSearchCV...")
grid_svm_linear.fit(X_train_scaled, y_train)

print(f"\n[OK] Best parameters: {grid_svm_linear.best_params_}")
print(f"[OK] Best CV F1-score: {grid_svm_linear.best_score_:.4f}")

# Prediction
y_pred_svm_linear = grid_svm_linear.predict(X_test_scaled)

# Evaluation
acc_svm_linear = accuracy_score(y_test, y_pred_svm_linear)
prec_svm_linear = precision_score(y_test, y_pred_svm_linear, average='macro')
rec_svm_linear = recall_score(y_test, y_pred_svm_linear, average='macro')
f1_svm_linear = f1_score(y_test, y_pred_svm_linear, average='macro')

print(f"\n" + "="*70)
print("RESULTS ON TEST SET")
print("="*70)
print(f"Accuracy:  {acc_svm_linear:.4f} ({acc_svm_linear*100:.2f}%)")
print(f"Precision: {prec_svm_linear:.4f}")
print(f"Recall:    {rec_svm_linear:.4f}")
print(f"F1-Score:  {f1_svm_linear:.4f}")

print(f"\n" + "-"*70)
print("CLASSIFICATION REPORT")
print("-"*70)
print(classification_report(y_test, y_pred_svm_linear, target_names=[class_names[i] for i in range(4)]))

# Confusion Matrix
cm_svm_linear = confusion_matrix(y_test, y_pred_svm_linear)
plt.figure(figsize=(8, 6))
sns.heatmap(cm_svm_linear, annot=True, fmt='d', cmap='Greens', 
            xticklabels=[class_names[i] for i in range(4)],
            yticklabels=[class_names[i] for i in range(4)])
plt.title('Confusion Matrix - SVM Linear', fontsize=14, fontweight='bold')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.tight_layout()
plt.show()

# Save model
joblib.dump(grid_svm_linear.best_estimator_, 'model_svm_linear.pkl')
print("\n[OK] Saved model to model_svm_linear.pkl")

## 3Ô∏è‚É£ SVM RBF (Support Vector Machine - RBF Kernel)

### üìö Algorithm Explanation
**SVM with RBF kernel** can handle non-linear data by mapping to higher-dimensional space.

**Advantages:**
- ‚úÖ Handles complex non-linear data
- ‚úÖ Flexible with gamma and C parameters

**Disadvantages:**
- ‚ùå Very slow training
- ‚ùå Easy to overfit if gamma is too large
- ‚ùå Requires careful standardization

**Hyperparameters:**
- `C`: Regularization parameter
- `gamma`: Defines influence of a single training sample

In [None]:
print("="*70)
print("3Ô∏è‚É£ SVM RBF")
print("="*70)

svm_rbf = SVC(kernel='rbf', random_state=42)

param_grid_svm_rbf = {
    'C': [0.1, 1, 10, 100],
    'gamma': ['scale', 0.001, 0.01, 0.1]
}

grid_svm_rbf = GridSearchCV(svm_rbf, param_grid_svm_rbf, cv=5, scoring='f1_macro', n_jobs=-1, verbose=1)

print("\n[TRAINING] Training SVM RBF with GridSearchCV (may take a few minutes)...")
grid_svm_rbf.fit(X_train_scaled, y_train)

print(f"\n[OK] Best parameters: {grid_svm_rbf.best_params_}")
print(f"[OK] Best CV F1-score: {grid_svm_rbf.best_score_:.4f}")

y_pred_svm_rbf = grid_svm_rbf.predict(X_test_scaled)

acc_svm_rbf = accuracy_score(y_test, y_pred_svm_rbf)
prec_svm_rbf = precision_score(y_test, y_pred_svm_rbf, average='macro')
rec_svm_rbf = recall_score(y_test, y_pred_svm_rbf, average='macro')
f1_svm_rbf = f1_score(y_test, y_pred_svm_rbf, average='macro')

print(f"\n" + "="*70)
print("RESULTS ON TEST SET")
print("="*70)
print(f"Accuracy:  {acc_svm_rbf:.4f} ({acc_svm_rbf*100:.2f}%)")
print(f"Precision: {prec_svm_rbf:.4f}")
print(f"Recall:    {rec_svm_rbf:.4f}")
print(f"F1-Score:  {f1_svm_rbf:.4f}")

print(f"\n" + "-"*70)
print("CLASSIFICATION REPORT")
print("-"*70)
print(classification_report(y_test, y_pred_svm_rbf, target_names=[class_names[i] for i in range(4)]))

cm_svm_rbf = confusion_matrix(y_test, y_pred_svm_rbf)
plt.figure(figsize=(8, 6))
sns.heatmap(cm_svm_rbf, annot=True, fmt='d', cmap='Oranges', 
            xticklabels=[class_names[i] for i in range(4)],
            yticklabels=[class_names[i] for i in range(4)])
plt.title('Confusion Matrix - SVM RBF', fontsize=14, fontweight='bold')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.tight_layout()
plt.show()

joblib.dump(grid_svm_rbf.best_estimator_, 'model_svm_rbf.pkl')
print("\n[OK] Saved model to model_svm_rbf.pkl")

## 4Ô∏è‚É£ Random Forest

### üìö Algorithm Explanation
**Random Forest** is an ensemble method that combines multiple decision trees and uses voting for final prediction.

**Advantages:**
- ‚úÖ High performance, robust
- ‚úÖ Handles non-linear data well
- ‚úÖ Less prone to overfitting
- ‚úÖ Provides feature importance
- ‚úÖ No need for data standardization

**Disadvantages:**
- ‚ùå Large model size, harder to deploy
- ‚ùå Slower training with many trees

**Hyperparameters:**
- `n_estimators`: Number of trees
- `max_depth`: Maximum depth of each tree
- `min_samples_split`: Minimum samples required to split a node

In [None]:
print("="*70)
print("4Ô∏è‚É£ RANDOM FOREST")
print("="*70)

rf = RandomForestClassifier(random_state=42)

param_grid_rf = {
    'n_estimators': [50, 100, 200],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10]
}

grid_rf = GridSearchCV(rf, param_grid_rf, cv=5, scoring='f1_macro', n_jobs=-1, verbose=1)

print("\n[TRAINING] Training Random Forest with GridSearchCV...")
grid_rf.fit(X_train_scaled, y_train)

print(f"\n[OK] Best parameters: {grid_rf.best_params_}")
print(f"[OK] Best CV F1-score: {grid_rf.best_score_:.4f}")

y_pred_rf = grid_rf.predict(X_test_scaled)

acc_rf = accuracy_score(y_test, y_pred_rf)
prec_rf = precision_score(y_test, y_pred_rf, average='macro')
rec_rf = recall_score(y_test, y_pred_rf, average='macro')
f1_rf = f1_score(y_test, y_pred_rf, average='macro')

print(f"\n" + "="*70)
print("RESULTS ON TEST SET")
print("="*70)
print(f"Accuracy:  {acc_rf:.4f} ({acc_rf*100:.2f}%)")
print(f"Precision: {prec_rf:.4f}")
print(f"Recall:    {rec_rf:.4f}")
print(f"F1-Score:  {f1_rf:.4f}")

print(f"\n" + "-"*70)
print("CLASSIFICATION REPORT")
print("-"*70)
print(classification_report(y_test, y_pred_rf, target_names=[class_names[i] for i in range(4)]))

cm_rf = confusion_matrix(y_test, y_pred_rf)
plt.figure(figsize=(8, 6))
sns.heatmap(cm_rf, annot=True, fmt='d', cmap='Purples', 
            xticklabels=[class_names[i] for i in range(4)],
            yticklabels=[class_names[i] for i in range(4)])
plt.title('Confusion Matrix - Random Forest', fontsize=14, fontweight='bold')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.tight_layout()
plt.show()

joblib.dump(grid_rf.best_estimator_, 'model_random_forest.pkl')
print("\n[OK] Saved model to model_random_forest.pkl")

## 5Ô∏è‚É£ XGBoost (Extreme Gradient Boosting)

### üìö Algorithm Explanation
**XGBoost** is a powerful boosting algorithm that trains weak learners sequentially to improve on previous learners' errors.

**Advantages:**
- ‚úÖ Very high performance, often wins Kaggle competitions
- ‚úÖ Handles complex non-linear data well
- ‚úÖ Built-in regularization (prevents overfitting)
- ‚úÖ Supports parallel processing
- ‚úÖ Provides feature importance

**Disadvantages:**
- ‚ùå Many complex hyperparameters
- ‚ùå Easy to overfit if not tuned carefully
- ‚ùå Slower training than Random Forest

**Hyperparameters:**
- `n_estimators`: Number of boosting rounds
- `max_depth`: Depth of trees
- `learning_rate`: Learning rate
- `subsample`: Fraction of samples for each tree

In [None]:
print("="*70)
print("5Ô∏è‚É£ XGBOOST")
print("="*70)

xgb = XGBClassifier(random_state=42, eval_metric='mlogloss')

param_grid_xgb = {
    'n_estimators': [100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.3],
    'subsample': [0.8, 1.0]
}

grid_xgb = GridSearchCV(xgb, param_grid_xgb, cv=5, scoring='f1_macro', n_jobs=-1, verbose=1)

print("\n[TRAINING] Training XGBoost with GridSearchCV...")
grid_xgb.fit(X_train_scaled, y_train)

print(f"\n[OK] Best parameters: {grid_xgb.best_params_}")
print(f"[OK] Best CV F1-score: {grid_xgb.best_score_:.4f}")

y_pred_xgb = grid_xgb.predict(X_test_scaled)

acc_xgb = accuracy_score(y_test, y_pred_xgb)
prec_xgb = precision_score(y_test, y_pred_xgb, average='macro')
rec_xgb = recall_score(y_test, y_pred_xgb, average='macro')
f1_xgb = f1_score(y_test, y_pred_xgb, average='macro')

print(f"\n" + "="*70)
print("RESULTS ON TEST SET")
print("="*70)
print(f"Accuracy:  {acc_xgb:.4f} ({acc_xgb*100:.2f}%)")
print(f"Precision: {prec_xgb:.4f}")
print(f"Recall:    {rec_xgb:.4f}")
print(f"F1-Score:  {f1_xgb:.4f}")

print(f"\n" + "-"*70)
print("CLASSIFICATION REPORT")
print("-"*70)
print(classification_report(y_test, y_pred_xgb, target_names=[class_names[i] for i in range(4)]))

cm_xgb = confusion_matrix(y_test, y_pred_xgb)
plt.figure(figsize=(8, 6))
sns.heatmap(cm_xgb, annot=True, fmt='d', cmap='Reds', 
            xticklabels=[class_names[i] for i in range(4)],
            yticklabels=[class_names[i] for i in range(4)])
plt.title('Confusion Matrix - XGBoost', fontsize=14, fontweight='bold')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.tight_layout()
plt.show()

joblib.dump(grid_xgb.best_estimator_, 'model_xgboost.pkl')
print("\n[OK] Saved model to model_xgboost.pkl")

## 6Ô∏è‚É£ K-Nearest Neighbors (KNN)

### üìö Algorithm Explanation
**KNN** is a lazy learning algorithm that classifies based on the K nearest neighbors in feature space.

**Advantages:**
- ‚úÖ Simple, easy to understand
- ‚úÖ No training required (lazy learning)
- ‚úÖ Effective with small datasets

**Disadvantages:**
- ‚ùå Slow prediction (must compute distance to all training samples)
- ‚ùå Sensitive to feature scales (requires standardization)
- ‚ùå Not effective with high-dimensional data (curse of dimensionality)
- ‚ùå Sensitive to outliers

**Hyperparameters:**
- `n_neighbors`: Number of neighbors (K)
- `weights`: 'uniform' (all equal) or 'distance' (closer = higher weight)
- `metric`: Distance metric (euclidean, manhattan, etc.)

In [None]:
print("="*70)
print("6Ô∏è‚É£ KNN")
print("="*70)

knn = KNeighborsClassifier()

param_grid_knn = {
    'n_neighbors': [3, 5, 7, 9, 11],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}

grid_knn = GridSearchCV(knn, param_grid_knn, cv=5, scoring='f1_macro', n_jobs=-1, verbose=1)

print("\n[TRAINING] Training KNN with GridSearchCV...")
grid_knn.fit(X_train_scaled, y_train)

print(f"\n[OK] Best parameters: {grid_knn.best_params_}")
print(f"[OK] Best CV F1-score: {grid_knn.best_score_:.4f}")

y_pred_knn = grid_knn.predict(X_test_scaled)

acc_knn = accuracy_score(y_test, y_pred_knn)
prec_knn = precision_score(y_test, y_pred_knn, average='macro')
rec_knn = recall_score(y_test, y_pred_knn, average='macro')
f1_knn = f1_score(y_test, y_pred_knn, average='macro')

print(f"\n" + "="*70)
print("RESULTS ON TEST SET")
print("="*70)
print(f"Accuracy:  {acc_knn:.4f} ({acc_knn*100:.2f}%)")
print(f"Precision: {prec_knn:.4f}")
print(f"Recall:    {rec_knn:.4f}")
print(f"F1-Score:  {f1_knn:.4f}")

print(f"\n" + "-"*70)
print("CLASSIFICATION REPORT")
print("-"*70)
print(classification_report(y_test, y_pred_knn, target_names=[class_names[i] for i in range(4)]))

cm_knn = confusion_matrix(y_test, y_pred_knn)
plt.figure(figsize=(8, 6))
sns.heatmap(cm_knn, annot=True, fmt='d', cmap='YlOrBr', 
            xticklabels=[class_names[i] for i in range(4)],
            yticklabels=[class_names[i] for i in range(4)])
plt.title('Confusion Matrix - KNN', fontsize=14, fontweight='bold')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.tight_layout()
plt.show()

joblib.dump(grid_knn.best_estimator_, 'model_knn.pkl')
print("\n[OK] Saved model to model_knn.pkl")

---
# üìà PART 3: COMPREHENSIVE COMPARISON
---

## Summary Table

Compare performance of all 6 algorithms on the test set

In [None]:
# Create comparison table
results_comparison = pd.DataFrame({
    'Model': ['Logistic Regression', 'SVM Linear', 'SVM RBF', 'Random Forest', 'XGBoost', 'KNN'],
    'Accuracy': [acc_lr, acc_svm_linear, acc_svm_rbf, acc_rf, acc_xgb, acc_knn],
    'Precision': [prec_lr, prec_svm_linear, prec_svm_rbf, prec_rf, prec_xgb, prec_knn],
    'Recall': [rec_lr, rec_svm_linear, rec_svm_rbf, rec_rf, rec_xgb, rec_knn],
    'F1-Score': [f1_lr, f1_svm_linear, f1_svm_rbf, f1_rf, f1_xgb, f1_knn]
})

# Sort by F1-Score
results_comparison = results_comparison.sort_values('F1-Score', ascending=False).reset_index(drop=True)

print("="*80)
print("COMPREHENSIVE COMPARISON - SORTED BY F1-SCORE")
print("="*80)
print(results_comparison.to_string(index=False))
print("="*80)

# Find best model
best_model_name = results_comparison.iloc[0]['Model']
best_f1 = results_comparison.iloc[0]['F1-Score']
print(f"\nBEST MODEL: {best_model_name} (F1-Score: {best_f1:.4f})")

# Display styled table
results_comparison_styled = results_comparison.style.background_gradient(cmap='RdYlGn', subset=['Accuracy', 'Precision', 'Recall', 'F1-Score'])
results_comparison_styled

## Comparison Charts

In [None]:
# Plot bar charts comparing all metrics
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score']
colors_list = ['#3498db', '#2ecc71', '#e74c3c', '#f39c12', '#9b59b6', '#1abc9c']

for idx, metric in enumerate(metrics):
    ax = axes[idx // 2, idx % 2]
    bars = ax.bar(results_comparison['Model'], results_comparison[metric], color=colors_list)
    ax.set_title(f'{metric} Comparison', fontsize=14, fontweight='bold')
    ax.set_ylabel(metric)
    ax.set_ylim([0, 1.05])
    ax.grid(axis='y', alpha=0.3)
    ax.set_xticklabels(results_comparison['Model'], rotation=45, ha='right')
    
    # Add values on bars
    for bar in bars:
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height + 0.01,
                f'{height:.3f}', ha='center', va='bottom', fontsize=10, fontweight='bold')

plt.tight_layout()
plt.show()

print("[OK] Comparison charts created!")

---
# üéØ PART 4: CONCLUSIONS AND RECOMMENDATIONS
---

## Detailed Analysis

### 1. Logistic Regression

**Evaluation:**
- Simple algorithm, fast training
- Reasonably effective with linearly separable data
- Good for baseline model

**When to use:**
- Need fast, simple model for deployment
- Need probability predictions
- Data is linearly separable

**When NOT to use:**
- Complex non-linear data
- Need highest possible accuracy

---

### 2. SVM Linear

**Evaluation:**
- Good performance with high-dimensional data
- Slower training than Logistic Regression
- Robust to outliers

**When to use:**
- Large number of features (high-dimensional)
- Need optimal margin
- Data has clear decision boundaries

**When NOT to use:**
- Very large dataset (slow training)
- Need interpretability

---

### 3. SVM RBF

**Evaluation:**
- Handles non-linear data well
- Very slow training, sensitive to hyperparameters
- Easy to overfit if not tuned carefully

**When to use:**
- Complex non-linear data
- Have time for hyperparameter tuning

**When NOT to use:**
- Need fast training and prediction
- Very large dataset

---

### 4. Random Forest

**Evaluation:**
- Very good performance, robust
- Less prone to overfitting, no need for standardization
- Large model size, harder to deploy

**When to use:**
- Need high accuracy and stability
- Need feature importance
- Don't need high interpretability

**When NOT to use:**
- Need lightweight model for mobile/edge deployment
- Need very fast real-time prediction

---

### 5. XGBoost

**Evaluation:**
- Usually provides highest accuracy
- Flexible, many hyperparameters
- Slower training than Random Forest

**When to use:**
- Need maximum accuracy
- Complex non-linear data
- Have time for hyperparameter tuning

**When NOT to use:**
- Need simple model
- Limited compute resources

---

### 6. KNN

**Evaluation:**
- Simple, no training required
- Slow prediction (must compute distances)
- Not effective with high-dimensional data

**When to use:**
- Small dataset
- Need quick baseline
- Low-dimensional data

**When NOT to use:**
- High-dimensional data (curse of dimensionality)
- Need fast prediction for production
- Large dataset

---

## Final Summary

### Recommended Models for Driver Monitoring System

**TOP 3 Choices:**

1. **Random Forest** - HIGHLY RECOMMENDED
   - High accuracy, stable
   - Not sensitive to outliers
   - Feature importance helps understand model
   - Saved to `model_random_forest.pkl`

2. **XGBoost** - RECOMMENDED
   - Similar accuracy to Random Forest
   - Slower training but fast prediction
   - Good if need optimal accuracy

3. **SVM RBF** - ALTERNATIVE
   - Good for smaller datasets
   - Requires careful hyperparameter tuning

### Future Development Directions

1. **Collect more data**: Increase samples per class
2. **Feature Engineering**: Extract features like EAR, MAR, Head Pose
3. **Ensemble Methods**: Combine multiple models (voting, stacking)
4. **Deep Learning**: Try Neural Networks with more data
5. **Model Optimization**: Quantization, pruning for edge device deployment

---

### Deployment Notes

- **Random Forest** is the best choice for production
- Save both `scaler.pkl` and model to ensure proper standardization
- Test model on real-world data before deployment
- Monitor model performance over time

---

**COMPLETED!** This notebook provides detailed comparison of 6 ML algorithms.