### Obesity Level Prediction using Ridge Logistic Regression
This notebook applies ridge logisitc regression to predict obesity levels based on lifestyle and demographic variables. 

In [5]:
# Import libraries
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.exceptions import ConvergenceWarning

warnings.filterwarnings("ignore", category=ConvergenceWarning)

# Confirm working directory
print("Current working directory:", os.getcwd())

# Load data (adjust path as needed!)
train_path = os.path.join("..", "processed_data", "train_data.feather")
test_path = os.path.join("..", "processed_data", "test_data.feather")

train_df = pd.read_feather(train_path)
test_df = pd.read_feather(test_path)

# Split features and labels
y_train = train_df["obesity_level"]
X_train = train_df.drop(columns=["obesity_level"])

y_test = test_df["obesity_level"]
X_test = test_df.drop(columns=["obesity_level"])

# Encode target
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

# Identify columns
numerical_cols = X_train.select_dtypes(include=["float64"]).columns.tolist()
categorical_cols = X_train.select_dtypes(include=["object", "bool"]).columns.tolist()

# Column transformer for preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), numerical_cols),
        ("cat", OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ]
)

# Create pipeline with Ridge Logistic Regression
ridge_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier", LogisticRegression(
        penalty='l2',
        solver='saga',
        multi_class='multinomial',
        max_iter=1000,
        random_state=42
    ))
])

# Parameter grid for Ridge
param_grid_ridge = {
    'classifier__C': [0.01, 0.1, 1, 10]
}

# Grid search
grid_search_ridge = GridSearchCV(ridge_pipeline, param_grid_ridge, cv=5, scoring='accuracy', n_jobs=-1)
grid_search_ridge.fit(X_train, y_train_encoded)

# Results
best_ridge = grid_search_ridge.best_estimator_
print(f"Best Ridge Parameters: {grid_search_ridge.best_params_}")
print(f"Best Cross-Validation Accuracy: {grid_search_ridge.best_score_:.4f}")

# Evaluate on test set
y_pred = best_ridge.predict(X_test)
test_acc = accuracy_score(y_test_encoded, y_pred)
print(f"Test Accuracy: {test_acc:.4f}")
print("\nClassification Report:")
print(classification_report(y_test_encoded, y_pred, target_names=label_encoder.classes_))



Current working directory: /Users/ashleyrazo/ml-project-obesity-prediction-1/ml-project-obesity-prediction/notebooks
Best Ridge Parameters: {'classifier__C': 10}
Best Cross-Validation Accuracy: 0.9236
Test Accuracy: 0.9362

Classification Report:
                     precision    recall  f1-score   support

Insufficient_Weight       0.90      1.00      0.95        56
      Normal_Weight       0.96      0.84      0.90        62
     Obesity_Type_I       0.96      0.97      0.97        78
    Obesity_Type_II       0.95      0.93      0.94        58
   Obesity_Type_III       0.95      0.98      0.97        63
 Overweight_Level_I       0.88      0.91      0.89        56
Overweight_Level_II       0.94      0.90      0.92        50

           accuracy                           0.94       423
          macro avg       0.94      0.93      0.93       423
       weighted avg       0.94      0.94      0.94       423





### Comparing Results: Ridge vs Regular Logistic Regression 
- <strong>Insufficient_Weight</strong>:Recall improved from 0.98 -> 1.00 

- <strong>Normal_Weight</strong>: Recall improved from 0.81 -> 0.84, F1 from 0.85 -> 0.90.

- <strong>Obesity_Type_III</strong>: Stayed at a strong 1.00 recall, with a slight boost in precision. 

- <strong>Overweight Levels</strong>: Notably stronger F-1 scores across both levels
    - Level I: 0.85 -> <strong>0.89</strong>
    - Level II: 0.89 -> <strong>0.92</strong>

#### Why Ridge Performed Better: 
Ridge Logistic Regression adds <strong>L2 regularization</strong>, which penalizes large coefficient value. This has several benefits:
- <strong>Prevents Overfitting</strong>
    By discouraging large swings in model weights, ridge regularization reduces the chance of the model fitting noise, especially when you have many dummy variables from one-hot encoding. 


- <strong>Handles Multicollinearity</strong>
    Since many features in the dataset (e.g. different frequency levels of food/exercises) are likely correlated, ridge helps stabilize the learning process, spreading importance more evenly. 


- <strong>Improved Generalization</strong>
    The higher <strong> test accuracy (93.6%)</strong> suggests the model generalizes better to unseen data due to the smoother decision boundaries. 

    
- <strong>Better Weights Sharing Across Classes</strong>
Ridge performs well in multi-class classification because it balances the coefficients for all classes simultaneously, unlike logistic regression which can overfit certain classes. 

#### Summary 
When comparing regular logistic regression and ridge logistic regression on the obesity classification task, we find that ridge achieves a higher test accuracy (93.6% vs 92.2%) and stronger performance across nearly all classes. This improvement stems from ridge's use of L2 regularization, which penalizes overly large coefficients and mitigates overfitting. This is especially important in high-dimensional settings with many one-hot encoded categorical features. Notably, class-level F1 scores improved in categories like "Normal_Weight" and both "Overweight_Level" classes, suggesting that ridge helped the model better distinguish between closely related classes. Overall, ridge logistic regression offers more robust generalization and smoother class boundaries in this multi-class classification context. 
