## Regularized Regression: Ridge & Lasso Models for Obesity Prediction

This notebook builds Ridge and Lasso regression models to predict obesity levels based on lifestyle and physical features. 
After preparing the dataset with one-hot encoding for categorical variables and numeric mapping of obesity categories, 
the models are trained on 80% of the dataset and evaluated on the remaining 20% using Mean Squared Error (MSE) and R² score.

In [4]:
# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge, Lasso
from sklearn.metrics import mean_squared_error, r2_score

# Load the full cleaned dataset
data = pd.read_feather('../processed_data/obesity_cleaned.feather')

# Define target and features
target_col = "obesity_level"
y = data[target_col]
X = data.drop(columns=[target_col])

# List categorical columns
categorical_cols = [
    "gender", "family_history_overweight", "high_caloric_food_freq", 
    "vegetables_freq", "main_meal_count", "snacking_freq", "smokes",
    "water_intake", "calorie_tracking", "physical_activity_freq",
    "screen_time_hours", "alcohol_consumption_freq", "transport_mode"
]

# One-hot encode categorical features
X_encoded = pd.get_dummies(X, columns=categorical_cols, drop_first=True)

# Map target classes to numbers
obesity_mapping = {
    'Insufficient_Weight': 0,
    'Normal_Weight': 1,
    'Overweight_Level_I': 2,
    'Overweight_Level_II': 3,
    'Obesity_Type_I': 4,
    'Obesity_Type_II': 5,
    'Obesity_Type_III': 6
}
y_encoded = y.map(obesity_mapping)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y_encoded, test_size=0.2, random_state=42)

# Train Ridge model
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train, y_train)

# Predict and evaluate Ridge
ridge_pred = ridge_model.predict(X_test)
ridge_mse = mean_squared_error(y_test, ridge_pred)
ridge_r2 = r2_score(y_test, ridge_pred)

print(f"Ridge Regression - MSE: {ridge_mse:.4f}")
print(f"Ridge Regression - R² Score: {ridge_r2:.4f}")

# Train Lasso model
lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X_train, y_train)

# Predict and evaluate Lasso
lasso_pred = lasso_model.predict(X_test)
lasso_mse = mean_squared_error(y_test, lasso_pred)
lasso_r2 = r2_score(y_test, lasso_pred)

print(f"\nLasso Regression - MSE: {lasso_mse:.4f}")
print(f"Lasso Regression - R² Score: {lasso_r2:.4f}")

Ridge Regression - MSE: 0.1754
Ridge Regression - R² Score: 0.9559

Lasso Regression - MSE: 0.5137
Lasso Regression - R² Score: 0.8708


Comments on findings:

- Ridge regression achieved very similar performance to the OLS model. This suggests that slight regularization didn't hurt the model's ability to predict obesity levels.
- Lasso regression showed a noticeable drop in R² score & a higher MSE, indicating that Lasso's stronger feature selection effect may not be ideal for this dataset.
- Overall, Ridge regression appears to be a better regularized alternative to OLS for this problem, while Lasso may be too restrictive for accurate predictions.

In [5]:
# Hyperparameter tuning using GridSearchCV (finding best alpha)

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge, Lasso

# Define alpha values to test
alphas = [0.01, 0.1, 1, 10, 100]

# Ridge Regression 
ridge = Ridge()
ridge_params = {'alpha': alphas}

ridge_grid = GridSearchCV(ridge, ridge_params, scoring='r2', cv=5)  # 5-fold cross-validation
ridge_grid.fit(X_train, y_train)

print(f"Best Ridge alpha: {ridge_grid.best_params_['alpha']}")
print(f"Best Ridge R2: {ridge_grid.best_score_:.4f}")

# Lasso Regression
lasso = Lasso(max_iter=10000)  # increase max_iter for convergence
lasso_params = {'alpha': alphas}

lasso_grid = GridSearchCV(lasso, lasso_params, scoring='r2', cv=5)
lasso_grid.fit(X_train, y_train)

print(f"Best Lasso alpha: {lasso_grid.best_params_['alpha']}")
print(f"Best Lasso R2: {lasso_grid.best_score_:.4f}")


Best Ridge alpha: 0.01
Best Ridge R2: 0.9536
Best Lasso alpha: 0.01
Best Lasso R2: 0.9478


Findings after Hyperparameter Tuning:

- The optimal alpha value for both Ridge & Lasso regression models was found to be 0.01
- Ridge regression achieved an R² score of 0.9536 (very close to the OLS model) This confirms its stability and robustness
- Lasso regression also significantly improved after tuning (R² = 0.9478, compared to its previous lower performance)
- Overall, Ridge regression remains slightly better than Lasso for this dataset, but both models perform strongly after tuning.