# Data Modeling - Diabetes Dataset

**Authors:**  
Filip Kobus, Łukasz Jarzęcki, Paweł Skierkowski  
**Date:** 23.01.25  
**Team 3**

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import Lasso, LinearRegression, Ridge
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
import sys, pathlib

sys.path.append(str(pathlib.Path.cwd().parent / "src"))

from med_project.config import PROCESSED_DATA_DIR
from xgboost import XGBRegressor

In [None]:
X_train = pd.read_csv(PROCESSED_DATA_DIR / 'X_train.csv')
X_test = pd.read_csv(PROCESSED_DATA_DIR / 'X_test.csv')
y_train = pd.read_csv(PROCESSED_DATA_DIR / 'y_train.csv').values.ravel()
y_test = pd.read_csv(PROCESSED_DATA_DIR / 'y_test.csv').values.ravel()

final_results = []

Before applying linear regression we check how highly corelated features such as: LDL cholesterol, HbA1, and waist-to-hip ratio imply the model.

**EDA cite:**
> Glucose postprandial and HbA1c are extremely highly correlated (r=0.93), as are total cholesterol and LDL cholesterol (r=0.91). Fasting glucose correlates strongly with both HbA1c (r=0.70) and postprandial glucose (r=0.59). BMI and waist-to-hip ratio also show high correlation (r=0.77).

Drop one column for a correlated pair - chosen randomly.

In [None]:
corelated_features = ['waist_to_hip_ratio', 'ldl_cholesterol', 'hba1c']
X_noncor_train = X_train.drop(columns=corelated_features)
X_noncor_test = X_test.drop(columns=corelated_features)

In [None]:
# LINEAR REGRESSION
def linear_regression_model(X_train, y_train, X_test, y_test):
    lr_model = LinearRegression()
    lr_model.fit(X_train, y_train)

    y_pred_lr = lr_model.predict(X_test)

    rmse_lr = np.sqrt(mean_squared_error(y_test, y_pred_lr))
    r2_lr = r2_score(y_test, y_pred_lr)

    print(f'Linear Regression RMSE: {rmse_lr:.6f}')
    print(f'Linear Regression R2: {r2_lr:.6f}')

    final_results.append({'Model': 'Linear Regression', 'RMSE': rmse_lr, 'R2': r2_lr})

    return lr_model

Check the RMSE and R2 between two approaches.

In [None]:
print("Linear Regression with all features:")
lr_before_drop = linear_regression_model(X_train, y_train, X_test, y_test)
print("Linear Regression after dropping correlated features:")
lr_after_drop = linear_regression_model(X_noncor_train, y_train, X_noncor_test, y_test)

Compare coeficients of those two models, I check coefs on features that where the pair for the removed ones.

In [None]:
features_to_check = ["glucose_postprandial", "cholesterol_total", "bmi"]
print("Coef before dropping correlated features:")
for feature in features_to_check:
    coef = lr_after_drop.coef_[list(X_noncor_train.columns).index(feature)]
    print(f"{feature}: {coef}")
print("\nCoef after dropping correlated features:")
for feature in features_to_check:
    coef = lr_before_drop.coef_[list(X_train.columns).index(feature)]
    print(f"{feature}: {coef}")

Dropping highly correlated features did not change the model's accuracy because the information was redundant. However, it made the coefficients much higher (glucose_postprandial X10 higher) which proves that the original coefficients were masked by multicollinearity, leading to an underestimation of individual feature importance.

In [None]:
lr_model = lr_after_drop

We picked the model trained on the dataset without corelated as the linear regresion model for further analysis.

In [None]:
# RIDGE
ridge_model = Ridge(alpha=1.0, random_state=42)
ridge_model.fit(X_train, y_train)

y_pred_ridge = ridge_model.predict(X_test)

rmse_ridge = np.sqrt(mean_squared_error(y_test, y_pred_ridge))
r2_ridge = r2_score(y_test, y_pred_ridge)

print(f'Ridge Regression RMSE: {rmse_ridge:.6f}')
print(f'Ridge Regression R2: {r2_ridge:.6f}')

final_results.append({'Model': 'Ridge', 'RMSE': rmse_ridge, 'R2': r2_ridge})

Ridge regression achieved the same performance as the linear model with an R2 of 0.9935. This shows that the data has a strong linear structure and that L2 regularization successfully maintains accuracy while protecting the model from extreme coefficient values caused by correlation.

In [None]:
# LASSO
lasso_model = Lasso(alpha=0.1, random_state=42)
lasso_model.fit(X_train, y_train)

y_pred_lasso = lasso_model.predict(X_test)
rmse_lasso = np.sqrt(mean_squared_error(y_test, y_pred_lasso))
r2_lasso = r2_score(y_test, y_pred_lasso)

print(f'Lasso RMSE: {rmse_lasso:.6f}')
print(f'Lasso R2: {r2_lasso:.6f}')
final_results.append({'Model': 'Lasso', 'RMSE': rmse_lasso, 'R2': r2_lasso})

Lasso regression has a slightly higher error than Ridge because it removed redundant features to simplify the model. The performance is still very high (R2 = 0.992), showing that we can identify the most important risk factors without losing significant predictive power.

In [None]:
# XGBOOST
xgb_model = XGBRegressor(
    n_estimators=200, learning_rate=0.05, max_depth=6, random_state=42, n_jobs=-1
)
xgb_model.fit(X_train, y_train)

y_pred_xgb = xgb_model.predict(X_test)
rmse_xgb = np.sqrt(mean_squared_error(y_test, y_pred_xgb))
r2_xgb = r2_score(y_test, y_pred_xgb)

print(f'XGBoost RMSE: {rmse_xgb:.4f}')
print(f'XGBoost R2: {r2_xgb:.6f}')
final_results.append({'Model': 'XGBoost', 'RMSE': rmse_xgb, 'R2': r2_xgb})

XGBoost performed the best with the lowest RMSE (0.3050) and the highest R2 (0.9988). This shows that tree-based models are better at capturing complex non-linear patterns in the data compared to linear models. XGBoost is also naturally resistant to highly correlated features, allowing it to achieve superior accuracy.

In [None]:
# TUNING
# Ridge
linear_params = {'alpha': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]}

ridge_grid = GridSearchCV(
    Ridge(random_state=42), linear_params, scoring='neg_root_mean_squared_error', cv=5
)
ridge_grid.fit(X_train, y_train)

print(f'Najlepsze Ridge Alpha: {ridge_grid.best_params_}')
print(f'Najlepsze Ridge RMSE: {-ridge_grid.best_score_:.4f}')

# Lasso
lasso_grid = GridSearchCV(
    Lasso(random_state=42), linear_params, scoring='neg_root_mean_squared_error', cv=5
)
lasso_grid.fit(X_train, y_train)

print(f'Najlepsze Lasso Alpha: {lasso_grid.best_params_}')
print(f'Najlepsze Lasso RMSE: {-lasso_grid.best_score_:.4f}')

best_ridge = ridge_grid.best_estimator_
best_lasso = lasso_grid.best_estimator_

# xgboost
xgb_params = {
    'n_estimators': [500, 1000, 2000],
    'max_depth': [2, 3, 4],
    'min_child_weight': [5, 7, 10, 15],
    'learning_rate': [0.005, 0.01, 0.02, 0.05],
    'subsample': [0.6, 0.7, 0.8],
    'colsample_bytree': [0.6, 0.7, 0.8],
    'reg_alpha': [0, 0.1, 1, 5],
    'reg_lambda': [1, 2, 5],
}
xgb_search = RandomizedSearchCV(
    XGBRegressor(random_state=42, n_jobs=-1),
    param_distributions=xgb_params,
    n_iter=50,
    scoring='neg_root_mean_squared_error',
    cv=5,
    verbose=1,
    random_state=42,
    n_jobs=-1,
)

xgb_search.fit(X_train, y_train)

print(f'Best XGBoost params: {xgb_search.best_params_}')
print(f'Best XGBoost RMSE: {-xgb_search.best_score_:.4f}')

best_xgb = xgb_search.best_estimator_

Hyperparameter tuning significantly improved the models, especially XGBoost, where the RMSE dropped from 0.30 to 0.17. For Ridge and Lasso, the best alpha values were very low, confirming that the linear relationships in the data are very strong. The tuned XGBoost model is the most precise and stable version of the regression for this project.

In [None]:
# COMPARISON
feature_names = X_train.columns
fig, axes = plt.subplots(3, 1, figsize=(24, 24))

# Ridge
df_ridge = pd.DataFrame({'Feature': feature_names, 'Value': best_ridge.coef_})

df_ridge = df_ridge.reindex(
    df_ridge['Value'].abs().sort_values(ascending=False).index
).head(10)

sns.barplot(data=df_ridge, x='Value', y='Feature', ax=axes[0], hue='Feature')
axes[0].set_title('Tuned Ridge', fontsize=14, fontweight='bold')
axes[0].axvline(0, color='black', linestyle='--', linewidth=1)
axes[0].set_xlabel('Importance')

# Lasso
df_lasso = pd.DataFrame({'Feature': feature_names, 'Value': best_lasso.coef_})
df_lasso = df_lasso.reindex(
    df_lasso['Value'].abs().sort_values(ascending=False).index
).head(10)

sns.barplot(data=df_lasso, x='Value', y='Feature', ax=axes[1], hue='Feature')
axes[1].set_title('Tuned Lasso', fontsize=14, fontweight='bold')
axes[1].axvline(0, color='black', linestyle='--', linewidth=1)
axes[1].set_xlabel('Importance')

# xgboost
df_xgb = pd.DataFrame(
    {'Feature': feature_names, 'Value': best_xgb.feature_importances_}
)
df_xgb = df_xgb.sort_values(by='Value', ascending=False).head(12)

sns.barplot(data=df_xgb, x='Value', y='Feature', ax=axes[2], hue='Feature')
axes[2].set_title('Tuned XGBoost', fontsize=14, fontweight='bold')
axes[2].set_xlabel('Importance')

1. Best Model: The tuned XGBoost model is the most accurate with an RMSE of 0.1773, significantly outperforming linear models.

2. Feature Importance: All models consistently identified family history, age, BMI, and fasting glucose as the most critical predictors of diabetes risk.

3. Regularization Impact: Using Lasso and Ridge ensured that the coefficients remained stable and the models generalized well to unseen data, even with high multicollinearity present in the features.

In [None]:
# FINAL TEST
final_models = {
    'Linear Regression': lr_model,
    'Tuned Ridge': best_ridge,
    'Tuned Lasso': best_lasso,
    'Tuned XGBoost': best_xgb,
}

results_comparison = []

for name, model in final_models.items():
    y_pred = model.predict(X_test)

    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    results_comparison.append({'Model': name, 'RMSE': rmse, 'MAE': mae, 'R2 Score': r2})

comparison_df = pd.DataFrame(results_comparison).sort_values(by='RMSE')

print(comparison_df)

fig, axes = plt.subplots(1, 2, figsize=(15, 6))

sns.barplot(
    x='RMSE',
    y='Model',
    data=comparison_df,
    palette='viridis',
    ax=axes[0],
    hue='Model',
    legend=False,
)
axes[0].set_title('RMSE')
axes[0].set_xlabel('Root Mean Squared Error')

sns.barplot(
    x='R2 Score',
    y='Model',
    data=comparison_df,
    palette='magma',
    ax=axes[1],
    hue='Model',
    legend=False,
)
axes[1].set_title('R2 Score')
axes[1].set_xlabel('R2 Score')
axes[1].set_xlim(comparison_df['R2 Score'].min() - 0.01, 1.005)

plt.tight_layout()
plt.show()

top_features = [
    'family_history_diabetes',
    'age',
    'physical_activity_minutes_per_week',
    'bmi',
    'glucose_fasting',
]

The tuned XGBoost model achieved the best performance with an R2 of 0.9988, significantly outperforming linear models by capturing complex non-linear relationships. Dropping redundant features and using Lasso regularization stabilized the coefficients for key factors like age, BMI, and glucose without decreasing overall accuracy. These results prove the model is highly precise and provides stable, reliable insights for identifying the most critical diabetes risk factors.

In [None]:
# RESIDUALS ANALYSIS
y_pred_final = final_models['Tuned XGBoost'].predict(X_test)
residuals = y_test - y_pred_final

plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_test, y=residuals, alpha=0.5, color='purple')
plt.axhline(y=0, color='red', linestyle='--')
plt.title('Residual Plot')
plt.xlabel('Real data')
plt.ylabel('Residuals')
plt.show()

plt.figure(figsize=(10, 6))
sns.histplot(residuals, kde=True, color='purple', bins=30)
plt.title('Errors distribution')
plt.xlabel('Errors')
plt.show()

The residual plot shows that the errors are randomly distributed around zero, which confirms the model captured the data patterns correctly. The normal distribution of errors proves that the model's predictions are unbiased and highly accurate. This final analysis validates the tuned XGBoost model as the most reliable choice for predicting diabetes risk.

## Summary

The data was cleaned and preprocessed to address multicollinearity, confirming family history and age as the strongest diabetes risk factors. Using Lasso and Ridge regularization successfully stabilized feature coefficients and improved model interpretability without sacrificing accuracy. The tuned XGBoost model proved to be the most effective solution, achieving an R2 of 0.9988 and a highly precise RMSE of 0.1773.