
# Glucose Level Prediction Project

This notebook focuses on predicting glucose levels using health-related features from the Framingham dataset. 



## Conclusion / What We Learned

- **Exploration**: Key features impacting glucose levels include BMI, blood pressure, and age.
- **Modeling**: Several models were trained and tested. Random Forest performed the best.
- **Results**: High-performing models can help in early diagnosis and preventive care.
- **Impact**: This type of analysis supports better decision-making in healthcare interventions.

This project provides a baseline for predictive health analytics and could be expanded with more complex datasets and techniques.



## What To Do

1. Import and explore the `framingham.csv` dataset.
2. Clean the data (handle nulls, correct formats, etc.).
3. Visualize the distribution of glucose and related health indicators.
4. Perform feature selection and engineering.
5. Train ML models (e.g., Logistic Regression, Decision Tree, Random Forest).
6. Evaluate models using classification metrics.
7. Predict glucose levels and draw insights.
8. Visualize the model’s important features and performance.


## Importing and Exploring the dataset

(also importing required libraries and tools)

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv('framingham.csv')

print("Shape:", df.shape)
print("\nColumns:", df.columns.tolist())
print("\nNull values:\n", df.isnull().sum())
print("\nStatistical Summary:\n", df.describe())

In [None]:
print(df.dtypes , "\n")

missing = df.isnull().sum()
print("\n Missing Values:\n", missing[missing > 0])

df_cleaned = df.copy()
num_cols = df_cleaned.select_dtypes(include=['float64', 'int64']).columns

for col in num_cols:
    if df_cleaned[col].isnull().sum() > 0:
        median_val = df_cleaned[col].median()
        df_cleaned[col].fillna(median_val, inplace=True)

#median over means as it is less effected by outlier values

print("Any remaining nulls?\n", df_cleaned.isnull().sum().sum())


Filling values instead of removing the null regions. Dropping those null values made a greater impact on the dataset and would make the model less accurate

## Now Plotting and visualizing the data `(EDA)`

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="darkgrid", palette="Set2")

plt.figure(figsize=(8, 5))
sns.histplot(df_cleaned['glucose'], kde=True, bins=30, color='darkgreen')
plt.title('Distribution of Glucose Levels')
plt.xlabel('Glucose')
plt.ylabel('Count')
plt.show()


In [None]:
# Pair plots of glucose with some important features
features_to_plot = ['age', 'BMI', 'heartRate', 'sysBP', 'diaBP', 'glucose']
sns.pairplot(df_cleaned[features_to_plot], diag_kind="kde", plot_kws={'alpha':0.6})
plt.suptitle("Glucose vs Health Indicators", y=1.02)
plt.show()


In [None]:
# Box plots to show glucose levels across binary features
plt.figure(figsize=(14, 5))
binary_features = ['diabetes', 'currentSmoker', 'prevalentHyp', 'TenYearCHD']

for i, feature in enumerate(binary_features):
    plt.subplot(1, 4, i+1)
    sns.boxplot(x=feature, y='glucose', data=df_cleaned, palette='Greens')
    plt.title(f'Glucose by {feature}')
    plt.xlabel(feature)
    plt.ylabel('Glucose')

plt.tight_layout()
plt.show()


## Corelation Matrix


In [None]:
plt.figure(figsize=(12, 10))
correlation_matrix = df_cleaned.corr()

sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='Greens', square=True)
plt.title('Correlation Heatmap')
plt.show()


In [None]:
# Correlation of each feature with glucose
glucose_corr = correlation_matrix['glucose'].drop('glucose')
glucose_corr_sorted = glucose_corr.abs().sort_values(ascending=False)

print("Top features correlated with glucose:")
print(glucose_corr_sorted)


The most glucose-related feature is clearly `diabetes`

## Feature selection and model training


In [None]:
#selecting some features only

selected_features = [
    'diabetes', 'sysBP', 'TenYearCHD', 'age', 'heartRate',
    'prevalentHyp', 'BMI', 'diaBP', 'cigsPerDay', 'currentSmoker'
]


import pandas as pd

df = df.dropna(subset=['glucose'])

# Fill remaining missing values (for features) with median
df.fillna(df.median(), inplace=True)

# Selected features + target
X = df[selected_features]
y = df['glucose']


In [None]:
#train test splitting
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

# Initialize models
lr = LinearRegression()
dt = DecisionTreeRegressor(random_state=42)
rf = RandomForestRegressor(random_state=42)

# Train
lr.fit(X_train_scaled, y_train)
dt.fit(X_train_scaled, y_train)
rf.fit(X_train_scaled, y_train)


In [None]:
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

models = {'Linear Regression': lr, 'Decision Tree': dt, 'Random Forest': rf}

for name, model in models.items():
    y_pred = model.predict(X_test_scaled)
    r2 = r2_score(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    print(f"{name} — R² Score: {r2:.3f}, RMSE: {rmse:.2f}")


since the r2 values are near 0.0, we find that these models are weak and need tuning

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Plot actual vs predicted for each model
plt.figure(figsize=(18, 5))

for i, (name, model) in enumerate(models.items()):
    y_pred = model.predict(X_test_scaled)

    plt.subplot(1, 3, i + 1)
    sns.scatterplot(x=y_test, y=y_pred, alpha=0.6, edgecolor='k')
    plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], '--r')
    plt.xlabel("Actual Glucose")
    plt.ylabel("Predicted Glucose")
    plt.title(f"{name}")

plt.tight_layout()
plt.show()


## Fine-Tuning

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

# Define the parameter grid to search
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}


In [None]:
# Initialize base model
rf = RandomForestRegressor(random_state=42)

# Set up GridSearchCV
grid_search = GridSearchCV(estimator=rf,
                           param_grid=param_grid,
                           cv=5,
                           n_jobs=-1,
                           scoring='neg_mean_squared_error',
                           verbose=2)

# Fit the model
grid_search.fit(X_train, y_train)


In [None]:
best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(X_test)


In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

mae = mean_absolute_error(y_test, y_pred_best)
mse = mean_squared_error(y_test, y_pred_best)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred_best)

print(f"MAE: {mae:.2f}")
print(f"MSE: {mse:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"R2 Score: {r2:.4f}")


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(6,6))
sns.scatterplot(x=y_test, y=y_pred_best)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.xlabel("Actual Glucose")
plt.ylabel("Predicted Glucose")
plt.title("Tuned Random Forest: Actual vs Predicted Glucose")
plt.show()


In [None]:
importances = best_model.feature_importances_
features = X.columns
importance_df = pd.DataFrame({'Feature': features, 'Importance': importances})
importance_df = importance_df.sort_values(by='Importance', ascending=False)

sns.barplot(data=importance_df, x='Importance', y='Feature')
plt.title("Feature Importances from Tuned Random Forest")
plt.show()


model still does not fucking work ahhhh

# Trying some way else


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import joblib
from sklearnex import patch_sklearn  

# Enable Intel optimizations 
patch_sklearn()

# Load and clean data
df = pd.read_csv('framingham.csv')
df = df.dropna(subset=['glucose']).copy()

# Feature engineering (unchanged, since it's lightweight)
def add_features(X):
    X = X.copy()
    X['BP_ratio'] = X['sysBP'] / (X['diaBP'] + 1e-6)
    X['chol_BMI'] = X['totChol'] / (X['BMI'] + 1e-6)
    X['MAP'] = X['diaBP'] + 0.4 * (X['sysBP'] - X['diaBP'])
    X['PP'] = X['sysBP'] - X['diaBP']
    X['age_sysBP'] = X['age'] * X['sysBP']
    X['BMI_heartRate'] = X['BMI'] * X['heartRate']
    X['smoke_BMI'] = X['currentSmoker'] * X['BMI']
    X['age_sq'] = X['age']**2
    X['BMI_sq'] = X['BMI']**2
    X['log_totChol'] = np.log1p(X['totChol'])
    return X

# Prepare data (same as before)
base_features = ['male', 'age', 'currentSmoker', 'cigsPerDay', 'BPMeds',
                 'prevalentHyp', 'diabetes', 'totChol', 'sysBP', 'diaBP',
                 'BMI', 'heartRate', 'TenYearCHD']
X = df[base_features]
X = add_features(X)
y = df['glucose']

# Remove outliers (top and bottom 1%)
lower = y.quantile(0.01)
upper = y.quantile(0.99)
mask = (y >= lower) & (y <= upper)
X, y = X[mask], y[mask]

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Efficient Pipeline (reduced complexity)
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),
    ('selector', SelectKBest(f_regression, k=15)),  # Fewer features → faster
    ('model', RandomForestRegressor(random_state=42, n_jobs=-1, n_estimators=150))  # Fewer trees
])

# Lighter GridSearch (fewer combinations)
param_grid = {
    'selector__k': [10, 15],  # Reduced options
    'model__max_depth': [None, 10],  # Shallower trees
    'model__min_samples_split': [2, 5]  # Simpler splits
}

# Efficient GridSearch (less exhaustive)
grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring='r2',
    n_jobs=-1,  # Still parallel, but fewer combinations
    verbose=1   # Less logging than verbose=2
)
grid_search.fit(X_train, y_train)

# Best model
best_model = grid_search.best_estimator_

# Evaluation
y_pred = best_model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"Best Parameters: {grid_search.best_params_}")
print(f"RMSE: {rmse:.2f}")
print(f"R-squared: {r2:.2f}")

# Feature importance (unchanged)
selected_indices = best_model.named_steps['selector'].get_support(indices=True)
feature_names = best_model.named_steps['poly'].get_feature_names_out()[selected_indices]
importances = best_model.named_steps['model'].feature_importances_

feature_importance = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
}).sort_values('Importance', ascending=False)

print("\nTop 10 Features:")
print(feature_importance.head(10))

# Cross-validation (n_jobs=-1 for speed)
cv_scores = cross_val_score(best_model, X, y, cv=5, scoring='r2', n_jobs=-1)
print(f"\nCross-validated R-squared: {cv_scores.mean():.2f} (±{cv_scores.std():.2f})")

# Plot actual vs predicted
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--')
plt.xlabel('Actual Glucose')
plt.ylabel('Predicted Glucose')
plt.title('Actual vs Predicted Glucose Levels (Efficient Model)')
plt.show()

# Save model
joblib.dump(best_model, 'efficient_glucose_predictor.joblib')
print("Model saved successfully!")

# `This project is paused here due to data quality barriers and diminishing returns on local hardware.`