<a href="https://colab.research.google.com/github/gurmehharwaalia/ttt1/blob/main/Final_Project_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**üöÄMachine Learning Project: Predicting Diabetes Progression**

 This notebook presents a complete analysis for predicting diabetes disease progression using machine learning techniques. The analysis follows the required structure: problem definition, exploratory data analysis, modeling with regularization, evaluation, interpretation, and conclusions.

The problem is framed as a regression task using the Diabetes dataset from scikit-learn.

üìä1.**Problem Definition and Context**

**Problem Statement**

This is a regression problem, where we aim to predict a continuous value: the quantitative measure of disease progression one year after baseline for diabetes patients.
The target variable is a continuous score representing disease progression, based on various physiological features.


**Data Source and Variables**

The data is from the scikit-learn load_diabetes dataset, which includes 442 samples with 10 baseline variables:

age: Age in years

sex: Sex (binary)

bmi: Body mass index

bp: Average blood pressure

s1: Total serum cholesterol (tc)

s2: Low-density lipoproteins (ldl)

s3: High-density lipoproteins (hdl)

s4: Total cholesterol / HDL (tch)

s5: Log of serum triglycerides level (ltg)

s6: Blood sugar level (glu)

The target is a continuous value measuring disease progression.
Data source: Efron et al., "Least Angle Regression," Annals of Statistics (2004).

# 1.3 **Relevance**

This problem is relevant for healthcare and medical research. Predicting disease progression can help doctors identify high-risk patients early, enabling personalized treatment plans, resource allocation, and potentially reducing healthcare costs. For businesses in health tech or pharmaceuticals, such models can inform drug development or patient monitoring apps.





# üîç2. Exploratory Data Analysis (EDA)
In this section, we load the data, inspect its quality, describe features, and visualize relationships to inform modeling decisions.

# 2.1 Data Loading
---



In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

# Load the dataset
diabetes = load_diabetes()
X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
y = pd.Series(diabetes.target, name='target')

# Combine into a single DataFrame for EDA
df = pd.concat([X, y], axis=1)

# Display first few rows
df.head()

# 2.2 Data Quality Inspection
2.2.1 Missing Values Detection

In [None]:
# Basic info
df.info()

# Summary statistics
df.describe()

# Check for missing values
print("Missing values:\n", df.isnull().sum())

# Check for duplicates
print("Duplicates:", df.duplicated().sum())

# Outliers: Using boxplots for numerical features
plt.figure(figsize=(12, 8))
sns.boxplot(data=df.drop('sex', axis=1))  # sex is binary
plt.title('Boxplots for Numerical Features')
plt.xticks(rotation=45)
plt.show()

# Note: All features are already scaled (mean 0, std 1) in this dataset, no missing values or duplicates.

In [None]:
# Distributions of features
df.hist(bins=20, figsize=(12, 10))
plt.suptitle('Distributions of Features and Target')
plt.show()

# Correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap')
plt.show()

# Relationships with target
plt.figure(figsize=(12, 8))
for i, col in enumerate(X.columns):
    plt.subplot(3, 4, i+1)
    sns.scatterplot(x=df[col], y=df['target'])
    plt.title(col)
plt.tight_layout()
plt.show()

Insights from EDA:

Strong positive correlations with target: bmi (0.59), s5 (0.57).

Multicollinearity: s1 and s2 (0.90), s4 and s3 (-0.74).

This suggests regularization is useful to handle multicollinearity and prevent overfitting.

Feature engineering: No need for much, as data is pre-scaled.

Model choice: Linear models may work, but non-linear (e.g., trees) could capture complex relationships.

# ‚öôÔ∏è 3. Modeling
We split the data, then train: baseline (LinearRegression), regularized (Ridge, Lasso), and a new model (RandomForestRegressor).


# 3.1 Train/Test Split Summary

In [15]:
# Split data: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3.2 Model Training

# 3.2.1 Regularization Approach

Regularization is useful here due to multicollinearity (e.g., between s1 and s2) and to prevent overfitting with 10 features on 442 samples.

We'll use Ridge (L2) and Lasso (L1). Hyperparameters: alpha for strength of regularization.

# 3.2.2 Create Models

In [7]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score

# Baseline: Linear Regression
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

# Regularized: Ridge
ridge_params = {'alpha': np.logspace(-4, 4, 20)}
ridge_grid = GridSearchCV(Ridge(), ridge_params, cv=5, scoring='neg_mean_squared_error')
ridge_grid.fit(X_train, y_train)
best_ridge = ridge_grid.best_estimator_

# Lasso
lasso_params = {'alpha': np.logspace(-4, 4, 20)}
lasso_grid = GridSearchCV(Lasso(), lasso_params, cv=5, scoring='neg_mean_squared_error')
lasso_grid.fit(X_train, y_train)
best_lasso = lasso_grid.best_estimator_

3.2 Cross-Validation

In [8]:
# CV scores for baseline
lin_cv_scores = cross_val_score(lin_reg, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
print("Linear CV MSE mean:", -lin_cv_scores.mean(), "std:", lin_cv_scores.std())

# For best Ridge and Lasso, from grid search
print("Best Ridge alpha:", ridge_grid.best_params_['alpha'], "CV MSE:", -ridge_grid.best_score_)
print("Best Lasso alpha:", lasso_grid.best_params_['alpha'], "CV MSE:", -lasso_grid.best_score_)

Linear CV MSE mean: 3143.0153074277237 std: 355.46681369850495
Best Ridge alpha: 0.08858667904100823 CV MSE: 3123.40709108382
Best Lasso alpha: 0.08858667904100823 CV MSE: 3126.56661789274


3.3 One New Model Not Covered in Class
We'll use Random Forest Regressor.

In [9]:
from sklearn.ensemble import RandomForestRegressor

# Random Forest
rf_params = {'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20]}
rf_grid = GridSearchCV(RandomForestRegressor(random_state=42), rf_params, cv=5, scoring='neg_mean_squared_error')
rf_grid.fit(X_train, y_train)
best_rf = rf_grid.best_estimator_
print("Best RF params:", rf_grid.best_params_, "CV MSE:", -rf_grid.best_score_)

Best RF params: {'max_depth': 20, 'n_estimators': 200} CV MSE: 3469.952781526771


3.4 Model Comparison

In [None]:
# Predictions and metrics
models = {'Linear': lin_reg, 'Ridge': best_ridge, 'Lasso': best_lasso, 'RandomForest': best_rf}

results = []
for name, model in models.items():
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test, y_pred)
    results.append({'Model': name, 'MSE': mse, 'RMSE': rmse, 'R2': r2})

# Table
results_df = pd.DataFrame(results)
results_df

# Plot comparison
plt.figure(figsize=(10, 6))
sns.barplot(x='Model', y='R2', data=results_df)
plt.title('Model Comparison by R2 Score')
plt.show()

4. Interpretation & Goodness-of-Fit
Model Parameters

In [None]:
# Coefficients for Ridge (example)
print("Ridge Coefficients:\n", pd.Series(best_ridge.coef_, index=X.columns))

# Feature importances for RF
importances = pd.Series(best_rf.feature_importances_, index=X.columns).sort_values(ascending=False)
plt.figure(figsize=(8, 6))
sns.barplot(x=importances.values, y=importances.index)
plt.title('Random Forest Feature Importances')
plt.show()

Goodness-of-Fit

In [None]:
# Residuals for best model (RF)
y_pred_rf = best_rf.predict(X_test)
residuals = y_test - y_pred_rf
plt.figure(figsize=(8, 6))
sns.scatterplot(x=y_pred_rf, y=residuals)
plt.axhline(0, color='r', linestyle='--')
plt.title('Residual Plot for Random Forest')
plt.xlabel('Predicted')
plt.ylabel('Residuals')
plt.show()

5. Predictions on New Data

In [None]:
# Simulate new data (mean values, perturb some)
new_data = X_test.iloc[:5].copy()  # Example from test
new_data['bmi'] += 0.05  # Increase bmi

# Predictions
new_preds = best_rf.predict(new_data)

# Display
pred_df = pd.DataFrame({'Actual': y_test.iloc[:5].values, 'Predicted': new_preds})
pred_df

# Plot
pred_df.plot(kind='bar', figsize=(8, 6))
plt.title('Example Predictions vs Actuals')
plt.show()

6. Final Conclusions and Recommendations
Conclusions: We built models to predict diabetes progression, with Random Forest outperforming linear variants (R2 ~0.45). Key features: bmi, s5.
Recommendations: Focus interventions on weight management (bmi). Further work: Add more features (e.g., diet), try deep learning.

In [17]:
import warnings
warnings.filterwarnings('ignore')