# Diabetes Dataset Classification
This notebook performs classification on the diabetes dataset using scikit-learn. It includes the following steps:
- Load the diabetes dataset
- Train a Decision Tree classifier
- Train a Random Forest classifier and compare its performance with the Decision Tree
- Apply Bagging to improve the Random Forest model
- Perform a significance test on all three models.

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier
from sklearn.metrics import accuracy_score
from scipy.stats import f_oneway

In [2]:
# Load the diabetes dataset
data = load_diabetes()
X = data.data
y = data.target
# Convert target to binary classification (e.g., above or below median)
y = (y > np.median(y)).astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [3]:
# Train a Decision Tree classifier
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
dt_predictions = dt_model.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_predictions)
print(f'Decision Tree Accuracy: {dt_accuracy:.2f}')

Decision Tree Accuracy: 0.67


In [4]:
# Train a Random Forest classifier
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)
rf_predictions = rf_model.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_predictions)
print(f'Random Forest Accuracy: {rf_accuracy:.2f}')

Random Forest Accuracy: 0.72


In [5]:
# Apply Bagging to improve the Random Forest model
bagging_model = BaggingClassifier(base_estimator=RandomForestClassifier(random_state=42), random_state=42)
bagging_model.fit(X_train, y_train)
bagging_predictions = bagging_model.predict(X_test)
bagging_accuracy = accuracy_score(y_test, bagging_predictions)
print(f'Bagging Random Forest Accuracy: {bagging_accuracy:.2f}')



Bagging Random Forest Accuracy: 0.73


In [7]:
from scipy.stats import ttest_rel

# Perform Significance Test on All Models
# Collect predictions
predictions = {
    "Decision Tree": dt_predictions,
    "Random Forest": rf_predictions,
    "Bagging": bagging_predictions
}

# Perform paired t-test
for model1, pred1 in predictions.items():
    for model2, pred2 in predictions.items():
        if model1 != model2:
            t_stat, p_value = ttest_rel(pred1, pred2)
            print(f"Significance Test between {model1} and {model2}:\nT-statistic: {t_stat}, P-value: {p_value}")

Significance Test between Decision Tree and Random Forest:
T-statistic: 0.0, P-value: 1.0
Significance Test between Decision Tree and Bagging:
T-statistic: -1.1491423807712053, P-value: 0.25361103536435814
Significance Test between Random Forest and Decision Tree:
T-statistic: 0.0, P-value: 1.0
Significance Test between Random Forest and Bagging:
T-statistic: -2.288688541085317, P-value: 0.024492231010549658
Significance Test between Bagging and Decision Tree:
T-statistic: 1.1491423807712053, P-value: 0.25361103536435814
Significance Test between Bagging and Random Forest:
T-statistic: 2.288688541085317, P-value: 0.024492231010549658
