# Diabetes Dataset Classification
This notebook performs classification on the diabetes dataset using scikit-learn. It includes the following steps:
- Load the diabetes dataset
- Train a Decision Tree classifier
- Train a Random Forest classifier and compare its performance with the Decision Tree
- Apply Bagging to improve the Random Forest model
- Perform a significance test on all three models.

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier
from sklearn.metrics import accuracy_score
from scipy.stats import f_oneway

In [2]:
# Load the diabetes dataset
data = load_diabetes()
X = data.data
y = data.target
# Convert target to binary classification (e.g., above or below median)
y = (y > np.median(y)).astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [3]:
# Train a Decision Tree classifier
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
dt_predictions = dt_model.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_predictions)
print(f'Decision Tree Accuracy: {dt_accuracy:.2f}')

Decision Tree Accuracy: 0.67


In [4]:
# Train a Random Forest classifier
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)
rf_predictions = rf_model.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_predictions)
print(f'Random Forest Accuracy: {rf_accuracy:.2f}')

Random Forest Accuracy: 0.72


In [5]:
# Apply Bagging to improve the Random Forest model
bagging_model = BaggingClassifier(base_estimator=RandomForestClassifier(random_state=42), random_state=42)
bagging_model.fit(X_train, y_train)
bagging_predictions = bagging_model.predict(X_test)
bagging_accuracy = accuracy_score(y_test, bagging_predictions)
print(f'Bagging Random Forest Accuracy: {bagging_accuracy:.2f}')



Bagging Random Forest Accuracy: 0.73


In [7]:
from scipy.stats import ttest_rel

# Perform Significance Test on All Models
# Collect predictions
predictions = {
    "Decision Tree": dt_predictions,
    "Random Forest": rf_predictions,
    "Bagging": bagging_predictions
}

# Perform paired t-test
for model1, pred1 in predictions.items():
    for model2, pred2 in predictions.items():
        if model1 != model2:
            t_stat, p_value = ttest_rel(pred1, pred2)
            print(f"Significance Test between {model1} and {model2}:\nT-statistic: {t_stat}, P-value: {p_value}")

Significance Test between Decision Tree and Random Forest:
T-statistic: 0.0, P-value: 1.0
Significance Test between Decision Tree and Bagging:
T-statistic: -1.1491423807712053, P-value: 0.25361103536435814
Significance Test between Random Forest and Decision Tree:
T-statistic: 0.0, P-value: 1.0
Significance Test between Random Forest and Bagging:
T-statistic: -2.288688541085317, P-value: 0.024492231010549658
Significance Test between Bagging and Decision Tree:
T-statistic: 1.1491423807712053, P-value: 0.25361103536435814
Significance Test between Bagging and Random Forest:
T-statistic: 2.288688541085317, P-value: 0.024492231010549658


Okay, let's break down the output of the paired t-tests comparing the predictions of the three models:

1.  **What the Test Does:** The paired t-test compares the predictions of two models on the *same* set of test instances. It checks if the difference in predictions between the two models is statistically significant or likely due to random chance. The null hypothesis is that there is no difference between the models' predictions.
    *   **T-statistic:** Measures the size of the difference relative to the variation in the sample data. A larger absolute value indicates a larger difference.
    *   **P-value:** Represents the probability of observing the data (or something more extreme) if the null hypothesis (no difference) were true. A small p-value (typically < 0.05) suggests that you can reject the null hypothesis and conclude there is a statistically significant difference between the models' predictions.

2.  **Analysis of Results:**

    *   **Decision Tree vs. Random Forest (and vice-versa):**
        *   T-statistic: 0.0
        *   P-value: 1.0
        *   **Explanation:** The t-statistic is 0 and the p-value is 1.0. This indicates that there is *no statistically significant difference* between the predictions made by the Decision Tree and the standard Random Forest model on this specific test set. Their predictions might even be identical for this run.

    *   **Decision Tree vs. Bagging (and vice-versa):**
        *   T-statistic: +/- 1.149
        *   P-value: 0.254
        *   **Explanation:** The p-value (0.254) is much greater than the common significance threshold of 0.05. Therefore, there is *no statistically significant difference* between the predictions of the Decision Tree and the Bagging Random Forest model. While their predictions might differ slightly, the difference isn't large enough to be considered statistically meaningful based on this test.

    *   **Random Forest vs. Bagging (and vice-versa):**
        *   T-statistic: +/- 2.289
        *   P-value: 0.024
        *   **Explanation:** The p-value (0.024) is less than 0.05. This indicates that there *is a statistically significant difference* between the predictions made by the standard Random Forest model and the Bagging Random Forest model.

3.  **Important Findings:**

    *   The most crucial finding is the **statistically significant difference between the standard Random Forest and the Bagging Random Forest** (p = 0.024). This suggests that applying Bagging on top of the Random Forest led to a demonstrably different set of predictions compared to the standard Random Forest alone.
    *   Interestingly, neither the standard Random Forest nor the Bagging Random Forest showed a statistically significant difference when compared to the simpler Decision Tree model on this dataset and split. This might imply that for this specific problem setup, the ensemble methods didn't provide a significantly *different* predictive behavior compared to the single tree, even though the Bagging approach did alter the Random Forest's predictions significantly.