Inferential Statistics on the Diabetes Dataset

Inferential statistics involves making inferences about a population based on a sample of data. It includes techniques like hypothesis testing, confidence intervals, ANOVA, t-tests, and regression analysis.

Let us import the necessary libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_diabetes
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
import seaborn as sns
import matplotlib.pyplot as plt

Now let us load the dataset

In [2]:
# Load the diabetes dataset
diabetes = load_diabetes()

# Convert to a pandas DataFrame
df = pd.DataFrame(data=diabetes.data, columns=diabetes.feature_names)
df['target'] = diabetes.target

Hypothesis Testing

One-Sample t-Test

This is used to test whether the mean of a feature (e.g., bmi) is significantly different from a known value (e.g., 0).

In [3]:
# One-sample t-test for 'bmi' (test if mean is different from 0)
t_stat, p_value = stats.ttest_1samp(df['bmi'], 0)
print("One-Sample t-Test for BMI:")
print(f"t-statistic: {t_stat}, p-value: {p_value}")
if p_value < 0.05:
    print("Reject the null hypothesis: The mean is significantly different from 0.")
else:
    print("Fail to reject the null hypothesis: The mean is not significantly different from 0.")

One-Sample t-Test for BMI:
t-statistic: -9.914159661291137e-14, p-value: 0.999999999999921
Fail to reject the null hypothesis: The mean is not significantly different from 0.


Two-Sample t-Test

We can use this to compare the means of two groups (e.g., target values above and below the median).

In [4]:
# Split the data into two groups based on the median of 'target'
median_target = df['target'].median()
group1 = df[df['target'] <= median_target]['bmi']
group2 = df[df['target'] > median_target]['bmi']

# Two-sample t-test
t_stat, p_value = stats.ttest_ind(group1, group2)
print("Two-Sample t-Test for BMI:")
print(f"t-statistic: {t_stat}, p-value: {p_value}")
if p_value < 0.05:
    print("Reject the null hypothesis: The means of the two groups are significantly different.")
else:
    print("Fail to reject the null hypothesis: The means of the two groups are not significantly different.")

Two-Sample t-Test for BMI:
t-statistic: -10.885847035861474, p-value: 1.3404380858420138e-24
Reject the null hypothesis: The means of the two groups are significantly different.


Paired t-Test

We use it to compare the means of two related features (e.g., bp and s1).

In [5]:
# Paired t-test for 'bp' and 's1'
t_stat, p_value = stats.ttest_rel(df['bp'], df['s1'])
print("Paired t-Test for BP and S1:")
print(f"t-statistic: {t_stat}, p-value: {p_value}")
if p_value < 0.05:
    print("Reject the null hypothesis: The means of the paired samples are significantly different.")
else:
    print("Fail to reject the null hypothesis: The means of the paired samples are not significantly different.")

Paired t-Test for BP and S1:
t-statistic: -1.2252950208590596e-14, p-value: 0.9999999999999902
Fail to reject the null hypothesis: The means of the paired samples are not significantly different.


Confidence Intervals

Let us calculate confidence intervals for the mean of a feature.

In [6]:
# Confidence interval for the mean of 'bmi'
confidence_level = 0.95
mean = df['bmi'].mean()
std_err = stats.sem(df['bmi'])  # Standard error of the mean
ci = stats.t.interval(confidence_level, len(df['bmi'])-1, loc=mean, scale=std_err)
print(f"95% Confidence Interval for BMI Mean: {ci}")

95% Confidence Interval for BMI Mean: (-0.004451549462177969, 0.004451549462177519)


ANOVA (Analysis of Variance)

The ANOVA is used to test whether there are significant differences between the means of three or more groups.

One-Way ANOVA

This can compare the means of target across quartiles of bmi.

In [7]:
# Create quartiles for 'bmi'
df['bmi_quartile'] = pd.qcut(df['bmi'], q=4, labels=["Q1", "Q2", "Q3", "Q4"])

# Perform one-way ANOVA
model = ols('target ~ C(bmi_quartile)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print("One-Way ANOVA for Target across BMI Quartiles:")
print(anova_table)

One-Way ANOVA for Target across BMI Quartiles:
                       sum_sq     df          F        PR(>F)
C(bmi_quartile)  7.728782e+05    3.0  61.056395  5.375804e-33
Residual         1.848131e+06  438.0        NaN           NaN


Regression Analysis

We perform linear regression to understand the relationship between features and the target variable.

Simple Linear Regression

In [9]:
# Simple linear regression between 'bmi' and 'target'
X = df['bmi']
y = df['target']
X = sm.add_constant(X)  # Add a constant (intercept) to the model
model = sm.OLS(y, X).fit()
print("Simple Linear Regression (BMI vs Target):")
print(model.summary())

Simple Linear Regression (BMI vs Target):
                            OLS Regression Results                            
Dep. Variable:                 target   R-squared:                       0.344
Model:                            OLS   Adj. R-squared:                  0.342
Method:                 Least Squares   F-statistic:                     230.7
Date:                Tue, 18 Mar 2025   Prob (F-statistic):           3.47e-42
Time:                        16:43:08   Log-Likelihood:                -2454.0
No. Observations:                 442   AIC:                             4912.
Df Residuals:                     440   BIC:                             4920.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const     