# Statistical Tests – Assessing Numerical and Categorical Feature Differences
***

The goal of this notebook is to move beyond descriptive and exploratory analysis (EDA)  
and apply **statistical hypothesis testing** to determine whether the differences observed  
between patients with and without heart disease are **statistically significant**.  

We will focus on:  
- Testing for **normality** and **variance homogeneity** of numerical features.  
- Applying appropriate tests (t-test, Mann–Whitney U test) to compare numerical features across groups.  
- Using **Chi-square tests** (or Fisher’s exact test when necessary) to evaluate associations between categorical features and heart disease.  

The results will help us identify which features show meaningful differences between the groups,  
providing evidence for their potential predictive power in the subsequent modeling phase.


In [1]:
import pandas as pd

from scipy.stats import shapiro, mannwhitneyu

from src.stats_tests_functions import multiple_results_mwu

In [2]:
dataset1 = pd.read_csv('../data/cleaned_data/dataset1_cleaned.csv')
dataset2 = pd.read_csv('../data/cleaned_data/dataset2_cleaned.csv')

## I. Loading and Preparing the Datasets
***

In [3]:
dataset1

Unnamed: 0,age,gender,heart_rate,pressure_high,pressure_low,glucose,kcm,troponin,heart_disease
0,64,1,66,160,83,160.0,1.80,0.012,0
1,21,1,94,98,46,296.0,6.75,1.060,1
2,55,1,64,160,77,270.0,1.99,0.003,0
3,64,1,70,120,55,270.0,13.87,0.122,1
4,55,1,64,112,65,300.0,1.08,0.003,0
...,...,...,...,...,...,...,...,...,...
1311,44,1,94,122,67,204.0,1.63,0.006,0
1312,66,1,84,125,55,149.0,1.33,0.172,1
1313,45,1,85,168,104,96.0,1.24,4.250,1
1314,54,1,58,117,68,443.0,5.80,0.359,1


In [4]:
dataset2

Unnamed: 0,age,gender,cholesterol,pressure_high,heart_rate,smoking,alcohol_intake,exercise_hours,family_history,diabetes,obesity,stress_level,blood_sugar,exercise_induced_angina,chest_pain_type,heart_disease
0,75,0,228,119,66,1,2,1,0,0,1,8,119,1,atypical angina,1
1,48,1,204,165,62,1,0,5,0,0,0,9,70,1,typical angina,0
2,53,1,234,91,67,0,2,3,1,0,1,5,196,1,atypical angina,1
3,69,0,192,90,72,1,0,4,0,1,0,7,107,1,non-anginal pain,0
4,62,0,172,163,93,0,0,6,0,1,0,2,183,1,asymptomatic,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,56,0,269,111,86,0,2,5,0,1,1,10,120,0,non-anginal pain,1
996,78,0,334,145,76,0,0,6,0,0,0,10,196,1,typical angina,1
997,79,1,151,179,81,0,1,4,1,0,1,8,189,1,asymptomatic,0
998,60,0,326,151,68,2,0,8,1,1,0,5,174,1,atypical angina,1


### Splitting Dataset 1 by Heart Disease Status


In [5]:
ds1_no_heart_disease = dataset1[dataset1['heart_disease'] == 0]
ds1_heart_disease = dataset1[dataset1['heart_disease'] == 1]

In [6]:
ds1_no_heart_disease

Unnamed: 0,age,gender,heart_rate,pressure_high,pressure_low,glucose,kcm,troponin,heart_disease
0,64,1,66,160,83,160.0,1.80,0.012,0
2,55,1,64,160,77,270.0,1.99,0.003,0
4,55,1,64,112,65,300.0,1.08,0.003,0
5,58,0,61,112,58,87.0,1.83,0.004,0
6,32,0,40,179,68,102.0,0.71,0.003,0
...,...,...,...,...,...,...,...,...,...
1299,40,1,57,208,40,108.0,2.11,0.003,0
1305,45,1,117,100,68,202.0,3.18,0.003,0
1309,48,1,84,118,68,96.0,5.33,0.006,0
1310,86,0,40,179,68,147.0,5.22,0.011,0


In [7]:
ds1_heart_disease

Unnamed: 0,age,gender,heart_rate,pressure_high,pressure_low,glucose,kcm,troponin,heart_disease
1,21,1,94,98,46,296.0,6.75,1.060,1
3,64,1,70,120,55,270.0,13.87,0.122,1
7,63,1,60,214,82,87.0,300.00,2.370,1
12,64,1,60,199,99,92.0,3.43,5.370,1
15,61,1,81,118,66,134.0,1.49,0.017,1
...,...,...,...,...,...,...,...,...,...
1308,85,1,112,115,69,114.0,2.19,0.062,1
1312,66,1,84,125,55,149.0,1.33,0.172,1
1313,45,1,85,168,104,96.0,1.24,4.250,1
1314,54,1,58,117,68,443.0,5.80,0.359,1


### Splitting Dataset 2 by Heart Disease Status


In [8]:
ds2_no_heart_disease = dataset2[dataset2['heart_disease'] == 0]
ds2_heart_disease = dataset2[dataset2['heart_disease'] == 1]

In [9]:
ds2_no_heart_disease

Unnamed: 0,age,gender,cholesterol,pressure_high,heart_rate,smoking,alcohol_intake,exercise_hours,family_history,diabetes,obesity,stress_level,blood_sugar,exercise_induced_angina,chest_pain_type,heart_disease
1,48,1,204,165,62,1,0,5,0,0,0,9,70,1,typical angina,0
3,69,0,192,90,72,1,0,4,0,1,0,7,107,1,non-anginal pain,0
4,62,0,172,163,93,0,0,6,0,1,0,2,183,1,asymptomatic,0
8,37,0,317,137,66,1,2,3,0,1,1,5,114,0,non-anginal pain,0
11,37,1,293,148,98,1,1,6,1,0,1,10,129,0,typical angina,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
988,37,0,349,148,62,2,1,5,0,0,0,6,144,1,asymptomatic,0
989,49,1,315,144,77,2,0,4,1,1,0,2,127,0,atypical angina,0
991,26,0,215,100,74,0,2,7,0,1,0,10,135,0,atypical angina,0
992,28,0,220,102,73,1,1,7,1,1,1,10,102,0,typical angina,0


In [10]:
ds2_heart_disease

Unnamed: 0,age,gender,cholesterol,pressure_high,heart_rate,smoking,alcohol_intake,exercise_hours,family_history,diabetes,obesity,stress_level,blood_sugar,exercise_induced_angina,chest_pain_type,heart_disease
0,75,0,228,119,66,1,2,1,0,0,1,8,119,1,atypical angina,1
2,53,1,234,91,67,0,2,3,1,0,1,5,196,1,atypical angina,1
5,77,1,309,110,73,0,0,0,0,1,1,4,122,1,asymptomatic,1
6,64,0,211,105,86,2,2,8,1,1,1,2,120,0,typical angina,1
7,60,0,208,148,83,0,1,4,0,1,1,2,113,1,asymptomatic,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
994,52,1,248,159,76,2,1,9,0,1,1,2,152,1,asymptomatic,1
995,56,0,269,111,86,0,2,5,0,1,1,10,120,0,non-anginal pain,1
996,78,0,334,145,76,0,0,6,0,0,0,10,196,1,typical angina,1
998,60,0,326,151,68,2,0,8,1,1,0,5,174,1,atypical angina,1


## II. Statistical Testing of Numerical Features
***

In this section we perform a sequence of statistical tests to evaluate the differences  
between patients with and without heart disease:

1. **Normality check (Shapiro–Wilk test):**  
   - Assess whether each numerical feature follows a Gaussian distribution.  

2. **Group comparison:**  
   - If normality and equal variances are satisfied → apply **Independent Samples t-test**.  
   - If assumptions are violated → apply **Mann–Whitney U test** as a non-parametric alternative.  



In [11]:
stat, p = shapiro(dataset1["age"])
print("Statistic=%.3f, p=%.3f" % (stat, p))

if p > 0.05:
    print("Data looks Gaussian (normal distribution)")
else:
    print("Data does not look Gaussian (not normal)")

Statistic=0.991, p=0.000
Data does not look Gaussian (not normal)


In [12]:
stat, p = shapiro(dataset2["age"])
print("Statistic=%.3f, p=%.3f" % (stat, p))

if p > 0.05:
    print("Data looks Gaussian (normal distribution)")
else:
    print("Data does not look Gaussian (not normal)")

Statistic=0.957, p=0.000
Data does not look Gaussian (not normal)


**Conclusion:**  
The Shapiro–Wilk test results for the **Age** feature in both datasets show:  
- Dataset 1: Statistic = 0.991, p < 0.001  
- Dataset 2: Statistic = 0.957, p < 0.001  

Since the p-values are below the 0.05 threshold, we reject the null hypothesis of normality.  
This indicates that the **Age** variable in both datasets does not follow a Gaussian distribution.  
Therefore, for comparing age between groups (heart disease vs. no heart disease),  
a **non-parametric test** such as the Mann–Whitney U test will be more appropriate than a t-test.


### How the Shapiro–Wilk Test Works (Simplified)

The Shapiro–Wilk test compares the distribution of the sample with the shape of a perfect normal distribution.  
It does this by looking at how well the ordered data values align with what would be expected if the data were normal.  

- In the **denominator**, it calculates the overall variance of the data (how spread out the values are).  
- In the **numerator**, it builds a weighted linear combination of the ordered values and then squares this sum.  
- Finally, the test statistic \(W\) is obtained as the ratio between the two.  

If the data is close to normal, the statistic \(W\) will be close to **1**.  
If the data deviates from normality, the value of \(W\) becomes smaller, and the corresponding *p-value* indicates whether this deviation is statistically significant.

### Shapiro–Wilk Test

The Shapiro–Wilk test is used to assess whether a sample comes from a normal (Gaussian) distribution.  
It is one of the most widely used tests for normality due to its good statistical power.

The test statistic **W** is defined as:

$$\[
W = \frac{\left(\sum_{i=1}^{n} a_i x_{(i)}\right)^2}{\sum_{i=1}^{n} (x_i - \bar{x})^2}
\]$$

**where:** 
$$
x_{(i)} \quad \text{= the i-th ordered sample value (order statistic)}
$$

$$
\( \bar{x} \) → \text { the sample mean }
$$ 

$$
\( a_i \) → \text{ constants derived from the covariance matrix and expected values of order statistics from a standard normal distribution. }
$$

**Interpretation:**  
$$
\( W \approx 1 \) → \text { the sample is close to a normal distribution. }
$$

$$
\text { Low values of \( W \) with \( p \leq 0.05 \) } → \text { reject the null hypothesis → the data is not normally distributed }   
$$

**Notes:**  
- For **small samples**, the test may not detect deviations from normality.  
- For **large samples**, even small deviations can lead to \( p < 0.05 \).


### Normality Testing of Remaining Numerical Features – Dataset 1

In [13]:
numeric_features_ds1 = ['heart_rate', 'pressure_high', 'pressure_low', 'glucose', 'kcm', 'troponin']
result_shapiro_ds1 = []

for col in numeric_features_ds1:
    stat, p = shapiro(dataset1[col])
    result_shapiro_ds1.append({
        "Feature": col,
        "W-Statistic": round(stat, 3),
        "p-value": round(p, 3),
        "Normality": "Yes (Gaussian)" if p > 0.05 else "No (Not Gaussian)"
    })
df_shapiro_ds1 = pd.DataFrame(result_shapiro_ds1)
df_shapiro_ds1

Unnamed: 0,Feature,W-Statistic,p-value,Normality
0,heart_rate,0.955,0.0,No (Not Gaussian)
1,pressure_high,0.972,0.0,No (Not Gaussian)
2,pressure_low,0.984,0.0,No (Not Gaussian)
3,glucose,0.785,0.0,No (Not Gaussian)
4,kcm,0.315,0.0,No (Not Gaussian)
5,troponin,0.334,0.0,No (Not Gaussian)


### Normality Testing of Remaining Numerical Features – Dataset 2

In [14]:
numeric_features_ds2 = ['cholesterol', 'pressure_high', 'heart_rate', 'blood_sugar']
result_shapiro_ds2 = []

for col in numeric_features_ds2:
    stat, p = shapiro(dataset2[col])
    result_shapiro_ds2.append({
        "Feature": col,
        "W-Statistic": round(stat, 3),
        "p-value": round(p, 3),
        "Normality": "Yes (Gaussian)" if p > 0.05 else "No (Not Gaussian)"
    })
df_shapiro_ds2 = pd.DataFrame(result_shapiro_ds2)
df_shapiro_ds2

Unnamed: 0,Feature,W-Statistic,p-value,Normality
0,cholesterol,0.953,0.0,No (Not Gaussian)
1,pressure_high,0.951,0.0,No (Not Gaussian)
2,heart_rate,0.955,0.0,No (Not Gaussian)
3,blood_sugar,0.959,0.0,No (Not Gaussian)


In [15]:
# # Save Shapiro Wilk test results for both datasets
# df_shapiro_ds1.to_csv("../data/stats_results/shapiro_results_ds1.csv", index=False)
# df_shapiro_ds2.to_csv("../data/stats_results/shapiro_results_ds2.csv", index=False)

**Conclusion:**  
Since none of the numerical features in either dataset follow a Gaussian distribution,  
non-parametric statistical tests (e.g., Mann–Whitney U) should be preferred over  
parametric tests (t-test) in the subsequent analysis.

### Mann–Whitney U Test for Group Comparison

Since the Shapiro–Wilk test revealed that the numerical features do not follow a normal distribution,  
the **Mann–Whitney U test** (a non-parametric alternative to the t-test) is applied.  

**Purpose:**  
Evaluate whether the distributions of each numerical feature differ significantly  
between patients **with** and **without** heart disease.  

In [16]:
stat, p = mannwhitneyu(ds1_no_heart_disease['age'], ds1_heart_disease['age'])
print(f'age: U={stat:.3f}, p={p:.3f}')

if p <= 0.05:
    print("Result: Statistically significant difference between groups.")
else:
    print("Result: No significant difference between groups.")

age: U=148215.000, p=0.000
Result: Statistically significant difference between groups.


**Interpretation of Results:**  
- **U (test statistic):** Indicates the rank-based difference between the two groups,  
  but is less important for interpretation compared to the p-value.  

- **p-value:**  
  - **p ≤ 0.05** → Reject the null hypothesis → the distributions between patients  
    with and without heart disease are **significantly different**.  
  - **p > 0.05** → Fail to reject the null hypothesis → no statistically significant  
    difference between the two groups.  

In our context:  
- Features with **p ≤ 0.05** can be considered as **important factors** associated with heart disease.  
- Features with **p > 0.05** do not show meaningful differences between the groups.  


**Conclusion for Age feature:**  
- U = 148215.000, p = 0.000  
- Since p ≤ 0.05, we reject the null hypothesis.  
- **Conclusion:** Age is a statistically significant factor, indicating that patients with heart disease tend to differ in age compared to those without heart disease.  


In [17]:
ds1_features = ['age', 'heart_rate', 'pressure_high', 'pressure_low', 'glucose', 'kcm', 'troponin']

In [18]:
mannwhitneyu_results_ds1 = []

for col in ds1_features:
    stat, p = multiple_results_mwu(ds1_no_heart_disease, ds1_heart_disease, col)
    print(f'------------------------------')
    
    mannwhitneyu_results_ds1.append({
        "Feature": col,
        "p-value": round(p, 3),
        "Significant Difference": "Statistically significant difference between groups" if p < 0.05 else "No significant difference between groups"
    })
# df_mannwhitneyu_ds1 = pd.DataFrame(mannwhitneyu_results_ds1)
# df_mannwhitneyu_ds1.to_csv("../data/stats_results/mannwhitneyu_results_ds1.csv", index=False)

age: U=148215.000, p=0.000
Result: Statistically significant difference between groups.
------------------------------
heart_rate: U=205045.500, p=0.978
Result: No significant difference between groups.
------------------------------
pressure_high: U=214167.000, p=0.183
Result: No significant difference between groups.
------------------------------
pressure_low: U=207245.500, p=0.764
Result: No significant difference between groups.
------------------------------
glucose: U=210007.000, p=0.477
Result: No significant difference between groups.
------------------------------
kcm: U=131000.000, p=0.000
Result: Statistically significant difference between groups.
------------------------------
troponin: U=43809.000, p=0.000
Result: Statistically significant difference between groups.
------------------------------


In [19]:
ds2_features = ['age', 'cholesterol', 'pressure_high', 'heart_rate', 'blood_sugar']

mannwhitneyu_results_ds2 = []
for col in ds2_features:
    stat, p = multiple_results_mwu(ds2_no_heart_disease, ds2_heart_disease, col)
    print(f'------------------------------')
    
    mannwhitneyu_results_ds2.append({
        "Feature": col,
        "p-value": round(p, 3),
        "Significant Difference": "Statistically significant difference between groups" if p < 0.05 else "No significant difference between groups"
    })

# df_mannwhitneyu_ds2 = pd.DataFrame(mannwhitneyu_results_ds2)
# df_mannwhitneyu_ds2.to_csv("../data/stats_results/mannwhitneyu_results_ds2.csv", index=False)

age: U=27390.000, p=0.000
Result: Statistically significant difference between groups.
------------------------------
cholesterol: U=67477.500, p=0.000
Result: Statistically significant difference between groups.
------------------------------
pressure_high: U=118176.000, p=0.824
Result: No significant difference between groups.
------------------------------
heart_rate: U=117400.000, p=0.692
Result: No significant difference between groups.
------------------------------
blood_sugar: U=121200.000, p=0.649
Result: No significant difference between groups.
------------------------------


### Dataset 1 – Conclusion
In the first dataset, a **statistically significant difference** between groups was found for:
- **Age**
- **KCM**
- **Troponin**

This means that these factors are associated with the disease.  

For the following variables, no significant difference was detected:  
- Heart Rate  
- Pressure High  
- Pressure Low  
- Glucose  

**This indicates that, within this dataset, they are not related to the disease.**

<br><br>
### Dataset 2 – Conclusion
In the second dataset, a **statistically significant difference** between groups was found for:
- **Age**
- **Cholesterol**

This shows that these factors are associated with the disease.  

For the following variables, no significant difference was detected:  
- Pressure High  
- Heart Rate  
- Blood Sugar  

**This suggests that, within this dataset, they are not related to the disease.**  

<br><br><br>
## Overall Summary
Across both datasets, **Age consistently emerges as a key factor** associated with heart disease.  
In addition, **KCM** and **Troponin** are significant in Dataset 1, while **Cholesterol** is significant in Dataset 2.  
All other tested variables (Heart Rate, Blood Pressure, Glucose, Blood Sugar) show no statistically significant association with the disease in either dataset.

