# Comparative Analysis of Heart Disease Risk Factors Across Two Independent Datasets

## Objective
This notebook presents the final report of our Data Science project.  
The main goal is to compare two independent heart disease datasets in order to:  
- Identify the most important risk factors associated with heart disease.  
- Highlight similarities and differences between the two datasets.  
- Evaluate and compare the performance of statistical tests and machine learning models.  

The notebook is structured as a summary of the entire workflow, focusing on results, visualizations, and key insights rather than technical implementation details.


**TODO:** DASHBOARDS (https://plotly.com/, https://www.gradio.app/)


In [15]:
import plotly.graph_objects as go
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import gradio as gr
from matplotlib import image as mpimg


## I. Data and Preparation

# Data Overview

In this part, we take a first look at the two datasets we’ll be working with.  
The idea is simple: understand what’s inside, how clean they are, and what kind of features we’re dealing with.  
This will guide all the next steps like preprocessing, analysis, and modeling.

### Dataset 1
- Rows: 1319  
- Columns: 9  
- Target variable: `heart_disease` (0 = no, 1 = yes)  
- Mix of numeric (age, heart_rate, pressure_high, pressure_low, glucose, kcm, troponin) and categorical features (gender).  

**Quick notes:**  
- No major missing values.  
- Some columns have strong imbalance (more women than men ~ 66/34).  
- Age seems normally distributed, while other features appear asymmetrically distributed
- We used boxplots to check for outliers and to compare the two target groups (healthy vs. heart disease).  
From the plots, it’s clear that many features contain outliers.  
However, the groups of healthy and sick patients look fairly balanced in terms of distribution.  
This can be seen by looking at the position of the 25th percentile and the median (50th percentile), which are placed in a similar way for both groups.


**Correlation Analysis**

The correlation matrix shows generally weak relationships between most features.  
Some higher correlations are observed for `age`, `kcm`, and `troponin` when using the **Pearson** method.  
With **Spearman**, we notice a similar pattern, but certain features, such as `troponin`, show stronger correlation with the target variable.  

The key point here is that the choice of correlation method matters:  
- For features that are normally distributed and have linear relationships, Pearson correlation is appropriate.  
- For features that are not normally distributed, Spearman correlation is recommended.  
- For categorical variables, it’s not good practice to use correlation; instead, tests like **Chi-Square** are more suitable.

### Dataset 2
- Rows: 1000  
- Columns: 16  
- Target variable: `heart_disease` (0 = no, 1 = yes)  
- Mix of numeric (age, cholesterol, pressure_high, heart_rate, exercise_hours, blood_sugar, troponin) 
and categorical features (gender, smoking, alcohol_intake, family_history, diabetes, obesity, stress_level, exercise_induced_angina).  

**Quick notes:**  
- No major missing values.  
- Gender column is quite balanced (50.3/49.7)  
- At first glance, the features Age, Cholesterol, Pressure High, Heart Rate, and Blood Sugar appear to follow a distribution close to normal. This observation suggests that these variables are relatively well-behaved and suitable for statistical analyses that assume normality.

**Correlation analysis**

The correlation matrix for Dataset 2 shows that most features are largely independent of each other, indicating a low risk of multicollinearity in modeling.
The strongest correlations with heart disease are observed for:
`Age` (r ≈ 0.65) – older patients are more likely to have heart disease.
`Cholesterol` (r ≈ 0.37) – higher cholesterol levels are moderately associated with heart disease.
Other features show little to no linear correlation with the target. However, this does not mean they are irrelevant — their impact may be non-linear or only visible when combined with other risk factors.
This information is useful for feature selection, ensuring model stability, and guiding further analysis of possible interactions between variables.

The key point here is that the choice of correlation method matters:  
- For features that are normally distributed and have linear relationships, Pearson correlation is appropriate.  
- For features that are not normally distributed, Spearman correlation is recommended.  
- For categorical variables, it’s not good practice to use correlation; instead, tests like **Chi-Square** are more suitable.