# Comparative Analysis of Heart Disease Risk Factors from Two Independent Datasets

## Project Overview
This project aims to analyze and compare two independent datasets related to heart disease, with the goal of identifying key risk factors and evaluating predictive models. The analysis includes data preprocessing, exploratory data analysis (EDA), statistical hypothesis testing, and machine learning modeling for both datasets, followed by a comparative evaluation of the results.

#### Both datasets are independent, meaning they have been collected separately, contain different features, and do not share any identical records.

## Datasets Description
- **Dataset 1:** Contains 1,000 records with 16 variables, including demographic, lifestyle, and medical measurements, as well as the target variable indicating heart disease presence.
- **Dataset 2:** Contains 1,025 records with 14 variables, mostly medical measurements and categorical encodings, along with the target variable indicating heart disease presence.


In [966]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt

from src.overview_functions import convert_to_string_to_lower

In [967]:
dataset1 = pd.read_csv('../data/dataset_1.csv')
dataset2 = pd.read_csv('../data/dataset_2.csv')

In [968]:
dataset1

Unnamed: 0,age,gender,impluse,pressurehight,pressurelow,glucose,kcm,troponin,class
0,64,1,66,160,83,160.0,1.80,0.012,negative
1,21,1,94,98,46,296.0,6.75,1.060,positive
2,55,1,64,160,77,270.0,1.99,0.003,negative
3,64,1,70,120,55,270.0,13.87,0.122,positive
4,55,1,64,112,65,300.0,1.08,0.003,negative
...,...,...,...,...,...,...,...,...,...
1314,44,1,94,122,67,204.0,1.63,0.006,negative
1315,66,1,84,125,55,149.0,1.33,0.172,positive
1316,45,1,85,168,104,96.0,1.24,4.250,positive
1317,54,1,58,117,68,443.0,5.80,0.359,positive


In [969]:
#Check the dimension of the dataset (rows, columns)
dataset1.shape

(1319, 9)

In [970]:
# Check the data types of all columns in the dataset
dataset1.dtypes

age                int64
gender             int64
impluse            int64
pressurehight      int64
pressurelow        int64
glucose          float64
kcm              float64
troponin         float64
class             object
dtype: object

In [971]:
# Check for missing values in each column
dataset1.isnull().sum()

age              0
gender           0
impluse          0
pressurehight    0
pressurelow      0
glucose          0
kcm              0
troponin         0
class            0
dtype: int64

In [972]:
# Count the number of unique values in each column
dataset1.nunique()

age               75
gender             2
impluse           79
pressurehight    116
pressurelow       73
glucose          244
kcm              700
troponin         352
class              2
dtype: int64

In [973]:
# Count the number of duplicate rows in the dataset
dataset1.duplicated().sum()

np.int64(0)

In [974]:
# Generate descriptive statistics for all numerical columns
dataset1.describe()

Unnamed: 0,age,gender,impluse,pressurehight,pressurelow,glucose,kcm,troponin
count,1319.0,1319.0,1319.0,1319.0,1319.0,1319.0,1319.0,1319.0
mean,56.191812,0.659591,78.336619,127.170584,72.269143,146.634344,15.274306,0.360942
std,13.647315,0.474027,51.63027,26.12272,14.033924,74.923045,46.327083,1.154568
min,14.0,0.0,20.0,42.0,38.0,35.0,0.321,0.001
25%,47.0,0.0,64.0,110.0,62.0,98.0,1.655,0.006
50%,58.0,1.0,74.0,124.0,72.0,116.0,2.85,0.014
75%,65.0,1.0,85.0,143.0,81.0,169.5,5.805,0.0855
max,103.0,1.0,1111.0,223.0,154.0,541.0,300.0,10.3


In [975]:
# Display the first 5 rows of the dataset
dataset1.head()

Unnamed: 0,age,gender,impluse,pressurehight,pressurelow,glucose,kcm,troponin,class
0,64,1,66,160,83,160.0,1.8,0.012,negative
1,21,1,94,98,46,296.0,6.75,1.06,positive
2,55,1,64,160,77,270.0,1.99,0.003,negative
3,64,1,70,120,55,270.0,13.87,0.122,positive
4,55,1,64,112,65,300.0,1.08,0.003,negative


In [976]:
# Convert the values in the 'class' column: 0 for 'negative' and 1 for 'positive'
dataset1['class'] = dataset1['class'].map({
    'negative': 0,
    'positive': 1
})

In [977]:
dataset1

Unnamed: 0,age,gender,impluse,pressurehight,pressurelow,glucose,kcm,troponin,class
0,64,1,66,160,83,160.0,1.80,0.012,0
1,21,1,94,98,46,296.0,6.75,1.060,1
2,55,1,64,160,77,270.0,1.99,0.003,0
3,64,1,70,120,55,270.0,13.87,0.122,1
4,55,1,64,112,65,300.0,1.08,0.003,0
...,...,...,...,...,...,...,...,...,...
1314,44,1,94,122,67,204.0,1.63,0.006,0
1315,66,1,84,125,55,149.0,1.33,0.172,1
1316,45,1,85,168,104,96.0,1.24,4.250,1
1317,54,1,58,117,68,443.0,5.80,0.359,1


In [978]:
dataset2

Unnamed: 0,Age,Gender,Cholesterol,Blood Pressure,Heart Rate,Smoking,Alcohol Intake,Exercise Hours,Family History,Diabetes,Obesity,Stress Level,Blood Sugar,Exercise Induced Angina,Chest Pain Type,Heart Disease
0,75,Female,228,119,66,Current,Heavy,1,No,No,Yes,8,119,Yes,Atypical Angina,1
1,48,Male,204,165,62,Current,,5,No,No,No,9,70,Yes,Typical Angina,0
2,53,Male,234,91,67,Never,Heavy,3,Yes,No,Yes,5,196,Yes,Atypical Angina,1
3,69,Female,192,90,72,Current,,4,No,Yes,No,7,107,Yes,Non-anginal Pain,0
4,62,Female,172,163,93,Never,,6,No,Yes,No,2,183,Yes,Asymptomatic,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,56,Female,269,111,86,Never,Heavy,5,No,Yes,Yes,10,120,No,Non-anginal Pain,1
996,78,Female,334,145,76,Never,,6,No,No,No,10,196,Yes,Typical Angina,1
997,79,Male,151,179,81,Never,Moderate,4,Yes,No,Yes,8,189,Yes,Asymptomatic,0
998,60,Female,326,151,68,Former,,8,Yes,Yes,No,5,174,Yes,Atypical Angina,1


In [979]:
dataset2.isnull().any()

Age                        False
Gender                     False
Cholesterol                False
Blood Pressure             False
Heart Rate                 False
Smoking                    False
Alcohol Intake              True
Exercise Hours             False
Family History             False
Diabetes                   False
Obesity                    False
Stress Level               False
Blood Sugar                False
Exercise Induced Angina    False
Chest Pain Type            False
Heart Disease              False
dtype: bool

In [980]:
# Convert all column names to lowercase and replace spaces with underscores for consistency
dataset2.columns = dataset2.columns.str.lower().str.replace(' ', '_')

In [981]:
# Convert 'gender' column values to lowercase strings for consistency
dataset2['gender'] = convert_to_string_to_lower(dataset2, 'gender')

# Map gender categories to numeric values: 0 for female, 1 for male
dataset2['gender'] = dataset2['gender'].map({
    'female': 0,
    'male': 1
})

In [982]:
# Convert 'gender' column values to lowercase strings for consistency
dataset2['smoking'] = convert_to_string_to_lower(dataset2, 'smoking')

# Map smoking status to numeric values: 
# 0 for never smoked, 1 for currently smoking, 2 for former smoker
dataset2['smoking'] = dataset2['smoking'].map({
    'never': 0,
    'current': 1,
    'former': 2
})

In [983]:
# Convert 'gender' column values to lowercase strings for consistency
dataset2['alcohol_intake'] = convert_to_string_to_lower(dataset2, 'alcohol_intake')

# Map alcohol intake levels to numeric values:
# 1 for moderate, 2 for heavy, and replace NaN with 0 (no alcohol intake)
dataset2['alcohol_intake'] = dataset2['alcohol_intake'].map({
    'moderate': 1,
    'heavy': 2
}).fillna(0)

# Convert the 'alcohol_intake' column to integer type
dataset2['alcohol_intake'] = dataset2['alcohol_intake'].astype(int)

In [984]:
# Convert 'family_history' column values to lowercase strings for consistency
dataset2['family_history'] = convert_to_string_to_lower(dataset2, 'family_history')

# Map family history of heart disease to numeric values:
# 0 for no, 1 for yes
dataset2['family_history'] = dataset2['family_history'].map({
    'no': 0,
    'yes': 1,
})

In [985]:
# Convert 'diabetes' column values to lowercase strings for consistency
dataset2['diabetes'] = convert_to_string_to_lower(dataset2, 'diabetes')

# Map diabetes status to numeric values:
# 0 for no, 1 for yes
dataset2['diabetes'] = dataset2['diabetes'].map({
    'no': 0,
    'yes': 1,
})

In [986]:
# Convert 'obesity' column values to lowercase strings for consistency
dataset2['obesity'] = convert_to_string_to_lower(dataset2, 'obesity')

# Map obesity status to numeric values:
# 0 for no, 1 for yes
dataset2['obesity'] = dataset2['obesity'].map({
    'no': 0,
    'yes': 1,
})

In [987]:
# Convert 'exercise_induced_angina' column values to lowercase strings for consistency
dataset2['exercise_induced_angina'] = convert_to_string_to_lower(dataset2, 'exercise_induced_angina')

# Map exercise-induced angina status to numeric values:
# 0 for no, 1 for yes
dataset2['exercise_induced_angina'] = dataset2['exercise_induced_angina'].map({
    'no': 0,
    'yes': 1,
})

In [988]:
# Convert 'chest_pain_type' column values to lowercase strings for consistency
dataset2['chest_pain_type'] = convert_to_string_to_lower(dataset2, 'chest_pain_type')

In [989]:
# Count the number of unique values in each column, including NaN values
dataset2.nunique(dropna=False)

age                         55
gender                       2
cholesterol                200
blood_pressure              90
heart_rate                  40
smoking                      3
alcohol_intake               3
exercise_hours              10
family_history               2
diabetes                     2
obesity                      2
stress_level                10
blood_sugar                130
exercise_induced_angina      2
chest_pain_type              4
heart_disease                2
dtype: int64

In [990]:
# Display the data type of each column in the dataset
dataset2.dtypes

age                                 int64
gender                              int64
cholesterol                         int64
blood_pressure                      int64
heart_rate                          int64
smoking                             int64
alcohol_intake                      int64
exercise_hours                      int64
family_history                      int64
diabetes                            int64
obesity                             int64
stress_level                        int64
blood_sugar                         int64
exercise_induced_angina             int64
chest_pain_type            string[python]
heart_disease                       int64
dtype: object

In [991]:
# Display the entire dataset2 DataFrame
dataset2

Unnamed: 0,age,gender,cholesterol,blood_pressure,heart_rate,smoking,alcohol_intake,exercise_hours,family_history,diabetes,obesity,stress_level,blood_sugar,exercise_induced_angina,chest_pain_type,heart_disease
0,75,0,228,119,66,1,2,1,0,0,1,8,119,1,atypical angina,1
1,48,1,204,165,62,1,0,5,0,0,0,9,70,1,typical angina,0
2,53,1,234,91,67,0,2,3,1,0,1,5,196,1,atypical angina,1
3,69,0,192,90,72,1,0,4,0,1,0,7,107,1,non-anginal pain,0
4,62,0,172,163,93,0,0,6,0,1,0,2,183,1,asymptomatic,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,56,0,269,111,86,0,2,5,0,1,1,10,120,0,non-anginal pain,1
996,78,0,334,145,76,0,0,6,0,0,0,10,196,1,typical angina,1
997,79,1,151,179,81,0,1,4,1,0,1,8,189,1,asymptomatic,0
998,60,0,326,151,68,2,0,8,1,1,0,5,174,1,atypical angina,1


## Dataset 2 - Encoded Values

| Encoded Value | gender  | smoking  | alcohol_intake | family_history | diabetes | obesity | exercise_induced_angina |
|---------------|---------|----------|----------------|----------------|----------|---------|-------------------------|
| 0             | female  | never    | None / NaN     | no             | no       | no      | no                      |
| 1             | male    | current  | moderate       | yes            | yes      | yes     | yes                     |
| 2             | -       | former   | heavy          | -              | -        | -       | -                       |




In [992]:
dataset2

Unnamed: 0,age,gender,cholesterol,blood_pressure,heart_rate,smoking,alcohol_intake,exercise_hours,family_history,diabetes,obesity,stress_level,blood_sugar,exercise_induced_angina,chest_pain_type,heart_disease
0,75,0,228,119,66,1,2,1,0,0,1,8,119,1,atypical angina,1
1,48,1,204,165,62,1,0,5,0,0,0,9,70,1,typical angina,0
2,53,1,234,91,67,0,2,3,1,0,1,5,196,1,atypical angina,1
3,69,0,192,90,72,1,0,4,0,1,0,7,107,1,non-anginal pain,0
4,62,0,172,163,93,0,0,6,0,1,0,2,183,1,asymptomatic,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,56,0,269,111,86,0,2,5,0,1,1,10,120,0,non-anginal pain,1
996,78,0,334,145,76,0,0,6,0,0,0,10,196,1,typical angina,1
997,79,1,151,179,81,0,1,4,1,0,1,8,189,1,asymptomatic,0
998,60,0,326,151,68,2,0,8,1,1,0,5,174,1,atypical angina,1
