# Project Topic: 
## #Predictive Modeling of Coronary Heart Disease using Machine Learning


## Problem Statement:
Predicting the likelihood of coronary heart disease (CHD) based on patient demographics and clinical indicators is crucial for early diagnosis and intervention. Traditional methods often lack accuracy and efficiency in risk assessment. Therefore, there is a need to develop a machine learning model that can accurately predict the presence of CHD based on a combination of patient characteristics and medical measurements.


## Description:
This project aims to develop a predictive model for identifying the presence of coronary heart disease (CHD) in patients using machine learning techniques. The model will utilize a dataset containing various patient attributes such as age, sex, chest pain type, resting blood pressure, cholesterol levels, fasting blood sugar, resting electrocardiographic results, maximum heart rate achieved during exercise, exercise-induced angina, oldpeak (ST depression induced by exercise relative to rest), and the slope of the peak exercise ST segment.

The project will involve the following steps:

- Data Collection and Preprocessing: Gathering a comprehensive dataset containing patient information and preprocessing it to handle missing values, normalize features, and encode categorical variables.

- Exploratory Data Analysis (EDA): Exploring the dataset to understand the relationships between different features and the target variable (presence of CHD).

- Feature Selection: Identifying the most relevant features that contribute significantly to predicting CHD.

- Model Development: Implementing various machine learning algorithms such as logistic regression, decision trees, random forests, support vector machines, and gradient boosting to develop predictive models.

- Model Evaluation: Evaluating the performance of each model using appropriate metrics such as accuracy, precision, recall, and F1-score. Employing techniques like cross-validation t risk of CHD in patients.

- The outcome of this project will be a reliable machine learning model capable of accurately predicting the presence of CHD in patients based on their demographic information and clinical indicators. Such a model can assist healthcare providers in making timely and informed decisions for the prevention and management of coronary heart disease.








In [1]:
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, f1_score, precision_recall_curve

In [2]:
## Loading the file into a dataframe 
df = pd.read_csv(r"C:\Users\user\Desktop\Heart Disease Project\heart_statlog_cleveland_hungary_final.csv")

In [3]:
df.head() # The first 5 rows on the datafame

Unnamed: 0,age,sex,chest pain type,resting bp s,cholesterol,fasting blood sugar,resting ecg,max heart rate,exercise angina,oldpeak,ST slope,target
0,40,1,2,140,289,0,0,172,0,0.0,1,0
1,49,0,3,160,180,0,0,156,0,1.0,2,1
2,37,1,2,130,283,0,1,98,0,0.0,1,0
3,48,0,4,138,214,0,0,108,1,1.5,2,1
4,54,1,3,150,195,0,0,122,0,0.0,1,0


In [4]:
# Exploring the dataframe 
def explore_data(df):
    df.columns = df.columns.str.strip()
    df.columns = df.columns.str.replace(' ','_')
    structure = df.info()
    missing_values = df.isna().sum()
    duplicated_rows = df.duplicated().sum()
    summary = df.describe().T
    target_value = df.target.value_counts()
    print('The Structure and Datatype of the Dataframe:', structure)
    print('The missing values in the dataframe:', missing_values)
    print('The duplicated_value in the dataframe:', duplicated_rows)
    print('The Statistical Summary of the Variables in the DataFrame:', summary)
    count_1 = df.target.sum()
    count_0 = len(df) - count_1
    if count_1 > 0.7 * len(df):
        print('The Data is Imbalance')
    else:
        print('The Data is Balanced')

In [5]:
explore_data(df)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1190 entries, 0 to 1189
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   age                  1190 non-null   int64  
 1   sex                  1190 non-null   int64  
 2   chest_pain_type      1190 non-null   int64  
 3   resting_bp_s         1190 non-null   int64  
 4   cholesterol          1190 non-null   int64  
 5   fasting_blood_sugar  1190 non-null   int64  
 6   resting_ecg          1190 non-null   int64  
 7   max_heart_rate       1190 non-null   int64  
 8   exercise_angina      1190 non-null   int64  
 9   oldpeak              1190 non-null   float64
 10  ST_slope             1190 non-null   int64  
 11  target               1190 non-null   int64  
dtypes: float64(1), int64(11)
memory usage: 111.7 KB
The Structure and Datatype of the Dataframe: None
The missing values in the dataframe: age                    0
sex                    0
che

In [6]:
## dropping all duplicated rows from the dataframe
def clean_data(df):
    df = df.copy()
    df.drop_duplicates(inplace=True)
    
    return df
df = clean_data(df)   


## splitting df into dependent(y) and independent varaible (x)
def split_data(df):
    df = df.copy()
    x = df.drop(['target'],axis=1)
    y = df.target
    return x, y 
x,y = split_data(df)



In [7]:
## creating  a pipeline for the independent variable 
def process_data(df):
    pipe = Pipeline([
        ('scaler', StandardScaler())
        ]
    )
## mport column transformer to proces the independent variable 
    from sklearn.compose import ColumnTransformer
    processor = ColumnTransformer(
        transformers=[
            ('pipe',pipe, x.columns)
        ]
    )
    return processor

processor = process_data(df)
    

## Selecting the train and test sets 
def train_test(df):
    x_train,x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=42)
    return x_train,y_train, x_test,y_test

x_train,y_train,x_test, y_test = train_test(df)

In [21]:
## BUilding Random Forest Model 
def rf_model(x_train,y_train):
    rf = RandomForestClassifier().fit(x_train,y_train)
    importance = rf.feature_importances_
    importance_df = pd.DataFrame(data=importance,index=x.columns, columns=['Importance'])
    rf_predict = rf.predict(x_test)
    prediction_table = pd.DataFrame()
    prediction_table['rf'] = rf_predict
    return rf_predict, importance_df

rf_predict, importance_df= rf_model(x_train,y_train)

## Evaluating the performance of the Model 
def get_metric(x_test,y_train):
    print('The F1 Score of The RandomForest Classifier:', f1_score(y_test,rf_predict))
    print('The Confusion matrix of The Random Forest Classifier Model:')
    print(confusion_matrix(y_test,rf_predict))
    print(importance_df)
get_metric(x_test,y_train)   

The F1 Score of The RandomForest Classifier: 0.9111969111969113
The Confusion matrix of The Random Forest Classifier Model:
[[ 89   9]
 [ 14 118]]
                     Importance
age                    0.096862
sex                    0.035085
chest_pain_type        0.097393
resting_bp_s           0.069754
cholesterol            0.105142
fasting_blood_sugar    0.023363
resting_ecg            0.023753
max_heart_rate         0.107092
exercise_angina        0.109642
oldpeak                0.107776
ST_slope               0.224140


In [20]:
## Building the Xgboost Model 
def xgb_model(x_train,y_train):
    xgb = XGBClassifier().fit(x_train,y_train)
    importance = xgb.feature_importances_
    importance_df = pd.DataFrame(data=importance,index=x.columns, columns=['Importance'])
    xgb_predict = xgb.predict(x_test)
    prediction_table = pd.DataFrame()
    prediction_table['xgb'] = xgb_predict
    return xgb_predict,importance_df

xgb_predict, importance_df = xgb_model(x_train,y_train)

## Evaluating the performance of the Model 
def get_metric(x_test,y_train):
    print('The F1 Score of The XGBoost Classifier:', f1_score(y_test,xgb_predict))
    print('The Confusion matrix of The XGBoost Classifier Model:')
    print(confusion_matrix(y_test,xgb_predict))
    print(importance_df)
get_metric(x_test,y_train)    

The F1 Score of The XGBoost Classifier: 0.8818897637795275
The Confusion matrix of The XGBoost Classifier Model:
[[ 88  10]
 [ 20 112]]
                     Importance
age                    0.024083
sex                    0.072788
chest_pain_type        0.104431
resting_bp_s           0.029534
cholesterol            0.035042
fasting_blood_sugar    0.057896
resting_ecg            0.018286
max_heart_rate         0.026146
exercise_angina        0.098047
oldpeak                0.047169
ST_slope               0.486577


In [22]:
## BUilding DecisionTreeClassifier Model 
def tree_model(x_train,y_train):
    tree = DecisionTreeClassifier().fit(x_train,y_train)
    importance = tree.feature_importances_
    importance_df = pd.DataFrame(data=importance,index=x.columns, columns=['Importance'])
    tree_predict = tree.predict(x_test)
    prediction_table = pd.DataFrame()
    prediction_table['tree'] = tree_predict
    return tree_predict, importance_df

tree_predict, importance_df= tree_model(x_train,y_train)
## Evaluating the performance of the Model 
def get_metric(x_test,y_train):
    print('The F1 Score of The Decision Tree Classifier:', f1_score(y_test,tree_predict))
    print('The Confusion matrix of The Decision Tree Classifier Model:')
    print(confusion_matrix(y_test,tree_predict))
    print(importance_df)
get_metric(x_test,y_train)    

The F1 Score of The Decision Tree Classifier: 0.7951807228915662
The Confusion matrix of The Decision Tree Classifier Model:
[[80 18]
 [33 99]]
                     Importance
age                    0.076104
sex                    0.038359
chest_pain_type        0.070424
resting_bp_s           0.054500
cholesterol            0.095478
fasting_blood_sugar    0.024164
resting_ecg            0.019411
max_heart_rate         0.117770
exercise_angina        0.026926
oldpeak                0.070452
ST_slope               0.406413


patient outcomes and healthcare decision-making.

# Project Report: Predictive Modeling of Coronary Heart Disease using Machine Learning

## Introduction

Coronary heart disease (CHD) is a leading cause of death worldwide, with millions of individuals affected annually. Traditional diagnostic methods, such as electrocardiograms (ECGs), are often expensive, time-consuming, and may not be accessible in smaller clinics. The advent of machine learning (ML) offers a promising alternative for early detection and risk assessment of CHD, leveraging patient data to predict disease presence with high accuracy. This project aims to develop a predictive model for CHD using ML techniques, focusing on the application of various algorithms to analyze patient demographics and clinical indicators.

## Objective

The primary objective of this project is to create a reliable ML model capable of accurately predicting the presence of CHD in patients based on their demographic information and clinical indicators. This model will assist healthcare providers in making timely and informed decisions for the prevention and management of coronary heart disease.

## Methodology

### Data Collection and Preprocessing

The dataset used for this project contains patient attributes such as age, sex, chest pain type, resting blood pressure, cholesterol levels, fasting blood sugar, resting electrocardiographic results, maximum heart rate achieved during exercise, exercise-induced angina, oldpeak (ST depression induced by exercise relative to rest), and the slope of the peak exercise ST segment. The data was preprocessed to handle missing values, normalize features, and encode categorical variables.

### Exploratory Data Analysis (EDA)

EDA was conducted to understand the relationships between different features and the target variable (presence of CHD). This step involved visualizing the data and identifying any patterns or anomalies that could influence the model's performance.

### Feature Selection

Relevant features contributing significantly to predicting CHD were identified through feature selection techniques. This step ensured that the model focuses on the most informative attributes, improving its predictive accuracy.

### Model Development

Various ML algorithms were implemented to develop predictive models, including logistic regression, decision trees, random forests, support vector machines, and gradient boosting. Each algorithm was trained and tested on the dataset to evaluate its performance.

### Model Evaluation

The performance of each model was evaluated using metrics such as accuracy, precision, recall, and F1-score. Techniques like cross-validation were employed to ensure the robustness of the models.

## Results

### Random Forest Classifier

The Random Forest Classifier achieved an F1 Score of 0.911, indicating a high level of accuracy in predicting CHD. The feature importance scores revealed that age, chest pain type, resting blood pressure, cholesterol, fasting blood sugar, max heart rate, exercise angina, oldpeak, and ST slope were the most influential factors in predicting CHD.

### XGBoost Classifier

The XGBoost Classifier achieved an F1 Score of 0.882, slightly lower than the Random Forest Classifier but still indicating a strong predictive performance. The feature importance scores highlighted that age, chest pain type, cholesterol, fasting blood sugar, max heart rate, exercise angina, oldpeak, and ST slope were significant in predicting CHD.

### Decision Tree Classifier

The Decision Tree Classifier achieved an F1 Score of 0.792, indicating a moderate level of accuracy in predicting CHD. The feature importance scores showed that age, chest pain type, resting blood pressure, cholesterol, fasting blood sugar, max heart rate, exercise angina, oldpeak, and ST slope were the most influential factors.

## Conclusion

The project successfully developed predictive models for CHD using ML techniques, demonstrating the potential of these algorithms in healthcare for early detection and risk assessment. The Random Forest Classifier and XGBoost Classifier showed the highest accuracy, suggesting that these models could be valuable tools for healthcare providers in the early diagnosis and management of CHD. Further research and refinement of these models could lead to more accurate and efficient CHD prediction systems.


#### Author: Tolulope Emuleomo
#### Date: 25 April 2024
