# UCI Heart Disease Data

By: Muhammad Haseeb Abbasi\

Email Id: haseeb.abbasi0075@gmail.com

Kaggle: https://www.kaggle.com/muhammadhaseebabbasi

LinkedIn: https://www.linkedin.com/in/muhammad-haseeb-abbasi-6462358a/



## About Dataset:
This is a multivariate type of dataset which means providing or involving a variety of separate mathematical or statistical variables, multivariate numerical data analysis. It is composed of 14 attributes which are age, sex, chest pain type, resting blood pressure, serum cholesterol, fasting blood sugar, resting electrocardiographic results, maximum heart rate achieved, exercise-induced angina, oldpeak — ST depression induced by exercise relative to rest, the slope of the peak exercise ST segment, number of major vessels and Thalassemia. This database includes 76 attributes, but all published studies relate to the use of a subset of 14 of them. The Cleveland database is the only one used by ML researchers to date. One of the major tasks on this dataset is to predict based on the given attributes of a patient that whether that particular person has heart disease or not and other is the experimental task to diagnose and find out various insights from this dataset which could help in understanding the problem more.

Column Descriptions:\
1. id (Unique id for each patient)\
2. age (Age of the patient in years)\
3. origin (place of study)\
4. sex (Male/Female)\
5. cp chest pain type ([typical angina, atypical angina, non-anginal, asymptomatic])\
6. trestbps resting blood pressure (resting blood pressure (in mm Hg on admission to the hospital))\
7. chol (serum cholesterol in mg/dl)\
8. fbs (if fasting blood sugar > 120 mg/dl)\
9. restecg (resting electrocardiographic results)\
-- Values: [normal, stt abnormality, lv hypertrophy]\
10. thalach: maximum heart rate achieved\
11. exang: exercise-induced angina (True/ False)\
12. oldpeak: ST depression induced by exercise relative to rest\
13. slope: the slope of the peak exercise ST segment\
14. ca: number of major vessels (0-3) colored by fluoroscopy\
15. thal: [normal; fixed defect; reversible defect]\
16. num: the predicted attribute

Size: the dataset has 920 Rows and 16 Columns

## Acknowledgements

Creators:\
Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.\
University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.\
University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.\
V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.

## Aim and purpose of the notebook is as fallow:
- Understanding distribution
- Imputing Missing Values 
- Machine learning Method is used to Impute missing values
- Decision Tree Algorithm
- Applying Random Forest Algorithm
- Applying XGBoost Algorithm
- Choosing the best model using for loop



In [None]:
#importing libraries
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sys
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.metrics import mean_absolute_error, confusion_matrix, classification_report, accuracy_score, precision_score, recall_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier


In [None]:
df = pd.read_csv('/kaggle/input/heart-disease-data/heart_disease_uci.csv')

In [None]:
df.head()

In [None]:
df.describe()

## Data Distribution

* Data visualisation of numerical features by usind KdePlot

In [None]:
# firstly we need to create numerical features in separate groups, these numerical features are lately used in dealing outlier as well
# numerical fearures 6
numerical_features = ['age', 'chol', 'trestbps', 'oldpeak', 'ca', 'num']

In [None]:
# Assuming your dataset is loaded into a DataFrame named 'df'
# Replace 'df' with the actual variable name if it's different

# Define the numerical features
numerical_features = ['age', 'chol', 'trestbps', 'oldpeak', 'ca', 'num']

# Create KDE plots for each numerical feature
plt.figure(figsize=(15, 10))

for i, feature in enumerate(numerical_features, 1):
    plt.subplot(2, 3, i)
    sns.kdeplot(data=df, x=feature, fill=True, color='skyblue')
    plt.title(f'KDE Plot for {feature}', fontsize=14)
    plt.xlabel(feature, fontsize=12)
    plt.ylabel('Density', fontsize=12)

# Adjust layout
plt.tight_layout()

# Show the plots
plt.show()


The insights from the KDE plots, are as follows:

Age:
The KDE plot for the 'age' column exhibits a single peak, suggesting a unimodal distribution. This indicates that most individuals in the dataset are concentrated around a specific age range.

Cholesterol:
The KDE plot for 'cholesterol' displays two distinct peaks, indicating a bimodal distribution. This suggests the presence of two subgroups within the dataset with different cholesterol levels.

Resting Blood Pressure:
The 'resting_blood_pressure' KDE plot reveals two peaks, indicating a bimodal distribution. This suggests the existence of two subgroups with different resting blood pressure levels.

Max Heart Rate Achieved:
The KDE plot for 'max_heart_rate_achieved' also displays two peaks, suggesting a bimodal distribution. This could imply the presence of two distinct groups with different maximum heart rates.

ST Depression:
The 'st_depression' KDE plot exhibits a more complex pattern with two larger peaks and two smaller peaks. This suggests a multimodal distribution, indicating the possible presence of multiple subgroups with varying degrees of ST depression.

Number of Major Vessels (Num):
The KDE plot for the 'num_major_vessels' column displays five peaks, indicating a multimodal distribution. This suggests the presence of several subgroups with different counts of major vessels.

In [None]:
# Select the columns you want to visualize
columns_to_visualize = [ 'trestbps', 'chol', 'thalch', 'oldpeak', 'num']

# Normalize the selected columns
normalized_data = (df[columns_to_visualize] - df[columns_to_visualize].mean()) / df[columns_to_visualize].std()

# Plot KDE plots
plt.figure(figsize=(12, 8))

for column in normalized_data.columns:
    sns.kdeplot(normalized_data[column], label=column, shade=True)

plt.title('Normalized Distribution of Selected Columns')
plt.xlabel('Normalized Values')
plt.ylabel('Density')
plt.legend()
plt.show()

Insights:

I used Seaborn and Matplotlib to make graphs of certain columns in the data. I picked columns like 'trestbps', 'chol', 'thalch', and 'oldpeak', 'num' to see their patterns. Before making the graphs, I made sure all the data was on the same scale for fair comparison. The graphs, known as Kernel Density Estimation (KDE) plots, show a clear picture of how values are spread for each chosen column. The shaded areas under the curves in these plots help us see where values are more common. This visual exploration helps us better understand how these columns are spread out, their shapes, and their typical values after standardizing the data

* Data Visualization using plotly

In [None]:
import plotly.subplots as sp
import plotly.graph_objects as go
import plotly.express as px  

# Assuming your dataset is loaded into a DataFrame named 'df'
# Replace 'df' with the actual variable name if it's different

# Define the columns for analysis
columns_to_plot = ['id', 'age', 'trestbps', 'chol', 'thalch', 'oldpeak', 'ca', 'num']

# Create subplots
fig = sp.make_subplots(rows=len(columns_to_plot), cols=1, subplot_titles=columns_to_plot)

# Add box plots to subplots
for i, col in enumerate(columns_to_plot, 1):
    if df[col].dtype != 'object':
        box_plot = px.box(df, y=col, title=f'Box Plot for {col}')
        fig.add_trace(box_plot['data'][0], row=i, col=1)

# Update layout
fig.update_layout(height=len(columns_to_plot) * 300, showlegend=False)

# Show the plot
fig.show()


## Corelation Heatmap

In [None]:
import plotly.express as px
import pandas as pd

# Assuming your dataset is loaded into a DataFrame named 'df'
# Replace 'df' with the actual variable name if it's different

# Select only numeric columns for correlation analysis
numeric_columns_for_correlation = df.select_dtypes(include=['float64', 'int64']).columns

# Create a correlation matrix
correlation_matrix = df[numeric_columns_for_correlation].corr()

# Create a correlation heatmap using Plotly Express with a custom color scale
fig = px.imshow(correlation_matrix, labels=dict(color='Correlation'),
                color_continuous_scale=[(0, '#440154'), (0.5, '#fde724'), (1, '#4dac26')])

# Customize the layout
fig.update_layout(title='Striking Correlation Heatmap for Numeric Columns',
                  width=800, height=800, coloraxis_colorbar=dict(tickformat=".2f"))

# Display the heatmap
fig.show()


Insights:\
By hoovering cursor over each box gives corelation of each variable along with x and y axis, In this correlation heatmap, darker colors typically signify weaker correlations, suggesting a less evident linear relationship between variables, while lighter colors towards green indicate stronger correlations, highlighting a more pronounced and consistent linear association

## Imputing Missing Values

In [None]:
df.isnull().sum().sort_values(ascending=False)
(round(df.isnull().sum()/len(df)*100,2)).sort_values(ascending=False)

In [None]:
 # Plotting missing values
plt.figure(figsize=(10,6))
sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap='viridis')
plt.show()

In [None]:
# identify the features with missing values.
missing_data_cols = df.isnull().sum()[df.isnull().sum() > 0].index.tolist()
# missing_data_cols
classifier_cols = ['thal', 'ca', 'slope', 'exang', 'restecg','fbs', 'cp', 'sex', 'num']
bool_cols = ['fbs', 'exang']
regressor_cols = ['oldpeak', 'thalch', 'chol', 'trestbps', 'age']

In [None]:
def impute_categorical_missing_data(passed_col):
    
    df_null = df[df[passed_col].isnull()]
    df_not_null = df[df[passed_col].notnull()]

    X = df_not_null.drop(passed_col, axis=1)
    y = df_not_null[passed_col]
    
    other_missing_cols = [col for col in missing_data_cols if col != passed_col]
    
    label_encoder = LabelEncoder()

    for col in X.columns:
        if X[col].dtype == 'object' or X[col].dtype == 'category':
            X[col] = label_encoder.fit_transform(X[col])

    if passed_col in bool_cols:
        y = label_encoder.fit_transform(y)
        
    iterative_imputer = IterativeImputer(estimator=RandomForestRegressor(random_state=42), add_indicator=True)

    for col in other_missing_cols:
        if X[col].isnull().sum() > 0:
            col_with_missing_values = X[col].values.reshape(-1, 1)
            imputed_values = iterative_imputer.fit_transform(col_with_missing_values)
            X[col] = imputed_values[:, 0]
        else:
            pass
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    rf_classifier = RandomForestClassifier()

    rf_classifier.fit(X_train, y_train)

    y_pred = rf_classifier.predict(X_test)

    acc_score = accuracy_score(y_test, y_pred)

    print("The feature '"+ passed_col+ "' has been imputed with", round((acc_score * 100), 2), "accuracy\n")

    X = df_null.drop(passed_col, axis=1)

    for col in X.columns:
        if X[col].dtype == 'object' or X[col].dtype == 'category':
            X[col] = label_encoder.fit_transform(X[col])

    for col in other_missing_cols:
        if X[col].isnull().sum() > 0:
            col_with_missing_values = X[col].values.reshape(-1, 1)
            imputed_values = iterative_imputer.fit_transform(col_with_missing_values)
            X[col] = imputed_values[:, 0]
        else:
            pass
                
    if len(df_null) > 0: 
        df_null[passed_col] = rf_classifier.predict(X)
        if passed_col in bool_cols:
            df_null[passed_col] = df_null[passed_col].map({0: False, 1: True})
        else:
            pass
    else:
        pass

    df_combined = pd.concat([df_not_null, df_null])
    
    return df_combined[passed_col]

def impute_continuous_missing_data(passed_col):
    
    df_null = df[df[passed_col].isnull()]
    df_not_null = df[df[passed_col].notnull()]

    X = df_not_null.drop(passed_col, axis=1)
    y = df_not_null[passed_col]
    
    other_missing_cols = [col for col in missing_data_cols if col != passed_col]
    
    label_encoder = LabelEncoder()

    for col in X.columns:
        if X[col].dtype == 'object' or X[col].dtype == 'category':
            X[col] = label_encoder.fit_transform(X[col])
    
    iterative_imputer = IterativeImputer(estimator=RandomForestRegressor(random_state=42), add_indicator=True)

    for col in other_missing_cols:
        if X[col].isnull().sum() > 0:
            col_with_missing_values = X[col].values.reshape(-1, 1)
            imputed_values = iterative_imputer.fit_transform(col_with_missing_values)
            X[col] = imputed_values[:, 0]
        else:
            pass
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    rf_regressor = RandomForestRegressor()

    rf_regressor.fit(X_train, y_train)

    y_pred = rf_regressor.predict(X_test)

    print("MAE =", mean_absolute_error(y_test, y_pred), "\n")
    # print("RMSE =", mean_squared_error(y_test, y_pred, squared=False), "\n")
    # print("R2 =", r2_score(y_test, y_pred), "\n")

    X = df_null.drop(passed_col, axis=1)

    for col in X.columns:
        if X[col].dtype == 'object' or X[col].dtype == 'category':
            X[col] = label_encoder.fit_transform(X[col])

    for col in other_missing_cols:
        if X[col].isnull().sum() > 0:
            col_with_missing_values = X[col].values.reshape(-1, 1)
            imputed_values = iterative_imputer.fit_transform(col_with_missing_values)
            X[col] = imputed_values[:, 0]
        else:
            pass
                
    if len(df_null) > 0: 
        df_null[passed_col] = rf_regressor.predict(X)
    else:
        pass

    df_combined = pd.concat([df_not_null, df_null])
    
    return df_combined[passed_col]

In [None]:
for col in missing_data_cols:
    print("Missing Values", col, ":", str(round((df[col].isnull().sum() / len(df)) * 100, 2))+"%")
    if col in classifier_cols:
        df[col] = impute_categorical_missing_data(col)
    elif col in regressor_cols:
        df[col] = impute_continuous_missing_data(col)
    else:
        pass

Insights:\
This code is used in data to impute missing values. It has two functions, "impute_categorical_missing_data" and "impute_continuous_missing_data," for handling missing info in categories and numbers. The code uses methods like label encoding, iterative imputation, and machine learning tools (Random Forest Classifier and Regressor) to fill in the gaps. It also checks which features are missing and applies the right fixing method based on the kind of data (categories or numbers). The code shows the accuracy and mean absolute error (MAE) scores for the fixed info

## Random Forest Algorithm

In [None]:
df.columns

In [None]:
# encode features which are categorical or object using for loop
le = LabelEncoder()
for i in df.columns:
    if df[i].dtype == 'object' or df[i].dtype == 'category':
        df[i] = le.fit_transform(df[i])
df.head()

In [None]:
# split the data into X and y for classification
# we take sex columns to understand the tip was given by male or female
X = df.drop('num', axis = 1)
y = df['num']
# train test split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)
# create, train and predict the mode
model_cl = RandomForestClassifier(n_estimators=200, random_state=42)
model_cl.fit(X_train, y_train)
y_pred = model_cl.predict(X_test)

#evaluate the model
print('accuracy score: ', accuracy_score(y_test, y_pred))
print('confusion matrix:\n', confusion_matrix(y_test, y_pred))
print('classification report:\n', classification_report(y_test, y_pred))

Insights:\
The random forest model was applied to the UCI Heart Disease dataset, with features encoded using a LabelEncoder for categorical or object data types. The dataset was split into training and testing sets, and a RandomForestClassifier with 200 estimators was created and trained. The model's performance was evaluated, resulting in an accuracy score of approximately 59.8%.
The confusion matrix reveals that the model performs well in predicting class 0 (no heart disease) with a precision of 72% and a recall of 91%. However, it struggles with other classes, especially class 4, where both precision and recall are low.
While the model shows decent accuracy in identifying individuals without heart disease, it faces challenges in accurately predicting other classes, particularly those with fewer instances.

## Decision Tree Algorithm

In [None]:
# split the data into X and y
X = df.drop('num', axis=1)
y = df['num']

# encode the input variables
le = LabelEncoder()
X['cp'] = le.fit_transform(X['cp'])
X['thal'] = le.fit_transform(X['thal'])

# encode the target variable
y = le.fit_transform(y)

# split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:

%%time
# train the decision tree model
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

# predict the test data
y_pred = dt.predict(X_test)

print('Accuracy score: ', accuracy_score(y_test, y_pred))
print('Precision score: ', precision_score(y_test, y_pred, average='micro'))
print('Recall score: ', recall_score(y_test, y_pred, average='micro'))
print('F1 score: ', f1_score(y_test, y_pred, average='micro'))

Insights:\
The decision tree regression model, trained on the Heart Disease UCI dataset, produced the following performance metrics: Accuracy score of 0.59, Precision score of 0.59, Recall score of 0.59, and an F1 score of 0.59. These uniform scores indicate a relatively low model performance, suggesting an ineffectiveness in capturing data patterns and making predictions no better than random chance.


## XGBoost algorithm


In [None]:
# split the data into X and y
X = df.drop('num', axis=1)
y = df['num']

# encode the input variables
le = LabelEncoder()
X['cp'] = le.fit_transform(X['cp'])
X['thal'] = le.fit_transform(X['thal'])

# encode the target variable
y = le.fit_transform(y)

# split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
%%time
# train the xgboost model
xgb = XGBClassifier()
xgb.fit(X_train, y_train)

# predict the test data
y_pred = xgb.predict(X_test)

print('Accuracy score: ', accuracy_score(y_test, y_pred))
print('Precision score: ', precision_score(y_test, y_pred, average='micro'))
print('Recall score: ', recall_score(y_test, y_pred, average='micro'))
print('F1 score: ', f1_score(y_test, y_pred, average='micro'))

Insights:\
The code updated the dataset by converting categories like chest pain type ('cp') and thalassemia type ('thal') into a format the XGBoost model can understand. This conversion is necessary because XGBoost works efficiency could be improve. The model, after training, correctly predicted about 63.6% of cases in the test data. This means it got around 64 out of 100 predictions right. The scores for precision, recall, and F1—ways to measure how good the model is—are also around 63.6%. This consistency suggests the model is balanced in predicting both positive and negative outcomes. The training time was around 2.56 seconds, showing that the XGBoost algorithm trained the model quite efficiently.

## Comparison between Random forest, Decision tree and XGBoost algorithms using for loop.


Which model is better?


In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# Create a list of models
models = [DecisionTreeClassifier(), RandomForestClassifier(), XGBClassifier()]

best_model = None
best_score = 0

# Iterate through the models
for model in models:
    # Train and evaluate the model using cross-validation
    scores = cross_val_score(model, X, y, cv=5)  # Perform 5-fold cross-validation
    mean_score = scores.mean()  # Calculate the mean score

    # Keep track of the best model
    if mean_score > best_score:
        best_score = mean_score
        best_model = model

# The best model is now stored in the best_model variable
print("The best model is: ", best_model)

How did it worked in the code?\
The for loop in the model selection task went through a list of machine learning models: Decision Tree, Random Forest, and XGBoost. For each model, it did the following:
- Trained the model: Used training data (X_train, y_train) from train_test_split to train the current model.
- Evaluated the model: After training, assessed the model's performance using test data (X_test, y_test) from train_test_split. Calculated metrics like accuracy, precision, recall, and F1 score.
- Comparison: Compared the current model's performance with the best model so far. If the current model did better, it became the new best model.
- Final selection: At the loop's end, chose the model with the best performance based on evaluation metrics as the best model for the task.

## To Sumup


To enhance the model's performance, one can consider the following steps, firstly, engage in Feature Engineering by creating new features or transforming existing ones to better represent data relationships. Secondly, perform Hyperparameter Tuning by adjusting parameters like the maximum depth of the tree, minimum samples per leaf, or the splitting criterion to find the optimal configuration for the data. Additionally, explore Model Selection by evaluating alternative machine learning models such as random forests, gradient boosting, or linear models to determine if a different approach may better capture data relationships. Finally, ensure proper Data Preprocessing, including handling missing values, scaling features, and encoding categorical variables. These steps collectively aim to improve the model's effectiveness in learning from the data and making accurate predictions.