Project Tile : Heart Disease Prediction

Goal : Predict the heart disease risk of patients.

Author Name : Joana Lawer & Samuel Osei

### Loading Data

In [None]:
# Import necessary modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# df is the variable name of our dataset
from google.colab import drive
drive.mount('/content/gdrive')
df = pd.read_csv('/content/gdrive/My Drive/Colab Notebooks/Data/framingham.csv')
df.head(5)

In [None]:
# Load the dataset and view first 10 entries. (Use if dataset is in the same folder as notebook)
# df = pd.read_csv('framingham.csv')
# df.head(10)

In [None]:
# View shape of dataset (number of rows and columns)
# df.shape
print ('No. of Rows :', df.shape[0], '\nNo. of Columns : ', df.shape[1])

- The columns represent the features and the rows represent the observations

In [None]:
#  View statistical info of the dataset
df.describe()

In [None]:
# View the datatype attributes of the features
df.info()

In [None]:
# Check if there are any duplication in the dataset


In [None]:
# Check the categorical coulumns in the dataset
cat_cols = df.select_dtypes(include='object').columns.tolist()
print(cat_cols)

In [None]:
print(f"{'Feature': <20} \t {'No. of values': <20}")
print('-'*40)
for col in cat_cols:
    print(f"{col: <20} \t {df[col].nunique(): <20}")

We see that all categorical values have two distinct values each. Sex has male and female, while currentSmoker, BPMeds, prevalentStroke, prevalentHyp and diabetes have yes or no.

In [None]:
# Check for missing values in each feature
# df.isnull().sum()
print(f"{'Feature': <20} \t {'no. missing values': <20} \t {'Percantage of missing values': <20}")
for col in df.columns:
    print(f"{col: <20} \t {df[col].isna().sum(): <20} \t {np.round(df[col].isna().sum()/df.shape[0],2)*100}%")

In [None]:
# drop first column by index
df = df.drop(df.columns[0], axis=1)
df.head()

The first column of the dataset shows the numbering for the observations and willnot be relevant to the analysis and project as such we drop it.

In [None]:
df.shape # Show the number of columns after dropping the first column

### Encode categorical features

In [None]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
for col in df.columns:
    if df[col].dtype == 'object':
        df[col] = label_encoder.fit_transform(df[col])
df.head()

From the above preview the categorical features have been encoded into 0 and 1.

*   Male = 1; Female = 0
*   Yes = 1; No = 0

All values in the dataset are now numerical.



In [None]:
# Check the number of female who have TenYearCHD and those do not as x an d y variables
x = df[df['TenYearCHD'] == 1]['sex'].sum()
y = df[df['TenYearCHD'] == 0]['sex'].sum()
print(f"No. of female with TenYearCHD : {x}")
print(f"No. of female without TenYearCHD : {y}")

In [None]:
# Plot a graph for the number of male and female who have TenYearCHD and those
plt.figure(figsize=(10, 6))
sns.countplot(x='sex', hue='TenYearCHD', data=df)
plt.title('Number of Male and Female with or without TenYearCHD')
plt.show()

In [None]:
# Correlation Matrix
plt.figure(figsize=(12,8))
sns.heatmap(df.corr(), annot=True, fmt='.2f')
plt.show()

The correlation matrix provides insights into the relationships between different features in the dataset. The correlation coefficient values range from -1 to 1, indicating the strength and direction of the relationship between the features.

### Key Points of the Correlation Matrix
1. Diagonal Elements:
- The diagonal elements of the matrix are all 1, as each feature is perfectly correlated with itself.

2. Correlation Values:
- Positive values indicate a positive correlation: as one feature increases, the other feature tends to increase.
- Negative values indicate a negative correlation: as one feature increases, the other feature tends to decrease.
- Values close to 0 indicate little to no linear relationship between the features.
3. Heatmap Colors:
- The color intensity represents the magnitude of the correlation. Darker colors indicate stronger correlations (either positive or negative), while lighter colors indicate weaker correlations.

### Notable Correlations
- Sex and Current Smoker: Correlation = 0.20
 - There is a moderate positive correlation between being male and being a current smoker.
- Age and TotChol: Correlation = 0.26
 - There is a moderate positive correlation between age and total cholesterol level.
- Age and SysBP: Correlation = 0.39
 - There is a stronger positive correlation between age and systolic blood pressure.
- Current Smoker and CigsPerDay: Correlation = 0.77
 - There is a very strong positive correlation between being a current smoker and the number of cigarettes smoked per day.
- PrevalentHyp and SysBP: Correlation = 0.70
 - There is a strong positive correlation between having hypertension and systolic blood pressure.
- SysBP and DiaBP: Correlation = 0.78
 - There is a very strong positive correlation between systolic blood pressure and diastolic blood pressure.
- Diabetes and Glucose: Correlation = 0.62
 - There is a strong positive correlation between having diabetes and glucose level.
- Age and TenYearCHD: Correlation = 0.23
 - There is a moderate positive correlation between age and the 10-year risk of coronary heart disease.

### Feature Selection and Splitting Data


In [None]:
# Import libraries from scikitlearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [None]:
X = df.drop('TenYearCHD', axis=1)
y = df['TenYearCHD']
print(X.shape)
print(y.shape)

The target variable for our model is "TenYearCHD".

X is a new DataFrame that contains all the columns from the original DataFrame df, that will be used to train the machine learning model, except for the TenYearCHD column.

y represents the target variable (dependent variable) that the model will learn to predict.

This prepares the data for machine learning by separating the independent variables (features) from the dependent variable (target).

In [None]:
# Split data into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

### Model Training and Evaluation

In [None]:
# Handle missing values using SimpleImputer
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean') # Replace missing values with the mean of the column
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

#  Train Random Forest Classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

In [None]:
# Predict and Evaluate
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

This output is a summary of the performance of a Random Forest model on the test dataset.

Accuracy: 0.8537735849056604
  measures the proportion of correctly predicted instances (both positive and negative) out of the total instances. In this case, the accuracy is approximately 85.38%, meaning that 85.38% of the predictions made by the model are correct.

The Confusion Matrix:
 [[713  11]
 [113  11]]
 helps visualize the performance of the classification model by showing the counts of true positive, true negative, false positive, and false negative predictions.
*   True Negatives (TN): 713
*   False Positives (FP): 11
*   False Negatives (FN): 113
*   True Positives (TP):11

The classification report provides detailed metrics for each class (in this case, classes 0 and 1).

Class 0 (No CHD)

* Precision: 0.86.
The proportion of true negative predictions (class 0) among all negative predictions.
Out of all predictions for class 0, 84% were correct.
* Recall: 0.98.
The proportion of true negative predictions (class 0) among all actual negatives.
Out of all actual class 0 instances, 98% were correctly predicted as class 0.
* F1-Score: 0.92.
The harmonic mean of precision and recall. Higher value indicates better performance.

Class 1 (CHD)

- Precision: 0.50.
The proportion of true positive predictions (class 1) among all positive predictions.
Out of all predictions for class 1, 50% were correct.
- Recall: 0.09.
The proportion of true positive predictions (class 1) among all actual positives.
Out of all actual class 1 instances, only 9% were correctly predicted as class 1.
- F1-Score: 0.15.
The harmonic mean of precision and recall. For class 1, it balances the two metrics into a single score, which is quite low

### Overall Metrics

- Accuracy: 0.85
 - The overall accuracy of the model across both classes is 85%.
- Macro Average: Precision = 0.68, Recall = 0.54, F1-Score = 0.54
 - The macro average calculates the metrics for each class independently and then takes the average, giving equal weight to each class.
- Weighted Average: Precision = 0.81, Recall = 0.85, F1-Score = 0.81
 - The weighted average takes into account the support (number of true instances) of each class, giving more weight to the majority class.

### Interpretation

- High Accuracy: The model has a high overall accuracy (85.38%), indicating it is generally good at predicting the correct class.
- Class Imbalance: There is an imbalance between the classes (724 instances of class 0 vs. 124 instances of class 1), which can impact performance metrics.
- Poor Performance for Class 1:
 - Low Precision and Recall: The model performs poorly in predicting class 1 (CHD). The precision is low (0.50), meaning many of the positive predictions are incorrect. The recall is very low (0.09), meaning the model misses many actual positive instances.
 - Low F1-Score: The F1-score for class 1 is also low (0.15), indicating poor overall performance in predicting this class.

### Feature Importance

In [None]:
feature_importances = clf.feature_importances_
features = X.columns
importance_df = pd.DataFrame({'Features': features, 'Importance': feature_importances})
importance_df = importance_df.sort_values('Importance', ascending=False)
plt.figure(figsize=(12,6))
sns.barplot(x='Importance', y='Features', data=importance_df)
plt.title('Feature Importance')
plt.show()

This visualization helps in understanding which features have the most significant impact on the model's predictions and can provide insights for feature selection or further analysis.

 Systolic blood pressure, BMI, and age are the top three most important features, indicating that these should be closely monitored and managed to reduce the risk of heart disease.

### Cross-Validation and Hyperparameter Tuning

In [None]:
#  Cross validation

In [None]:
# Hyperparameter Tuning