# Heart Failure Prediction

### This is a machine Learning project where  we built a model  to predict heart failure in an individual
#### Contributors:
- Goodness Nwokebu
- Ibukunoluwa Abraham
- Faith Lucky

## Importing Necessary Libraries


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
pd.set_option('display.max_rows',None)
df = pd.read_csv('heart.csv')


In [None]:
df.shape
print(f'Our dataset has {df.shape[0]} rows and {df.shape[1]} columns')

In [None]:
df.head()

In [None]:
df.isnull().sum()

The data has no null values

In [None]:
df.info()

In [None]:
data = pd.Series({'Age': 'age of the patient [years]',
'Sex': 'sex of the patient [M: Male, F: Female]',
'ChestPainType' : 'chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]',
'RestingBP': 'resting blood pressure [mm Hg]',
'Cholesterol': 'serum cholesterol [mm/dl]',
'FastingBS': 'fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]','RestingECG': 'resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes\' criteria]','MaxHR' : 'maximum heart rate achieved [Numeric value between 60 and 202]',
'ExerciseAngina' : 'exercise-induced angina [Y: Yes, N: No]','Oldpeak': 'oldpeak = ST [Numeric value measured in depression]',
'ST_Slope': 'the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]',
'HeartDisease' : 'output class [1: heart disease, 0: Normal]'})

datas = pd.DataFrame(data, columns = ['Description'])
print(datas.to_markdown(tablefmt="grid"))

All the dtypes are matched to the corresponding heading. 

In [None]:
df.describe(include = 'all')

In [None]:
sns.pairplot(df, hue="HeartDisease",palette = "RdPu")

In [None]:
plt.figure(figsize = (14,7))
sns.heatmap(df.corr(),annot = True)

The diagram above displays the correlation between the variables. The pearson correlation coefficient ranges from -1 to +1 such that positive coefficient values closer to 1 indicate strong correlation (meaning that as one variable increases the other also increases) while negative coefficient values closer to -1 also indicate strong correlation but with a different interpretation (as one variable increases the other decreases).

Asides from the coefficients, the color is also indicative of correlation, so the darker the color the higher the correlation and vice versa. Therefore, from the diagram, it can be observed that there is an overall low to moderate degree of correlation between variables bearing positive coefficient values (especially considering the features,Age,RestingBP, fasting blood sugar and old peak to the label-heart disease). While Max heart rate has a negative-moderate correlation to heart disease. Age and MaxHR also have negative moderate correlation.

In [None]:
plt.figure(figsize=(15,10))

plt.subplot(3,2,1)
sns.boxplot(x = df.HeartDisease, y = df.Age,  palette = "RdPu")
plt.subplot(3,2,2)
sns.boxplot(x = df.HeartDisease, y = df.RestingBP, palette = "RdPu")
plt.subplot(3,2,3)
sns.boxplot(x = df.HeartDisease, y = df.Cholesterol,  palette = "RdPu")
plt.subplot(3,2,4)
sns.boxplot(x = df.HeartDisease, y = df.FastingBS,  palette = "RdPu")
plt.subplot(3,2,5)
sns.boxplot(x = df.HeartDisease, y = df.MaxHR,  palette = "RdPu")
plt.subplot(3,2,6)
sns.boxplot(x = df.HeartDisease, y = df.Oldpeak,  palette = "RdPu")





From the boxplot display, the mean age and mean oldpeak  for persons with heart disease was greater than those without heart disease. Hence, it could be inferred that as persons advance in age they are more likely to develop a heart disease while those who have had an oldpeak record also have an increased tendency to come down with a heart disease. in the fifth subplot however, heart disease was diagnosed in persons with lower Maximum heart rate. Therefore, indicating that as maximum heartrate decreases the probablity that a person develops a heart disease increases.

All the labels contained outliers while majority of persons with a fasting blood sugar higher than 120mg/dl were with a form of heart  disease.

The cholesterol and Oldpeak subplots displayed negative skewness while MaxHR showed a positive skewness.Thus, indicating that majority of persons with heart disease had a cholesterol level of <200mg/dl(Medium-low risk), an oldpeak below 1.8 (low risk) and a MaxHR above 125(low-medium risk)

**What is the relationship between age and HeartFailure?**

In [None]:
sns.histplot(data = df, x = 'Age', hue = 'HeartDisease', palette = 'RdPu')
#sns.countplot(data= df, x='Age',hue='HeartDisease')

From the visualization above, the instance of heart failure is seen from age 55 years and above.

**Counts of sex category**

In [None]:
data1 = df

In [None]:
sex_counts = data1["Sex"].value_counts(normalize=True).round(2) * 100
sex_counts = sex_counts.reset_index().rename(columns={"Sex": "Pct", "index": "Sex"})
sex_counts

In [None]:
plt.figure(figsize=(7, 5))
sns.set_context("paper")



ax1 = sns.barplot(
    data=sex_counts,
    x="Sex",
    #errorbar=None,
    y="Pct",
    palette= "RdPu",
    linewidth=0.5,
    edgecolor="black",
    alpha=0.7,
)

values1 = ax1.containers[0].datavalues
labels = ["{:g}%".format(val) for val in values1]
ax1.bar_label(ax1.containers[0], labels=labels)

ax1.set_ylabel("Percent")
ax1.set_xlabel("")
ax1.set_title(
    "Almost 80% percent of the gender category are Males, ~21% Females", fontsize=10
)


plt.show()


The data may have some level of skewnessas it contains more Males than Females.

In [None]:
plt.figure(figsize=(15,10))

plt.subplot(3,2,1)
#plt.title('Sex')
fig = sns.histplot(data = df, x ='Sex', hue = 'HeartDisease', multiple="dodge", shrink=.8, palette = "RdPu")


plt.subplot(3,2,2)
#plt.title('Chest Pain Type')
sns.histplot(data = df, x ='ChestPainType', hue = 'HeartDisease', multiple="dodge", shrink=.8, palette = "RdPu")


plt.subplot(3,2,3)
#plt.title('RestingECG')
sns.histplot(data = df, x ='RestingECG', hue = 'HeartDisease', multiple="dodge", shrink=.8, palette = "RdPu")


plt.subplot(3,2,4)
#plt.title('ExerciseAngina')
sns.histplot(data = df, x ='ExerciseAngina', hue = 'HeartDisease', multiple="dodge", shrink=.8, palette = "RdPu")


plt.subplot(3,2,5)
#plt.title('ST_Slope')
sns.histplot(data = df, x ='ST_Slope', hue = 'HeartDisease', multiple="dodge", shrink=.8, palette = "RdPu")





#####  A Summary Plot
- **Subplot 1:** This plot shows us that more percentage of Males have heart failure than Females 
- **Subplot 2:** This plot shows us that a greater percentage of persons with heart dsease have asymptomatic pain while other pains may or maynot lead to heart disease.
- **Subplot 3:** Generally having a normal, ST or left ventricular hypertrophy is not a definite measure of Heart Disease though a greater percentage of ST abnormality wave have heart disease
- **Subplot 4:** Exercise Angina is pain gotten after exercising. We can see that most people that have this pain had heart Disease 
- **Subplot 5:**  the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]. It can be sais from this visualization that an 'up' slope may be the safest to have as the likelihood of having a heart diesase is relatively low

## Feature Engineering




In [None]:
data = df.copy()
for colname in data.select_dtypes("object"):
    data[colname], _ = data[colname].factorize()
#this code obtains a numeric representation of an array identifying distinct values.

In [None]:
X = data.copy()
y = X.pop('HeartDisease')
X.dtypes
#X.dtypes is numeric so we are good to go

#### A) Feature engineering by Mutual Information


In [None]:
from sklearn.feature_selection import mutual_info_classif

def make_mi_scores(X, y, discrete_features):
    mi_scores = mutual_info_classif(X, y, discrete_features=discrete_features)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores

mi_scores = make_mi_scores(X, y, discrete_features)
mi_scores  # show a few features with their MI scores

- The mutual information (MI) between two quantities is a measure of the extent to which knowledge of one quantity reduces uncertainty about the other.

- Heart Disease has little or no dependence on RestingECG, Age and Resting Bp.

#### B) Feature engineering by K best

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
   #target column 
#apply SelectKBest class to extract top best features
bestfeatures = SelectKBest(score_func=chi2, k=10)
X = X.drop('Oldpeak', axis = 1)
fit = bestfeatures.fit(X,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
#concat two dataframes for better visualization 
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score']  #naming the dataframe columns
print(featureScores.nlargest(12,'Score'))  #print best features


OldPeak column was dropped because it has a negative value, and chi2 deals only with positive values

In [None]:
#to view the different unique independent variables contained in the data 
for i in data.columns:
    print(i,len(df[i].unique()))
                       
    

From the cardinality, we can bin the high unique values like Age and Cholesterol for a better model

In [None]:
plt.figure(figsize = (14,7))
sns.heatmap(data.corr(),annot = True)

Selecting best features is important process when we prepare a large dataset for training. It helps us to eliminate the less important part of the data and reduce training time. From all the feature Enginnering, we decided to drop
- RestingECG
- RestingBP

## Data Preprocessing

In [None]:
df[categorical]

num = []
for cat in df[categorical]:
    a = len(df[cat].unique())
    num.append(a)
print(num)
df[categorical]

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [None]:
new_df = df.copy()
new_df.drop(['RestingECG','RestingBP'], axis = 1, inplace = True) #according to our feature engineering

In [None]:
#transforming categorical variables using the label encoding method
new_df['Sex'] = le.fit_transform(new_df['Sex'])
new_df['ChestPainType'] = le.fit_transform(new_df['ChestPainType'])
new_df['ExerciseAngina'] = le.fit_transform(new_df['ExerciseAngina'])
new_df['ST_Slope'] = le.fit_transform(new_df['ST_Slope'])

In [None]:
numerical.remove('RestingBP')#removing RestingBP
new_df[numerical]

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score,roc_curve,ConfusionMatrixDisplay,classification_report,roc_auc_score,confusion_matrix

Scaler = MinMaxScaler()#for an appropriate scale



In [None]:
X = new_df.copy()

y= X.pop('HeartDisease')

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.3, random_state = 43)
#scaling the train data
X_train['Age']= Scaler.fit_transform(X_train[['Age']])
X_train['FastingBS']= Scaler.fit_transform(X_train[['FastingBS']])
X_train['Cholesterol'] = Scaler.fit_transform(X_train[['Cholesterol']])
X_train['MaxHR'] = Scaler.fit_transform(X_train[['MaxHR']])

X_train



In [None]:
print("Size of training set:", X_train.shape)


In [None]:
#scaling the test_data

X_test['Age']= Scaler.fit_transform(X_test[['Age']])
X_test['FastingBS']= Scaler.fit_transform(X_test[['FastingBS']])
X_test['Cholesterol'] = Scaler.fit_transform(X_test[['Cholesterol']])
X_test['MaxHR'] = Scaler.fit_transform(X_test[['MaxHR']])
print("Size of test set:", X_test.shape)

## Machine Learning

This project is a **classification problem** as it aims at predicting who might possibly have heart failure given some attributes. 

There are a lot of claasification models out there but we used:
- Logistic Regression
- Support Vector Classifier
- Random Forest Classifier
- Gradient Boosting Classifier

In [None]:
#IMPORTING MODELS AND NECESSARY LIBRARIES
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
def ML(model, X_train = X_train,X_test = X_test,y_train = y_train,y_test = y_test):
    """ 
    This function takes in the type of model you want to run 
    with an already defined X_train,X_test,y_train and y_test then
    gives the accuracy, confusionmatrix and classification report
    
    """
    model.fit(X_train, y_train)
    y_pred_test = model.predict(X_test)
    print('Accuracy score of this model : ', accuracy_score(y_test, y_pred_test))
    cm =confusion_matrix(y_test, y_pred_test)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
    disp.plot()
    print(classification_report(y_test, y_pred_test))



#### MODEL ONE: Logistic Regression

In [None]:
logreg = LogisticRegression(random_state=0)

In [None]:
ML(logreg)

In [None]:
from sklearn.metrics import confusion_matrix
cm =confusion_matrix(y_test, y_pred_test)
cm

In [None]:
print(classification_report(y_test, y_pred_test))

In [None]:
def AUC_curve(model,y_test = y_test, X_test = X_test):
    """
    This function shows  the AUC_curve and score of an a
    already trained model
    
    """
    y_pred1 = model.predict_proba(X_test)[:,1]
    y_pred0 = model.predict_proba(X_test)[:,0]
    fpr, tpr, thresholds = roc_curve(y_test, y_pred1)
    plt.figure(figsize = (6,4))
    plt.plot(fpr, tpr, linewidth=2)
    plt.plot([0,1], [0,1], '--')
    plt.title('ROC curve for Heart Failure classifier')
    plt.xlabel('False positive rate')
    plt.ylabel('True positve rate')
    plt.show()
    ROC_AUC = roc_auc_score(y_test,y_pred1)
    print(f'Area under this curve is {ROC_AUC}')
    
    
AUC_curve(logreg)

#### Tuning Logistic Regression

In [None]:
C_param_range = [0.001,0.01,0.1,1,10,100]

logreg_table = pd.DataFrame(columns = ['C_parameter','Accuracy'])
logreg_table['C_parameter'] = C_param_range


j = 0
for i in C_param_range:
    
    # Apply logistic regression model to training data
    lr = LogisticRegression(C = i,random_state = 0)
    lr.fit(X_train,y_train)
    
    # Predict using model
    y_pred = lr.predict(X_test)
    
    # Saving accuracy score in table
    print(f'Accuracy Score for C = {i} is {accuracy_score(y_test,y_pred)}')
    #j += 1
    
    # Printing decision regions
    #print(classification_report(y_test, y_pred))


In [None]:
#The best accuracy score is C = 10 or 100
logreg2 = LogisticRegression(C = 10,random_state = 0)
ML(logreg2)
AUC_curve(logreg2)

#### MODEL TWO: Standard Vector Classifier

In [None]:
svc_model= SVC(kernel = 'linear',probability = True)
ML(svc_model)

In [None]:
AUC_curve(svc_model)

#### Tuning the SVC model

In [None]:
parameters = {'kernel':('linear', 'rbf','poly'), 'C':[1, 10]}
svc = SVC()
clf = GridSearchCV(svc, parameters)
clf.fit(X_train, y_train)
print(f'Best parameters are: {clf.best_params_}')
GridSearchCV(estimator=SVC(),
             param_grid={'C': [1, 10], 'kernel': ('linear', 'rbf', 'poly')})


In [None]:
svc = SVC(C = 1,kernel = 'rbf',probability = True)
ML(svc)

In [None]:
AUC_curve(svc)

#### MODEL THREE: Random Forest Classifier

In [None]:
RFC = RandomForestClassifier()
ML(RFC)

In [None]:
AUC_curve(RFC)

#### Tuning Random Forest Regression

In [None]:
n_estimators = [100, 300, 500, 800, 1200]
max_depth = [5, 8, 15, 25, 30]
min_samples_split = [2, 5, 10, 15, 100]
min_samples_leaf = [1, 2, 5, 10] 

hyperF = dict(n_estimators = n_estimators, max_depth = max_depth,  
              min_samples_split = min_samples_split, 
             min_samples_leaf = min_samples_leaf)

gridF = GridSearchCV(RFC, hyperF, cv = 3, verbose = 1, 
                      n_jobs = -1)
ML(gridF)

In [None]:
AUC_curve(gridF)

#### MODEL FOUR: Gradient Boosting Classifier Model

In [None]:
GBC = GradientBoostingClassifier()
ML(GBC)

In [None]:
AUC_curve(GBC)

#### Gradient Boosting Classifier Tuning

In [None]:
parameters = {
    "n_estimators":[5,50,250,500],
    "max_depth":[1,3,5,7,9],
    "learning_rate":[0.01,0.1,1,10,100]
}


In [None]:
from sklearn.model_selection import GridSearchCV
cv = GridSearchCV(GBC,parameters,cv=5)
print(f'Best parameters are: {cv.best_params_}')
ML(cv)

In [None]:
AUC_curve(cv)

From the results obtained from the models, the standard vector classifier could be said to be the best model as it had the least number of wrongly classified persons who actually had heart disease. This is important as heart disease is a critical condition that is to be diagnosed early to reduce fatality. However, the SVC also has the most number of false positives compared to random forest and gradient boost classifiers , meaning that more money would be spent mansging persons who do not have heart disease accounting to wastage of health resources.