The dataset is publically available from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. The classification goal is to predict whether the patient has 10-year risk of future coronary heart disease (CHD).The dataset provides the patients’ information. It includes over 4,000 records and 16 attributes.
#### Variables : <br>

##### Demographic: <br>
sex: male or female <br>
age: age of the patient <br>

##### Behavioural: <br>
currentSmoker: whether or not the patient is a current smoker <br>
cigsPerDay: the number of cigarettes that the person smoked on average in one day.<br>

##### Medical history:<br>
BPMeds: whether or not the patient was on blood pressure medication <br>
prevalentStroke: whether or not the patient had previously had a stroke <br>
prevalentHyp: whether or not the patient was hypertensive <br>
diabetes: whether or not the patient had diabetes <br>

##### Medical current:<br>
totChol: total cholesterol level <br>
sysBP: systolic blood pressure <br>
diaBP: diastolic blood pressure <br>
BMI: Body Mass Index <br>
heartRate: heart rate <br>
glucose: glucose level <br>

##### Predict variable (desired target):<br>
10 year risk of coronary heart disease CHD (binary: “1”, means “Yes”, “0” means “No”)<br>

#### Import dependencies

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn

In [None]:
df= pd.read_csv(r'data.csv')
df.head()

In [None]:
df.drop(['education'],axis=1,inplace=True)

In [None]:
df.rename(columns={'male':'Gender_male'},inplace=True)

In [None]:
df

#### Missing values

In [None]:
df.isnull().sum()

In [None]:
count=0
for i in df.isnull().sum(axis=1):
    if i>0:
        count=count+1
print('Total number of rows with missing values is ', count)
print('since it is only',round((count/len(df.index))*100), 'percent of the entire dataset the rows with missing values are excluded.')

In [None]:
df.dropna(axis=0,inplace=True)

#### Exploratory Analysis

In [None]:
def draw_histograms(dataframe, features, rows, cols):
    fig=plt.figure(figsize=(20,20))
    for i, feature in enumerate(features):
        ax=fig.add_subplot(rows,cols,i+1)
        dataframe[feature].hist(bins=20,ax=ax,facecolor='midnightblue')
        ax.set_title(feature+" Distribution",color='DarkRed')
        
    fig.tight_layout()  
    plt.show()

draw_histograms(df,df.columns,6,3)

In [None]:
df.TenYearCHD.value_counts()

There are 3179 patents with no heart disease and 572 patients with risk of heart disease.

In [None]:
sn.countplot(x='TenYearCHD',data=df)

In [None]:
df.describe()

#### Splitting data to train and test split

In [None]:
new_features=df[['age','Gender_male','cigsPerDay','totChol','sysBP','glucose','TenYearCHD']]
x=new_features.iloc[:,:-1]
y=new_features.iloc[:,-1]

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.20,random_state=5)

In [None]:
from sklearn.linear_model import LogisticRegression
logreg=LogisticRegression()
logreg.fit(x_train,y_train)

In [None]:
y_pred=logreg.predict(x_test)

#### Model Evaluation

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

True positives (TP): People who had heart disease and were also predicted to have heart disease. i.e 5 <br>
True negatives (TN): People who did not have heart disease and were also predicted to not have heart disease. i.e 652 <br>
False positives (FP): People who did not have heart disease but the prediction says they do.(Also known as a “Type I error.”) i.e 7 <br>
False negatives (FN): People who have heart disease but the prediction says they don’t.(Also known as a “Type II error.”) i.e 87 <br>

Sensitivity/Recall = TP/(TP + FN). When it’s actually yes, how often does it predict yes?<br>
Specificity = TN/(TN + FP).When it’s actually no, how often does it predict no?<br>
Precision = TP/predicted yes. When it predicts yes, how often is it correct?<br>

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

In [None]:
TN=cm[0,0]
TP=cm[1,1]
FN=cm[1,0]
FP=cm[0,1] 

In [None]:
print('The acuuracy of the model = TP+TN/(TP+TN+FP+FN) = ',(TP+TN)/float(TP+TN+FP+FN),'\n',

'The Missclassification = 1-Accuracy = ',1-((TP+TN)/float(TP+TN+FP+FN)),'\n',

'Sensitivity or True Positive Rate = TP/(TP+FN) = ',TP/float(TP+FN),'\n',

'Specificity or True Negative Rate = TN/(TN+FP) = ',TN/float(TN+FP),'\n',

'Positive Predictive value = TP/(TP+FP) = ',TP/float(TP+FP),'\n',

'Negative predictive Value = TN/(TN+FN) = ',TN/float(TN+FN))

In [None]:
from sklearn.preprocessing import binarize
for i in range(1,5):
    cm2=0
    y_pred_prob_yes=logreg.predict_proba(x_test)
    y_pred2=binarize(y_pred_prob_yes,i/10)[:,1]
    cm2=confusion_matrix(y_test,y_pred2)
    print ('With',i/10,'threshold the Confusion Matrix is ','\n',cm2,'\n',
            'with',cm2[0,0]+cm2[1,1],'correct predictions and',cm2[1,0],'Type II errors( False Negatives)','\n\n',
          'Sensitivity: ',cm2[1,1]/(float(cm2[1,1]+cm2[1,0])),'Specificity: ',cm2[0,0]/(float(cm2[0,0]+cm2[0,1])),'\n\n\n')

##### High Threshold: <br>
High specificity <br>
Low sensitivity <br>

##### Low Threshold <br>
Low specificity <br>
High sensitivity <br>

In [None]:
from sklearn.metrics import roc_curve
fpr, tpr, _ = roc_curve(y_test, y_pred_prob_yes[:,1])
plt.plot(fpr,tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('ROC curve for Heart disease classifier')
plt.xlabel('False positive rate (1-Specificity)')
plt.ylabel('True positive rate (Sensitivity)')
plt.grid(True)

In [None]:
import sklearn
sklearn.metrics.roc_auc_score(y_test,y_pred_prob_yes[:,1])

The area under the ROC curve quantifies model classification accuracy; higher the area, greater the disparity between true and false positives

#### Model Selection

Contrast to GridSearchCV,In RandomizedSearchCV not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings that are tried is given by n_iter.

In [None]:
from sklearn.model_selection import RandomizedSearchCV

parameters ={
                'tol' : [1e-5, 1e-4, 1e-3], # stopping criteria.
                'class_weight' : [None, 'balanced'], # upsampling minority classs
                'max_iter' : [100,150,200,250,300], # number of iterations
                 'solver' : ['newton-cg', 'lbfgs'], # optimizer
                'C':[1,2,3,4,5,6,7,8,9,10], # regularization parameter
            }

In [None]:
model_rs = RandomizedSearchCV(logreg, parameters, n_iter=10, n_jobs=10)

In [None]:
model_rs.fit(x_train, y_train)

In [None]:
pd.DataFrame(model_rs.cv_results_).transpose()

In [None]:
y_pred=model_rs.predict(x_test)

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)

#### Handling class imbalance

update the data to oversample the minority class to have 70 percent the number of examples of the majority class

In [None]:
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from collections import Counter


over = SMOTE(sampling_strategy=0.7,random_state=2)
steps = [('o', over)] # oversampling
pipeline = Pipeline(steps=steps)
X_res, y_res = pipeline.fit_resample(x_train, y_train)


print('Original dataset shape %s' % Counter(y_train))
print('Resampled dataset shape %s' % Counter(y_res))

In [None]:
model_rs.fit(X_res, y_res)

In [None]:
y_pred=model_rs.predict(x_test)

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

#### Conclusions:
Selected attributes are significant in the Heart disease prediction as their Pvalues lower than 5% <br>
The trained model is more specific than sensitive. <br>
The Area under the ROC curve is somewhat satisfactory. <br>
Overall model could be improved with more data and complex model. <br>