# CARDIOVASCULAR DISEASE DETECTION
## FINAL CAPSTONE PROJECT - LOGISTIC ANALYSIS

### This dataset is obtained from kaggle.com.

#### Cardiovascular disease is a generalized term that encompasses various disorders that affect the heart and its blood vessels. Cardiovascular disease impacts over 10 million lives globally. This disease is also the leading cause of death both nationally and globally. There are many contributing factors including genetics, diet, lifestyle status and many health factors.

#### Using this dataset, we are aiming to generate a model that will allow us to potentially determine if a subject will have the presence or absence of cardiovascular disease based on various Demographics and Vital Sign Factors.

### LIBRARIES AND DATA:

In [1]:
# IMPORT DATA LIBRARIES 
import numpy as np 
import pandas as pd 

# IMPORT VIS LIBRARIES 
import seaborn as sns 
import matplotlib.pyplot as plt 

# IMPORT MODELLING LIBRARIES 
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder 
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree
from sklearn.metrics import classification_report,confusion_matrix, precision_score, accuracy_score, recall_score, ConfusionMatrixDisplay
from sklearn.metrics import precision_recall_curve, make_scorer, recall_score, roc_auc_score, roc_curve, auc  
from sklearn.pipeline import Pipeline

In [2]:
cardiac_base=pd.read_csv('Data/cardio_train.csv', sep = ';')
cardiac_base.head(10)

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0
5,8,21914,1,151,67.0,120,80,2,2,0,0,0,0
6,9,22113,1,157,93.0,130,80,3,1,0,0,1,0
7,12,22584,2,178,95.0,130,90,3,3,0,0,1,1
8,13,17668,1,158,71.0,110,70,1,1,0,0,1,0
9,14,19834,1,164,68.0,110,60,1,1,0,0,0,0


In [3]:
cardiac_base.set_index('id', inplace=True)

### DATA SUBSET:

#### Upon review of the dataset, many values for Vital Signs variables were noticed that seemed too extremely low or too extremely high. With that in mind, a subset of the initial dataset was generated based on standard ranges of Height, Weight, Diastolic Blood Pressure and Systolic Blood Pressure.

In [4]:
cardiac = cardiac_base.loc[(cardiac_base['height'] >= 140) & (cardiac_base['height'] <= 200) 
                           & (cardiac_base['weight'] >= 55) 
                           & (cardiac_base['ap_hi'] >= 90) & (cardiac_base['ap_hi'] <= 180) 
                           & (cardiac_base['ap_lo'] >= 60) & (cardiac_base['ap_lo'] <= 120),
                           ['age', 'gender', 'height', 'weight', 'ap_hi', 'ap_lo', 'cholesterol', 'gluc', 'smoke', 'alco', 'active', 'cardio']]

### TRANSFORMATION:

In [5]:
cardiac['age_yr']=round(cardiac['age']/365.25,2)

### LOGISTIC REGRESSION:

#### DATASET:

In [6]:
cardiac_logreg=cardiac[['age_yr', 'gender', 'height', 'weight', 'ap_hi', 'ap_lo', 'cholesterol', 'gluc', 'smoke', 'alco', 'active', 'cardio']]
cardiac_logreg

Unnamed: 0_level_0,age_yr,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,50.36,2,168,62.0,110,80,1,1,0,0,1,0
1,55.38,1,156,85.0,140,90,3,1,0,0,1,1
2,51.63,1,165,64.0,130,70,3,1,0,0,0,1
3,48.25,2,169,82.0,150,100,1,1,0,0,1,1
4,47.84,1,156,56.0,100,60,1,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
99993,52.68,2,168,76.0,120,80,1,1,1,0,1,0
99995,61.88,1,158,126.0,140,90,2,2,0,0,1,1
99996,52.20,2,183,105.0,180,90,3,1,0,1,0,1
99998,61.41,1,163,72.0,135,80,1,2,0,0,0,1


#### TRAIN/TEST SPLIT:

In [7]:
X_train, X_test, y_train, y_test = train_test_split(cardiac_logreg.drop('cardio', axis = 1), cardiac_logreg.cardio, 
                                                    random_state = 42)

#### MODEL:

In [8]:
selector = make_column_selector(dtype_include=object)

In [9]:
transformer = make_column_transformer((OneHotEncoder(drop = 'first'), selector),
                                     remainder = StandardScaler())

In [10]:
extractor = SelectFromModel(LogisticRegression(penalty='l1', solver = 'liblinear' ,random_state = 42))

In [11]:
lgr_pipe = Pipeline([('transformer', transformer),
                    ('selector', extractor),
                    ('lgr', LogisticRegression(random_state=42, max_iter = 1000))])

lgr_pipe.fit(X_train, y_train)

pipe_1_acc = lgr_pipe.score(X_test, y_test)

In [12]:
feature_names = lgr_pipe.named_steps['transformer'].get_feature_names_out() 
selected_features =feature_names[ [int(i[1:]) for i in lgr_pipe.named_steps['selector'].get_feature_names_out()]]
clean_names = [i.split('__')[-1] for i in selected_features]
coef_df = pd.DataFrame({'feature': clean_names, 'coefs': lgr_pipe.named_steps['lgr'].coef_[0]})
coef_df['coefs'] = coef_df['coefs'].apply(abs)
coef_df = coef_df.sort_values(by = 'coefs', ascending = False)

coef_df.head(20)

Unnamed: 0,feature,coefs
4,ap_hi,0.907423
6,cholesterol,0.34333
0,age_yr,0.331627
3,weight,0.13049
5,ap_lo,0.11512
10,active,0.098708
7,gluc,0.079301
9,alco,0.040723
8,smoke,0.037773
2,height,0.036717


#### PRECISION, ACCURACY, RECALL, ROC_AUC, TRAINING ACCURACY AND TEST ACCURACY:

In [13]:
preds = lgr_pipe.predict(X_test)
y_prob = lgr_pipe.predict_proba(X_test)[:, 1]
accuracy_log = accuracy_score(y_test, preds)
precision_log = precision_score(y_test, preds)
recall_log = recall_score(y_test, preds)
roc_auc_log = roc_auc_score(y_test, y_prob)
### END SOLUTION

# Answer check
print(f'Accuracy: {accuracy_log: .2f}\nPrecision: {precision_log: .2f}\nRecall: {recall_log: .2f}\nRoc_AUC: {roc_auc_log: .2f}')

Accuracy:  0.73
Precision:  0.76
Recall:  0.67
Roc_AUC:  0.79


In [14]:
train_acc = lgr_pipe.score(X_train, y_train)
test_acc = lgr_pipe.score(X_test, y_test)
print(f'Training Accuracy: {train_acc: .2f}')
print(f'Test Accuracy: {test_acc: .2f}')

Training Accuracy:  0.72
Test Accuracy:  0.73
