
## Logistic Regression

In [14]:
# Standard Headers
# You are welcome to add additional headers here if you wish
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

!pip install scikit_learn
import sklearn.model_selection as sklearn

# Enable inline mode for matplotlib so that Jupyter displays graphs
%matplotlib inline



## Heart Dataset 

In this Assignment we will work with some patients dataset. 

We have access to 303 patients data. The features are listed below. 

In [22]:

heart_df = pd.read_csv("Heart.csv")
heart_df

Unnamed: 0,Age,Sex,ChestPain,RestBP,Chol,Fbs,RestECG,MaxHR,ExAng,Oldpeak,Slope,Ca,Thal,Target
0,63,1,typical,145,233,1,2,150,0,2.3,3,0.0,fixed,No
1,67,1,asymptomatic,160,286,0,2,108,1,1.5,2,3.0,normal,Yes
2,67,1,asymptomatic,120,229,0,2,129,1,2.6,2,2.0,reversable,Yes
3,37,1,nonanginal,130,250,0,0,187,0,3.5,3,0.0,normal,No
4,41,0,nontypical,130,204,0,2,172,0,1.4,1,0.0,normal,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,45,1,typical,110,264,0,0,132,0,1.2,2,0.0,reversable,Yes
299,68,1,asymptomatic,144,193,1,0,141,0,3.4,2,2.0,reversable,Yes
300,57,1,asymptomatic,130,131,0,0,115,1,1.2,2,1.0,reversable,Yes
301,57,0,nontypical,130,236,0,2,174,0,0.0,2,1.0,normal,Yes


**Age:** The person’s age in years

**Sex:** The person’s sex (1 = male, 0 = female)

**ChestPain:** chest pain type

* Value 0: asymptomatic
* Value 1: atypical angina
* Value 2: non-anginal pain
* Value 3: typical angina

**RestBP:** The person’s resting blood pressure (mm Hg on admission to the hospital)

**Chol:** The person’s cholesterol measurement in mg/dl

**Fbs:** The person’s fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false)
restecg: resting electrocardiographic results

* Value 0: showing probable or definite left ventricular hypertrophy by Estes’ criteria
* Value 1: normal
* Value 2: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)

**RestECG:** The person’s maximum heart rate achieved

**MaxHR:** Exercise induced angina (1 = yes; 0 = no)

**Oldpeak:** ST depression induced by exercise relative to rest (‘ST’ relates to positions on the ECG plot. See more here)

**Slope:** the slope of the peak exercise ST segment — 0: downsloping; 1: flat; 2: upsloping

* 0: downsloping; 
* 1: flat; 
* 2: upsloping

**Ca:** The number of major vessels (0–3)

**Thal:** A blood disorder called thalassemia Value 0: NULL (dropped from the dataset previously

* Value 1: fixed defect (no blood flow in some part of the heart)
* Value 2: normal blood flow
* Value 3: reversible defect (a blood flow is observed but it is not normal)

**Target:** Heart disease (1 = no, 0= yes)


Use logistic regression to predict if the patients will have heart problems or not. The column "Target" in datasets includes data about heart diseases. If the patient had heart disease have a 1 and if not a zero. 

Prepare data set for predicting heart disease ("Target" column) out of 3 features:

* Age of the patient (Column **"Age"**)
* Gender of the patient (male or female - Column **"Sex"**)
* Cholestrol level of the patient (Column **"Chol"**) 


In [69]:
#compile just the columns needed to evaluate for the explanatory variables
X_val = heart_df[['Age', 'Sex', 'Chol']]

#compile just the column needed to evaluate for the response variables 
Y_val = heart_df.drop(heart_df.columns.difference(['Target']), axis=1)

#import train test split
from sklearn.model_selection import train_test_split

#split the data into training and test data
x_train, x_test, y_train, y_test = train_test_split(X_val, Y_val, test_size = 0.2, random_state = 0)


Unnamed: 0,Target
0,No
1,Yes
2,Yes
3,No
4,No
...,...
298,Yes
299,Yes
300,Yes
301,Yes



Generate logistic regression model using your training data. 


In [67]:
from sklearn.linear_model import LogisticRegression

#conduct a logistic regression 
logistic_reg = LogisticRegression()

#fit the model to the training data
logistic_reg.fit(x_train, y_train.values.ravel())
print('Coefficients:' ,logistic_reg.coef_)
print('Intercept:',logistic_reg.intercept_)

#predict the values of y_test in the model using the test data
y_predict = logistic_reg.predict(x_test)

y_predict

#display the accuracy of the model
model_acc = accuracy_score(y_test,y_predict)
print('The model is ', model_acc*100, '% accurate', sep = '')

Coefficients: [[0.05959126 1.11779218 0.0054548 ]]
Intercept: [-5.50459844]
The model is 75.40983606557377% accurate




Generate the classification report for Logistic regresion model, and interpret results.


In [62]:

from sklearn import *
from sklearn.metrics import *

#generate a classification report for the precision of the model in relation to the values it predicted
class_report = classification_report(y_test,y_predict)
print(class_report)

#Interpret Results

#Precision: The logistic regression model can identify the true positives of the patients who are classified as positive approximately 69% of the time.

#Recall: The logistic regression model can identify the true positives of the patients who are actually positive approximately 77% of the time. 

#F1-Score: The logistic regression model has a f1 score of 0.73 which indicates that the weighted average of precision and recall is relatively high, with precision weighed higher than recall. 

              precision    recall  f1-score   support

          No       0.81      0.74      0.78        35
         Yes       0.69      0.77      0.73        26

    accuracy                           0.75        61
   macro avg       0.75      0.76      0.75        61
weighted avg       0.76      0.75      0.76        61



Build other models to improve predictions

In [74]:
#Another feature of my data I can use to improve my prediction is using RESTECG as an additional explanatory variable.


#RestECG_model

#compile just the columns needed to evaluate for the explanatory variables
X_val = heart_df[['Age', 'Sex', 'Chol', 'RestECG']]

#compile just the column needed to evaluate for the response variables 
Y_val = heart_df.drop(heart_df.columns.difference(['Target']), axis=1)

#split the data into training and test data
x_train_2,x_test_2, y_train_2, y_test_2 = train_test_split(X_val, Y_val, test_size = 0.2, random_state = 0)

#conduct a logistic regression 
logistic_reg_2 = LogisticRegression()

#fit the model to the training data
logistic_reg_2.fit(x_train_2, y_train_2.values.ravel())
print('Coefficients:' ,logistic_reg_2.coef_)
print('Intercept:',logistic_reg_2.intercept_)

#predict the values of y_test in the model using the test data
y_predict_2 = logistic_reg_2.predict(x_test_2)

y_predict_2

#display the accuracy of the model
model_acc_2 = accuracy_score(y_test_2,y_predict_2)
print('The model is ', model_acc_2*100, '% accurate', sep = '')

#classification report 
class_report = classification_report(y_test,y_predict)
print(class_report)

Coefficients: [[0.05785349 1.11636138 0.00477994 0.21043382]]
Intercept: [-5.44831671]
The model is 78.68852459016394% accurate
              precision    recall  f1-score   support

          No       0.81      0.74      0.78        35
         Yes       0.69      0.77      0.73        26

    accuracy                           0.75        61
   macro avg       0.75      0.76      0.75        61
weighted avg       0.76      0.75      0.76        61

