# Creating an Initial Logistic Regression Model

For our first lab, we are going to fit a logistic regression model to a dataset concerning heart disease. Whether or not a patient has heart disease is indicated in the final column labelled 'target'. 1 is for positive for heart disease while 0 indicates no heart disease.

Our goals are to:
* Define appropriate X and y
* Normalize the Data
* Split the data into train and test sets
* Fit a logistic regression model using SciKit Learn

With that, let's have at it!

In [1]:
#Starter Code
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import pandas as pd

In [28]:
#Starter Code
df = pd.read_csv('heart.csv')
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


# Define appropriate X and y
Recall the dataset is whether or not a patient has heart disease and is indicated in the final column labelled 'target'. With that, define appropriate X and y in order to model whether or not a patient has heart disease.

In [29]:
from patsy import dmatrices

In [30]:
# y, X = dmatrices('Target ~ C(Age)  + C(Sex) + C(Chol) + C()',
#                   df, return_type = "dataframe") how to set data using dmatrices

# Normalize the Data
Normalize the data prior to fitting the model.

In [31]:
for col in df.columns:
    df[col] = (df[col] - min(df[col]))/ (max(df[col]) - min(df[col]))

In [36]:
#Your code here 
X = df[df.columns[:-1]]
y = df.target

# Train Test Split
Split the data into train and test sets.

In [38]:
import numpy as np
from sklearn.model_selection import train_test_split
split = train_test_split(X,y)

In [40]:
X_train, X_test, y_train, y_test = split[0], split[1], split[2], split[3]

# Fit a model
Fit an intial model to the training set. In SciKit Learn you do this by first creating an instance of the regression class. From there, then use the **fit** method from your class instance to fit a model to the training data.

In [42]:
import statsmodels.api as sm
import sklearn.preprocessing as preprocessing
from sklearn.linear_model import LogisticRegression

In [44]:
logreg = LogisticRegression(fit_intercept = False, C = 1e12) #Starter code
logit_model = sm.Logit(y_train, X_train) #iloc is returning only salaries greater than 50K
result = logit_model.fit()
#Your code here

Optimization terminated successfully.
         Current function value: 0.335019
         Iterations 8


In [46]:
result.summary()

0,1,2,3
Dep. Variable:,target,No. Observations:,227.0
Model:,Logit,Df Residuals:,214.0
Method:,MLE,Df Model:,12.0
Date:,"Mon, 20 Aug 2018",Pseudo R-squ.:,0.5124
Time:,14:10:07,Log-Likelihood:,-76.049
converged:,True,LL-Null:,-155.96
,,LLR p-value:,5.685e-28

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
age,1.4003,1.205,1.162,0.245,-0.962,3.763
sex,-2.0804,0.591,-3.521,0.000,-3.239,-0.922
cp,2.7263,0.658,4.144,0.000,1.437,4.016
trestbps,-3.6377,1.315,-2.765,0.006,-6.216,-1.059
chol,-1.6247,2.065,-0.787,0.431,-5.672,2.423
fbs,0.1000,0.611,0.164,0.870,-1.098,1.298
restecg,1.6279,0.835,1.950,0.051,-0.008,3.264
thalach,6.1360,1.459,4.207,0.000,3.277,8.995
exang,-0.8414,0.480,-1.751,0.080,-1.783,0.100


In [47]:
np.exp(result.params)

age           4.056338
sex           0.124881
cp           15.276648
trestbps      0.026312
chol          0.196966
fbs           1.105117
restecg       5.093130
thalach     462.202067
exang         0.431122
oldpeak       0.021829
slope         3.097800
ca            0.059162
thal          0.072356
dtype: float64

# ALTERNATE WAY

In [48]:
logreg = LogisticRegression(fit_intercept = False, C = 1e12) #Starter code
#Your code here
model_log = logreg.fit(X_train, y_train)
model_log

LogisticRegression(C=1000000000000.0, class_weight=None, dual=False,
          fit_intercept=False, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [49]:
model_log.coef_

array([[ 1.40013059, -2.08032445,  2.72637562, -3.63737214, -1.62356207,
         0.09987601,  1.62803827,  6.13566786, -0.84141006, -3.82430737,
         1.13062067, -2.82731298, -2.62646424]])

# Predict
Generate predictions for the train and test sets. Use the **predict** method from the logreg object.

In [51]:
y_hat_test = logreg.predict(X_test)
y_hat_train = logreg.predict(X_train)

# Initial Evaluation
How many times was the classifier correct for the training set?

In [56]:
residuals = y_train - y_hat_train #if value = 0 then the prediction was correct
# residuals
print(pd.Series(residuals).value_counts())
print(pd.Series(residuals).value_counts(normalize=True)) #gives the percentage of each value

 0.0    194
-1.0     22
 1.0     11
Name: target, dtype: int64
 0.0    0.854626
-1.0    0.096916
 1.0    0.048458
Name: target, dtype: float64


# How many times was the classifier correct for the test set?

In [57]:
residuals_test = y_test - y_hat_test #if value = 0 then the prediction was correct
# residuals
print(pd.Series(residuals_test).value_counts())
print(pd.Series(residuals_test).value_counts(normalize=True)) 

 0.0    65
-1.0     6
 1.0     5
Name: target, dtype: int64
 0.0    0.855263
-1.0    0.078947
 1.0    0.065789
Name: target, dtype: float64


In [60]:
y_train.head()

146    1.0
297    0.0
186    0.0
66     1.0
195    0.0
Name: target, dtype: float64

In [62]:
y_hat_train

array([1., 0., 0., 1., 0., 1., 1., 1., 0., 1., 1., 0., 1., 1., 0., 1., 1.,
       0., 1., 0., 1., 1., 0., 1., 0., 0., 0., 0., 1., 0., 0., 1., 1., 0.,
       1., 1., 0., 1., 1., 1., 1., 0., 0., 1., 1., 0., 1., 0., 0., 0., 0.,
       0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 1., 1., 0., 1., 0., 0., 0.,
       0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 0., 0., 0., 0., 0., 1., 1., 1., 1., 0., 1., 1., 0.,
       1., 1., 0., 0., 1., 0., 0., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1.,
       0., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 0., 0., 1., 0., 0., 0.,
       1., 0., 1., 1., 1., 0., 1., 1., 1., 1., 0., 0., 1., 1., 0., 0., 0.,
       1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1., 1., 0., 0., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 0., 0., 1., 1.,
       0., 1., 1., 1., 1., 0., 1., 1., 0., 1., 1., 1., 1., 0., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 0., 1.,
       0., 1., 1., 1., 1.

# Analysis
Describe how well you think this initial model is based on the train and test performance. Within your description, make note of how you evaluated perforamnce as compared to our previous work with regression.

In [64]:
#-1 shows false positive (we predicted 1 (heart disease) when there was no disease)
# 0 shows correct guess
#+1 shows false negative (we predicted 0 (no heart disease) when there was a disease)