# Creating an Initial Logistic Regression Model

For our first lab, we are going to fit a logistic regression model to a dataset concerning heart disease. Whether or not a patient has heart disease is indicated in the final column labelled 'target'. 1 is for positive for heart disease while 0 indicates no heart disease.

Our goals are to:
* Define appropriate X and y
* Normalize the Data
* Split the data into train and test sets
* Fit a logistic regression model using SciKit Learn

With that, let's have at it!

In [44]:
#Starter Code
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import pandas as pd

In [3]:
#Starter Code
df = pd.read_csv('heart.csv')
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


# Define appropriate X and y
Recall the dataset is whether or not a patient has heart disease and is indicated in the final column labelled 'target'. With that, define appropriate X and y in order to model whether or not a patient has heart disease.

In [8]:
#Your code here 
X = df[df.columns[0:-1]]
y = df.target

# Normalize the Data
Normalize the data prior to fitting the model.

In [13]:
#normalizing data by subtracting min from each value and then dividing by the difference between the max and min
#this ensures all data is between 0 and 1
for col in df.columns:
    df[col] = (df[col]-min(df[col]))/ (max(df[col]) - min(df[col]))
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,0.708333,1.0,1.0,0.481132,0.244292,1.0,0.0,0.603053,0.0,0.370968,0.0,0.0,0.333333,1.0
1,0.166667,1.0,0.666667,0.339623,0.283105,0.0,0.5,0.885496,0.0,0.564516,0.0,0.0,0.666667,1.0
2,0.25,0.0,0.333333,0.339623,0.178082,0.0,0.0,0.770992,0.0,0.225806,1.0,0.0,0.666667,1.0
3,0.5625,1.0,0.333333,0.245283,0.251142,0.0,0.5,0.816794,0.0,0.129032,1.0,0.0,0.666667,1.0
4,0.583333,0.0,0.0,0.245283,0.520548,0.0,0.5,0.70229,1.0,0.096774,1.0,0.0,0.666667,1.0


# Train Test Split
Split the data into train and test sets.

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Fit a model
Fit an intial model to the training set. In SciKit Learn you do this by first creating an instance of the regression class. From there, then use the **fit** method from your class instance to fit a model to the training data.

In [17]:
logreg = LogisticRegression(fit_intercept = False, C = 1e12) #Starter code
model_log = logreg.fit(X_train, y_train)
model_log

LogisticRegression(C=1000000000000.0, class_weight=None, dual=False,
          fit_intercept=False, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

# Predict
Generate predictions for the train and test sets. Use the **predict** method from the logreg object.

In [20]:
y_hat_train = logreg.predict(X_train)
y_hat_test = logreg.predict(X_test)

# Initial Evaluation
How many times was the classifier correct for the training set?

In [39]:
#calculate residuals, if 0, classifier was correct
#if 1 or -1, classifier was incorrect in either direction
residuals = y_train - y_hat_train
residuals
print(pd.Series(residuals).value_counts())
print('\nAccuracy: {}%'.format(round(194/(194+21+12)*100,2)))

 0    194
-1     21
 1     12
Name: target, dtype: int64

Accuracy: 85.46%


In [46]:
## top left = true positive
## top right = false positive
## bottom left = false negative
## bottom right = true negative

confusion_matrix(y_train,y_hat_train)

array([[ 84,  21],
       [ 12, 110]])

# How many times was the classifier correct for the test set?

In [41]:
test_residuals = y_test - y_hat_test
test_residuals
print(pd.Series(test_residuals).value_counts())
print('\nAccuracy: {}%'.format(round(63/(63+9+4)*100,2)))

 0    63
-1     9
 1     4
Name: target, dtype: int64

Accuracy: 82.89%


In [47]:
## top left = true positive
## top right = false positive
## bottom left = false negative
## bottom right = true negative
confusion_matrix(y_test,y_hat_test)

array([[24,  9],
       [ 4, 39]])

# Analysis
Describe how well you think this initial model is based on the train and test performance. Within your description, make note of how you evaluated perforamnce as compared to our previous work with regression.

**Answers will vary. Students should generally be grappling with the notion of True Positives, False Positives, True Negatives, and False Negatives that will be formalized in later sections. Hopefully they will be able to at least check the number of correct/incorrect predictions at this point.**

* intial results seem to be pretty solid (>80% accuracy). Training set performed better than test set, which is expected given it contains a much larger sample size**

* 1 represents a false negative (i.e. our model predicted no heart disease when they in fact had heart disease)
* -1 represents a false positive (i.e. our model predicted the patient has heart disease when they in fact do not have heart disease)
    * In the medical field, it is often better to have your errors be false positives as opposed to false negatives. For example, it is preferred to treat someone for a disease they do not have instead of not treating a patient for a disease which they do have.
    * Therefore, it is better that our model's errors tend to be false positives more frequently than false negatives in both the training and test sets
    