# Logistic Regression in scikit-learn - Lab

## Introduction 

In this lab, you are going to fit a logistic regression model to a dataset concerning heart disease. Whether or not a patient has heart disease is indicated in the final column labeled 'target'. 1 is for positive for heart disease while 0 indicates no heart disease.

## Objectives
You will be able to:

- Implement logistic regression in scikit-learn
- Form conclusions about the performance of a model


## Let's get started!

In [1]:
#Starter Code
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

In [2]:
#Starter Code
df = pd.read_csv('heart.csv')
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [3]:
df.isna().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

## Define appropriate X and y
Recall the dataset is whether or not a patient has heart disease and is indicated in the final column labelled 'target'. With that, define appropriate X and y in order to model whether or not a patient has heart disease.

In [4]:
#Your code here 
X = df.drop('target', axis=1)
y = df['target']

## Normalize the Data
Normalize the data prior to fitting the model.

In [5]:
#Your code here
for col in X.columns:
    X[col] = (X[col] - min(X[col])) / (max(X[col]) - min(X[col]))

In [6]:
X.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,0.528465,0.683168,0.322332,0.354941,0.274575,0.148515,0.264026,0.600358,0.326733,0.167678,0.69967,0.182343,0.771177
std,0.18921,0.466011,0.344017,0.165454,0.118335,0.356198,0.26293,0.174849,0.469794,0.18727,0.308113,0.255652,0.204092
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.385417,0.0,0.0,0.245283,0.194064,0.0,0.0,0.477099,0.0,0.0,0.5,0.0,0.666667
50%,0.541667,1.0,0.333333,0.339623,0.260274,0.0,0.5,0.625954,0.0,0.129032,0.5,0.0,0.666667
75%,0.666667,1.0,0.666667,0.433962,0.339041,0.0,0.5,0.725191,1.0,0.258065,1.0,0.25,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Train Test Split
Split the data into train and test sets.

In [7]:
#Your code here
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

## Fit a model
Fit an initial model to the training set. In scikit-learn you do this by first creating an instance of the regression class. From there, then use the fit method from your class instance to fit a model to the training data.

In [8]:
logreg = LogisticRegression(fit_intercept = False, C = 1e12) #Starter code

# Your code here
model_log = logreg.fit(X_train, y_train)
model_log



LogisticRegression(C=1000000000000.0, class_weight=None, dual=False,
                   fit_intercept=False, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

## Predict
Generate predictions for the train and test sets. Use the **predict** method from the logreg object.

In [9]:
# Your code here
y_hat_train = logreg.predict(X_train)
y_hat_test = logreg.predict(X_test)

## Initial Evaluation
How many times was the classifier correct for the training set?

In [10]:
# Your code here
resids_train = y_train - y_hat_train
pd.Series(resids_train).value_counts(normalize=True)

 0    0.854626
-1    0.088106
 1    0.057269
Name: target, dtype: float64

The classifier was correct 85.5\% of the time for the training set.

The value of 8.8\% for "-1" means that 9\% of the time, the prediction was 1 when the real value was 0 (i.e., a false positive). 

The value of 5.7\% for "1" means that 6\% of the time, the prediction was 0 when the real value was 1 (i.e., a false negative).

## How many times was the classifier correct for the test set?

In [11]:
# Your code here
resids_test = y_test - y_hat_test
pd.Series(resids_test).value_counts(normalize=True)

 0    0.815789
-1    0.118421
 1    0.065789
Name: target, dtype: float64

The classifier was correct 81.6\% of the time for the test set, with a false positive rate of 11.8\% and a false negative rate of 6.6\%. 

## Analysis
Describe how well you think this initial model is performing based on the train and test performance. Within your description, make note of how you evaluated performance as compared to your previous work with regression.

Performance was pretty good overall, with 85\% accuracy for the training set and 82\% for the test set. This would be comparable to adjusted R<sup>2</sup> values of 0.85 and 0.82, respectively. It's good that the success rate for the training and test sets are close to each other, too. 

## Summary

In this lab, you practiced a standard data science pipeline: importing data, splitting into train and test sets, and fitting a logistic regression model. In the upcoming labs and lessons, you'll continue to investigate how to analyze and tune these models for various scenarios.