# Logistic Regression in SciKit Learn - Lab

## Introduction 

In this lab, we are going to fit a logistic regression model to a dataset concerning heart disease. Whether or not a patient has heart disease is indicated in the final column labelled 'target'. 1 is for positive for heart disease while 0 indicates no heart disease.

## Objectives
You will be able to:

* Understand and implement logistic regression
* Compare testing and training errors

## Let's get started!

In [7]:
#Starter Code
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import pandas as pd

In [21]:
#Starter Code
df = pd.read_csv('heart.csv')
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


## Define appropriate X and y
Recall the dataset is whether or not a patient has heart disease and is indicated in the final column labelled 'target'. With that, define appropriate X and y in order to model whether or not a patient has heart disease.

In [23]:
#Your code here 
x_feats = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal']
X_dum = pd.get_dummies(df, columns=x_feats)
print(type(df))
X = pd.concat([df, X_dum], axis=1)
y = df['target']

<class 'pandas.core.frame.DataFrame'>


In [26]:
X.shape
X.head()


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,...,slope_2,ca_0,ca_1,ca_2,ca_3,ca_4,thal_0,thal_1,thal_2,thal_3
0,63,1,3,145,233,1,0,150,0,2.3,...,0,1,0,0,0,0,0,1,0,0
1,37,1,2,130,250,0,1,187,0,3.5,...,0,1,0,0,0,0,0,0,1,0
2,41,0,1,130,204,0,0,172,0,1.4,...,1,1,0,0,0,0,0,0,1,0
3,56,1,1,120,236,0,1,178,0,0.8,...,1,1,0,0,0,0,0,0,1,0
4,57,0,0,120,354,0,1,163,1,0.6,...,1,1,0,0,0,0,0,0,1,0


## Normalize the Data
Normalize the data prior to fitting the model.

In [27]:
scaler = MinMaxScaler()
scaler.fit(X, y)

  return self.partial_fit(X, y)


MinMaxScaler(copy=True, feature_range=(0, 1))

In [28]:
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,...,slope_2,ca_0,ca_1,ca_2,ca_3,ca_4,thal_0,thal_1,thal_2,thal_3
0,63,1,3,145,233,1,0,150,0,2.3,...,0,1,0,0,0,0,0,1,0,0
1,37,1,2,130,250,0,1,187,0,3.5,...,0,1,0,0,0,0,0,0,1,0
2,41,0,1,130,204,0,0,172,0,1.4,...,1,1,0,0,0,0,0,0,1,0
3,56,1,1,120,236,0,1,178,0,0.8,...,1,1,0,0,0,0,0,0,1,0
4,57,0,0,120,354,0,1,163,1,0.6,...,1,1,0,0,0,0,0,0,1,0


## Train Test Split
Split the data into train and test sets.

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

## Fit a model
Fit an intial model to the training set. In SciKit Learn you do this by first creating an instance of the regression class. From there, then use the **fit** method from your class instance to fit a model to the training data.

In [44]:
logreg = LogisticRegression(fit_intercept = False, C = 1e12) #Starter code
logreg.fit(X_train, y_train)
logreg



LogisticRegression(C=1000000000000.0, class_weight=None, dual=False,
          fit_intercept=False, intercept_scaling=1, max_iter=100,
          multi_class='warn', n_jobs=None, penalty='l2', random_state=None,
          solver='warn', tol=0.0001, verbose=0, warm_start=False)

In [46]:
logreg.coef_

array([[-1.06488799e-02, -6.35818613e-01,  3.00334483e-01,
        -1.04647048e-02, -2.19885193e-03,  7.84766041e-03,
        -1.17035499e-01, -2.82583543e-03, -2.51129534e-01,
        -4.10831360e-02, -8.20670624e-03, -1.44936959e-01,
        -3.91386865e-01,  8.24611164e+00, -1.06488799e-02,
        -1.04647048e-02, -2.19885193e-03, -2.82583543e-03,
        -4.10831360e-02,  8.24611164e+00,  4.08473884e-01,
        -6.35818613e-01, -5.52992705e-01,  3.25651955e-01,
         2.53055366e-02, -2.53095150e-02, -2.35192389e-01,
         7.84766041e-03, -1.30483407e-01, -7.66871440e-02,
        -2.01741775e-02,  2.37848049e-02, -2.51129534e-01,
         8.84172437e-02, -6.23317239e-01,  3.07555266e-01,
         5.70093526e-01, -6.63916596e-01, -4.59985757e-01,
        -1.33094759e-01,  4.59558857e-01, -1.02868857e-01,
        -7.64256046e-02,  1.70810458e-01, -2.18860725e-01]])

## Predict
Generate predictions for the train and test sets. Use the **predict** method from the logreg object.

In [50]:
y_hat_test = logreg.predict(X_test)
y_hat_train = logreg.predict(X_train)


0      1
1      1
2      1
3      1
4      1
5      1
6      1
7      1
8      1
9      1
10     1
11     1
12     1
13     1
14     1
15     1
16     1
17     1
18     1
19     1
20     1
21     1
22     1
23     1
24     1
25     1
26     1
27     1
28     1
29     1
      ..
273    0
274    0
275    0
276    0
277    0
278    0
279    0
280    0
281    0
282    0
283    0
284    0
285    0
286    0
287    0
288    0
289    0
290    0
291    0
292    0
293    0
294    0
295    0
296    0
297    0
298    0
299    0
300    0
301    0
302    0
Name: target, Length: 303, dtype: int64

## Initial Evaluation
How many times was the classifier correct for the training set?

In [38]:
#We could subtract the two columns. If values or equal, difference will be zero. Then count number of zeros.
residuals = y_train - y_hat_train
print(pd.Series(residuals).value_counts())
print(pd.Series(residuals).value_counts(normalize=True))
#194 correct, 85% accuracy

0    227
Name: target, dtype: int64
0    1.0
Name: target, dtype: float64


## How many times was the classifier correct for the test set?

In [39]:
len(y_test)

76

In [40]:
residuals = y_test - y_hat_test
print(pd.Series(residuals).value_counts())
print(pd.Series(residuals).value_counts(normalize=True))
#63 correct, 83% accuracy

0    76
Name: target, dtype: int64
0    1.0
Name: target, dtype: float64


## Analysis
Describe how well you think this initial model is based on the train and test performance. Within your description, make note of how you evaluated perforamnce as compared to our previous work with regression.

In [None]:
#Your answer here

## Summary

In this lab, you practiced a standard data science pipeline, importing data, splitting into train and test sets and fitting a logistic regression model. In the upcoming labs and lessons, we'll continue to investigate how to analyze and tune these models for various scenarios.