# Logistic Regression in scikit-learn - Lab

## Introduction 

In this lab, you are going to fit a logistic regression model to a dataset concerning heart disease. Whether or not a patient has heart disease is indicated in the column labeled `'target'`. 1 is for positive for heart disease while 0 indicates no heart disease.

## Objectives

In this lab you will: 

- Fit a logistic regression model using scikit-learn 


## Let's get started!

Run the following cells that import the necessary functions and import the dataset: 

In [117]:
# Import necessary functions
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

In [118]:
# Import data
df = pd.read_csv('heart.csv')
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [119]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
age         303 non-null int64
sex         303 non-null int64
cp          303 non-null int64
trestbps    303 non-null int64
chol        303 non-null int64
fbs         303 non-null int64
restecg     303 non-null int64
thalach     303 non-null int64
exang       303 non-null int64
oldpeak     303 non-null float64
slope       303 non-null int64
ca          303 non-null int64
thal        303 non-null int64
target      303 non-null int64
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


In [120]:
df['target'].nunique()

2

In [121]:
df.nunique()

age          41
sex           2
cp            4
trestbps     49
chol        152
fbs           2
restecg       3
thalach      91
exang         2
oldpeak      40
slope         3
ca            5
thal          4
target        2
dtype: int64

In [122]:
df['target'].unique()

array([1, 0], dtype=int64)

## Define appropriate `X` and `y` 

Recall the dataset contains information about whether or not a patient has heart disease and is indicated in the column labeled `'target'`. With that, define appropriate `X` (predictors) and `y` (target) in order to model whether or not a patient has heart disease.

In [123]:
df.columns

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
      dtype='object')

In [124]:
df.drop(labels='target',axis=1).shape,df.shape

((303, 13), (303, 14))

In [125]:
# Split the data into target and predictors
X = df.drop(labels='target',axis=1)
y = df['target']
X.shape, y.shape, X.columns

((303, 13),
 (303,),
 Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
        'exang', 'oldpeak', 'slope', 'ca', 'thal'],
       dtype='object'))

## Normalize the data 

Normalize the data (`X`) prior to fitting the model. 

In [126]:
df['age'].min()

29

In [127]:
min(df['age'])

29

In [128]:
np.min(df['age'])

29

In [129]:
(df['age'] - min(df['age'])) / ( max(df['age'])-min(df['age']) )

0      0.708333
1      0.166667
2      0.250000
3      0.562500
4      0.583333
         ...   
298    0.583333
299    0.333333
300    0.812500
301    0.583333
302    0.583333
Name: age, Length: 303, dtype: float64

In [130]:
X

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3


In [131]:
# Your code here
t=X.columns
X = [ (X['age'] - min(X['age'])) / ( max(X['age']) - min(X['age']) )  for age in X.columns ]
# X.head()
X= pd.DataFrame(X)
X=X.transpose()
X.columns =t
X

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,0.708333,0.708333,0.708333,0.708333,0.708333,0.708333,0.708333,0.708333,0.708333,0.708333,0.708333,0.708333,0.708333
1,0.166667,0.166667,0.166667,0.166667,0.166667,0.166667,0.166667,0.166667,0.166667,0.166667,0.166667,0.166667,0.166667
2,0.250000,0.250000,0.250000,0.250000,0.250000,0.250000,0.250000,0.250000,0.250000,0.250000,0.250000,0.250000,0.250000
3,0.562500,0.562500,0.562500,0.562500,0.562500,0.562500,0.562500,0.562500,0.562500,0.562500,0.562500,0.562500,0.562500
4,0.583333,0.583333,0.583333,0.583333,0.583333,0.583333,0.583333,0.583333,0.583333,0.583333,0.583333,0.583333,0.583333
...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,0.583333,0.583333,0.583333,0.583333,0.583333,0.583333,0.583333,0.583333,0.583333,0.583333,0.583333,0.583333,0.583333
299,0.333333,0.333333,0.333333,0.333333,0.333333,0.333333,0.333333,0.333333,0.333333,0.333333,0.333333,0.333333,0.333333
300,0.812500,0.812500,0.812500,0.812500,0.812500,0.812500,0.812500,0.812500,0.812500,0.812500,0.812500,0.812500,0.812500
301,0.583333,0.583333,0.583333,0.583333,0.583333,0.583333,0.583333,0.583333,0.583333,0.583333,0.583333,0.583333,0.583333


In [132]:
X = df.drop(labels='target',axis=1)
X.apply(lambda x: (x-min(x))/(max(x) - min(x)),axis = 0)  # x is a column of X
X

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3


## Train- test split 

- Split the data into training and test sets 
- Assign 25% to the test set 
- Set the `random_state` to 0 

In [133]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.25)

## Fit a model

- Instantiate `LogisticRegression`
  - Make sure you don't include the intercept  
  - set `C` to a very large number such as `1e12` 
  - Use the `'liblinear'` solver 
- Fit the model to the training data 

In [134]:
# Instantiate the model
logreg = LogisticRegression(fit_intercept = False, C=1e12, solver = 'liblinear')
logreg.fit(X_train,y_train)

# Fit the model


LogisticRegression(C=1000000000000.0, class_weight=None, dual=False,
                   fit_intercept=False, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

## Predict
Generate predictions for the training and test sets. 

In [135]:
# Generate predictions
y_hat_train = logreg.predict(X_train)
y_hat_test = logreg.predict(X_test)

## How many times was the classifier correct on the training set?

In [142]:
# calculate pred-actual 
# count how many are 0
y_train=pd.DataFrame(y_train)

y_train['pred']=y_hat_train
y_train['diff']=y_train['pred']-y_train['target']

print(y_train['diff'].value_counts())
print(y_train['diff'].value_counts(normalize=True))


 0    196
 1     20
-1     11
Name: diff, dtype: int64
 0    0.863436
 1    0.088106
-1    0.048458
Name: diff, dtype: float64


## How many times was the classifier correct on the test set?

In [143]:
# Your code here
y_test=pd.DataFrame(y_test)

y_test['pred']=y_hat_test
y_test['diff']=y_test['pred']-y_test['target']

print(y_test['diff'].value_counts())
print(y_test['diff'].value_counts(normalize=True))


 0    62
 1    12
-1     2
Name: diff, dtype: int64
 0    0.815789
 1    0.157895
-1    0.026316
Name: diff, dtype: float64


## Analysis
Describe how well you think this initial model is performing based on the training and test performance. Within your description, make note of how you evaluated performance as compared to your previous work with regression.

In [144]:
# Your analysis here
# 86% accuracy on train set
# 82% accuracy on test set
# different measurement compared to linear reg because discrete result is either true or false.
# We don't measure the distance to the true y value anymore.

## Summary

In this lab, you practiced a standard data science pipeline: importing data, split it into training and test sets, and fit a logistic regression model. In the upcoming labs and lessons, you'll continue to investigate how to analyze and tune these models for various scenarios.