# Logistic Regression (Binary Classification)


Logistic regression is a popular statistical method used for binary classification problems, such as predicting the presence or absence of heart disease based on various predictors. In the context of heart disease prediction, logistic regression models the probability that a given patient has heart disease based on their features.

Thanks for the DataSet: [Dataset Link](https://github.com/rashida048/Datasets/blob/master/Heart.csv)

## Import Necessary Libraries

In [347]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

## Data Preparation

In [348]:
dataframe = pd.read_csv('../data/Heart.csv')
dataframe.head()

Unnamed: 0.1,Unnamed: 0,Age,Sex,ChestPain,RestBP,Chol,Fbs,RestECG,MaxHR,ExAng,Oldpeak,Slope,Ca,Thal,AHD
0,1,63,1,typical,145,233,1,2,150,0,2.3,3,0.0,fixed,No
1,2,67,1,asymptomatic,160,286,0,2,108,1,1.5,2,3.0,normal,Yes
2,3,67,1,asymptomatic,120,229,0,2,129,1,2.6,2,2.0,reversable,Yes
3,4,37,1,nonanginal,130,250,0,0,187,0,3.5,3,0.0,normal,No
4,5,41,0,nontypical,130,204,0,2,172,0,1.4,1,0.0,normal,No


In [349]:
dataframe = dataframe.drop(columns=['Unnamed: 0'])
dataframe

Unnamed: 0,Age,Sex,ChestPain,RestBP,Chol,Fbs,RestECG,MaxHR,ExAng,Oldpeak,Slope,Ca,Thal,AHD
0,63,1,typical,145,233,1,2,150,0,2.3,3,0.0,fixed,No
1,67,1,asymptomatic,160,286,0,2,108,1,1.5,2,3.0,normal,Yes
2,67,1,asymptomatic,120,229,0,2,129,1,2.6,2,2.0,reversable,Yes
3,37,1,nonanginal,130,250,0,0,187,0,3.5,3,0.0,normal,No
4,41,0,nontypical,130,204,0,2,172,0,1.4,1,0.0,normal,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,45,1,typical,110,264,0,0,132,0,1.2,2,0.0,reversable,Yes
299,68,1,asymptomatic,144,193,1,0,141,0,3.4,2,2.0,reversable,Yes
300,57,1,asymptomatic,130,131,0,0,115,1,1.2,2,1.0,reversable,Yes
301,57,0,nontypical,130,236,0,2,174,0,0.0,2,1.0,normal,Yes


In [350]:
dataframe.isna().sum()

Age          0
Sex          0
ChestPain    0
RestBP       0
Chol         0
Fbs          0
RestECG      0
MaxHR        0
ExAng        0
Oldpeak      0
Slope        0
Ca           4
Thal         2
AHD          0
dtype: int64

In [351]:
dataframe = dataframe.dropna()
dataframe

Unnamed: 0,Age,Sex,ChestPain,RestBP,Chol,Fbs,RestECG,MaxHR,ExAng,Oldpeak,Slope,Ca,Thal,AHD
0,63,1,typical,145,233,1,2,150,0,2.3,3,0.0,fixed,No
1,67,1,asymptomatic,160,286,0,2,108,1,1.5,2,3.0,normal,Yes
2,67,1,asymptomatic,120,229,0,2,129,1,2.6,2,2.0,reversable,Yes
3,37,1,nonanginal,130,250,0,0,187,0,3.5,3,0.0,normal,No
4,41,0,nontypical,130,204,0,2,172,0,1.4,1,0.0,normal,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
297,57,0,asymptomatic,140,241,0,0,123,1,0.2,2,0.0,reversable,Yes
298,45,1,typical,110,264,0,0,132,0,1.2,2,0.0,reversable,Yes
299,68,1,asymptomatic,144,193,1,0,141,0,3.4,2,2.0,reversable,Yes
300,57,1,asymptomatic,130,131,0,0,115,1,1.2,2,1.0,reversable,Yes


In [352]:
dataframe.isna().sum()

Age          0
Sex          0
ChestPain    0
RestBP       0
Chol         0
Fbs          0
RestECG      0
MaxHR        0
ExAng        0
Oldpeak      0
Slope        0
Ca           0
Thal         0
AHD          0
dtype: int64

In [353]:
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
Index: 297 entries, 0 to 301
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Age        297 non-null    int64  
 1   Sex        297 non-null    int64  
 2   ChestPain  297 non-null    object 
 3   RestBP     297 non-null    int64  
 4   Chol       297 non-null    int64  
 5   Fbs        297 non-null    int64  
 6   RestECG    297 non-null    int64  
 7   MaxHR      297 non-null    int64  
 8   ExAng      297 non-null    int64  
 9   Oldpeak    297 non-null    float64
 10  Slope      297 non-null    int64  
 11  Ca         297 non-null    float64
 12  Thal       297 non-null    object 
 13  AHD        297 non-null    object 
dtypes: float64(2), int64(9), object(3)
memory usage: 34.8+ KB


In [354]:
dataframe.shape

(297, 14)

## Converting non-numeric data into numeric data

In [355]:
dataframe['ChestPain'] = dataframe['ChestPain'].astype('category')
dataframe['ChestPain'] = dataframe['ChestPain'].cat.codes


dataframe['Thal'] = dataframe['Thal'].astype('category')
dataframe['Thal'] = dataframe['Thal'].cat.codes

dataframe['AHD'] = dataframe['AHD'].astype('category')
dataframe['AHD'] = dataframe['AHD'].cat.codes

dataframe

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataframe['ChestPain'] = dataframe['ChestPain'].astype('category')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataframe['ChestPain'] = dataframe['ChestPain'].cat.codes
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataframe['Thal'] = dataframe['Thal'].astype('category')
A value is trying to be

Unnamed: 0,Age,Sex,ChestPain,RestBP,Chol,Fbs,RestECG,MaxHR,ExAng,Oldpeak,Slope,Ca,Thal,AHD
0,63,1,3,145,233,1,2,150,0,2.3,3,0.0,0,0
1,67,1,0,160,286,0,2,108,1,1.5,2,3.0,1,1
2,67,1,0,120,229,0,2,129,1,2.6,2,2.0,2,1
3,37,1,1,130,250,0,0,187,0,3.5,3,0.0,1,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0.0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
297,57,0,0,140,241,0,0,123,1,0.2,2,0.0,2,1
298,45,1,3,110,264,0,0,132,0,1.2,2,0.0,2,1
299,68,1,0,144,193,1,0,141,0,3.4,2,2.0,2,1
300,57,1,0,130,131,0,0,115,1,1.2,2,1.0,2,1


## Assigning the X and y

In [356]:
X = dataframe.drop(columns=['AHD'])
X.head()

Unnamed: 0,Age,Sex,ChestPain,RestBP,Chol,Fbs,RestECG,MaxHR,ExAng,Oldpeak,Slope,Ca,Thal
0,63,1,3,145,233,1,2,150,0,2.3,3,0.0,0
1,67,1,0,160,286,0,2,108,1,1.5,2,3.0,1
2,67,1,0,120,229,0,2,129,1,2.6,2,2.0,2
3,37,1,1,130,250,0,0,187,0,3.5,3,0.0,1
4,41,0,2,130,204,0,2,172,0,1.4,1,0.0,1


In [357]:
y = dataframe['AHD']
y[:5]

0    0
1    1
2    1
3    0
4    0
Name: AHD, dtype: int8

## Splitting the Data (Training Data and test Data)

In [358]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=42)

## Data Scaling

In [359]:
scaler = StandardScaler()

In [360]:
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [361]:
X_train_scaled

array([[ 0.72855974,  0.66143783, -0.84346533, ..., -1.0091956 ,
         0.29894064,  1.14642301],
       [ 0.28302985,  0.66143783, -0.84346533, ...,  0.6487686 ,
         0.29894064, -2.29284602],
       [-0.71941239,  0.66143783,  1.23507423, ...,  2.30673279,
        -0.73240456,  1.14642301],
       ...,
       [ 0.28302985,  0.66143783,  0.19580445, ...,  0.6487686 ,
         0.29894064,  1.14642301],
       [ 0.5057948 ,  0.66143783,  2.27434401, ..., -1.0091956 ,
         1.33028584, -0.5732115 ],
       [ 1.84238445, -1.51185789,  0.19580445, ..., -1.0091956 ,
         0.29894064, -0.5732115 ]])

In [362]:
X_test_scaled[:5]

array([[-1.0535598 , -1.51185789,  1.23507423, -1.14075454, -1.6650806 ,
        -0.43549417, -1.10179688, -0.44240594, -0.72253638, -0.90553097,
         0.6487686 , -0.73240456, -0.5732115 ],
       [-0.2738825 ,  0.66143783, -0.84346533, -1.14075454, -0.35624812,
        -0.43549417, -1.10179688,  0.49949058, -0.72253638, -0.90553097,
        -1.0091956 ,  0.29894064, -0.5732115 ],
       [-0.05111756, -1.51185789,  0.19580445,  0.20725548,  1.02737478,
         2.29624199, -1.10179688,  0.92762535, -0.72253638, -0.90553097,
        -1.0091956 , -0.73240456, -0.5732115 ],
       [ 1.73100198,  0.66143783, -0.84346533, -0.08579018,  1.3639317 ,
        -0.43549417,  0.916539  , -1.6839968 , -0.72253638,  1.23262391,
         0.6487686 ,  2.36163104, -0.5732115 ],
       [ 0.17164738,  0.66143783,  1.23507423, -0.67188149, -0.24406248,
        -0.43549417, -1.10179688,  1.27013318, -0.72253638, -0.19281268,
        -1.0091956 , -0.73240456, -0.5732115 ]])

## Model Building

In [363]:
log_reg_model = LogisticRegression(C=0.1)

In [364]:
log_reg_model.fit(X_train_scaled,y_train)

## Model Prediction

In [365]:
pred = log_reg_model.predict(X_test_scaled)
pred

array([0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1,
       0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1,
       0, 1], dtype=int8)

## Model Evaluation

In [366]:
log_reg_model.score(X_train_scaled, y_train)

0.8454106280193237

In [367]:
log_reg_model.score(X_test_scaled, y_test)

0.8666666666666667