# **Supervised Learning: Classification**

This project aims to develop a Machine Learning algorithm to predict a person's tendency to develop some type of heart disease based on some clinical and laboratory factors.

The data was extracted from the Kaggle website:

https://www.kaggle.com/fedesoriano/heart-failure-prediction/version/1

# **Preprocessing**

In [76]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA
from sklearn.decomposition import KernelPCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import KFold, train_test_split, cross_val_score

from google.colab import drive

In [77]:
drive.mount('/content/drive')

data = pd.read_csv('/content/drive/MyDrive/colab_notebooks/pre_processing/pre_processed_heart_disease_dataset.csv',
                    sep=';', encoding='utf-8')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [78]:
data.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289.0,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180.0,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283.0,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214.0,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195.0,0,Normal,122,N,0.0,Up,0


In [79]:
data.shape

(917, 12)

In [80]:
data.dtypes

Unnamed: 0,0
Age,int64
Sex,object
ChestPainType,object
RestingBP,int64
Cholesterol,float64
FastingBS,int64
RestingECG,object
MaxHR,int64
ExerciseAngina,object
Oldpeak,float64


## **Transforming nominal categorical variables into ordinal categorical variables**

In [81]:
data2 = pd.DataFrame.copy(data)

In [82]:
data2['Sex'].replace({'M': 0, 'F': 1}, inplace=True)
data2['ChestPainType'].replace({'TA': 0, 'ATA': 1, 'NAP': 2, 'ASY': 3}, inplace=True)
data2['RestingECG'].replace({'Normal': 0, 'ST': 1, 'LVH': 2}, inplace=True)
data2['ExerciseAngina'].replace({'N': 0, 'Y': 1}, inplace=True)
data2['ST_Slope'].replace({'Up': 0, 'Flat': 1, 'Down': 2}, inplace=True)

In [83]:
data2.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,0,1,140,289.0,0,0,172,0,0.0,0,0
1,49,1,2,160,180.0,0,0,156,0,1.0,1,1
2,37,0,1,130,283.0,0,1,98,0,0.0,0,0
3,48,1,3,138,214.0,0,0,108,1,1.5,1,1
4,54,0,2,150,195.0,0,0,122,0,0.0,0,0


In [84]:
data2.dtypes

Unnamed: 0,0
Age,int64
Sex,int64
ChestPainType,int64
RestingBP,int64
Cholesterol,float64
FastingBS,int64
RestingECG,int64
MaxHR,int64
ExerciseAngina,int64
Oldpeak,float64


In [85]:
data2.shape

(917, 12)

## **Attributes Summary**

- Age = age (years)
- Sex = sex (0=M; 1=F)
- Chest Pain Type = type of chest pain (0=TA: typical angina; 1=ATA: atypical angina; 2=NAP: non-anginal pain; 3=ASY: asymptomatic)
- Resting BP = resting blood pressure (mmHg)
- Cholesterol = serum cholesterol (mg/dl)
- Fasting BS = fasting blood sugar (mg/dl)
  - 0: Fasting BS < 120 mg/dl (non-diabetic)
  - 1: Fasting BS >= 120 mg/dl, (diabetic)
- Resting ECG = resting electrocardiogram (0=Normal; 1=ST: ST-T wave abnormality; 2=LVH: Left ventricular hypertrophy)
- Max HR = maximum heart rate
- Exercise Angina = Angina Exercise-induced (0 = No; 1 = Yes)
- Old Peak = Exercise-induced ST depression relative to rest
- ST_Slope = ST segment slope (0 = UP; 1 = Flat; 2 = Down)
- Heart Disease = Heart disease (0 = Does not have heart disease; 1 = Has heart disease)

## **Predictor and target attributes**

In [86]:
data2.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,0,1,140,289.0,0,0,172,0,0.0,0,0
1,49,1,2,160,180.0,0,0,156,0,1.0,1,1
2,37,0,1,130,283.0,0,1,98,0,0.0,0,0
3,48,1,3,138,214.0,0,0,108,1,1.5,1,1
4,54,0,2,150,195.0,0,0,122,0,0.0,0,0


In [87]:
# The first value of `iloc` is the rows: in this case, we want all the rows
# The second value is the columns: we want all the attributes but the HeartDisease, which's the value we want to predict
predictors = data2.iloc[:, 0:11].values

In [88]:
predictors

array([[40. ,  0. ,  1. , ...,  0. ,  0. ,  0. ],
       [49. ,  1. ,  2. , ...,  0. ,  1. ,  1. ],
       [37. ,  0. ,  1. , ...,  0. ,  0. ,  0. ],
       ...,
       [57. ,  0. ,  3. , ...,  1. ,  1.2,  1. ],
       [57. ,  1. ,  1. , ...,  0. ,  0. ,  1. ],
       [38. ,  0. ,  2. , ...,  0. ,  0. ,  0. ]])

In [89]:
predictors.shape

(917, 11)

In [90]:
target = data2.iloc[:, 11].values

In [91]:
target

array([0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
       1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0,
       1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1,
       1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,

In [92]:
target.shape

(917,)

## **Analysis of attribute: Scaling**

In [93]:
data2.describe()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
count,917.0,917.0,917.0,917.0,917.0,917.0,917.0,917.0,917.0,917.0,917.0,917.0
mean,53.509269,0.210469,2.251908,132.540894,244.635389,0.23337,0.604144,136.789531,0.40458,0.886696,0.63795,0.55289
std,9.437636,0.407864,0.931502,17.999749,53.347125,0.423206,0.806161,25.467129,0.491078,1.06696,0.60727,0.497466
min,28.0,0.0,0.0,80.0,85.0,0.0,0.0,60.0,0.0,-2.6,0.0,0.0
25%,47.0,0.0,2.0,120.0,214.0,0.0,0.0,120.0,0.0,0.0,0.0,0.0
50%,54.0,0.0,3.0,130.0,244.635389,0.0,0.0,138.0,0.0,0.6,1.0,1.0
75%,60.0,0.0,3.0,140.0,267.0,0.0,1.0,156.0,1.0,1.5,1.0,1.0
max,77.0,1.0,3.0,200.0,603.0,1.0,2.0,202.0,1.0,6.2,2.0,1.0


- Standardization (uses the mean and standard deviation as a reference).
- Normalization (uses the maximum and minimum values ​​as a reference).

In [94]:
scaled_predictors = StandardScaler().fit_transform(predictors)
scaled_predictors

array([[-1.43220634, -0.51630861, -1.34470119, ..., -0.82431012,
        -0.83150225, -1.05109458],
       [-0.47805725,  1.9368261 , -0.27058012, ..., -0.82431012,
         0.10625149,  0.59651863],
       [-1.75025603, -0.51630861, -1.34470119, ..., -0.82431012,
        -0.83150225, -1.05109458],
       ...,
       [ 0.37007527, -0.51630861,  0.80354095, ...,  1.21313565,
         0.29380223,  0.59651863],
       [ 0.37007527,  1.9368261 , -1.34470119, ..., -0.82431012,
        -0.83150225,  0.59651863],
       [-1.64423947, -0.51630861, -0.27058012, ..., -0.82431012,
        -0.83150225, -1.05109458]])

In [95]:
df = pd.DataFrame(scaled_predictors)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,-1.432206,-0.516309,-1.344701,0.414627,0.832075,-0.551733,-0.749818,1.383339,-0.824310,-0.831502,-1.051095
1,-0.478057,1.936826,-0.270580,1.526360,-1.212261,-0.551733,-0.749818,0.754736,-0.824310,0.106251,0.596519
2,-1.750256,-0.516309,-1.344701,-0.141240,0.719543,-0.551733,0.491306,-1.523953,-0.824310,-0.831502,-1.051095
3,-0.584074,1.936826,0.803541,0.303453,-0.574578,-0.551733,-0.749818,-1.131075,1.213136,0.575128,0.596519
4,0.052026,-0.516309,-0.270580,0.970493,-0.930931,-0.551733,-0.749818,-0.581047,-0.824310,-0.831502,-1.051095
...,...,...,...,...,...,...,...,...,...,...,...
912,-0.902124,-0.516309,-2.418822,-1.252973,0.363191,-0.551733,-0.749818,-0.188170,-0.824310,0.293802,0.596519
913,1.536257,-0.516309,0.803541,0.636973,-0.968441,1.812470,-0.749818,0.165420,-0.824310,2.356860,0.596519
914,0.370075,-0.516309,0.803541,-0.141240,-2.131275,-0.551733,-0.749818,-0.856061,1.213136,0.293802,0.596519
915,0.370075,1.936826,-1.344701,-0.141240,-0.161960,-0.551733,1.732430,1.461915,-0.824310,-0.831502,0.596519


In [96]:
df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
count,917.0,917.0,917.0,917.0,917.0,917.0,917.0,917.0,917.0,917.0,917.0
mean,1.859654e-16,7.748558e-18,1.046055e-16,7.767929e-16,-1.86934e-16,4.649135e-17,0.0,-5.114048e-16,-1.046055e-16,7.748558000000001e-17,-3.8742790000000005e-17
std,1.000546,1.000546,1.000546,1.000546,1.000546,1.000546,1.000546,1.000546,1.000546,1.000546,1.000546
min,-2.704405,-0.5163086,-2.418822,-2.920572,-2.994023,-0.5517333,-0.749818,-3.016886,-0.8243101,-3.269662,-1.051095
25%,-0.6900904,-0.5163086,-0.2705801,-0.6971063,-0.5745784,-0.5517333,-0.749818,-0.6596226,-0.8243101,-0.8315022,-1.051095
50%,0.05202558,-0.5163086,0.803541,-0.1412398,0.0,-0.5517333,-0.749818,0.04755658,-0.8243101,-0.26885,0.5965186
75%,0.688125,-0.5163086,0.803541,0.4146267,0.4194568,-0.5517333,0.491306,0.7547357,1.213136,0.5751284,0.5965186
max,2.490407,1.936826,0.803541,3.749826,6.721265,1.81247,1.73243,2.561971,1.213136,4.982571,2.244132


## **Encoding of categorical variables**

### **LabelEncoder: transforming categorical variables into numeric variables programmatically**

In [97]:
data.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289.0,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180.0,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283.0,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214.0,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195.0,0,Normal,122,N,0.0,Up,0


In [98]:
label_encoded_predictors = data.iloc[:, 0:11].values
label_encoded_predictors

array([[40, 'M', 'ATA', ..., 'N', 0.0, 'Up'],
       [49, 'F', 'NAP', ..., 'N', 1.0, 'Flat'],
       [37, 'M', 'ATA', ..., 'N', 0.0, 'Up'],
       ...,
       [57, 'M', 'ASY', ..., 'Y', 1.2, 'Flat'],
       [57, 'F', 'ATA', ..., 'N', 0.0, 'Flat'],
       [38, 'M', 'NAP', ..., 'N', 0.0, 'Up']], dtype=object)

In [99]:
label_encoded_predictors[:,1] = LabelEncoder().fit_transform(label_encoded_predictors[:,1])
label_encoded_predictors[:,2] = LabelEncoder().fit_transform(label_encoded_predictors[:,2])
label_encoded_predictors[:,6] = LabelEncoder().fit_transform(label_encoded_predictors[:,6])
label_encoded_predictors[:,8] = LabelEncoder().fit_transform(label_encoded_predictors[:,8])
label_encoded_predictors[:,10] = LabelEncoder().fit_transform(label_encoded_predictors[:,10])
label_encoded_predictors

array([[40, 1, 1, ..., 0, 0.0, 2],
       [49, 0, 2, ..., 0, 1.0, 1],
       [37, 1, 1, ..., 0, 0.0, 2],
       ...,
       [57, 1, 0, ..., 1, 1.2, 1],
       [57, 0, 1, ..., 0, 0.0, 1],
       [38, 1, 2, ..., 0, 0.0, 2]], dtype=object)

### **OneHotEncoder: Creating Dummy variables.**

Beware of multicollinearity (variables that are highly correlated with each other).

A   B   C   D   
1   0   0   0   
0   1   0   0   
0   0   1   0   
0   0   0   1


ColumnTransformer Parameters
- name: name given to the transformation.
- transformer: type of estimator (OneHotEncoder).
- columns: columns that will be transformed.
- remainder: what will happen to the remaining unrelated columns:
1) drop = deletes the other columns.
2) passthrough = keeps the other columns. drop is the default.
- sparse_threshold: sparse matrix classification parameter. default is 0.3
- n_jobs: number of jobs to be executed in parallel. default is none
- transformer_weights: definition of weights for the transformers.
-verbose: default is False. if True, the execution is shown on the screen.

In [100]:
label_and_one_hot_encoding_predictors = ColumnTransformer(transformers=[('OneHot', OneHotEncoder(), [1,2,6,8,10])],
                                                          remainder='passthrough').fit_transform(label_encoded_predictors)
label_and_one_hot_encoding_predictors

array([[0.0, 1.0, 0.0, ..., 0, 172, 0.0],
       [1.0, 0.0, 0.0, ..., 0, 156, 1.0],
       [0.0, 1.0, 0.0, ..., 0, 98, 0.0],
       ...,
       [0.0, 1.0, 1.0, ..., 0, 115, 1.2],
       [1.0, 0.0, 0.0, ..., 0, 174, 0.0],
       [0.0, 1.0, 0.0, ..., 0, 173, 0.0]], dtype=object)

In [101]:
pd.DataFrame(label_and_one_hot_encoding_predictors).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,40,140,289.0,0,172,0.0
1,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,49,160,180.0,0,156,1.0
2,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,37,130,283.0,0,98,0.0
3,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,48,138,214.0,0,108,1.5
4,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,54,150,195.0,0,122,0.0


## **Scaling**

In [102]:
scaled_label_and_one_hotel_encoding_predictors = StandardScaler().fit_transform(label_and_one_hot_encoding_predictors)
scaled_label_and_one_hotel_encoding_predictors

array([[-0.51630861,  0.51630861, -1.08542493, ..., -0.55173333,
         1.38333943, -0.83150225],
       [ 1.9368261 , -1.9368261 , -1.08542493, ..., -0.55173333,
         0.75473573,  0.10625149],
       [-0.51630861,  0.51630861, -1.08542493, ..., -0.55173333,
        -1.52395266, -0.83150225],
       ...,
       [-0.51630861,  0.51630861,  0.92129817, ..., -0.55173333,
        -0.85606123,  0.29380223],
       [ 1.9368261 , -1.9368261 , -1.08542493, ..., -0.55173333,
         1.46191489, -0.83150225],
       [-0.51630861,  0.51630861, -1.08542493, ..., -0.55173333,
         1.42262716, -0.83150225]])

In [103]:
df = pd.DataFrame(scaled_label_and_one_hotel_encoding_predictors)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,-0.516309,0.516309,-1.085425,2.073784,-0.531524,-0.229810,-0.507826,0.815013,-0.490781,0.824310,-0.824310,-0.271607,-1.001091,1.149573,-1.432206,0.414627,0.832075,-0.551733,1.383339,-0.831502
1,1.936826,-1.936826,-1.085425,-0.482210,1.881384,-0.229810,-0.507826,0.815013,-0.490781,0.824310,-0.824310,-0.271607,0.998910,-0.869888,-0.478057,1.526360,-1.212261,-0.551733,0.754736,0.106251
2,-0.516309,0.516309,-1.085425,2.073784,-0.531524,-0.229810,-0.507826,-1.226974,2.037569,0.824310,-0.824310,-0.271607,-1.001091,1.149573,-1.750256,-0.141240,0.719543,-0.551733,-1.523953,-0.831502
3,1.936826,-1.936826,0.921298,-0.482210,-0.531524,-0.229810,-0.507826,0.815013,-0.490781,-1.213136,1.213136,-0.271607,0.998910,-0.869888,-0.584074,0.303453,-0.574578,-0.551733,-1.131075,0.575128
4,-0.516309,0.516309,-1.085425,-0.482210,1.881384,-0.229810,-0.507826,0.815013,-0.490781,0.824310,-0.824310,-0.271607,-1.001091,1.149573,0.052026,0.970493,-0.930931,-0.551733,-0.581047,-0.831502
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
912,-0.516309,0.516309,-1.085425,-0.482210,-0.531524,4.351412,-0.507826,0.815013,-0.490781,0.824310,-0.824310,-0.271607,0.998910,-0.869888,-0.902124,-1.252973,0.363191,-0.551733,-0.188170,0.293802
913,-0.516309,0.516309,0.921298,-0.482210,-0.531524,-0.229810,-0.507826,0.815013,-0.490781,0.824310,-0.824310,-0.271607,0.998910,-0.869888,1.536257,0.636973,-0.968441,1.812470,0.165420,2.356860
914,-0.516309,0.516309,0.921298,-0.482210,-0.531524,-0.229810,-0.507826,0.815013,-0.490781,-1.213136,1.213136,-0.271607,0.998910,-0.869888,0.370075,-0.141240,-2.131275,-0.551733,-0.856061,0.293802
915,1.936826,-1.936826,-1.085425,2.073784,-0.531524,-0.229810,1.969177,-1.226974,-0.490781,0.824310,-0.824310,-0.271607,0.998910,-0.869888,0.370075,-0.141240,-0.161960,-0.551733,1.461915,-0.831502


In [104]:
df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
count,917.0,917.0,917.0,917.0,917.0,917.0,917.0,917.0,917.0,917.0,917.0,917.0,917.0,917.0,917.0,917.0,917.0,917.0,917.0,917.0
mean,1.084798e-16,-1.472226e-16,1.937139e-17,-3.8742790000000005e-17,3.8742790000000005e-17,6.973702000000001e-17,0.0,-9.298269e-17,1.549712e-17,-4.2617070000000006e-17,4.2617070000000006e-17,8.523413e-17,0.0,-3.8742790000000005e-17,1.859654e-16,7.884157e-16,3.014189e-15,-1.549712e-17,-5.114048e-16,-1.859654e-16
std,1.000546,1.000546,1.000546,1.000546,1.000546,1.000546,1.000546,1.000546,1.000546,1.000546,1.000546,1.000546,1.000546,1.000546,1.000546,1.000546,1.000546,1.000546,1.000546,1.000546
min,-0.5163086,-1.936826,-1.085425,-0.4822104,-0.5315237,-0.2298105,-0.507826,-1.226974,-0.490781,-1.213136,-0.8243101,-0.2716072,-1.001091,-0.8698879,-2.704405,-2.920572,-2.994023,-0.5517333,-3.016886,-3.269662
25%,-0.5163086,0.5163086,-1.085425,-0.4822104,-0.5315237,-0.2298105,-0.507826,-1.226974,-0.490781,-1.213136,-0.8243101,-0.2716072,-1.001091,-0.8698879,-0.6900904,-0.6971063,-0.5745784,-0.5517333,-0.6596226,-0.8315022
50%,-0.5163086,0.5163086,0.9212982,-0.4822104,-0.5315237,-0.2298105,-0.507826,0.8150134,-0.490781,0.8243101,-0.8243101,-0.2716072,0.99891,-0.8698879,0.05202558,-0.1412398,3.19836e-15,-0.5517333,0.04755658,-0.26885
75%,-0.5163086,0.5163086,0.9212982,-0.4822104,-0.5315237,-0.2298105,-0.507826,0.8150134,-0.490781,0.8243101,1.213136,-0.2716072,0.99891,1.149573,0.688125,0.4146267,0.4194568,-0.5517333,0.7547357,0.5751284
max,1.936826,0.5163086,0.9212982,2.073784,1.881384,4.351412,1.969177,0.8150134,2.037569,0.8243101,1.213136,3.681787,0.99891,1.149573,2.490407,3.749826,6.721265,1.81247,2.561971,4.982571


## **Preprocessing Summary**

- target = variable that is intended to be achieved (does the person have heart disease or not).
- predictors = set of predictor variables with the categorical variables transformed into numerics manually, without scaling.
- scaled_predictors = set of predictor variables with the categorical variables transformed into numerics, scaled.
- label_encoded_predictors = set of predictor variables with the categorical variables transformed into numerics by the labelencoder.
- label_and_one_hot_encoding_predictors = set of predictor variables transformed by the labelencoder and onehotencoder, without scaling.
- scaled_label_and_one_hotel_encoding_predictors = set of predictor variables transformed by the labelencoder and onehotencoder, scaled.

## **Dimension Reduction**

The objective is to select the best components (attributes) for training the algorithm, through the analysis of correlations between variables.

### Principal Component Analysis (PCA)

**Feature Selection:** Selects the best attributes and uses them without transformations.

**Feature Extraction:** Finds the relationships of the best attributes and creates new attributes.

It is an unsupervised learning algorithm.

It is applied to linearly separable data.

In [105]:
# From 11 to 4 attributes with PCA
pca = PCA(n_components=4)

In [106]:
pca_predictors = pca.fit_transform(predictors)

In [107]:
pca_predictors.shape

(917, 4)

In [108]:
pca_predictors

array([[  44.01031323,  -36.16368188,   10.64655418,   -9.4858855 ],
       [ -63.99070205,  -13.9285156 ,   31.68531903,   -5.3197523 ],
       [  38.53828121,   33.89882653,  -12.48258193,  -21.52677046],
       ...,
       [-113.34768547,   23.48739283,   -2.51236375,    1.14310997],
       [  -9.11479572,  -35.9101508 ,    4.82792119,    9.14499845],
       [ -70.01231135,  -35.67713061,   12.10297998,  -10.53005398]])

In [109]:
pca.explained_variance_ratio_

array([0.72844082, 0.1718306 , 0.08121793, 0.01767869])

In [110]:
pca.explained_variance_ratio_.sum()

0.9991680439746082

### Kernel PCA

It is an unsupervised learning algorithm.

It can also be applied to linearly non-separable data.

In [111]:
kpca = KernelPCA(n_components=4, kernel='rbf')

In [112]:
kernel_predictors = kpca.fit_transform(predictors)

In [113]:
kernel_predictors.shape

(917, 4)

In [114]:
kernel_predictors

array([[-0.00161313, -0.00266007, -0.00186814, -0.00263223],
       [-0.00161382, -0.00266122, -0.00186901, -0.00263352],
       [-0.0016132 , -0.00266017, -0.00186822, -0.00263235],
       ...,
       [-0.00161315, -0.00266009, -0.00186816, -0.00263225],
       [-0.00161325, -0.00266026, -0.00186829, -0.00263245],
       [-0.00161314, -0.00266007, -0.00186814, -0.00263224]])

### **Linear Discriminant Analysis (LDA)**

Supervised learning algorithm, as it uses the class as a reference for selection.

Applied in situations with many predictor attributes and also with the target attribute with many classes.

In [115]:
lda = LinearDiscriminantAnalysis(n_components = 1)

In [116]:
lda_predictors = lda.fit_transform(predictors, target)
lda_predictors

array([[-1.84039906e+00],
       [-1.02850026e+00],
       [-1.31942421e+00],
       [ 5.44796136e-01],
       [-1.07056148e+00],
       [-1.33498689e+00],
       [-2.65709986e+00],
       [-1.71006712e+00],
       [ 1.05406169e+00],
       [-2.19691693e+00],
       [-2.10336229e+00],
       [ 5.60819162e-01],
       [-1.82642106e+00],
       [ 1.02429254e+00],
       [-2.04219711e+00],
       [-1.21482688e+00],
       [-2.02616486e-01],
       [-2.71291883e+00],
       [ 4.99934678e-01],
       [-4.11374508e-01],
       [-2.98063459e+00],
       [-7.04338045e-01],
       [-2.65710181e+00],
       [ 4.75746736e-01],
       [-1.28657725e+00],
       [-1.62448979e+00],
       [ 1.63352464e+00],
       [-1.39514243e+00],
       [-1.90715060e+00],
       [-1.76263888e+00],
       [ 3.45010914e-01],
       [-1.07924898e+00],
       [ 6.51134898e-01],
       [ 4.33824846e-01],
       [-2.60569643e+00],
       [-1.85455855e+00],
       [ 2.39630035e+00],
       [-2.47176887e+00],
       [-1.9

In [117]:
lda.explained_variance_ratio_

array([1.])

## **Training and Test Datasets**

train_test_split parameters:
- arrays: names of predictor and target attributes.
- test_size: size in percentage of test data. default is none.
- train_size: size in percentage of training data. default is none.
- random_state: naming of a random state.
- shuffle: shuffling of random data. Associated with random_state, the same shuffling always occurs. Default is True.
- stratify: Possibility of splitting the data in a stratified manner. Default is None (in this case, the proportion is maintained, that is, if there are 30% zeros and 70% 1s in the dataframe, this proportion will be maintained when separating into training and testing).

In [118]:
x_training, x_test, y_training, y_test = train_test_split(scaled_label_and_one_hotel_encoding_predictors, target, test_size = 0.3, random_state = 0)

In [119]:
x_training.shape, x_test.shape, y_training.shape, y_test.shape

((641, 20), (276, 20), (641,), (276,))

# **Naive Bayes**

### **Fitting the model**

Training the model: fitting the model to the data

In [120]:
naive = GaussianNB()
naive.fit(x_training, y_training);

Making the prediction

In [121]:
naive_bayes_prediction = naive.predict(x_test)
naive_bayes_prediction

array([1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1,
       1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1,
       1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1,
       1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0,
       1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1,
       1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0,
       1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1,
       1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1])

In [122]:
print("Test Accuracy: %.2f%%" % (accuracy_score(y_test, naive_bayes_prediction) * 100.0))

Test Accuracy: 84.78%


In [123]:
confusion_matrix(y_test, naive_bayes_prediction)

array([[100,  21],
       [ 21, 134]])

In [124]:
print(classification_report(y_test, naive_bayes_prediction))

              precision    recall  f1-score   support

           0       0.83      0.83      0.83       121
           1       0.86      0.86      0.86       155

    accuracy                           0.85       276
   macro avg       0.85      0.85      0.85       276
weighted avg       0.85      0.85      0.85       276



### **Training data analysis**

In [125]:
training_prediction = naive.predict(x_training)

In [126]:
print("Training Accuracy: %.2f%%" % (accuracy_score(y_training, training_prediction) * 100.0))

Training Accuracy: 86.12%


In [127]:
confusion_matrix(y_training, training_prediction)

array([[248,  41],
       [ 48, 304]])

### **Cross Validation**

In [128]:
kfold = KFold(n_splits = 30, shuffle=True, random_state = 5)

In [129]:
gnb = GaussianNB()
prediction = cross_val_score(gnb, predictors, target, cv=kfold)

In [130]:
print("Mean accuracy: %.2f%%" % (prediction.mean() * 100.0))

Mean accuracy: 84.18%


### **All in one**

All types of preprocessed predictors together

In [136]:
all_predictors = {
  'Basic': predictors,
  'Scaled basic': scaled_predictors,
  'Encoded': label_encoded_predictors,
  'One hot encoded': label_and_one_hot_encoding_predictors,
  'Scaled one hot encoded': scaled_label_and_one_hotel_encoding_predictors
}

kfold = KFold(n_splits = 30, shuffle=True, random_state = 5)

for title, predictors in all_predictors.items():
  print(f"==== {title} ====")

  x_training, x_test, y_training, y_test = train_test_split(predictors, target, test_size = 0.3, random_state = 0)

  naive = GaussianNB()
  naive.fit(x_training, y_training);
  naive_bayes_prediction = naive.predict(x_test)
  print("Test Accuracy: %.2f%%" % (accuracy_score(y_test, naive_bayes_prediction) * 100.0))

  training_prediction = naive.predict(x_training)
  print("Training Accuracy: %.2f%%" % (accuracy_score(y_training, training_prediction) * 100.0))

  gnb = GaussianNB()
  prediction = cross_val_score(gnb, predictors, target, cv=kfold)
  print("Cross validation Accuracy: %.2f%%\n" % (prediction.mean() * 100.0))

==== Basic ====
Test Accuracy: 84.78%
Training Accuracy: 86.12%
Cross validation Accuracy: 85.17%

==== Scaled basic ====
Test Accuracy: 84.42%
Training Accuracy: 83.62%
Cross validation Accuracy: 84.18%

==== Encoded ====
Test Accuracy: 84.78%
Training Accuracy: 84.24%
Cross validation Accuracy: 84.17%

==== One hot encoded ====
Test Accuracy: 84.78%
Training Accuracy: 86.12%
Cross validation Accuracy: 85.17%

==== Scaled one hot encoded ====
Test Accuracy: 84.78%
Training Accuracy: 86.12%
Cross validation Accuracy: 85.17%

