Supervised Learning - Classification
-------------------------------------
Dataset : https://raw.githubusercontent.com/catharinamega/dataset/main/heart_failure_clinical_records_dataset.csv

classify the dataset using 3 methods: Logistic Regression, Naive Bayes, and K-Nearest Neighbor (with k=5)

## Import library

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Load Dataset

In [3]:
# Read CSV
df = pd.read_csv('https://raw.githubusercontent.com/catharinamega/dataset/main/heart_failure_clinical_records_dataset.csv')
df


Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.00,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.00,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.00,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.00,2.7,116,0,0,8,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
294,62.0,0,61,1,38,1,155000.00,1.1,143,1,1,270,0
295,55.0,0,1820,0,38,0,270000.00,1.2,139,0,0,271,0
296,45.0,0,2060,1,60,0,742000.00,0.8,138,0,0,278,0
297,45.0,0,2413,0,38,0,140000.00,1.4,140,1,1,280,0


In [4]:
print(df['creatinine_phosphokinase'].mode()[0])
print(df['creatinine_phosphokinase'].median())


582
250.0


Anemia - anemia is a condition in which you lack enough healthy red blood cells to carry adequate oxygen to your body's tissues. Having anemia, also referred to as low hemoglobin, can make you feel tired and weak. (there is not anemia - 0, there is anemia - 1)

Creatine_phosphokinase (CPK) - CPK is an enzyme in the body. It is found mainly in the heart, brain, and skeletal muscle. Total CPK normal values: 10 to 120 micrograms per liter (mcg/L)

Ejection_fraction (EF) - EF is a measurement, expressed as a percentage, of how much blood the left ventricle pumps out with each contraction. An ejection fraction of 60 percent means that 60 percent of the total amount of blood in the left ventricle is pushed out with each heartbeat. This indication of how well your heart is pumping out blood can help to diagnose and track heart failure. A normal heart’s ejection fraction may be between 50 and 70 percent.

Platelets - platelets are colorless blood cells that help blood clot. Platelets stop bleeding by clumping and forming plugs in blood vessel injuries. Thrombocytopenia might occur as a result of a bone marrow disorder such as leukemia or an immune system problem. The normal number of platelets in the blood is 150,000 to 400,000 platelets per microliter (mcL) or 150 to 400 × 109/L.

Serum_creatinine - The amount of creatinine in your blood should be relatively stable. An increased level of creatinine may be a sign of poor kidney function. Serum creatinine is reported as milligrams of creatinine to a deciliter of blood (mg/dL) or micromoles of creatinine to a liter of blood (micromoles/L). Here are the normal values by age: 0.9 to 1.3 mg/dL for adult males. 0.6 to 1.1 mg/dL for adult females. 0.5 to 1.0 mg/dL for children ages 3 to 18 years.

Serum_sodium - Measurement of serum sodium is routine in assessing electrolyte, acid-base, and water balance, as well as renal function. Sodium accounts for approximately 95% of the osmotically active substances in the extracellular compartment, provided that the patient is not in renal failure or does not have severe hyperglycemia. The normal range for blood sodium levels is 135 to 145 milliequivalents per liter (mEq/L).

Time - follow-up period (days)

High_blood_pressure - (True - 1, False - 0)

Age - between 40 - 95

Diabetes - (True - 1, False - 0)

Sex - (male - 1, female - 0)

Smoking - (True - 1, False - 0)

Death event - (True - 1, False - 0

### Arti dari Variabel

## Data Preprocessing

look for duplicated values 

In [5]:
duplicate = df[df.duplicated()]
 
print("Duplicate Rows :")
 
# Print the resultant Dataframe
duplicate

Duplicate Rows :


Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT


find the missing value

In [6]:
df.isnull().sum()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       299 non-null    float64
 1   anaemia                   299 non-null    int64  
 2   creatinine_phosphokinase  299 non-null    int64  
 3   diabetes                  299 non-null    int64  
 4   ejection_fraction         299 non-null    int64  
 5   high_blood_pressure       299 non-null    int64  
 6   platelets                 299 non-null    float64
 7   serum_creatinine          299 non-null    float64
 8   serum_sodium              299 non-null    int64  
 9   sex                       299 non-null    int64  
 10  smoking                   299 non-null    int64  
 11  time                      299 non-null    int64  
 12  DEATH_EVENT               299 non-null    int64  
dtypes: float64(3), int64(10)
memory usage: 30.5 KB


Separate the dataset into independent variables and dependent variables

In [7]:
X = df.iloc[:,:-1].values
y = df.iloc[:,-1].values

Separate the train dataset and test dataset, with test size 0.1

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.1)

scaling on the X_train and X_test variables.

In [9]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

Logistic Regression

In [10]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(random_state=0)
classifier.fit(X_train,y_train)

LogisticRegression(random_state=0)

Test the model that already created with a test dataset

In [11]:
y_pred = classifier.predict(X_test)
print(y_pred)

[1 1 0 0 0 0 0 0 0 1 1 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0]


Show the accuracy score and the confusion matrix

In [12]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test,y_pred)
print("Confusion Matrix\n", cm)
print("accuracy score", accuracy_score(y_test,y_pred))

Confusion Matrix
 [[19  1]
 [ 3  7]]
accuracy score 0.8666666666666667


Coba prediksi data pasien dengan profil sebagai berikut, gunakan list pada bagian bawah soal untuk membantu:

* usia: 58 tahun
* memiliki anemia
* kadar creatinine phosphokinase: 60
* tidak menderita diabetes
* ejection fraction: 38
* tidak memiliki tekanan darah tinggi
* trombosit: 153000
* kadar serum creatinine: 5.8
* kadar serum sodium:134
* jenis kelamin laki-laki
* tidak merokok
* follow-up period time: 26 hari


    [58,1,60,0,38,0,153000,5.8,134,1,0,26]

In [13]:
print(classifier.predict([[58,1,60,0,38,0,153000,5.8,134,1,0,26]]))

[0]


# K-Nearest Neighbour (K-NN) Classifier

Create a K-NN model with n = 5, metric = 'euclidean', p = 2

In [14]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = "euclidean", p = 2)
classifier.fit(X_train,y_train)

KNeighborsClassifier(metric='euclidean')

Test the model that already created with a test dataset

In [15]:
y_pred = classifier.predict(X_test)
print(y_pred)

[0 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


Show the accuracy score and the confusion matrix

In [16]:
from sklearn.metrics import confusion_matrix, accuracy_score
# cm = confusion_matrix(y_test,y_pred)
# print("Confusion Matrix\n", cm)
# print("accuracy score", accuracy_score(y_test,y_pred))
cm2 = confusion_matrix(y_test, y_pred)
print("Confusion Matrix\n", cm2)

print("Accuracy Score: ", accuracy_score(y_test, y_pred))

Confusion Matrix
 [[19  1]
 [ 6  4]]
Accuracy Score:  0.7666666666666667


Coba prediksi dengan data pasien sebagai berikut:

* usia: 45 tahun
* tidak memiliki anemia
* kadar creatinine phosphokinase: 52
* menderita diabetes
* ejection fraction: 25
* memiliki tekanan darah tinggi
* trombosit: 276000
* kadar serum creatinine: 1.3
* kadar serum sodium:137
* jenis kelamin perempuan
* tidak merokok
* follow-up period time: 16 hari


    [45,0,52,1,25,1,276000,1.3,137,0,0,16]

In [17]:
print(classifier.predict(sc.transform([[45,0,52,1,25,1,276000,1.3,137,0,0,16]])))

[0]


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


Naive Bayes

In [18]:
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()
model.fit(X_train, y_train)

GaussianNB()

Test the model that already created with a test dataset

In [19]:
y_pred = model.predict(X_test)
print(y_pred)

[1 1 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]


Show the accuracy score and the confusion matrix

In [20]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm3 = confusion_matrix(y_test, y_pred)
print("Confusion Matrix\n", cm2)

print("Accuracy Score: ", accuracy_score(y_test, y_pred))

Confusion Matrix
 [[19  1]
 [ 6  4]]
Accuracy Score:  0.8666666666666667


Coba prediksi pasien dengan profil berikut:

* usia: 78 tahun
* memiliki anemia
* kadar creatinine phosphokinase: 159
* menderita diabetes
* ejection fraction: 38
* tidak memiliki tekanan darah tinggi
* trombosit: 148000
* kadar serum creatinine: 5.6
* kadar serum sodium:134
* jenis kelamin perempuan
* tidak merokok
* follow-up period time: 26 hari


    [78,1,159,1,38,0,148000,5.6,134,0,0,26]

In [21]:
print(classifier.predict([[78,1,159,1,38,0,148000,5.6,134,0,0,26]]))

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


[0]
