# Heart Disease Model Construction

Kelompok AI-ven <br>
William Juniarta Hadiman - 13516026 <br>
Mochammad Alghifari - 13516038 <br>
Dion Saputra - 13516045 <br>
Rifo Ahmad Genadi - 13516111 <br>
Ivan Fadillah - 13516128

### Import Necessary Library 

Library yang digunakan pada pembentukan model ini adalah scikit-learn, pandas, numpy, dan itertools. Scikit-learn digunakan untuk training model, pandas digunakan untuk menampung data ke dalam dataframe, numpy digunakan pada penanganan missing values, dan itertools digunakan untuk mengiterasi kombinasi yang mungkin dari feature yang ada

In [103]:
import pandas as pd
import numpy as np
import itertools
import math
from sklearn.externals import joblib
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.model_selection import KFold
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report

### Import Dataset

Dataset yang digunakan berasal dari file tubes2_HeartDisease_train.csv. Digunakan library pandas untuk membaca file csv tersebut ke dalam dataframe pandas

In [96]:
file = "tubes2_HeartDisease_train.csv"
df = pd.read_csv(file)

feature = df.drop("Column14",inplace=False,axis=1)
label = df["Column14"]

df.head()

Unnamed: 0,Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8,Column9,Column10,Column11,Column12,Column13,Column14
0,54,1,4,125,216,0,0,140,0,0.0,?,?,?,1
1,55,1,4,158,217,0,0,110,1,2.5,2,?,?,1
2,54,0,3,135,304,1,0,170,0,0.0,1,0,3,0
3,48,0,3,120,195,0,0,125,0,0.0,?,?,?,0
4,50,1,4,120,0,0,1,156,1,0.0,1,?,6,3


### Handling Missing Values 

Pada dataset training yang digunakan terdapat missing values pada beberapa atribut. Missing values ditandai dengan character '?. Missing values tersebut perlu ditangani agar dapat dijalankan pada model yang akan digunakan. Pada eksplorasi ini, missing values diganti dengan nilai mean atau modus dari atribut tersebut. Penggantian dengan nilai mean digunakan untuk atribut dengan value kontinu, sedangkan nilai modus digunakan untuk atribut dengan value diskrit. Handling values hanya dilakukan untuk atribut yang merubakan feature dari data

In [97]:
header = feature.columns.values.tolist()
feature_impute = feature.replace('?',np.nan)

imputer_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imputer_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

discrete_value = ['Column1','Column2','Column3','Column6','Column7','Column9','Column13']
continues_value = ['Column4','Column5','Column8','Column10','Column11','Column12']

imputer_mode.fit(feature_impute[discrete_value])
feature_impute[discrete_value] = imputer_mode.transform(feature_impute[discrete_value])

imputer_mean.fit(feature_impute[continues_value])
feature_impute[continues_value] = imputer_mean.transform(feature_impute[continues_value])

feature_impute['Column13'] = pd.to_numeric(feature_impute['Column13'])
feature_impute.head()

Unnamed: 0,Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8,Column9,Column10,Column11,Column12,Column13
0,54,1,4,125.0,216.0,0,0,140.0,0,0.0,1.762089,0.686792,3
1,55,1,4,158.0,217.0,0,0,110.0,1,2.5,2.0,0.686792,3
2,54,0,3,135.0,304.0,1,0,170.0,0,0.0,1.0,0.0,3
3,48,0,3,120.0,195.0,0,0,125.0,0,0.0,1.762089,0.686792,3
4,50,1,4,120.0,0.0,0,1,156.0,1,0.0,1.0,0.686792,6


### Penanganan Outlier

Outlier merupakan instance yang tidak normal terhadap data lainnya. Outlier perlu dihilangkan karena dikhawatirkan instance tersebut didapatkan dari pengukuran yang bersifat noisy. Pada eksplorasi ini, sebuah instance disebut outlier jika value pada atribut tertentu tidak berada dalam range $\mu \pm \sigma$.

In [98]:
idx_to_drop = []

for item in continues_value:
    mean = feature_impute[item].mean()
    std = feature_impute[item].std()
    low_threshold = mean - 2*std
    high_threshold = mean + 2*std
        
    for i in range(feature_impute[item].shape[0]):
        cur_value = feature_impute[item].iloc[i]
        if (cur_value < low_threshold or cur_value > high_threshold):
            idx_to_drop.append(i)

feature_impute.drop(feature_impute.index[idx_to_drop],inplace=True)
label.drop(label.index[idx_to_drop],inplace=True)

feature_impute.describe()

Unnamed: 0,Column1,Column2,Column3,Column4,Column5,Column8,Column10,Column11,Column12,Column13
count,598.0,598.0,598.0,598.0,598.0,598.0,598.0,598.0,598.0,598.0
mean,52.369565,0.764214,3.217391,130.387932,200.704864,140.356731,2.347483,1.663715,0.567426,3.794314
std,9.233553,0.424844,0.918577,14.153952,102.705054,23.381618,4.142096,0.388179,0.301544,1.557228
min,28.0,0.0,1.0,96.0,0.0,88.0,-2.6,1.0,0.0,3.0
25%,46.0,1.0,2.0,120.0,183.25,122.0,0.0,1.762089,0.686792,3.0
50%,53.0,1.0,4.0,130.0,221.5,140.0,0.5,1.762089,0.686792,3.0
75%,59.0,1.0,4.0,140.0,268.0,159.0,2.95,2.0,0.686792,3.0
max,77.0,1.0,4.0,165.0,412.0,188.0,19.0,2.0,1.0,7.0


### Feature Scalling

Pada dataset yang digunakan antar-atribut pada feature memiliki range nilai yang berbeda. Terdapat atribut dengan range nilai satuan dan terdapat pula atribut dengan range nilai ratusan. Perbedaan range ini dapat menyebabkan atribut dengan range besar memiliki kontribusi yang besar terhadap perhitungan model. Feature Scalling akan merubah nilai atribut tersebut sedemikian sehingga range pada setiap atribut tersebut sama.

In [99]:
feature_scale = pd.DataFrame(preprocessing.scale(feature_impute), columns=header)
feature_scale.head()

  """Entry point for launching an IPython kernel.


Unnamed: 0,Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8,Column9,Column10,Column11,Column12,Column13
0,0.176725,0.555458,0.852692,-0.380985,0.149048,-0.387298,-0.693683,-0.01527,-0.700908,-0.567213,0.253637,0.396182,-0.510509
1,0.285116,0.555458,0.852692,1.952471,0.158792,-0.387298,-0.693683,-1.299403,1.426721,0.036852,0.86704,0.396182,-0.510509
2,0.176725,-1.800315,-0.236859,0.326123,1.006587,2.581989,-0.693683,1.268864,-0.700908,-0.567213,-1.711246,-1.883308,-0.510509
3,-0.473623,-1.800315,-0.236859,-0.734539,-0.055593,-0.387298,-0.693683,-0.657336,-0.700908,-0.567213,0.253637,0.396182,-0.510509
4,-0.25684,0.555458,0.852692,-0.734539,-1.955823,-0.387298,0.619047,0.669601,1.426721,-0.567213,-1.711246,0.396182,1.417604


### Feature Selection 

Feature Selection digunakan untuk memilih feature yang memiliki pengaruh terhadap kinerja model. Feature selection yang digunakan adalah Forward Selection. Sebelumnya didaftarkan terlebih dahulu semua kemungkinan dari kombinasi feature yang mungkin. Lalu dimulai dari 0 feature, kinerja model dihitung. Kemudian jumlah feature ditambahkan hingga kinerja model tidak meningkat lagi.

In [100]:
feature_combinations = []

for i in range(1,14):
    feature_combinations.append(list(itertools.combinations(header,i)))  

### Ukuran Kinerja Model 

Ukuran kinerja model akan ditentukan berdasarkan nilai accuracy, precission, dan recall dari model yang dihasilkan. 

### K-Nearest Neighboard

K-Nearest neighboard dibangun dengan memilih nilai k yang mengoptimalkan kinerja KNN dengan mengiterasi k dari 1 hingga $\sqrt(n)$. Untuk split datanya digunakan metode 10-fold. Berdasarkan eksperimen diperoleh nilai k yang optimal yaitu untuk k = 6 dengan feature yang dipilih adalah ['Column2', 'Column3', 'Column4', 'Column5', 'Column6', 'Column8', 'Column9']. Model ini mencapai akurasi 80%

In [102]:
kf = KFold(n_splits=10)

choosen_k = 0;
best_accuracy_knn = 0.0
best_header_knn = None

best_label_knn = None
best_predict_knn = None

i = 7    # jumlah feature yang digunakan
s = 149
e = 150

for j in range(s,e):
    cur_header = []
    for k in range(len(feature_combinations[i][j])):
        cur_header.append(feature_combinations[i][j][k])

    cur_feature = feature_scale[cur_header]
    for train_idx,test_idx in kf.split(cur_feature):
        X_train = cur_feature.iloc[train_idx] 
        y_train = label.iloc[train_idx]       

        X_test = cur_feature.iloc[test_idx]
        y_test = label.iloc[test_idx] 
        
        for t in range(1,int(math.sqrt(y_train.count()))):
            cur_model = KNeighborsClassifier(n_neighbors=t).fit(X_train, y_train)
            y_predict = cur_model.predict(X_test)
            cur_accuracy = accuracy_score(y_test, y_predict)

            if (cur_accuracy > best_accuracy_knn):
                choosen_k = t
                best_accuracy_knn = cur_accuracy
                best_header_knn = cur_header
                best_label_knn = y_test
                best_predict_knn = y_predict

print('Choosen k: ', choosen_k)
print('Choosen feature: ', best_header)
print('Accuracy: ', best_accuracy_knn)
print('Precission: ', precision_score(best_label_knn,best_predict_knn,average="micro"))
print('Recall: ', recall_score(best_label_knn,best_predict_knn,average="micro"))

Choosen k:  6
Choosen feature:  ['Column2', 'Column3', 'Column4', 'Column5', 'Column6', 'Column8', 'Column9']
Accuracy:  0.8
Precission:  0.8
Recall:  0.8


### Naive Bayes 

Naive Bayes yang digunakan pada eksplorasi ini adalah Gaussian Naive Bayes. Pembangunan model menggunakan K-Fold dengan 10 Fold sambil mencari kombinasi feature yang mengoptimalkan accuracy model. Untuk naive bayes, jumlah feature yang mengoptimalkan kinerja adalah 10 feature yaitu ['Column1', 'Column3', 'Column4', 'Column5', 'Column7', 'Column9', 'Column10', 'Column11', 'Column12', 'Column13']. Akurasi dari model Naive Bayes yang didapatkan yaitu 75%

In [81]:
kf = KFold(n_splits=10)

best_model_nb = None
best_accuracy_nb = 0.0
best_header_nb = None

best_label_nb = None
best_predict_nb = None

i = 9    # jumlah feature yang digunakan
s = 191
e = 192

for j in range(s,e):
    cur_header = []
    for k in range(len(feature_combinations[i][j])):
        cur_header.append(feature_combinations[i][j][k])

    cur_feature = feature_scale[cur_header]
    for train_idx,test_idx in kf.split(cur_feature):
        X_train = cur_feature.iloc[train_idx] 
        y_train = label.iloc[train_idx]       

        X_test = cur_feature.iloc[test_idx]
        y_test = label.iloc[test_idx] 

        cur_model = GaussianNB(var_smoothing=0.001).fit(X_train, y_train)
        y_predict = cur_model.predict(X_test)
        cur_accuracy = accuracy_score(y_test, y_predict)

        if (cur_accuracy > best_accuracy_nb):
            best_model_nb = cur_model
            best_accuracy_nb = cur_accuracy
            best_header_nb = cur_header
            best_label_nb = y_test
            best_predict_nb = y_predict
    
print('Choosen feature: ', best_header_nb)
print('Accuracy: ', best_accuracy_nb)
print('Precission: ', precision_score(best_label_nb,best_predict_nb,average="micro"))
print('Recall: ', recall_score(best_label_nb,best_predict_nb,average="micro"))

Choosen feature:  ['Column1', 'Column3', 'Column4', 'Column5', 'Column7', 'Column9', 'Column10', 'Column11', 'Column12', 'Column13']
Accuracy:  0.75
Precission:  0.75
Recall:  0.75


### Decision Tree ID3 

Decision Tree yang digunakan pada eksplorasi ini adalah Decision Tree ID3 dengan mengeset parameter criterion menjadi entropy. Pembangunan model menggunakan K-Fold dengan 10 Fold sambil mencari kombinasi feature yang mengoptimalkan accuracy model. Untuk Decision Tree, jumlah feature yang mengoptimalkan kinerja adalah 7 yaitu ['Column2', 'Column3', 'Column4', 'Column5', 'Column6', 'Column8', 'Column9']. Akurasi dari model Decisin Tree ID3 yang didapatkan yaitu 73%

In [80]:
kf = KFold(n_splits=10)

best_model_dt = None
best_accuracy_dt = 0.0
best_header_dt = None

best_label_dt = None
best_predict_dt = None

i = 6    # jumlah feature yang digunakan
s = 930
e = 931

for j in range(s,e):
    cur_header = []
    for k in range(len(feature_combinations[i][j])):
        cur_header.append(feature_combinations[i][j][k])

    cur_feature = feature_scale[cur_header]
    for train_idx,test_idx in kf.split(cur_feature):
        X_train = cur_feature.iloc[train_idx] 
        y_train = label.iloc[train_idx]       

        X_test = cur_feature.iloc[test_idx]
        y_test = label.iloc[test_idx] 

        cur_model = tree.DecisionTreeClassifier(
            criterion='entropy', 
            min_samples_leaf=34, 
            max_depth=5, 
        ).fit(X_train, y_train)
        
        y_predict = cur_model.predict(X_test)
        cur_accuracy = accuracy_score(y_test, y_predict)

        if (cur_accuracy > best_accuracy_dt):
            best_model_dt = cur_model
            best_accuracy_dt = cur_accuracy
            best_header_dt = cur_header
            best_label_dt = y_test
            best_predict_dt = y_predict
    
print('Choosen feature: ', best_header_dt)
print('Accuracy: ', best_accuracy_dt)
print('Precission: ', precision_score(best_label_dt,best_predict_dt,average="micro"))
print('Recall: ', recall_score(best_label_dt,best_predict_dt,average="micro"))

Choosen feature:  ['Column2', 'Column3', 'Column4', 'Column5', 'Column6', 'Column8', 'Column9']
Accuracy:  0.7333333333333333
Precission:  0.7333333333333333
Recall:  0.7333333333333333


### Multi Layer Perceptron

Isu pertama dalam optimasi model MLP adalah menentukan parameter-parameter MLPClassifier. Untuk parameter solver, kami memilih lbfgs yang cocok digunakan untuk data berukuran kecil. Untuk parameter activation kami memilih logistic (sigmoid) karena itu yang sudah diajarkan di kelas. Dan random_state diisi 100 (bebas, asalkan bukan false) agar parameter-parameter initialization seperti weight, bias, dll sama. Sedangkan parameter alpha, hidden_layer_sizes, dan learning_rates kami coba mencari kombinasi yang terbaik dari beberapa nilai yang kami tawarkan. Hasilnya hidden_layer_sizes = (7, 11, 7), learning_rates = constant, dan alpha = 0.001.

Isu optimasi selanjutnya adalah pemilihan feature yang digunakan. Dengan pertimbangan jumlah feature yang tidak terlalu banyak, dalam pemilihan feature kami mencari kombinasi feature terbaik dari keseluruhan kombinasi yang ada. Kami juga menggunakan strategi K-Fold untuk mencari dataset yang menghasilkan akurasi paling baik.

Untuk MLP, jumlah feature yang menghasilkan akurasi model MLP terbaik adalah 5 yaitu ['Column4', 'Column5', 'Column6', 'Column11', 'Column12'] . Akurasi dari model ini adalah 81%

In [78]:
kf = KFold(n_splits=10)

best_model_mlp = None
best_accuracy_mlp = 0.0
best_header_mlp = None

best_label_mlp = None
best_predict_mlp = None

i = 4
s = 1053
e = 1054
for j in range(s,e):
    cur_header = []
    for k in range(len(feature_combinations[i][j])):
        cur_header.append(feature_combinations[i][j][k])
        
    cur_feature = feature_scale[cur_header]
    for train_idx,test_idx in kf.split(cur_feature):
        X_train = cur_feature.iloc[train_idx] 
        y_train = label.iloc[train_idx]      
        X_test = cur_feature.iloc[test_idx]
        y_test = label.iloc[test_idx] 

        list_hidden_layer_sizes = [(7,11,7), (5)]
        list_alpha = [0.0001, 0.001]
        list_learning_rate = ['constant','adaptive']

        for hidden_layer_sizes_ in list_hidden_layer_sizes:
            for alpha_ in list_alpha:
                for learning_rate_ in list_learning_rate:                
                    cur_model = MLPClassifier(solver='lbfgs', activation='logistic', random_state=100, alpha=alpha_,
                                     hidden_layer_sizes=hidden_layer_sizes_,  learning_rate=learning_rate_).fit(X_train,y_train)
                    y_predict = cur_model.predict(X_test)
                    cur_accuracy = accuracy_score(y_test, y_predict)

                    if (cur_accuracy > best_accuracy_mlp):
                        best_model_mlp = cur_model
                        best_accuracy_mlp = cur_accuracy
                        best_header_mlp = cur_header
                        best_parameter = {
                            'alpha' : alpha_,
                            'hidden_layer_sizes' : hidden_layer_sizes_,
                            'learning_rate' : learning_rate_
                        }
                        best_label_mlp = y_test
                        best_predict_mlp = y_predict

print('Choosen feature: ', best_header_mlp)
print('Choosen parameter: ', best_parameter)
print('Accuracy: ', best_accuracy_mlp)
print('Precission: ', precision_score(best_label_mlp,best_predict_mlp,average="micro"))
print('Recall: ', recall_score(best_label_mlp,best_predict_mlp,average="micro"))

Choosen feature:  ['Column4', 'Column5', 'Column6', 'Column11', 'Column12']
Choosen parameter:  {'alpha': 0.001, 'hidden_layer_sizes': (7, 11, 7), 'learning_rate': 'constant'}
Accuracy:  0.8166666666666667
Precission:  0.8166666666666667
Recall:  0.8166666666666667


### Analisis Model 

Berdasarkan hasil eksperimen menggunakan model knn, naive bayes, dt, dan mlp, model mlp menghasilkan statistik terbaik dengan akurasi, presisi, recall masing-masing 81.67%.
Dilihat dari segi feature yang dipakai masing-masing model, model mlp memiliki jumlah feature yang paling sedikit diantara model yang lain yaitu 5. Sehingga, model mlp adalah model yang paling general diantara yang lain.
Oleh karena itu, model terbaik dalam tugas kali ini adalah model mlp.

### Save Best Model 

In [105]:
joblib.dump(best_model_mlp, 'best_model_mlp.pkl') 

['best_model_mlp.pkl']

### Predict Data Test 

Data test diambil dari file tubes2_HeartDisease_test.csv disimpan pada dataframe pandas. 

In [86]:
file_test = "tubes2_HeartDisease_test.csv"
df_test = pd.read_csv(file_test)

feature_test = df_test

In [88]:
feature_impute_test = feature_test.replace('?',np.nan)

imputer_mode.fit(feature_impute_test[discrete_value])
feature_impute_test[discrete_value] = imputer_mode.transform(feature_impute_test[discrete_value])

imputer_mean.fit(feature_impute_test[continues_value])
feature_impute_test[continues_value] = imputer_mean.transform(feature_impute_test[continues_value])

feature_impute_test['Column13'] = pd.to_numeric(feature_impute_test['Column13'])
feature_impute_test.head()

Unnamed: 0,Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8,Column9,Column10,Column11,Column12,Column13
0,60,1,2,160.0,267.0,1,1,157.0,0,0.5,2.0,0.613636,7
1,61,1,4,148.0,203.0,0,0,161.0,0,0.0,1.0,1.0,7
2,54,1,4,130.0,242.0,0,0,91.0,1,1.0,2.0,0.613636,7
3,48,1,4,120.0,260.0,0,0,115.0,0,2.0,2.0,0.613636,7
4,57,0,1,130.0,308.0,0,0,98.0,0,1.0,2.0,0.613636,7


In [90]:
feature_scale_test = pd.DataFrame(preprocessing.scale(feature_impute_test), columns=header)
feature_scale_test.head()

  """Entry point for launching an IPython kernel.


Unnamed: 0,Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8,Column9,Column10,Column11,Column12,Column13
0,0.722781,0.42997,-1.228609,1.634929,0.660909,2.054805,0.495034,1.01946,-0.7298,-0.424101,0.368695,0.0,0.47701
1,0.834283,0.42997,0.871196,0.961467,0.092894,-0.486664,-0.774053,1.189424,-0.7298,-0.488818,-1.669972,0.740049,0.47701
2,0.053774,0.42997,0.871196,-0.048726,0.439028,-0.486664,-0.774053,-1.784953,1.370238,-0.359384,0.368695,0.0,0.47701
3,-0.615234,0.42997,0.871196,-0.609944,0.598783,-0.486664,-0.774053,-0.765167,-0.7298,-0.229949,0.368695,0.0,0.47701
4,0.388278,-2.325745,-2.278512,-0.048726,1.024794,-0.486664,-0.774053,-1.487515,-0.7298,-0.359384,0.368695,0.0,0.47701


In [106]:
model = joblib.load("best_model_mlp.pkl")

y_predict_test = model.predict(feature_scale_test[best_header_mlp])

y_predict_test

array([1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 3, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
       0, 0, 1, 1, 0, 0, 0, 0, 0])