# Heart Disease Model Construction

Kelompok AI-ven <br>
William Juniarta Hadiman - 13516026 <br>
Mochammad Alghifari - 13516038 <br>
Dion Saputra - 13516045 <br>
Rifo Ahmad Genadi - 13516111 <br>
Ivan Fadillah - 13516128

### Import Necessary Library 

Library yang digunakan pada pembentukan model ini adalah scikit-learn, pandas, numpy, dan itertools. Scikit-learn digunakan untuk training model, pandas digunakan untuk menampung data ke dalam dataframe, numpy digunakan pada penanganan missing values, dan itertools digunakan untuk mengiterasi kombinasi yang mungkin dari feature yang ada

In [1]:
import pandas as pd
import numpy as np
import itertools
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
from sklearn import tree

### Import Dataset

Dataset yang digunakan berasal dari file tubes2_HeartDisease_train.csv. Digunakan library pandas untuk membaca file csv tersebut ke dalam dataframe pandas

In [2]:
file = "tubes2_HeartDisease_train.csv"
df = pd.read_csv(file)

feature = df.drop("Column14",inplace=False,axis=1)
label = df["Column14"]

df.head()

Unnamed: 0,Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8,Column9,Column10,Column11,Column12,Column13,Column14
0,54,1,4,125,216,0,0,140,0,0.0,?,?,?,1
1,55,1,4,158,217,0,0,110,1,2.5,2,?,?,1
2,54,0,3,135,304,1,0,170,0,0.0,1,0,3,0
3,48,0,3,120,195,0,0,125,0,0.0,?,?,?,0
4,50,1,4,120,0,0,1,156,1,0.0,1,?,6,3


### Handling Missing Values 

Pada dataset training yang digunakan terdapat missing values pada beberapa atribut. Missing values ditandai dengan character '?. Missing values tersebut perlu ditangani agar dapat dijalankan pada model yang akan digunakan. Pada eksplorasi ini, missing values diganti dengan nilai mean atau modus dari atribut tersebut. Penggantian dengan nilai mean digunakan untuk atribut dengan value kontinu, sedangkan nilai modus digunakan untuk atribut dengan value diskrit. Handling values hanya dilakukan untuk atribut yang merubakan feature dari data

In [3]:
header = feature.columns.values.tolist()
feature_impute = feature.replace('?',np.nan)

imputer_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imputer_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

discrete_value = ['Column1','Column2','Column3','Column6','Column7','Column9','Column13']
continues_value = ['Column4','Column5','Column8','Column10','Column11','Column12']

imputer_mode.fit(feature_impute[discrete_value])
feature_impute[discrete_value] = imputer_mode.transform(feature_impute[discrete_value])

imputer_mean.fit(feature_impute[continues_value])
feature_impute[continues_value] = imputer_mean.transform(feature_impute[continues_value])

feature_impute['Column13'] = pd.to_numeric(feature_impute['Column13'])
feature_impute.head()

Unnamed: 0,Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8,Column9,Column10,Column11,Column12,Column13
0,54,1,4,125.0,216.0,0,0,140.0,0,0.0,1.762089,0.686792,3
1,55,1,4,158.0,217.0,0,0,110.0,1,2.5,2.0,0.686792,3
2,54,0,3,135.0,304.0,1,0,170.0,0,0.0,1.0,0.0,3
3,48,0,3,120.0,195.0,0,0,125.0,0,0.0,1.762089,0.686792,3
4,50,1,4,120.0,0.0,0,1,156.0,1,0.0,1.0,0.686792,6


### Penanganan Outlier

Outlier merupakan instance yang tidak normal terhadap data lainnya. Outlier perlu dihilangkan karena dikhawatirkan instance tersebut didapatkan dari pengukuran yang bersifat noisy. Pada eksplorasi ini, sebuah instance disebut outlier jika value pada atribut tertentu tidak berada dalam range $\mu \pm \sigma$.

In [4]:
idx_to_drop = []

for item in continues_value:
    mean = feature_impute[item].mean()
    std = feature_impute[item].std()
    low_threshold = mean - 2*std
    high_threshold = mean + 2*std
        
    for i in range(feature_impute[item].shape[0]):
        cur_value = feature_impute[item].iloc[i]
        if (cur_value < low_threshold or cur_value > high_threshold):
            idx_to_drop.append(i)

feature_impute.drop(feature_impute.index[idx_to_drop],inplace=True)
label.drop(label.index[idx_to_drop],inplace=True)

feature_impute.describe()

Unnamed: 0,Column1,Column2,Column3,Column4,Column5,Column8,Column10,Column11,Column12,Column13
count,598.0,598.0,598.0,598.0,598.0,598.0,598.0,598.0,598.0,598.0
mean,52.369565,0.764214,3.217391,130.387932,200.704864,140.356731,2.347483,1.663715,0.567426,3.794314
std,9.233553,0.424844,0.918577,14.153952,102.705054,23.381618,4.142096,0.388179,0.301544,1.557228
min,28.0,0.0,1.0,96.0,0.0,88.0,-2.6,1.0,0.0,3.0
25%,46.0,1.0,2.0,120.0,183.25,122.0,0.0,1.762089,0.686792,3.0
50%,53.0,1.0,4.0,130.0,221.5,140.0,0.5,1.762089,0.686792,3.0
75%,59.0,1.0,4.0,140.0,268.0,159.0,2.95,2.0,0.686792,3.0
max,77.0,1.0,4.0,165.0,412.0,188.0,19.0,2.0,1.0,7.0


### Feature Scalling

Pada dataset yang digunakan antar-atribut pada feature memiliki range nilai yang berbeda. Terdapat atribut dengan range nilai satuan dan terdapat pula atribut dengan range nilai ratusan. Perbedaan range ini dapat menyebabkan atribut dengan range besar memiliki kontribusi yang besar terhadap perhitungan model. Feature Scalling akan merubah nilai atribut tersebut sedemikian sehingga range pada setiap atribut tersebut sama.

In [5]:
feature_scale = pd.DataFrame(preprocessing.scale(feature_impute), columns=header)
feature_scale.head()

  """Entry point for launching an IPython kernel.


Unnamed: 0,Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8,Column9,Column10,Column11,Column12,Column13
0,0.176725,0.555458,0.852692,-0.380985,0.149048,-0.387298,-0.693683,-0.01527,-0.700908,-0.567213,0.253637,0.396182,-0.510509
1,0.285116,0.555458,0.852692,1.952471,0.158792,-0.387298,-0.693683,-1.299403,1.426721,0.036852,0.86704,0.396182,-0.510509
2,0.176725,-1.800315,-0.236859,0.326123,1.006587,2.581989,-0.693683,1.268864,-0.700908,-0.567213,-1.711246,-1.883308,-0.510509
3,-0.473623,-1.800315,-0.236859,-0.734539,-0.055593,-0.387298,-0.693683,-0.657336,-0.700908,-0.567213,0.253637,0.396182,-0.510509
4,-0.25684,0.555458,0.852692,-0.734539,-1.955823,-0.387298,0.619047,0.669601,1.426721,-0.567213,-1.711246,0.396182,1.417604


### Feature Selection 

Feature Selection digunakan untuk memilih feature yang memiliki pengaruh terhadap kinerja model. Feature selection yang digunakan adalah Forward Selection. Sebelumnya didaftarkan terlebih dahulu semua kemungkinan dari kombinasi feature yang mungkin. Lalu dimulai dari 0 feature, kinerja model dihitung. Kemudian jumlah feature ditambahkan hingga kinerja model tidak meningkat lagi.

In [6]:
feature_combinations = []

for i in range(1,14):
    feature_combinations.append(list(itertools.combinations(header,i)))

### Naive Bayes 

Naive Bayes yang digunakan pada eksplorasi ini adalah Gaussian Naive Bayes. Pembangunan model menggunakan K-Fold dengan 10 Fold sambil mencari kombinasi feature yang mengoptimalkan accuracy model. Untuk naive bayes, jumlah feature yang mengoptimalkan kinerja adalah 10 feature yaitu ['Column1', 'Column3', 'Column4', 'Column5', 'Column7', 'Column9', 'Column10', 'Column11', 'Column12', 'Column13']. Akurasi dari model Naive Bayes yang didapatkan yaitu 75%

In [7]:
kf = KFold(n_splits=10)

best_model = None
best_accuracy = 0.0
best_header = None

i = 9    # jumlah feature yang digunakan

for j in range(len(feature_combinations[i])):
    cur_header = []
    for k in range(len(feature_combinations[i][j])):
        cur_header.append(feature_combinations[i][j][k])

    cur_feature = feature_scale[cur_header]
    for train_idx,test_idx in kf.split(cur_feature):
        X_train = cur_feature.iloc[train_idx] 
        y_train = label.iloc[train_idx]       

        X_test = cur_feature.iloc[test_idx]
        y_test = label.iloc[test_idx] 

        cur_model = GaussianNB(var_smoothing=0.001).fit(X_train, y_train)
        y_predict = cur_model.predict(X_test)
        cur_accuracy = accuracy_score(y_test, y_predict)

        if (cur_accuracy > best_accuracy):
            best_model = cur_model
            best_accuracy = cur_accuracy
            best_header = cur_header
    
print('Best Feature: ', best_header)
print('Best Accuracy: ', best_accuracy)

Best Feature:  ['Column1', 'Column3', 'Column4', 'Column5', 'Column7', 'Column9', 'Column10', 'Column11', 'Column12', 'Column13']
Best Accuracy:  0.75


### Decision Tree ID3 

In [8]:
kf = KFold(n_splits=10)

best_model = None
best_accuracy = 0.0
best_header = None

i = 6    # jumlah feature yang digunakan

for j in range(len(feature_combinations[i])):
    cur_header = []
    for k in range(len(feature_combinations[i][j])):
        cur_header.append(feature_combinations[i][j][k])

    cur_feature = feature_scale[cur_header]
    for train_idx,test_idx in kf.split(cur_feature):
        X_train = cur_feature.iloc[train_idx] 
        y_train = label.iloc[train_idx]       

        X_test = cur_feature.iloc[test_idx]
        y_test = label.iloc[test_idx] 

        cur_model = tree.DecisionTreeClassifier(
            criterion='entropy', 
            min_samples_leaf=34, 
            max_depth=5, 
        ).fit(X_train, y_train)
        
        y_predict = cur_model.predict(X_test)
        cur_accuracy = accuracy_score(y_test, y_predict)

        if (cur_accuracy > best_accuracy):
            best_model = cur_model
            best_accuracy = cur_accuracy
            best_header = cur_header
    
print('Best Feature: ', best_header)
print('Best Accuracy: ', best_accuracy)

Best Feature:  ['Column2', 'Column3', 'Column4', 'Column5', 'Column6', 'Column8', 'Column9']
Best Accuracy:  0.7333333333333333
