
Kali ini kita akan menggunakan data untuk memprediksi kelangsungan hidup pasien yang telah mengalami operasi payudara. Dengan informasi yang dimiliki terkait pasien, kita akan membuat model untuk memprediksi apakah pasien akan bertahan hidup dalam waktu lebih dari 5 tahun atau tidak.
 
Lebih Lengkapnya kalian bisa membaca informasi tentang dataset di link berikut: https://raw.githubusercontent.com/jbrownlee/Datasets/master/haberman.names
 
Buat model Klasifikasi Decision Tree untuk memprediksi status pasien dengan ketentuan sebagai berikut:
1.     Bagi kedua data ini menjadi data training dan data test dengan test_size=0.25. (coding)
2.     Pelajari tentang metrics roc_auc_score kemudian buatlah model dan evaluasi dengan menggunakan teknik cross-validation dengan scoring 'roc_auc'. Baca https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html untuk menggunakan metric roc_auc saat cross-validation. (coding)
3.     Berapa score rata-rata dari model dengan teknik cross-validation tersebut? (coding)
4.     Prediksi data test dengan model yang telah kalian buat! (coding)
5.     Bagaimana hasil confusion matrix dari hasil prediksi tersebut? (coding)
6.     Bagaimana classification report dari hasil prediksi tersebut? (coding)
7.     Seberapa baik model anda dalam memprediksi seorang pasien mempunyai status positive? (Deskriptif)
8.     Seberapa baik model anda dalam memprediksi seorang pasien mempunyai status negatif? (Deskriptif)

link: https://www.kaggle.com/gilsousa/habermans-survival-data-set

In [50]:
#import library
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from collections import Counter
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree,decomposition
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix, classification_report

#Load Dataset

In [51]:
df= pd.read_csv('haberman.csv', header=None,names=["Age", "Year", "Positive_axillary_nodes", "Survival_status"])

Attribute Information:
   1. Age of patient at time of operation (numerical)
   2. Patient's year of operation (year - 1900, numerical)
   3. Number of positive axillary nodes detected (numerical)
   4. Survival status (class attribute)
         1 = the patient survived 5 years or longer
         2 = the patient died within 5 year

In [52]:
df

Unnamed: 0,Age,Year,Positive_axillary_nodes,Survival_status
0,30,64,1,1
1,30,62,3,1
2,30,65,0,1
3,31,59,2,1
4,31,65,4,1
...,...,...,...,...
301,75,62,1,1
302,76,67,0,1
303,77,65,3,1
304,78,65,1,2


#Data Preprocess

In [53]:
#check data
df.describe()

Unnamed: 0,Age,Year,Positive_axillary_nodes,Survival_status
count,306.0,306.0,306.0,306.0
mean,52.457516,62.852941,4.026144,1.264706
std,10.803452,3.249405,7.189654,0.441899
min,30.0,58.0,0.0,1.0
25%,44.0,60.0,0.0,1.0
50%,52.0,63.0,1.0,1.0
75%,60.75,65.75,4.0,2.0
max,83.0,69.0,52.0,2.0


In [54]:
#check missing value
df.isnull().sum()

Age                        0
Year                       0
Positive_axillary_nodes    0
Survival_status            0
dtype: int64

In [55]:
# Change not survived to 0
df = df.assign(Survival_status = [0 if Survival_status == 2 else 1 for Survival_status in df['Survival_status']])

In [56]:
#check target
print(Counter(df['Survival_status']))

Counter({1: 225, 0: 81})


In [57]:
#separate future data and target data
X=df.drop(['Survival_status'],1)
y=df['Survival_status']

In [58]:
#no 1: Bagi kedua data ini menjadi data training dan data test dengan test_size=0.25
X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.25, stratify=y, random_state=0)

In [59]:
Counter(y_train)

Counter({0: 61, 1: 168})

In [60]:
#handling imbalance dataset

smote = SMOTE()
X_smote, y_smote = smote.fit_resample(X_train,y_train)



#Modelling

In [61]:
dec_tree = tree.DecisionTreeClassifier()
std_slc= StandardScaler()
pca = decomposition.PCA()

pipe = Pipeline(steps=[('std_slc', std_slc),
                        ('pca', pca),
                        ('dec_tree', dec_tree)])

In [62]:
n_components = list(range(1,X.shape[1]+1,1))
criterion = ['gini', 'entropy']
max_depth = [2,4,6,8,10,12]
parameters = dict(pca__n_components=n_components,
                  dec_tree__criterion=criterion,
                  dec_tree__max_depth=max_depth)

In [63]:
from sklearn.model_selection import GridSearchCV, cross_val_score
clf_GS = GridSearchCV(pipe, parameters)
clf_GS.fit(X_smote, y_smote)

GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('std_slc',
                                        StandardScaler(copy=True,
                                                       with_mean=True,
                                                       with_std=True)),
                                       ('pca',
                                        PCA(copy=True, iterated_power='auto',
                                            n_components=None,
                                            random_state=None,
                                            svd_solver='auto', tol=0.0,
                                            whiten=False)),
                                       ('dec_tree',
                                        DecisionTreeClassifier(ccp_alpha=0.0,
                                                               class_weight=None,
                                                             

In [64]:
print('Best Criterion:', clf_GS.best_estimator_.get_params()['dec_tree__criterion'])
print('Best max_depth:', clf_GS.best_estimator_.get_params()['dec_tree__max_depth'])
print('Best Number Of Components:', clf_GS.best_estimator_.get_params()['pca__n_components'])
print(); print(clf_GS.best_estimator_.get_params()['dec_tree'])

Best Criterion: entropy
Best max_depth: 12
Best Number Of Components: 3

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=12, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')


In [74]:
clf= DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=12, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [75]:
clf.fit(X_smote, y_smote)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=12, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

#Cross Validation

In [76]:
crossval_scores = cross_val_score(DecisionTreeClassifier(), X_smote, y_smote, scoring='roc_auc', cv=10)
crossval_scores

array([0.52941176, 0.75432526, 0.55882353, 0.73529412, 0.74913495,
       0.67647059, 0.81617647, 0.72426471, 0.75919118, 0.85294118])

In [77]:
crossval_scores.mean()

0.7156033737024222

no 2: nilai cross validation score = [0.52941176, 0.75432526, 0.55882353, 0.73529412, 0.74913495, 0.67647059, 0.81617647, 0.72426471, 0.75919118, 0.85294118]

no 3: nilai rata-rata cross validation score = 0.72

#Prediction

In [78]:
#no 4: Prediksi data test dengan model yang telah kalian buat
y_test_pred = clf.predict(X_test)
y_test_pred

array([1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1,
       1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1,
       0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0])

#Confusion Matrix

In [79]:
#no 5: Bagaimana hasil confusion matrix dari hasil prediksi tersebut?
y_train_pred = cross_val_predict(clf, X_smote, y_smote)
confusion_matrix(y_smote, y_train_pred)

array([[127,  41],
       [ 52, 116]])

In [80]:
predictions1 = cross_val_predict(clf, X_test, y_test)
confusion_matrix(y_test, predictions1)

array([[ 7, 13],
       [12, 45]])

#Classification Report

In [81]:
#no 6: Bagaimana classification report dari hasil prediksi tersebut?
print(classification_report(y_smote, y_train_pred))

              precision    recall  f1-score   support

           0       0.71      0.76      0.73       168
           1       0.74      0.69      0.71       168

    accuracy                           0.72       336
   macro avg       0.72      0.72      0.72       336
weighted avg       0.72      0.72      0.72       336



In [82]:
print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

           0       0.28      0.35      0.31        20
           1       0.75      0.68      0.72        57

    accuracy                           0.60        77
   macro avg       0.52      0.52      0.51        77
weighted avg       0.63      0.60      0.61        77



notes: 1 = the patient survived 5 years or longer, 0 = the patient died within 5 year

7. Seberapa baik model anda dalam memprediksi seorang pasien mempunyai status positive? (Deskriptif)
Prediction di model training = 0.71 untuk prediksi pasien yang tidak bertahan dalam 5 tahun
Recall di model training = 0.76 untuk prediksi pasien yang tidak bertahan dalam 5 tahun
Tetapi, di model testing data precision & recall cukup rendah yaitu hanya 0.28 dan 0.35
Artinya jika diberikan data baru, model masih belum mampu memprediksi dengan baik apakah seseorang akan mati dalam 5 tahun

8. Seberapa baik model anda dalam memprediksi seorang pasien mempunyai status negatif? (Deskriptif)
Prediction di model training = 0.74  untuk prediksi pasien yang tidak bertahan dalam 5 tahun
Recall di model training = 0.69 untuk prediksi pasien yang tidak bertahan dalam 5 tahun
Di model testing data precision & recall cukup stabil yaitu hanya 0.75 dan 0.68
Artinya model mampu memprediksi dengan baik apakah seseorang tidak akan mati dalam 5 tahun