# Metode Evaluasi
adalah proses untuk menilai kinerja model machine learning. Evaluasi yang tepat sangat penting untuk memastikan bahwa model dapat diandalkan dan berfungsi dengan baik pada data yang belum pernah dilihat sebelumnya. Pada proses pembuatan model machine learning, dataset yang tersedia biasanya dibagi menjadi tiga bagian utama:

1. **Training Set (80%)**
   - Digunakan untuk melatih model agar mengenali pola dari data.
   - Model belajar dari data ini.

2. **Validation Set (10%)**
   - Digunakan untuk mengevaluasi model selama proses pelatihan.
   - Membantu dalam penyesuaian hyperparameter dan mencegah overfitting.
   - Model tidak belajar dari data ini, hanya diuji.

3. **Test Set (10%)**
   - Digunakan untuk menguji kinerja akhir model setelah pelatihan selesai.
   - Mensimulasikan performa model pada data baru yang belum pernah dilihat sebelumnya.

Pembagian ini bertujuan agar model yang dihasilkan benar-benar mampu bekerja dengan baik pada data nyata, bukan hanya pada data yang sudah pernah dilihat saat pelatihan.

## Cross Validation
Cross validation adalah metode evaluasi model machine learning dengan cara membagi data menjadi beberapa bagian (fold), lalu melatih dan menguji model secara bergantian pada setiap fold. Tujuannya agar hasil evaluasi lebih robust dan model benar-benar teruji pada data yang belum pernah dilihat.

**Penjelasan:**
- Pada supervised learning, model dilatih pada data training dan diuji pada data yang belum pernah dilihat (unseen data).
- Jika model mampu memprediksi dengan baik pada data unseen, berarti model bisa melakukan generalisasi dengan baik.
- Cross validation membantu menguji generalisasi model dengan membagi data menjadi beberapa fold, lalu proses training dan validasi dilakukan bergantian di setiap fold.

**Contoh: 5-Fold Cross Validation**
- Data dibagi menjadi 5 bagian.
- Setiap bagian secara bergantian menjadi data validasi, sisanya menjadi data training.
- Hasil evaluasi dari setiap fold digabungkan untuk mendapatkan skor akhir yang lebih akurat.

**Jenis cross-validation yang umum digunakan:**
- k-fold cross-validation (misal: 2 fold, 5 fold, 10 fold)
- Leave One Out Cross-Validation (LOOCV)

Dengan cross validation, kita bisa memastikan model tidak hanya bagus di satu bagian data, tapi juga konsisten di seluruh data.

In [15]:
# Library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder, RobustScaler
from category_encoders import BinaryEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer

# Models
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

# Pipeline and Metrics
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, recall_score, precision_score, confusion_matrix, f1_score, classification_report

import warnings
warnings.filterwarnings('ignore')

In [16]:
# Data - Load fresh data
fraud = pd.read_csv('Fraud.csv')
print(f"Shape: {fraud.shape}")
fraud.head()

Shape: (3000, 11)


Unnamed: 0,order_id,customer_tier,payment_method,num_items,order_value,customer_city,device_type,past_returns,account_age_days,shipping_speed,is_fraud
0,O0001,Gold,cod,1,417821.0,,Desktop,0.0,507,express,No
1,O0002,Platinum,COD,4,481832.0,,DESKTOP,1.0,258,regular,Yes
2,O0003,Gold,cod,3,495209.0,,DESKTOP,3.0,639,express,No
3,O0004,Silver,CoD,2,304920.0,,mobile,6.0,398,express,Yes
4,O0005,gold,COD,3,158201.0,,mobile,2.0,485,regular,No


In [17]:
# Cek Unique Values
unique_df = pd.DataFrame({
    'Column': fraud.columns,
    'Unique Count': [fraud[col].nunique() for col in fraud.columns],
    'Unique List': [fraud[col].unique() for col in fraud.columns]
})
unique_df

Unnamed: 0,Column,Unique Count,Unique List
0,order_id,3000,"[O0001, O0002, O0003, O0004, O0005, O0006, O00..."
1,customer_tier,5,"[Gold, Platinum, Silver, gold, silver]"
2,payment_method,3,"[cod, COD, CoD]"
3,num_items,5,"[1, 4, 3, 2, 5]"
4,order_value,1804,"[417821.0, 481832.0, 495209.0, 304920.0, 15820..."
5,customer_city,7,"[nan, Bandung, Medan, surabaya, Surabaya, band..."
6,device_type,4,"[Desktop, DESKTOP, mobile, Mobile]"
7,past_returns,7,"[0.0, 1.0, 3.0, 6.0, 2.0, 5.0, 4.0, nan]"
8,account_age_days,1284,"[507, 258, 639, 398, 485, 1277, 1391, 435, 133..."
9,shipping_speed,2,"[express, regular]"


**kolom yang perlu distandarisasi**
1. customer_tier
2. payment_method
3. customer_city
4. device_type

In [18]:
# Penanganan standarisasi pada kolom tertentu ke lowercase
cols_to_standardize = ['customer_tier', 'payment_method', 'customer_city', 'device_type', 'shipping_speed', 'is_fraud']
for col in cols_to_standardize:
    fraud[col] = fraud[col].str.lower().str.strip()


In [19]:
# Cek ulang unique values setelah standarisasi
for col in ['customer_tier', 'payment_method', 'device_type', 'shipping_speed', 'is_fraud']:
    uniques = sorted(fraud[col].dropna().unique())
    print(f"{col}: {uniques}")


customer_tier: ['gold', 'platinum', 'silver']
payment_method: ['cod']
device_type: ['desktop', 'mobile']
shipping_speed: ['express', 'regular']
is_fraud: ['no', 'yes']


In [20]:
# cek missing value dalam persentase
fraud.isna().sum() / len(fraud) * 100

order_id             0.000000
customer_tier        0.000000
payment_method       0.000000
num_items            0.000000
order_value          5.133333
customer_city       93.300000
device_type          0.000000
past_returns         3.366667
account_age_days     0.000000
shipping_speed       0.000000
is_fraud             0.000000
dtype: float64

In [21]:
# Ganti dropna dengan fillna
fraud['past_returns'] = fraud['past_returns'].fillna(fraud['past_returns'].median())
fraud['order_value'] = fraud['order_value'].fillna(fraud['order_value'].median())

In [22]:
# Drop kolom customer_city
fraud = fraud.drop('customer_city', axis=1)

In [23]:
fraud.isna().sum() / len(fraud) * 100

order_id            0.0
customer_tier       0.0
payment_method      0.0
num_items           0.0
order_value         0.0
device_type         0.0
past_returns        0.0
account_age_days    0.0
shipping_speed      0.0
is_fraud            0.0
dtype: float64

In [24]:
# Ubah fraud menjadi 0 dan 1
print("Nilai unik is_fraud sebelum convert:")
print(fraud['is_fraud'].unique())

fraud['is_fraud'] = fraud['is_fraud'].map({'yes': 1, 'no': 0})

print("\nSetelah convert:")
print(fraud['is_fraud'].value_counts())
print(f"Fraud rate: {fraud['is_fraud'].mean()*100:.2f}%")
fraud.head()

Nilai unik is_fraud sebelum convert:
['no' 'yes']

Setelah convert:
is_fraud
0    2646
1     354
Name: count, dtype: int64
Fraud rate: 11.80%


Unnamed: 0,order_id,customer_tier,payment_method,num_items,order_value,device_type,past_returns,account_age_days,shipping_speed,is_fraud
0,O0001,gold,cod,1,417821.0,desktop,0.0,507,express,0
1,O0002,platinum,cod,4,481832.0,desktop,1.0,258,regular,1
2,O0003,gold,cod,3,495209.0,desktop,3.0,639,express,0
3,O0004,silver,cod,2,304920.0,mobile,6.0,398,express,1
4,O0005,gold,cod,3,158201.0,mobile,2.0,485,regular,0


**Menentukan metriks dan skema pre processing**
- FN: Transaksi dianggap tidak fraud, tapi sebenarnya fraud (potensi kerugian besar karena kecolongan transaksi mencurigakan)
- FP: Transaksi dianggap fraud, tapi sebenarnya tidak fraud (potensi kerugian karena menolak transaksi yang sah)

**Metriks yang cocok: Recall (Sensitivitas)**
- One Hot Encoding: customer_tier, payment_method, device_type
- Ordinal Encoding: shipping_speed (1: regular, 2: express, 0: diluar opsi)
- StandardScaler: num_items, order_value, past_returns, account_age_days

In [25]:
ord = [
    {'col': 'shipping_speed', 'mapping': {None:0, 'regular': 1, 'express': 2}}
]

OE = OrdinalEncoder(cols=['shipping_speed'], mapping=ord)

prepos = ColumnTransformer([
    ('standard', StandardScaler(), ['num_items', 'order_value', 'past_returns', 'account_age_days']),
    ('ordinal', OE, ['shipping_speed']),
    ('onehot', OneHotEncoder(), ['customer_tier', 'payment_method', 'device_type'])
], remainder='passthrough')

prepos

In [26]:
# Train Test Split

x = fraud.drop(['order_id', 'is_fraud'], axis=1)
y = fraud['is_fraud']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42, stratify=y)

In [27]:
# Coba di model decision tree

dt = DecisionTreeClassifier(random_state=42, class_weight='balanced')

dt_pipe = Pipeline([
    ('Preprocessing', prepos),
    ('Model', dt)
])

kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

score = cross_val_score(dt_pipe, 
                        x_train, y_train, 
                        cv=kfold, 
                        scoring='recall')
score

array([0.08928571, 0.10714286, 0.10526316, 0.10526316, 0.12280702])

In [28]:
# Model Benchmarking Untuk Menentukan Model Terbaik

dt = DecisionTreeClassifier(random_state=42, class_weight='balanced')
rf = RandomForestClassifier(random_state=42)
logreg = LogisticRegression(random_state=42)
knn = KNeighborsClassifier()
svm = SVC()

model = [dt, rf, logreg, knn, svm]
score = []
score_mean = []
score_std = []

for i in model:
    pipe = Pipeline([
        ('Preprocessing', prepos),
        ('Model', i)])
    
    kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    
    cv_score = cross_val_score(pipe,
                            x_train, y_train,
                            cv = kfold,
                            scoring = 'recall')
    
    score.append(cv_score)
    score_mean.append(f"{cv_score.mean() * 100:.2f}")  
    score_std.append(f"{cv_score.std() * 100:.2f}")  

rangkuman = pd.DataFrame({
    'Model' : ['Decision Tree', 'Random Forest', 'LogReg', 'KNN', 'SVM'],
    'Recall Score AVG (%)' : score_mean,
    'Recall Score STD (%)' : score_std
}).sort_values('Recall Score AVG (%)', ascending=False)

rangkuman

Unnamed: 0,Model,Recall Score AVG (%),Recall Score STD (%)
0,Decision Tree,10.6,1.06
3,KNN,1.42,1.73
1,Random Forest,0.71,0.87
2,LogReg,0.0,0.0
4,SVM,0.0,0.0


**Tunning kedua model dengan hyperparameter tuning menggunakan GridSearchCV dan RandomizedSearchCV**

In [32]:
# Tuning untuk DT
dt_pipe = Pipeline([
    ('Preprocessing', prepos),
    ('Model', dt)
])

dt_param = {
    'Model__max_depth': [50, 100, 150, 200],
    'Model__criterion': ['gini', 'entropy'],
    'Model__min_samples_split': [5, 10, 15, 20]
}

grid = GridSearchCV(estimator=dt_pipe,
                    param_grid=dt_param,
                    scoring='recall',
                    cv=kfold,
                    n_jobs=-1,
                    verbose=3)

grid.fit(x_train, y_train)

Fitting 5 folds for each of 32 candidates, totalling 160 fits


In [37]:
# Lihat parameter terbaik
print("Best Parameters: ", grid.best_params_)
print("Best Recall Score: ", f"{grid.best_score_*100:.2f}%")


Best Parameters:  {'Model__criterion': 'gini', 'Model__max_depth': 50, 'Model__min_samples_split': 20}
Best Recall Score:  30.04.%


In [40]:
pd.DataFrame(grid.cv_results_).head()


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_Model__criterion,param_Model__max_depth,param_Model__min_samples_split,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.024826,0.001459,0.016412,0.003007,gini,50,5,"{'Model__criterion': 'gini', 'Model__max_depth...",0.107143,0.214286,0.122807,0.140351,0.192982,0.155514,0.041225,29
1,0.027115,0.000716,0.012004,0.000974,gini,50,10,"{'Model__criterion': 'gini', 'Model__max_depth...",0.142857,0.285714,0.263158,0.122807,0.263158,0.215539,0.068325,21
2,0.029969,0.001299,0.012711,0.001472,gini,50,15,"{'Model__criterion': 'gini', 'Model__max_depth...",0.196429,0.321429,0.298246,0.210526,0.298246,0.264975,0.051116,13
3,0.030304,0.006565,0.01185,0.000951,gini,50,20,"{'Model__criterion': 'gini', 'Model__max_depth...",0.196429,0.410714,0.333333,0.245614,0.315789,0.300376,0.073895,1
4,0.029631,0.004383,0.010598,0.000505,gini,100,5,"{'Model__criterion': 'gini', 'Model__max_depth...",0.107143,0.214286,0.122807,0.140351,0.192982,0.155514,0.041225,29


In [41]:
# Tuning untuk RF

rf = RandomForestClassifier(random_state=42, class_weight='balanced', n_jobs=-1)

rf_pipe = Pipeline([
    ('Preprocessing', prepos),
    ('Model', rf)
])

rf_param = {
    'Model__n_estimators': [100, 200, 300, 400, 500],
    'Model__max_depth': [None, 10, 20, 30, 40, 50],
    'Model__min_samples_split': [2, 5, 10],
    'Model__min_samples_leaf': [1, 2, 4],
    'Model__max_features': ['sqrt', 'log2', None],
    'Model__bootstrap': [True, False]
}

rf_random = RandomizedSearchCV(
    estimator=rf_pipe,
    param_distributions=rf_param,
    n_iter=25,
    scoring='recall',
    cv=kfold,
    n_jobs=-1,
    random_state=42,
    verbose=3
)

rf_random.fit(x_train, y_train)


Fitting 5 folds for each of 25 candidates, totalling 125 fits


In [42]:
# Hasil tuning RF
print("Best Parameters:", rf_random.best_params_)
print("Best Recall Score:", f"{rf_random.best_score_ * 100:.2f}%")


Best Parameters: {'Model__n_estimators': 500, 'Model__min_samples_split': 10, 'Model__min_samples_leaf': 1, 'Model__max_features': None, 'Model__max_depth': 50, 'Model__bootstrap': False}
Best Recall Score: 21.55%


In [44]:
# lihat parameter terbaik RF
pd.DataFrame(rf_random.cv_results_)[['params', 'mean_test_score', 'std_test_score']]\
    .sort_values('mean_test_score', ascending=False)\
    .head()

Unnamed: 0,params,mean_test_score,std_test_score
6,"{'Model__n_estimators': 500, 'Model__min_sampl...",0.215539,0.061697
8,"{'Model__n_estimators': 200, 'Model__min_sampl...",0.137845,0.045226
4,"{'Model__n_estimators': 100, 'Model__min_sampl...",0.12713,0.04609
2,"{'Model__n_estimators': 300, 'Model__min_sampl...",0.102569,0.040888
5,"{'Model__n_estimators': 100, 'Model__min_sampl...",0.095489,0.032862
