# Part 2: Training with XGBoost

## Contents
1. [Introduction](#Introduction)
2. [Prerequisites](#Prequisites)
3. [Train a model using XGBoost](#Train-a-model-using-XGBoost)

## Introduction

은행거래에서의 사기를 판별하는 것은 실제 은행에서 매우 중요한 일이며 매우 소수가 사기(Fraud)이지만 만약 그 거래가 잘 탐지되지 않았을 때 피해는 매우 큽니다. 본 notebook을 통해 가상의 금융 거래 데이터로 데이터 분석 및 이진 분류 모델을 (binary classifier) 만듭니다. 데이터는 [kaggle](https://www.kaggle.com/)에서도 받으실 수 있고, 여기서 사용된 데이터는 이미 가공을 한 번 거친 상태이기에 원본 데이터를 사용하신다면 column이나 row 수가 다를 수 있습니다.


## Prerequisites
### Install XGBoost module
XGBoost 모듈을 설치합니다.

In [1]:
!pip install xgboost

Collecting xgboost
  Downloading xgboost-1.5.2-py3-none-manylinux2014_x86_64.whl (173.6 MB)
     |████████████████████████████████| 173.6 MB 4.2 kB/s             
Installing collected packages: xgboost
Successfully installed xgboost-1.5.2


### Import libraries

Notebook을 진행하기 위해 여러 라이브러리를 다운로드 합니다.

In [2]:
import os                                         # For manipulating filepath names  
import sys                                        # For writing outputs to notebook
import math                                       # For ceiling function

import numpy as np                                # For matrix operations and numerical processing
import pandas as pd                               # For munging tabular data

from sklearn.model_selection import train_test_split # import train_test_split function
from sklearn.metrics import classification_report # import classification metrics
from sklearn.ensemble import RandomForestClassifier # import RandomForestClassifier
import xgboost as xgb                               # import XGBoost

import boto3

### Read dataset

위의 셀에서 설정한 S3 bucket 위치에서 데이터를 불러옵니다.

In [3]:
df = pd.read_csv('./banking_fraud_final_dataset.csv')

df.shape

(3000, 19)

In [4]:
df.head()

Unnamed: 0,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,CASH_IN,CASH_OUT,DEBIT,PAYMENT,TRANSFER,0,1,2,3,4
0,CASH_OUT,140421.18,C1667570766,16004.0,0.0,C2102410298,0.0,140421.18,0,0,1,0,0,0,0.014042,0.000279,0.0,0.0,0.001334
1,CASH_OUT,216666.53,C1495945377,50398.0,0.0,C814408370,10119297.16,10335963.7,0,0,1,0,0,0,0.021667,0.000879,0.0,0.09603,0.098211
2,CASH_OUT,234636.2,C269129885,74262.0,0.0,C1389815469,166046.48,400682.68,0,0,1,0,0,0,0.023464,0.001296,0.0,0.001576,0.003807
3,CASH_IN,52816.29,C129678616,117751.0,170567.29,C842027837,0.0,0.0,0,1,0,0,0,0,0.005282,0.002054,0.003605,0.0,0.0
4,CASH_OUT,63871.25,C1282823885,6012.0,0.0,C1236511065,456488.36,520359.6,0,0,1,0,0,0,0.006387,0.000105,0.0,0.004332,0.004944


### Split dataset

데이터를 train | validation | test 데이터셋으로 분리한 후 모델을 훈련합니다.

In [5]:
# 훈련에 필요없는 column을 삭제합니다
columns_to_drop = ['type', 'isFraud', 'nameOrig', 'nameDest']

features = df.drop(columns_to_drop, axis=1)
target = df.isFraud

In [6]:
# 데이터를 train, test 각각 7:3의 비율로 나눕니다
x_train, x_test, y_train, y_test = train_test_split(features, target, test_size=0.3)

## Train a model using XGBoost

이 notebook에서는 간단하지만 이진 분류에 효과적인 XGBoost 를 사용합니다. XGBoost 는 Gradient Boosting 을 수행하는 open source library 입니다. 계산 성능이 뛰어나고 필요한 기능들을 모두 구현하고 있으며, 많은 머신러닝 경쟁에서 성공적인 성과를 보여주고 있습니다. XGBoost 모듈을 사용하여 모델 생성을 시작해 보겠습니다.

In [7]:
# 객체 생성(모델링)
xgb_model = xgb.XGBClassifier(objective="binary:logistic", random_state=42)

# 모델을 훈련합니다
xgb_model.fit(x_train, y_train)

# 생성한 모델에 예측을 수행합니다
xgb_y_pred = xgb_model.predict(x_test)
y_real = y_test





In [8]:
#평가 지표를 확인합니다
print(classification_report(y_real, xgb_y_pred))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99       809
           1       0.94      0.93      0.94        91

    accuracy                           0.99       900
   macro avg       0.97      0.96      0.97       900
weighted avg       0.99      0.99      0.99       900



### (Optional) Train a model using RandomForestClassifier

In [9]:
# 객체 생성(모델링)
rf_model = RandomForestClassifier(n_estimators = 10)

# 모델을 훈련합니다
rf_model.fit(x_train, y_train)

# 생성한 모델에 예측을 수행합니다
rf_y_pred = rf_model.predict(x_test)
y_real = y_test

In [10]:
#평가 지표를 확인합니다
print(classification_report(y_real, rf_y_pred))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99       809
           1       0.94      0.91      0.93        91

    accuracy                           0.99       900
   macro avg       0.97      0.95      0.96       900
weighted avg       0.99      0.99      0.99       900

