# Kaggle 신용카드 부정결제 검출 (Google Drive Mount)
https://www.kaggle.com/mlg-ulb/creditcardfraud
## Credit Card Fraud Detection
* creditcard.csv (284,807 * 31)
* Class : '0' (정상결제), '1' (부정결제)
* 부정 검출(Fraud Detection), 이상 탐지(Anomaly Detection)

In [None]:
import warnings
warnings.filterwarnings('ignore')

# I. Google Drive Mount
* 'creditCardFraud.zip' 파일을 구글드라이브에 업로드 후 진행

In [None]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


* 마운트 결과 확인

In [None]:
!ls -l '/content/drive/My Drive/Colab Notebooks/datasets/creditCardFraud.zip'

-r-------- 1 root root 69155672 Mar 10  2022 '/content/drive/My Drive/Colab Notebooks/datasets/creditCardFraud.zip'


# II. Data Preprocessing

> ## 1) Unzip 'creditCardFraud.zip'

* Colab 파일시스템에 'creditcard.csv' 파일 생성

In [None]:
!unzip /content/drive/My\ Drive/Colab\ Notebooks/datasets/creditCardFraud.zip

Archive:  /content/drive/My Drive/Colab Notebooks/datasets/creditCardFraud.zip
  inflating: creditcard.csv          


* creditcard.csv 파일 확인

In [None]:
!ls -l

total 147304
-rw-r--r-- 1 root root 150828752 Sep 20  2019 creditcard.csv
drwx------ 5 root root      4096 Sep 23 06:39 drive
drwxr-xr-x 1 root root      4096 Sep 14 13:44 sample_data


> ## 2) 데이터 읽어오기

* pandas DataFrame

In [None]:
%%time

import pandas as pd

DF = pd.read_csv('creditcard.csv')

DF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

In [None]:
DF.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


* '0' (정상) Class와 '1' (부정) Class 개수

In [None]:
DF.Class.value_counts()

0    284315
1       492
Name: Class, dtype: int64

* '0' (정상) Class와 '1' (부정) Class 비율

In [None]:
(DF.Class.value_counts() / DF.shape[0]) * 100

0    99.827251
1     0.172749
Name: Class, dtype: float64

> ## 3) Time 열(Column) 삭제

In [None]:
DF.drop('Time', axis = 1, inplace = True)

DF.head(1)

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0


> ## 4) train_test_split

* X (Input), y (Output) 지정

In [None]:
X = DF.iloc[:,:-1]
y = DF.iloc[:, -1]

X.shape, y.shape

((284807, 29), (284807,))

> ### (1) Without 'stratify'

In [None]:
from sklearn.model_selection import train_test_split 

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.3,
                                                    random_state = 2045)

X_train.shape, y_train.shape, X_test.shape, y_test.shape

((199364, 29), (199364,), (85443, 29), (85443,))

* Train_Data와 Test_Data의 1 (부정) 비율이 불균형

In [None]:
print('Train_Data :','\n', (y_train.value_counts() / y_train.shape[0]) * 100)
print('Test_Data :','\n', (y_test.value_counts() / y_test.shape[0]) * 100)

Train_Data : 
 0    99.825445
1     0.174555
Name: Class, dtype: float64
Test_Data : 
 0    99.831467
1     0.168533
Name: Class, dtype: float64


> ### (2) With 'Stratify'

In [None]:
from sklearn.model_selection import train_test_split 

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.3,
                                                    stratify = y,
                                                    random_state = 2045)

X_train.shape, y_train.shape, X_test.shape, y_test.shape

((199364, 29), (199364,), (85443, 29), (85443,))

* Train_Data와 Test_Data의 1 (부정) 비율이 균형

In [None]:
print('Train_Data :','\n', (y_train.value_counts() / y_train.shape[0]) * 100)
print('Test_Data :','\n', (y_test.value_counts() / y_test.shape[0]) * 100)

Train_Data : 
 0    99.827451
1     0.172549
Name: Class, dtype: float64
Test_Data : 
 0    99.826785
1     0.173215
Name: Class, dtype: float64


# III. Modeling

> ## 1) Decision Tree - Without SMOTE

In [None]:
%%time

from sklearn.tree import DecisionTreeClassifier

Model_dt = DecisionTreeClassifier(random_state = 2045)
Model_dt.fit(X_train, y_train)

CPU times: user 21.2 s, sys: 14 ms, total: 21.3 s
Wall time: 21.4 s


DecisionTreeClassifier(random_state=2045)

In [None]:
y_hat = Model_dt.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_hat)

array([[85263,    32],
       [   28,   120]])

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

print(accuracy_score(y_test, y_hat))
print(precision_score(y_test, y_hat, pos_label = 1))
print(recall_score(y_test, y_hat, pos_label = 1))

0.9992977774656788
0.7894736842105263
0.8108108108108109


In [None]:
from sklearn.metrics import f1_score

f1_score(y_test, y_hat, pos_label = 1)

0.8

> ## 2) SMOTE

* Synthetic Minority Over-sampling TEchnique
* KNN(K-Nearst Neighbor) : K개의 이웃과 일정 값의 차이를 가지를 새로운 데이터를 생성
* imbalanced-learn Package

In [None]:
# Without SMOTE

X_train.shape, y_train.shape

((199364, 29), (199364,))

In [None]:
pd.Series(y_train).value_counts()

0    199020
1       344
Name: Class, dtype: int64

* imbalanced-learn Package

In [None]:
from imblearn.over_sampling import SMOTE 

* With SMOTE

In [None]:
%%time

OS = SMOTE(random_state = 2045)

X_train_OS, y_train_OS = OS.fit_resample(X_train, y_train)



CPU times: user 271 ms, sys: 196 ms, total: 467 ms
Wall time: 348 ms


In [None]:
X_train_OS.shape, y_train_OS.shape

((398040, 29), (398040,))

* 0 (정상) Class와 1 (사기) Class 개수

In [None]:
pd.Series(y_train_OS).value_counts()

0    199020
1    199020
Name: Class, dtype: int64

> ## 3) Decision Tree - With SMOTE

In [None]:
%%time 

from sklearn.tree import DecisionTreeClassifier

Model_dt = DecisionTreeClassifier(random_state = 2045)
Model_dt.fit(X_train_OS, y_train_OS)

CPU times: user 36.3 s, sys: 44.1 ms, total: 36.3 s
Wall time: 36.9 s


DecisionTreeClassifier(random_state=2045)

In [None]:
y_hat = Model_dt.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_hat)

array([[85140,   155],
       [   31,   117]])

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

print(accuracy_score(y_test, y_hat))
print(precision_score(y_test, y_hat, pos_label = 1))
print(recall_score(y_test, y_hat, pos_label = 1))

0.9978231101436045
0.43014705882352944
0.7905405405405406


In [None]:
from sklearn.metrics import f1_score

f1_score(y_test, y_hat, pos_label = 1)

0.5571428571428572

> ## 4) LightGBM - With SMOTE

* n_estimators : 모델링에 사용되는 Tree의 개수
* num_leaves : 최대 Terminal Node 개수
* boost_from_average : 불균형 데이터일 경우 'False' 지정
* learning_rate : 0~1 사이의 값
* max_depth : Tree의 최대 크기(깊이)
* min_child_samples : Terminal Node의 최소 Datapoint 개수

* 약 90초

In [None]:
%%time

from lightgbm import LGBMClassifier

Model_lgbm = LGBMClassifier(n_estimators = 1500,
                            num_leaves = 64,
                            n_jobs = -1,
                            boost_from_average = False)

Model_lgbm.fit(X_train_OS, y_train_OS)

CPU times: user 3min 3s, sys: 471 ms, total: 3min 4s
Wall time: 1min 34s


LGBMClassifier(boost_from_average=False, n_estimators=1500, num_leaves=64)

In [None]:
y_hat = Model_lgbm.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_hat)

array([[85273,    22],
       [   19,   129]])

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

print(accuracy_score(y_test, y_hat))
print(precision_score(y_test, y_hat, pos_label = 1))
print(recall_score(y_test, y_hat, pos_label = 1))

0.9995201479348805
0.8543046357615894
0.8716216216216216


In [None]:
from sklearn.metrics import f1_score

f1_score(y_test, y_hat, pos_label = 1)

0.862876254180602

# 
# 
# 
# The End
# 
# 
# 