#### 불균형 데이터(Imbalanced Data) 처리를 위한 샘플링 기법

**불균형 데이터란?**
- 정상 범주의 관측치 수와 이상 범주의 관측치 수가 현저히 차이가 나는 데이터
- 예를 들어 암 발생 환자가 암에 걸리지 않은 사람보다 현저히 적고, 신용카드 사기 거래인 경우가 정상 거래인 경우보다 현저히 적은 경우가 일반적
- 문제점: 정상을 정확히 분류하는 것과 이상을 정확히 분류하는 것 중 일반적으로 이상을 정확히 분류하는 것이 더 중요하다. 보통 이상 데이터가 target이 되는 경우가 많기 때문이다.
- 불균형한 데이터셋은 이상 데이터를 정확히 찾아내지 못할 수 있다는 문제점이 존재한다.

**데이터를 조정해서 불균형 데이터를 해결하는 샘플링 기법들**
1. 언더 샘플링: 다수 범주의 데이터를 소수 범주의 데이터 수에 맞게 줄이는 샘플링 방식, 정보 손실 때문에 많이 사용하지는 않음
  - Random Sampling: 다수의 범주에서 무작위로 샘플링 하는 것
    - 할 때마다 결과가 달라진다는 단점이 있음
  - Tomek Links: 두 범주 사이를 탐지하고 정리를 통해 부정확한 분류 경계선을 방지하는 방법(Random Sampling의 단점을 보완한 방법)
  - CNN Rule: 합성곱 신경망
  - One Sided Selection: Tomek Links + CNN Rule
2. 오버 샘플링: 소수 범주의 데이터를 다수 범주의 데이터 수에 맞게 늘리는 샘플링 방식(Resampling)
  - SMOTE: 소수 범주에서 가상의 데이터를 생성하는 방법, knn
  - GAN

**피처 엔지니어링(특성공학)**
- Log 변환
- IQR(Inter Quantile Range) = Q3 - Q1, 이상치
   - value > Q3 + 1.5 * IQR
   - value < Q1 - 1.5 * IQR
- 언더 샘플링, 오버 샘플링

# 신용카드 사기 검출
- 이상 거래 판단할 데이터셋(creditcard.csv)
- 이상 거래는 카드값을 지불하지 않을 의도를 가지고서 결제를 하거나, 도난된 카드를 가지고 결제를 하는 등의 거래를 말한다.
- 종속변수: 이상 거래 여부
- 알고리즘 종류: 분류
- 평가지표: 정확도, 혼동행렬, 분류 리포트 등

# Module Loading

In [1]:
# 필수 라이브러리
from IPython.display import display
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
!pip install mglearn
import mglearn

# 음수표현 라이브러리
plt.rcParams['axes.unicode_minus'] = False

# 경고무시
import warnings
warnings.filterwarnings("ignore")

# 매직명령어 : 시각화 결과가 노트북에 포함되도록
%matplotlib inline



# Dataset Loading

In [36]:
card_df = pd.read_csv('creditcard.csv')
card_df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [37]:
card_df.shape

(284807, 31)

In [38]:
card_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

# Preprocessing

# 결측치 처리
- Amount: 금액을 나타내므로 평균 금액으로 대체
- Class: 타겟을 나타냄 더 많은 클래스로 대체

In [39]:
# Amount
card_df['Amount'].fillna(card_df['Amount'].mean(), inplace=True)

In [40]:
# Class
card_df['Class'].value_counts()

0    284315
1       492
Name: Class, dtype: int64

In [41]:
card_df['Class'].fillna(0.0, inplace=True)

In [42]:
card_df.isnull().sum()

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

## 불필요 컬럼 제거
- Time

In [43]:
card_df.drop('Time',inplace=True, axis=1)

In [44]:
card_df.columns

Index(['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',
       'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21',
       'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount', 'Class'],
      dtype='object')

## data / target 분리

In [45]:
data = card_df.iloc[:,:-1]
target = card_df.iloc[:,-1]

In [80]:
target.info()

<class 'pandas.core.series.Series'>
RangeIndex: 284807 entries, 0 to 284806
Series name: Class
Non-Null Count   Dtype
--------------   -----
284807 non-null  int64
dtypes: int64(1)
memory usage: 2.2 MB


## train / test 분리

In [46]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(data, target,
                                                    test_size=0.3, stratify= target, random_state=0)

In [47]:
x_train.shape, x_test.shape

((199364, 29), (85443, 29))

In [48]:
y_train.shape, y_test.shape

((199364,), (85443,))

# Modeling

## 평가지표 함수 만들기
- 정확도, 정밀도, 재현율, F1스코어, ROC AUC값 출력

In [49]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score
from sklearn.metrics import f1_score, confusion_matrix, roc_curve, precision_recall_curve

def get_eval(y_test, pred=None, pred_proba=None):
  confusion = confusion_matrix(y_test, pred)
  accuracy = accuracy_score(y_test, pred)
  precision = precision_score(y_test, pred)
  recall = recall_score(y_test, pred)
  f1 = f1_score(y_test, pred)
  roc_auc = roc_auc_score(y_test, pred_proba)
  print('오차행렬')
  print(confusion)
  print('정확도: {0:.4f}, 정밀도: {1:.4f}, 재현율: {2:.4f},\
  F1: {3:.4F}, AUC: {4:.4f}'.format(accuracy, precision, recall, f1, roc_auc))


## LogisticRegression

In [50]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter=1000)
# 학습
lr.fit(x_train, y_train)

# 예측
lr_pred = lr.predict(x_test)
lr_pred_proba = lr.predict_proba(x_test)[:, 1]

# 평가
get_eval(y_test, lr_pred, lr_pred_proba)

오차행렬
[[85281    14]
 [   57    91]]
정확도: 0.9992, 정밀도: 0.8667, 재현율: 0.6149,  F1: 0.7194, AUC: 0.9704


### 'Amount' feature scaling

**Amount feature의 로그 변환**
- 왜곡된(skewed)분포도를 가진 데이터를 정규 분포에 가깝게 변환해주는것

In [69]:
card = pd.read_csv('creditcard.csv')

In [70]:
amount_n = np.log1p(card['Amount'])
card.drop(['Time', 'Amount'], axis=1, inplace=True)
card.insert(0, 'AmountScaled', amount_n)
card.head()

Unnamed: 0,AmountScaled,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Class
0,5.01476,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,0
1,1.305626,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,0
2,5.939276,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.52498,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,0
3,4.824306,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.208038,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,0
4,4.262539,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,0


In [71]:
card.shape

(284807, 30)

In [72]:
card.isnull().sum()

AmountScaled    0
V1              0
V2              0
V3              0
V4              0
V5              0
V6              0
V7              0
V8              0
V9              0
V10             0
V11             0
V12             0
V13             0
V14             0
V15             0
V16             0
V17             0
V18             0
V19             0
V20             0
V21             0
V22             0
V23             0
V24             0
V25             0
V26             0
V27             0
V28             0
Class           0
dtype: int64

In [73]:
card.AmountScaled.fillna(card.AmountScaled.mean(), inplace=True)
card.V23.fillna(card.V23.mean(), inplace=True)
card.V24.fillna(card.V24.mean(), inplace=True)
card.V25.fillna(card.V25.mean(), inplace=True)
card.V26.fillna(card.V26.mean(), inplace=True)
card.V27.fillna(card.V27.mean(), inplace=True)
card.V28.fillna(card.V28.mean(), inplace=True)
card.Class.fillna(0.0, inplace=True)

In [74]:
card.isnull().sum()

AmountScaled    0
V1              0
V2              0
V3              0
V4              0
V5              0
V6              0
V7              0
V8              0
V9              0
V10             0
V11             0
V12             0
V13             0
V14             0
V15             0
V16             0
V17             0
V18             0
V19             0
V20             0
V21             0
V22             0
V23             0
V24             0
V25             0
V26             0
V27             0
V28             0
Class           0
dtype: int64

In [75]:
data = card.iloc[:,:-1]
target = card.iloc[:,-1]

x_train, x_test, y_train, y_test = train_test_split(data, target,
                                                    test_size=0.3, stratify= target, random_state=0)

lr = LogisticRegression(max_iter=1000)
# 학습
lr.fit(x_train, y_train)

# 예측
lr_pred = lr.predict(x_test)
lr_pred_proba = lr.predict_proba(x_test)[:, 1]

# 평가
get_eval(y_test, lr_pred, lr_pred_proba)

오차행렬
[[85283    12]
 [   59    89]]
정확도: 0.9992, 정밀도: 0.8812, 재현율: 0.6014,  F1: 0.7149, AUC: 0.9727


In [79]:
x_train.shape, y_test.shape

((199364, 29), (85443,))

## LightGBM

In [76]:
## 학습/예측/평가 함수 생성
def train_eval(model, ftr_train=None, ftr_test=None, tgt_train=None, tgt_test=None):
  model.fit(ftr_train, tgt_train)
  pred = model.predict(ftr_test)
  pred_proba = model.predict_proba(ftr_test)[:,1]
  get_eval(tgt_test, pred, pred_proba)

In [77]:
from lightgbm import LGBMClassifier
lgbm = LGBMClassifier(n_estimators=1000, num_leaves=64,
                      n_jobs=-1, boost_from_average=False)
# num_leaves: 결정 트리의 최대 리프 노드 수를 지정
## 높으면 복잡한 학습(과적합 위험)
## 낮으면 간단한 학습(과소적합 위험)
# n_jobs=-1: 사용가능한 CPU 코어 모두 사용
# boost_from_average=False: 불균형한 데이터에서 재현률 및 ROC-AUC 성능을 저하시킴

train_eval(lgbm, ftr_train=x_train, ftr_test=x_test,
           tgt_train=y_train, tgt_test=y_test)

[LightGBM] [Info] Number of positive: 344, number of negative: 199020
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.042543 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 7395
[LightGBM] [Info] Number of data points in the train set: 199364, number of used features: 29
오차행렬
[[85290     5]
 [   35   113]]
정확도: 0.9995, 정밀도: 0.9576, 재현율: 0.7635,  F1: 0.8496, AUC: 0.9796


## SMOTE 오버 샘플링 적용 후 모델 학습

In [78]:
from imblearn.over_sampling import SMOTE

smote = SMOTE()
x_train_over, y_train_over = smote.fit_resample(x_train, y_train)

print('SMOTE 적용 전 학습/타깃 데이터셋: ', x_train.shape, y_train.shape)
print('SMOTE 적용 후 학습/타깃 데이터셋: ', x_train_over.shape, y_train_over.shape)

SMOTE 적용 전 학습/타깃 데이터셋:  (199364, 29) (199364,)
SMOTE 적용 후 학습/타깃 데이터셋:  (398040, 29) (398040,)
