# 언더 샘플링과 오버 샘플링

<span style='background-color: #f5f0ff'>불균형한 레이블 데이터를 학습시킬 때</span> 사용  
오버 샘플링 방식이 예측 성능상 조금 유리한 경우가 많아 상대적으로 더 많이 사용됨

## 언더 샘플링

많은 데이터 세트를 적은 데이터 세트 수준으로 감소시키는 방식

## 오버 샘플링

적은 데이터 세트를 증식하여 학습을 위한 충분하 데이터를 확보하는 방법  
- 대표적으로 SMOTE(Synthetic Minority Over-sampling Technique) - 구현하는 패키지 : imbalanced-learn

In [2]:
pip install imbalanced-learn




# 분류 실습 - Creditcard Fraud

In [3]:
import pandas as pd
card = pd.read_csv('creditcard2.csv')
card[:3]

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0


In [4]:
card.shape

(284807, 31)

## 데이터 가공하기

> Time 컬럼 삭제

In [5]:
card.drop(['Time'], axis = 1, inplace = True)

> 데이터 셋 분리

In [7]:
from sklearn.model_selection import train_test_split

In [8]:
features = card.iloc[:, :-1]
labels = card.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size = .3, stratify = labels) 

# stratify = labels : 'labels' 보고 알아서 쪼개라

> 서로 비슷하게 분할됐는지 확인하기

In [9]:
y_train.value_counts() / len(y_train) # 학습 데이터 레이블 값 비율

0    0.998275
1    0.001725
Name: Class, dtype: float64

In [10]:
y_test.value_counts() / len(y_test) # 테스트 데이터 레이블 값 비율

0    0.998268
1    0.001732
Name: Class, dtype: float64

In [11]:
y_train.shape[0]

199364

In [12]:
def get_clf_eval(y_test, pred, pred_proba):
    confusion = confusion_matrix(y_test, pred)
    accuracy = accuracy_score(y_test, pred)
    precision = precision_score(y_test, pred)
    recall = recall_score(y_test, pred)
    f1 = f1_score(y_test, pred)
    roc_auc = roc_auc_score(y_test, pred_proba)
    
    print(confusion)
    print()
    print(f' 정확도: {accuracy:.4f}, \n 정밀도: {precision:.4f}, \n 재현율: {recall:.4f}, \n f1_score: {f1:.4f}, \n AUC: {roc_auc:.4f}')

> 로지스틱 회귀를 이용해 신용 카드 사기 여부 예측하기

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, precision_score, f1_score, roc_auc_score

lr_clf = LogisticRegression(max_iter = 1000)
lr_clf.fit(X_train, y_train)
pred = lr_clf.predict(X_test)
pred_proba = lr_clf.predict_proba(X_test)[:, 1]

get_clf_eval(y_test, pred, pred_proba)

[[85283    12]
 [   65    83]]

 정확도: 0.9991, 
 정밀도: 0.8737, 
 재현율: 0.5608, 
 f1_score: 0.6831, 
 AUC: 0.9570


> LightGBM를 이용하여 예측하기

In [14]:
pip install lightgbm

Note: you may need to restart the kernel to use updated packages.


In [15]:
from lightgbm import LGBMClassifier

lgbm_clf = LGBMClassifier(n_estimators = 1000, num_leaves = 64, n_jobs = -1, boost_from_average = False)
lgbm_clf.fit(X_train, y_train)
lgbm_pred = lgbm_clf.predict(X_test)
lgbm_pred_proba = lgbm_clf.predict_proba(X_test)[:, 1]

get_clf_eval(y_test, lgbm_pred, lgbm_pred_proba)

[[85289     6]
 [   32   116]]

 정확도: 0.9996, 
 정밀도: 0.9508, 
 재현율: 0.7838, 
 f1_score: 0.8593, 
 AUC: 0.9846


=> 정확도, 정밀도, 재현율, f1_score, AUC 모두 높아짐

## 데이터 정제 후 적용하기

In [16]:
card['Amount'].describe()

count    284807.000000
mean         88.349619
std         250.120109
min           0.000000
25%           5.600000
50%          22.000000
75%          77.165000
max       25691.160000
Name: Amount, dtype: float64

### 스케일링(표준화)

In [17]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
card['Amount'] = scaler.fit_transform(card['Amount'].values.reshape(-1,1))

In [18]:
card[:3]

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,0.244964,0
1,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,-0.342475,0
2,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,1.160686,0


> 데이터 셋 분리

In [19]:
features = card.iloc[:, :-1]
labels = card.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size = .3, stratify = labels) 

> 로지스틱 회귀로 예측 및 평가

In [20]:
lr_clf = LogisticRegression(max_iter = 1000)
lr_clf.fit(X_train, y_train)
pred = lr_clf.predict(X_test)
pred_proba = lr_clf.predict_proba(X_test)[:, 1]

get_clf_eval(y_test, pred, pred_proba)

[[85284    11]
 [   52    96]]

 정확도: 0.9993, 
 정밀도: 0.8972, 
 재현율: 0.6486, 
 f1_score: 0.7529, 
 AUC: 0.9687


> lightGBM로 예측 및 평가

In [21]:
from lightgbm import LGBMClassifier

lgbm_clf = LGBMClassifier(n_estimators = 1000, num_leaves = 64, n_jobs = -1, boost_from_average = False)
lgbm_clf.fit(X_train, y_train)
lgbm_pred = lgbm_clf.predict(X_test)
lgbm_pred_proba = lgbm_clf.predict_proba(X_test)[:, 1]

get_clf_eval(y_test, lgbm_pred, lgbm_pred_proba)

[[85288     7]
 [   29   119]]

 정확도: 0.9996, 
 정밀도: 0.9444, 
 재현율: 0.8041, 
 f1_score: 0.8686, 
 AUC: 0.9745


--------

### 로그 변환(np.log1p)

In [22]:
card = pd.read_csv('creditcard2.csv')

# time 컬럼 삭제
card = card.drop(['Time'], axis = 1)

In [23]:
import numpy as np

# 로그 변환
np.log1p(card['Amount'])

0         5.014760
1         1.305626
2         5.939276
3         4.824306
4         4.262539
            ...   
284802    0.570980
284803    3.249987
284804    4.232366
284805    2.397895
284806    5.384495
Name: Amount, Length: 284807, dtype: float64

> **로그 변환을 하는 이유는?**  => 숫자가 너무 커서

In [24]:
card['Amount'] = np.log1p(card['Amount'])
card[:3]

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,5.01476,0
1,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,1.305626,0
2,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,5.939276,0


In [25]:
# 데이터 셋 분리
features = card.iloc[:, :-1]
labels = card.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size = .3, stratify = labels) 

# 로지스틱 회귀 모델 생성, fit, predict, evaluate
lr_clf = LogisticRegression(max_iter = 1000)
lr_clf.fit(X_train, y_train)
pred = lr_clf.predict(X_test)
pred_proba = lr_clf.predict_proba(X_test)[:, 1]

print(get_clf_eval(y_test, pred, pred_proba))

# lightgbm
lgbm_clf = LGBMClassifier(n_estimators = 1000, num_leaves = 64, n_jobs = -1, boost_from_average = False)
lgbm_clf.fit(X_train, y_train)
lgbm_pred = lgbm_clf.predict(X_test)
lgbm_pred_proba = lgbm_clf.predict_proba(X_test)[:, 1]

get_clf_eval(y_test, lgbm_pred, lgbm_pred_proba)

[[85279    16]
 [   57    91]]

 정확도: 0.9991, 
 정밀도: 0.8505, 
 재현율: 0.6149, 
 f1_score: 0.7137, 
 AUC: 0.9808
None
[[85291     4]
 [   30   118]]

 정확도: 0.9996, 
 정밀도: 0.9672, 
 재현율: 0.7973, 
 f1_score: 0.8741, 
 AUC: 0.9762


### 이상치 제거

In [26]:
card.describe().round(2)

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,...,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0
mean,0.0,0.0,-0.0,0.0,-0.0,0.0,-0.0,-0.0,-0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,3.15,0.0
std,1.96,1.65,1.52,1.42,1.38,1.33,1.24,1.19,1.1,1.09,...,0.73,0.73,0.62,0.61,0.52,0.48,0.4,0.33,1.66,0.04
min,-56.41,-72.72,-48.33,-5.68,-113.74,-26.16,-43.56,-73.22,-13.43,-24.59,...,-34.83,-10.93,-44.81,-2.84,-10.3,-2.6,-22.57,-15.43,0.0,0.0
25%,-0.92,-0.6,-0.89,-0.85,-0.69,-0.77,-0.55,-0.21,-0.64,-0.54,...,-0.23,-0.54,-0.16,-0.35,-0.32,-0.33,-0.07,-0.05,1.89,0.0
50%,0.02,0.07,0.18,-0.02,-0.05,-0.27,0.04,0.02,-0.05,-0.09,...,-0.03,0.01,-0.01,0.04,0.02,-0.05,0.0,0.01,3.14,0.0
75%,1.32,0.8,1.03,0.74,0.61,0.4,0.57,0.33,0.6,0.45,...,0.19,0.53,0.15,0.44,0.35,0.24,0.09,0.08,4.36,0.0
max,2.45,22.06,9.38,16.88,34.8,73.3,120.59,20.01,15.59,23.75,...,27.2,10.5,22.53,4.58,7.52,3.52,31.61,33.85,10.15,1.0


=> mean이 0인 것을 보니 현재 **표준화**되어 있는 상태임을 알 수 있음

####  찾기(IQR 방식)

> IQR(Inter Quantile Range)  
  
    : 사분위 값의 편차를 이용하는 기법으로, 흔히 박스 플롯 방식으로 시각화

IQR = 3/4 - 1/4  
  
3/4 + 1.5IQR  
1/4 - 1.5IQR

In [27]:
card.Class.value_counts()

0    284315
1       492
Name: Class, dtype: int64

> class가 '1'인 값만 추출해보자

In [28]:
fraud = card[card['Class'] == 1]
fraud

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
541,-2.312227,1.951992,-1.609851,3.997906,-0.522188,-1.426545,-2.537387,1.391657,-2.770089,-2.772272,...,0.517232,-0.035049,-0.465211,0.320198,0.044519,0.177840,0.261145,-0.143276,0.000000,1
623,-3.043541,-3.157307,1.088463,2.288644,1.359805,-1.064823,0.325574,-0.067794,-0.270953,-0.838587,...,0.661696,0.435477,1.375966,-0.293803,0.279798,-0.145362,-0.252773,0.035764,6.272877,1
4920,-2.303350,1.759247,-0.359745,2.330243,-0.821628,-0.075788,0.562320,-0.399147,-0.238253,-1.525412,...,-0.294166,-0.932391,0.172726,-0.087330,-0.156114,-0.542628,0.039566,-0.153029,5.484506,1
6108,-4.397974,1.358367,-2.592844,2.679787,-1.128131,-1.706536,-3.496197,-0.248778,-0.247768,-4.801637,...,0.573574,0.176968,-0.436207,-0.053502,0.252405,-0.657488,-0.827136,0.849573,4.094345,1
6329,1.234235,3.019740,-4.304597,4.732795,3.624201,-1.357746,1.713445,-0.496358,-1.282858,-2.447469,...,-0.379068,-0.704181,-0.656805,-1.632653,1.488901,0.566797,-0.010016,0.146793,0.693147,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
279863,-1.927883,1.125653,-4.518331,1.749293,-1.566487,-2.010494,-0.882850,0.697211,-2.064945,-5.587794,...,0.778584,-0.319189,0.639419,-0.294885,0.537503,0.788395,0.292680,0.147968,5.968708,1
280143,1.378559,1.289381,-5.004247,1.411850,0.442581,-1.326536,-1.413170,0.248525,-1.127396,-3.232153,...,0.370612,0.028234,-0.145640,-0.081049,0.521875,0.739467,0.389152,0.186637,0.565314,1
280149,-0.676143,1.126366,-2.213700,0.468308,-1.120541,-0.003346,-2.234739,1.210158,-0.652250,-3.463891,...,0.751826,0.834108,0.190944,0.032070,-0.739695,0.471111,0.385107,0.194361,4.368054,1
281144,-3.113832,0.585864,-5.399730,1.817092,-0.840618,-2.943548,-2.208002,1.058733,-1.632333,-5.245984,...,0.583276,-0.269209,-0.456108,-0.183659,-0.328168,0.606116,0.884876,-0.253700,5.505332,1


> 이상치를 찾아보자

In [29]:
pct25 = np.percentile(fraud.V14, 25)
pct75 = np.percentile(fraud.V14, 75)

IQR = pct75 - pct25

upper_out = pct75 + 1.5 * IQR
under_out = pct25 - 1.5 * IQR

cond1 = fraud['V14'] >= upper_out
cond2 = fraud['V14'] <= under_out
fraud.V14[cond1 | cond2] 

8296   -19.214325
8615   -18.822087
9035   -18.493773
9252   -18.049998
Name: V14, dtype: float64

In [30]:
# 다른 답변
pct75 =fraud['V14'].quantile(0.75)
pct25 =fraud['V14'].quantile(0.25)
IQR=pct75-pct25
upper=pct75+1.5*IQR
lower=pct25-1.5*IQR
fraud[(fraud['V14']<=upper) |(fraud['V14']>=lower)] 

# fraud['V14'][(fraud['V14']>=upper) |(fraud['V14']<=lower)] 

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
541,-2.312227,1.951992,-1.609851,3.997906,-0.522188,-1.426545,-2.537387,1.391657,-2.770089,-2.772272,...,0.517232,-0.035049,-0.465211,0.320198,0.044519,0.177840,0.261145,-0.143276,0.000000,1
623,-3.043541,-3.157307,1.088463,2.288644,1.359805,-1.064823,0.325574,-0.067794,-0.270953,-0.838587,...,0.661696,0.435477,1.375966,-0.293803,0.279798,-0.145362,-0.252773,0.035764,6.272877,1
4920,-2.303350,1.759247,-0.359745,2.330243,-0.821628,-0.075788,0.562320,-0.399147,-0.238253,-1.525412,...,-0.294166,-0.932391,0.172726,-0.087330,-0.156114,-0.542628,0.039566,-0.153029,5.484506,1
6108,-4.397974,1.358367,-2.592844,2.679787,-1.128131,-1.706536,-3.496197,-0.248778,-0.247768,-4.801637,...,0.573574,0.176968,-0.436207,-0.053502,0.252405,-0.657488,-0.827136,0.849573,4.094345,1
6329,1.234235,3.019740,-4.304597,4.732795,3.624201,-1.357746,1.713445,-0.496358,-1.282858,-2.447469,...,-0.379068,-0.704181,-0.656805,-1.632653,1.488901,0.566797,-0.010016,0.146793,0.693147,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
279863,-1.927883,1.125653,-4.518331,1.749293,-1.566487,-2.010494,-0.882850,0.697211,-2.064945,-5.587794,...,0.778584,-0.319189,0.639419,-0.294885,0.537503,0.788395,0.292680,0.147968,5.968708,1
280143,1.378559,1.289381,-5.004247,1.411850,0.442581,-1.326536,-1.413170,0.248525,-1.127396,-3.232153,...,0.370612,0.028234,-0.145640,-0.081049,0.521875,0.739467,0.389152,0.186637,0.565314,1
280149,-0.676143,1.126366,-2.213700,0.468308,-1.120541,-0.003346,-2.234739,1.210158,-0.652250,-3.463891,...,0.751826,0.834108,0.190944,0.032070,-0.739695,0.471111,0.385107,0.194361,4.368054,1
281144,-3.113832,0.585864,-5.399730,1.817092,-0.840618,-2.943548,-2.208002,1.058733,-1.632333,-5.245984,...,0.583276,-0.269209,-0.456108,-0.183659,-0.328168,0.606116,0.884876,-0.253700,5.505332,1


In [31]:
# 다른 답변
iqr = fraud['V14'].quantile(0.75) - fraud['V14'].quantile(0.25)
con1 = fraud['V14'] <= fraud['V14'].quantile(0.25) - 1.5*iqr
con2 = fraud['V14'] >= fraud['V14'].quantile(0.75) + 1.5*iqr

fraud['V14'][con1 | con2]

8296   -19.214325
8615   -18.822087
9035   -18.493773
9252   -18.049998
Name: V14, dtype: float64

In [32]:
# 다른 답변
Q1 = np.percentile(fraud['V14'],25)
Q3 = np.percentile(fraud['V14'],75)
IQR = Q3-Q1

fraud = fraud.query(" V14 > @Q3+1.5*@IQR or V14 < @Q1 -1.5*@IQR ")
fraud 

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
8296,-2.12549,5.973556,-11.034727,9.007147,-1.689451,-2.854415,-7.810441,2.03087,-5.902828,-12.840934,...,1.646518,-0.278485,-0.664841,-1.164555,1.701796,0.690806,2.119749,1.108933,0.693147,1
8615,-3.891192,7.098916,-11.426467,8.607557,-2.065706,-2.985288,-8.138589,2.973928,-6.27279,-13.193415,...,1.757085,-0.189709,-0.508629,-1.189308,1.188536,0.605242,1.881529,0.87526,0.693147,1
9035,-2.589617,7.016714,-13.705407,10.343228,-2.954461,-3.055116,-9.301289,3.349573,-5.654212,-11.853867,...,1.887738,0.333998,0.287659,-1.186406,-0.690273,0.631704,1.934221,0.789687,0.693147,1
9252,-5.454362,8.287421,-12.752811,8.594342,-3.106002,-3.179949,-9.252794,4.245062,-6.329801,-13.136698,...,1.846165,-0.267172,-0.310804,-1.201685,1.352176,0.608425,1.574715,0.808725,0.693147,1


#### 제거하기

In [33]:
outlier = fraud[(fraud['V14'] > upper) | (fraud['V14'] < lower)]
outlier.index

Int64Index([8296, 8615, 9035, 9252], dtype='int64')

In [34]:
out_index = outlier.index.tolist()
card.drop(out_index, axis=0) 

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,5.014760,0
1,1.191857,0.266151,0.166480,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,...,-0.225775,-0.638672,0.101288,-0.339846,0.167170,0.125895,-0.008983,0.014724,1.305626,0
2,-1.358354,-1.340163,1.773209,0.379780,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,5.939276,0
3,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,...,-0.108300,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,4.824306,0
4,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,...,-0.009431,0.798278,-0.137458,0.141267,-0.206010,0.502292,0.219422,0.215153,4.262539,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284802,-11.881118,10.071785,-9.834783,-2.066656,-5.364473,-2.606837,-4.918215,7.305334,1.914428,4.356170,...,0.213454,0.111864,1.014480,-0.509348,1.436807,0.250034,0.943651,0.823731,0.570980,0
284803,-0.732789,-0.055080,2.035030,-0.738589,0.868229,1.058415,0.024330,0.294869,0.584800,-0.975926,...,0.214205,0.924384,0.012463,-1.016226,-0.606624,-0.395255,0.068472,-0.053527,3.249987,0
284804,1.919565,-0.301254,-3.249640,-0.557828,2.630515,3.031260,-0.296827,0.708417,0.432454,-0.484782,...,0.232045,0.578229,-0.037501,0.640134,0.265745,-0.087371,0.004455,-0.026561,4.232366,0
284805,-0.240440,0.530483,0.702510,0.689799,-0.377961,0.623708,-0.686180,0.679145,0.392087,-0.399126,...,0.265245,0.800049,-0.163298,0.123205,-0.569159,0.546668,0.108821,0.104533,2.397895,0


In [35]:
card.shape[0]

284807

In [36]:
# 다른 답변
# card.drop(labels = [8296, 8615, 9035, 9252], axis = 0, inplace = True)

In [37]:
# 데이터 셋 분리
features = card.iloc[:, :-1]
labels = card.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size = .3, stratify = labels) 

# 로지스틱 회귀 모델 생성, fit, predict, evaluate
lr_clf = LogisticRegression(max_iter = 1000)
lr_clf.fit(X_train, y_train)
pred = lr_clf.predict(X_test)
pred_proba = lr_clf.predict_proba(X_test)[:, 1]

print(get_clf_eval(y_test, pred, pred_proba))

# lightgbm
lgbm_clf = LGBMClassifier(n_estimators = 1000, num_leaves = 64, n_jobs = -1, boost_from_average = False)
lgbm_clf.fit(X_train, y_train)
lgbm_pred = lgbm_clf.predict(X_test)
lgbm_pred_proba = lgbm_clf.predict_proba(X_test)[:, 1]

get_clf_eval(y_test, lgbm_pred, lgbm_pred_proba)

[[85285    10]
 [   60    88]]

 정확도: 0.9992, 
 정밀도: 0.8980, 
 재현율: 0.5946, 
 f1_score: 0.7154, 
 AUC: 0.9725
None
[[85290     5]
 [   28   120]]

 정확도: 0.9996, 
 정밀도: 0.9600, 
 재현율: 0.8108, 
 f1_score: 0.8791, 
 AUC: 0.9880


## SMOTE로 오버 샘플링

In [38]:
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=0)
X_train_over, y_train_over = smote.fit_resample(X_train, y_train)

In [39]:
# 로지스틱 회귀로 모델 학습/예측/평가
lr_clf = LogisticRegression(max_iter=1000)
lr_clf.fit(X_train_over, y_train_over)
preds = lr_clf.predict(X_test)
preds_proba = lr_clf.predict_proba(X_test)[: , 1]
print(get_clf_eval(y_test, preds, preds_proba))
print()

# lightGBM로 분석
lgbm_clf = LGBMClassifier(n_estimators=1000, num_leaves=64, n_jobs=-1, boost_from_average=False)
lgbm_clf.fit(X_train_over, y_train_over)
preds = lgbm_clf.predict(X_test)
preds_proba = lgbm_clf.predict_proba(X_test)[: , 1]
print(get_clf_eval(y_test, preds, preds_proba))

[[83212  2083]
 [   15   133]]

 정확도: 0.9754, 
 정밀도: 0.0600, 
 재현율: 0.8986, 
 f1_score: 0.1125, 
 AUC: 0.9710
None

[[85277    18]
 [   26   122]]

 정확도: 0.9995, 
 정밀도: 0.8714, 
 재현율: 0.8243, 
 f1_score: 0.8472, 
 AUC: 0.9721
None


---------

# 스태킹 앙상블

개별 알고리즘의 예측 결과 데이터 세트를 최종적인 <span style='background-color: #dcffe4'>메타 데이터 세트</span>로 만들어 별도의 ML 알고리즘으로 최종학습을 수행하고, <br>테스트 데이터를 기반으로 다시 최종 예측을 수행하는 방식

    (용어 설명) <span style='background-color: #dcffe4'>메타 모델</span> : 개별 모델의 예측된 데이터 세트를 다시 기반으로 하여 학습하고 예측하는 방식

> **핵심** : 여러 개별 모델의 예측 데이터를 각각 스태킹 형태로 결합해 최종 메타 모델의 학습용 피처 데이터 세트와 테스트용 피처 데이터 세트를 만드는 것임

## 기본 스태킹 모델 - 위스콘신 암 데이터

In [40]:
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score 

In [41]:
# 데이터 로딩
cancer = load_breast_cancer()
type(cancer)

# 데이터 처리
X_data = cancer.data

y_label = cancer.target

X_train , X_test , y_train , y_test = train_test_split(X_data , y_label , test_size=0.2 , random_state=0) 

# 개별 ML 모델을 위한 Classifier 생성
knn_clf  = KNeighborsClassifier(n_neighbors=4)
rf_clf = RandomForestClassifier(n_estimators=100, random_state=0)
dt_clf = DecisionTreeClassifier()
ada_clf = AdaBoostClassifier(n_estimators=100) 

lf_final = LogisticRegression()

# 개별 모델들을 학습
knn_clf.fit(X_train, y_train)
rf_clf.fit(X_train , y_train)
dt_clf.fit(X_train , y_train)
ada_clf.fit(X_train, y_train)

# 학습된 개별 모델들이 각자 반환하는 예측 데이터 셋을 생성하고 개별 모델의 정확도 측정 
knn_pred = knn_clf.predict(X_test)
rf_pred = rf_clf.predict(X_test)
dt_pred = dt_clf.predict(X_test)
ada_pred = ada_clf.predict(X_test) 

print('KNN 정확도: {0:.4f}'.format(accuracy_score(y_test, knn_pred)))
print('랜덤 포레스트 정확도: {0:.4f}'.format(accuracy_score(y_test, rf_pred)))
print('결정 트리 정확도: {0:.4f}'.format(accuracy_score(y_test, dt_pred)))
print('에이다부스트 정확도: {0:.4f} :'.format(accuracy_score(y_test, ada_pred))) 

pred = np.array([knn_pred, rf_pred, dt_pred, ada_pred])
pred = pred.T
print(pred.shape)
pred

# 최종 Stacking 모델을 위한 Classifier생성 
lr_final = LogisticRegression(C=10) 
lr_final.fit(pred, y_test)
final = lr_final.predict(pred)

# 최종 메타 모델의 예측 정확도 측정
np.round(accuracy_score(y_test, final), 4)

KNN 정확도: 0.9211
랜덤 포레스트 정확도: 0.9649
결정 트리 정확도: 0.9123
에이다부스트 정확도: 0.9561 :
(114, 4)


0.9737

-----------