Task1_0726. 신용카드 사기 검출 모델을 아래와 같이 생성하고 평가하세요.

- 데이터 일차 가공 및 모델 학습/예측/평가
  - Time 컬럼 삭제, 로지스틱 회귀, LightGBM을 이용하여 모델링 및 평가 -완-

- Amount 컬럼 데이터 분포도 변환 후 모델 학습/예측/평가
  - 표준화한 후 로지스틱 회귀, LightGBM을 이용하여 모델링 및 평가

- 이상치 데이터 제거 후 모델 학습/예측/평가
  - 데이터의 상관관계를 시각화 V14와 클래스의 상관관계 높음을 확인 후 V14 컬럼의 이상치 제거한 후 로지스틱 회귀, LightGBM을 이용하여 모델링 및 평가

- SMOTE 오버 샘플링 적용 후 모델 학습/예측/평가
  - 불균형한 데이터셋 처리를 위한 imbalanced-learn 라이브러리를 설치
  - %pip install imbalanced-learn
  - SMOTE(Synthetic Minority Over-sampling Technique)를 사용하여 불균형한 데이터셋을 처리
  - from imblearn.over_sampling import SMOTE
  - smote = SMOTE(random_state=0)
  - X_train_over, y_train_over = smote.fit_resample(X_train, y_train)
  - SMOTE를 적용한 학습 데이터셋을 사용하여 로지스틱 회귀 모델을 학습하고 예측 성능을 평가
  - Precision-Recall 커브를 시각화하는 함수
  - LightGBM을 이용하여 모델링 및 평가

In [17]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt
import lightgbm as lgb
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score, f1_score, confusion_matrix, precision_recall_curve, roc_curve
from sklearn.preprocessing import StandardScaler

In [18]:
# 데이터 불러오기
df = pd.read_csv('/content/drive/MyDrive/kdt_240224/m5_머신러닝/dataset/creditcard.csv')
df

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.166480,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.167170,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.379780,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.108300,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.50,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.206010,0.502292,0.219422,0.215153,69.99,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284802,172786.0,-11.881118,10.071785,-9.834783,-2.066656,-5.364473,-2.606837,-4.918215,7.305334,1.914428,...,0.213454,0.111864,1.014480,-0.509348,1.436807,0.250034,0.943651,0.823731,0.77,0
284803,172787.0,-0.732789,-0.055080,2.035030,-0.738589,0.868229,1.058415,0.024330,0.294869,0.584800,...,0.214205,0.924384,0.012463,-1.016226,-0.606624,-0.395255,0.068472,-0.053527,24.79,0
284804,172788.0,1.919565,-0.301254,-3.249640,-0.557828,2.630515,3.031260,-0.296827,0.708417,0.432454,...,0.232045,0.578229,-0.037501,0.640134,0.265745,-0.087371,0.004455,-0.026561,67.88,0
284805,172788.0,-0.240440,0.530483,0.702510,0.689799,-0.377961,0.623708,-0.686180,0.679145,0.392087,...,0.265245,0.800049,-0.163298,0.123205,-0.569159,0.546668,0.108821,0.104533,10.00,0


In [25]:
df['Class'].value_counts()

Class
0    284315
1       492
Name: count, dtype: int64

In [26]:
# 데이터 일차 가공 및 모델 학습/예측/평가
# Time 컬럼 삭제
# 위 모든 항목 함수화
def first_preprocessing_and_model(df):
    df = df.drop('Time', axis=1)

    X = df.drop('Class', axis=1)
    y = df['Class']

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # 로지스틱 회귀 모델 학습 및 예측
    model = LogisticRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    # 평가 지표 계산
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_pred)
    confusion = confusion_matrix(y_test, y_pred)

    print(f"Accuracy:{accuracy:.4f}: ")
    print(f"Precision:{precision:.4f}: ")
    print(f"Recall:{recall:.4f}: ")
    print(f"F1 Score:{f1:.4f}: ")
    print(f"ROC AUC Score:{roc_auc:.4f}: ")
    print(f"Confusion Matrix:\n{confusion}")
    print('\n\n\n')


    # LightGBM 모델 학습 및 예측

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    lgb_model = lgb.LGBMClassifier(random_state=42,verbose=-1)

    lgb_model.fit(X_train, y_train)
    y_pred = lgb_model.predict(X_test)

    # 평가 지표 계산
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_pred)
    confusion = confusion_matrix(y_test, y_pred)

    print(f"Accuracy:{accuracy:.4f}: ")
    print(f"Precision:{precision:.4f}: ")
    print(f"Recall:{recall:.4f}: ")
    print(f"F1 Score:{f1:.4f}: ")
    print(f"ROC AUC Score:{roc_auc:.4f}: ")
    print(f"Confusion Matrix:\n{confusion}")



    return None

first_preprocessing_and_model(df)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Accuracy:0.9990: 
Precision:0.8060: 
Recall:0.5510: 
F1 Score:0.6545: 
ROC AUC Score:0.7754: 
Confusion Matrix:
[[56851    13]
 [   44    54]]




Accuracy:0.9961: 
Precision:0.1556: 
Recall:0.2857: 
F1 Score:0.2014: 
ROC AUC Score:0.6415: 
Confusion Matrix:
[[56712   152]
 [   70    28]]


In [28]:
# Amount 컬럼 데이터 분포도 변환 후 모델 학습/예측/평가
# 표준화한 후 로지스틱 회귀, LightGBM을 이용하여 모델링 및 평가
# 함수화

def sca_Amount_and_model(df):


    scaler = StandardScaler()
    df['scaled_amount'] = scaler.fit_transform(df['Amount'].values.reshape(-1, 1))
df.corr()['Amount']

Time     -0.010596
V1       -0.227709
V2       -0.531409
V3       -0.210880
V4        0.098732
V5       -0.386356
V6        0.215981
V7        0.397311
V8       -0.103079
V9       -0.044246
V10      -0.101502
V11       0.000104
V12      -0.009542
V13       0.005293
V14       0.033751
V15      -0.002986
V16      -0.003910
V17       0.007309
V18       0.035650
V19      -0.056151
V20       0.339403
V21       0.105999
V22      -0.064801
V23      -0.112633
V24       0.005146
V25      -0.047837
V26      -0.003208
V27       0.028825
V28       0.010258
Amount    1.000000
Class     0.005632
Name: Amount, dtype: float64