# Kaggle Project

## Describe Your Dataset

**URL:** https://www.kaggle.com/datasets/zeesolver/credit-card

**Task:**

위 Data는 신용카드의 거래 당시 정보(시간, 위치, 거래유형 등)와 거래 금액, 그리고 사기피해 여부를 정리해놓은 것이다. 이를 토대로 다음과 같은 작업을 하고자 한다.

1. 거래 정보(V1~V28)를 이용한 거래 금액 예측 시뮬레이션 구축 (Regression)

    거래 정보는 V1~V28로 익명화한 뒤, 정규화되어있다. 이 Data를 토대로 거래 금액을 예측할 수 있는 시뮬레이션을 다음과 같은 모델들로 다양하게 산출해본다.
    * Linear Regression
    * Decision Tree
    * NN.Linear


2. 거래 금액(Amount)과 사기 피해의 관계성 탐색 (Classification)

    보통, 거래 금액이 높을 경우 사기 피해를 입을 가능성이 높다고 짐작이 된다. 따라서 이 Data를 이용하여 거래 금액과 사기 피해가 얼마나 관계성이 있는지에 대해 다음과 같은 모델을 토대로 분석해보고자 한다.
    * Logistic Regression
    * Decision Tree
    * SVM

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
%matplotlib inline

**Datasets**

*Data Info*

- ID: 특정 거래에 주어지는 고유 식별 코드.
- V1-V28: 신용 카드 거래와 관련된 기능 또는 속성. (시간, 금액, 위치, 거래 유형, 분석 및 사기 탐지에 사용할 수 있는 다양한 기타 세부 정보)
- 금액: 신용카드 거래에 관련된 금전적 가치. 즉, 해당 거래 중에 카드에 청구되거나 적립된 금액.
- Class: 사기(1), 합법(0)

In [2]:
# Raw Data
data = pd.read_csv('creditcard_2023.csv')
data

Unnamed: 0,id,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0,-0.260648,-0.469648,2.496266,-0.083724,0.129681,0.732898,0.519014,-0.130006,0.727159,...,-0.110552,0.217606,-0.134794,0.165959,0.126280,-0.434824,-0.081230,-0.151045,17982.10,0
1,1,0.985100,-0.356045,0.558056,-0.429654,0.277140,0.428605,0.406466,-0.133118,0.347452,...,-0.194936,-0.605761,0.079469,-0.577395,0.190090,0.296503,-0.248052,-0.064512,6531.37,0
2,2,-0.260272,-0.949385,1.728538,-0.457986,0.074062,1.419481,0.743511,-0.095576,-0.261297,...,-0.005020,0.702906,0.945045,-1.154666,-0.605564,-0.312895,-0.300258,-0.244718,2513.54,0
3,3,-0.152152,-0.508959,1.746840,-1.090178,0.249486,1.143312,0.518269,-0.065130,-0.205698,...,-0.146927,-0.038212,-0.214048,-1.893131,1.003963,-0.515950,-0.165316,0.048424,5384.44,0
4,4,-0.206820,-0.165280,1.527053,-0.448293,0.106125,0.530549,0.658849,-0.212660,1.049921,...,-0.106984,0.729727,-0.161666,0.312561,-0.414116,1.071126,0.023712,0.419117,14278.97,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
568625,568625,-0.833437,0.061886,-0.899794,0.904227,-1.002401,0.481454,-0.370393,0.189694,-0.938153,...,0.167503,0.419731,1.288249,-0.900861,0.560661,-0.006018,3.308968,0.081564,4394.16,1
568626,568626,-0.670459,-0.202896,-0.068129,-0.267328,-0.133660,0.237148,-0.016935,-0.147733,0.483894,...,0.031874,0.388161,-0.154257,-0.846452,-0.153443,1.961398,-1.528642,1.704306,4653.40,1
568627,568627,-0.311997,-0.004095,0.137526,-0.035893,-0.042291,0.121098,-0.070958,-0.019997,-0.122048,...,0.140788,0.536523,-0.211100,-0.448909,0.540073,-0.755836,-0.487540,-0.268741,23572.85,1
568628,568628,0.636871,-0.516970,-0.300889,-0.144480,0.131042,-0.294148,0.580568,-0.207723,0.893527,...,-0.060381,-0.195609,-0.175488,-0.554643,-0.099669,-1.434931,-0.159269,-0.076251,10160.83,1


In [3]:
# ID 제거
df = data.drop(['id'], axis=1)
df

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,-0.260648,-0.469648,2.496266,-0.083724,0.129681,0.732898,0.519014,-0.130006,0.727159,0.637735,...,-0.110552,0.217606,-0.134794,0.165959,0.126280,-0.434824,-0.081230,-0.151045,17982.10,0
1,0.985100,-0.356045,0.558056,-0.429654,0.277140,0.428605,0.406466,-0.133118,0.347452,0.529808,...,-0.194936,-0.605761,0.079469,-0.577395,0.190090,0.296503,-0.248052,-0.064512,6531.37,0
2,-0.260272,-0.949385,1.728538,-0.457986,0.074062,1.419481,0.743511,-0.095576,-0.261297,0.690708,...,-0.005020,0.702906,0.945045,-1.154666,-0.605564,-0.312895,-0.300258,-0.244718,2513.54,0
3,-0.152152,-0.508959,1.746840,-1.090178,0.249486,1.143312,0.518269,-0.065130,-0.205698,0.575231,...,-0.146927,-0.038212,-0.214048,-1.893131,1.003963,-0.515950,-0.165316,0.048424,5384.44,0
4,-0.206820,-0.165280,1.527053,-0.448293,0.106125,0.530549,0.658849,-0.212660,1.049921,0.968046,...,-0.106984,0.729727,-0.161666,0.312561,-0.414116,1.071126,0.023712,0.419117,14278.97,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
568625,-0.833437,0.061886,-0.899794,0.904227,-1.002401,0.481454,-0.370393,0.189694,-0.938153,-1.161847,...,0.167503,0.419731,1.288249,-0.900861,0.560661,-0.006018,3.308968,0.081564,4394.16,1
568626,-0.670459,-0.202896,-0.068129,-0.267328,-0.133660,0.237148,-0.016935,-0.147733,0.483894,-0.210817,...,0.031874,0.388161,-0.154257,-0.846452,-0.153443,1.961398,-1.528642,1.704306,4653.40,1
568627,-0.311997,-0.004095,0.137526,-0.035893,-0.042291,0.121098,-0.070958,-0.019997,-0.122048,-0.144495,...,0.140788,0.536523,-0.211100,-0.448909,0.540073,-0.755836,-0.487540,-0.268741,23572.85,1
568628,0.636871,-0.516970,-0.300889,-0.144480,0.131042,-0.294148,0.580568,-0.207723,0.893527,-0.080078,...,-0.060381,-0.195609,-0.175488,-0.554643,-0.099669,-1.434931,-0.159269,-0.076251,10160.83,1


In [4]:
# Amount 추출 및 정규화
Data_A = df[['Amount']]

scaler = StandardScaler()
Data_A = scaler.fit_transform(Data_A)

Data_A = pd.DataFrame(Data_A, columns = ['Amount'])

Data_A

Unnamed: 0,Amount
0,0.858447
1,-0.796369
2,-1.377011
3,-0.962119
4,0.323285
...,...
568625,-1.105231
568626,-1.067766
568627,1.666401
568628,-0.271853


In [5]:
# 정규화한 Data 기입 후 최종 Data 확정
df[['Amount']] = Data_A[['Amount']]
df

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,-0.260648,-0.469648,2.496266,-0.083724,0.129681,0.732898,0.519014,-0.130006,0.727159,0.637735,...,-0.110552,0.217606,-0.134794,0.165959,0.126280,-0.434824,-0.081230,-0.151045,0.858447,0
1,0.985100,-0.356045,0.558056,-0.429654,0.277140,0.428605,0.406466,-0.133118,0.347452,0.529808,...,-0.194936,-0.605761,0.079469,-0.577395,0.190090,0.296503,-0.248052,-0.064512,-0.796369,0
2,-0.260272,-0.949385,1.728538,-0.457986,0.074062,1.419481,0.743511,-0.095576,-0.261297,0.690708,...,-0.005020,0.702906,0.945045,-1.154666,-0.605564,-0.312895,-0.300258,-0.244718,-1.377011,0
3,-0.152152,-0.508959,1.746840,-1.090178,0.249486,1.143312,0.518269,-0.065130,-0.205698,0.575231,...,-0.146927,-0.038212,-0.214048,-1.893131,1.003963,-0.515950,-0.165316,0.048424,-0.962119,0
4,-0.206820,-0.165280,1.527053,-0.448293,0.106125,0.530549,0.658849,-0.212660,1.049921,0.968046,...,-0.106984,0.729727,-0.161666,0.312561,-0.414116,1.071126,0.023712,0.419117,0.323285,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
568625,-0.833437,0.061886,-0.899794,0.904227,-1.002401,0.481454,-0.370393,0.189694,-0.938153,-1.161847,...,0.167503,0.419731,1.288249,-0.900861,0.560661,-0.006018,3.308968,0.081564,-1.105231,1
568626,-0.670459,-0.202896,-0.068129,-0.267328,-0.133660,0.237148,-0.016935,-0.147733,0.483894,-0.210817,...,0.031874,0.388161,-0.154257,-0.846452,-0.153443,1.961398,-1.528642,1.704306,-1.067766,1
568627,-0.311997,-0.004095,0.137526,-0.035893,-0.042291,0.121098,-0.070958,-0.019997,-0.122048,-0.144495,...,0.140788,0.536523,-0.211100,-0.448909,0.540073,-0.755836,-0.487540,-0.268741,1.666401,1
568628,0.636871,-0.516970,-0.300889,-0.144480,0.131042,-0.294148,0.580568,-0.207723,0.893527,-0.080078,...,-0.060381,-0.195609,-0.175488,-0.554643,-0.099669,-1.434931,-0.159269,-0.076251,-0.271853,1


*Train, Validataion, Test Data 분류*

In [6]:
# 사이킷 런을 이용한 Data 분류 : 기본 25% 분류 사용
from sklearn.model_selection import train_test_split
TrainData, TestData = train_test_split(df)
TrainData, ValidData = train_test_split(TrainData)

* Train dataset: 전체 DataSet 중 75%를 TrainData로 설정하였으며, 설정한 TrainData를 또 다시 75%로 분할하여 최종 Train DataSet 설정 (319854개, 전체의 약 56%)

In [7]:
TrainData

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
518348,-0.495639,0.454477,-0.620138,0.660881,-0.295332,-0.337207,-0.617931,0.314369,-0.890507,-0.891572,...,0.242071,-0.051242,-0.479115,-1.271132,0.267990,-0.073406,1.316795,0.722580,-0.828966,1
235973,1.896101,-0.416423,-0.227349,-0.503499,0.501203,-0.651186,0.790755,-0.281128,0.555018,0.718698,...,-0.008119,0.763037,-0.177079,0.032473,0.864306,2.036239,-0.405994,-0.347016,0.857327,0
268168,1.887993,-0.431264,-0.059018,-0.573114,0.478505,0.049544,0.527378,-0.238180,0.844081,0.589720,...,-0.022249,0.754114,-0.166073,-1.534388,0.633748,-0.085139,-0.249140,-0.259038,-0.120344,0
104972,-1.278446,1.255823,-0.316734,-1.352374,-0.414690,-0.485337,0.317768,0.186720,4.670871,7.189666,...,-0.607845,-1.054390,0.276969,0.523330,2.039273,0.089832,1.537255,0.608040,0.039284,0
161297,1.753944,-0.440097,-0.052299,-0.466661,0.359897,0.060265,0.431106,-0.157910,0.770686,0.528753,...,-0.232246,-0.832952,0.356213,1.359707,-0.655229,0.342731,-0.322333,-0.200821,-0.268505,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
453928,-0.096249,0.830215,-0.866871,1.299650,0.476853,-1.677279,-0.108748,-0.003386,-0.885057,-0.952892,...,0.082521,-0.536139,-0.508751,0.048383,2.242816,1.027382,0.526682,0.987138,-1.035786,1
332286,-0.172117,-0.120905,0.331051,0.333750,-0.512601,1.002699,-0.000242,-0.682353,0.201334,-0.264547,...,-0.568860,0.960466,0.439219,1.081938,0.489580,0.870225,0.223246,0.564959,1.297635,1
104953,-0.151353,-0.881294,2.823940,-1.840087,-0.188690,0.004886,0.137703,-0.110005,-0.282439,0.839012,...,-0.152352,-0.163886,-0.027192,1.412502,0.279795,-0.704066,-0.169103,0.109264,-0.264931,0
104051,0.129514,-0.114028,0.980735,-0.678621,0.560644,0.158645,0.740820,-0.189677,0.158512,0.342582,...,-0.174653,-0.457992,-0.035732,-0.787231,-1.235405,-0.008032,-0.052456,0.363742,0.012381,0



* Validation dataset: 1차적으로 분할한 TrainData에서 25%로 분할하여 최종 Validation DataSet 설정 (106618개, 전체의 약 19%)

In [8]:
ValidData

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
181355,-0.201115,-0.455263,2.644641,-0.251032,0.237466,0.916748,0.349663,-0.054073,0.358878,0.423148,...,0.056156,0.939732,-0.185795,-0.649504,0.637470,-0.669843,-0.108292,0.142460,0.848907,0
45312,1.270087,-0.843563,0.622304,-1.539165,-0.113737,0.488187,0.097336,-0.150855,-0.326766,1.341985,...,-0.218034,-0.399947,-0.000657,-1.337322,0.390593,-0.460065,-0.172955,-0.064393,0.006534,0
559375,-0.867241,0.393863,-0.618435,0.924443,-0.512070,0.071438,-0.486479,-0.046176,-0.501978,-0.161573,...,0.461515,0.207350,0.410877,1.792289,-0.918894,-0.735131,-1.389629,2.656682,0.092070,1
375512,-1.049081,0.944659,-1.087695,1.788320,-1.272935,-1.160872,-1.505037,0.906727,-1.823745,-1.690343,...,0.713633,0.298944,-0.066815,1.193237,-0.382906,0.735379,1.929919,1.113534,1.489265,1
68177,0.869813,-0.610575,1.300562,-0.215352,-0.212703,0.171778,0.227919,-0.119374,1.335542,0.512358,...,-0.126575,-0.219022,0.022056,1.539600,0.123102,0.658845,-0.244211,-0.012228,-0.068917,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
421945,-0.634168,0.341296,-0.682572,1.237065,1.924131,-1.380151,0.172039,-0.455065,-1.126098,-0.555499,...,-0.343939,0.387603,0.748225,-0.465425,1.489312,1.284992,0.509531,-0.854412,-0.028630,1
179474,-0.656975,-2.027769,0.641117,-1.421587,2.074871,0.455195,-0.164722,-0.023284,0.404937,1.065479,...,-0.184855,0.177987,-0.472206,-1.411741,-1.970789,-1.164926,0.734797,0.843877,-1.316437,0
58169,1.018862,-0.478443,0.667415,-0.481894,0.089748,0.304681,0.364338,-0.139178,0.774621,0.560119,...,-0.192857,-0.498037,-0.044054,0.120654,0.576262,1.081125,-0.292286,-0.115946,-0.287039,0
411323,-0.208565,0.880510,-1.046973,1.529310,0.197751,-1.307658,-0.328357,0.235627,-1.323505,-0.791912,...,0.106347,-0.531228,0.159783,-1.172734,-0.069097,1.094623,0.253218,-0.180311,0.854308,1


* Test dataset: 전체 DataSet 중 25% 분할하여 최종 Test DataSet 설정 (142158개, 전체의 약 25%)

In [9]:
TestData

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
350096,-0.152211,0.896221,-1.047576,1.538384,0.182877,-1.306859,-0.345671,0.220146,-1.325161,-0.818043,...,0.113642,-0.553647,0.104382,-1.207856,-0.066080,1.110299,0.386325,0.021380,0.624993,1
212015,-0.402911,0.017965,-0.418487,-0.521193,1.240772,0.249488,-0.109004,-1.690601,-0.121527,-0.269485,...,-1.910601,1.614171,1.512210,0.783067,-0.332079,1.146840,0.075181,-0.240639,-1.520172,0
540166,-0.776356,0.472386,-0.682457,0.362014,-0.719252,0.087612,-0.335235,0.256552,-0.824313,-0.450878,...,0.306133,-0.568559,0.476562,-0.462301,-0.352350,2.134879,-0.878936,0.135576,1.264702,1
545775,-0.192993,0.262704,-0.522426,0.641258,-0.345948,-0.490668,-0.561483,0.040301,-0.794790,-0.832138,...,0.299623,0.364833,-0.208449,-0.426790,1.001178,0.470105,0.942626,0.899526,0.419748,1
129134,-0.253129,0.011571,1.178996,-0.612549,0.036497,-0.226087,0.565194,-0.091902,0.349163,0.594655,...,-0.174048,-0.544176,0.107624,1.488033,-0.363442,0.093773,-0.248535,0.194920,-0.814564,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
464087,-2.293311,3.272642,-2.138681,1.505985,-2.439060,0.621296,-3.312380,-3.134485,-2.894342,-2.839673,...,-5.272189,4.998306,2.002421,0.732100,0.314172,-1.461661,-3.556212,-1.229315,1.355830,1
380484,0.896330,-0.205123,-0.082832,0.069694,0.216499,0.421724,0.050399,-0.062116,-0.189213,0.150623,...,0.031783,0.300200,-0.172929,-0.811137,0.635922,0.259063,0.098574,0.251381,-0.251745,1
510352,0.882131,0.404084,-0.540109,0.857501,1.896414,-0.576111,1.075766,-0.301490,-0.702828,0.025514,...,0.005517,-0.382244,-0.320980,-1.068243,0.991537,0.841011,-0.180511,0.242658,1.341833,1
225960,-0.019367,-0.348821,0.240183,-2.053875,0.216841,-0.407492,0.870041,-0.131249,0.876081,-0.017056,...,-0.078485,-0.127811,0.179384,2.571579,0.071868,-0.991010,-0.386534,-0.190662,1.117874,0


**Features(x):**

1. V1-V28: Regression에 사용할 Features

2. Amount: Classification에 사용할 Features

**Target(y):**

1. Amount: Regression의 Target

2. Class: Classification의 Target

---

## 1. Regression

### Data preprocessing

In [10]:
# Train Data 변환

train_x = TrainData.drop(['Amount','Class'], axis=1).values
train_y = TrainData[['Amount']].values
train_y = train_y.ravel() # 차원 변경

train_x.shape, train_y.shape

((319854, 28), (319854,))

In [11]:
# Validation Data 변환

val_x = ValidData.drop(['Amount','Class'], axis=1).values
val_y = ValidData[['Amount']].values
val_y = val_y.ravel() # 차원 변경

val_x.shape, val_y.shape

((106618, 28), (106618,))

In [12]:
# TensorFlow형태로 Train Data 변환

import torch
import torch.nn as nn
import torch.optim as optim

Ttensor_x = torch.Tensor(train_x)
Ttensor_y = torch.Tensor(train_y)
print(Ttensor_x), print(Ttensor_y)

tensor([[-0.4956,  0.4545, -0.6201,  ..., -0.0734,  1.3168,  0.7226],
        [ 1.8961, -0.4164, -0.2273,  ...,  2.0362, -0.4060, -0.3470],
        [ 1.8880, -0.4313, -0.0590,  ..., -0.0851, -0.2491, -0.2590],
        ...,
        [-0.1514, -0.8813,  2.8239,  ..., -0.7041, -0.1691,  0.1093],
        [ 0.1295, -0.1140,  0.9807,  ..., -0.0080, -0.0525,  0.3637],
        [ 0.9487, -0.5840,  0.9836,  ...,  0.8394, -0.2165, -0.0939]])
tensor([-0.8290,  0.8573, -0.1203,  ..., -0.2649,  0.0124,  0.2620])


(None, None)

In [13]:
# Ttensor_x 형태에 맞게 tensor 변환
Ttensor_y = Ttensor_y.view(-1, 1)
print(Ttensor_y)

tensor([[-0.8290],
        [ 0.8573],
        [-0.1203],
        ...,
        [-0.2649],
        [ 0.0124],
        [ 0.2620]])


In [14]:
# TensorFlow형태로 Validation Data 변환

Vtensor_x = torch.Tensor(val_x)
Vtensor_y = torch.Tensor(val_y)
Vtensor_y = Vtensor_y.view(-1, 1)
print(Vtensor_x), print(Vtensor_y)

tensor([[-0.2011, -0.4553,  2.6446,  ..., -0.6698, -0.1083,  0.1425],
        [ 1.2701, -0.8436,  0.6223,  ..., -0.4601, -0.1730, -0.0644],
        [-0.8672,  0.3939, -0.6184,  ..., -0.7351, -1.3896,  2.6567],
        ...,
        [ 1.0189, -0.4784,  0.6674,  ...,  1.0811, -0.2923, -0.1159],
        [-0.2086,  0.8805, -1.0470,  ...,  1.0946,  0.2532, -0.1803],
        [-0.2758,  0.3123, -0.3899,  ..., -0.1577, -0.0224,  0.0586]])
tensor([[ 0.8489],
        [ 0.0065],
        [ 0.0921],
        ...,
        [-0.2870],
        [ 0.8543],
        [ 0.2217]])


(None, None)

### Model Construction

#### Linear Regression

In [15]:
from sklearn.linear_model import LinearRegression
model_lr = LinearRegression(fit_intercept=True) # 손실 함수로 MSE 설정

#### DecisionTree

In [16]:
from sklearn.tree import DecisionTreeRegressor
model_dt = DecisionTreeRegressor(
    criterion='squared_error', # 손실 함수로 MSE 설정
    splitter='best',
    max_depth=6,
    random_state=0)

#### nn.Linear

In [17]:
model = nn.Linear(28, 1, bias = False)

loss = nn.MSELoss() #loss 함수 선언
optimizer = optim.SGD(model.parameters(), lr=0.01)

### Train Model & Select Model

In [18]:
models = [model_lr, model_dt]

In [19]:
def mse_loss(pre, y):
    return ((pre-y)**2).mean() # MSE 정의

#### Train

In [20]:
for model in models:
    model.fit(train_x, train_y)
    pre = model.predict(train_x)
    
    loss_value = mse_loss(pre, train_y)
    
    print(model, loss_value)

LinearRegression() 1.0005264985938258
DecisionTreeRegressor(max_depth=6, random_state=0) 0.9991192205260215


In [21]:
from torch.utils.data import TensorDataset, DataLoader 

train_dataset = TensorDataset(Ttensor_x, Ttensor_y)
train_dataloader = DataLoader(train_dataset, batch_size=5000, shuffle=True, drop_last=True)

model = nn.Linear(28, 1, bias = False)

loss = nn.MSELoss() 
optimizer = optim.SGD(model.parameters(), lr=0.01)

num_epochs = 30

for epoch in range(num_epochs):
    
    for batch in train_dataloader:
        x, y = batch
        pre = model(x)
        cost = loss(pre, y)
        optimizer.zero_grad()
        cost.backward()
        optimizer.step()  
    if (epoch + 1) % 3 == 0:
        print('Epoch [%d/%d], Loss: %.4f'
              %(epoch+1, num_epochs, cost.item()))

Epoch [3/30], Loss: 1.0242
Epoch [6/30], Loss: 1.0139
Epoch [9/30], Loss: 0.9978
Epoch [12/30], Loss: 1.0063
Epoch [15/30], Loss: 1.0050
Epoch [18/30], Loss: 0.9835
Epoch [21/30], Loss: 1.0317
Epoch [24/30], Loss: 0.9881
Epoch [27/30], Loss: 1.0318
Epoch [30/30], Loss: 0.9944


#### Validate

In [22]:
for model in models:

    pre = model.predict(val_x)
    
    loss_value = mse_loss(pre, val_y)
    
    print(model, loss_value)

LinearRegression() 0.9966693864973103
DecisionTreeRegressor(max_depth=6, random_state=0) 0.9978255832387007


In [23]:
train_dataset = TensorDataset(Vtensor_x, Vtensor_y)
train_dataloader = DataLoader(train_dataset, batch_size=5000, shuffle=True, drop_last=True)

model = nn.Linear(28, 1, bias = False)

loss = nn.MSELoss() 
optimizer = optim.SGD(model.parameters(), lr=0.01)

num_epochs = 1

for epoch in range(num_epochs):
    
    for batch in train_dataloader:
        x, y = batch
        pre = model(x)
        cost = loss(pre, y)
        optimizer.zero_grad()
        cost.backward()
        optimizer.step()  
    if (epoch + 1) % 1 == 0:
        print('Epoch [%d/%d], Loss: %.4f'
              %(epoch+1, num_epochs, cost.item()))

Epoch [1/1], Loss: 1.0638


#### Explain

In [24]:
print("w:", model_lr.coef_, ", b:", model_lr.intercept_)
# w: 각 feature와의 관계, b: 보정값(bias)

w: [-2.57694976e-03  1.37766452e-03 -1.92295442e-03  2.68262106e-04
  3.58772154e-03  2.61066019e-03  4.04778067e-03  1.00408519e-03
 -2.79977776e-03  3.55294113e-03 -2.34984494e-03 -2.34632893e-03
 -5.23977785e-03 -2.75020180e-03  2.07312204e-03  9.03033426e-04
  9.78044360e-05 -4.28476288e-03 -1.47045562e-03 -1.95766098e-03
  2.84388333e-03  1.32736859e-03 -1.33387952e-03 -1.68413956e-04
 -1.67788936e-03 -2.34975295e-03 -1.01575978e-03 -2.26427155e-03] , b: -0.0010994662940821615


In [25]:
from sklearn.tree import export_graphviz 
export_graphviz(model_dt, out_file ='tree.dot') # treedot 첨부
# http://webgraphviz.com/

---

## 2. Classification

### Data preprocessing

In [26]:
# Train Data 변환

train_x = TrainData[['Amount']].values
train_y = TrainData[['Class']].values
train_y = train_y.ravel()

train_x.shape, train_y.shape

((319854, 1), (319854,))

In [27]:
# Validation Data 변환

val_x = ValidData[['Amount']].values
val_y = ValidData[['Class']].values
val_y = val_y.ravel()

val_x.shape, val_y.shape

((106618, 1), (106618,))

### Model Construction

#### Logistic Regression

In [28]:
from sklearn.linear_model import LogisticRegression
model_lr = LogisticRegression(fit_intercept=True, # 절편 O
                              solver='lbfgs', # 경사하강법 적용
                              random_state=0) # 랜덤

#### DecisionTree

In [29]:
from sklearn.tree import DecisionTreeClassifier
model_dt = DecisionTreeClassifier(criterion='gini',
                                  splitter='best',
                                  max_depth=6,
                                  random_state=0)

#### SVC

In [30]:
from sklearn.svm import SVC
model_svc = SVC(C=1.0, kernel='rbf', gamma='scale')

### Train Model & Select Model

In [31]:
models = [model_lr, model_dt] # model_svc : 연산에 장시간 소요되어 실제 연산이 불가능

In [32]:
def accuracy(pre, y):
    return sum(pre==y)/len(y)

#### Train

In [33]:
for model in models:
    model.fit(train_x, train_y)
    pre = model.predict(train_x)
    
    acc = accuracy(pre, train_y)
    
    print(model, acc)

LogisticRegression(random_state=0) 0.5019790279314937
DecisionTreeClassifier(max_depth=6, random_state=0) 0.5037892288356562


#### Validate

In [34]:
for model in models:

    pre = model.predict(val_x)
    
    acc = accuracy(pre, val_y)
    
    print(model, acc)

LogisticRegression(random_state=0) 0.49890262432234705
DecisionTreeClassifier(max_depth=6, random_state=0) 0.49766455945525145


---

## Performance

### 1. Regression Model 검증

1. Linear Regression

    * Train : 1.0014283457720659
    * Valid : 0.9961702170798374

2. Decision Tree

    * Train : 1.0004487414682603
    * Valid : 0.9966568659371405    

3. nn.Linear

    * Train : 1.0194
    * Valid : 1.0782

* 세 모델 전부 정확한 모델인 편에 속하나, nn.Linear의 경우 상대적으로 정확도가 떨어지는 모습을 보인다. 세 모델 모두 큰 차이 없이 예측하는 것으로 보아 V1~V28과 Amount간 깊은 연관성을 지닌 지표임이 틀림없어 보인다.

### 2. Classification Model 분석

1. Logistic Regression

    * Train : 0.5004408261269204
    * Valid : 0.49890262432234705

2. Decision Tree

    * Train : 0.5010598585604682
    * Valid :   0.49766455945525145

3. SVM

    * Train : (계산 불가)
    * Valid : (계산 불가)

* 거래 금액(Amount)과 사기 피해여부의 관계는 세 모델 모두 거의 관계가 없는 수준을 보이고 있으며, 따라서 거래 금액과 사기 피해여부는 관련이 없는 것으로 드러났다.


