**Kaggle Project**

https://www.kaggle.com/datasets/erdemtaha/cancer-data?resource=download

Task : binary classification

number of datasets : 569

feature x : 30 

('radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst')

Target y : binary (B : benign cancer, M : malignant cancer )

30개의 feature를 이용하여 양성 암(B), 악성 암(M)을 분류하는 이진 분류 데이터셋

Datasets

Train : 398개
Validation : hyperparameter tuning을 위해 gridsearchCV 사용 5-fold cross validation
Test : 171개

전체 dataset의 70%를 train data, 30%를 test data로 사용

In [90]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix

**Data preprocessing**

In [91]:
# data 불러오기
data = pd.read_csv('./Cancer_Data.csv')
data

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,
1,842517,M,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,
2,84300903,M,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,
4,84358402,M,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,926424,M,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,
565,926682,M,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,
566,926954,M,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,
567,927241,M,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,


In [92]:
# 변수 값과 target 값으로 나눔

x = data[data.columns[2:32]]        # 33번째 columns에는 NaN 값만 존재하여 제외하고 변수로 사용
y = data['diagnosis']               # 암의 여부를 나타내는 B와 M에 대한 target 값 설정


In [93]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 9)


# class가 M과 B로 되어있어서 각각 1과 0으로 할당하는 함수
def class_index(data):
    class_ = []
    for i in data:
        if i =='M':
            class_.append(1)
        elif i =='B':
            class_.append(0)
    return np.array(class_)

y_train = class_index(y_train)
y_test = class_index(y_test)

**Model Construction**

In [94]:
# 분류에 사용할 모델 선언
lr = LogisticRegression(max_iter=2000)
DT = DecisionTreeClassifier()
Knn = KNeighborsClassifier()
svm = SVC()

In [95]:
# 분류 모델들에 사용한 각각의 hyper parameters tuning

dt_params = {
    'min_samples_split': [i for i in range(5,10)],
    'min_samples_leaf': [i for i in range(5,10)],
    'criterion': ["gini", "entropy"]
}
dt_grid_search = GridSearchCV(estimator=DT, 
                           param_grid=dt_params,
                           cv=5, 
                           scoring = "accuracy")

knn_params = {
    'n_neighbors' : [i for i in range(1,30)]
}
knn_grid_search = GridSearchCV(estimator=Knn, 
                           param_grid=knn_params,
                           cv=5, 
                           scoring = "accuracy")

svm_params = {
    'C' : [10**i for i in range(-4,3)],
    'gamma' : [10**i for i in range(-4,3)]
}
svm_grid_search = GridSearchCV(estimator=svm, 
                           param_grid=svm_params,
                           cv=5, 
                           scoring = "accuracy")

**Train Model & Select Model**

In [96]:
# train data로 model 학습
lr.fit(x_train, y_train)
dt_grid_search.fit(x_train, y_train)
knn_grid_search.fit(x_train, y_train)
svm_grid_search.fit(x_train, y_train)

# graidsearchCV를 이용하여 가장 높은 accracy를 갖는 hyperparameter 모델 저장
knn_best=knn_grid_search.best_estimator_
svm_best=svm_grid_search.best_estimator_  
dt_best=dt_grid_search.best_estimator_

# 저장한 모델 확인 (hyperparameter)
print(dt_grid_search.best_estimator_)
print(knn_grid_search.best_estimator_)
print(svm_grid_search.best_estimator_)

DecisionTreeClassifier(min_samples_leaf=5, min_samples_split=7)
KNeighborsClassifier(n_neighbors=2)
SVC(C=1, gamma=0.0001)


**Performance**

In [97]:
# train data로 학습 된 모델에 test data로 accaracy 확인

lr_acc = accuracy_score(y_test, lr.predict(x_test))
dt_acc = accuracy_score(y_test, dt_best.predict(x_test))
knn_acc = accuracy_score(y_test, knn_best.predict(x_test))
svm_acc = accuracy_score(y_test, svm_best.predict(x_test))

# 성능 확인
print('LogisticRegression Accuracy : ', lr_acc)
print('DecisionTree Accuracy : ', dt_acc)
print('KNN Accuracy : ', knn_acc)
print('SVM Accuracy : ', svm_acc)

LogisticRegression Accuracy :  0.9532163742690059
DecisionTree Accuracy :  0.9473684210526315
KNN Accuracy :  0.9239766081871345
SVM Accuracy :  0.9415204678362573


**Result**

- Tabular dataset을 이용한 간단한 이진 분류 문제에서 굳이 딥러닝 모델을 사용하지 않고 머신러닝 모델을 이용해도 좋은 분류 성능을 달성할 수 있음을 보여줌.

- 가장 단순한 모델인 LogisticRegression model이 큰 차이는 아니지만 가장 높은 분류 성능을 달성함. 

- Dataset에 대한 분석을 하는 경우, 복잡한 최신 모델을 먼저 사용하는 것 보다 이전 많이 사용되면서 검증이 완료된 단순한 모델을 이용하여 분석하는 경우 좋은 성능을 달성할 수 있음.

**Final Project**

In [98]:
import torch 
from torch import nn, optim

# Neural network 선언
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer = nn.Sequential(
            nn.Linear(30, 16),
            nn.ReLU(),
            nn.Linear(16, 8),
            nn.ReLU(),
            nn.Linear(8,2),
            nn.ReLU()
        )
    
    def forward(self, x):
        Net_out = self.layer(x)
        return Net_out

In [100]:
torch.manual_seed(99)

model = Net()
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr = 1e-4)

x_train_tensor = torch.from_numpy(x_train.values).float()
y_train_tensor = torch.from_numpy(y_train).long()

for i in range(1,10000+1):
    
    pred = model(x_train_tensor)
    loss_ = loss_fn(pred, y_train_tensor)
    correct = (pred.argmax(1)==y_train_tensor).type(torch.float).sum().item()
    
    optimizer.zero_grad()
    loss_.backward()
    optimizer.step()
    
    if i%1000==0:
        print('Epoch : ',i)
        print('Train ACC',correct/len(y_train_tensor))

Epoch :  1000
Train ACC 0.9170854271356784
Epoch :  2000
Train ACC 0.9246231155778895
Epoch :  3000
Train ACC 0.9396984924623115
Epoch :  4000
Train ACC 0.9346733668341709
Epoch :  5000
Train ACC 0.9447236180904522
Epoch :  6000
Train ACC 0.957286432160804
Epoch :  7000
Train ACC 0.964824120603015
Epoch :  8000
Train ACC 0.9698492462311558
Epoch :  9000
Train ACC 0.9673366834170855
Epoch :  10000
Train ACC 0.9723618090452262


In [101]:
x_test_tensor = torch.from_numpy(x_test.values).float()
y_test_tensor = torch.from_numpy(y_test).long()


model.eval()

with torch.no_grad():
    output = model(x_test_tensor)
    correct = (output.argmax(1)==y_test_tensor).type(torch.float).sum().item()
    
    print('Test ACC : ',correct/len(y_test_tensor))

Test ACC :  0.9649122807017544


In [106]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler

scaler_mm = MinMaxScaler()
scaler_st = StandardScaler()
scaler_ro = RobustScaler()

x_train_mm = scaler_mm.fit_transform(x_train)
x_test_mm = scaler_mm.transform(x_test)

x_train_st = scaler_st.fit_transform(x_train)
x_test_st = scaler_st.transform(x_test)

x_train_ro = scaler_ro.fit_transform(x_train)
x_test_ro = scaler_ro.transform(x_test)

In [107]:
torch.manual_seed(99)

model_mm = Net()
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(model_mm.parameters(), lr = 1e-4)

x_train_tensor = torch.from_numpy(x_train_mm).float()
y_train_tensor = torch.from_numpy(y_train).long()

for i in range(1,10000+1):
    
    pred = model_mm(x_train_tensor)
    loss_ = loss_fn(pred, y_train_tensor)
    correct = (pred.argmax(1)==y_train_tensor).type(torch.float).sum().item()
    
    optimizer.zero_grad()
    loss_.backward()
    optimizer.step()
    
    if i%1000==0:
        print('Epoch : ',i)
        print('Train ACC',correct/len(y_train_tensor))

Epoch :  1000
Train ACC 0.914572864321608
Epoch :  2000
Train ACC 0.9522613065326633
Epoch :  3000
Train ACC 0.9723618090452262
Epoch :  4000
Train ACC 0.9798994974874372
Epoch :  5000
Train ACC 0.9798994974874372
Epoch :  6000
Train ACC 0.9824120603015075
Epoch :  7000
Train ACC 0.9899497487437185
Epoch :  8000
Train ACC 0.992462311557789
Epoch :  9000
Train ACC 0.992462311557789
Epoch :  10000
Train ACC 0.992462311557789


In [110]:
x_test_tensor = torch.from_numpy(x_test_mm).float()
y_test_tensor = torch.from_numpy(y_test).long()


model_mm.eval()

with torch.no_grad():
    output = model_mm(x_test_tensor)
    correct = (output.argmax(1)==y_test_tensor).type(torch.float).sum().item()
    
    print('Test ACC : ',correct/len(y_test_tensor))

Test ACC :  0.9824561403508771


In [109]:
torch.manual_seed(99)

model_st = Net()
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(model_st.parameters(), lr = 1e-4)

x_train_tensor = torch.from_numpy(x_train_st).float()
y_train_tensor = torch.from_numpy(y_train).long()

for i in range(1,10000+1):
    
    pred = model_st(x_train_tensor)
    loss_ = loss_fn(pred, y_train_tensor)
    correct = (pred.argmax(1)==y_train_tensor).type(torch.float).sum().item()
    
    optimizer.zero_grad()
    loss_.backward()
    optimizer.step()
    
    if i%1000==0:
        print('Epoch : ',i)
        print('Train ACC',correct/len(y_train_tensor))

Epoch :  1000
Train ACC 0.9723618090452262
Epoch :  2000
Train ACC 0.9849246231155779
Epoch :  3000
Train ACC 0.992462311557789
Epoch :  4000
Train ACC 0.9949748743718593
Epoch :  5000
Train ACC 0.9974874371859297
Epoch :  6000
Train ACC 1.0
Epoch :  7000
Train ACC 1.0
Epoch :  8000
Train ACC 1.0
Epoch :  9000
Train ACC 1.0
Epoch :  10000
Train ACC 1.0


In [111]:
x_test_tensor = torch.from_numpy(x_test_st).float()
y_test_tensor = torch.from_numpy(y_test).long()


model_st.eval()

with torch.no_grad():
    output = model_st(x_test_tensor)
    correct = (output.argmax(1)==y_test_tensor).type(torch.float).sum().item()
    
    print('Test ACC : ',correct/len(y_test_tensor))

Test ACC :  0.9883040935672515


In [112]:
torch.manual_seed(99)

model_ro = Net()
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(model_ro.parameters(), lr = 1e-4)

x_train_tensor = torch.from_numpy(x_train_ro).float()
y_train_tensor = torch.from_numpy(y_train).long()

for i in range(1,10000+1):
    
    pred = model_ro(x_train_tensor)
    loss_ = loss_fn(pred, y_train_tensor)
    correct = (pred.argmax(1)==y_train_tensor).type(torch.float).sum().item()
    
    optimizer.zero_grad()
    loss_.backward()
    optimizer.step()
    
    if i%1000==0:
        print('Epoch : ',i)
        print('Train ACC',correct/len(y_train_tensor))

Epoch :  1000
Train ACC 0.957286432160804
Epoch :  2000
Train ACC 0.9824120603015075
Epoch :  3000
Train ACC 0.9874371859296482
Epoch :  4000
Train ACC 0.9899497487437185
Epoch :  5000
Train ACC 0.9974874371859297
Epoch :  6000
Train ACC 1.0
Epoch :  7000
Train ACC 1.0
Epoch :  8000
Train ACC 1.0
Epoch :  9000
Train ACC 1.0
Epoch :  10000
Train ACC 1.0


In [113]:
x_test_tensor = torch.from_numpy(x_test_ro).float()
y_test_tensor = torch.from_numpy(y_test).long()


model_ro.eval()

with torch.no_grad():
    output = model_ro(x_test_tensor)
    correct = (output.argmax(1)==y_test_tensor).type(torch.float).sum().item()
    
    print('Test ACC : ',correct/len(y_test_tensor))

Test ACC :  0.9824561403508771


**Midterm Project 결과**

LogisticRegression Accuracy :  0.9532163742690059

DecisionTree Accuracy :  0.9590643274853801

KNN Accuracy :  0.9239766081871345

SVM Accuracy :  0.9415204678362573

**Final Project 결과**

Neural Network Accuracy : 0.9649122807017544

MinMax scaling : 0.9824561403508771

Standard scaling : 0.9883040935672515

Robust scaling : 0.9824561403508771


neural network과 scaling을 통한 데이터의 scale의 영향을 줄이면서 더 좋은 성능 달성