# Kaggle Project

## Describe Your Dataset

**URL:** https://www.kaggle.com/datasets/merishnasuwal/breast-cancer-prediction-dataset

**Task:**

1. 필요한 library 및 데이터 불러오기 
2. 데이터 세트 분할하기 
3. 모델 선택하기(logistic regression model, decision tree model, support vector model, Neaural network model)
4. 각 모델 훈련 및 성능 평가하기
5. 최종 테스트 데이터로 모델 평가하기

**Datasets**

* Train dataset: 341개 (60%)
* Validation dataset: 114개 (20%)
* Test dataset: 114개 (20%)

**Features(x):**

mean_radius, mean_texture, mean_perimeter, mean_area, mean_smoothness

**Target(y):**

diagnosis

---

## Build Your Model

### Data preprocessing

In [69]:
# 필요한 library 불러오기
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import hinge_loss
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import make_scorer
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

In [70]:
# 데이터 불러오기
data = pd.read_csv('C:/Users/khwsk/Downloads/Breast_cancer_data.csv', header=0, sep=',', encoding='euc-kr')

In [71]:
data.head()

Unnamed: 0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,diagnosis
0,17.99,10.38,122.8,1001.0,0.1184,0
1,20.57,17.77,132.9,1326.0,0.08474,0
2,19.69,21.25,130.0,1203.0,0.1096,0
3,11.42,20.38,77.58,386.1,0.1425,0
4,20.29,14.34,135.1,1297.0,0.1003,0


In [72]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   mean_radius      569 non-null    float64
 1   mean_texture     569 non-null    float64
 2   mean_perimeter   569 non-null    float64
 3   mean_area        569 non-null    float64
 4   mean_smoothness  569 non-null    float64
 5   diagnosis        569 non-null    int64  
dtypes: float64(5), int64(1)
memory usage: 26.8 KB


In [73]:
# 결측치 확인
print(data.isnull().sum())

mean_radius        0
mean_texture       0
mean_perimeter     0
mean_area          0
mean_smoothness    0
diagnosis          0
dtype: int64


In [74]:
# 상관관계 분석
mean_radius_correlation = data['mean_radius'].corr(data['diagnosis'])
mean_texture_correlation = data['mean_texture'].corr(data['diagnosis'])
mean_perimeter_correlation = data['mean_perimeter'].corr(data['diagnosis'])
mean_area_correlation = data['mean_area'].corr(data['diagnosis'])
mean_smoothness_correlation = data['mean_smoothness'].corr(data['diagnosis'])

print('상관관계:', mean_radius_correlation)
print('상관관계:', mean_texture_correlation)
print('상관관계:', mean_perimeter_correlation)
print('상관관계:', mean_area_correlation)
print('상관관계:', mean_smoothness_correlation)

상관관계: -0.7300285113754558
상관관계: -0.4151852998452039
상관관계: -0.7426355297258322
상관관계: -0.7089838365853892
상관관계: -0.3585599650859317


### Model Construction

In [75]:
# 입력 변수와 대상 변수로 데이터 분할
X = data.drop('diagnosis', axis=1)
y = data['diagnosis']

In [76]:
# 데이터를 train, val, test 세트로 분할 (6:2:2 비율)
X_train, X_inter, y_train, y_inter = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_inter, y_inter, test_size=0.5, random_state=42)

print("X_train shape:", X_train.shape)
print("X_val shape:", X_val.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_val shape:", y_val.shape)
print("y_test shape:", y_test.shape)

X_train shape: (341, 5)
X_val shape: (114, 5)
X_test shape: (114, 5)
y_train shape: (341,)
y_val shape: (114,)
y_test shape: (114,)


In [77]:
# 데이터 스케일링
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

### Train Model & Select Model

In [78]:
# 모델 초기화와 훈련(logistic regression model)
logistic_regression_model = LogisticRegression(random_state=42)
logistic_regression_model.fit(X_train_scaled, y_train)

In [79]:
# 모델 예측(logistic regression model)
y_pred = logistic_regression_model.predict(X_val_scaled)
y_pred_proba = logistic_regression_model.predict_proba(X_val_scaled)[:, 1]

In [80]:
# 모델 성능 평가(logistic regression model)
print("Validation Accuracy:", accuracy_score(y_val, y_pred))
print("\nValidation Classification Report:\n", classification_report(y_val, y_pred))
print("\nValidation Confusion Matrix:\n", confusion_matrix(y_val, y_pred))

Validation Accuracy: 0.9736842105263158

Validation Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.93      0.96        42
           1       0.96      1.00      0.98        72

    accuracy                           0.97       114
   macro avg       0.98      0.96      0.97       114
weighted avg       0.97      0.97      0.97       114


Validation Confusion Matrix:
 [[39  3]
 [ 0 72]]


### Performance


In [81]:
# 최종 테스트 데이터로 모델 평가(logistic regression model)
y_test_pred = logistic_regression_model.predict(X_test_scaled)
y_test_pred_proba = logistic_regression_model.predict_proba(X_test_scaled)[:, 1]

print("\nTest Accuracy:", accuracy_score(y_test, y_test_pred))
print("\nTest Classification Report:\n", classification_report(y_test, y_test_pred))
print("\nTest Confusion Matrix:\n", confusion_matrix(y_test, y_test_pred))


Test Accuracy: 0.9298245614035088

Test Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.79      0.88        38
           1       0.90      1.00      0.95        76

    accuracy                           0.93       114
   macro avg       0.95      0.89      0.92       114
weighted avg       0.94      0.93      0.93       114


Test Confusion Matrix:
 [[30  8]
 [ 0 76]]


### Train Model & Select Model

In [82]:
# 모델 초기화와 훈련(Decision Tree model)
decision_tree_model = DecisionTreeClassifier(random_state=42)
decision_tree_model.fit(X_train_scaled, y_train) 

In [83]:
# 모델 예측(Decision Tree model)
y_pred = decision_tree_model.predict(X_val_scaled)

In [84]:
# 모델 성능 평가(Decision Tree model)
print("Validation Accuracy:", accuracy_score(y_val, y_pred))
print("\nValidation Classification Report:\n", classification_report(y_val, y_pred))
print("\nValidation Confusion Matrix:\n", confusion_matrix(y_val, y_pred))

Validation Accuracy: 0.8947368421052632

Validation Classification Report:
               precision    recall  f1-score   support

           0       0.88      0.83      0.85        42
           1       0.91      0.93      0.92        72

    accuracy                           0.89       114
   macro avg       0.89      0.88      0.89       114
weighted avg       0.89      0.89      0.89       114


Validation Confusion Matrix:
 [[35  7]
 [ 5 67]]


### Performance

In [85]:
# 최종 테스트 데이터로 모델 평가(Decision Tree model)
y_test_pred = logistic_regression_model.predict(X_test_scaled)
y_test_pred_proba = logistic_regression_model.predict_proba(X_test_scaled)[:, 1]

print("\nTest Accuracy:", accuracy_score(y_test, y_test_pred))
print("\nTest Classification Report:\n", classification_report(y_test, y_test_pred))
print("\nTest Confusion Matrix:\n", confusion_matrix(y_test, y_test_pred))


Test Accuracy: 0.9298245614035088

Test Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.79      0.88        38
           1       0.90      1.00      0.95        76

    accuracy                           0.93       114
   macro avg       0.95      0.89      0.92       114
weighted avg       0.94      0.93      0.93       114


Test Confusion Matrix:
 [[30  8]
 [ 0 76]]


### Train Model & Select Model

In [86]:
# 모델 초기화와 훈련(support vector model)
support_vector_model = SVC(kernel='linear', random_state=42)
support_vector_model.fit(X_train_scaled, y_train)

In [87]:
# 모델 예측(support vector model)
y_pred = support_vector_model.predict(X_val_scaled)

In [88]:
# 모델 성능 평가(support vector model)
print("Validation Accuracy:", accuracy_score(y_val, y_pred))
print("\nValidation Classification Report:\n", classification_report(y_val, y_pred))
print("\nValidation Confusion Matrix:\n", confusion_matrix(y_val, y_pred))

Validation Accuracy: 0.9736842105263158

Validation Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.93      0.96        42
           1       0.96      1.00      0.98        72

    accuracy                           0.97       114
   macro avg       0.98      0.96      0.97       114
weighted avg       0.97      0.97      0.97       114


Validation Confusion Matrix:
 [[39  3]
 [ 0 72]]


### Performance

In [89]:
# 최종 테스트 데이터로 모델 평가(support vector model)
y_test_pred = support_vector_model.predict(X_test_scaled)

print("\nTest Accuracy:", accuracy_score(y_test, y_test_pred))
print("\nTest Classification Report:\n", classification_report(y_test, y_test_pred))
print("\nTest Confusion Matrix:\n", confusion_matrix(y_test, y_test_pred))


Test Accuracy: 0.9385964912280702

Test Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.82      0.90        38
           1       0.92      1.00      0.96        76

    accuracy                           0.94       114
   macro avg       0.96      0.91      0.93       114
weighted avg       0.94      0.94      0.94       114


Test Confusion Matrix:
 [[31  7]
 [ 0 76]]


### Model Construction

In [90]:
# 입력 변수와 대상 변수로 데이터 분할
X = data.drop('diagnosis', axis=1)
y = data['diagnosis']

# y를 2차원 텐서로 변환
y = y.values.reshape(-1, 1)

# 데이터를 train, val, test 세트로 분할 (6:2:2 비율)
X_train, X_inter, y_train, y_inter = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_inter, y_inter, test_size=0.5, random_state=42)

# 데이터 정규화
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

In [91]:
# Tensor로 변환
X_train = torch.tensor(X_train, dtype=torch.float32)
X_val = torch.tensor(X_val, dtype=torch.float32)
X_test = torch.tensor(X_test, dtype=torch.float32)
y_train = torch.tensor(np.array(y_train), dtype=torch.long) 
y_val = torch.tensor(np.array(y_val), dtype=torch.long)  
y_test = torch.tensor(np.array(y_test), dtype=torch.long) 

In [92]:
# 데이터로더 정의
train_data = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_data, batch_size=16, shuffle=True)

### Train Model & Select Model

In [93]:
# 신경망 모델 정의
class NeuralNetwork(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(NeuralNetwork, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.fc3 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = torch.sigmoid(self.fc3(x))
        return x

input_size = X.shape[1]
hidden_size = 64
output_size = 1
model = NeuralNetwork(input_size, hidden_size, output_size) 

In [94]:
# 손실 함수와 옵티마이저 정의
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001) 

In [95]:
# 모델 훈련(Neural network model)
for epoch in range(100):
    for inputs, targets in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets.float())
        loss.backward()
        optimizer.step()

# X_val을 텐서로 변환하여 모델에 전달
X_val_tensor = torch.tensor(X_val, dtype=torch.float32)

# 검증 데이터로 예측(Neural network model)
with torch.no_grad():
    outputs = model(X_val_tensor)
    val_loss = criterion(outputs, y_val.float())
    print("\nValidation MSE Loss:", val_loss.item())
    predicted_val = (outputs > 0.5).float()
    val_accuracy = accuracy_score(y_val, predicted_val)
    print("\nValidation Accuracy:", val_accuracy)
    print("\nValidation Classification Report:\n", classification_report(y_val, predicted_val))
    print("\nValidation Confusion Matrix:\n", confusion_matrix(y_val, predicted_val))


Validation MSE Loss: 0.034588322043418884

Validation Accuracy: 0.956140350877193

Validation Classification Report:
               precision    recall  f1-score   support

           0       0.93      0.95      0.94        42
           1       0.97      0.96      0.97        72

    accuracy                           0.96       114
   macro avg       0.95      0.96      0.95       114
weighted avg       0.96      0.96      0.96       114


Validation Confusion Matrix:
 [[40  2]
 [ 3 69]]


  X_val_tensor = torch.tensor(X_val, dtype=torch.float32)


### Perfomance

In [41]:
# X_test를 텐서로 변환하여 모델에 전달
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)

# 테스트 데이터로 예측(Neural network model)
with torch.no_grad():
    outputs_test = model(X_test_tensor)
    test_loss = criterion(outputs_test, y_test.float())
    print("\nTest MSE Loss:", test_loss.item())
    predicted_test = (outputs_test > 0.5).float()
    test_accuracy = accuracy_score(y_test, predicted_test)
    print("\nTest Accuracy:", test_accuracy)
    print("\nTest Classification Report:\n", classification_report(y_test, predicted_test))
    print("\nTest Confusion Matrix:\n", confusion_matrix(y_test, predicted_test))


Test MSE Loss: 0.034406233578920364

Test Accuracy: 0.956140350877193

Test Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.92      0.93        38
           1       0.96      0.97      0.97        76

    accuracy                           0.96       114
   macro avg       0.95      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114


Test Confusion Matrix:
 [[35  3]
 [ 2 74]]


  X_test_tensor = torch.tensor(X_test, dtype=torch.float32)


---