## team_01
# 유방암 데이터 기본 분석
---
- 유방암 데이터에 대해 기본적인 사항들을 분석해 보자
- 유방암 데이터는 속성의 갯수가 30개로 아주 많다
- 각 속성들의 값의 범위가 차이가 많이 나서 정규화가 필요하다

In [9]:
%pylab inline

import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()

Populating the interactive namespace from numpy and matplotlib


In [10]:
print(cancer.DESCR)

Breast Cancer Wisconsin (Diagnostic) Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, field
        13 is Radius SE, field 23 is Worst Radius.

        

- 타겟명과 속성들을 출력해 보자

In [11]:
print(cancer.target_names) # malignant : 악성(0), benign : 양성(1)
for i,name in enumerate(cancer.feature_names):
    print(i,name)

['malignant' 'benign']
0 mean radius
1 mean texture
2 mean perimeter
3 mean area
4 mean smoothness
5 mean compactness
6 mean concavity
7 mean concave points
8 mean symmetry
9 mean fractal dimension
10 radius error
11 texture error
12 perimeter error
13 area error
14 smoothness error
15 compactness error
16 concavity error
17 concave points error
18 symmetry error
19 fractal dimension error
20 worst radius
21 worst texture
22 worst perimeter
23 worst area
24 worst smoothness
25 worst compactness
26 worst concavity
27 worst concave points
28 worst symmetry
29 worst fractal dimension


- 데이터와 타겟값의 shape 를 알아보자
- 악성과 양성 데이터의 비율을 알아보자

In [12]:
print('data =>',cancer.data.shape)
print('target =>',cancer.target.shape)

malignant = cancer.data[cancer.target==0]
benign = cancer.data[cancer.target==1]

print('malignant(악성) =>',malignant.shape)
print('benign(양성) =>',benign.shape)

data => (569, 30)
target => (569,)
malignant(악성) => (212, 30)
benign(양성) => (357, 30)


- SVC 으로 분류해 보자
- 속성 30개를 모두 사용하고 정규화를 적용한다
- 학습과 테스트를 위해 데이터를 임의로 분할하므로, 10번을 반복 수행해서 score 를 알아보자

In [19]:
scores = []

for i in range(10):
    X_train,X_test,y_train,y_test = train_test_split(cancer.data,cancer.target)

    X_mean = X_train.mean(axis=0)
    X_std = X_train.std(axis=0)

    X_train_scaled = (X_train-X_mean)/X_std
    X_test_scaled = (X_test-X_mean)/X_std

    model=SVC(C=1,gamma=0.1)
    model.fit(X_train_scaled,y_train)

    pred_y = model.predict(X_test_scaled)
    score = model.score(X_test_scaled,y_test)
    scores.append(score)

print('scores =', scores)

scores = [0.965034965034965, 0.972027972027972, 0.972027972027972, 0.9440559440559441, 0.9440559440559441, 0.965034965034965, 0.972027972027972, 0.9230769230769231, 0.951048951048951, 0.9370629370629371]
