#### 大問2. 乳がんデータセット
================

## Q1. 実装
つぎの条件にしたがって、乳がん(breast_cancer)の分類を行ってください。

なお、指定された条件以外にも、必要だと思われる処理等を思いついた場合に自由に追加してもらって構いません。

### 条件
- 使用するデータ: [breast_cancer dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html)
- 評価指標: 正答率(Accuracy)
- 評価プロトコル: ホールドアウト(交差検証は不要)
- アルゴリズム: サポートベクターマシン(Support Vector Machine)

In [44]:
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
%matplotlib inline

import pandas as pd

cancer_data = load_breast_cancer()

#データの確認
print(cancer_data.DESCR)

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, f

In [0]:
X = pd.DataFrame(cancer_data.data, columns=cancer_data.feature_names)

y = pd.Series(cancer_data.target)

In [46]:
y.value_counts()

1    357
0    212
dtype: int64

In [47]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 30 columns):
mean radius                569 non-null float64
mean texture               569 non-null float64
mean perimeter             569 non-null float64
mean area                  569 non-null float64
mean smoothness            569 non-null float64
mean compactness           569 non-null float64
mean concavity             569 non-null float64
mean concave points        569 non-null float64
mean symmetry              569 non-null float64
mean fractal dimension     569 non-null float64
radius error               569 non-null float64
texture error              569 non-null float64
perimeter error            569 non-null float64
area error                 569 non-null float64
smoothness error           569 non-null float64
compactness error          569 non-null float64
concavity error            569 non-null float64
concave points error       569 non-null float64
symmetry error             569 

In [48]:
X.describe()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


In [49]:
#データの分割
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

#スケール
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)

#SVM
svc = SVC()
svc.fit(X_train_sc, y_train)

#スケール
X_test_sc = scaler.transform(X_test)

#評価
print(f"{svc.score(X_test_sc, y_test):.2f}")



0.95




## Q2. 評価その1
クライアントに説明することを想定して、最終的な評価や分析結果も記述してください。


- 手持ちのデータ５６９件からSVMをモデルとして、９５％の正答率が得られた
- データの欠損はないが、値の大きさに違いがあるためスケールを調整
- 本来、乳がんの判定に正答率でモデルを評価するのは好ましくない

## Q3. モデルの改良
クライアントから「正答率は最低でも90%以上は欲しい」という要望がありました。

正答率が90%以上になるようにモデルを改良してください。このとき、Q1の条件は自由に変更してもらって構いません。

なお、Q1の時点で既に90%以上の正答率を出している場合には、Q3およびQ4の回答は不要です。


## Q4. 評価その2
あらためて、クライアントに説明することを想定して、最終的な評価や分析結果も記述してください。

## Q5. 混同行列による評価
今回のモデルにおける混同行列を求め、その結果についての説明をしてください。


In [0]:
from sklearn.metrics import confusion_matrix

svc_pred = svc.predict(X_test)
confusion = confusion_matrix(y_test, svc_pred)

In [0]:
# 悩みだしたのでいったん止め