# Classification: Wisconsin Breast Cancer
- Dataset from UCI repository
- https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Original)

### Attribute Information:

1. Sample code number: id number 
2. Clump Thickness: 1 - 10 
3. Uniformity of Cell Size: 1 - 10 
4. Uniformity of Cell Shape: 1 - 10 
5. Marginal Adhesion: 1 - 10 
6. Single Epithelial Cell Size: 1 - 10 
7. Bare Nuclei: 1 - 10 
8. Bland Chromatin: 1 - 10 
9. Normal Nucleoli: 1 - 10 
10. Mitoses: 1 - 10 
11. Class: (2 for benign, 4 for malignant)

In [16]:
import pandas as pd

In [17]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"
names = [
    'Code', 'Clump-Thickness', 'Cell-Size', 'Cell-Shape', 'Adhesion', 
    'Single-Cell-Size', 'Bare-Nuclei', 'Chromatin', 'Nucleoli', 'Mitoses', 'Class']
dataset = pd.read_csv(url, names=names)

### 지금까지 배운 모델을 사용해 유방암 여부를 예측하는 좋은 성능의 모델을 만들어보세요.

In [18]:
dataset.head

<bound method NDFrame.head of         Code  Clump-Thickness  Cell-Size  Cell-Shape  Adhesion  \
0    1000025                5          1           1         1   
1    1002945                5          4           4         5   
2    1015425                3          1           1         1   
3    1016277                6          8           8         1   
4    1017023                4          1           1         3   
5    1017122                8         10          10         8   
6    1018099                1          1           1         1   
7    1018561                2          1           2         1   
8    1033078                2          1           1         1   
9    1033078                4          2           1         1   
10   1035283                1          1           1         1   
11   1036172                2          1           1         1   
12   1041801                5          3           3         3   
13   1043999                1          1      

In [19]:
feature_cols = ['Clump-Thickness', 'Cell-Size', 'Single-Cell-Size', 'Chromatin']
X = dataset[feature_cols]
y = dataset.Class

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

In [20]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train, y_train)
print(list(zip(feature_cols, logreg.coef_[0])))

[('Clump-Thickness', 0.58675019992341781), ('Cell-Size', 0.58710529283138757), ('Single-Cell-Size', 0.26353603540052462), ('Chromatin', 0.70392734562201209)]


In [22]:
y_pred_class = logreg.predict(X_test)

In [23]:
# calculate classification accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))

0.965714285714


In [24]:
                #Predicted Class
#Actual Class   1        0
#            1   n11    n10
#            0   n01    n00
print(metrics.confusion_matrix(y_test, y_pred_class))

[[108   1]
 [  5  61]]


In [25]:
confusion = metrics.confusion_matrix(y_test, y_pred_class)
TP = confusion[1][1]
TN = confusion[0][0]
FP = confusion[0][1]
FN = confusion[1][0]

In [26]:
print('True Positives:', TP)
print('True Negatives:', TN)
print('False Positives:', FP)
print('False Negatives:', FN)

True Positives: 61
True Negatives: 108
False Positives: 1
False Negatives: 5


In [27]:
# calculate the sensitivity
print('Sensitivity:', TP / float(TP + FN))

Sensitivity: 0.924242424242


In [28]:
# calculate the specificity
print('Specificity:', TN / float(TN + FP))

Specificity: 0.990825688073


In [29]:
from sklearn.metrics import roc_curve, auc
from sklearn.cross_validation import StratifiedKFold

In [30]:
classifier = LogisticRegression(penalty='l1', C=0.2)
y_score = classifier.fit(X_train, y_train).decision_function(X_test)

In [31]:
fpr, tpr, _ = roc_curve(y_test, y_score)
roc_auc = auc(fpr, tpr)
print(roc_auc)

ValueError: Data is not binary and pos_label is not specified