## Used Packages


numpy(numerical python) : 선형대수(배열,행렬) 연산에 효과적인 라이브러리

          An effective library for linear algebra(arrays, matrices)

pandas : 구조화된 데이터를 가공하는데 효과적인 라이브러리 (dataframe)

          An effective library for structured data processing

scikit-learn : 데이터 분석을 위한 라이브러리 (numpy, scipy, matplotlib 기반)

          library for data analysis



In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score
import warnings
from sklearn import svm
warnings.filterwarnings("ignore")

# SVM(classification)
### breast_cancer data  (569개의 데이터)


종속변수(dependent variable) : 양성1 , 음성0

독립변수(independent variable) : 심장의 이미지를 설명하는 30개의 변수

In [2]:

breast_cancer = load_breast_cancer()
X = pd.DataFrame(breast_cancer.data[:100, :])
y = pd.DataFrame(breast_cancer.target[:100])
y=y[0] 

breast_cancer_data=pd.concat([X, y], axis=1)
breast_cancer_data_df=pd.DataFrame(breast_cancer_data)
breast_cancer_data_df.shape


(100, 31)

커널 종류 : linear,polynomial(degree 설정필요),rbf(gamma 설정필요),sigmoid 
조율 파라미터 C 설정

In [3]:
for c in [0.01,0.1,1,10]:
    for i in ["linear","rbf","poly"]:
        clf = svm.SVC(kernel=i, C=c)
        scores = cross_val_score(clf, X, y, cv=10)
        print("kernel:",i,"\n"
              "C:",c,"\n","Accuracy: ",(scores.mean()),"\n" ,"-" *25)
    

kernel: linear 
C: 0.01 
 Accuracy:  0.926262626263 
 -------------------------
kernel: rbf 
C: 0.01 
 Accuracy:  0.651515151515 
 -------------------------
kernel: poly 
C: 0.01 
 Accuracy:  0.90404040404 
 -------------------------
kernel: linear 
C: 0.1 
 Accuracy:  0.90404040404 
 -------------------------
kernel: rbf 
C: 0.1 
 Accuracy:  0.651515151515 
 -------------------------
kernel: poly 
C: 0.1 
 Accuracy:  0.90404040404 
 -------------------------
kernel: linear 
C: 1 
 Accuracy:  0.915151515152 
 -------------------------
kernel: rbf 
C: 1 
 Accuracy:  0.651515151515 
 -------------------------
kernel: poly 
C: 1 
 Accuracy:  0.90404040404 
 -------------------------
kernel: linear 
C: 10 
 Accuracy:  0.915151515152 
 -------------------------
kernel: rbf 
C: 10 
 Accuracy:  0.651515151515 
 -------------------------
kernel: poly 
C: 10 
 Accuracy:  0.90404040404 
 -------------------------


### Split dataset

In [4]:
X = pd.DataFrame(breast_cancer.data[100:300, :])
y = pd.DataFrame(breast_cancer.target[100:300])
y

Unnamed: 0,0
0,0
1,1
2,1
3,1
4,1
5,0
6,1
7,1
8,0
9,1


In [5]:
X_train, X_test , y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=1)
X_train=X_train[:]
y_train=y_train[:]
dfXtrain=pd.DataFrame(X_train)
dfytrain=pd.DataFrame(y_train)
dfXtest=pd.DataFrame(X_test)
dfytest=pd.DataFrame(y_test)
print(dfXtrain.shape)
print(dfXtest.shape)


(140, 30)
(60, 30)


### model training

In [6]:
C = 0.01
kernel= "linear"
models = (svm.SVC(kernel=kernel, C=C))
clf.fit(dfXtrain, dfytrain)
svm_y_pred=clf.predict(dfXtest)
svm_y_pred

array([1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1,
       0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0])

### result

In [7]:
from sklearn.metrics import precision_score,recall_score, confusion_matrix, classification_report, accuracy_score, f1_score

print ('Accuracy:', accuracy_score(y_test, svm_y_pred))
print ('F1 score:', f1_score(y_test, svm_y_pred))
print ('Recall:', recall_score(y_test, svm_y_pred))
print ('Precision:', precision_score(y_test, svm_y_pred))
print ('\n clasification report:\n', classification_report(y_test,svm_y_pred))
print ('\n confussion matrix:\n',confusion_matrix(y_test, svm_y_pred))

Accuracy: 0.9
F1 score: 0.916666666667
Recall: 0.970588235294
Precision: 0.868421052632

 clasification report:
              precision    recall  f1-score   support

          0       0.95      0.81      0.88        26
          1       0.87      0.97      0.92        34

avg / total       0.91      0.90      0.90        60


 confussion matrix:
 [[21  5]
 [ 1 33]]


# SVR(regression)
### breast_cancer data  (442개의 데이터)

종속변수(dependent variable) : 1년 후의 당뇨병 진행도 (diabetes progression after 1 year)

독립변수(independent variable) : 나이, 성별, BMI지수, 혈압 등 10개의 변수 (age, sex, BMI index, pressure pressure, etc.)


In [8]:
diabetes = datasets.load_diabetes()
dfX=pd.DataFrame(diabetes.data)
dfy=pd.DataFrame(diabetes.target)
dfdiabetes=pd.concat([dfX, dfy], axis=1)
dfdiabetes.columns=['X1','X2','X3','X4','X5','X6','X7','X8','X9','X10','y']

## 파라미터 선택을 위한 cross validation
### *가장 낮은 MSE값을 가지는 C와 kernel을 선택

In [9]:
for c in [0.01,0.1,1,10,100]:
    for i in ["linear","rbf","poly"]:
        for e in [0.001,0.01,0.1]:
            clf = svm.SVR(kernel=i, C=c, epsilon=e)
            scores = cross_val_score(clf, dfX, dfy, cv=10, scoring='mean_squared_error')
            print("kernel:",i,"\n","C:",c,"\n","MSE: ",np.mean(scores),"\n","epsilon:",e,"\n" ,"-" *25)

kernel: linear 
 C: 0.01 
 MSE:  -6090.05632778 
 epsilon: 0.001 
 -------------------------
kernel: linear 
 C: 0.01 
 MSE:  -6090.09219259 
 epsilon: 0.01 
 -------------------------
kernel: linear 
 C: 0.01 
 MSE:  -6090.60486718 
 epsilon: 0.1 
 -------------------------
kernel: rbf 
 C: 0.01 
 MSE:  -6091.37582335 
 epsilon: 0.001 
 -------------------------
kernel: rbf 
 C: 0.01 
 MSE:  -6091.42479295 
 epsilon: 0.01 
 -------------------------
kernel: rbf 
 C: 0.01 
 MSE:  -6091.94822551 
 epsilon: 0.1 
 -------------------------
kernel: poly 
 C: 0.01 
 MSE:  -6091.70593508 
 epsilon: 0.001 
 -------------------------
kernel: poly 
 C: 0.01 
 MSE:  -6091.75802023 
 epsilon: 0.01 
 -------------------------
kernel: poly 
 C: 0.01 
 MSE:  -6092.28154478 
 epsilon: 0.1 
 -------------------------
kernel: linear 
 C: 0.1 
 MSE:  -6075.26032092 
 epsilon: 0.001 
 -------------------------
kernel: linear 
 C: 0.1 
 MSE:  -6075.29051196 
 epsilon: 0.01 
 -------------------------
kern

### Split dataset

In [10]:
from sklearn import svm

X_train, X_test , y_train, y_test = train_test_split(diabetes.data, diabetes.target, test_size=0.3, random_state=1)
dfXtrain=pd.DataFrame(X_train)
dfytrain=pd.DataFrame(y_train)
dfXtest=pd.DataFrame(X_test)
dfytest=pd.DataFrame(y_test)
print(dfXtrain.shape)
print(dfXtest.shape)

(309, 10)
(133, 10)


### model training

In [11]:
clf = svm.SVR(C=100,kernel= "linear", epsilon=0.1)
clf.fit(dfXtrain, dfytrain)
svr_y_pred=clf.predict(dfXtest)

### result

In [12]:
print ('MSE:', mean_squared_error(dfytest, svr_y_pred))
print ('MAE:', mean_absolute_error(dfytest, svr_y_pred))

MSE: 2931.89939913
MAE: 43.2634110623
