### svm 하이퍼 파라미터
* C(cost)
    - cost값이 작으면 과소적합이 발생한다.
        * cost값을 작게하면 훈련 데이터에 대해 어느정도 오류를 허용하게 되며, 새로 들어오는 데이터는 잘 분류할 수 있다
    - cost값이 커지면 과대적합이 발생한다.
        * cost값을 크게하면 훈련 데이터에 대해 오류가 최소화 되지만, 새로 들어오는 데이터는 분류를 잘못할 가능성이 높다는 것이다
        
        

![image](images/cost.png)

* gamma
    - 결정 경계에 영향을 끼치는 범위를 조절하는 변수
    - gamma가 크면 과대적합이 될 수 있다
        - gamma가 크면 결정경계에 영향력이 커 결정경계가 복잡해진다
    - gamma가 작으면 과소적합이 될 수 있다.
        - gamma가 작으면 결정경계에 영향력이 미미하여 결정경계가 부드러워진다

![image](images/gamma.png)

### 데이터 셋 로드

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
import pandas as pd



In [2]:
df = pd.read_csv('data/titanic_cleaning.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare
0,1,0,3,0,22.0,1,0,7.25
1,2,1,1,1,38.0,1,0,71.2833
2,3,1,3,1,26.0,0,0,7.925
3,4,1,1,1,35.0,1,0,53.1
4,5,0,3,0,35.0,0,0,8.05


In [3]:
df.columns


Index(['PassengerId', 'Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch',
       'Fare'],
      dtype='object')

In [4]:
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch',
       'Fare']
label = 'Survived'

X,y = df[features], df[label]

In [5]:
X_train , X_test, y_train , y_test =\
                        train_test_split(X, y, test_size = 0.2)

scaler = StandardScaler()
scaler.fit(X)
X_scaler = scaler.transform(X)

X_train_scaler , X_test_scaler, y_train , y_test =\
                        train_test_split(X_scaler, y, test_size = 0.2)


### linear 모델 생성 및 하이퍼파라미터
* linear은 c(cost)값만 줄수있다

In [6]:
svc = SVC(kernel='linear')
svc.fit( X_train, y_train)

print('===스케일전===')
print('train :', svc.score(X_train, y_train))
print('test :', svc.score(X_test, y_test))




===스케일전===
train : 0.6123595505617978
test : 0.6312849162011173


In [7]:
svc.fit( X_train_scaler, y_train)

print('===스케일후===')
print('train :', svc.score(X_train_scaler, y_train))
print('test :', svc.score(X_test_scaler, y_test))

===스케일후===
train : 0.7949438202247191
test : 0.7821229050279329


In [8]:
svc = SVC(kernel='linear', C = 1000)
svc.fit(X_train_scaler, y_train)

print('train :', svc.score(X_train_scaler, y_train))
print('test :', svc.score(X_test_scaler, y_test))

train : 0.7949438202247191
test : 0.7821229050279329


In [9]:
param_cost = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
train_list = []
test_list = []
for cost in param_cost:
    svc = SVC(kernel='linear', C=cost)
    svc.fit(X_train_scaler, y_train)
    
    train_list.append( svc.score(X_train_scaler, y_train))
    test_list.append( svc.score(X_test_scaler, y_test))
    
dic = {
    
    'cost':param_cost,
    'train정확도':train_list,
    'test정확도':test_list
    
}
score_df = pd.DataFrame(dic)
score_df

Unnamed: 0,cost,train정확도,test정확도
0,0.001,0.655899,0.642458
1,0.01,0.787921,0.782123
2,0.1,0.794944,0.782123
3,1.0,0.794944,0.782123
4,10.0,0.794944,0.782123
5,100.0,0.794944,0.782123


In [10]:
svc = SVC(kernel='rbf')

svc.fit(X_train_scaler, y_train)

print('train :', svc.score(X_train_scaler, y_train))
print('test :', svc.score(X_test_scaler, y_test))

train : 0.8426966292134831
test : 0.7932960893854749


In [15]:
param_cost = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
train_list = []
test_list = []
for cost in param_cost:
    svc = SVC(kernel='rbf', C=cost)
    svc.fit(X_train_scaler, y_train)
    
    train_list.append( svc.score(X_train_scaler, y_train))
    test_list.append( svc.score(X_test_scaler, y_test))
    
dic = {
    
    'cost':param_cost,
    'train정확도':train_list,
    'test정확도':test_list
    
}
score_df = pd.DataFrame(dic)
score_df

Unnamed: 0,cost,train정확도,test정확도
0,0.001,0.61236,0.631285
1,0.01,0.61236,0.631285
2,0.1,0.817416,0.787709
3,1.0,0.842697,0.793296
4,10.0,0.853933,0.776536
5,100.0,0.867978,0.75419


In [18]:
param_gamma = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
train_list = []
test_list = []
for g in param_gamma:
    svc = SVC(kernel='rbf', C=100, gamma=g)
    svc.fit(X_train_scaler, y_train)
    
    train_list.append( svc.score(X_train_scaler, y_train))
    test_list.append( svc.score(X_test_scaler, y_test))
    
dic = {
    
    'gamma':param_gamma,
    'train정확도':train_list,
    'test정확도':test_list
    
}
score_df = pd.DataFrame(dic)
score_df

Unnamed: 0,gamma,train정확도,test정확도
0,0.001,0.797753,0.782123
1,0.01,0.838483,0.787709
2,0.1,0.859551,0.776536
3,1.0,0.900281,0.787709
4,10.0,0.945225,0.698324
5,100.0,0.963483,0.659218


In [20]:
from sklearn.model_selection import GridSearchCV

param_cost = [0.001, 0.01, 0.1,1.0,10.0, 100.0]
param_gamma = [0.001, 0.01, 0.1,1.0,10.0, 100.0]

params = {
    'C':param_cost,
    'gamma':param_gamma
}
svc = SVC(kernel='rbf')
grid_cv = GridSearchCV(svc, param_grid=params,cv=5,n_jobs = -1)

grid_cv.fit(X_train_scaler, y_train)

print('최적의 하이퍼파라미터 : ', grid_cv.best_params_)
print('train : ',grid_cv.score(X_train_scaler,y_train))
print('test : ',grid_cv.score(X_test_scaler,y_test))

최적의 하이퍼파라미터 :  {'C': 1.0, 'gamma': 0.1}
train :  0.8412921348314607
test :  0.7877094972067039
