### Classification (SVM)
* 이번 시간에는 분류 문제를 다루고자 함


#### [RECALL] `Pycaret 프로세스` 

1. `Set up the environment`: 분석을 위한 환경 설정 (독립변수, 종속변수, 결측치 처리 방법 등)
> setup()
2. `Create Model`: 모델 생성, 학습데이터에 대한 Cross-validation 수행
> create_model()
3. `Tune Model`: 모델의 하이퍼파라미터 튜닝 (회귀분석에서는 최적화할 파라미터가 없어서 생략)
> tune_model()
4. `Evaluate Model`: 모델들의 성능평가
> evaluate_model()
5. `Finalize model`: 전체 데이터를 가지고 학습 진행
> finalize_model()
6. `Predict Model`: 신규 데이터에 대한 예측수행
> predict_model()
7. `Save Model`: 향후 사용을 위한 모델 저장
> save_model()
8. `Load Model`: 저장된 모델 불러오기
> load_model()

### 1. 데이터 불러오기

- 이번 실습에서는 '와인 품질' 데이터 (winequality.csv) 를 사용하며, 목표는 다양한 변수를 기반으로 와인의 품질 클래스를 예측하고자 함

In [1]:
# 학습 데이터 불러오기
import pandas as pd
dataset = pd.read_csv("wine_quality.csv", sep=',', header = 0)
display(dataset)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,2
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,2
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,2
3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,3
4,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,2
...,...,...,...,...,...,...,...,...,...,...,...,...
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,2
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,3
1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,3
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,2


In [2]:
# 데이터 형태 확인
dataset.shape

(1599, 12)

In [3]:
from sklearn.model_selection import train_test_split
dataset_train, dataset_new = train_test_split(dataset, test_size=0.1, stratify = dataset['quality'], random_state=2)
display(dataset_train)
display(dataset_new)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
810,7.3,0.49,0.10,2.60,0.068,4.0,14.0,0.99562,3.30,0.47,10.5,2
477,10.4,0.24,0.49,1.80,0.075,6.0,20.0,0.99770,3.18,1.06,11.0,3
1594,6.2,0.60,0.08,2.00,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,2
1002,9.1,0.29,0.33,2.05,0.063,13.0,27.0,0.99516,3.26,0.84,11.7,4
580,12.3,0.50,0.49,2.20,0.089,5.0,14.0,1.00020,3.19,0.44,9.6,2
...,...,...,...,...,...,...,...,...,...,...,...,...
730,9.5,0.55,0.66,2.30,0.387,12.0,37.0,0.99820,3.17,0.67,9.6,2
1516,6.1,0.32,0.25,2.30,0.071,23.0,58.0,0.99633,3.42,0.97,10.6,2
138,7.8,0.56,0.19,2.10,0.081,15.0,105.0,0.99620,3.33,0.54,9.5,2
1570,6.4,0.36,0.53,2.20,0.230,19.0,35.0,0.99340,3.37,0.93,12.4,3


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
1077,8.6,0.370,0.65,6.4,0.080,3.0,8.0,0.99817,3.27,0.58,11.000000,2
794,10.1,0.270,0.54,2.3,0.065,7.0,26.0,0.99531,3.17,0.53,12.500000,3
1416,10.0,0.320,0.59,2.2,0.077,3.0,15.0,0.99940,3.20,0.78,9.600000,2
1353,7.6,0.645,0.03,1.9,0.086,14.0,57.0,0.99690,3.37,0.46,10.300000,2
1439,7.3,0.670,0.02,2.2,0.072,31.0,92.0,0.99566,3.32,0.68,11.066667,3
...,...,...,...,...,...,...,...,...,...,...,...,...
1385,8.0,0.810,0.25,3.4,0.076,34.0,85.0,0.99668,3.19,0.42,9.200000,2
1457,7.6,0.490,0.33,1.9,0.074,27.0,85.0,0.99706,3.41,0.58,9.000000,2
1408,8.1,0.290,0.36,2.2,0.048,35.0,53.0,0.99500,3.27,1.01,12.400000,4
1129,10.5,0.430,0.35,3.3,0.092,24.0,70.0,0.99798,3.21,0.69,10.500000,3


### 2. 환경설정

* PyCaret에서 기계 학습 실험의 첫 번째 단계는 수행하고자 하는 작업에 맞는 필요한 모듈(분류, classification)를 가져오고 환경을 설정하는 단계
* 본 실습에서 분류를 위해 사용하는 모듈은 pycaret.classification
* setup() 함수를 통해 DataFrame의 데이터('data')과 종속변수('Type')를 정의하고 분류 모델을 초기화

In [4]:
from pycaret.classification import * # pycaret.classification 내에 있는 모든 함수를 불러온다는 의미

# 데이터 전처리 
cla = setup(data = dataset_train, fold = 5, target = 'quality', train_size = 0.8,  data_split_stratify = True, session_id = 123,  numeric_features=[], categorical_features=[], ignore_features = [])

Unnamed: 0,Description,Value
0,session_id,123
1,Target,quality
2,Target Type,Multiclass
3,Label Encoded,
4,Original Data,"(1439, 12)"
5,Missing Values,False
6,Numeric Features,11
7,Categorical Features,0
8,Ordinal Features,False
9,High Cardinality Features,False


### 3. 모델 생성

* 모델생성은 create_model함수를 바탕으로 이루어짐

In [5]:
# 사용할 수 있는 분류 모델 확인
models()

Unnamed: 0_level_0,Name,Reference,Turbo
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
lr,Logistic Regression,sklearn.linear_model._logistic.LogisticRegression,True
knn,K Neighbors Classifier,sklearn.neighbors._classification.KNeighborsCl...,True
nb,Naive Bayes,sklearn.naive_bayes.GaussianNB,True
dt,Decision Tree Classifier,sklearn.tree._classes.DecisionTreeClassifier,True
svm,SVM - Linear Kernel,sklearn.linear_model._stochastic_gradient.SGDC...,True
rbfsvm,SVM - Radial Kernel,sklearn.svm._classes.SVC,False
gpc,Gaussian Process Classifier,sklearn.gaussian_process._gpc.GaussianProcessC...,False
mlp,MLP Classifier,sklearn.neural_network._multilayer_perceptron....,False
ridge,Ridge Classifier,sklearn.linear_model._ridge.RidgeClassifier,True
rf,Random Forest Classifier,sklearn.ensemble._forest.RandomForestClassifier,True


### 3.1 SVM

In [6]:
# SVM 모델 생성
svm = create_model('rbfsvm')

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.5498,0.706,0.2374,0.501,0.5139,0.25,0.2564
1,0.5609,0.7166,0.2502,0.5204,0.5311,0.2701,0.2753
2,0.5739,0.7061,0.2557,0.5572,0.5432,0.2846,0.2915
3,0.5826,0.735,0.2548,0.5509,0.5536,0.3042,0.3101
4,0.5609,0.7232,0.2472,0.5241,0.5351,0.2772,0.2821
Mean,0.5656,0.7174,0.249,0.5307,0.5354,0.2772,0.2831
SD,0.0114,0.011,0.0066,0.0207,0.0132,0.0177,0.0177


In [7]:
print(svm)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=True, random_state=123, shrinking=True, tol=0.001,
    verbose=False)


### 4 모델 튜닝
- SVM에서는 최적의 kernel, gamma, C 등을 결정


* 모델 튜닝은 tune_model 함수 활용
* PyCaret은 미리 정의된 검색 공간에서 임의의 그리드 검색을 수행 반환



### 4.1 SVM

In [10]:
# 모델의 최적 파라미터 도출
# tuned_svm = tune_model(svm)

#파라미터를 본인이 설정하고자 하는 경우

params = {"C": np.logspace(-3, 3, num=7, base=10),
          "gamma" : np.logspace(-3, 3, num=7, base=10),
          "kernel": ['linear', 'poly', 'rbf'], 
          "degree": [2, 3], }

tuned_svm = tune_model(svm, custom_grid = params)


Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.6147,0.7453,0.276,0.5779,0.5841,0.3605,0.3668
1,0.6087,0.758,0.2778,0.5756,0.5847,0.3619,0.3678
2,0.5957,0.7652,0.2813,0.5648,0.5771,0.3461,0.349
3,0.6087,0.7671,0.277,0.5848,0.5882,0.3565,0.3607
4,0.6391,0.7888,0.3182,0.6061,0.6211,0.4187,0.4206
Mean,0.6134,0.7649,0.286,0.5818,0.591,0.3688,0.373
SD,0.0143,0.0142,0.0161,0.0137,0.0155,0.0256,0.0247


In [11]:
print(tuned_svm)

SVC(C=100.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=2, gamma=0.01, kernel='poly',
    max_iter=-1, probability=True, random_state=123, shrinking=True, tol=0.001,
    verbose=False)


### 4. 모델의 성능 평가
* 훈련된 기계학습 알고리즘의 성능 평가 및 진단은 evaluate_model 함수를 사용하여 수행

### 4.1 evaluate_model

In [12]:
# SVM 모델의 성능평가 
evaluate_model(tuned_svm)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Hyperparameters', 'param…

### 5. 테스트 데이터 평가 

In [13]:
# 테스트 데이터에 대한 모델 예측 및 평가
predict_model(tuned_svm)


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,SVM - Radial Kernel,0.6181,0.7903,0.2905,0.5887,0.6004,0.3792,0.3813


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,Label,Score
0,8.2,1.000,0.09,2.3,0.065,7.0,37.0,0.99685,3.32,0.55,9.0,3,2,0.7878
1,11.1,0.310,0.53,2.2,0.060,3.0,10.0,0.99572,3.02,0.83,10.9,4,4,0.4476
2,7.6,0.715,0.00,2.1,0.068,30.0,35.0,0.99533,3.48,0.65,11.4,3,3,0.6824
3,7.8,0.390,0.42,2.0,0.086,9.0,21.0,0.99526,3.39,0.66,11.6,3,3,0.5690
4,7.7,0.965,0.10,2.1,0.112,11.0,22.0,0.99630,3.26,0.50,9.5,2,2,0.7917
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
283,7.0,0.690,0.07,2.5,0.091,15.0,21.0,0.99572,3.38,0.60,11.3,3,3,0.5403
284,8.4,0.390,0.10,1.7,0.075,6.0,25.0,0.99581,3.09,0.43,9.7,3,2,0.5693
285,8.2,0.260,0.34,2.5,0.073,16.0,47.0,0.99594,3.40,0.78,11.3,4,3,0.6775
286,8.8,0.270,0.39,2.0,0.100,20.0,27.0,0.99546,3.15,0.69,11.2,3,3,0.5666


### 6. 전체 데이터에 대한 학습

* `finalize_model()` 함수는 최적의 파라미터에 대해서 학습데이터와 검증데이터를 포함하는 전체 데이터에 학습을 다시 진행
* 이 함수의 목적은 모델을 배포하기 전에 전체 데이터 세트에서 모델을 훈련시키는 것

In [14]:
# 전체 데이터에 대한 모델 재학습
final_svm = finalize_model(tuned_svm)
print(final_svm)

SVC(C=100.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=2, gamma=0.01, kernel='poly',
    max_iter=-1, probability=True, random_state=123, shrinking=True, tol=0.001,
    verbose=False)


### 7. 모델 저장
* 학습된 모델을 활용하여 보이지 않는 신규 데이터 세트에 대한 예측을 생성하는 한 가지 방법은 모델이 학습된 동일한 노트북 / IDE에서 predict_model 함수를 사용하는 것
* PyCaret의 save_model 기능을 사용하여 학습된 모델을 포함한 전체 파이프 라인을 저장할 수 있음

In [15]:
# 모델 저장하기
save_model(final_svm,'Final_SVM_Model_20220512')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=[], target='quality',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,
                                 numeric_stra...
                 ('feature_select', 'passthrough'), ('fix_multi', 'passthrough'),
                 ('dfs', 'passthrough'), ('pca', 'passthrough'),
                 ['trained_model',
                  SVC(C=100.0, break_ties=False, cache_size=200,
                      class

### 8. 저장된 모델 불러오기

In [16]:
# 저장된 모델 불러오기
saved_final_svm = load_model('Final_SVM_Model_20220512')

Transformation Pipeline and Model Successfully Loaded


In [17]:
# 신규 데이터에 대한 예측값 생성 (dataset이 신규 데이터)

# 신규 데이터 지정
new_data = dataset_new

# 신규 데이터의 종속변수(y) 예측
new_prediction_svm = predict_model(saved_final_svm, data=new_data)


In [18]:
# 실제값(Class variable), 예측값(label) 확인
new_prediction_svm.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,Label,Score
1077,8.6,0.37,0.65,6.4,0.08,3.0,8.0,0.99817,3.27,0.58,11.0,2,2,0.2826
794,10.1,0.27,0.54,2.3,0.065,7.0,26.0,0.99531,3.17,0.53,12.5,3,3,0.593
1416,10.0,0.32,0.59,2.2,0.077,3.0,15.0,0.9994,3.2,0.78,9.6,2,3,0.478
1353,7.6,0.645,0.03,1.9,0.086,14.0,57.0,0.9969,3.37,0.46,10.3,2,2,0.5258
1439,7.3,0.67,0.02,2.2,0.072,31.0,92.0,0.99566,3.32,0.68,11.066667,3,3,0.5119
