## 머신러닝 워크플로우 최적화

- 파이프라인을 사용하여 여러 추정기를 하나로 묶을 수 있다.
- 편의성과 캡슐화 

# ML Workflow Optimization
<font color=#CC3D3D>
## Pipeline: chaining estimators   
</font>
- Pipeline can be used to chain multiple estimators into one.
- Pipeline serves two purposes:
  - Convenience and encapsulation
  - Joint parameter selection
- All estimators in a pipeline, except the last one, must be transformers. 
  - The last estimator may be any type (transformer, classifier, etc.)
- Training and prediction procedure of the pipeline
<br>
<img align="left" src="http://drive.google.com/uc?export=view&id=1pIde-P6d7EnjL3xYo8eE3cWAUEvzV7tS" >

In [1]:
import pandas as pd
import copy
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, random_state=0)   

### Building Pipelines

In [3]:
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

** 파이프라인은 (키, 값) 목록을 사용하여 빌드됩니다.** 여기서 key 는 이 단계에서 지정할 이름을 포함하는 문자열이고 value는 추정기 개체입니다.

In [4]:
from sklearn.pipeline import Pipeline
pipe = Pipeline([("scaler", MinMaxScaler()), ("svm", SVC())])

- 전체 일련의 추정기에 적합하도록 데이터에 대해  fit 및 predict 를 한 번만 호출하면 됩니다.

In [5]:
pipe.fit(X_train, y_train).score(X_test, y_test)

0.972027972027972

### Using Pipelines in Grid-searches

In [6]:
from sklearn.model_selection import GridSearchCV # 최적의 파라미터 검색

파이프라인의 추정기 매개 변수는 ** 추정기__매개변수** 구문을 사용하여 정의해야 합니다.

In [7]:
param_grid = {'svm__C': [0.001, 0.01, 0.1, 1, 10, 100],
              'svm__gamma': [0.001, 0.01, 0.1, 1, 10, 100]}

In [8]:
grid = GridSearchCV(pipe, param_grid=param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best cross-validation accuracy: {:.2f}".format(
    grid.best_score_))
print("Test set score: {:.2f}".format(grid.score(X_test, y_test)))
print("Best parameters: {}".format(grid.best_params_))

Best cross-validation accuracy: 0.98
Test set score: 0.97
Best parameters: {'svm__C': 1, 'svm__gamma': 1}


### *make_pipeline*을 통한 편리한 파이프라인 생성

In [9]:
from sklearn.pipeline import make_pipeline

In [10]:
# standard syntax
pipe_long = Pipeline([("scaler", MinMaxScaler()), 
                      ("svm", SVC(C=100))])
#pipe_long = Pipeline([('pca', PCA(n_components=3))), 
 #                     (('univ_select', SelectKBest(k=10))])

In [11]:
# abbreviated syntax(약칭구문 - 이름을 넣지 않고 파이프 라인 만들기)
pipe_short = make_pipeline(MinMaxScaler(), SVC(C=100)) 

In [12]:
print("pipe_long steps:\n{}".format(pipe_long.steps))

pipe_long steps:
[('scaler', MinMaxScaler()), ('svm', SVC(C=100))]


**Make_pipeline**은(는) 추정기의 이름을 지정할 필요가 없으며 허용하지 않습니다. <br>
대신, 이름은 ** 소문자*로 자동 설정됩니다.

In [13]:
print("Pipeline steps:\n{}".format(pipe_short.steps))

Pipeline steps:
[('minmaxscaler', MinMaxScaler()), ('svc', SVC(C=100))]


In [14]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

pipe = make_pipeline(StandardScaler(), PCA(n_components=2), 
                     StandardScaler())
print("Pipeline steps:\n{}".format(pipe.steps))

Pipeline steps:
[('standardscaler-1', StandardScaler()), ('pca', PCA(n_components=2)), ('standardscaler-2', StandardScaler())]


*FeatureUnion*과 기능 결합

<img align='left' src='https://image.slidesharecdn.com/featureengineeringpipelines1-161106200348/95/feature-engineering-pipelines-11-638.jpg?cb=1478462927' width=600 height=400>

In [15]:
from sklearn.pipeline import FeatureUnion
from sklearn.decomposition import PCA # 차원축소 
from sklearn.feature_selection import SelectKBest # 성능이 좋은 변수만 사용하는 전처리기

In [16]:
# create feature union
features = []
features.append(('pca', PCA(n_components=3)))
features.append(('univ_select', SelectKBest(k=10)))
feature_union = FeatureUnion(features)

# create pipeline
estimators = []
estimators.append(('features', feature_union))
estimators.append(('scaler', MinMaxScaler()))
estimators.append(("svm", SVC()))
pipe = Pipeline(estimators)

In [17]:
pipe.fit(X_train, y_train).score(X_test, y_test)

0.958041958041958

In [18]:
# Do grid search
param_grid = dict(features__pca__n_components=[1, 2, 3],
                  features__univ_select__k=[9, 10, 11],
                  svm__C=[0.1, 1, 10],
                  svm__gamma=[0.1, 1, 10])
grid_search = GridSearchCV(pipe, param_grid=param_grid, cv=5)
print(grid_search.fit(X_train, y_train).score(X_test, y_test))
print(grid_search.best_estimator_)

0.951048951048951
Pipeline(steps=[('features',
                 FeatureUnion(transformer_list=[('pca', PCA(n_components=1)),
                                                ('univ_select',
                                                 SelectKBest())])),
                ('scaler', MinMaxScaler()), ('svm', SVC(C=10, gamma=10))])


In [19]:
import pandas as pd
import numpy as np

train = pd.read_csv('titanic_train.csv')
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


#### 타이타닉 호 침몰 당시의 승객 명단 데이터

- Survived: 생존 여부 => 0 = No, 1 = Yes
- pclass: 티켓 등급 => 1 = 1st, 2 = 2nd, 3 = 3rd
- Sex: 성별
- Age: 나이
- Sibsp: 함께 탑승한 형제자매, 배우자의 수
- Parch: 함께 탑승한 부모, 자식의 수
- Ticket: 티켓 번호
- Fare: 운임
- Cabin: 객실 번호
- Embarked: 탑승 항구 => C = Cherbourg, Q = Queenstown, S = Southampton

In [20]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [21]:
train.iloc[:,2:]

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...
886,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [22]:
X_train, X_test, y_train, y_test = train_test_split(train.iloc[:,2:], train.Survived, random_state=0)

#### 전처리 방향
feature를 `수치형과 범주형으로 나누어`<sup>1)</sup> 다르게 전처리를 수행한다.
- 수치형의 경우: 결측값을 중앙값으로 대체 -> Standardization
- 범주형의 경우: `결측값이 없는 모든 범주형 feature에 대해 One-Hot-Encoding 수행`<sup>2)</sup>


*<sup>1),2)</sup> scikit-learn에 없는 전처리 기능이기 때문에 Custom Transformer를 만들어야 함.*

In [23]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelBinarizer

(참고) https://databuzz-team.github.io/2018/11/11/make_pipeline/

In [24]:
from sklearn.base import BaseEstimator, TransformerMixin
# BaseEstimator를 상속하면 get_params 및 set_params를 무료로 얻을 수 있습니다
# TransformerMixin를 상속하면 fit 및 transform 메서드를 작성하고 fit_transform을 무료로 얻을 수 있습니다
# (참고) https://towardsdatascience.com/custom-transformers-and-ml-data-pipelines-with-python-20ea2a7adb65

# 1)번 Custom Transformer
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, feature_names):
        self.feature_names = feature_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.feature_names].values

In [25]:
# 2)번 Custom Transformer
class CustomLabelBinarizer(BaseEstimator, TransformerMixin):
    def fit(self, X_cat, y=None):
        return self
    def transform(self, X_cat):
        X_cat_df = pd.DataFrame(X_cat, columns=range(X_cat.shape[1]))
        X_onehot_df = pd.get_dummies(X_cat_df, columns=X_cat_df.columns)
        return X_onehot_df.values

In [26]:
# feature를 수치형과 범주형으로 구분
con = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare'] # 수치형 변수들
cat = ['Sex', 'Ticket']  # 범주형 변수들

In [27]:
# 인덱스 위치
con_idx = [train.columns.get_loc(c) for c in train.columns if c in con]
cat_idx = [train.columns.get_loc(c) for c in train.columns if c in cat]

In [28]:
print(con_idx, cat_idx)

[2, 5, 6, 7, 9] [4, 8]


In [29]:
# 수치형 feature에 대한 전처리
con_pipeline = Pipeline([
    ('selector', DataFrameSelector(con)),
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
])

# 범주형 feature에 대한 전처리
cat_pipeline = Pipeline([
    ('selector', DataFrameSelector(cat)),
    ('encoder', CustomLabelBinarizer()),
])

# 전처리된 수치형과 범주형 feature를 결합
full_pipeline = FeatureUnion([
    ('con_pipeline', con_pipeline),
    ('cat_pipeline', cat_pipeline),
])

In [30]:
print("con_pipeline steps:\n{}".format(con_pipeline.steps))

con_pipeline steps:
[('selector', DataFrameSelector(feature_names=['Pclass', 'Age', 'SibSp', 'Parch', 'Fare'])), ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())]


In [31]:
print("cat_pipeline steps:\n{}".format(cat_pipeline.steps))

cat_pipeline steps:
[('selector', DataFrameSelector(feature_names=['Sex', 'Ticket'])), ('encoder', CustomLabelBinarizer())]


In [32]:
full_pipeline

FeatureUnion(transformer_list=[('con_pipeline',
                                Pipeline(steps=[('selector',
                                                 DataFrameSelector(feature_names=['Pclass',
                                                                                  'Age',
                                                                                  'SibSp',
                                                                                  'Parch',
                                                                                  'Fare'])),
                                                ('imputer',
                                                 SimpleImputer(strategy='median')),
                                                ('scaler', StandardScaler())])),
                               ('cat_pipeline',
                                Pipeline(steps=[('selector',
                                                 DataFrameSelector(feature_names=['Sex',
            

In [33]:
X_train_prepared = full_pipeline.fit_transform(X_train)
X_test_prepared = full_pipeline.transform(X_test)