<font color="#CC3D3D"><p>
# ML Pipeline: Basics

## Pipeline: chaining estimators   
- Pipeline can be used to chain multiple estimators into one.
- Pipeline serves two purposes:
  - Convenience and encapsulation
  - Joint parameter selection
- All estimators in a pipeline, except the last one, must be transformers. 
  - The last estimator may be any type (transformer, classifier, etc.)
- Training and prediction procedure of the pipeline

<img align="left" src="http://drive.google.com/uc?export=view&id=1pIde-P6d7EnjL3xYo8eE3cWAUEvzV7tS" >

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.feature_selection import SelectPercentile
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA

# 학습할때는 샘플링 테스트할때는 나눠서 모델적용->콘캣, 앙상블
# 불균형 데이터 -> 층화추출

### Building Pipelines

In [2]:
# load and split the data
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()

X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)

<br><font color = "blue">
The **Pipeline** is built using a list of **(key, value)** pairs, where the **key** is a string containing the name you want to give this step and **value** is an estimator object:

In [3]:
# Build Pipelines
from sklearn.pipeline import Pipeline

pipe = Pipeline([('scaler', MinMaxScaler()), ('selector', SelectPercentile()), ("svm", SVC())])

# 모델은 파이프라인의 끝에 와야 한다 (끝날때 모델이 없어도 된다)

<br><font color = "blue">
You only have to call **fit** and **predict** once on your data to fit a whole sequence of estimators

In [4]:
pipe.fit(X_train, y_train).score(X_test, y_test)

0.9440559440559441

### Using Pipelines in Grid-searches

<br><font color = "blue">
Parameters of the estimators in the pipeline shoud be defined using the **estimator__parameter** syntax

In [5]:
param_grid = {
    'selector__percentile': range(10, 100, 10),
    'svm__C': [0.001, 0.01, 0.1, 1, 10, 100],
    'svm__gamma': [0.001, 0.01, 0.1, 1, 10, 100]
}

<font color='green'></p> 
##### C parameter: hard margin vs. soft margin
<img align='left' src='https://i.stack.imgur.com/GbW5S.png' style="width: 60%; height: auto;">

<font color='green'></p> 
##### Kernel Trick
<img align='left' src='https://t1.daumcdn.net/cfile/tistory/9989503359C62ECF0A' style="width: 70%; height: auto;">

<font color='green'></p> 
##### Gamma parameter   
<img src='https://t1.daumcdn.net/cfile/tistory/992DEB3359EACB9301' style="width: 40%; height: auto;">
<img align='left' src='https://cdn-images-1.medium.com/max/1600/1*r9CO-gp1uuRsYooCLL9UeQ.png' style="width: 70%; height: auto;">

In [6]:
grid = GridSearchCV(pipe, param_grid=param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best cross-validation accuracy: {:.2f}".format(grid.best_score_))
print("Test set score: {:.2f}".format(grid.score(X_test, y_test)))
print("Best parameters: {}".format(grid.best_params_))
print(grid.score(X_test, y_test))

Best cross-validation accuracy: 0.98
Test set score: 0.98
Best parameters: {'selector__percentile': 70, 'svm__C': 100, 'svm__gamma': 0.01}
0.9790209790209791


### Convenient Pipeline creation with *make_pipeline* 

In [7]:
from sklearn.pipeline import make_pipeline
# standard syntax
pipe_long = Pipeline([("scaler", MinMaxScaler()), 
                      ('selector', SelectPercentile(percentile=70)),
                      ("svm", SVC(C=100,gamma=0.01))])
# abbreviated syntax
pipe_short = make_pipeline(MinMaxScaler(), SelectPercentile(percentile=70), SVC(C=100,gamma=0.01))

# make_pipeline: 이름을 안달아주면 지가 알아서 이름을 달아준다

In [8]:
print("Pipeline steps:\n{}".format(pipe_short.steps))

Pipeline steps:
[('minmaxscaler', MinMaxScaler()), ('selectpercentile', SelectPercentile(percentile=70)), ('svc', SVC(C=100, gamma=0.01))]


<br><font color = "blue">
**Make_pipeline** does not require, and does not permit, naming the estimators. Instead, their names will be set to the **lowercase of their types** automatically.

In [9]:
pipe = make_pipeline(StandardScaler(), PCA(n_components=2), StandardScaler())
print("Pipeline steps:\n{}".format(pipe.steps))

Pipeline steps:
[('standardscaler-1', StandardScaler()), ('pca', PCA(n_components=2)), ('standardscaler-2', StandardScaler())]


# End