In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Przeskalowane dane

* Dla wielu algorytmów uczenia maszynowego reprezentacja danych jest bardzo ważna. 

* W związku z tym, większość zastosowań uczenia maszynowego wymaga nie tylko zastosowania pojedynczego algorytmu, ale także łączenia wielu różnych etapów przetwarzania i modeli uczenia maszynowego. 



In [2]:
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# load and split the data
cancer = load_breast_cancer()

In [3]:
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)
# compute minimum and maximum on the training data


In [4]:
scaler = StandardScaler().fit(X_train)
# rescale the training data
X_train_scaled = scaler.transform(X_train)

In [5]:
svm = SVC()
# learn an SVM on the scaled training data
svm.fit(X_train_scaled, y_train)
# scale the test data and score the scaled data
X_test_scaled = scaler.transform(X_test)

In [6]:
print("Test score: {:.2f}".format(svm.score(X_test_scaled, y_test)))

Test score: 0.97


### Grid search parametrami SVC na przeskalowanych danych

In [7]:
from sklearn.model_selection import GridSearchCV
# for illustration purposes only, don't use this code!
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100],
              'gamma': [0.001, 0.01, 0.1, 1, 10, 100]}

grid = GridSearchCV(SVC(), param_grid=param_grid, cv=5)
grid.fit(X_train_scaled, y_train)

print("Best cross-validation accuracy: {:.2f}".format(grid.best_score_))
print("Best set score: {:.2f}".format(grid.score(X_test_scaled, y_test)))
print("Best parameters: ", grid.best_params_)

Best cross-validation accuracy: 0.99
Best set score: 0.98
Best parameters:  {'C': 10, 'gamma': 0.01}


* Tutaj przeszukiwaliśmy parametry dla SVC, używając skalowanych danych. 

* Jednak w tym, co właśnie zrobiliśmy, jest subtelny haczyk. 

* Podczas skalowania danych użyliśmy wszystkich danych w zestawie uczącego. 

* W konsekwencji popełniamy błąd, polegający, na użyciu informacji ze zbioru testowego do walidacji w walidacji krzyżowej.

* Zasadniczo różni się to od realnej sytuacji. 

* Jeśli obserwujemy nowe dane (np. w formie naszego zestawu testowego), dane te nie zostaną użyte do skalowania danych treningowych, a mogą mieć inne minimum i maksimum niż dane treningowe. 


## Pipeline

* Aby to osiągnąć w scikit-learn z funkcją **cross_val_score** i **GridSearchCV** możemy użyć klasy **Pipeline**. 
* Klasa Pipeline jest klasą, która umożliwia "sklejanie" razem wielu etapów przetwarzania w jednym estymatorze naukowego uczenia. 


In [9]:
from sklearn.pipeline import Pipeline
pipe = Pipeline([
    ("scaler", StandardScaler()), 
    ("svm", SVC())
])

In [10]:
pipe.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svm', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

In [11]:
print("Test score: {:.2f}".format(pipe.score(X_test, y_test)))

Test score: 0.97


* Wywołanie metody score na obiekvie **Pipeline** najpierw przekształca dane testowe za pomocą skalera, a następnie wywołuje metodę score na SVM przy użyciu skalowanych danych testowych. 

* Jak widać, wynik jest identyczny jak na poczatki podczas ręcznego przekształcania. 

* Korzystając z **Pipeline** zredukowaliśmy kod potrzebny do naszego procesu "preprocessingu + klasyfikacji". 

* Główną zaletą korzystania z **Pipeline** jest to, że możemy teraz używać tego pojedynczego estymatora w **cross_val_score** lub **GridSearchCV**.

In [12]:
param_grid = {'svm__C': [0.001, 0.01, 0.1, 1, 10, 100],
            'svm__gamma': [0.001, 0.01, 0.1, 1, 10, 100]}

In [13]:
grid = GridSearchCV(pipe, param_grid=param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best cross-validation accuracy: {:.2f}".format(grid.best_score_))
print("Test set score: {:.2f}".format(grid.score(X_test, y_test)))
print("Best parameters: {}".format(grid.best_params_))

Best cross-validation accuracy: 0.99
Test set score: 0.98
Best parameters: {'svm__C': 10, 'svm__gamma': 0.01}


W przeciwieństwie do grid search, które wykonaliśmy wcześniej, teraz dla każdego podziału w walidacji krzyżowej **StandardScaler** jest ucony tylko na zborze trningowym inie mamy information leake.

# standard syntax

In [20]:
from sklearn.pipeline import make_pipeline
# standard syntax
pipe_long = Pipeline([("scaler", StandardScaler()), ("svm", SVC(C=100))])
# abbreviated syntax
pipe_short = make_pipeline(StandardScaler(), SVC(C=100))

In [22]:
print("Pipeline steps:\n{}".format(pipe_long.steps))

Pipeline steps:
[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svm', SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))]


In [21]:
print("Pipeline steps:\n{}".format(pipe_short.steps))

Pipeline steps:
[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svc', SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))]


# możemy zrobić GridSearchCV

In [24]:
from sklearn.linear_model import LogisticRegression
pipe = make_pipeline(StandardScaler(), LogisticRegression())

param_grid = {'logisticregression__C': [0.01, 0.1, 1, 10, 100]}

X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, random_state=4)
grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logisticregression', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'logisticregression__C': [0.01, 0.1, 1, 10, 100]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [25]:
print("Best estimator:\n{}".format(grid.best_estimator_))

Best estimator:
Pipeline(memory=None,
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logisticregression', LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])


In [26]:
print("Logistic regression step:\n{}".format( grid.best_estimator_.named_steps["logisticregression"]))

Logistic regression step:
LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)


In [28]:
print("Logistic regression coefficients:\n{}".format( grid.best_estimator_.named_steps["logisticregression"].coef_))

Logistic regression coefficients:
[[-0.38856355 -0.37529972 -0.37624793 -0.39649439 -0.11519359  0.01709608
  -0.3550729  -0.38995414 -0.05780518  0.20879795 -0.49487753 -0.0036321
  -0.37122718 -0.38337777 -0.04488715  0.19752816  0.00424822 -0.04857196
   0.21023226  0.22444999 -0.54669761 -0.52542026 -0.49881157 -0.51451071
  -0.39256847 -0.12293451 -0.38827425 -0.4169485  -0.32533663 -0.13926972]]


# Możemy zrobić 

In [50]:
from sklearn.ensemble import RandomForestClassifier

pipe = Pipeline([('preprocessing', StandardScaler()), ('classifier', SVC())])

param_grid = [ {'classifier': [SVC()], 
                                       'preprocessing': [StandardScaler(), None],
                                       'classifier__gamma': [0.001, 0.01, 0.1, 1, 10, 100],
                                       'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100]},
               {'classifier': [RandomForestClassifier(n_estimators=100)],
                                       'preprocessing': [None], 
                                       'classifier__max_features': [1, 2, 3]}
             ]

In [51]:
grid.get_params().keys()

dict_keys(['cv', 'error_score', 'estimator__memory', 'estimator__steps', 'estimator__standardscaler', 'estimator__logisticregression', 'estimator__standardscaler__copy', 'estimator__standardscaler__with_mean', 'estimator__standardscaler__with_std', 'estimator__logisticregression__C', 'estimator__logisticregression__class_weight', 'estimator__logisticregression__dual', 'estimator__logisticregression__fit_intercept', 'estimator__logisticregression__intercept_scaling', 'estimator__logisticregression__max_iter', 'estimator__logisticregression__multi_class', 'estimator__logisticregression__n_jobs', 'estimator__logisticregression__penalty', 'estimator__logisticregression__random_state', 'estimator__logisticregression__solver', 'estimator__logisticregression__tol', 'estimator__logisticregression__verbose', 'estimator__logisticregression__warm_start', 'estimator', 'fit_params', 'iid', 'n_jobs', 'param_grid', 'pre_dispatch', 'refit', 'return_train_score', 'scoring', 'verbose'])

In [53]:
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)

grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('preprocessing', StandardScaler(copy=True, with_mean=True, with_std=True)), ('classifier', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'classifier': [SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.01, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)], 'preprocessing': [StandardScaler(copy=True, with...=0,
            warm_start=False)], 'preprocessing': [None], 'classifier__max_features': [1, 2, 3]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [54]:
print("Best params:\n{}\n".format(grid.best_params_))
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
print("Test-set score: {:.2f}".format(grid.score(X_test, y_test)))

Best params:
{'classifier': SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.01, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False), 'classifier__C': 10, 'classifier__gamma': 0.01, 'preprocessing': StandardScaler(copy=True, with_mean=True, with_std=True)}

Best cross-validation score: 0.99
Test-set score: 0.98
