# ML Seminar 7

Data Preprocessing, Bayesian Optimization for ML and beyond

## Credit approval dataset
<center>
Build a [credit approval](http://archive.ics.uci.edu/ml/datasets/Credit+Approval) classifier!

<img src="misc/credit.svg" alt="Drawing" style="width: 800px;"/>

Requires some data preprocessing!
</center>

## Making data preprocessing a pipeline

* Compact code

* Reusable in other projects

* Adjust pipeline using `GridSearchCV` or the like

In [2]:
import pandas as ps
from sklearn.svm import SVR
from sklearn.preprocessing import LabelBinarizer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import make_union, make_pipeline

# class that selects a single column
class ColumnSelector(BaseEstimator, TransformerMixin):
    def __init__(self, index):
        self.index = index
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        return X[:, [self.index]]

# class that encodes the column 
class OneHotEncoder(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.model = None
    
    def fit(self, X, y=None):
        self.model = LabelBinarizer()
        self.model.fit(X[:, 0])
        return self
    
    def transform(self, X, y=None):
        return self.model.transform(X[:, 0])
        
    
# read the file as csv
Xy = ps.read_csv('data/credit-screening.csv').as_matrix()
X = Xy

cs = ColumnSelector(4)
X = cs.fit_transform(X)
hot = OneHotEncoder();
X = hot.fit_transform(X)
print(X)

# this joins multiple extracted features into one feature set
features = make_union(
    ColumnSelector(1),
    make_pipeline(ColumnSelector(4), OneHotEncoder())
)

print(features.fit_transform(Xy))

[[0 1 0 0]
 [0 1 0 0]
 [0 1 0 0]
 ..., 
 [0 0 0 1]
 [0 1 0 0]
 [0 1 0 0]]
[['58.67' 0 1 0 0]
 ['24.50' 0 1 0 0]
 ['27.83' 0 1 0 0]
 ..., 
 ['25.25' 0 0 0 1]
 ['17.92' 0 1 0 0]
 ['35.00' 0 1 0 0]]


## A general pipeline

Represent data preprocessing, data scaling and model fitting as a single pipeline.

In [3]:
import pandas as ps
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, Imputer
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import make_pipeline, make_union

# class that selects a single column
class ColumnSelector(BaseEstimator, TransformerMixin):
    def __init__(self, index):
        self.index = index

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        return X[:, [self.index]]


# class that encodes the column
class OneHotEncoder(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.model = None

    def fit(self, X, y=None):
        self.model = LabelBinarizer()
        self.model.fit(X[:, 0])
        return self

    def transform(self, X, y=None):
        return self.model.transform(X[:, 0])

# read the csv file
data = ps.read_csv('data/credit-screening.csv', header=None)
# replace ? with NaN
data = data.replace('?', 'NaN')
data = data.as_matrix()

# split data 
X, y = data[:, :-1], data[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# helper functions
category = lambda idx: make_pipeline(ColumnSelector(idx), OneHotEncoder())
number = lambda idx: make_pipeline(ColumnSelector(idx), Imputer())

# feature extraction pipeline
features = make_union(
    category(0),
    number(1),
    number(2),
    category(3),
    category(4),
    category(5),
    category(6),
    number(7),
    category(8),
    category(9),
    number(10),
    category(11),
    category(12),
    number(13),
    number(14)
)

# estimator pipeline
model = Pipeline([
    ('features', features),
    ('scaler', StandardScaler()),
    ('est', SVC())
])

# model paramter grid definition
svc_grid = {
    'est': [SVC()],
    'est__C': [0.1, 1.0, 10.0],
    'est__gamma': [0.1, 1.0],
}

# grid search class
gsearch = GridSearchCV(
    estimator=model,
    param_grid=[svc_grid],
    verbose=1,
    cv=3,
    n_jobs=-1
)

# fit all the transformers and estimators in the pipeline
gsearch.fit(X_train, y_train)
print(gsearch.best_params_)
print(gsearch.score(X_test, y_test))

Fitting 3 folds for each of 6 candidates, totalling 18 fits


[Parallel(n_jobs=-1)]: Done   3 out of  18 | elapsed:    0.2s remaining:    0.9s


{'est__C': 1.0, 'est': SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False), 'est__gamma': 0.1}
0.7976878612716763


[Parallel(n_jobs=-1)]: Done  18 out of  18 | elapsed:    0.4s finished


## Acessing the quality of classification model

Is 75% accuracy a good value?

Use `DummyClassifier` for comparison.

## A general pipeline

The code is already quite general and can be fairly easily adjusted to other datasets. 

*Task* Adjust the pipeline to the `titanic.csv` in `data` folder.

## Brute force through high dimensions

Using [https://github.com/scikit-optimize/scikit-optimize](https://github.com/scikit-optimize/scikit-optimize)!

An idea of how this works is [shown here](https://github.com/scikit-optimize/scikit-optimize/blob/master/examples/bayesian-optimization.ipynb).

In [3]:
import pandas as ps
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# Import necessary functionality from scikit-optimize!
from skopt import BayesSearchCV
from skopt.space import Real, Integer, Categorical

import numpy as np

# read the file as csv
Xy = ps.read_csv('data/winequality-red.csv', sep=';').as_matrix()
X_train, X_test, y_train, y_test = train_test_split(Xy[:, :-1], Xy[:, -1], random_state=0)
    
# create a model class instance
estimator = make_pipeline(
    StandardScaler(),
    SVR(),
)

# Bayesian optimization class, which can be used instead of GridSearchCV
model = BayesSearchCV(
    estimator=estimator,
    search_spaces={
        "svr__C": Real(1e-6, 1e+3, 'log-uniform'),
        "svr__gamma": Real(1e-3, 1e+3, 'log-uniform'),
        "svr__kernel": Categorical(['rbf']),
        "svr__degree": Integer(1, 2),
    },
    verbose=1,
    n_jobs=8,
)

# fit a model to the data
model.fit(X_train, y_train)

# evaluate the model on the data
print(model.score(X_test, y_test))

# make estimations as usual
yp = model.predict(X_test)

print("Example estimations")
print([v for v in zip(y_test[:10], yp[:10])])

Fitting 3 folds for each of 8 candidates, totalling 24 fits


[Parallel(n_jobs=8)]: Done   9 out of  24 | elapsed:    0.2s remaining:    0.4s
[Parallel(n_jobs=8)]: Done  24 out of  24 | elapsed:    0.6s finished


Fitting 3 folds for each of 8 candidates, totalling 24 fits


[Parallel(n_jobs=8)]: Done   9 out of  24 | elapsed:    0.2s remaining:    0.4s
[Parallel(n_jobs=8)]: Done  24 out of  24 | elapsed:    0.5s finished


Fitting 3 folds for each of 8 candidates, totalling 24 fits


[Parallel(n_jobs=8)]: Done   9 out of  24 | elapsed:    0.3s remaining:    0.4s
[Parallel(n_jobs=8)]: Done  24 out of  24 | elapsed:    2.4s finished


Fitting 3 folds for each of 8 candidates, totalling 24 fits


[Parallel(n_jobs=8)]: Done   9 out of  24 | elapsed:    0.3s remaining:    0.4s
[Parallel(n_jobs=8)]: Done  24 out of  24 | elapsed:    0.5s finished


Fitting 3 folds for each of 8 candidates, totalling 24 fits


[Parallel(n_jobs=8)]: Done   9 out of  24 | elapsed:    0.3s remaining:    0.5s
[Parallel(n_jobs=8)]: Done  24 out of  24 | elapsed:    0.6s finished


Fitting 3 folds for each of 8 candidates, totalling 24 fits


[Parallel(n_jobs=8)]: Done   9 out of  24 | elapsed:    0.3s remaining:    0.5s
[Parallel(n_jobs=8)]: Done  24 out of  24 | elapsed:    0.5s finished


Fitting 3 folds for each of 2 candidates, totalling 6 fits


[Parallel(n_jobs=8)]: Done   6 out of   6 | elapsed:    0.1s finished


0.376378052406
Example estimations
[(6.0, 5.2458061678985075), (5.0, 5.1795120012465548), (7.0, 7.089794322406445), (6.0, 4.8243386318352313), (5.0, 5.9433703014307051), (6.0, 5.1525783339156055), (5.0, 5.1025421328279013), (6.0, 5.9003136717746401), (4.0, 4.9975256227072622), (5.0, 5.054024756726526)]
