# ML Seminar 6

Grid Search, Data Preprocessing and Bayesian Optimization for ML

## The usual goal reminder
<center>
Build a [wine quality](https://archive.ics.uci.edu/ml/datasets/wine+quality) detector!

<img src="misc/wine.svg" alt="Drawing" style="width: 800px;"/>

We are going to finish this today (really).
</center>

## Where to look up the parameter ranges?

In material like this:
* Model comparison: https://arxiv.org/pdf/1708.05070.pdf
* Corresponding parameter ranges: https://github.com/rhiever/sklearn-benchmarks/tree/master/model_code/grid_search



## Making your own transformer

* Often useful for dedicated feature extraction

*Task*: Make your own transformer which subtracts the median value from a column, and divides it by st. deviation of absolute values. Substitute it with the used normalization routine.

In [1]:
import pandas as ps
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
# some of already available functionality
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

# read the file as csv
Xy = ps.read_csv('data/winequality-red.csv', sep=';').as_matrix()
X_train, X_test, y_train, y_test = train_test_split(Xy[:, :-1], Xy[:, -1], random_state=0)

class MedianScaler(BaseEstimator, TransformerMixin):
    """
    Subtract the median of column values from every column
    of the dataset matrix, and divide every column by 
    the median of absolute deviation from median.
    
    Attributes
    ----------
    
    median_: numpy nd array of shape (n_features,)
        contains median of values
    
    absdev_: ...        
    """
    def __init__(self):
        self.median_ = None
        self.std_ = None
    
    def fit(self, X, y=None):
        """
        Fits the scaler to the data. 

        Parameters
        ----------
        
        X: array like of shape [n_samples, ...].
            Dataset        
        """
        self.median_ = np.median(X, axis=0)
        X = X - self.median_
        self.std_ = np.std(X, axis=0)
        
        # !!! important
        return self
    
    def transform(self, X, y=None):
        """
        ...enter the description here!
        """
        X = X - self.median_
        X = X / self.std_
        return X    
    
# create a model class instance
estimator = make_pipeline(
    MedianScaler(),
    SVR(),
)

# create an instance of a grid search class
model = GridSearchCV(
    estimator=estimator,
    param_grid={
        "svr__C": [0.1, 1.0, 10.0],
        "svr__gamma": [0.1, 1.0, 10.0],
    },
    verbose=1,
    n_jobs=8,
)

# fit a model to the data
model.fit(X_train, y_train)

# evaluate the model on the data
print(model.score(X_test, y_test))

# make estimations as usual
yp = model.predict(X_test)

print("Example estimations")
print([v for v in zip(y_test[:10], yp[:10])])

Fitting 3 folds for each of 9 candidates, totalling 27 fits


[Parallel(n_jobs=8)]: Done  12 out of  27 | elapsed:    0.3s remaining:    0.4s
[Parallel(n_jobs=8)]: Done  27 out of  27 | elapsed:    0.7s finished


0.374776767158
Example estimations
[(6.0, 5.1192985837032978), (5.0, 5.1679546993499654), (7.0, 7.0942408979956681), (6.0, 4.8421969242225291), (5.0, 6.0050072789140261), (6.0, 5.2082930080415846), (5.0, 5.0528333146304876), (6.0, 5.9496966127585189), (4.0, 5.0921054035949984), (5.0, 5.0883908910564912)]


## Automated feature processing

* Select categorical column

* Transform it into numerical columns

<center>
<img src="misc/onehot.svg" alt="Drawing" style="width: 400px;"/>
</center>

## Credit approval dataset
<center>
Build a [credit approval](http://archive.ics.uci.edu/ml/datasets/Credit+Approval) classifier!

<img src="misc/credit.svg" alt="Drawing" style="width: 800px;"/>

Requires some data preprocessing!
</center>

In [2]:
import pandas as ps
from sklearn.svm import SVR
from sklearn.preprocessing import LabelBinarizer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import make_union, make_pipeline

# class that selects a single column
class ColumnSelector(BaseEstimator, TransformerMixin):
    def __init__(self, index):
        self.index = index
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        return X[:, [self.index]]

# class that encodes the column 
class OneHotEncoder(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.model = None
    
    def fit(self, X, y=None):
        self.model = LabelBinarizer()
        self.model.fit(X[:, 0])
        return self
    
    def transform(self, X, y=None):
        return self.model.transform(X[:, 0])
        
    
# read the file as csv
Xy = ps.read_csv('data/credit-screening.csv').as_matrix()
X = Xy

cs = ColumnSelector(4)
X = cs.fit_transform(X)
hot = OneHotEncoder();
X = hot.fit_transform(X)
print(X)

# this joins multiple extracted features into one feature set
features = make_union(
    ColumnSelector(1),
    make_pipeline(ColumnSelector(4), OneHotEncoder())
)

print(features.fit_transform(Xy))

[[0 1 0 0]
 [0 1 0 0]
 [0 1 0 0]
 ..., 
 [0 0 0 1]
 [0 1 0 0]
 [0 1 0 0]]
[['58.67' 0 1 0 0]
 ['24.50' 0 1 0 0]
 ['27.83' 0 1 0 0]
 ..., 
 ['25.25' 0 0 0 1]
 ['17.92' 0 1 0 0]
 ['35.00' 0 1 0 0]]
