# ML Seminar 5

Grid Search and beyond in sklearn

## The usual goal reminder
<center>
Build a [wine quality](https://archive.ics.uci.edu/ml/datasets/wine+quality) detector!

<img src="misc/wine.svg" alt="Drawing" style="width: 800px;"/>

We are going to finish this today.
</center>

## Cross - validation

Next step for the validation dataset:

All data is split into folds, and every fold is successively used as validation set.

In [7]:
import pandas as ps
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# read the file as csv
Xy = ps.read_csv('data/winequality-red.csv', sep=';').as_matrix()
X_train, X_test, y_train, y_test = train_test_split(Xy[:, :-1], Xy[:, -1], random_state=0)

# create a model class instance
model = make_pipeline(
    StandardScaler(),
    SVR(),
)

# setting parameters in the pipeline
model.set_params(
    standardscaler__with_std=True,
    svr__C=1.0,
)

# get the cross - validation score estimate
sc = cross_val_score(model, X_train, y_train, cv=4)
print(sum(sc) / 4.0)

# fit a model to the data
model.fit(X_train, y_train)

# evaluate the model on the data
print(model.score(X_test, y_test))

0.384060958672
0.374563753236


## Grid Search

Search automatically for the good values of the parameters.

In [8]:
import pandas as ps
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# read the file as csv
Xy = ps.read_csv('data/winequality-red.csv', sep=';').as_matrix()
X_train, X_test, y_train, y_test = train_test_split(Xy[:, :-1], Xy[:, -1], random_state=0)

# create a model class instance
estimator = make_pipeline(
    StandardScaler(),
    SVR(),
)

# create an instance of a grid search class
model = GridSearchCV(
    estimator=estimator,
    param_grid={
        "standardscaler__with_std": [True, False],
        "svr__C": [0.1, 1.0, 10.0],
        "svr__gamma": [0.1, 1.0, 10.0],
    },
    verbose=1,
    n_jobs=8,
)

# fit a model to the data
model.fit(X_train, y_train)

# evaluate the model on the data
print(model.score(X_test, y_test))

# make estimations as usual
yp = model.predict(X_test)

print("Example estimations")
print([v for v in zip(y_test[:10], yp[:10])])

Fitting 3 folds for each of 18 candidates, totalling 54 fits
0.374776767158
Example estimations
[(6.0, 5.119298583703296), (5.0, 5.1679546993499574), (7.0, 7.0942408979956673), (6.0, 4.8421969242225433), (5.0, 6.0050072789140199), (6.0, 5.2082930080415828), (5.0, 5.0528333146304885), (6.0, 5.9496966127585198), (4.0, 5.0921054035949931), (5.0, 5.0883908910564859)]


[Parallel(n_jobs=8)]: Done  54 out of  54 | elapsed:    1.0s finished


## Try more than one class of models

* Explore linear and kNN model in `MLSeminar2_2.pdf`

* Use these two classes in GridSearchCV

*Note*: See example parameter ranges for models and their "comparison" here:
https://arxiv.org/pdf/1708.05070.pdf

In [7]:
import pandas as ps
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# read the file as csv
Xy = ps.read_csv('data/winequality-red.csv', sep=';').as_matrix()
X_train, X_test, y_train, y_test = train_test_split(Xy[:, :-1], Xy[:, -1], random_state=0)

# create a model class instance
estimator = Pipeline([
    ('scaler', StandardScaler()),
    ('model', Lasso()),
])

# create an instance of a grid search class
model = GridSearchCV(
    estimator=estimator,
    param_grid=[ # a list of dicts - understood as a list of subspaces to look into
        {
            "model":[KNeighborsRegressor()], # fix the model like this
            "model__n_neighbors": [1, 2, 3], # set parameters of the pipeline
            'model__metric': ['minkowski']
        },
        {
            "model":[Lasso()],
            "model__alpha": [0.01, 0.1, 1.0]
        }
    ],
    verbose=1,
    n_jobs=8,
)


# fit a model to the data
model.fit(X_train, y_train)

print(model.best_params_)

# evaluate the model on the data
print(model.score(X_test, y_test))

# make estimations as usual
yp = model.predict(X_test)

print("Example estimations")
print([v for v in zip(y_test[:10], yp[:10])])

Fitting 3 folds for each of 6 candidates, totalling 18 fits
{'model': Lasso(alpha=0.01, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False), 'model__alpha': 0.01}
0.338515504834656
Example estimations
[(6.0, 5.7686006218926025), (5.0, 5.034163979580779), (7.0, 6.5095248248564), (6.0, 5.38237161642421), (5.0, 5.8686735982789715), (6.0, 5.101823625659236), (5.0, 5.37781568807379), (6.0, 5.972112277618752), (4.0, 4.836185272129585), (5.0, 5.008708344116504)]


[Parallel(n_jobs=8)]: Done   3 out of  18 | elapsed:    0.0s remaining:    0.1s
[Parallel(n_jobs=8)]: Done  18 out of  18 | elapsed:    0.1s finished
