# ML Seminar 4

Data Science Pipeline

## Let's set a goal
<center>
Build a [wine quality](https://archive.ics.uci.edu/ml/datasets/wine+quality) detector!

<img src="misc/wine.svg" alt="Drawing" style="width: 800px;"/>

We have already a prototype!
</center>

## Use functions of sklearn

* `train_test_split`

In [6]:
import pandas as ps
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split

# read the file as csv
Xy = ps.read_csv('data/winequality-red.csv', sep=';').as_matrix()

X_train, X_test, y_train, y_test = train_test_split(Xy[:, :-1], Xy[:, -1])

# create a model class instance
model = SVR()

# fit a model to the data
model.fit(X_train, y_train)

# evaluate the model on the data
print(model.score(X_test, y_test))

0.24785121402305202


## Model parameters in sklearn

* two ways to set them

In [27]:
import pandas as ps
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split

# read the file as csv
Xy = ps.read_csv('data/winequality-red.csv', sep=';').as_matrix()

X_train, X_test, y_train, y_test = train_test_split(Xy[:, :-1], Xy[:, -1], random_state=0)

# setting parameter in way #1
model = SVR(C=10.0)

# setting parameter in way #2
model.set_params(C=1.0-0.4)

# get a list of all parameter values
print(model.get_params())

# fit a model to the data
model.fit(X_train, y_train)

# evaluate the model on the data
print(model.score(X_test, y_test))

{'C': 0.6, 'kernel': 'rbf', 'epsilon': 0.1, 'verbose': False, 'tol': 0.001, 'shrinking': True, 'coef0': 0.0, 'gamma': 'auto', 'cache_size': 200, 'max_iter': -1, 'degree': 3}
0.27038267084148215


## Inputs normalization

A trick in machine learning: 

* Make mean of columns = 0.0
* Stand. dev. of columns = 1.0

In [32]:
import pandas as ps
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# read the file as csv
Xy = ps.read_csv('data/winequality-red.csv', sep=';').as_matrix()

X = Xy[:, :-1]
y = Xy[:, -1]

# instance of TransformerMixin
scaler = StandardScaler()
scaler.fit(X, y) # can be fit to the data
X = scaler.transform(X) # does transformation on the data

X_train, X_test, y_train, y_test = train_test_split(X, y)

# create a model class instance
model = SVR()

# fit a model to the data
model.fit(X_train, y_train)

# evaluate the model on the data
print(model.score(X_test, y_test))

0.4107145631253448


Any error in the code above?

## Pipelines in sklearn

Simplify the preprocessing by making it a part of a predictive model instance. 

In [6]:
import pandas as ps
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# read the file as csv
Xy = ps.read_csv('data/winequality-red.csv', sep=';').as_matrix()
X_train, X_test, y_train, y_test = train_test_split(Xy[:, :-1], Xy[:, -1], random_state=0)

# create a model class instance
model = make_pipeline(
    StandardScaler(),
    SVR(),
)

# setting parameters in the pipeline
model.set_params(
    standardscaler__with_mean=False,
    svr__C=1.0,    
)

# fit a model to the data
model.fit(X_train, y_train)

# evaluate the model on the data
print(model.score(X_test, y_test))

0.374563755169
