# ML Seminar 3

Data Science Pipeline

## Let's set a goal
<center>
Build a [wine quality](https://archive.ics.uci.edu/ml/datasets/wine+quality) detector!

<img src="misc/wine.svg" alt="Drawing" style="width: 800px;"/></center>

## Plan

[ x ] Install everything we need


[ x ] Understand basics of predictive modelling


[   ] Load and preprocess data

 
[   ] Create predictive model with sklearn

## Quantum leap to datascience

* Loading data in csv

* Train model on data

## Solution

Minimal program to train an sklearn model on our data.

In [1]:
import pandas as ps
from sklearn.svm import SVR

# read the file as csv
Xy = ps.read_csv('data/winequality-red.csv', sep=';').as_matrix()

# split the data into inputs and outputs
X = Xy[:, :-1]
y = Xy[:, -1]

# create a model class instance
model = SVR()

# fit a model to the data
model.fit(X, y)

# make estimations with the model
yp = model.predict(X)
print(yp)

# evaluate the model on the data
print(model.score(X, y))

[5.09985909 5.31661002 4.90015287 ... 5.98878076 5.10019844 5.89988359]
0.6374359027519503


Any issues with this code?

## Data preprocessing

* Split data into training and testing parts

In [2]:
import pandas as ps
from sklearn.svm import SVR

# read the file as csv
Xy = ps.read_csv('data/winequality-red.csv', sep=';').as_matrix()

# split the data into inputs and outputs
X = Xy[:, :-1]
y = Xy[:, -1]

print(X.shape)
print(y.shape)

X_train, X_test = X[:1000], X[1000:]
y_train, y_test = y[:1000], y[1000:]

# create a model class instance
model = SVR()

# fit a model to the data
model.fit(X_train, y_train)

# evaluate the model on the data
print(model.score(X_test, y_test))

(1599, 11)
(1599,)
0.0834225250112


## Use functions of sklearn

* `train_test_split`

In [2]:
import pandas as ps
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split

# read the file as csv
Xy = ps.read_csv('data/winequality-red.csv', sep=';').as_matrix()

X_train, X_test, y_train, y_test = train_test_split(Xy[:, :-1], Xy[:, -1])

# create a model class instance
model = SVR()

# fit a model to the data
model.fit(X_train, y_train)

# evaluate the model on the data
print(model.score(X_test, y_test))

0.21440009257313908


## Model parameters in sklearn

* two ways to set them

In [27]:
import pandas as ps
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split

# read the file as csv
Xy = ps.read_csv('data/winequality-red.csv', sep=';').as_matrix()

X_train, X_test, y_train, y_test = train_test_split(Xy[:, :-1], Xy[:, -1], random_state=0)

# setting parameter in way #1
model = SVR(C=10.0)

# setting parameter in way #2
model.set_params(C=1.0-0.4)

# get a list of all parameter values
print(model.get_params())

# fit a model to the data
model.fit(X_train, y_train)

# evaluate the model on the data
print(model.score(X_test, y_test))

{'C': 0.6, 'kernel': 'rbf', 'epsilon': 0.1, 'verbose': False, 'tol': 0.001, 'shrinking': True, 'coef0': 0.0, 'gamma': 'auto', 'cache_size': 200, 'max_iter': -1, 'degree': 3}
0.27038267084148215


## Inputs normalization

A trick in machine learning: 

* Make mean of columns = 0.0
* Stand. dev. of columns = 1.0

In [32]:
import pandas as ps
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# read the file as csv
Xy = ps.read_csv('data/winequality-red.csv', sep=';').as_matrix()

X = Xy[:, :-1]
y = Xy[:, -1]

# instance of TransformerMixin
scaler = StandardScaler()
scaler.fit(X, y) # can be fit to the data
X = scaler.transform(X) # does transformation on the data

X_train, X_test, y_train, y_test = train_test_split(X, y)

# create a model class instance
model = SVR()

# fit a model to the data
model.fit(X_train, y_train)

# evaluate the model on the data
print(model.score(X_test, y_test))

0.4107145631253448


Any error in the code above?

## Pipelines in sklearn

Simplify the preprocessing by making it a part of a predictive model instance. 

In [6]:
import pandas as ps
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# read the file as csv
Xy = ps.read_csv('data/winequality-red.csv', sep=';').as_matrix()
X_train, X_test, y_train, y_test = train_test_split(Xy[:, :-1], Xy[:, -1], random_state=0)

# create a model class instance
model = make_pipeline(
    StandardScaler(),
    SVR(),
)

# setting parameters in the pipeline
model.set_params(
    standardscaler__with_mean=False,
    svr__C=1.0,    
)

# fit a model to the data
model.fit(X_train, y_train)

# evaluate the model on the data
print(model.score(X_test, y_test))

0.374563755169


Anything missing here?

## Cross - validation

Next step for the validation dataset:

All data is split into folds, and every fold is successively used as validation set.

In [7]:
import pandas as ps
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# read the file as csv
Xy = ps.read_csv('data/winequality-red.csv', sep=';').as_matrix()
X_train, X_test, y_train, y_test = train_test_split(Xy[:, :-1], Xy[:, -1], random_state=0)

# create a model class instance
model = make_pipeline(
    StandardScaler(),
    SVR(),
)

# setting parameters in the pipeline
model.set_params(
    standardscaler__with_std=True,
    svr__C=1.0,
)

# get the cross - validation score estimate
sc = cross_val_score(model, X_train, y_train, cv=4)
print(sum(sc) / 4.0)

# fit a model to the data
model.fit(X_train, y_train)

# evaluate the model on the data
print(model.score(X_test, y_test))

0.384060958672
0.374563753236


## Grid Search

Search automatically for the good values of the parameters.

In [8]:
import pandas as ps
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# read the file as csv
Xy = ps.read_csv('data/winequality-red.csv', sep=';').as_matrix()
X_train, X_test, y_train, y_test = train_test_split(Xy[:, :-1], Xy[:, -1], random_state=0)

# create a model class instance
estimator = make_pipeline(
    StandardScaler(),
    SVR(),
)

# create an instance of a grid search class
model = GridSearchCV(
    estimator=estimator,
    param_grid={
        "standardscaler__with_std": [True, False],
        "svr__C": [0.1, 1.0, 10.0],
        "svr__gamma": [0.1, 1.0, 10.0],
    },
    verbose=1,
    n_jobs=8,
)

# fit a model to the data
model.fit(X_train, y_train)

# evaluate the model on the data
print(model.score(X_test, y_test))

# make estimations as usual
yp = model.predict(X_test)

print("Example estimations")
print([v for v in zip(y_test[:10], yp[:10])])

Fitting 3 folds for each of 18 candidates, totalling 54 fits
0.374776767158
Example estimations
[(6.0, 5.119298583703296), (5.0, 5.1679546993499574), (7.0, 7.0942408979956673), (6.0, 4.8421969242225433), (5.0, 6.0050072789140199), (6.0, 5.2082930080415828), (5.0, 5.0528333146304885), (6.0, 5.9496966127585198), (4.0, 5.0921054035949931), (5.0, 5.0883908910564859)]


[Parallel(n_jobs=8)]: Done  54 out of  54 | elapsed:    1.0s finished
