# Model selection and evaluation

A typical approach is to split our data into training and test sets, fit a model to the training set and then test the model's performance on the test set.

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

iris = datasets.load_iris()

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)

0.96666666666666667

Anyway, in the above example, we can tweak the model's hyperparameters manually (for eg. $C$) until the model produces the best score on the test set. This could lead to a problem known as **overfitting the test set**. To avoid the problem, one approach is to split the data in three subsets: train, validation and test. After training the data on the train set, and setting the model's parameters on the validation set, we can finally estimate the model performance on the test set.



## Cross-validation

scikit-learn has a method called _cross-validation_ which automatically covers the stage of training and estimating the accuracy on the validation set. Actually, in practice, there is no validation set, but the training set is split into $k$ subsets, the learning is done on $k-1$ subsets and it's evaluated on the last subset. The method tries out all combinations of training-evaluation subsets, and returns a mean value for the cross-validation score.

In [34]:
from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=5, scoring='f1_macro')
print(scores)

# Print mean score estimate and 95% confidence interval
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

[ 0.96658312  1.          0.96658312  0.96658312  1.        ]
Accuracy: 0.98 (+/- 0.03)


To use many different types for the scoring parameter (including using custom created), you can use $cross\_validate$ instead of $cross\_val\_score$.

On the other side, the method $preprocessing.cross\_val\_predict$ returns the prediction of samples (when they appeared - exactly ones,  in the validation set):

In [36]:
from sklearn.model_selection import cross_val_predict
from sklearn import metrics

predicted = cross_val_predict(clf, iris.data, iris.target, cv=10)
metrics.accuracy_score(iris.target, predicted) 

0.97333333333333338

**Important**: if target classes exhibit large imbalance in labels (eg. there could be several times more negative samples than positive), it is recommended to use stratified sampling as implemented in $StratifiedKFold$ and $StratifiedShuffleSplit$. This is to ensure that relative class frequencies is approximately preserved in each train and validation fold.

---------------

Now let's try data scaling before fitting a model..

In [33]:
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(iris.data)
iris_data_scaled = scaler.transform(iris.data)
X_train, X_test, y_train, y_test = train_test_split(
    iris_data_scaled, iris.target, test_size=0.4, random_state=0)
clf = svm.SVC(C=1)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.93333333333333335

## Tuning the estimator's hyper-parameters

Hyper-parameters are not automatically selected by sklearn, but manually selected. Their selection should be done in such way that they produce the best cross-validation score.

In [40]:
clf.get_params()

{'C': 1,
 'cache_size': 200,
 'class_weight': None,
 'coef0': 0.0,
 'decision_function_shape': None,
 'degree': 3,
 'gamma': 'auto',
 'kernel': 'linear',
 'max_iter': -1,
 'probability': False,
 'random_state': None,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}

** I'll skip this for now **