# generate multiple classifier models for testing

Generate classifier models and pickle them under `/User/mlrun/models/<any>-classifier.cpkl`.

In principle, the pickle `load` method should give us a class instance that we can predict with.  This may not work in practice, and that is the purpose of this notebook, to figure out which models work, and which don't.  Several pickling packages will also be tested in case there are differences.

### _extensions_
* cpkl for `cloudpickle`
* pkl for `pickle`
* dpkl for `dill`...

**gbc model:** adapted from **[Probabilistic predictions with Gaussian process classification (GPC)](https://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gpc.html#sphx-glr-auto-examples-gaussian-process-plot-gpc-py)**



In [1]:
%matplotlib inline

In [2]:
from cloudpickle import dump as cdump, load as cload
from pickle import dump as pdump, load as pload
from dill import dump as ddump, load as dload

In [3]:
import numpy as np

from matplotlib import pyplot as plt

from sklearn.metrics import accuracy_score, log_loss
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF

In [4]:
n_samples = 1000
train_size = 0.7

X, y = make_classification(
    n_samples=n_samples,
    n_features=28, 
    random_state = 1)

xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=1-train_size)

In [5]:
kernel = 1.0*RBF(length_scale=1.0)

clf = GaussianProcessClassifier(kernel=kernel, random_state=1)

In [6]:
clf.fit(xtrain, ytrain)

GaussianProcessClassifier(copy_X_train=True, kernel=1**2 * RBF(length_scale=1),
                          max_iter_predict=100, multi_class='one_vs_rest',
                          n_jobs=None, n_restarts_optimizer=0,
                          optimizer='fmin_l_bfgs_b', random_state=1,
                          warm_start=False)

In [7]:
accuracy =  accuracy_score(ytest, clf.predict(xtest))

logloss = log_loss(ytest, clf.predict_proba(xtest)[:, 1])

In [8]:
cdump(clf, open('/User/mlrun/models/gpc-classifier.cpkl', 'wb'))

In [9]:
clf_loaded = cload(open('/User/mlrun/models/gpc-classifier.cpkl', 'rb'))

In [10]:
clf_loaded

GaussianProcessClassifier(copy_X_train=True, kernel=1**2 * RBF(length_scale=1),
                          max_iter_predict=100, multi_class='one_vs_rest',
                          n_jobs=None, n_restarts_optimizer=0,
                          optimizer='fmin_l_bfgs_b', random_state=1,
                          warm_start=False)

In [11]:
assert accuracy ==  accuracy_score(ytest, clf_loaded.predict(xtest))
assert logloss == log_loss(ytest, clf_loaded.predict_proba(xtest)[:, 1])

In [12]:
from sklearn.ensemble import AdaBoostClassifier

In [21]:
clf = AdaBoostClassifier(n_estimators=100, random_state=1)

In [22]:
clf.fit(xtrain, ytrain)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0,
                   n_estimators=100, random_state=1)

In [23]:
cdump(clf, open('/User/mlrun/models/ada-classifier.cpkl', 'wb'))

In [24]:
clf_loaded = cload(open('/User/mlrun/models/ada-classifier.cpkl', 'rb'))

In [25]:
accuracy =  accuracy_score(ytest, clf.predict(xtest))

logloss = log_loss(ytest, clf.predict_proba(xtest)[:, 1])

In [26]:
assert accuracy ==  accuracy_score(ytest, clf_loaded.predict(xtest))
assert logloss == log_loss(ytest, clf_loaded.predict_proba(xtest)[:, 1])