In this notebook, we're going to generate some synthetic binary classification data and show how to train supervised cadre models (SCM) on it. We'll train a model with the default parameters, and then we'll show how we can use cross-validation for hyperparameter tuning to get better performance.

THIS NOTEBOOK IS INCOMPLETE

In [None]:
import numpy as np
import pandas as pd
import sys
import matplotlib.pyplot as plt
import seaborn as sns

sys.path.insert(0, '../cadreModels')

from classificationBinary import binaryCadreModel
from sklearn.datasets import make_classification
from scipy.stats import zscore, zmap

from sklearn.model_selection import train_test_split

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
sns.set_style('darkgrid')

Generate data with the `sklearn.datasets.make_classification` function. Bind `X` and `y` into a `pd.DataFrame`.

In [None]:
X, y = make_classification(n_samples=50000, random_state=2125615, n_clusters_per_class=10, 
                           n_features=50, n_informative=25, n_repeated=15)

data = pd.DataFrame(X)
data.columns = ['f'+str(p) for p in data.columns]
data = data.assign(target=y)
features = data.columns[data.columns != 'target']

In [None]:
data.head()

Since the features are continuous, we should standardize them.

In [None]:
D_tr, D_va = train_test_split(data, test_size=0.2, random_state=313616)

D_va[features] = zmap(D_va[features], D_tr[features])
D_tr[features] = zscore(D_tr[features])

A `binaryCadreModel`'s initialization function takes the following arguments and default values:

* `M=2` -- number of cadres in model
* `gamma=10.` -- cadre-assignment sharpness
* `lambda_d=0.01` -- regularization strength for cadre-assignment weight parameter `d`
* `lambda_W=0.01` -- regularization strength for classification-weight parameter `W`
* `alpha_d=0.9` -- elastic net mixing weight for cadre-assignment weight parameter `d`
* `alpha_W=0.9` -- elastic net mixing with for classification-weight parameter `W`
* `Tmax=10000` -- maximum number of SGD steps to take
* `record=100` -- during training, how often goodness-of-fit metrics should be evaluated on the data
* `eta=2e-3` -- initial stepsize / learning rate
* `Nba=64` -- minibatch size
* `eps=1e-3` -- convergence tolerance
* `termination_metric='ROC_AUC'` -- training terminated if the difference between the most recent `termination_metric` value and the second most recent `termination_metric` is less than `eps`

Once you initialize a `binaryCadreModel`, you apply the `fit` method to train it. This method takes the following arguments and default values:

* `data` -- `pd.DataFrame` of training data
* `targetCol` -- string column-name of target feature in `data`
* `cadreFts=None` -- `pd.Index` of column-names used for cadre-assignment
* `predictFts=None` -- `pd.Index` of column-names used for target-prediction
* `dataVa=None` -- optional `pd.DataFrame` of validation data 
* `seed=16162` -- seed for parameter initialization and minibatch generation
* `store=False` -- whether or not copies `data` and `dataVa` should be added as attributes of the `binaryCadreModel`
* `progress=False` -- whether or not goodness-of-fit metrics should be printed during training

Other attributes of the `binaryCadreModel` include:

* `W` -- matrix of cadre-specific classification weights
* `W0` -- vector of cadre-specific classification biases
* `C` -- matrix of cadre centers
* `d` -- vector of cadre-assignments weights
* `metrics` -- a `dict` with `'training'` and `'validation'` as keys. Each item is a `pd.DataFrame` of goodness-of-fit metrics evaluated during training. Metrics include loss, accuracy, ROC AUC, and precision-recall (PR) AUC
* `time` -- list of computer-time values it took for each SGD step to be evaluated
* `proportions` -- during training, the proportion of the training data assigned to each cadre is recorded. This is a `pd.DataFrame` of those proportions, which lets you see if cadre assignments have converged to a stable distribution.

In [None]:
scm = binaryCadreModel(Tmax=17001, record=50, eps=1e-4, lambda_W=1e-3, lambda_d=1e-3, M=10)
scm.fit(D_tr, 'target', features, features, D_va, progress=True)

In [None]:
scm.metrics['validation'].drop('loss', axis=1).plot()

In [None]:
scm.metrics['training']['loss'].plot()

In [None]:
scm.scoreMetrics(D_va)

In [None]:
scm.entropy(D_va)

In [None]:
f, l, G, m, l = scm.predictFull(D_va)

In [None]:
pd.Series(m).value_counts()

In [None]:
from itertools import product
from joblib import Parallel, delayed

In [None]:
def scmCrossval(d_tr, d_va, d_te, M, l_W, l_d, cadre_fts, predict_fts, Tmax, record):
    mod = binaryCadreModel(
                Tmax=Tmax, record=record,
                M=M, alpha_d=0.99, alpha_W=0.99, lambda_d=l_d, lambda_W=l_W, gamma=1.)
        
    mod.fit(d_tr, 'target', cadre_fts, predict_fts, d_va, progress=False)
    
    ## evaluate on validation and test sets
    err_va = mod.scoreMetrics(d_va)
    err_te = mod.scoreMetrics(d_te)
    
    ## return everything as a list
    return mod, err_va, err_te

In [None]:
from sklearn.model_selection import KFold

In [None]:
l_ds = np.array([0.01, 0.001])
l_Ws = np.array([0.01, 0.001])
Ms = np.array([4,6,8,10])
n_folds = 5

In [None]:
kf = KFold(n_splits=n_folds, random_state=1414)

n_jobs = np.minimum(12, n_folds * Ms.shape[0] * l_ds.shape[0] * l_Ws.shape[0])

results = (Parallel(n_jobs=n_jobs, backend='threading', verbose=11)(delayed(scmCrossval)
                    (D_tr.iloc[tr], D_tr.iloc[va], D_va, M, l_W, l_d, features, features, 20001, 1000) 
                    for (M, l_d, l_W, (fold, (tr, va))) in product(Ms, l_ds, l_Ws, enumerate(kf.split(D_tr)))))

In [None]:
def extract_scores(results):
    results_va, results_te = [], []
    for model, scores_va, scores_te in results:
        results_va.append(scores_va)
        results_va[-1] = results_va[-1].assign(M=model.M, lambda_d=model.lambda_d, lambda_W=model.lambda_W)
        
        results_te.append(scores_te)
        results_te[-1] = results_te[-1].assign(M=model.M, lambda_d=model.lambda_d, lambda_W=model.lambda_W)
    results_va = pd.concat(results_va).reset_index(drop=True)
    results_te = pd.concat(results_te).reset_index(drop=True)
    print(results_va.head())
    print(results_te.head())
    return results_va, results_te

In [None]:
extracted_scores = extract_scores(results)