# XGBoost trainer

This notebook function handles training and logging of xgboost models **only**, exposing both the sklearn and low level api's.

## steps
1. generate an xgboost model configuration by selecting one of 5 available types
2. get a sample of data from a data source (random rows, consecutive rows, or the entire dataset)
3. split the data into train, validation, and test sets.  

> _PLEASE NOTE_:  there are many approaches to cross validation (cv) and as many ways to implement cv in scikit learn.  In this third stage, an alternative, two-way train and test split can be created.  The training set would then, for example, serve as input to a cross validation splitter.  The latter creates multiple training and validation subsets, called folds. These folds are then input, either in sequence or in parallel into the fit algorithm.

4. train the model
5. dump the model
6. generate predictions and probabilities
7. (calibrate probabilities if needed, wip)
8. calculate evaluation statistics and plots

All these steps have been separated here into independent functions since many can be reused for other model types. Some of the following functions will be transferred in the `mlrun.mlutils` module. Additionally, each function contains its own imports in order to isolate and identify dependencies.

In [3]:
# nuclio: ignore
import nuclio

In [4]:
%nuclio config kind = "job"
%nuclio config spec.image = "mlrun/ml-models"

%nuclio: setting kind to 'job'
%nuclio: setting spec.image to 'mlrun/ml-models'


In [5]:
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

In [6]:
from mlrun.datastore import DataItem

## generate an xgb model

In [19]:
def _gen_xgb_model(model_type: str, xgb_params: dict):
    """generate an xgboost model
    
    Multiple model types that can be estimated using
    the XGBoost Scikit-Learn API
    
    :param model_type: one of "classifier", "regressor",
                       "ranker", "rf_classifier", or
                      "rf_regressor"
    :param xgb_params: parameters passed through the 
                       function execution context
    """
    from json import load
    from mlrun.mlutils import get_class_fit, create_class

    # generate model and fit function
    mtypes = {
        "classifier"   : "xgboost.XGBClassifier",
        "regressor"    : "xgboost.XGBRegressor",
        "ranker"       : "xgboost.XGBRanker",
        "rf_classifier": "xgboost.XGBRFClassifier",
        "rf_regressor" : "xgboost.XGBRFRegressor"
    }
    if model_type not in mtypes.keys():
        raise Exception("unrecognized model types, see help documentation")
    
    model_config = get_class_fit(mtypes[model_type])

    for k, v in xgb_params:
        if k.startswith("CLASS_"):
            model_config["CLASS"][k[6:]] = v
        if k.startswith("FIT_"):
            model_config["FIT"][k[4:]] = v

    ClassifierClass = create_class(model_config["META"]["class"])
    model = ClassifierClass(**model_config["CLASS"])

    return model, model_config

## get a sample of data

In [20]:
def _get_sample(src:DataItem, sample: int, label: str, reader=None):
    """generate data sample to be split (candidate for mlrun)
     
    Returns features matrix and header (x), and labels (y)
    :param src:    full path and filename of data artifact
    :param sample: sample size from data source, use negative 
                   integers to sample randomly, positive to
                   sample consecutively from the first row
    :param label:  label column title
    :param reader: pandas type reader (read_csv, read_parquet, ...) returning
                   a pandas dataframe, and with a `dropna` attribute
    """
    import pandas as pd
    table = src.as_df()
    
    # get sample
    if (sample == -1) or (sample >= 1):
        # get all rows, or contiguous sample starting at row 1.
        raw = table.dropna()
        labels = raw.pop(label)
        raw = raw.iloc[:sample, :]
        labels = labels.iloc[:sample]
    else:
        # grab a random sample
        raw = table.dropna().sample(sample * -1)
        labels = raw.pop(label)

    return raw, labels, raw.columns.values

## split data into train, validate and test

In [21]:
def _get_splits(
    raw, 
    labels, 
    n_ways: int = 3,
    test_size: float = 0.15,
    valid_size: float = 0.30,
    label_names: list = ["labels"],
    random_state: int = 1
):
    """generate train and test sets (candidate for mlrun)

    cross validation:
    1. cut out a test set
    2a. use the training set in a cross validation scheme, or
    2b. make another split to generate a validation set
    
    2 parts (n_ways=2): train and test set only
    3 parts (n_ways=3): train, validation and test set
    4 parts (n_ways=4): n_ways=3 + a held-out probability calibration set
    
    :param raw:            dataframe or numpy array of raw features
    :param labels:         dataframe or numpy array of raw labels
    :param n_ways:         (3) split data into 2, 3, or 4 parts
    :param test_size:      proportion of raw data to set asid as test data
    :param valid_size:     proportion of remaining data to be set as validation
    :param label_names:         label names
    :param random_state:   (1) random number seed
    """
    import pandas as pd
    import numpy as np
    from sklearn.model_selection import train_test_split
    
    if isinstance(raw, np.ndarray):
        if labels.ndim==1:
            labels=labels.reshape(-1,1)
        xy = np.concatenate([raw, labels], axis=1)
    else:
        if isinstance(labels, pd.Series):
            labels = pd.DataFrame(data=labels, columns=label_names)
        xy = pd.concat([raw, labels], axis=1)
        
    x, xte, y, yte = train_test_split(xy, labels, test_size=test_size,
                                      random_state=random_state)
    if n_ways==2:
        return (x, y), (xte, yte), None, None
    elif n_ways==3:
        xtr, xva, ytr, yva = train_test_split(x, y,train_size=valid_size,
                                              random_state=random_state)
        return (xtr, ytr), (xva, yva), (xte, yte), None
    elif n_ways==4:
        xt, xva, yt, yva = train_test_split(x, y,train_size=valid_size,
                                              random_state=random_state)
        xtr, xcal, ytr, ycal = train_test_split(xt, yt, train_size=0.8,
                                              random_state=random_state)
        return (xtr, ytr), (xva, yva), (xte, yte), (xcal, ycal)
    else:
        raise Exception("n_ways must be in the range [2,4]")

## save the test data separately


In [22]:
def _save_test_set(
    context, 
    xtest, 
    ytest, 
    header: list, 
    label: str = "labels", 
    file_ext: str = "parquet", 
    index: bool = False,
    debug: bool = False
):
    """log a held out test set

    :param context:    the function execution context
    :param xtest:      test features, as np.ndarray output from `get_splits`
    :param ytest:      test labels, as np.ndarray output from `get_splits`
    :param header:     ([])features header if required
    :param label:      ("labels") name of label column
    :param file_ext:   format of test set file
    :param index:      preserve index column
    :param debug:      (False)
    """
    import pandas as pd
    from mlrun import mlconf
    
    test_set = pd.concat(
        [pd.DataFrame(data=xtest, columns=header),
         pd.DataFrame(data=ytest.values, columns=[label])],
        axis=1,)
    
    if debug:
        test_set.to_parquet(mlconf.artifact_path+'/test_set-dev.parquet')
        
    context.log_dataset("test_set", df=test_set, format=file_ext, index=index)

## dump an xgb model

In [23]:
def _dump_xgb_model(
    context, 
    model,
    dump_type: str,
    dest_folder: str,
    dest_name: str
):
    """serialize/log model
    
    XGBoost model can be save in 3 different ways:
    1. pickle the internal _booster object, inside the model
    2. using model.save_model("fn.bin") using a legacy binary xgb format
    2. using model.save_model("fn.json") using a portable json format
    
    :param context:     the function"s execution context
    :param model:       the fitted xgboost model
    :param dump_type:   "pickle" legacy", or "json", 
    :param dest_folder: path for serialized model 
    :param dest_name:   name for serialized model file
    """
    from cloudpickle import dumps, dump
    import os
    
    os.makedirs(dest_folder, exist_ok=True)
    #try:
    # https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier.save_model
    if dump_type == "pickle":
        filename = f"{dest_name}.pkl"
    else:
        # this save all contents as json
        # see https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html
        filename = f"{dest_name}.json"
    model.save_model(f"{dest_folder}/{filename}")

    # else:
    # this saves all internal contents as pickle
    #_booster = model.get_booster()
    #dump(_booster, open(f"{dest_folder}/{dest_name}-dump.pkl", "wb"))
    #dump(model, open(f"{dest_folder}/{dest_name}-dump-model.pkl", "wb"))

    # this will be saved as pickle but doesn't include file extension so we added it in key
    context.log_model("xgb-model", model_dir=dest_folder,
                      model_file=filename, metrics=context.results)
    #except Exception as e:
    #    print("xgboost model serialization error", str(e))

## plots

In [24]:
def plot_confusion_matrix(
    labels,
    predictions,
    classes,
    normalize="all",
    title='Confusion matrix',
    cmap=None
):
    """prints and plots the confusion matrix.
    
    """
    import matplotlib.pyplot as plt
    from sklearn import metrics
    import numpy as np
    import itertools
    
    if not cmap:
        cmap = plt.cm.Blues

    cm = metrics.confusion_matrix(labels, predictions, normalize=normalize)
    
    # plt.gcf().set_size_inches(30, 10)
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, round(cm[i, j], 2),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    #plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    return plt.gcf()

In [25]:
def plot_roc(
    context,
    y_labels,
    y_probs,
    fpr_label: str = "false positive rate",
    tpr_label: str = "true positive rate",
    title: str = "roc curve",
    legend_loc: str = "best",
):
    """plot roc curves

    TODO:  add averaging method (as string) that was used to create probs, 
    display in legend

    :param context:      the function context
    :param y_labels:     ground truth labels, hot encoded for multiclass  
    :param y_probs:      model prediction probabilities
    :param key:          ("roc") key of plot in artifact store
    :param plots_dir:    ("plots") destination folder relative path to artifact path
    :param fmt:          ("png") plot format
    :param fpr_label:    ("false positive rate") x-axis labels
    :param tpr_label:    ("true positive rate") y-axis labels
    :param title:        ("roc curve") title of plot
    :param legend_loc:   ("best") location of plot legend
    """
    from sklearn import metrics
    import matplotlib.pyplot as plt
    from mlrun.mlutils import gcf_clear
    
    # clear matplotlib current figure
    gcf_clear(plt)

    # draw 45 degree line
    plt.plot([0, 1], [0, 1], "k--")

    # labelling
    plt.xlabel(fpr_label)
    plt.ylabel(tpr_label)
    plt.title(title)
    plt.legend(loc=legend_loc)

    # single ROC or mutliple
    if y_labels.ndim > 2:
        # data accummulators by class
        fpr = dict()
        tpr = dict()
        roc_auc = dict()
        for i in range(y_labels[:, :-1].shape[1]):
            fpr[i], tpr[i], _ = metrics.roc_curve(
                y_labels[:, i], y_probs[:, i], pos_label=1
            )
            roc_auc[i] = metrics.auc(fpr[i], tpr[i])
            plt.plot(fpr[i], tpr[i], label=f"class {i}")
    else:
        fpr, tpr, _ = metrics.roc_curve(y_labels, y_probs[:,-1])
        plt.plot(fpr, tpr, label=f"positive class")

    return plt.gcf()

## probabilities

### generate probabilities

In [26]:
def gen_proba(
    context,
    feats,
    labels,
    model,
    score_method,
    plots_dest,
    ntree_limit=None,
    validate_features=True,
    base_margin=None
):
    """ generate predictions and validation stats
    
    :param context:           the function execution context
    :param feats:             validation features array 
    :param labels:            validation ground-truth labels
    :param model:             estimated model
    :param scrore_method:     ("average") multiclass scoring
    :param plots_dest:        destination subfolder for plot artifacts
    :param ntree_limit:       (None) limit no. trees used in prediction
    :param validate_features: (True) ensure consistent feature names 
                              between model and input data
    :param base_margin:       (None) undefined
    """
    from sklearn import metrics
    from mlrun.artifacts import PlotArtifact
    from mlrun.mlutils import gcf_clear
    from xgboost import XGBClassifier
    import matplotlib.pyplot as plt
    
    ypred = model.predict(feats, False, ntree_limit, validate_features, base_margin)
    
    y_proba = []
    if hasattr(model, "predict_proba"):
        y_proba = model.predict_proba(feats, ntree_limit, validate_features, base_margin)
    ypred_binary = [round(value) for value in y_proba[:,-1]]
    
    average_precision = metrics.average_precision_score(labels, y_proba[:,-1], average=score_method)
    context.log_result(f"avg_precision", average_precision)
    context.log_result(f"rocauc", metrics.roc_auc_score(labels, y_proba[:,-1]))
    context.log_result(f"accuracy_score", float(metrics.accuracy_score(labels, ypred_binary)))
    context.log_result(f"f1_score", metrics.f1_score(labels, ypred_binary, average=score_method))
    
    # ROC plot
    context.log_artifact(
        PlotArtifact("roc", body=plot_roc(context, labels, y_proba)), local_path=f"{plots_dest}/roc")
    gcf_clear(plt)

    body = plot_confusion_matrix(labels, ypred_binary, classes=labels.labels.unique()) 
    context.log_artifact(PlotArtifact("confusion", body=body), local_path=f"{plots_dest}/confusion")
    
    return y_proba

## train

In [27]:
def train_model(
    context,
    model_type: str,
    dataset: DataItem,
    label_column: str = "labels",
    sample: int = -1,
    test_size: float = 0.05,
    valid_size: float = 0.75,
    random_state: int = 1,
    model_filename: str = "xgb-model",
    models_dest: str = "",
    plots_dest: str = "",
    score_method: str = "micro",
    file_ext: str = "parquet",
    model_pkg_file: str = "",    
) -> None:
    """train an xgboost model.

    :param context:           the function context
    :param model_type:        the model type to train, 'classifier', 'regressor'...
    :param dataset:           ("data") name of raw data file
    :param label_column:      ground-truth (y) labels
    :param sample:            Selects the first n rows, or select a sample
                              starting from the first. If negative <-1, select
                              a random sample
    :param model_filename:    model file filename,
                              points to a directory
    :param test_size:         (0.05) test set size
    :param valid_size:          (0.75) Once the test set has been removed the
                              training set gets this proportion.
    :param random_state:      (1) sklearn rng seed
    :param models_dest:       models subfolder on artifact path
    :param plots_dest:        plot subfolder on artifact path
    :param score_method:      for multiclass classification
    
    :param file_ext:          format for test_set_key hold out data
    :param model_pkg_file:    json model config file                                  
    """
    # deprecate:
    models_dest = models_dest or "models"
    plots_dest = plots_dest or f"plots/{context.name}"
    
    # get a sample from the raw data
    raw, labels, header = _get_sample(dataset, sample, label_column)
    
    # split the sample into train validate, test and calibration sets:
    (xtr,ytr), (xva,yva), (xte,yte), (xcal, ycal) = _get_splits(raw, labels, 4,
                                                                test_size, 
                                                                valid_size, 
                                                                ["labels"],
                                                                random_state)
        
    # get xgboost model and model config
    model, model_config = _gen_xgb_model(model_type, context.parameters.items())
    
    # update the model config with training data and callbacks
    model_config["FIT"].update({"X": xtr,"y": ytr.values})
    
    # run the fit
    model.fit(**model_config["FIT"])
    
    # generate predictions
    y_proba = gen_proba(context, xva, yva, model, score_method, plots_dest)
    
    # serialize the model
    context.logger.info(f'saving model to {models_dest}/{model_filename}..')
    print(model)
    _dump_xgb_model(context, model, "json", models_dest, model_filename)

    # calibrate probabilities
    # y_proba_cal = proba_calibration(model, xcal, ycal)

In [28]:
# nuclio: end-code

### mlconfig

In [1]:
from mlrun import mlconf
import os

mlconf.dbpath = mlconf.dbpath or 'http://mlrun-api:8080'
mlconf.artifact_path = mlconf.artifact_path or f'{os.environ["HOME"]}/artifacts'

### save

In [2]:
from mlrun import code_to_function 
# create job function object from notebook code
fn = code_to_function("xgb_trainer")

# add metadata (for templates and reuse)
fn.spec.default_handler = "train_model"
fn.spec.description = "train any classifier using scikit-learn's API"
fn.metadata.categories = ["training", "ml"]
fn.metadata.labels = {"author": "yjb", "framework": "xgboost"}
fn.export("function.yaml")

[mlrun] 2020-05-02 19:50:15,157 function spec saved to path: function.yaml


<mlrun.runtimes.kubejob.KubejobRuntime at 0x7f71a91f9588>

### test function

In [3]:
# load function from marketplacen
from mlrun import import_function

# vcs_branch = 'development'
# base_vcs = f'https://raw.githubusercontent.com/mlrun/functions/{vcs_branch}/'
# mlconf.hub_url = mlconf.hub_url or base_vcs + f'{name}/function.yaml'
# fn = import_function("hub://xgb_trainer")

In [4]:
from mlrun import mount_v3io, NewTask, run_local

if "V3IO_HOME" in list(os.environ):
    # mlrun on the iguazio platform
    from mlrun import mount_v3io
    fn.apply(mount_v3io())
else:
    # mlrun is setup using the instructions at 
    # https://github.com/mlrun/mlrun/blob/master/hack/local/README.md
    from mlrun.platforms import mount_pvc
    fn.apply(mount_pvc("nfsvol", "nfsvol", "/home/joyan/data"))

In [5]:
gpus = False

task_params = {
    "name" : "tasks xgb cpu trainer",
    "params" : {
        "model_type"         : "classifier",
        #"num_class"          : 3,
        "CLASS_tree_method"  : "gpu_hist" if gpus else "hist",
        "CLASS_objective"    : "binary:logistic",
        "CLASS_random_state" : 1,
        "sample"             : -1,
        "label_column"       : "labels",
        "test_size"          : 0.10,
        "valid_size"         : 0.75,
        "score_method"       : "weighted",
        "models_dest"        : "xgb_trainer/models",
        "plots_dest"         : "xgb_trainer/plots",
    }}

### run remotely

In [7]:
run = fn.run(
    NewTask(**task_params),
    inputs={"dataset"  : os.path.join(mlconf.artifact_path, "breast_cancer.parquet")})

[mlrun] 2020-05-02 19:53:05,985 starting run tasks xgb cpu trainer uid=0b24dfec740a49ccb65d522e6e5dfaee  -> http://10.196.88.27:80
[mlrun] 2020-05-02 19:53:06,247 Job is running in the background, pod: tasks-xgb-cpu-trainer-rtxr4
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
No handles with labels found to put in legend.
[mlrun] 2020-05-02 19:53:15,162 log artifact roc at /User/artifacts/xgb_trainer/plots/roc, size: 30870, db: Y
[mlrun] 2020-05-02 19:53:15,324 log artifact confusion at /User/artifacts/xgb_trainer/plots/confusion, size: 25332, db: Y
[mlrun] 2020-05-02 19:53:15,325 saving model to xgb_trainer/models/xgb-model..
XGBClassifier(base_score=0.5, booster=None, colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints=None,
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
default,...6e5dfaee,0,May 02 19:53:13,completed,tasks xgb cpu trainer,host=tasks-xgb-cpu-trainer-rtxr4kind=jobowner=adminv3io_user=admin,dataset,CLASS_objective=binary:logisticCLASS_random_state=1CLASS_tree_method=histlabel_column=labelsmodel_type=classifiermodels_dest=xgb_trainer/modelsplots_dest=xgb_trainer/plotssample=-1score_method=weightedtest_size=0.1valid_size=0.75,accuracy_score=1.0avg_precision=1.0f1_score=1.0rocauc=1.0,rocconfusionxgb-model


to track results use .show() or .logs() or in CLI: 
!mlrun get run 0b24dfec740a49ccb65d522e6e5dfaee  , !mlrun logs 0b24dfec740a49ccb65d522e6e5dfaee 
[mlrun] 2020-05-02 19:53:18,633 run executed, status=completed


# tests

WIP

In [19]:
def test_gen_xgb_model():
    import xgboost
    c, j = _gen_xgb_model("rf_classifier", {})
    assert isinstance(c, xgboost.XGBRFClassifier)
test_gen_xgb_model()    

In [None]:
from mlrun.run import get_dataitem
breast_cancer = get_dataitem(mlconf.artifact_path+"/breast_cancer.parquet")
classifier-data = get_dataitem(mlconf.artifact_path+"/classifier-data.csv")
iris = get_dataitem(mlconf.artifact_path+"/iris.parquet")

In [20]:
def test_get_sample():
    from mlrun import mlconf
    r, l, h = _get_sample(breast_cancer, -1, "labels")
    assert r.shape[0]==l.shape[0]
test_get_sample()

In [21]:
def test_get_splits():
    from mlrun import mlconf
    r, l, h = _get_sample(classifier-data, -1, "labels")
    (xtr, ytr), (xva, yva), (xte, yte), (xcal, ycal) = _get_splits(r, l, 4)

    assert xtr.shape[0]+xva.shape[0]+xte.shape[0]+xcal.shape[0] == r.shape[0]
test_get_splits()

In [22]:
#def test_save_test_set():
r, l, h = _get_sample(iris, -1, "labels")
A = _get_splits(r,l,3)
from mlrun import get_or_create_ctx
_save_test_set(get_or_create_ctx("test"), A[2][1], A[2][1], h, debug=True)
import pandas as pd
    # pd.read_parquet()
    # assert
#test_save_test_set()

[mlrun] 2020-04-30 20:47:21,886 logging run results to: http://mlrun-api:8080
[mlrun] 2020-04-30 20:47:21,970 log artifact test_set at test_set.parquet, size: 3508, db: Y
