# deploying yaml on optimized python images

* one node
* lightgbm
* 10 mio samples / 20 features
* code stored as yaml in github
* precomiled images using optimized for cpu python libraries 
    * **[yjbds/mlrun-ds](https://hub.docker.com/repository/docker/yjbds/mlrun-ds)** a data science stack
    * **[yjbds/mlrun-files](https://hub.docker.com/repository/docker/yjbds/mlrun-files)** a parquet/pandas stack

## imports

In [1]:
import mlrun
import os
import numpy as np
mlrun.mlconf.dbpath = 'http://mlrun-api:8080'

## parameters

In [2]:
CODE_BASE   = '/User/repos/functions/'           
N_SAMPLES          = 100_000  # size of HIGGS data
M_FEATURES         = 20
NEG_WEIGHT         = 0.5
TARGET_DATA_PATH   = '/User/mlrun/sklearn-classifier'
FILE_NAME          = 'simdata.pqt'
KEY                = 'simdata'
RNG                = 1
SKLEARN_CLASSIFIER = 'lightgbm.sklearn.LGBMClassifier'
MODEL_KEY          = 'model'
MODEL_NAME         = 'lgb-classifier.pkl'
VERBOSE            = False

## generate some binary classifiaction data

In [3]:
binarydatagen = mlrun.import_function(
    os.path.join(CODE_BASE+'datagen/classification', 'binary.yaml')
).apply(mlrun.mount_v3io())

In [4]:
binarydatagen.deploy(skip_deployed=True, with_mlrun=False)

'ready'

In [5]:
task1 = mlrun.NewTask()
task1.with_params(
    n_samples=N_SAMPLES,
    m_features=M_FEATURES,
    weight=NEG_WEIGHT,
    target_path=TARGET_DATA_PATH,
    filename=FILE_NAME,
    key=KEY,
    random_state=RNG)

<mlrun.model.RunTemplate at 0x7f78fcc10748>

In [6]:
tsk1 = binarydatagen.run(task1, handler='create_binary_classification')

[mlrun] 2020-01-26 13:15:51,762 starting run create_binary_classification uid=39417bbf476c45b7a5cb0809e883b979  -> http://mlrun-api:8080
[mlrun] 2020-01-26 13:15:51,849 Job is running in the background, pod: create-binary-classification-vsdvh
[mlrun] 2020-01-26 13:16:02,850 log artifact simdata at /User/mlrun/sklearn-classifier/simdata.pqt, size: None, db: Y

[mlrun] 2020-01-26 13:16:02,862 run executed, status=completed
  result = infer_dtype(pandas_collection)
final state: succeeded


uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
...83b979,0,Jan 26 13:16:02,completed,binary,host=create-binary-classification-vsdvhkind=jobowner=admin,,filename=simdata.pqtkey=simdatam_features=20n_samples=100000random_state=1target_path=/User/mlrun/sklearn-classifierweight=0.5,,simdata


to track results use .show() or .logs() or in CLI: 
!mlrun get run 39417bbf476c45b7a5cb0809e883b979  , !mlrun logs 39417bbf476c45b7a5cb0809e883b979 
[mlrun] 2020-01-26 13:16:11,051 run executed, status=completed


______

## split the generated data

In [7]:
splitter = mlrun.import_function(
    os.path.join(CODE_BASE+'datagen/splitters', 'train_valid_test.yaml')
).apply(mlrun.mount_v3io())

In [8]:
splitter.deploy(skip_deployed=True, with_mlrun=False)

'ready'

In [9]:
task1 = mlrun.NewTask()
task1.with_params(
    src_file=TARGET_DATA_PATH + '/' + FILE_NAME,
    sample=20_000,
    target_path=TARGET_DATA_PATH,
    random_state=RNG)

<mlrun.model.RunTemplate at 0x7f78fc0d4e48>

In [10]:
tsk1 = splitter.run(task1, handler='train_valid_test_splitter')

[mlrun] 2020-01-26 13:16:11,109 starting run train_valid_test_splitter uid=ecb802dffb1b43269c25f43fd7a4919a  -> http://mlrun-api:8080
[mlrun] 2020-01-26 13:16:11,191 Job is running in the background, pod: train-valid-test-splitter-7k25p
[mlrun] 2020-01-26 13:16:21,068 log artifact header at /User/mlrun/sklearn-classifier/header.pkl, size: None, db: Y
[mlrun] 2020-01-26 13:16:21,156 log artifact xtrain at /User/mlrun/sklearn-classifier/xtrain.pqt, size: None, db: Y
[mlrun] 2020-01-26 13:16:21,220 log artifact xvalid at /User/mlrun/sklearn-classifier/xvalid.pqt, size: None, db: Y
[mlrun] 2020-01-26 13:16:21,262 log artifact xtest at /User/mlrun/sklearn-classifier/xtest.pqt, size: None, db: Y
[mlrun] 2020-01-26 13:16:21,280 log artifact ytrain at /User/mlrun/sklearn-classifier/ytrain.pqt, size: None, db: Y
[mlrun] 2020-01-26 13:16:21,298 log artifact yvalid at /User/mlrun/sklearn-classifier/yvalid.pqt, size: None, db: Y
[mlrun] 2020-01-26 13:16:21,312 log artifact ytest at /User/mlrun/skl

uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
...a4919a,0,Jan 26 13:16:20,completed,train-valid-test,host=train-valid-test-splitter-7k25pkind=jobowner=admin,,random_state=1sample=20000src_file=/User/mlrun/sklearn-classifier/simdata.pqttarget_path=/User/mlrun/sklearn-classifier,,headerxtrainxvalidxtestytrainyvalidytest


to track results use .show() or .logs() or in CLI: 
!mlrun get run ecb802dffb1b43269c25f43fd7a4919a  , !mlrun logs ecb802dffb1b43269c25f43fd7a4919a 
[mlrun] 2020-01-26 13:16:30,357 run executed, status=completed


In [11]:
tsk1.outputs

{'header': '/User/mlrun/sklearn-classifier/header.pkl',
 'xtrain': '/User/mlrun/sklearn-classifier/xtrain.pqt',
 'xvalid': '/User/mlrun/sklearn-classifier/xvalid.pqt',
 'xtest': '/User/mlrun/sklearn-classifier/xtest.pqt',
 'ytrain': '/User/mlrun/sklearn-classifier/ytrain.pqt',
 'yvalid': '/User/mlrun/sklearn-classifier/yvalid.pqt',
 'ytest': '/User/mlrun/sklearn-classifier/ytest.pqt'}

_____
## train a classifier

In [12]:
yaml_name = os.path.join(CODE_BASE, 'train', 'sklearn-classifier.yaml')
if not os.path.isfile(yaml_name):
    testfn = mlrun.code_to_function(
        kind='job', 
        image='yjbds/mlrun-ds:latest',
        filename=os.path.join(CODE_BASE, 'train', 'sklearn-classifier.py'))
    testfn.build_config(base_image='yjbds/mlrun-ds:latest', commands=[])
    testfn.export(os.path.join(CODE_BASE, 'train', 'sklearn-classifier.yaml'))

[mlrun] 2020-01-26 13:16:30,608 function spec saved to path: /User/repos/functions/train/sklearn-classifier.yaml


In [13]:
trainfn = mlrun.import_function(
    os.path.join(CODE_BASE+'train/sklearn-classifier.yaml')
).apply(mlrun.mount_v3io())

In [14]:
trainfn.deploy(skip_deployed=True, with_mlrun=False)

'ready'

In [19]:
task2 = mlrun.NewTask()
task2.with_params(
    src_file=tsk1.output(KEY),
    SKClassifier=SKLEARN_CLASSIFIER,
    callbacks = [],
    xtrain=tsk1.outputs['xtrain'],
    ytrain=tsk1.outputs['ytrain'],
    xvalid=tsk1.outputs['xvalid'],
    yvalid=tsk1.outputs['yvalid'],
    target_path='/User/mlrun/models',
    name=MODEL_NAME,
    key=MODEL_KEY,
    verbose=VERBOSE)

<mlrun.model.RunTemplate at 0x7f78fbf3ef98>

In [20]:
tsk2 = trainfn.run(task2, handler='train')

[mlrun] 2020-01-26 13:17:59,993 starting run train uid=b57510063377418ab0f90b33d14b6117  -> http://mlrun-api:8080
[mlrun] 2020-01-26 13:18:00,097 Job is running in the background, pod: train-wr4m2
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[mlrun] 2020-01-26 13:18:15,384 log artifact training-validation-plot.html at training-validation-plot.html, size: 32700, db: Y
[mlrun] 2020-01-26 13:18:15,498 log artifact model at /User/mlrun/models/lgb-classifier.pkl, size: None, db: Y

[mlrun] 2020-01-26 13:18:15,519 run executed, status=completed
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
final state: succeeded


uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
...4b6117,0,Jan 26 13:18:08,completed,sklearn-classifier,host=train-wr4m2kind=jobowner=admin,,SKClassifier=lightgbm.sklearn.LGBMClassifiercallbacks=[]key=modelname=lgb-classifier.pklsrc_file=Nonetarget_path=/User/mlrun/modelsverbose=Falsextrain=/User/mlrun/sklearn-classifier/xtrain.pqtxvalid=/User/mlrun/sklearn-classifier/xvalid.pqtytrain=/User/mlrun/sklearn-classifier/ytrain.pqtyvalid=/User/mlrun/sklearn-classifier/yvalid.pqt,train_accuracy=0.9781481481481481,training-validation-plot.htmlmodel


to track results use .show() or .logs() or in CLI: 
!mlrun get run b57510063377418ab0f90b33d14b6117  , !mlrun logs b57510063377418ab0f90b33d14b6117 
[mlrun] 2020-01-26 13:18:19,266 run executed, status=completed


In [22]:
tsk2.outputs

{'train_accuracy': 0.9781481481481481,
 'training-validation-plot.html': 'training-validation-plot.html',
 'model': '/User/mlrun/models/lgb-classifier.pkl'}