# deploying yaml on optimized python images

* one node
* lightgbm
* 10 mio samples / 20 features
* code stored as yaml in github
* precomiled images using optimized for cpu python libraries 
    * **[yjbds/mlrun-ds](https://hub.docker.com/repository/docker/yjbds/mlrun-ds)** a data science stack
    * **[yjbds/mlrun-files](https://hub.docker.com/repository/docker/yjbds/mlrun-files)** a parquet/pandas stack

## imports

In [23]:
import mlrun
import os
import numpy as np
mlrun.mlconf.dbpath = 'http://mlrun-api:8080'

## parameters

In [29]:
CODE_BASE   = '/User/repos/functions/'           
N_SAMPLES          = 100_000  # size of HIGGS data
M_FEATURES         = 28
NEG_WEIGHT         = 0.5
TARGET_DATA_PATH   = '/User/mlrun/models'
FILE_NAME          = 'simdata.pqt'
KEY                = 'simdata'
RNG                = 1
SKLEARN_CLASSIFIER = 'lightgbm.sklearn.LGBMClassifier'
MODEL_KEY          = 'model'
MODEL_NAME         = 'lgb-classifier.pkl'
VERBOSE            = False

## generate some binary classifiaction data

In [30]:
binarydatagen = mlrun.import_function(
    os.path.join(CODE_BASE+'datagen/classification', 'binary.yaml')
).apply(mlrun.mount_v3io())

In [31]:
binarydatagen.deploy(skip_deployed=True, with_mlrun=False)

'ready'

In [32]:
task1 = mlrun.NewTask()
task1.with_params(
    n_samples=N_SAMPLES,
    m_features=M_FEATURES,
    weight=NEG_WEIGHT,
    target_path=TARGET_DATA_PATH,
    filename=FILE_NAME,
    key=KEY,
    random_state=RNG)

<mlrun.model.RunTemplate at 0x7f78fbf67e80>

In [33]:
tsk1 = binarydatagen.run(task1, handler='create_binary_classification')

[mlrun] 2020-01-26 14:35:40,509 starting run create_binary_classification uid=245e550ff213469681114228327a8e02  -> http://mlrun-api:8080
[mlrun] 2020-01-26 14:35:40,606 Job is running in the background, pod: create-binary-classification-7295j
[mlrun] 2020-01-26 14:35:53,548 log artifact simdata at /User/mlrun/models/simdata.pqt, size: None, db: Y

[mlrun] 2020-01-26 14:35:53,560 run executed, status=completed
  result = infer_dtype(pandas_collection)
final state: succeeded


uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
...7a8e02,0,Jan 26 14:35:52,completed,binary,host=create-binary-classification-7295jkind=jobowner=admin,,filename=simdata.pqtkey=simdatam_features=28n_samples=100000random_state=1target_path=/User/mlrun/modelsweight=0.5,,simdata


to track results use .show() or .logs() or in CLI: 
!mlrun get run 245e550ff213469681114228327a8e02  , !mlrun logs 245e550ff213469681114228327a8e02 
[mlrun] 2020-01-26 14:35:59,827 run executed, status=completed


______

## split the generated data

In [34]:
splitter = mlrun.import_function(
    os.path.join(CODE_BASE+'datagen/splitters', 'train_valid_test.yaml')
).apply(mlrun.mount_v3io())

In [35]:
splitter.deploy(skip_deployed=True, with_mlrun=False)

'ready'

In [36]:
task1 = mlrun.NewTask()
task1.with_params(
    src_file=TARGET_DATA_PATH + '/' + FILE_NAME,
    sample=20_000,
    target_path=TARGET_DATA_PATH,
    random_state=RNG)

<mlrun.model.RunTemplate at 0x7f7937ee2400>

In [37]:
tsk1 = splitter.run(task1, handler='train_valid_test_splitter')

[mlrun] 2020-01-26 14:35:59,880 starting run train_valid_test_splitter uid=907ad4a876fa4205a40a668956446468  -> http://mlrun-api:8080
[mlrun] 2020-01-26 14:35:59,974 Job is running in the background, pod: train-valid-test-splitter-vdn6h
[mlrun] 2020-01-26 14:36:09,842 log artifact header at /User/mlrun/models/header.pkl, size: None, db: Y
[mlrun] 2020-01-26 14:36:09,953 log artifact xtrain at /User/mlrun/models/xtrain.pqt, size: None, db: Y
[mlrun] 2020-01-26 14:36:10,052 log artifact xvalid at /User/mlrun/models/xvalid.pqt, size: None, db: Y
[mlrun] 2020-01-26 14:36:10,104 log artifact xtest at /User/mlrun/models/xtest.pqt, size: None, db: Y
[mlrun] 2020-01-26 14:36:10,146 log artifact ytrain at /User/mlrun/models/ytrain.pqt, size: None, db: Y
  labels = getattr(columns, 'labels', None) or [
  return pd.MultiIndex(levels=new_levels, labels=labels, names=columns.names)
  labels, = index.labels
  result = infer_dtype(pandas_collection)
[mlrun] 2020-01-26 14:36:10,182 log artifact yvalid

uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
...446468,0,Jan 26 14:36:09,completed,train-valid-test,host=train-valid-test-splitter-vdn6hkind=jobowner=admin,,random_state=1sample=20000src_file=/User/mlrun/models/simdata.pqttarget_path=/User/mlrun/models,,headerxtrainxvalidxtestytrainyvalidytest


to track results use .show() or .logs() or in CLI: 
!mlrun get run 907ad4a876fa4205a40a668956446468  , !mlrun logs 907ad4a876fa4205a40a668956446468 
[mlrun] 2020-01-26 14:36:19,229 run executed, status=completed


In [38]:
tsk1.outputs

{'header': '/User/mlrun/models/header.pkl',
 'xtrain': '/User/mlrun/models/xtrain.pqt',
 'xvalid': '/User/mlrun/models/xvalid.pqt',
 'xtest': '/User/mlrun/models/xtest.pqt',
 'ytrain': '/User/mlrun/models/ytrain.pqt',
 'yvalid': '/User/mlrun/models/yvalid.pqt',
 'ytest': '/User/mlrun/models/ytest.pqt'}

_____
## train a classifier

In [39]:
yaml_name = os.path.join(CODE_BASE, 'train', 'sklearn-classifier.yaml')
if not os.path.isfile(yaml_name):
    testfn = mlrun.code_to_function(
        kind='job', 
        filename=os.path.join(CODE_BASE, 'train', 'sklearn-classifier.py'))
    testfn.build_config(base_image='yjbds/mlrun-ds:latest', commands=[])
    testfn.export(os.path.join(CODE_BASE, 'train', 'sklearn-classifier.yaml'))

In [40]:
trainfn = mlrun.import_function(
    os.path.join(CODE_BASE+'train/sklearn-classifier.yaml')
).apply(mlrun.mount_v3io())

In [41]:
trainfn.deploy(skip_deployed=True, with_mlrun=False)

'ready'

In [42]:
task2 = mlrun.NewTask()
task2.with_params(
    src_file=tsk1.output(KEY),
    SKClassifier=SKLEARN_CLASSIFIER,
    callbacks = [],
    xtrain=tsk1.outputs['xtrain'],
    ytrain=tsk1.outputs['ytrain'],
    xvalid=tsk1.outputs['xvalid'],
    yvalid=tsk1.outputs['yvalid'],
    target_path='/User/mlrun/models',
    name=MODEL_NAME,
    key=MODEL_KEY,
    verbose=VERBOSE)

<mlrun.model.RunTemplate at 0x7f7937ee2c50>

In [43]:
tsk2 = trainfn.run(task2, handler='train')

[mlrun] 2020-01-26 14:36:19,413 starting run train uid=6eb38fb5166b40099e6e579f00c1ad22  -> http://mlrun-api:8080
[mlrun] 2020-01-26 14:36:19,496 Job is running in the background, pod: train-n4qgh
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[mlrun] 2020-01-26 14:36:31,442 log artifact training-validation-plot.html at training-validation-plot.html, size: 32968, db: Y
[mlrun] 2020-01-26 14:36:31,512 log artifact model at /User/mlrun/models/lgb-classifier.pkl, size: None, db: Y

[mlrun] 2020-01-26 14:36:31,540 run executed, status=completed
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
final state: succeeded


uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
...c1ad22,0,Jan 26 14:36:25,completed,sklearn-classifier,host=train-n4qghkind=jobowner=admin,,SKClassifier=lightgbm.sklearn.LGBMClassifiercallbacks=[]key=modelname=lgb-classifier.pklsrc_file=Nonetarget_path=/User/mlrun/modelsverbose=Falsextrain=/User/mlrun/models/xtrain.pqtxvalid=/User/mlrun/models/xvalid.pqtytrain=/User/mlrun/models/ytrain.pqtyvalid=/User/mlrun/models/yvalid.pqt,train_accuracy=0.9856296296296296,training-validation-plot.htmlmodel


to track results use .show() or .logs() or in CLI: 
!mlrun get run 6eb38fb5166b40099e6e579f00c1ad22  , !mlrun logs 6eb38fb5166b40099e6e579f00c1ad22 
[mlrun] 2020-01-26 14:36:38,665 run executed, status=completed


In [45]:
tsk2.outputs

{'train_accuracy': 0.9856296296296296,
 'training-validation-plot.html': 'training-validation-plot.html',
 'model': '/User/mlrun/models/lgb-classifier.pkl'}