# deploying yaml on optimized python images

* one node
* lightgbm
* 10 mio samples / 20 features
* code stored as yaml in github
* precomiled images using optimized for cpu python libraries 
    * **[yjbds/mlrun-ds](https://hub.docker.com/repository/docker/yjbds/mlrun-ds)** a data science stack
    * **[yjbds/mlrun-files](https://hub.docker.com/repository/docker/yjbds/mlrun-files)** a parquet/pandas stack

## imports

In [1]:
import mlrun
import os
import numpy as np
mlrun.mlconf.dbpath = 'http://mlrun-api:8080'

## parameters

In [2]:
TARGET_CODE_BASE   = '/User/repos/functions/'           
N_SAMPLES          = 100_000  # size of HIGGS data
M_FEATURES         = 20
NEG_WEIGHT         = 0.5
TARGET_DATA_PATH   = '/User/mlrun/sklearn-classifier'
FILE_NAME          = 'simdata.pqt'
KEY                = 'simdata'
RNG                = 1
SKLEARN_CLASSIFIER = 'lightgbm.sklearn.LGBMClassifier'
MODEL_KEY          = 'model'
MODEL_NAME         = MODEL_KEY
VERBOSE            = False

## generate some binary classifiaction data

In [3]:
binarydatagen = mlrun.import_function(
    os.path.join(TARGET_CODE_BASE+'datagen/classification', 'binary.yaml')
).apply(mlrun.mount_v3io())

In [5]:
binarydatagen.deploy(skip_deployed=True)

'ready'

In [6]:
task1 = mlrun.NewTask()
task1.with_params(
    n_samples=N_SAMPLES,
    m_features=M_FEATURES,
    weight=NEG_WEIGHT,
    target_path=TARGET_DATA_PATH,
    filename=FILE_NAME,
    key=KEY,
    random_state=RNG)

<mlrun.model.RunTemplate at 0x7fe98897f4a8>

In [7]:
tsk1 = binarydatagen.run(task1, handler='create_binary_classification')

[mlrun] 2020-01-23 11:46:49,385 starting run create_binary_classification uid=e1164e49ef22478791f5b23fea2de60b  -> http://mlrun-api:8080
[mlrun] 2020-01-23 11:46:49,486 Job is running in the background, pod: create-binary-classification-j6gng
[mlrun] 2020-01-23 11:47:00,255 log artifact simdata at /User/mlrun/sklearn-classifier/simdata.pqt, size: None, db: Y

[mlrun] 2020-01-23 11:47:00,268 run executed, status=completed
  result = infer_dtype(pandas_collection)
final state: succeeded


uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
...2de60b,0,Jan 23 11:46:59,completed,binary,host=create-binary-classification-j6gngkind=jobowner=admin,,filename=simdata.pqtkey=simdatam_features=20n_samples=100000random_state=1target_path=/User/mlrun/sklearn-classifierweight=0.5,,simdata


to track results use .show() or .logs() or in CLI: 
!mlrun get run e1164e49ef22478791f5b23fea2de60b  , !mlrun logs e1164e49ef22478791f5b23fea2de60b 
[mlrun] 2020-01-23 11:47:08,728 run executed, status=completed


____
# tests

In [8]:
import pandas as pd
df = pd.read_parquet(os.path.join(TARGET_DATA_PATH, FILE_NAME), engine='pyarrow')

In [9]:
assert tsk1.output(KEY) == os.path.join(TARGET_DATA_PATH, FILE_NAME), "binary.yaml failed to create a file"
assert df.shape== (N_SAMPLES, M_FEATURES+1), "simulation data artifact is not of the correct dimensions"

_____
## train a classifier

In [17]:
trainfn = mlrun.import_function(
    os.path.join(TARGET_CODE_BASE+'train/sklearn-classifier.yaml')
).apply(mlrun.mount_v3io())

In [18]:
trainfn.deploy()

[mlrun] 2020-01-23 11:49:20,163 starting remote build, image: .mlrun/func-default-sklearn-classifier-latest
[36mINFO[0m[0000] Resolved base name yjbds/mlrun-ds:latest to yjbds/mlrun-ds:latest 
[36mINFO[0m[0000] Resolved base name yjbds/mlrun-ds:latest to yjbds/mlrun-ds:latest 
[36mINFO[0m[0000] Downloading base image yjbds/mlrun-ds:latest 
[36mINFO[0m[0000] Error while retrieving image from cache: getting file info: stat /cache/sha256:e4dd2f2f98d45ea9b78e8776e998e0c5f4d19099676464c0dd486139d6f391dc: no such file or directory 
[36mINFO[0m[0000] Downloading base image yjbds/mlrun-ds:latest 
[36mINFO[0m[0000] Built cross stage deps: map[]                
[36mINFO[0m[0000] Downloading base image yjbds/mlrun-ds:latest 
[36mINFO[0m[0000] Error while retrieving image from cache: getting file info: stat /cache/sha256:e4dd2f2f98d45ea9b78e8776e998e0c5f4d19099676464c0dd486139d6f391dc: no such file or directory 
[36mINFO[0m[0000] Downloading base image yjbds/mlrun-ds:latest 
[36

True

In [19]:
task2 = mlrun.NewTask()
task2.with_params(
    src_file=tsk1.output(KEY),
    SKClassifier=SKLEARN_CLASSIFIER,
    name=MODEL_NAME,
    key=MODEL_KEY,
    verbose=VERBOSE,
    random_state=RNG,
    callbacks = [])

<mlrun.model.RunTemplate at 0x7fe978656748>

In [20]:
tsk2 = trainfn.run(task2, handler='train')

[mlrun] 2020-01-23 11:50:58,444 starting run train uid=d7118c8161b9487ea79b136cd2d4a0cc  -> http://mlrun-api:8080
[mlrun] 2020-01-23 11:50:58,533 Job is running in the background, pod: train-s9w4j
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[mlrun] 2020-01-23 11:51:12,955 log artifact model at model, size: None, db: Y
[mlrun] 2020-01-23 11:51:12,974 log artifact xtest at xtest.pkl, size: None, db: Y
[mlrun] 2020-01-23 11:51:12,998 log artifact ytest at ytest.pkl, size: None, db: Y

[mlrun] 2020-01-23 11:51:13,022 run executed, status=completed
  labels = getattr(columns, 'labels', None) or [
  return pd.MultiIndex(levels=new_levels, labels=labels, names=columns.names)
  labels, = index.labels
final state: succeeded


uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
...d4a0cc,0,Jan 23 11:51:07,completed,sklearn-classifier,host=train-s9w4jkind=jobowner=admin,,SKClassifier=lightgbm.sklearn.LGBMClassifiercallbacks=[]key=modelname=modelrandom_state=1src_file=/User/mlrun/sklearn-classifier/simdata.pqtverbose=False,train_accuracy=0.9546808100860753,modelxtestytest


to track results use .show() or .logs() or in CLI: 
!mlrun get run d7118c8161b9487ea79b136cd2d4a0cc  , !mlrun logs d7118c8161b9487ea79b136cd2d4a0cc 
[mlrun] 2020-01-23 11:51:17,725 run executed, status=completed


In [21]:
tsk2.outputs

{'train_accuracy': 0.9546808100860753,
 'model': 'model',
 'xtest': 'xtest.pkl',
 'ytest': 'ytest.pkl'}

_____
## train another classifier

____

In [24]:
task3 = mlrun.NewTask()
task3.with_params(
    src_file=tsk1.output(KEY),
    SKClassifier='xgboost.XGBClassifier',
    name='xgb-classifier.pkl',
    key='xgb-classifier',
    verbose=VERBOSE,
    random_state=RNG,
    callbacks = [])

<mlrun.model.RunTemplate at 0x7fe978656cc0>

In [25]:
tsk3 = trainfn.run(task3, handler='train')

[mlrun] 2020-01-23 11:52:46,121 starting run train uid=3539274893904935adea979b410bf135  -> http://mlrun-api:8080
[mlrun] 2020-01-23 11:52:46,218 Job is running in the background, pod: train-qwzg9
[mlrun] 2020-01-23 11:52:56,785 Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/mlrun/runtimes/local.py", line 174, in exec_from_params
    val = handler(*args_list)
  File "main.py", line 91, in train
    verbose=verbose)
TypeError: fit() got an unexpected keyword argument 'eval_names'


[mlrun] 2020-01-23 11:52:56,796 exec error - fit() got an unexpected keyword argument 'eval_names'
[mlrun] 2020-01-23 11:52:56,830 run executed, status=error
runtime error: fit() got an unexpected keyword argument 'eval_names'
  labels = getattr(columns, 'labels', None) or [
  return pd.MultiIndex(levels=new_levels, labels=labels, names=columns.names)
  labels, = index.labels
fit() got an unexpected keyword argument 'eval_names'
final state: failed


uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
...0bf135,0,Jan 23 11:52:52,error,sklearn-classifier,host=train-qwzg9kind=jobowner=admin,,SKClassifier=xgboost.XGBClassifiercallbacks=[]key=xgb-classifiername=xgb-classifier.pklrandom_state=1src_file=/User/mlrun/sklearn-classifier/simdata.pqtverbose=False,,


to track results use .show() or .logs() or in CLI: 
!mlrun get run 3539274893904935adea979b410bf135  , !mlrun logs 3539274893904935adea979b410bf135 
[mlrun] 2020-01-23 11:53:05,425 run executed, status=error
runtime error: fit() got an unexpected keyword argument 'eval_names'


RunError: fit() got an unexpected keyword argument 'eval_names'

In [None]:
tsk3.outputs

## evaluation

run plots here

## model optimization

onnx here