# Record metadata on Kubeflow from Notebooks
> Demonstration of how lineage tracking works

- toc: true 
- badges: true
- comments: true
- categories: [jupyter]

![](my_icons/kubeflow-metadata.png)

# Record metadata about your models on Kubeflow from Notebooks

This notebook shows you how to use Kubeflow to record metadata after build, train, and deploy models on Kubernetes.
This notebook walks you through the following
 
* Building an XGBoost model inside a notebook
* Training the model inside the notebook
* Performing inference using the model inside the notebook
* Using Kubeflow Fairing to launch training jobs on Kubernetes
* Using Kubeflow Fairing to build and deploy a model using [Seldon Core](https://www.seldon.io/)
* Using [Kubeflow metadata](https://github.com/kubeflow/metadata) to record metadata about your models



## Install Required Libraries

Import the libraries required to record metadata about model.

In [2]:
from kubeflow.metadata import metadata

## Code to record metadata 

* In the cells below we define some functions to record metadata about model
* This function could just as easily be defined in a separate python module

In [6]:
def create_workspace():
    METADATA_STORE_HOST = "metadata-grpc-service.kubeflow" # default DNS of Kubeflow Metadata gRPC serivce.
    METADATA_STORE_PORT = 8080
    return metadata.Workspace(
        store=metadata.Store(grpc_host=METADATA_STORE_HOST, grpc_port=METADATA_STORE_PORT),
        name="xgboost-synthetic",
        description="workspace for xgboost-synthetic artifacts and executions")

## Wrap Training and record metadata in a class

* In the cell below we wrap training in a class and record metadata


In [7]:
# fairing:include-cell
class ModelServe(object):    
    def __init__(self, model_file=None):
        self.n_estimators = 50
        self.learning_rate = 0.1
        if not model_file:
            if "MODEL_FILE" in os.environ:
                print("model_file not supplied; checking environment variable")
                model_file = os.getenv("MODEL_FILE")
            else:
                print("model_file not supplied; using the default")
                model_file = "mockup-model.dat"
        
        self.model_file = model_file
        print("model_file={0}".format(self.model_file))
        
        self.model = None
        self._workspace = None
        self.exec = self.create_execution()

    def train(self):
        (train_X, train_y), (test_X, test_y) = read_synthetic_input()
        
        # Here we use Kubeflow's metadata library to record information
        # about the training run to Kubeflow's metadata store.
        self.exec.log_input(metadata.DataSet(
            description="xgboost synthetic data",
            name="synthetic-data",
            owner="someone@kubeflow.org",
            uri="file://path/to/dataset",
            version="v1.0.0"))
        
        model = train_model(train_X,
                          train_y,
                          test_X,
                          test_y,
                          self.n_estimators,
                          self.learning_rate)

        mae = eval_model(model, test_X, test_y)
        
        # Here we log metrics about the model to Kubeflow's metadata store.
        self.exec.log_output(metadata.Metrics(
            name="xgboost-synthetic-traing-eval",
            owner="someone@kubeflow.org",
            description="training evaluation for xgboost synthetic",
            uri="gcs://path/to/metrics",
            metrics_type=metadata.Metrics.VALIDATION,
            values={"mean_absolute_error": mae}))
        
        save_model(model, self.model_file)
        self.exec.log_output(metadata.Model(
            name="housing-price-model",
            description="housing price prediction model using synthetic data",
            owner="someone@kubeflow.org",
            uri=self.model_file,
            model_type="linear_regression",
            training_framework={
                "name": "xgboost",
                "version": "0.9.0"
            },
            hyperparameters={
                "learning_rate": self.learning_rate,
                "n_estimators": self.n_estimators
            },
            version=datetime.utcnow().isoformat("T")))
        
    @property
    def workspace(self):
        if not self._workspace:
            self._workspace = create_workspace()
        return self._workspace
    
    def create_execution(self):                
        r = metadata.Run(
            workspace=self.workspace,
            name="xgboost-synthetic-faring-run" + datetime.utcnow().isoformat("T"),
            description="a notebook run")

        return metadata.Execution(
            name = "execution" + datetime.utcnow().isoformat("T"),
            workspace=self.workspace,
            run=r,
            description="execution for training xgboost-synthetic")

MetadataStore with gRPC connection initialized


## Track Models and Artifacts

> youtube: https://youtu.be/DYsIrpgaucg

* Using Kubeflow's metadata server you can track models and artifacts
* The ModelServe code was instrumented to log executions and outputs
* You can access Kubeflow's metadata UI by selecting **Artifact Store** from the central dashboard
  * See [here](https://www.kubeflow.org/docs/other-guides/accessing-uis/) for instructions on connecting to Kubeflow's UIs
* You can also use the python SDK to read and write entries
* This [notebook](https://github.com/kubeflow/metadata/blob/master/sdk/python/demo.ipynb) illustrates a bunch of metadata functionality

### Create a workspace

* Kubeflow metadata uses workspaces as a logical grouping for artifacts, executions, and datasets that belong together
* Earlier in the notebook we defined the function `create_workspace` to create a workspace for this example
* You can use that function to return a workspace object and then call list to see all the artifacts in that workspace

In [22]:
ws = create_workspace()
ws.list()

MetadataStore with gRPC connection initialized


[{'id': 3,
  'workspace': 'xgboost-synthetic',
  'run': 'xgboost-synthetic-faring-run2020-02-26T23:26:36.443396',
  'version': '2020-02-26T23:26:36.660862',
  'owner': 'someone@kubeflow.org',
  'description': 'housing price prediction model using synthetic data',
  'name': 'housing-price-model',
  'model_type': 'linear_regression',
  'create_time': '2020-02-26T23:26:36.660887Z',
  'uri': 'mockup-model.dat',
  'training_framework': {'name': 'xgboost', 'version': '0.9.0'},
  'hyperparameters': {'learning_rate': 0.1, 'n_estimators': 50},
  'labels': None,
  'kwargs': {}},
 {'id': 6,
  'workspace': 'xgboost-synthetic',
  'run': 'xgboost-synthetic-faring-run2020-02-26T23:27:11.144500',
  'create_time': '2020-02-26T23:27:11.458520Z',
  'version': '2020-02-26T23:27:11.458480',
  'owner': 'someone@kubeflow.org',
  'description': 'housing price prediction model using synthetic data',
  'name': 'housing-price-model',
  'model_type': 'linear_regression',
  'uri': 'mockup-model.dat',
  'training_f