## Running in Vertex Pipelines

To make our model training more reproducible, we would like to run it in an automated pipeline that clearly defines the different steps we need to take (e.g. training and evaluating the model) and captures any produced artifacts (e.g. the trained model). In the Google Cloud, we can use Vertex Pipelines for this purpose. 

Vertex Pipelines allows you to define pipelines as a graph of containerized tasks, in which each task performs a specific step needed to train/evaluate/deploy a model.

Exercise:
* Build the docker image
* Run the pipeline
* Implement the evaluate component
* Add an evaluate step to the pipeline
* Add a prediction component + step

Bonus:
* Deploy the trained model as an API
* ...

In [None]:
! mkdir -p _artifacts

In [None]:
GCP_REGION = "europe-west3"

# Enter your name here. We'll use this to tag your unique
# Docker image to avoid clashing with other people.
USER_NAME = input("Your user name:")

In [None]:
! make -C ../ USER_NAME=$USER_NAME docker-push

In [None]:
from typing import Optional, NamedTuple

import kfp
from kfp import components
from kfp.v2 import compiler
from kfp.v2.dsl import (
    component,
    Input,
    InputPath,
    OutputPath,
    Output,
    Dataset,
    Metrics,
    Model
)

@component(
    base_image=f"{GCP_REGION}-docker.pkg.dev/gdd-cb-vertex/docker/fancy-fashion-{USER_NAME}",
    output_component_file="_artifacts/train.yaml",
)
def train(train_data_path: str, model: Output[Model]) -> None:
    """Trains the model on the given dataset."""
    
    from pathlib import Path
    import joblib
    
    from fancy_fashion.model import train_model
    from fancy_fashion.util import local_gcs_path
    
    trained_model = train_model(local_gcs_path(train_data_path))

    model_dir = Path(model.path)
    model_dir.mkdir(parents=True, exist_ok=True)
    joblib.dump(trained_model, model_dir / "model.pkl")

    
@component(
    base_image=f"{GCP_REGION}-docker.pkg.dev/gdd-cb-vertex/docker/fancy-fashion-{USER_NAME}",
    output_component_file="_artifacts/evaluate.yaml",
)
def evaluate(
    test_data_path: str, model: InputPath("Model"), metrics: Output[Metrics]
) -> NamedTuple("EvalModelOutput", [("roc", float)]):
    # TODO: Implement the actual evaluation.
    #       Tip: we can use the evaluate_model function from our package.
    metrics.log_metric("roc", 0.9)

    
@kfp.dsl.pipeline(name="fancy-fashion-julian")
def pipeline(train_path: str):
    train_task = train(train_path)
    
    # TODO: Add an evaluate task that uses the evaluate component above.

compiler.Compiler().compile(
    pipeline_func=pipeline,
    package_path="_artifacts/pipeline.json",
)

In [None]:
from google.cloud.aiplatform.pipeline_jobs import PipelineJob

job = PipelineJob(
    display_name=f"fancy-fashion-{USER_NAME}",
    enable_caching=False,
    template_path="_artifacts/pipeline.json",
    parameter_values={
        "train_path": "gs://gdd-cb-vertex-fashion-inputs/train"
    },
    pipeline_root=f"gs://gdd-cb-vertex-fashion-artifacts/pipelines",
    location=GCP_REGION,
)

job.run(
    service_account=f"vmd-fashion@gdd-cb-vertex.iam.gserviceaccount.com"
)