<a href="https://colab.research.google.com/github/joahofmann/gcp-notebooks/blob/main/vertex_ai_wine_classification_pipeline_ok.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vertex AI Pipelines Example

This notebook demonstrates how to create and run a simple Kubeflow pipeline on Vertex AI.

# 1. Setup and Authentication

In [5]:
# Install necessary libraries
!pip install --upgrade google-cloud-aiplatform google-cloud-storage kfp google-cloud-pipeline-components --quiet

In [6]:
# Restart runtime (Colab only)
import sys
if "google.colab" in sys.modules:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

In [None]:
# Authenticate to Google Cloud
# If you are running this in a Colab environment, this will open a browser window for authentication.
import sys
if "google.colab" in sys.modules:
    from google.colab import auth
    auth.authenticate_user()

In [2]:
# --- User-defined variables ---
# Replace with your actual project ID and region
PROJECT_ID = "vertex-test-id" # @param {type:"string"}
REGION = "us-central1" # @param {type:"string"}
BUCKET_NAME = "gcs-bucket-name-wine2" # @param {type:"string"}

In [3]:
# Validate inputs
if PROJECT_ID == "your-gcp-project-id" or not PROJECT_ID:
    raise ValueError("Please replace 'your-gcp-project-id' with your actual GCP project ID.")
if BUCKET_NAME == "your-gcs-bucket-name" or not BUCKET_NAME:
    raise ValueError("Please replace 'your-gcs-bucket-name' with your actual GCS bucket name.")

In [4]:
BUCKET_URI = f"gs://{BUCKET_NAME}"
PIPELINE_ROOT = f"{BUCKET_URI}/pipeline_root_simple_example"

print(f"Project ID: {PROJECT_ID}")
print(f"Region: {REGION}")
print(f"Bucket URI: {BUCKET_URI}")
print(f"Pipeline Root: {PIPELINE_ROOT}")

Project ID: vertex-test-id
Region: us-central1
Bucket URI: gs://gcs-bucket-name-wine2
Pipeline Root: gs://gcs-bucket-name-wine2/pipeline_root_simple_example


### Create a Cloud Storage bucket (if it doesn't exist)

Create a storage bucket to store intermediate artifacts such as datasets.

In [5]:
# You only need to run this if your bucket doesn't already exist
! gsutil mb -l {REGION} -p {PROJECT_ID} {BUCKET_URI}

Creating gs://gcs-bucket-name-wine2/...
ServiceException: 409 A Cloud Storage bucket named 'gcs-bucket-name-wine2' already exists. Try another name. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization.


In [6]:
# Initialize Vertex AI SDK
from google.cloud import aiplatform
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)

In [7]:
# Get the service account
SERVICE_ACCOUNT = !gcloud projects describe $PROJECT_ID --format="value(projectNumber)"
SERVICE_ACCOUNT = f"{SERVICE_ACCOUNT[0].strip()}-compute@developer.gserviceaccount.com"
print(f"Service Account: {SERVICE_ACCOUNT}")

Service Account: 219162896674-compute@developer.gserviceaccount.com


Grant necessary permissions to the Compute Engine default service account.
Grant roles/storage.objectAdmin and roles/aiplatform.user to the service account at the project level.

In [8]:
# Grant necessary permissions to the Compute Engine default service account at the project level
!gcloud projects add-iam-policy-binding {PROJECT_ID} --member="serviceAccount:{SERVICE_ACCOUNT}" --role="roles/storage.objectAdmin"
!gcloud projects add-iam-policy-binding {PROJECT_ID} --member="serviceAccount:{SERVICE_ACCOUNT}" --role="roles/aiplatform.user"

#!gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectAdmin {BUCKET_URI}

Updated IAM policy for project [vertex-test-id].
bindings:
- members:
  - serviceAccount:219162896674-compute@developer.gserviceaccount.com
  - serviceAccount:vertex-test-id@appspot.gserviceaccount.com
  - user:Joachim.Hofmann@bluewin.ch
  role: roles/aiplatform.admin
- members:
  - serviceAccount:service-219162896674@gcp-sa-vertex-nb.iam.gserviceaccount.com
  role: roles/aiplatform.colabServiceAgent
- members:
  - serviceAccount:service-219162896674@gcp-sa-aiplatform-cc.iam.gserviceaccount.com
  role: roles/aiplatform.customCodeServiceAgent
- members:
  - serviceAccount:service-219162896674@gcp-sa-aiplatform-vm.iam.gserviceaccount.com
  role: roles/aiplatform.notebookServiceAgent
- members:
  - serviceAccount:service-219162896674@gcp-sa-aiplatform.iam.gserviceaccount.com
  role: roles/aiplatform.serviceAgent
- members:
  - serviceAccount:219162896674-compute@developer.gserviceaccount.com
  - serviceAccount:vertex-test-id@appspot.gserviceaccount.com
  role: roles/aiplatform.user
- memb

Set service account access for Vertex AI Pipelines. Run the following commands to grant your service account access to read and write pipeline artifacts in the bucket that you created in the previous step. You only need to run these once per service account.

In [None]:
#! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectCreator $BUCKET_URI

#! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectViewer $BUCKET_URI

### Import libraries and define constants

In [9]:
import google.cloud.aiplatform as aip
from kfp import compiler, dsl
from kfp.dsl import ClassificationMetrics, Metrics, Output, component

In [10]:
import kfp.dsl as dsl
from kfp.dsl import (Artifact,
                        Dataset,
                        Input,
                        Model,
                        Output,
                        Metrics,
                        ClassificationMetrics,
                        component,
                        OutputPath,
                        InputPath)

from typing import NamedTuple
from datetime import datetime
import os # Import os for path manipulation if needed
from google.cloud.aiplatform import pipeline_jobs
import json

# --- Global Configuration Placeholders ---
# IMPORTANT: Replace these with your actual GCP project ID, bucket, and region.
# These variables need to be defined before they are used in the pipeline definition.
####PROJECT_ID = "vertex-test-id" # e.g., "my-gcp-project-12345"
#REGION = LOCATION             # e.g., "us-central1" or "europe-west1"
# Define a GCS bucket path where pipeline artifacts will be stored.
# Ensure this bucket exists and your service account has write permissions.
####PIPELINE_ROOT = f"gs://your-kfp-pipeline-bucket/wine-quality-pipeline-root"

# Generate a timestamp for unique display names for pipeline runs
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
DISPLAY_NAME = f'pipeline-winequality-job-{TIMESTAMP}'

# Create pipeline

We create 4 components:  
- Load data   
- Train a  model
- Evaluate the model
- Deploy the model

The components have dependencies on `pandas`, `sklearn`.

#### Vertex AI constants

Setup up the following constants for Vertex AI pipelines:
- `PIPELINE_NAME`: Set name for the pipeline.
- `PIPELINE_ROOT`: Cloud Storage bucket path to store pipeline artifacts.

In [None]:
#PIPELINE_NAME = "metrics-pipeline-v2"
#PIPELINE_ROOT = "{}/pipeline_root/iris".format(BUCKET_URI)

Let's look at our data.

In [11]:
import pandas as pd
df_wine = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv", delimiter=";")
df_wine.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


In [12]:
df_wine.quality.describe()

Unnamed: 0,quality
count,4898.0
mean,5.877909
std,0.885639
min,3.0
25%,5.0
50%,6.0
75%,6.0
max,9.0


Initialize Vertex AI SDK for Python: To get started using Vertex AI, you must [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

In [13]:
aip.init(project=PROJECT_ID, staging_bucket=BUCKET_URI)

define_component:wine_classification

Note the use of the `@component()` decorator in the definitions below. Optionally, you can set a list of packages for the component to install. That is, list the base image to use (the default is a Python 3.7 image), and the name of a component YAML file to generate, so that the component definition can be shared and reused.


## First component: read the wine quality dataset and store it in Google Cloud Storage
Also let's do some preprocessing as we always do in ML tasks.



In [14]:
@component(
  packages_to_install=["pandas", "numpy==1.23.5", "pyarrow", "scikit-learn==1.2.2"],
  base_image="python:3.9",
  #output_component_file="get_wine_data.yaml"
)
def get_wine_data(
  url: str, # Revert to standard parameter definition
  # Use Output[T] to get a metadata-rich handle to the output artifact of type `Dataset`.
  # the artifact already has path in the place, where we run the pipeline
  dataset_train: Output[Dataset],
  dataset_test: Output[Dataset]
):
  import numpy as np
  import pandas as pd
  from sklearn.model_selection import train_test_split

  df_wine = pd.read_csv(url, delimiter=";")
  df_wine['best_quality'] = df_wine.quality.apply(lambda x: int(x>=7))
  df_wine['target'] = df_wine.best_quality
  df_wine.drop(
      columns=['quality', 'total sulfur dioxide', 'best_quality'],
      inplace=True
  )

  train, test = train_test_split(df_wine, test_size=0.3)
  train.to_csv(dataset_train.path + ".csv" , index=False)
  test.to_csv(dataset_test.path + ".csv" , index=False)

## Train the wine quality model


In [15]:
@component(
  packages_to_install = [
      "pandas",
      "numpy==1.23.5",
      "scikit-learn==1.2.2"
  ], base_image="python:3.9",
)
def train_winequality(
  # Use Input[T] to get a metadata-rich handle to the
  # input artifact of type `Dataset`.
  dataset:  Input[Dataset],
  model: Output[Model],
):
  import pickle
  import pandas as pd
  from sklearn.ensemble import RandomForestClassifier

  data = pd.read_csv(dataset.path+".csv")
  model_rf = RandomForestClassifier(n_estimators=10)
  model_rf.fit(
      data.drop(columns=["target"]),
      data.target,
  )
  model.metadata["framework"] = "RF"
  file_name = model.path + ".pkl"
  with open(file_name, 'wb') as file:
      pickle.dump(model_rf, file)

## Evaluate the model
The results of evaluation will be written in the file in GCP.

In [16]:
@component(
  packages_to_install = [
      "pandas",
      "numpy==1.23.5",
      "scikit-learn==1.2.2"
  ], base_image="python:3.9",
)
def winequality_evaluation(
  test_set:  Input[Dataset],
  rf_winequality_model: Input[Model],
  thresholds_dict_str: str,
  metrics: Output[ClassificationMetrics],
  kpi: Output[Metrics]
) -> NamedTuple("output", [("deploy", str)]):

  from sklearn.ensemble import RandomForestClassifier
  import pandas as pd
  import logging
  import pickle
  from sklearn.metrics import roc_curve, confusion_matrix, accuracy_score
  import json
  import typing

  def threshold_check(val1, val2):
      cond = "false"
      if val1 >= val2 :
          cond = "true"
      return cond

  data = pd.read_csv(test_set.path+".csv")
  file_name = rf_winequality_model.path + ".pkl"
  with open(file_name, 'rb') as file:
      model = pickle.load(file)

  X_test = data.drop(columns=["target"])
  y_target = data.target
  y_pred = model.predict(X_test)

  y_scores =  model.predict_proba(X_test)[:, 1]
  fpr, tpr, thresholds = roc_curve(
        y_true=data.target.to_numpy(), y_score=y_scores, pos_label=True
  )
  metrics.log_roc_curve(fpr.tolist(), tpr.tolist(), thresholds.tolist())

  metrics.log_confusion_matrix(
      ["False", "True"],
      confusion_matrix(
          data.target, y_pred
      ).tolist(),
  )

  accuracy = accuracy_score(data.target, y_pred.round())
  thresholds_dict = json.loads(thresholds_dict_str)
  rf_winequality_model.metadata["accuracy"] = float(accuracy)
  kpi.log_metric("accuracy", float(accuracy))
  deploy = threshold_check(float(accuracy), int(thresholds_dict['roc']))
  return (deploy,)

## Deploy model

In [17]:
@component(
  packages_to_install=["google-cloud-aiplatform", "scikit-learn==1.0.0",  "kfp"],
  base_image="python:3.9",
  #output_component_file="model_winequality_coponent.yml"
)
def deploy_winequality(
  model: Input[Model],
  project: str,
  region: str,
  serving_container_image_uri : str,
  vertex_endpoint: Output[Artifact],
  vertex_model: Output[Model]
):
  from google.cloud import aiplatform
  aiplatform.init(project=project, location=region)

  DISPLAY_NAME  = "winequality"
  MODEL_NAME = "winequality-rf"
  ENDPOINT_NAME = "winequality_endpoint"

  def create_endpoint():
      endpoints = aiplatform.Endpoint.list(
        filter='display_name="{}"'.format(ENDPOINT_NAME),
        order_by='create_time desc',
        project=project,
        location=region,
      )
      if len(endpoints) > 0:
          return endpoints[0]  # most recently created
      else:
          return aiplatform.Endpoint.create(
            display_name=ENDPOINT_NAME, project=project, location=region
        )
  endpoint = create_endpoint()

  #Import a model programmatically
  model_upload = aiplatform.Model.upload(
      display_name = DISPLAY_NAME,
      artifact_uri = model.uri.replace("model", ""),
      serving_container_image_uri = serving_container_image_uri,
      serving_container_health_route=f"/v1/models/{MODEL_NAME}",
      serving_container_predict_route=f"/v1/models/{MODEL_NAME}:predict",
      serving_container_environment_variables={
      "MODEL_NAME": MODEL_NAME,
  },
  )
  model_deploy = model_upload.deploy(
      machine_type="n1-standard-4",
      endpoint=endpoint,
      traffic_split={"0": 100},
      deployed_model_display_name=DISPLAY_NAME,
  )

  # Save the resource name to the output params
  vertex_model.uri = model_deploy.resource_name

In [None]:
from datetime import datetime

TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
DISPLAY_NAME = 'pipeline-winequality-job{}'.format(TIMESTAMP)

In [18]:
DISPLAY_NAME

'pipeline-winequality-job-20250713164903'

## Create the Pipeline itself

Once you have created all the needed components define the pipeline and then compile it into a `.json` file.

In [19]:
PIPELINE_ROOT

'gs://gcs-bucket-name-wine2/pipeline_root_simple_example'

In [20]:
@dsl.pipeline(
  # Default pipeline root. You can override it when submitting the pipeline.
  pipeline_root=PIPELINE_ROOT,
  # A name for the pipeline. Use to determine the pipeline Context.
  name="pipeline-winequality",
)
def pipeline(
  url: str = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv",
  project: str = PROJECT_ID,
  region: str = REGION,
  display_name: str = DISPLAY_NAME,
  api_endpoint: str = REGION+"-aiplatform.googleapis.com",
  thresholds_dict_str: str = '{"roc":0.8}',
  serving_container_image_uri: str = "europe-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.0-24:latest"
):
  data_op = get_wine_data(url=url)
  train_model_op = train_winequality(dataset=data_op.outputs["dataset_train"])
  model_evaluation_op = winequality_evaluation(
      test_set=data_op.outputs["dataset_test"],
      rf_winequality_model=train_model_op.outputs["model"],
      thresholds_dict_str = thresholds_dict_str, # I deploy the model anly if the model performance is above the threshold
  )

  with dsl.Condition(
      model_evaluation_op.outputs["deploy"]=="true",
      name="deploy-winequality",
  ):
      deploy_model_op = deploy_winequality(
        model=train_model_op.outputs['model'],
        project=project,
        region=region,
        serving_container_image_uri = serving_container_image_uri,
      )

  with dsl.Condition(


### Compile and run the pipeline

In [21]:
compiler.Compiler().compile(
    pipeline_func=pipeline,
    package_path='ml_winequality.json'
)

The pipeline compilation generates the **ml_winequality.json** job spec file.

In [None]:
### Create a run

In [22]:
aiplatform.init(project=PROJECT_ID, location=REGION)

In [None]:
# might be needed if we restarted the notebook before
# from google.colab import auth
# auth.authenticate_user()

In [23]:
start_pipeline = pipeline_jobs.PipelineJob(
  display_name="winequality-pipeline",
  template_path="ml_winequality.json",
  enable_caching=True,
  location=REGION
)

In [None]:
#stop execution
#raise SystemExit(1)

In [None]:
start_pipeline.run(service_account=SERVICE_ACCOUNT)

### List all models

In [None]:
! gcloud ai models list --region={REGION} --project={PROJECT_ID} --filter={DISPLAY_NAME}

### Schedule pipeline

The scheduled jobs are supported by the Cloud Scheduler and Cloud Functions.
Check that the APIs Cloud Scheduler, Cloud Functions are enabled.

Below is a code to create a scheduled pipeline run

In [None]:
from kfp.v2.google.client import AIPlatformClient

api_client = AIPlatformClient(
                project_id=PROJECT_ID,
                region=REGION,
                )

response = api_client.create_schedule_from_job_spec(
    enable_caching=True,
    job_spec_path="ml_winequality.json",
    schedule="0 0 * * 1", # once per week on Monday
    time_zone="Europe/Brussels",  # change this as necessary
    parameter_values={"display_name": DISPLAY_NAME},
    pipeline_root=PIPELINE_ROOT,  # this argument is necessary if you did not specify PIPELINE_ROOT as part of the pipeline definition.
    #service_account=SERVICE_ACCOUNT,
)


Once the scheduled job is created, you can see it listed in the Cloud Scheduler panel in the Console.

# Get predictions from endpoint

We can check our model performance by calling endpoint. And for sure we can call it for the new data coming.

In [None]:
! pip install gcsfs

In [None]:
# here is how to read data from GCS. You need to be authorized!

import gcsfs
import pandas as pd

fs = gcsfs.GCSFileSystem()

data_path = 'gs://gcs-bucket-name-wine2/pipeline_root_simple_example/219162896674/pipeline-winequality-20250712224306/get-wine-data_1228575498100015104/dataset_test.csv'

with fs.open(data_path, 'rb') as f:
    test_df = pd.read_csv(f, nrows=10) # let's read just a chunk of data to speed up data load
test_df.head()

In [None]:
# create instances
instances = test_df.drop(columns='target').values.tolist()

In [None]:
instances

In [None]:
ENDPOINT_ID = !(gcloud ai endpoints list --region=$REGION \
              --format='value(ENDPOINT_ID)'\
              --filter=display_name=$ENDPOINT_NAME \
              --sort-by=creationTimeStamp)

In [None]:
from google.cloud import aiplatform
aiplatform.init(project=PROJECT_ID, location=REGION)

ENDPOINT_NAME="winequality_endpoint"

# get the endpoint id
endpoint_output = !(gcloud ai endpoints list --region=$REGION \
              --format='value(ENDPOINT_ID)'\
              --filter=display_name=$ENDPOINT_NAME \
              --sort-by=creationTimeStamp --project=$PROJECT_ID)

if not endpoint_output:
    raise ValueError(f"No endpoint found with display name {ENDPOINT_NAME}")

# Extract the actual endpoint ID from the command output
# The actual ID should be the last line of the output
ENDPOINT_ID = endpoint_output[-1].strip()


# aiplatform.init(project=PROJECT_ID, location=REGION)
endpoint = aiplatform.Endpoint(ENDPOINT_ID)
prediction = endpoint.predict(instances=instances)

In [None]:
prediction.predictions

In [None]:
list(map(int, prediction.predictions)), test_df.target.tolist()

So the model performance is quite good. At least on those 10 records :)

# Test the batch prediction

Takes some time, but at least you will have a code and understanding of how to run it.



In [None]:
# Define variables
job_display_name = "winequality-batch-prediction-job"
MODEL_NAME="winequality"
ENDPOINT_NAME="winequality_endpoint"
BUCKET_URI="gs://tokyo-charge-378510-bucket-winequality/pipeline_root_wine/24871937313/pipeline-winequality-20230222111722/get_wine_data_5671759263626690560"
input_file_name="dataset_test.csv"

# Get model id
MODEL_ID=!(gcloud ai models list --region=$REGION \
           --filter=display_name=$MODEL_NAME)
MODEL_ID=MODEL_ID[2].split(" ")[0]

model_resource_name = f'projects/{PROJECT_ID}/locations/{REGION}/models/{MODEL_ID}'
gcs_source= [f"{BUCKET_URI}/{input_file_name}"]
gcs_destination_prefix=f"{BUCKET_URI}/output"

def batch_prediction_job(
    project: str,
    location: str,
    model_resource_name: str,
    job_display_name: str,
    gcs_source: str,
    gcs_destination_prefix: str,
    machine_type: str,
    starting_replica_count: int = 1, # The number of nodes for this batch prediction job.
    max_replica_count: int = 1,
):
    aiplatform.init(project=project, location=location)

    model = aiplatform.Model(model_resource_name)

    batch_prediction_job = model.batch_predict(
        job_display_name=job_display_name,
        instances_format='csv', #json
        gcs_source=[f"{BUCKET_URI}/{input_file_name}"],
        gcs_destination_prefix=f"{BUCKET_URI}/output",
        machine_type=machine_type, # must be present
    )
    batch_prediction_job.wait()
    print(batch_prediction_job.display_name)
    print(batch_prediction_job.state)
    return batch_prediction_job

batch_prediction_job(PROJECT_ID, REGION, model_resource_name, job_display_name, gcs_source, gcs_destination_prefix, machine_type="n1-standard-2")

# Clean up!

In [None]:
# setup the following parameters manually

DISPLAY_NAME = "pipeline-winequality-20230222111722"
BUCKET_URI = "gs://tokyo-charge-378510-bucket-winequality/pipeline_root_wine/24871937313/aaa"

In [None]:
import os
delete_pipeline = True
delete_bucket = True

try:
    if delete_pipeline and "DISPLAY_NAME" in globals():
        pipelines = aiplatform.PipelineJob.list(
            filter=f"display_name={DISPLAY_NAME}", order_by="create_time"
        )
        pipeline = pipelines[0]
        aiplatform.PipelineJob.delete(pipeline.resource_name)
        print("Deleted pipeline:", pipeline)
except Exception as e:
    print(e)

if delete_bucket or os.getenv("IS_TESTING"):
    ! gsutil rm -r $BUCKET_URI

In [None]:
if delete_bucket or os.getenv("IS_TESTING"):
    ! gsutil rm -r $BUCKET_URI

Then we need to go the the Vertex AI and delete deployed models.
This is important as everyhting we store on cloud costs something.

In [None]:
# Define the get_wine_data component using a YAML specification
get_wine_data_yaml = """
name: Get Wine Data
description: Reads the wine quality dataset and splits it into train and test sets.
inputs:
  - name: url
    type: String
outputs:
  - name: dataset_train
    type: Dataset
  - name: dataset_test
    type: Dataset
implementation:
  container:
    image: python:3.9
    command:
      - python
      - -c
      - |
        import pandas as pd
        from sklearn.model_selection import train_test_split
        import argparse

        parser = argparse.ArgumentParser()
        parser.add_argument('--url', type=str)
        parser.add_argument('--dataset_train_path', type=str)
        parser.add_argument('--dataset_test_path', type=str)
        args = parser.parse_args()

        df_wine = pd.read_csv(args.url, delimiter=";")
        df_wine['best_quality'] = df_wine.quality.apply(lambda x: int(x>=7))
        df_wine['target'] = df_wine.best_quality
        df_wine.drop(
            columns=['quality', 'total sulfur dioxide', 'best_quality'],
            inplace=True
        )

        train, test = train_test_split(df_wine, test_size=0.3)
        train.to_csv(args.dataset_train_path + ".csv" , index=False)
        test.to_csv(args.dataset_test_path + ".csv" , index=False)

    args:
      - --url
      - {inputValue: url}
      - --dataset_train_path
      - {outputPath: dataset_train}
      - --dataset_test_path
      - {outputPath: dataset_test}
"""

from kfp import components
get_wine_data_op = components.load_component_from_yaml(get_wine_data_yaml)

In [None]:
@dsl.pipeline(
  # Default pipeline root. You can override it when submitting the pipeline.
  pipeline_root=PIPELINE_ROOT,
  # A name for the pipeline. Use to determine the pipeline Context.
  name="pipeline-winequality",
)
def pipeline(
  url: str = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv",
  project: str = PROJECT_ID,
  region: str = REGION,
  display_name: str = DISPLAY_NAME,
  api_endpoint: str = REGION+"-aiplatform.googleapis.com",
  thresholds_dict_str: str = '{"roc":0.8}',
  serving_container_image_uri: str = "europe-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.0-24:latest"
):
  data_op = get_wine_data_op(url=url)
  train_model_op = train_winequality(dataset=data_op.outputs["dataset_train"])
  model_evaluation_op = winequality_evaluation(
      test_set=data_op.outputs["dataset_test"],
      rf_winequality_model=train_model_op.outputs["model"],
      thresholds_dict_str = thresholds_dict_str, # I deploy the model anly if the model performance is above the threshold
  )

  with dsl.Condition(
      model_evaluation_op.outputs["deploy"]=="true",
      name="deploy-winequality",
  ):
      deploy_model_op = deploy_winequality(
        model=train_model_op.outputs['model'],
        project=project,
        region=region,
        serving_container_image_uri = serving_container_image_uri,
      )