<a href="https://colab.research.google.com/github/joahofmann/gcp-notebooks/blob/main/vertex_ai_pipeline_test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vertex AI Pipelines

In [None]:
# Reinstall google-colab to resolve potential dependency conflicts with the requests library.
!pip install --upgrade google-colab

from google.colab import auth
auth.authenticate_user()



In [None]:
!gcloud auth login



You are running on a Google Compute Engine virtual machine.
It is recommended that you use service accounts for authentication.

You can run:

  $ gcloud config set account `ACCOUNT`

to switch accounts if necessary.

Your credentials may be visible to others with access to this
virtual machine. Are you sure you want to authenticate with
your personal account?

Do you want to continue (Y/n)?  Y

Go to the following link in your browser, and complete the sign-in prompts:

    https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=32555940559.apps.googleusercontent.com&redirect_uri=https%3A%2F%2Fsdk.cloud.google.com%2Fauthcode.html&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fappengine.admin+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fsqlservice.login+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcompute+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.

In [None]:
PROJECT_ID='vertex-test-id'
REGION='us-west2'

In [None]:
!gcloud config set project vertex-test-id

Updated property [core/project].


In [None]:
SERVICE_ACCOUNT='219162896674-compute@developer.gserviceaccount.com'

In [None]:
#!gcloud storage rm --recursive gs://my-training-artifacts-45567

In [None]:
BUCKET_URI='gs://my-training-artifacts-4567'
#!gsutil mb -l us-west2 gs://my-training-artifacts-45567

### Important Note:
This notebook might deploy and consume cloud resources in your Google Cloud Project(s) leading to you getting charged/billed for those resources. It's your respondibility to verify the impact of this code before you run it and to monitor and delete any resources to avoid ongoing cloud charges.

In [None]:
#!pip install kfp
#!pip install google-cloud-pipeline-components
#!pip install --upgrade google-cloud-pipeline-components

!pip3 install google-cloud-aiplatform kfp google_cloud_pipeline_components



### Dependencies

#### Before running this notebook, please make sure you have already installed the following libraries with correct versions.

- pandas==1.3.5
- numpy==1.21.6
- google-cloud-aiplatform==1.24.1
- google-cloud-storage==2.9.0
- google-cloud-bigquery==2.34.4
- kfp==1.8.21
- google-cloud-pipeline-components==1.0.42

## Import useful libraries

In [None]:
from typing import NamedTuple
import typing
from kfp import dsl
from kfp.dsl import (Artifact,
                        Dataset,
                        Input,
                        Model,
                        Output,
                        Metrics,
                        ClassificationMetrics,
                        component,
                        OutputPath,
                        InputPath)

from kfp import compiler
from google.cloud import bigquery
from google.cloud import aiplatform
from google.cloud.aiplatform import pipeline_jobs
# from google_cloud_pipeline_components import aiplatform as gcc_aip
# from google_cloud_pipeline_components.v1.vertex_pipelines import ModelDeployOp

In [None]:
from datetime import datetime
TIMESTAMP =datetime.now().strftime("%Y%m%d%H%M%S")

## Configurations

In [None]:
import vertexai
vertexai.init(project=PROJECT_ID, location=REGION)

## Load open-source wine quality dataset

In [None]:
import pandas as pd

df_wine = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv", delimiter=";")
df_wine.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


Available columns:
- volatile acidity :   Volatile acidity is the gaseous acids present in wine.
- fixed acidity :   Primary fixed acids found in wine are tartaric, succinic, citric, and malic
- residual sugar :   Amount of sugar left after fermentation.
- citric acid :    It is weak organic acid, found in citrus fruits naturally.
- chlorides :   Amount of salt present in wine.
- free sulfur dioxide :   So2 is used for prevention of wine by oxidation and microbial spoilage.
- total sulfur dioxide
- pH :   In wine pH is used for checking acidity
- density
- sulphates :    Added sulfites preserve freshness and protect wine from oxidation, and bacteria.
- alcohol :   Percent of alcohol present in wine.

## Create pipeline Components

### We will create 4 components:  
- Load data   
- Train a model
- Evaluate the model
- Deploy the model

## Component1 : Load the wine quality dataset

In [None]:
@component(
    packages_to_install=["pandas", "pyarrow", "scikit-learn==1.0.0"],
    base_image="python:3.11",
    output_component_file="load_data_component.yaml"
)

def get_wine_data(
    url: str,
    dataset_train: Output[Dataset],
    dataset_test: Output[Dataset]
):
    import pandas as pd
    import numpy as np
    from sklearn.model_selection import train_test_split as tts

    df_wine = pd.read_csv(url, delimiter=";")
    df_wine['best_quality'] = [1 if x>=7 else 0 for x in df_wine.quality]
    df_wine['target'] = df_wine.best_quality
    df_wine = df_wine.drop(
        ['quality', 'total sulfur dioxide', 'best_quality'],
         axis=1,
    )
    train, test = tts(df_wine, test_size=0.3)
    train.to_csv(
        dataset_train.path,
        index=False,
        encoding='utf-8-sig',
    )
    test.to_csv(
        dataset_test.path,
        index=False,
        encoding='utf-8-sig',
    )

  @component(


## Component2: Train the model

In [None]:
@component(
    packages_to_install = [
        "pandas",
        "scikit-learn"
    ],
    base_image="python:3.11",
    output_component_file="model_training_component.yml",
)
def train_winequality(
    dataset:  Input[Dataset],
    model: Output[Model],
):
    from sklearn.ensemble import RandomForestClassifier
    import pandas as pd
    import pickle

    data = pd.read_csv(dataset.path)
    model_rf = RandomForestClassifier(n_estimators=10)
    model_rf.fit(
        data.drop(columns=["target"]),
        data.target,
    )
    model.metadata["framework"] = "RF"
    file_name = model.path + f".pkl"
    with open(file_name, 'wb') as file:
        pickle.dump(model_rf, file)

  @component(


## Component3: Evaluate the model

In [None]:
@component(
    packages_to_install = [
        "pandas",
        "scikit-learn"
    ],
    base_image="python:3.11",
    output_component_file="model_evaluation_component.yml",
)
def winequality_evaluation(
    test_set:  Input[Dataset],
    rf_winequality_model: Input[Model],
    thresholds_dict_str: str,
    metrics: Output[ClassificationMetrics],
    kpi: Output[Metrics]
) -> NamedTuple("output", [("deploy", str)]):

    from sklearn.ensemble import RandomForestClassifier
    import pandas as pd
    import logging
    import pickle
    from sklearn.metrics import roc_curve, confusion_matrix, accuracy_score
    import json
    import typing

    def threshold_check(val1, val2):
        cond = "false"
        if val1 >= val2 :
            cond = "true"
        return cond

    data = pd.read_csv(test_set.path+".csv")
    model = RandomForestClassifier()
    file_name = rf_winequality_model.path + ".pkl"
    with open(file_name, 'rb') as file:
        model = pickle.load(file)

    y_test = data.drop(columns=["target"])
    y_target=data.target
    y_pred = model.predict(y_test)

    y_scores =  model.predict_proba(
        data.drop(columns=["target"])
    )[:, 1]
    fpr, tpr, thresholds = roc_curve(
         y_true=data.target.to_numpy(),
        y_score=y_scores, pos_label=True
    )
    metrics.log_roc_curve(
        fpr.tolist(),
        tpr.tolist(),
        thresholds.tolist()
    )

    metrics.log_confusion_matrix(
       ["False", "True"],
       confusion_matrix(
           data.target, y_pred
       ).tolist(),
    )

    accuracy = accuracy_score(data.target, y_pred.round())
    thresholds_dict = json.loads(thresholds_dict_str)
    rf_winequality_model.metadata["accuracy"] = float(accuracy)
    kpi.log_metric("accuracy", float(accuracy))
    deploy = threshold_check(float(accuracy), int(thresholds_dict['roc']))
    return (deploy,)

  @component(


## Component4: Deploy model

In [None]:
@component(
    packages_to_install=["google-cloud-aiplatform", "scikit-learn",  "kfp"],
    base_image="python:3.11",
    output_component_file="model_winequality_component.yml"
)
def deploy_winequality(
    model: Input[Model],
    project: str,
    region: str,
    serving_container_image_uri : str,
    vertex_endpoint: Output[Artifact],
    vertex_model: Output[Model]
):
    from google.cloud import aiplatform
    aiplatform.init(project=project, location=region)

    DISPLAY_NAME  = "winequality"
    MODEL_NAME = "winequality-rf"
    ENDPOINT_NAME = "winequality_endpoint"

    def create_endpoint():
        endpoints = aiplatform.Endpoint.list(
        filter='display_name="{}"'.format(ENDPOINT_NAME),
        order_by='create_time desc',
        project=project,
        location=region,
        )
        if len(endpoints) > 0:
            endpoint = endpoints[0]  # most recently created
        else:
            endpoint = aiplatform.Endpoint.create(
            display_name=ENDPOINT_NAME, project=project, location=region
        )
    endpoint = create_endpoint()

    #Import a model programmatically
    model_upload = aiplatform.Model.upload(
        display_name = DISPLAY_NAME,
        artifact_uri = model.uri.replace("model", ""),
        serving_container_image_uri =  serving_container_image_uri,
        serving_container_health_route=f"/v1/models/{MODEL_NAME}",
        serving_container_predict_route=f"/v1/models/{MODEL_NAME}:predict",
        serving_container_environment_variables={
        "MODEL_NAME": MODEL_NAME,
    },
    )
    model_deploy = model_upload.deploy(
        machine_type="n1-standard-4",
        endpoint=endpoint,
        traffic_split={"0": 100},
        deployed_model_display_name=DISPLAY_NAME,
    )

    # Save data to the output params
    vertex_model.uri = model_deploy.resource_name

  @component(


## Pipeline Definition

In [None]:
DISPLAY_NAME = 'pipeline-winequality-job{}'.format(TIMESTAMP)

Once you have created all the needed components define the pipeline and then compile it into a `.json` file.

In [None]:
@dsl.pipeline(
    pipeline_root=BUCKET_URI,
    name="pipeline-winequality",
)
def pipeline(
    url: str = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv",
    project: str = PROJECT_ID,
    region: str = REGION,
    display_name: str = DISPLAY_NAME,
    api_endpoint: str = REGION+"-aiplatform.googleapis.com",
    thresholds_dict_str: str = '{"roc":0.8}',
    serving_container_image_uri: str = "us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.0-24:latest"
    ):

    # adding first component
    data_op = get_wine_data(url=url)
    # second component uses output of first component as input
    train_model_op = train_winequality(dataset=data_op.outputs["dataset_train"])
    # add third component (uses outputs of comp1 and comp2 as input)
    model_evaluation_op = winequality_evaluation(
        test_set=data_op.outputs["dataset_test"],
        rf_winequality_model=train_model_op.outputs["model"],
        # We deploy the model only if the model performance is above the threshold
        thresholds_dict_str = thresholds_dict_str,
    )

    # condition to deploy the model
    with dsl.If(
        model_evaluation_op.outputs["deploy"]=="true",
        name="deploy-winequality",
    ):
        deploy_model_op = deploy_winequality(
            model=train_model_op.outputs['model'],
            project=project,
            region=region,
            serving_container_image_uri = serving_container_image_uri,
        )
        # The outputs from deploy_model_op are not directly returned as pipeline outputs in this version.
        # You can find the deployed endpoint and model resource names in Vertex AI after the pipeline runs.

## Compile the pipeline

In [None]:
compiler.Compiler().compile(
    pipeline_func=pipeline,
    package_path='ml_winequality.json',
)

The pipeline compilation generates the **ml_winequality.json** job spec file.

## RUN the pipeline

In [None]:
pipeline_job = pipeline_jobs.PipelineJob(
    display_name="winequality-pipeline",
    template_path="ml_winequality.json",
    enable_caching=False,
    location=REGION,
    project=PROJECT_ID,
)

In [None]:
pipeline_job.run()

RuntimeError: Job failed with:
code: 9
message: " The DAG failed because some tasks failed. The failed tasks are: [get-wine-data].; Job (project_id = vertex-test-id, job_id = 7117053004687081472) is failed due to the above error.; Failed to handle the job: {project_number = 219162896674, job_id = 7117053004687081472}"


In [None]:
#!pip install kfp==1.8.21 google-cloud-pipeline-components==1.0.42 google-cloud-aiplatform==1.24.1 google-cloud-storage==2.9.0 google-cloud-bigquery==2.34.4 pandas==1.3.5 numpy==1.21.6