# Setup

## Requirements

The following are the requirements to follow allong with this guide:

1. A Google Cloud Platform proyect with billing enabled.
2. Conda installed in your system.
3. Docker installed in your system.

## Environment set up

In this section we will set up the environment for the notebook.
1. Create a conda environment with the required packages. Then you will need to activate the environment and use it for the kernel of the notebook.
2. Authenticate with the Google Cloud Platform. You will need to do this from the terminal as the notebook does not support the authentication process. 
3. Obtain local Application Default Credentials for the Google Cloud Platform. This also needs to be done from the terminal.

The following cells will guide you through the process.


In [None]:
%%writefile environment.yml
name: gcp-prediction-env
channels:
  - defaults
dependencies:
  - python==3.10
  - jupyter
  - pandas
  - conda-forge::scikit-learn
  - numpy
  - matplotlib
  - seaborn
  - scikit-learn
  - xgboost>=1.7.0, <2.0.0
  - conda-forge::google-cloud-sdk
  - pyarrow
  - db-dtypes
  - pip
  - pip:
    - google-cloud-aiplatform[prediction]
    - fastapi[standard]

> **Note**: The `%%writefile` magic command will write the content of the cell to a file.

In [None]:
!conda env create -f environment.yml

In [None]:
#check if we are in the right conda environment
import json
conda_info= !conda info --json
conda_info = json.loads(''.join(conda_info))
conda_env = conda_info['active_prefix_name']
if conda_env == 'gcp-prediction-env':
    print('Conda environment is set up correctly!')
else:
    raise Exception('Please use the conda environment gcp-prediction-env')

In [None]:
#check if `gcloud auth login` has been run
active_string = "*" # represents the currently active account
gcloud_auth = !gcloud auth list | grep "$active_string"
gcloud_auth = gcloud_auth[-1]
if active_string in gcloud_auth:
    account = gcloud_auth.split(' ')[-1]
    print(f'Active account is {account}')
    print('gcloud has been authenticated up correctly!')
else:
    raise Exception('Please run `gcloud auth login` from the terminal')

In [None]:
# check if there is an ADC for gcp
gcloud_adk = !gcloud auth application-default print-access-token
if 'ERROR' not in gcloud_adk[-1]:
    print('ADC has been set up correctly!')
else:
    raise Exception('Please run `gcloud auth application-default login` from the terminal')

## Config

This section defines some global variables that will be used throughout the notebook.

In [100]:
PROJECT_ID = "my-project" # @param {type:"string"}
LOCATION = "us-west1" # @param {type:"string"}

In [101]:
if PROJECT_ID == "my-project":
    PROJECT_ID = !gcloud config get-value project
    PROJECT_ID=PROJECT_ID[0]

In [102]:
repository = "iris-artifact-repo"
image = "xgboost-predictor"
image_uri = f"{LOCATION}-docker.pkg.dev/{PROJECT_ID}/{repository}/{image}"

In [None]:
!gcloud config set project $PROJECT_ID --quiet

In [104]:
seed=42 # for reproducibility

## Imports

This section imports the required libraries for the notebook.

In [105]:
import pandas as pd
import numpy as np
import os

# Model training
from sklearn import datasets
from sklearn.model_selection import train_test_split
import xgboost as xgb

# HTTP Server and  making requests
import subprocess
from time import sleep
import requests
import json

# BigQuery
from google.cloud import bigquery

#aiplatform
from google.cloud.aiplatform.prediction import LocalModel
from google.cloud import aiplatform

#logging
import logging
logging.getLogger().setLevel(logging.INFO)


## Utils

This sections defines some utility functions that will be used throughout the notebook. Reviewing them is not required but it is recommended to understand what is happening in the notebook.


In [106]:
import re

def replace_variable_in_file(filepath: str, variable_name: str, variable_value: str) -> None:
    """
    Replaces all occurrences of a variable name with a specified value in a given file.

    Parameters:
    filepath (str): The path to the file where the replacement should occur.
    variable_name (str): The name of the variable to be replaced.
    variable_value (str): The value to replace the variable with.

    Returns:
    None
    """
    with open(filepath, "rt") as file:
        s = file.read()

    s = re.sub(rf"{variable_name}", f'"{variable_value}"', s)

    with open(filepath, "wt") as file:
        file.write(s)

In [107]:
def upload_to_bigquery(data_test: pd.DataFrame) -> None:
    """
    Uploads a DataFrame to a BigQuery table.

    Parameters:
    data_test (pd.DataFrame): The DataFrame containing the data to be uploaded.

    Returns:
    None
    """
    dataset_name = "iris_predictor"
    table_name = "test_data"
    table_id = f"{PROJECT_ID}.{dataset_name}.{table_name}"

    bq_client = bigquery.Client(project=PROJECT_ID)
    
    # Create dataset
    dataset = bigquery.Dataset(f"{PROJECT_ID}.{dataset_name}")
    dataset.location = LOCATION.split("-")[0].upper()
    dataset = bq_client.create_dataset(dataset, exists_ok=True)
    
    job_config = bigquery.LoadJobConfig(
        schema=[bigquery.SchemaField(field, "FLOAT") for field in fields],
        write_disposition="WRITE_TRUNCATE"  # Overwrite the table
    )
        
    # Upload data to BigQuery
    job = bq_client.load_table_from_dataframe(data_test, table_id, job_config=job_config)
    job.result()  # Waits for the job to complete
    
    print(f"Loaded {job.output_rows} rows into {table_id}")

    return dataset_name, table_id

In [108]:
def fetch_predictions_from_bigquery(dataset_name: str) -> pd.DataFrame:
    """
    Fetches predictions from a BigQuery table and returns them as a DataFrame.

    Parameters:
    dataset_name (str): The name of the dataset containing the predictions table.

    Returns:
    pd.DataFrame: DataFrame containing the predictions.
    """
    query = f"""
    SELECT *
    FROM {dataset_name}.predictions
    """
    
    bq_client = bigquery.Client(project=PROJECT_ID)
    df = bq_client.query(query).to_dataframe()
    return df

# Train


In [109]:
iris = datasets.load_iris()
fields = ["sepal_length", "sepal_width", "petal_length", "petal_width"]
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=seed)

In [110]:
artifact_local_folder = "model-artifacts"
if not os.path.exists(artifact_local_folder):
    os.makedirs(artifact_local_folder)
model_filename = "model.json"
artifact_local_path = os.path.join(artifact_local_folder, model_filename)

In [111]:
clf = xgb.XGBClassifier(seed=seed)
clf.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)
clf.save_model(artifact_local_path)

In [None]:
clf = xgb.XGBClassifier()
clf.load_model(artifact_local_path)
print("Test set predictions: ", clf.predict(X_test))

# Local HTTP server


In [19]:
app_folder = "app"
if not os.path.exists(app_folder):
    os.makedirs(app_folder)
app_filepath=os.path.join(app_folder, "app.py")

In [None]:
%%writefile $app_filepath

import numpy as np
import xgboost as xgb
from fastapi import FastAPI, HTTPException, Request
import os

import logging
logging.getLogger().setLevel(logging.INFO)

# import your functions

app = FastAPI()

# load model before running app (to be saved in memory)
model_ = xgb.XGBClassifier()
model_.load_model(artifact_local_path)

try:
    HEALTH_ROUTE = os.environ["AIP_HEALTH_ROUTE"]
    PREDICTIONS_ROUTE = os.environ["AIP_PREDICT_ROUTE"]
except KeyError:
    HEALTH_ROUTE = "/health"
    PREDICTIONS_ROUTE = "/predictions"

def preprocess(content):
    preprocessed_content = np.asarray(content)
    return preprocessed_content

def prediction(ingest_data):
    return model_.predict(ingest_data)

def postprocess(predictions):
    predictions = predictions.tolist()
    return {"predictions": predictions}

@app.get(HEALTH_ROUTE, status_code=200)
def health():
    return {"Server is up and running!"}


@app.post(PREDICTIONS_ROUTE)
async def predict(request: Request):

    request_json = await request.json()
    content = request_json["instances"]

    ingest_data = preprocess(content)

    predictions = prediction(ingest_data)

    output = postprocess(predictions)

    return output

> **Note**: Using the `%%writefile` magic command does not allow the use of local variables through the `$local_variable` syntax. This is why we need to manually replace `artifact_local_path` with the actual path in the following cell.

In [None]:
replace_variable_in_file(app_filepath, r"artifact_local_path", artifact_local_path)

# Check if replacement was successful
!cat $app_filepath -n | grep $artifact_local_path

In [None]:
# start the fastapi server on a subprocess
fastapi_process = subprocess.Popen(["fastapi", "run"])
# This is necessary because if we run it from a cell it will be running indefinetly

#wait 30 seconds for http server to start
sleep(30)

In [None]:
data = {
    "instances": X_test.tolist()
}

response = requests.post("http://localhost:8000/predictions", json=data)

In [None]:
print("Response status code: ", response.status_code)
content_json = json.loads(response.content)
print("Response content: ", content_json)
df_local = pd.DataFrame(X_test, columns=fields)
df_local["prediction"] = content_json["predictions"]

In [25]:
# Terminate the subprocess
fastapi_process.terminate()

# Container HTTP server


In [26]:
requirements_filepath = os.path.join(app_folder, "requirements.txt")

In [None]:
%%writefile $requirements_filepath
fastapi[standard]
scikit-learn
xgboost>=1.7.0<2.0.0

In [None]:
%%writefile Dockerfile
FROM python:3.10
COPY ./app /app
COPY ./model-artifacts /model-artifacts

RUN pip install -r /app/requirements.txt

CMD ["fastapi", "run"]
EXPOSE 8000

In [None]:
# build docker
!docker build -t xgboost-predictor .

In [None]:
# Start the container in a subprocess
docker_process = subprocess.Popen(["docker", "run", "-p", "8000:8000", "xgboost-predictor"])

#wait 30 seconds for http server to start
sleep(30)

> **Note:** If you are running in WSL2 with docker desktop you need to configure the .wslconfig file to use networking in mirrored mode. This allows the WSL2 kernel to access the localhost as otherwise it will not be able to access the localhost directly but through the mDNS.
https://stackoverflow.com/questions/64763147/access-a-localhost-running-in-windows-from-inside-wsl-2


In [None]:
data = {
    "instances": X_test.tolist()
}

response = requests.post(f"http://localhost:8000/predictions", json=data)

In [None]:
print("Response status code: ", response.status_code)

content_json = json.loads(response.content)
print("Response content: ", content_json)

In [33]:
docker_process.terminate()

# Custom Container

## Push to Artifact Registry

To push the container to the Artifact Registry we need to follow these steps:
1. Build the container. (Already done in the previous section)
2. Enable the Artifact Registry API.
3. Create the repository in Artifact Registry.
4. Configure docker to be authenticated to push to Artifact Registry.
2. Tag the container.
3. Push the container.

In [None]:
!gcloud services enable artifactregistry.googleapis.com

In [None]:
!gcloud artifacts repositories create $repository --repository-format=docker \
    --location=$LOCATION \
    --project=$PROJECT_ID

In [None]:
!gcloud artifacts repositories list --project=$PROJECT_ID

In [None]:
# Configure docker to authenticate with the repository endpoint
!gcloud auth configure-docker {LOCATION}-docker.pkg.dev --quiet

In [38]:
# Tag the image
!docker tag xgboost-predictor $image_uri

In [None]:
# Push the image
!docker push $image_uri

## Add to model registry

To add the model to the model registry we need to follow these steps:
1. Enable the AI Platform API.
2. Add the Artifact Registry reader role to the Vertex AI service account.
3. Upload the model to the Model Registry.


In [40]:
!gcloud services enable aiplatform.googleapis.com

In [None]:
project_number = !gcloud projects describe $PROJECT_ID --format="value(projectNumber)"
vertex_ai_service_agent = f"service-{project_number[0]}@gcp-sa-aiplatform.iam.gserviceaccount.com"

!gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:$vertex_ai_service_agent"\
    --role="roles/artifactregistry.reader"

In [42]:
model_display_name = "xgboost-iris-model"

## Check local deployment (Optional)

In [None]:
# (Optional) Check the model locally
local_model = LocalModel(
    serving_container_image_uri=image_uri,
    serving_container_predict_route="/predictions",
    serving_container_health_route="/health",
    serving_container_ports=[8000]
)

request_dict = dict(instances=X_test.tolist())   
json_request = json.dumps(request_dict) 

# Deploy the model locally and make a prediction
with local_model.deploy_to_local_endpoint(
    host_port="8000"
) as local_endpoint:
    predict_response = local_endpoint.predict(
        request=json_request,
        headers={"Content-Type": "application/json"},
        verbose=True,
    )

print(predict_response, predict_response.content)
df_custom_local = pd.DataFrame(X_test, columns=fields)
df_custom_local["prediction"] = json.loads(predict_response.content)["predictions"]

In [None]:
model = aiplatform.Model.upload(
    project=PROJECT_ID,
    location=LOCATION,
    display_name=model_display_name,
    serving_container_image_uri=image_uri,
    serving_container_predict_route="/predictions",
    serving_container_health_route="/health",
    serving_container_ports=[8000],
)

model.wait()

print(model.display_name)
print(model.resource_name)

# Print the response
print("Model uploaded successfully:", response)

> **Note**: Vertex AI batch prediction requires data to come either from a GCS bucket or a BigQuery table. In this case, we will copy the data to a BigQuery table. This alternative is useful as we can check the data from the BigQuery console and inspect it before running the batch prediction or after in case we have any issues.

In [54]:
# Data to copy
data_test = pd.DataFrame(X_test, columns=fields)

In [None]:
# Use the upload_to_bigquery utility defined in Setup -> Utils
dataset_name, table_id = upload_to_bigquery(data_test)

In [None]:
# get model id
model_id = !gcloud ai models list --region=$LOCATION --format="value(name)" --filter="display_name=$model_display_name"
print("Command output: ", model_id)
model_id = model_id[-1]
print("Model ID: ", model_id)
try:
    int(model_id)
except ValueError:
    print("model_id is not an integer")
    raise Exception("Seems the model ID could not be fetched!")

In [57]:
model_resource_name = f"projects/{PROJECT_ID}/locations/{LOCATION}/models/{model_id}"
bigquery_source_input_uri = f"bq://{table_id}"
bigquery_destination_output_uri = f"bq://{PROJECT_ID}.{dataset_name}.predictions"

In [None]:
aiplatform.init(project=PROJECT_ID, location=LOCATION)

model = aiplatform.Model(model_resource_name)

batch_prediction_job = model.batch_predict(
    job_display_name="iris_batch_prediction",
    bigquery_source=bigquery_source_input_uri,
    bigquery_destination_prefix=bigquery_destination_output_uri,
    machine_type="e2-standard-2",
    starting_replica_count=1,
    max_replica_count=1,
)

batch_prediction_job.wait()

print(batch_prediction_job.display_name)
print(batch_prediction_job.resource_name)
print(batch_prediction_job.state)

In [None]:
# Use the fetch_predictions_from_bigquery utility defined in Setup -> Utils
df = fetch_predictions_from_bigquery(dataset_name)

In [None]:
print("Predictions:")
print(df["prediction"].astype(int).tolist())

> **Note:** The predictions are not in the same order as the input data for the batch prediction call. This is because the data is processed in multiple calls in an asynchronous operation and the order of the predictions is not guaranteed. We could force the order by setting the batch size to the size of the test set, so that it is processed in a single batch. Here we instead just postprocess the predictions to match the order of the input data.


In [None]:
# merge data_test to df
df_ordered = pd.merge(df, data_test, how="right")
print("Ordered Predictions:")
print(df_ordered["prediction"].astype(int).tolist())
df_docker = df_ordered.copy()

# Custom Prediction Routine


In [62]:
cpr_folder = "custom_prediction_routine"
bucket_name= f"iris-bucket-{PROJECT_ID}"
bucket_cpr_path = f"gs://{bucket_name}/{cpr_folder}"

In [None]:
# create gcs bucket
!gsutil mb -l $LOCATION gs://$bucket_name

In [None]:
# upload model.json to cloud storage
!gsutil cp $artifact_local_path $bucket_cpr_path/$model_filename

In [None]:
#list contents of the bucket folder
!gsutil ls $bucket_cpr_path

In [66]:
artifact_uri = f"{bucket_cpr_path}/{model_filename}"

In [67]:
if not os.path.exists(cpr_folder):
    os.makedirs(cpr_folder)

In [None]:
%%writefile $cpr_folder/requirements.txt
fastapi[standard]
scikit-learn
xgboost>=1.7.0<2.0.0

In [None]:
%%writefile $cpr_folder/predictor.py

import numpy as np

from google.cloud.aiplatform.prediction.sklearn.predictor import SklearnPredictor
from google.cloud.aiplatform.utils import prediction_utils
import xgboost as xgb
import os

import logging
logging.getLogger().setLevel(logging.INFO)

class CprPredictor(SklearnPredictor):
    
    def __init__(self):
        return
    
    def load(self, artifacts_uri: str):
        """Loads the preprocessor artifacts."""
        prediction_utils.download_model_artifacts(artifacts_uri)

        self._model = xgb.XGBClassifier()
        self._model.load_model("model.json")
    
    def preprocess(self, prediction_input):
        instances  = np.asarray(prediction_input["instances"])
        return np.asarray(instances)

    def predict(self, inputs):
        prediction_results = self._model.predict(inputs)
        return prediction_results
    
    def postprocess(self, prediction_results):
        predictions = prediction_results.tolist()
        return {"predictions": predictions}

In [70]:
#import CprPredictor from cpr_folder
try:
    current_directory = os.getcwd()
    os.chdir(cpr_folder)
    from predictor import CprPredictor
except ImportError:
    print("Error importing the CPR predictor")
finally:
    os.chdir(current_directory)


In [None]:
# Call the LocalModel build_cpr_model method. This will build the custom prediction routine docker image.
from google.cloud.aiplatform.prediction import LocalModel

local_model = LocalModel.build_cpr_model(
    cpr_folder,
    image_uri,
    base_image="python:3.10",
    predictor=CprPredictor,
    requirements_path=os.path.join(cpr_folder, "requirements.txt")
)

In [None]:
local_model.get_serving_container_spec()

In [73]:
request_dict = dict(instances=X_test.tolist())   
json_request = json.dumps(request_dict) 

In [74]:
# Deploy the model locally and make a prediction
with local_model.deploy_to_local_endpoint(
    artifact_uri=artifact_local_folder, # You can also use the GCS path here
    host_port="8000"
) as local_endpoint:
    predict_response = local_endpoint.predict(
        request=json_request,
        headers={"Content-Type": "application/json"},
        verbose=True,
    )

In [None]:
print(predict_response, predict_response.content)
df_cpr_local = pd.DataFrame(X_test, columns=fields)
df_cpr_local["prediction"] = json.loads(predict_response.content)["predictions"]

In [None]:
local_model.push_image()

In [None]:
from google.cloud import aiplatform

model = aiplatform.Model.upload(
    project=PROJECT_ID,
    location=LOCATION,
    local_model=local_model,
    display_name=model_display_name,
    artifact_uri=bucket_cpr_path,
    parent_model=model_resource_name, # This allows us to upload this model as a new version of the existing model
)

In [78]:
# delete bigquery prediction table, otherwise the results will be appended
with bigquery.Client() as bq_client:
    job = bq_client.delete_table(f"{PROJECT_ID}.{dataset_name}.predictions", not_found_ok=True)

In [None]:
batch_prediction_job = model.batch_predict(
    job_display_name="iris_batch_prediction",
    bigquery_source=bigquery_source_input_uri,
    bigquery_destination_prefix=bigquery_destination_output_uri,
    machine_type="e2-standard-2",
    starting_replica_count=1,
    max_replica_count=1,
)

batch_prediction_job.wait()

print(batch_prediction_job.display_name)
print(batch_prediction_job.resource_name)
print(batch_prediction_job.state)

In [None]:
df = fetch_predictions_from_bigquery(dataset_name)

In [None]:
print("Predictions:")
print(df["prediction"].astype(int).tolist())

In [None]:
# merge data_test to df
ordered_df = pd.merge(df, data_test, how="right")
print("Ordered Predictions:")
print(ordered_df["prediction"].astype(int).tolist())
df_cpr = ordered_df.copy()

# Prebuilt Container

The required steps to use the prebuilt container are:
1. Transform the model to the booster format and upload the artifacts to a GCS bucket.
2. Create a model resource in Vertex AI Model Registry.
3. Test a batch prediction.

> Note: The xgboost prebuilt container requires that we transform the model to the booster format. Previously we have been using the sklearn wrapper, which has a nicer interface but does not allow us to use the prebuilt container. In this section we will transform the model to the booster format and then use the prebuilt container to make a batch prediction. Please take a look at the Vertex AI [documentation](https://cloud.google.com/vertex-ai/docs/training/exporting-model-artifacts#xgboost) for more information on the prebuilt containers requirements.


In [117]:
prebuilt_image_uri = "us-docker.pkg.dev/vertex-ai/prediction/xgboost-cpu.1-7:latest"

In [None]:
# Transforming the xgboost model to the booster format

# load xgboost model
clf = xgb.XGBClassifier()
clf.load_model(artifact_local_path)

#convert to plain xgboost model (not sklearn wrapper)
# ref: 
clf = clf.get_booster()

# save model
booster_filename = "model.bst"
booster_model_filepath = os.path.join(artifact_local_folder, booster_filename)
clf.save_model(booster_model_filepath)

# save to cloud storage
booster_folder="booster"
booster_folder_uri = f"gs://{bucket_name}/{booster_folder}"
booster_model_uri = os.path.join(booster_folder_uri, booster_filename)
!gsutil cp $booster_model_filepath $booster_model_uri

In [None]:
# (Optional) Test it locally

local_model = LocalModel(
    serving_container_image_uri=prebuilt_image_uri
)

# **Note: For some reason running this as local expects the model.bst to be in a folder named model
local_booster_model_uri = os.path.join(booster_folder_uri, "model", booster_filename)
!gsutil cp $booster_model_filepath $local_booster_model_uri

# Deploy the model locally and make a prediction
with local_model.deploy_to_local_endpoint(
     # The prebuilt container does not automatically pick up the ADC
    credential_path=os.path.expanduser("~/.config/gcloud/application_default_credentials.json"),
    # The prebuilt container expects the model to be in a folder named model inside the artifact_uri
    # Using a local folder here fails, hence we are using the GCS path here for the local model
    artifact_uri=booster_folder_uri,
    host_port="8000"
) as local_endpoint:
    predict_response = local_endpoint.predict(
        request=json_request,
        headers={"Content-Type": "application/json"},
        verbose=True,
    )
    sleep(1000)

print(predict_response, predict_response.content)
df_custom_local = pd.DataFrame(X_test, columns=fields)
df_custom_local["prediction"] = json.loads(predict_response.content)["predictions"]

!gsutil rm -r $local_booster_model_uri

In [None]:
# use a gcp vertex prebuilt container 
model = aiplatform.Model.upload(
    project=PROJECT_ID,
    location=LOCATION,
    display_name=model_display_name,
    serving_container_image_uri=prebuilt_image_uri,
    parent_model=model_resource_name, # This allows us to upload this model as a new version of the existing model
    artifact_uri=booster_folder_uri
)

In [87]:
# delete bigquery prediction table, otherwise the results will be appended
with bigquery.Client() as bq_client:
    job = bq_client.delete_table(f"{PROJECT_ID}.{dataset_name}.predictions", not_found_ok=True)

In [None]:
# test batch prediction
batch_prediction_job = model.batch_predict(
    job_display_name="iris_batch_prediction",
    bigquery_source=bigquery_source_input_uri,
    bigquery_destination_prefix=bigquery_destination_output_uri,
    machine_type="e2-standard-2",
    starting_replica_count=1,
    max_replica_count=1,
)

In [None]:
df = fetch_predictions_from_bigquery(dataset_name)

> **Note:** The prebuilt container returns the predictions scores (i.e. the predicted probabilities for each class), therefore we need to postprocess the predictions to get the predicted class.

In [90]:
#convert prediction column to array
def get_prediction(x):
    float_array = [float(a) for a in x[1:-2].split(",")]
    label = np.argmax(float_array)
    return label

In [None]:
print("Raw Predictions:")
print(df["prediction"].tolist())
print("Postprocessed Predictions:")
print(df["prediction"].apply(get_prediction).tolist())

In [None]:
new_df = df.copy()
new_df["prediction"] = new_df["prediction"].apply(get_prediction)
ordered_df= pd.merge(new_df, data_test, how="right")
print("Ordered Predictions:")
print(ordered_df["prediction"].astype(int).tolist())
df_prebuilt = ordered_df.copy()

# Compare results

In [None]:
#compare results from each df on a single table
df_local["source"] = "local"
df_docker["source"] = "docker"
df_cpr_local["source"] = "cpr_local"
df_cpr["source"] = "cpr"
df_prebuilt["source"] = "prebuilt"

df_all = pd.concat([df_local, df_docker, df_cpr_local, df_cpr, df_prebuilt])
df_all["prediction"] = df_all["prediction"].astype(int)
df_pivot = df_all.pivot_table(index=fields, columns="source", values="prediction", aggfunc="first")
# add a column checking if all have the same value
df_pivot["all_equal"] = df_pivot.apply(lambda x: len(set(list(x)))==1, axis=1)
# print length
print("df_piviot.shape: ", df_pivot.shape)
print("Value counts of all_equal:")
print(df_pivot["all_equal"].value_counts())
df_pivot.head()


# Clean up

In [None]:
delete_resources = True

if delete_resources:
    # delete local files
    !rm -r $cpr_folder
    !rm -r $artifact_local_folder
    !rm -r $app_folder
    !rm Dockerfile
    !rm environment.yml
    # delete bucket
    !gsutil rm -r gs://$bucket_name
    # delete docker images
    !docker rmi xgboost-predictor
    !docker rmi $image_uri --force
    # delete bigquery dataset
    !bq rm -r -f $PROJECT_ID:$dataset_name
    # delete model
    !gcloud ai models delete $model_id --region=$LOCATION --quiet
    # delete artifact registry repository
    !gcloud artifacts repositories delete $repository --location=$LOCATION --quiet
    