# Evaluate LLMs with Vertex AutoSxS Model Evaluation

## Learning Objectives

1) Learn how to create evaluation data
1) Learn how to setup a AutoSxS model evaluation pipeline
1) Learn how to run the evaluation pipeline job
1) Learn how to check the autorater judgments
1) Learn how to evaluate how much AutoSxS is aligned with human judgment


In this notebook, we will use Vertex AI Model Evaluation AutoSxS (pronounced Auto Side-by-Side) to compare two LLMs predictions in a summarization task in order to understand which LLM model did a better job at the summarization task. Provided that we have some additional human judgments as to which model is better for part of the dataset, then we will demonstrate how to evaluate the alignment of AutoSxS  with human judgment. (Note that Vertex AI Model Evaluation AutoSxS allows you to compare the performance of Google-first-party and Third-party LLMs, provided the model responses are stored in a JSONL evaluation file.)



## Setup


In [None]:
import datetime
import os
import pprint

import pandas as pd
from google.cloud import aiplatform
from google.protobuf.json_format import MessageToDict
from IPython.display import HTML, display

In [None]:
project_id_list = !gcloud config get-value project 2> /dev/null
PROJECT_ID = project_id_list[0]
BUCKET = PROJECT_ID
REGION = "us-central1"

# Evaluation data containing competing models responses
EVALUATION_FILE_URI = "gs://cloud-training/specialized-training/llm_eval/sum_eval_gemini_dataset_001.jsonl"
HUMAN_EVALUATION_FILE_URI = "gs://cloud-training/specialized-training/llm_eval/sum_human_eval_gemini_dataset_001.jsonl"

# AutoSxS Vertex Pipeline template
TEMPLATE_URI = (
    "https://us-kfp.pkg.dev/ml-pipeline/llm-rlhf/autosxs-template/default"
)

print(f"Your project ID is set to {PROJECT_ID}")

Let us make sure the bucket where AutoSxS will export the data exists, and if not, let us create it:

In [None]:
! gsutil ls gs://{BUCKET} > /dev/null || gsutil mb -l {REGION} -p {PROJECT_ID} gs://{BUCKET}

At last, let us initialize the `aiplatform` client in the cell below:

In [None]:
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET)

## Evaluate LLMs using Vertex AI Model Evaluation AutoSxS

Suppose you've obtained your LLM-generated predictions in a summarization task. To evaluate LLMs such as Gemini-Pro on Vertex AI against another using [AutoSXS](https://cloud.google.com/vertex-ai/generative-ai/docs/models/side-by-side-eval), you need to follow these steps for evaluation:

1.   **Prepare the Evaluation Dataset**: Gather your prompts, contexts, generated responses and human preference required for the evaluation.

2.   **Convert the Evaluation Dataset:** Convert the dataset into the JSONL format and store it in a Cloud Storage bucket. (Alternatively, you can save the dataset to a BigQuery table.)

3.   **Run a Model Evaluation Job:** Use Vertex AI to run a model evaluation job to assess the performance of the LLM.


### Dataset

The dataset is a modified sample of the [XSum](https://huggingface.co/datasets/EdinburghNLP/xsum) dataset for evaluation of abstractive single-document summarization systems.

### Read the evaluation data

In this summarization use case, you use `sum_eval_gemini_dataset_001`, a JSONL-formatted evaluation datasets which contains content-response pairs without human preferences.

In the dataset, each row represents a single example. The dataset includes ID fields, such as "id" and "document," which are used to identify each unique example. The "document" field contains the newspaper articles to be summarized.

While the dataset does not have [data fields](https://cloud.google.com/vertex-ai/docs/generative-ai/models/side-by-side-eval#prep-eval-dataset) for prompts and contexts, it does include pre-generated predictions. These predictions contain the generated response according to the LLMs task, with "response_a" and "response_b" representing different article summaries generated by two different LLM models.

**Note: For experimentation, you can provide only a few examples. The documentation recommends at least 400 examples to ensure high-quality aggregate metrics.**


In [None]:
evaluation_gemini_df = pd.read_json(EVALUATION_FILE_URI, lines=True)
evaluation_gemini_df.head()

### Run a model evaluation job

AutoSxS relays on Vertex AI pipelines to run model evaluation. And here you can see some of the required pipeline parameters:

*   `evaluation_dataset` to indicate where the evaluation dataset location. In this case, it is the JSONL Cloud bucket URI.

*   `id_colums` to distinguish evaluation examples that are unique. Here, as you can imagine, your have `id` and `document` fields. These fields will be added in the judgment table generated by AutoSxS.

*   `task` to indicate the task type you want to evaluate. It can be `summarization` or `question_answer`. In this case you have `summarization`.

*   `autorater_prompt_parameters` to configure the autorater task behavior. You can specify inference instructions to guide task completion, as well as setting the inference context to refer during the task execution. For example, for the summarization task below we have that `autorater_prompt_parameters` is specified by a dictionary containing the name of the field containing the summarization context (i.e. the document to summarize) as well as the summarization instruction itself:
```python
    {
        "inference_context": {"column": "document"},
        "inference_instruction": {"template": "Summarize the following text: "},
    },
```

Lastly, you have to provide `response_column_a` and `response_column_b` with the names of columns containing predefined predictions in order to calculate the evaluation metrics. In this case, `response_a` and `response_b` respectively. Note that we can simply specify the actual models through the `model_a` and `model_b` fields (as long as these models are stored in Vertex model registry) instead of providing the pre-generated responses (through the `response_column_a` and `response_column_b` fields). To learn more about all supported parameters and their usage, see the [official documentation](https://cloud.google.com/vertex-ai/docs/generative-ai/models/side-by-side-eval#perform-eval).



### Exercise

In the first code cell below, configure the `parameters` argument so that it launches an summarization evaluation job task between the two LLMs responses stored in the evaluation data located at `EVALUATION_FILE_URI`. In the second cell below, configure the `PipelineJob` to launch an AutoSxS Vertex Pipeline using the pipeline template stored at `TEMPLATE_URI`.

In [None]:
timestamp = str(datetime.datetime.now().timestamp()).replace(".", "")
display_name = f"autosxs-{timestamp}"
pipeline_root = os.path.join("gs://", BUCKET, display_name)

parameters = None  # TODO

After you define the model evaluation parameters, you can run a model evaluation pipeline job using the predifined pipeline template with Vertex AI Python SDK.

In [None]:
job = aiplatform.PipelineJob(None)  # TODO
job.run(sync=True)

### Evaluate the results

After the evaluation pipeline successfully run, you can review the evaluation results by looking both at artifacts generated by the pipeline itself in Vertex AI Pipelines UI and in the notebook enviroment using the Vertex AI Python SDK.

AutoSXS produces three types of evaluation results: a judgments table, aggregated metrics, and alignment metrics (if human preferences are provided).


### AutoSxS Judgments

The judgments table contains metrics that offer insights of LLM performance per each example.

For each response pair, the judgments table includes a `choice` column indicating the better response based on the evaluation criteria used by the autorater.

Each choice has a `confidence score` column between 0 and 1, representing the autorater's level of confidence in the evaluation.

Last but not less important, AutoSXS provides an explanation for why the autorater preferred one response over the other.

Below you have an example of AutoSxS judgments output.

In [None]:
online_eval_task = [
    task
    for task in job.task_details
    if task.task_name == "online-evaluation-pairwise"
][0]


judgments_uri = MessageToDict(online_eval_task.outputs["judgments"]._pb)[
    "artifacts"
][0]["uri"]

judgments_df = pd.read_json(judgments_uri, lines=True)

The next cell contains a helper function to print a sample from AutoSxS judgments nicely in the notebook.

In [None]:
def print_autosxs_judgments(df, n=3):
    """Print AutoSxS judgments in the notebook"""

    style = "white-space: pre-wrap; width: 800px; overflow-x: auto;"
    df = df.sample(n=n)

    for index, row in df.iterrows():
        if row["confidence"] >= 0.5:
            display(
                HTML(
                    f"<h2>Document:</h2> <div style='{style}'>{row['document']}</div>"
                )
            )
            display(
                HTML(
                    f"<h2>Response A:</h2> <div style='{style}'>{row['response_a']}</div>"
                )
            )
            display(
                HTML(
                    f"<h2>Response B:</h2> <div style='{style}'>{row['response_b']}</div>"
                )
            )
            display(
                HTML(
                    f"<h2>Explanation:</h2> <div style='{style}'>{row['explanation']}</div>"
                )
            )
            display(
                HTML(
                    f"<h2>Confidence score:</h2> <div style='{style}'>{row['confidence']}</div>"
                )
            )
            display(HTML("<hr>"))


print_autosxs_judgments(judgments_df)

### AutoSxS Aggregate metrics

AutoSxS also provides aggregated metrics as an additional evaluation result. These win-rate metrics are calculated by utilizing the judgments table to determine the percentage of times the autorater preferred one model response.  

These metrics are relevant for quickly find out which is the best model in the context of the evaluated task.

Below you have an example of AutoSxS Aggregate metrics.

In [None]:
metrics_eval_task = [
    task
    for task in job.task_details
    if task.task_name == "model-evaluation-text-generation-pairwise"
][0]


win_rate_metrics = MessageToDict(
    metrics_eval_task.outputs["autosxs_metrics"]._pb
)["artifacts"][0]["metadata"]

The next cell contains a helper function to print AutoSxS aggregated metrics nicely in the notebook.

In [None]:
def print_aggregated_metrics(scores):
    """Print AutoSxS aggregated metrics"""

    score_b = round(win_rate_metrics["autosxs_model_b_win_rate"] * 100)
    display(
        HTML(
            f"<h3>AutoSxS Autorater prefers {score_b}% of time Model B over Model A </h3>"
        )
    )


print_aggregated_metrics(win_rate_metrics)

### Human-preference alignment metrics

After reviewing the results of your initial AutoSxS evalution, you may wonder about the reliability of the Autorater assessment's alignment with human raters' views.

AutoSxS supports human preference to validate Autorater evaluation.

To check alignment with a human-preference dataset,  you need to add the ground truths as a column to the `evaluation_dataset` and pass the column name to `human_preference_column`.

#### Read the evaluation data

With respect of evaluation dataset, in this case the `sum_human_eval_gemini_dataset_001` dataset also includes human preferences.

Below you have a sample of the dataset.

In [None]:
human_evaluation_gemini_df = pd.read_json(HUMAN_EVALUATION_FILE_URI, lines=True)
human_evaluation_gemini_df.head()

#### Run a model evaluation job

With respect to the AutoSXS pipeline, you must specify the human preference column in the pipeline parameters.

Then, you can run the evaluation pipeline job using the Vertex AI Python SDK as shown below.


### Exercise


In the first code cell below, configure the `parameters` argument so that it launches a human-alignment evaluation job using the human judgments stored in the evaluation data located at `HUMAN_EVALUATION_FILE_URI` in the `actual` field. In the second cell below, configure the `PipelineJob` to launch the AutoSxS Vertex Pipeline using the pipeline template stored at `TEMPLATE_URI`.

In [None]:
timestamp = str(datetime.datetime.now().timestamp()).replace(".", "")
display_name = f"autosxs-human-eval-{timestamp}"
pipeline_root = os.path.join("gs://", BUCKET, display_name)

parameters = None  # TODO

In [None]:
job = aiplatform.PipelineJob(None)  # TODO
job.run(sync=True)

### Get human-aligned aggregated metrics

Compared with the aggregated metrics you get before, now the pipeline returns additional measurements that utilize human-preference data provided by you.

Below you have a view of the resulting human-aligned aggregated metrics, comparing the win rates for models for both human preferenced and model inferences.


In [None]:
human_eval_task = [
    task
    for task in job.task_details
    if task.task_name == "model-evaluation-text-generation-pairwise"
][0]

human_aligned_metrics = {
    k: round(v, 3)
    for k, v in MessageToDict(human_eval_task.outputs["autosxs_metrics"]._pb)[
        "artifacts"
    ][0]["metadata"].items()
    if "win_rate" in k
}

The next cell contains a helper function to print AutoSxS alignment metrics nicely in the notebook.

In [None]:
def print_human_preference_metrics(metrics):
    """Print AutoSxS Human-preference alignment metrics"""
    display(
        HTML(
            f"<h3>AutoSxS Autorater prefers {score_b}% of time Model B over Model A </h3>"
        )
    )


pprint.pprint(human_aligned_metrics)

## Acknowledgement 

This notebook is adapted from a [tutorial](https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/evaluate_gemini_with_autosxs.ipynb)
written by Ivan Nardini.

Copyright 2024 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

     https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.