
# Vertex AI Evaluation
---



In [2]:
%pip install --upgrade --user --quiet google-cloud-aiplatform[evaluation]

Variable definitions

In [3]:
import sys

PROJECT_ID = "jkwng-vertex-playground"  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}
EXPERIMENT_NAME = "stories-dataset-experiment"  # @param {type:"string"}
BUCKET = "jkwng-vertex-experiments" # @param {type:"string"}

if not PROJECT_ID or PROJECT_ID == "[your-project-id]":
    raise ValueError("Please set your PROJECT_ID")


import vertexai

vertexai.init(project=PROJECT_ID, location=LOCATION)

Import statements

In [4]:
# General
import pandas as pd

# Main
from vertexai.evaluation import EvalTask, PointwiseMetric, PointwiseMetricPromptTemplate, MetricPromptTemplateExamples

Helper functions

In [5]:
from IPython.display import Markdown, display


def display_eval_result(eval_result, metrics=None):
    """Display the evaluation results."""
    summary_metrics, metrics_table = (
        eval_result.summary_metrics,
        eval_result.metrics_table,
    )

    metrics_df = pd.DataFrame.from_dict(summary_metrics, orient="index").T
    if metrics:
        metrics_df = metrics_df.filter(
            [
                metric
                for metric in metrics_df.columns
                if any(selected_metric in metric for selected_metric in metrics)
            ]
        )
        metrics_table = metrics_table.filter(
            [
                metric
                for metric in metrics_table.columns
                if any(selected_metric in metric for selected_metric in metrics)
            ]
        )

    # Display the summary metrics
    display(Markdown("### Summary Metrics"))
    display(metrics_df)
    # Display the metrics table
    display(Markdown("### Row-based Metrics"))
    display(metrics_table)

Vertex AI includes the following built-in metrics for LLM evaluation. These metrics use pre-defined prompts and use the LLM-as-judge methodology to score model responses.

We have printed out the metric prompt template for `fluency` for review.



In [46]:
from pprint import pprint
# View all the available examples of model-based metrics
pprint(MetricPromptTemplateExamples.list_example_metric_names())
print(f"\nFluency:\n {MetricPromptTemplateExamples.get_prompt_template('fluency')}")



['coherence',
 'fluency',
 'safety',
 'groundedness',
 'instruction_following',
 'verbosity',
 'text_quality',
 'summarization_quality',
 'question_answering_quality',
 'multi_turn_chat_quality',
 'multi_turn_safety',
 'pairwise_coherence',
 'pairwise_fluency',
 'pairwise_safety',
 'pairwise_groundedness',
 'pairwise_instruction_following',
 'pairwise_verbosity',
 'pairwise_text_quality',
 'pairwise_summarization_quality',
 'pairwise_question_answering_quality',
 'pairwise_multi_turn_chat_quality',
 'pairwise_multi_turn_safety']

Fluency:
 
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models.
We will provide you with the user input and an AI-generated response.
You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the criteria provided in the Evaluation section below.
You will assign the response a rating following the Rating Rubric and Evaluation

Below we have defined our own metrics based on our dataset:


1.   Story quality: is the story understandable and relatable to children.
2.   Faithfulness: is the generated story grounded in the original short story. We especially care if character dialogue was changed, or if new characters or major plot points were introduced.

We will later map the `story` variable in our dataset to `response`, as these templates assume that the AI response to be evaluate is named `response`.

In [20]:
# Your own definition of text_quality.
story_quality_template = PointwiseMetricPromptTemplate(
    criteria={
        "entertaining": "The text is engaging, has enough descriptive text for children to immerse themselves in the story.",
        "relatable": "The text contains a story that children can understand and relate to their daily lives.",
    },
    rating_rubric={
        "5": "The story is very engaging to children.",
        "4": "The story is mostly engaging to children with some minor exceptions.",
        "3": "The story contains some parts that are entertaining or relatable to children.",
        "2": "The story contains things that children can relate to, but is not very entertaining.",
        "1": "The story contains details that children may be bored by, not understand, or not be able to relate to.",
        "0": "The story is incomprehensible to young children.",
    },
)

story_quality = PointwiseMetric(
    metric="story_quality",
    metric_prompt_template=story_quality_template,
)

faithfulness_quality_template = PointwiseMetricPromptTemplate(
    criteria={
      "faithfulness": "The text contains the same events from the original and does not introduce new characters, events or change any of the dialogue.",
    },
    rating_rubric={
        "5": "The story is true to the original with no details added or removed, and no character dialogue changed.",
        "4": "The story has minor details added, such as characters' inner thoughts or descriptions of settings, but the events remain the same and the character dialogue is exactly the same as the original.",
        "3": "The story introduces minor changes to the plot, setting or character dialogue, but the plot of the story is the same as the original.",
        "2": "The story introduces a minor new character or introduces or excludes minor plot points from the original, but the plot of the story is close to the original.",
        "1": "The story has major plot points added, changed, or removed which changes the story significantly from the original.",
        "0": "The story is completely different from the original.",
    },
    input_variables=["original"]
)
faithfulness = PointwiseMetric(
    metric="faithfulness",
    metric_prompt_template=faithfulness_quality_template,
)

print(story_quality.metric_prompt_template)


INFO:vertexai.evaluation.metrics.metric_prompt_template:The `input_variables` parameter is empty. Only the `response` column is used for computing this model-based metric.


# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models. We will provide you with the user prompt and an AI-generated responses.
You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.
You will assign the response a rating following the Rating Rubric and Evaluation Steps. Give step by step explanations for your rating, and only choose ratings from the Rating Rubric.


# Evaluation
## Criteria
entertaining: The text is engaging, has enough descriptive text for children to immerse themselves in the story.
relatable: The text contains a story that children can understand and relate to their daily lives.

## Rating Rubric
0: The story is incomprehensible to young children.
1: The story contains details that children may be bored by, not understand, or not be able to relate to.
2: The story contains thin

We also created a third metric type:
3.   Lesson quality: evaluate the quality of the lesson at the end of the story. Is it insightful? And does it follow from the details of the story?

We used the custom metric quality prompt instead of using the template, as the fields used are `story` and `lesson` which are different from the `prompt` and `response` fields used in the template. In our dataset, the model is asked to produce both a story and a lesson in a json that we have separated out in our dataset.

In [11]:

lesson_prompt_template = """
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models. We will provide you with the user prompt and an AI-generated responses.
You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.
You will assign the response a rating following the Rating Rubric and Evaluation Steps. Give step by step explanations for your rating, and only choose ratings from the Rating Rubric.


# Evaluation
## Criteria
applicability: The lesson is drawn directly from details contained in the story.
insightfulness: The text contains an actionable, useful life lesson applicable to living a happy and fulfilling life with others and oneself.

## Rating Rubric
0: The lesson is not insightful and is not applicable to the story.
1: The lesson is not insightful, and can only be drawn from minor details in the story.
2: The lesson is not insightful, and can be directly drawn from the major plot in to the story.
3: The lesson is somewhat insightful, and can be directly drawn from the major plot in the story.
4: The lesson is insightful, and can be directly drawn from the major plot in the story.
5: The lesson is very insightful and can be directly drawn from the major plot of the story.

## Evaluation Steps
Step 1: Assess the response in aspects of all criteria provided. Provide assessment according to each criterion.
Step 2: Score based on the rating rubric. Give a brief rationale to explain your evaluation considering each individual criterion.


# User Inputs and AI-generated Response
## User Inputs
### story
{story}

### lesson
{lesson}
"""

lesson_quality = PointwiseMetric(
    metric="lesson_quality",
    metric_prompt_template=lesson_prompt_template,

)

print(lesson_quality.metric_prompt_template)


# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models. We will provide you with the user prompt and an AI-generated responses.
You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.
You will assign the response a rating following the Rating Rubric and Evaluation Steps. Give step by step explanations for your rating, and only choose ratings from the Rating Rubric.


# Evaluation
## Criteria
applicability: The lesson is drawn directly from details contained in the story.
insightfulness: The text contains an actionable, useful life lesson applicable to living a happy and fulfilling life with others and oneself.

## Rating Rubric
0: The lesson is not insightful and is not applicable to the story.
1: The lesson is not insightful, and can only be drawn from minor details in the story.
2: The lesson is

Load the dataset into a dataframe.  Our dataset looks like the following:


```
{
  "instruction": <prompt instruction>,
  "original": <original story>,
  "system_instruction": <system prompt>,
  "story": <AI generated story based on the original story>,
  "lesson": <AI generated lesson drawn from the story>,
}
```

We load 25 records it into a pandas dataframe just to reduce the amount of time it takes to run the evaluation.


In [8]:
from google.cloud import storage
import json

stories_prefix = 'stories'
storage_client = storage.Client()
bucket = storage_client.bucket(BUCKET)
dataset_path = f"{stories_prefix}/stories_dataset.jsonl"
dataset_blob = bucket.blob(dataset_path)

dataset = []

with dataset_blob.open(mode='r') as f:
  for line in f:
    json_line = json.loads(line)
    dataset.append(json_line)

eval_dataset = pd.DataFrame.from_dict(dataset)

Here is an example output of one of the entries in our dataset:

In [9]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 2000)
display(eval_dataset.loc[0])

Unnamed: 0,0
instruction,"\nMake the story more interesting. Do not add, remove, or change any details from the story. Only add sentences to elaborate\non what is already written. **Do not change any of the character's dialog.** Include a lesson at the end that children can\nlearn from the story.\n\nStory:\n"
original,"Jack was very enthusiastic. He wanted to do something special. He asked his mom what he could do.\n\n""Why don't you try to prevent something bad from happening,"" his mom said.\n\nJack looked at her with confusion.\n\n""What does that mean?"" he asked.\n\n""It means that you can use your shoulder to stop something from happening,"" his mom explained.\n\nJack was excited. He put his shoulder against the door to prevent it from opening. His mom was proud of him.\n\nJack felt a sense of accomplishment. He was pleased that he could prevent something bad from happening. He was glad that he could help.\n"
system_instruction,\nYou are a children's story editor. You will be shown a simple children's story that you will be asked to add\ndetails to the story to make it more engaging for children.\n
story,"Jack was very enthusiastic. He was a young boy with a lot of energy, and he was always looking for new things to do. He wanted to do something special. He wanted to do something that would make his mom proud and make him feel good about himself. He asked his mom what he could do. He bounced up and down, eager to hear her suggestion.\n\n""Why don't you try to prevent something bad from happening,"" his mom said. She smiled at her son and placed a hand on his shoulder.\n\nJack looked at her with confusion. His eyebrows furrowed together as he tried to understand what she meant. He tilted his head, his eyes wide.\n\n""What does that mean?"" he asked. He truly wanted to understand what his mom was saying so he could do what she wanted. \n\n""It means that you can use your shoulder to stop something from happening,"" his mom explained. She looked around the room to think of an example of how Jack could prevent something from happening.\n\nJack was excited. His face lit up with excitement at the thought of using his shoulder to help. He put his shoulder against the door to prevent it from opening. He pushed as hard as he could. The door rattled but held firm. His mom was proud of him. She gave him a big hug and a kiss on the forehead. She knew he truly wanted to be a helper.\n\nJack felt a sense of accomplishment. He was pleased that he could prevent something bad from happening. He imagined all the bad things that could have come in through the door, and he felt strong that he was able to stop it. He was glad that he could help. He looked around the room for more things to prevent."
lesson,"The lesson of this story is that everyone can do something helpful. Even if it seems small, any effort to make things better can lead to big results. Just like Jack used his shoulder to help prevent something bad from happening, we can all use our talents and abilities to contribute to a greater good and improve the world around us, one small act at a time."


Let's test one of the metrics on one of the examples in our dataset.

In [21]:
from vertexai.generative_models import GenerativeModel, GenerationConfig, Part, SafetySetting, FinishReason

max_tokens = 8192
temperature = 1
top_p = 0.95

model = GenerativeModel(
  "gemini-1.5-pro-002",
)

response = model.generate_content(
    contents=[
        faithfulness.metric_prompt_template.format(
            original=eval_dataset.loc[0]['original'],
            response=eval_dataset.loc[0]['story']
        )
    ]
)

print(response.text)


Step 1:
Faithfulness: The AI response adds descriptions of Jack's feelings, actions, and his mom's actions. While these additions provide more context and imagery, they don't change the original events or dialogue.  The core storyline remains intact.


Step 2:
Rating: 4

Rationale: The AI-generated story adds details like Jack's enthusiasm being described as "bouncing up and down," his mom placing a hand on his shoulder, and Jack pushing hard against the door. These are minor additions that enhance the descriptions but don't alter the plot or dialogue. Therefore, a rating of 4 is appropriate, as it signifies minor details added while keeping the core story intact.



Now let's evaluate the four metrics `[fluency, story_quality, faithfulness, lesson_quality]`. This effectively the prompts to our LLM model to determine the scores and aggregates them into an Experiment tracked inside of Vertex AI.  

If you're running inside of Colab Enterprise, you can see the experiment inline in the notebook interface and evaluate each of the metric scores.

Note we mapped `response` to `story` here in the `EvalTask` as discussed above.

In [13]:
eval_task = EvalTask(
    dataset=eval_dataset.sample(10),
    metric_column_mapping={
        "prompt": "instruction",
        "response": "story",
    },
    metrics=[
        MetricPromptTemplateExamples.Pointwise.FLUENCY,
        story_quality,
        faithfulness,
        lesson_quality
    ],
    experiment=EXPERIMENT_NAME
)

eval_result = eval_task.evaluate()

INFO:google.cloud.aiplatform.metadata.experiment_resources:Associating projects/205512073711/locations/us-central1/metadataStores/default/contexts/stories-dataset-experiment-60ced10d-35d3-4531-82ca-a8c736def105 to Experiment: stories-dataset-experiment


INFO:vertexai.evaluation._evaluation:Computing metrics with a total of 40 Vertex Gen AI Evaluation Service API requests.
100%|██████████| 40/40 [00:48<00:00,  1.21s/it]
INFO:vertexai.evaluation._evaluation:Evaluation Took:48.37069001699996 seconds
