In [None]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Evaluating a LangChain Agent on Vertex AI Reasoning Engine (Prebuilt template)

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/reasoning-engine/evaluating_langchain_agent_reasoning_engine_prebuilt_template.ipynb">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Freasoning-engine%2Fevaluating_langchain_agent_reasoning_engine_prebuilt_template.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/reasoning-engine/evaluating_langchain_agent_reasoning_engine_prebuilt_template.ipynb">
      <img src="https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/reasoning-engine/evaluating_langchain_agent_reasoning_engine_prebuilt_template.ipynb">
      <img width="32px" src="https://upload.wikimedia.org/wikipedia/commons/9/91/Octicons-mark-github.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

<div style="clear: both;"></div>

<b>Share to:</b>

<a href="https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/reasoning-engine/evaluating_langchain_agent_reasoning_engine_prebuilt_template.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg" alt="LinkedIn logo">
</a>

<a href="https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/reasoning-engine/evaluating_langchain_agent_reasoning_engine_prebuilt_template.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg" alt="Bluesky logo">
</a>

<a href="https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/reasoning-engine/evaluating_langchain_agent_reasoning_engine_prebuilt_template.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/53/X_logo_2023_original.svg" alt="X logo">
</a>

<a href="https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/reasoning-engine/evaluating_langchain_agent_reasoning_engine_prebuilt_template.ipynb" target="_blank">
  <img width="20px" src="https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" alt="Reddit logo">
</a>

<a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/reasoning-engine/evaluating_langchain_agent_reasoning_engine_prebuilt_template.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg" alt="Facebook logo">
</a>

| | |
|-|-|
| Authors |  [Naveksha Sood](https://github.com/navekshasood) [Ivan Nardini](https://github.com/inardini) |

## 개요

모든 생성형 AI 애플리케이션과 마찬가지로, AI 에이전트 역시 안정적이고 효과적으로 작동하는지 확인하기 위해 철저한 평가가 필요합니다. 이러한 평가는 실시간(온라인)으로, 그리고 대규모 테스트 케이스 데이터 세트(오프라인)에서 모두 이루어져야 합니다. 에이전트 애플리케이션을 구축하는 개발자는 성능 평가에 있어 상당한 어려움에 직면합니다. 주관적인(사람의 피드백) 평가와 객관적인(측정 가능한 지표) 평가 모두 에이전트 동작에 대한 신뢰를 구축하는 데 필수적입니다.

이 튜토리얼에서는 Vertex AI Gen AI Evaluation을 사용하여 퍼스트 파티 Reasoning Engine Agent를 평가하는 방법을 보여줍니다.

이 튜토리얼에서는 다음 Google Cloud 서비스 및 리소스를 사용합니다.

* Vertex AI Gen AI Evaluation
* Vertex AI Reasoning Engine

수행되는 단계는 다음과 같습니다.

* LangChain을 사용하여 에이전트 구축 및 배포
* 에이전트 평가 데이터 세트 준비
* 단일 도구 사용 평가
* 궤적 평가
* 응답 평가

## Get started

### Install Vertex AI SDK and other required packages


In [1]:
%pip install --upgrade --user --quiet "google-cloud-aiplatform[evaluation, langchain, reasoningengine]" \
    "langchain_google_vertexai" \
    "cloudpickle==3.0.0" \
    "pydantic>=2.10" \
    "requests"

Note: you may need to restart the kernel to use updated packages.


### Restart runtime

To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which restarts the current kernel.

The restart might take a minute or longer. After it's restarted, continue to the next step.

In [None]:
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

<div class="alert alert-block alert-warning">
<b>⚠️ The kernel is going to restart. In Colab or Colab Enterprise, you might see an error message that says "Your session crashed for an unknown reason." This is expected. Wait until it's finished before continuing to the next step. ⚠️</b>
</div>


### Authenticate your notebook environment (Colab only)

If you're running this notebook on Google Colab, run the cell below to authenticate your environment.

In [2]:
# import sys

# if "google.colab" in sys.modules:
#     from google.colab import auth

#     auth.authenticate_user()

### Set Google Cloud project information and initialize Vertex AI SDK

To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

In [3]:
import dotenv
import os

dotenv.load_dotenv()

True

In [4]:
# Use the environment variable if the user doesn't provide Project ID.
import os

import vertexai

# PROJECT_ID = "[your-project-id]"  # @param {type: "string", placeholder: "[your-project-id]", isTemplate: true}
PROJECT_ID = os.getenv("PROJECT_ID")

if not PROJECT_ID or PROJECT_ID == "[your-project-id]":
    PROJECT_ID = str(os.environ.get("GOOGLE_CLOUD_PROJECT"))

LOCATION = os.environ.get("GOOGLE_CLOUD_REGION", "us-central1")

BUCKET_NAME = "[your-bucket-name]"  # @param {type: "string", placeholder: "[your-bucket-name]", isTemplate: true}

if not BUCKET_NAME or BUCKET_NAME == "[your-bucket-name]":
    BUCKET_NAME = f"{PROJECT_ID}-bucket"

BUCKET_URI = f"gs://{BUCKET_NAME}"

! gsutil mb -p $PROJECT_ID -l $LOCATION $BUCKET_URI

EXPERIMENT_NAME = "evaluate-re-agent"  # @param {type:"string"}

vertexai.init(
    project=PROJECT_ID,
    location=LOCATION,
    staging_bucket=BUCKET_URI,
    experiment=EXPERIMENT_NAME,
)

Creating gs://weverse-dev-mint-bucket/...
Creating Tensorboard
Create Tensorboard backing LRO: projects/377376242639/locations/us-central1/tensorboards/6791546985330507776/operations/6043520534572957696
Tensorboard created. Resource name: projects/377376242639/locations/us-central1/tensorboards/6791546985330507776
To use this Tensorboard in another session:
tb = aiplatform.Tensorboard('projects/377376242639/locations/us-central1/tensorboards/6791546985330507776')


## Import libraries

Import tutorial libraries.

In [5]:
# General
import random
import string

from IPython.display import HTML, Markdown, display

# Build agent
from google.cloud import aiplatform
import pandas as pd
import plotly.graph_objects as go
from vertexai.preview import reasoning_engines

# Evaluate agent
from vertexai.preview.evaluation import EvalTask
from vertexai.preview.evaluation.metrics import (
    PointwiseMetric,
    PointwiseMetricPromptTemplate,
    TrajectorySingleToolUse,
)

In [12]:
import langchain

langchain.debug = True

## 헬퍼 함수 정의

튜토리얼 결과를 출력하기 위한 헬퍼 함수 집합을 초기화합니다.

In [6]:
def get_id(length: int = 8) -> str:
    """Generate a uuid of a specified length (default=8)."""
    return "".join(random.choices(string.ascii_lowercase + string.digits, k=length))  # 지정된 길이(기본값=8)의 uuid를 생성합니다.


def display_eval_report(eval_result: pd.DataFrame) -> None:
    """Display the evaluation results."""
    # eval_result.summary_metrics 딕셔너리에서 Pandas DataFrame을 생성합니다.
    # orient="index"는 딕셔너리의 키를 DataFrame의 인덱스로 사용하도록 지정합니다.
    # .T는 DataFrame을 전치하여 요약 지표가 올바른 형식으로 표시되도록 합니다.
    metrics_df = pd.DataFrame.from_dict(eval_result.summary_metrics, orient="index").T
    display(Markdown("### Summary Metrics"))  # 요약 지표를 표시합니다.
    display(metrics_df)  # 지표 데이터프레임을 표시합니다.

    display(Markdown(f"### Row-wise Metrics"))  # 행별 지표를 표시합니다.
    display(eval_result.metrics_table)  # 지표 테이블을 표시합니다.


def display_drilldown(row: pd.Series) -> None:
    """Displays a drill-down view for trajectory data within a row."""

    style = "white-space: pre-wrap; width: 800px; overflow-x: auto;"  # 스타일을 정의합니다.

    if not (
        isinstance(row["predicted_trajectory"], list)
        and isinstance(row["reference_trajectory"], list)
    ):  # 예측된 궤적과 참조 궤적이 모두 리스트인지 확인합니다.
        return  # 그렇지 않으면 반환합니다.

    for predicted_trajectory, reference_trajectory in zip(
        row["predicted_trajectory"], row["reference_trajectory"]
    ):  # 예측된 궤적과 참조 궤적을 반복합니다.
        display(
            HTML(
                f"<h3>Tool Names:</h3><div style='{style}'>{predicted_trajectory['tool_name'], reference_trajectory['tool_name']}</div>"
            )
        )  # 도구 이름을 표시합니다.

        if not (
            isinstance(predicted_trajectory.get("tool_input"), dict)
            and isinstance(reference_trajectory.get("tool_input"), dict)
        ):  # 예측된 궤적과 참조 궤적의 도구 입력이 모두 딕셔너리인지 확인합니다.
            continue  # 그렇지 않으면 계속합니다.

        for tool_input_key in predicted_trajectory["tool_input"]:  # 예측된 궤적의 도구 입력을 반복합니다.
            print("Tool Input Key: ", tool_input_key)  # 도구 입력 키를 출력합니다.

            if tool_input_key in reference_trajectory["tool_input"]:  # 도구 입력 키가 참조 궤적에 있는지 확인합니다.
                print(
                    "Tool Values: ",
                    predicted_trajectory["tool_input"][tool_input_key],
                    reference_trajectory["tool_input"][tool_input_key],
                )  # 도구 값을 출력합니다.
            else:
                print(
                    "Tool Values: ",
                    predicted_trajectory["tool_input"][tool_input_key],
                    "N/A",
                )  # 도구 값을 출력합니다.
        print("\n")  # 새 줄을 출력합니다.
    display(HTML("<hr>"))  # 가로선을 표시합니다.


def display_dataframe_rows(
    df: pd.DataFrame,
    columns: list[str] | None = None,
    num_rows: int = 3,
    display_drilldown: bool = False,
) -> None:
    """Displays a subset of rows from a DataFrame, optionally including a drill-down view."""

    if columns:  # 열이 지정된 경우
        df = df[columns]  # 데이터프레임을 해당 열로 필터링합니다.

    base_style = "font-family: monospace; font-size: 14px; white-space: pre-wrap; width: auto; overflow-x: auto;"  # 기본 스타일을 정의합니다.
    header_style = base_style + "font-weight: bold;"  # 헤더 스타일을 정의합니다.

    for _, row in df.head(num_rows).iterrows():  # 데이터프레임의 처음 num_rows 행을 반복합니다.
        for column in df.columns:  # 데이터프레임의 열을 반복합니다.
            display(
                HTML(
                    f"<span style='{header_style}'>{column.replace('_', ' ').title()}: </span>"
                )
            )  # 헤더를 표시합니다.
            display(HTML(f"<span style='{base_style}'>{row[column]}</span><br>"))  # 값을 표시합니다.

        display(HTML("<hr>"))  # 가로선을 표시합니다.

        if (
            display_drilldown
            and "predicted_trajectory" in df.columns
            and "reference_trajectory" in df.columns
        ):  # 드릴다운을 표시해야 하는 경우
            display_drilldown(row)  # 드릴다운을 표시합니다.


def plot_bar_plot(
    eval_result: pd.DataFrame, title: str, metrics: list[str] = None
) -> None:
    fig = go.Figure()  # Figure 객체를 생성합니다.
    data = []  # 데이터를 저장할 리스트를 초기화합니다.

    summary_metrics = eval_result.summary_metrics  # 요약 지표를 가져옵니다.
    if metrics:  # 지표가 지정된 경우
        summary_metrics = {
            k: summary_metrics[k]
            for k, v in summary_metrics.items()
            if any(selected_metric in k for selected_metric in metrics)
        }  # 지정된 지표만 필터링합니다.

    data.append(
        go.Bar(
            x=list(summary_metrics.keys()),
            y=list(summary_metrics.values()),
            name=title,
        )
    )  # 막대 그래프 데이터를 추가합니다.

    fig = go.Figure(data=data)  # Figure 객체를 생성합니다.

    # Change the bar mode
    fig.update_layout(barmode="group")  # 레이아웃을 업데이트합니다.
    fig.show()  # 그래프를 표시합니다.


def display_radar_plot(eval_results, title: str, metrics=None):
    """Plot the radar plot."""
    fig = go.Figure()  # Figure 객체를 생성합니다.
    summary_metrics = eval_results.summary_metrics  # 요약 지표를 가져옵니다.
    if metrics:  # 지표가 지정된 경우
        summary_metrics = {
            k: summary_metrics[k]
            for k, v in summary_metrics.items()
            if any(selected_metric in k for selected_metric in metrics)
        }  # 지정된 지표만 필터링합니다.

    min_val = min(summary_metrics.values())  # 최소값을 가져옵니다.
    max_val = max(summary_metrics.values())  # 최대값을 가져옵니다.

    fig.add_trace(
        go.Scatterpolar(
            r=list(summary_metrics.values()),
            theta=list(summary_metrics.keys()),
            fill="toself",
            name=title,
        )
    )  # 레이더 차트 데이터를 추가합니다.
    fig.update_layout(
        title=title,
        polar=dict(radialaxis=dict(visible=True, range=[min_val, max_val])),
        showlegend=True,
    )  # 레이아웃을 업데이트합니다.
    fig.show()  # 그래프를 표시합니다.

## Vertex AI Reasoning Engine의 사전 구축된 템플릿을 사용하여 LangChain 에이전트 구축 및 배포

Gemini 모델 및 사용자가 정의한 사용자 정의 도구를 포함하여 LangChain을 사용하여 애플리케이션을 구축하고 배포합니다.


### 도구 설정

시작하려면 에이전트가 작업을 수행하는 데 필요한 몇 가지 도구를 만드십시오. 이 Colab에는 데이터베이스가 있다고 가정하지만 실제 에이전트의 경우 데이터베이스 또는 타사 시스템에 연결합니다.

In [7]:
def get_product_details(product_name: str):
    """Gathers basic details about a product."""
    details = {
        "smartphone": "A cutting-edge smartphone with advanced camera features and lightning-fast processing.",
        "usb charger": "A super fast and light usb charger",
        "shoes": "High-performance running shoes designed for comfort, support, and speed.",
        "headphones": "Wireless headphones with advanced noise cancellation technology for immersive audio.",
        "speaker": "A voice-controlled smart speaker that plays music, sets alarms, and controls smart home devices.",
    }
    return details.get(product_name, "Product details not found.")


def get_product_price(product_name: str):
    """Gathers price about a product."""
    details = {
        "smartphone": 500,
        "usb charger": 10,
        "shoes": 100,
        "headphones": 50,
        "speaker": 80,
    }
    return details.get(product_name, "Product price not found.")

### 모델 설정

에이전트가 사용할 Gemini AI 모델을 선택하세요. Gemini와 다양한 기능에 대해 궁금하다면 [공식 문서](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models)에서 자세한 내용을 확인하세요.

In [8]:
model = "gemini-1.5-pro"

### 에이전트 조립(assemble)

[Vertex AI Reasoning Engine](https://cloud.google.com/vertex-ai/generative-ai/docs/reasoning-engine/deploy)을 사용하여 LangChain 에이전트를 만들려면 LangchainAgent 클래스를 사용하세요. 이 클래스는 표준 템플릿으로 에이전트를 빠르게 실행하는 도움이 됩니다. 처음부터 시작할 필요 없이 에이전트를 구축하는 바로 가기라고 생각하세요. LangchainAgent는 기본 구조와 초기 구성을 처리하므로 에이전트를 바로 사용할 수 있습니다.

> 에이전트가 수행한 도구 호출을 반환하여 평가할 수 있도록 하는 추가 매개변수 `agent_executor_kwargs`에 유의하세요.

Vertex AI Gen AI 평가는 'Queryable' 에이전트(이 경우와 같이)와 직접 작동하며 특정 구조(서명)로 사용자 정의 함수를 추가할 수도 있습니다.

In [13]:
local_1p_agent = reasoning_engines.LangchainAgent(
    model=model,
    tools=[get_product_details, get_product_price],
    agent_executor_kwargs={"return_intermediate_steps": True},
)

### 로컬 에이전트 테스트

에이전트에 쿼리하세요.

In [14]:
response = local_1p_agent.query(input="Get product details for shoes")
display(Markdown(response["output"]))

[32;1m[1;3m[chain/start][0m [1m[chain:AgentExecutor] Entering Chain run with input:
[0m{
  "input": "Get product details for shoes"
}
[32;1m[1;3m[chain/start][0m [1m[chain:AgentExecutor > chain:RunnableSequence] Entering Chain run with input:
[0m{
  "input": ""
}
[32;1m[1;3m[chain/start][0m [1m[chain:AgentExecutor > chain:RunnableSequence > chain:RunnableParallel<input,agent_scratchpad>] Entering Chain run with input:
[0m{
  "input": ""
}
[32;1m[1;3m[chain/start][0m [1m[chain:AgentExecutor > chain:RunnableSequence > chain:RunnableParallel<input,agent_scratchpad> > chain:RunnableLambda] Entering Chain run with input:
[0m{
  "input": ""
}
[32;1m[1;3m[chain/start][0m [1m[chain:AgentExecutor > chain:RunnableSequence > chain:RunnableParallel<input,agent_scratchpad> > chain:RunnableLambda] Entering Chain run with input:
[0m{
  "input": ""
}
[36;1m[1;3m[chain/end][0m [1m[chain:AgentExecutor > chain:RunnableSequence > chain:RunnableParallel<input,agent_scratchpad> 

High-performance running shoes designed for comfort, support, and speed. 


In [15]:
response = local_1p_agent.query(input="Get product price for shoes")
display(Markdown(response["output"]))

[32;1m[1;3m[chain/start][0m [1m[chain:AgentExecutor] Entering Chain run with input:
[0m{
  "input": "Get product price for shoes"
}
[32;1m[1;3m[chain/start][0m [1m[chain:AgentExecutor > chain:RunnableSequence] Entering Chain run with input:
[0m{
  "input": ""
}
[32;1m[1;3m[chain/start][0m [1m[chain:AgentExecutor > chain:RunnableSequence > chain:RunnableParallel<input,agent_scratchpad>] Entering Chain run with input:
[0m{
  "input": ""
}
[32;1m[1;3m[chain/start][0m [1m[chain:AgentExecutor > chain:RunnableSequence > chain:RunnableParallel<input,agent_scratchpad> > chain:RunnableLambda] Entering Chain run with input:
[0m{
  "input": ""
}
[32;1m[1;3m[chain/start][0m [1m[chain:AgentExecutor > chain:RunnableSequence > chain:RunnableParallel<input,agent_scratchpad> > chain:RunnableLambda] Entering Chain run with input:
[0m{
  "input": ""
}
[36;1m[1;3m[chain/end][0m [1m[chain:AgentExecutor > chain:RunnableSequence > chain:RunnableParallel<input,agent_scratchpad> > 

The price for shoes is 100. 


### Vertex AI Reasoning Engine에 로컬 에이전트 배포

Vertex AI Reasoning Engine에 로컬 에이전트를 배포하려면 에이전트와 몇 가지 특정 종속성(외부 PyPI 패키지에 대한 `requirements` 및 로컬 패키지에 대한 `extra_packages`)을 전달하여 `create` 메서드를 사용할 수 있습니다.

자세한 내용은 [애플리케이션 배포](https://cloud.google.com/vertex-ai/generative-ai/docs/reasoning-engine/deploy#create_a_reasoningengine_instance) 문서 페이지를 참조하세요.

> Vertex AI Reasoning Engine에 에이전트를 배포하는 데 약 10분이 소요됩니다.

In [22]:
remote_1p_agent = reasoning_engines.ReasoningEngine.create(
    local_1p_agent,
    requirements=[
        "google-cloud-aiplatform[langchain,reasoningengine]",
        "langchain_google_vertexai",
        "cloudpickle==3.0.0",
        "pydantic>=2.10",
        "requests",
    ],
)

Using bucket weverse-dev-mint-bucket
Writing to gs://weverse-dev-mint-bucket/reasoning_engine/reasoning_engine.pkl
Writing to gs://weverse-dev-mint-bucket/reasoning_engine/requirements.txt
Creating in-memory tarfile of extra_packages
Writing to gs://weverse-dev-mint-bucket/reasoning_engine/dependencies.tar.gz
Creating ReasoningEngine
Create ReasoningEngine backing LRO: projects/377376242639/locations/us-central1/reasoningEngines/6125581588479606784/operations/7899003581049602048
ReasoningEngine created. Resource name: projects/377376242639/locations/us-central1/reasoningEngines/6125581588479606784
To use this ReasoningEngine in another session:
reasoning_engine = vertexai.preview.reasoning_engines.ReasoningEngine('projects/377376242639/locations/us-central1/reasoningEngines/6125581588479606784')


### Test the remote agent

Query your remote agent.

In [None]:
response = remote_1p_agent.query(input="Get product details for shoes")
display(Markdown(response["output"]))

## Vertex AI Gen AI Evaluation을 사용한 에이전트 평가

AI 에이전트를 사용할 때는 **에이전트의 성능**과 **작동 상태를 추적**하는 것이 중요합니다. 이는 주로 **모니터링**(monitoring)과 **관찰 가능성**(observability)이라는 두 가지 방법으로 확인할 수 있습니다.

모니터링은 에이전트가 특정 작업을 얼마나 잘 수행하는지에 중점을 둡니다.

* **단일 도구 선택**: 에이전트가 작업에 적합한 도구를 선택하고 있습니까?

* **다중 도구 선택 (또는 궤적)**: 에이전트가 도구를 사용하는 순서에서 논리적인 선택을 하고 있습니까?

* **응답 생성**: 에이전트의 출력이 양호하며 사용된 도구를 기반으로 합리적입니까?

관찰 가능성은 에이전트의 전반적인 상태를 이해하는 데 중점을 둡니다.

* **지연 시간**: 에이전트가 응답하는 데 얼마나 걸립니까?

* **실패율**: 에이전트가 응답을 생성하지 못하는 빈도는 얼마나 됩니까?

Vertex AI Gen AI Evaluation 서비스는 에이전트를 프로토타입하는 동안 또는 프로덕션 환경에 배포한 후에 이러한 모든 측면을 평가하는 데 도움을 줍니다. 이 서비스는 [미리 빌드된 평가 기준 및 메트릭](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval)을 제공하므로 에이전트의 성능을 정확하게 파악하고 개선할 영역을 식별할 수 있습니다.

### 에이전트 평가 데이터세트 준비

Vertex AI Gen AI Evaluation 서비스를 사용하여 AI 에이전트를 평가하려면 에이전트의 어떤 측면을 평가할지에 따라 특정 데이터세트가 필요합니다.

이 데이터세트에는 에이전트에 제공된 프롬프트가 포함되어야 합니다. 또한 이상적이거나 예상되는 응답(ground truth)과 에이전트가 수행해야 하는 도구 호출의 의도된 시퀀스(reference trajectory)를 포함할 수 있습니다. reference trajectory는 각 프롬프트에 대해 에이전트가 호출할 것으로 예상되는 도구의 시퀀스를 나타냅니다.

> 선택적으로 생성된 응답과 예측된 trajectory를 모두 제공할 수 있습니다 (**bring-your-own-dataset 시나리오**).

아래에는 사용자 프롬프트와 reference trajectory를 사용하여 고객 지원 에이전트가 가질 수 있는 데이터세트의 예가 있습니다.

In [16]:
eval_data = {
    "prompt": [
        "Get price for smartphone",
        "Get product details and price for headphones",
        "Get details for usb charger",
        "Get product details and price for shoes",
        "Get product details for speaker?",
    ],
    "reference_trajectory": [
        [
            {
                "tool_name": "get_product_price",
                "tool_input": {"product_name": "smartphone"},
            }
        ],
        [
            {
                "tool_name": "get_product_details",
                "tool_input": {"product_name": "headphones"},
            },
            {
                "tool_name": "get_product_price",
                "tool_input": {"product_name": "headphones"},
            },
        ],
        [
            {
                "tool_name": "get_product_details",
                "tool_input": {"product_name": "usb charger"},
            }
        ],
        [
            {
                "tool_name": "get_product_details",
                "tool_input": {"product_name": "shoes"},
            },
            {"tool_name": "get_product_price", "tool_input": {"product_name": "shoes"}},
        ],
        [
            {
                "tool_name": "get_product_details",
                "tool_input": {"product_name": "speaker"},
            }
        ],
    ],
}

eval_sample_dataset = pd.DataFrame(eval_data)

데이터세트에서 몇 가지 샘플을 출력합니다.

In [17]:
display_dataframe_rows(eval_sample_dataset, num_rows=3)

### 단일 도구 사용 평가

AI 에이전트와 평가 데이터세트를 설정한 후, 에이전트가 주어진 작업에 대해 올바른 단일 도구를 선택하는지 평가하기 시작합니다.


#### 단일 도구 사용 메트릭 설정

Vertex AI Gen AI 평가의 `trajectory_single_tool_use` 메트릭은 에이전트가 특정 도구 순서에 관계없이 예상대로 도구를 사용하는지 신속하게 평가할 수 있는 방법을 제공합니다. 이는 에이전트의 프로세스 중 특정 시점에 올바른 도구가 사용되었는지 평가하는 기본적이지만 유용한 방법입니다.

`trajectory_single_tool_use` 메트릭을 사용하려면 특정 사용자 요청에 대해 어떤 도구를 사용해야 하는지 설정해야 합니다. 예를 들어, 사용자가 "이메일 보내기"를 요청하는 경우 에이전트가 "send_email" 도구를 사용할 것으로 예상하고, 이 메트릭을 사용할 때 해당 도구의 이름을 지정합니다.


In [23]:
single_tool_usage_metrics = [TrajectorySingleToolUse(tool_name="get_product_price")]

#### 평가 작업 실행

평가를 실행하려면 실험 내에서 미리 정의된 데이터세트(`eval_sample_dataset`)와 메트릭(`single_tool_usage_metrics` in this case)을 사용하여 `EvalTask`를 시작합니다. 그런 다음 원격 에이전트를 사용하여 평가를 실행하고 이 특정 평가 실행에 고유한 식별자를 할당하여 평가 결과를 저장하고 시각화합니다.


In [24]:
EXPERIMENT_RUN = f"single-metric-eval-{get_id()}"

로컬 에이전트 실험하기

In [21]:
single_tool_call_eval_task = EvalTask(
    dataset=eval_sample_dataset,
    metrics=single_tool_usage_metrics,
    experiment=EXPERIMENT_NAME,
)

single_tool_call_eval_result = single_tool_call_eval_task.evaluate(
    runnable=local_1p_agent, experiment_run_name=EXPERIMENT_RUN
)

display_eval_report(single_tool_call_eval_result)

Associating projects/377376242639/locations/us-central1/metadataStores/default/contexts/evaluate-re-agent-single-metric-eval-t3aictws to Experiment: evaluate-re-agent


Logging Eval experiment evaluation metadata: {'model_name': 'gemini-1.5-pro', 'tools': [<function get_product_details at 0x35dc63ec0>, <function get_product_price at 0x35dc63f60>]}
Experiment evaluation metadata logging failed: Value for key tools is of type list but must be one of float, int, str


  0%|          | 0/5 [00:00<?, ?it/s]

[32;1m[1;3m[chain/start][0m [1m[chain:AgentExecutor] Entering Chain run with input:
[0m{
  "input": "Get price for smartphone"
}
[32;1m[1;3m[chain/start][0m [1m[chain:AgentExecutor > chain:RunnableSequence] Entering Chain run with input:
[0m{
  "input": ""
}
[32;1m[1;3m[chain/start][0m [1m[chain:AgentExecutor > chain:RunnableSequence > chain:RunnableParallel<input,agent_scratchpad>] Entering Chain run with input:
[0m{
  "input": ""
}
[32;1m[1;3m[chain/start][0m [1m[chain:AgentExecutor > chain:RunnableSequence > chain:RunnableParallel<input,agent_scratchpad> > chain:RunnableLambda] Entering Chain run with input:
[0m{
  "input": ""
}
[32;1m[1;3m[chain/start][0m [1m[chain:AgentExecutor > chain:RunnableSequence > chain:RunnableParallel<input,agent_scratchpad> > chain:RunnableLambda] Entering Chain run with input:
[0m{
  "input": ""
}
[36;1m[1;3m[chain/end][0m [1m[chain:AgentExecutor > chain:RunnableSequence > chain:RunnableParallel<input,agent_scratchpad> > cha

This model can reply with multiple function calls in one response. Please don't rely on `additional_kwargs.function_call` as only the last one will be saved.Use `tool_calls` instead.
This model can reply with multiple function calls in one response. Please don't rely on `additional_kwargs.function_call` as only the last one will be saved.Use `tool_calls` instead.


[36;1m[1;3m[llm/end][0m [1m[chain:AgentExecutor > chain:RunnableSequence > llm:ChatVertexAI] [2.19s] Exiting LLM run with output:
[0m{
  "generations": [
    [
      {
        "text": "",
        "generation_info": {
          "safety_ratings": [
            {
              "category": "HARM_CATEGORY_HATE_SPEECH",
              "probability_label": "NEGLIGIBLE",
              "probability_score": 0.103515625,
              "blocked": false,
              "severity": "HARM_SEVERITY_NEGLIGIBLE",
              "severity_score": 0.11279296875
            },
            {
              "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
              "probability_label": "NEGLIGIBLE",
              "probability_score": 0.171875,
              "blocked": false,
              "severity": "HARM_SEVERITY_NEGLIGIBLE",
              "severity_score": 0.11279296875
            },
            {
              "category": "HARM_CATEGORY_HARASSMENT",
              "probability_label": "NEGLIGIBLE",
  

 20%|██        | 1/5 [00:02<00:10,  2.70s/it]

[36;1m[1;3m[llm/end][0m [1m[chain:AgentExecutor > chain:RunnableSequence > llm:ChatVertexAI] [839ms] Exiting LLM run with output:
[0m{
  "generations": [
    [
      {
        "text": "The price for smartphone is 500. \n",
        "generation_info": {
          "safety_ratings": [
            {
              "category": "HARM_CATEGORY_HATE_SPEECH",
              "probability_label": "NEGLIGIBLE",
              "probability_score": 0.044677734375,
              "blocked": false,
              "severity": "HARM_SEVERITY_NEGLIGIBLE",
              "severity_score": 0.1298828125
            },
            {
              "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
              "probability_label": "NEGLIGIBLE",
              "probability_score": 0.125,
              "blocked": false,
              "severity": "HARM_SEVERITY_NEGLIGIBLE",
              "severity_score": 0.162109375
            },
            {
              "category": "HARM_CATEGORY_HARASSMENT",
              "prob

 60%|██████    | 3/5 [00:03<00:01,  1.15it/s]

[36;1m[1;3m[llm/end][0m [1m[chain:AgentExecutor > chain:RunnableSequence > llm:ChatVertexAI] [1.40s] Exiting LLM run with output:
[0m{
  "generations": [
    [
      {
        "text": "A super fast and light usb charger. \n",
        "generation_info": {
          "safety_ratings": [
            {
              "category": "HARM_CATEGORY_HATE_SPEECH",
              "probability_label": "NEGLIGIBLE",
              "probability_score": 0.04345703125,
              "blocked": false,
              "severity": "HARM_SEVERITY_NEGLIGIBLE",
              "severity_score": 0.11279296875
            },
            {
              "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
              "probability_label": "NEGLIGIBLE",
              "probability_score": 0.054931640625,
              "blocked": false,
              "severity": "HARM_SEVERITY_NEGLIGIBLE",
              "severity_score": 0.146484375
            },
            {
              "category": "HARM_CATEGORY_HARASSMENT",
       

100%|██████████| 5/5 [00:03<00:00,  1.32it/s]

[36;1m[1;3m[llm/end][0m [1m[chain:AgentExecutor > chain:RunnableSequence > llm:ChatVertexAI] [1.47s] Exiting LLM run with output:
[0m{
  "generations": [
    [
      {
        "text": "The product details for shoes are: High-performance running shoes designed for comfort, support, and speed. The price is: 100. \n",
        "generation_info": {
          "safety_ratings": [
            {
              "category": "HARM_CATEGORY_HATE_SPEECH",
              "probability_label": "NEGLIGIBLE",
              "probability_score": 0.09130859375,
              "blocked": false,
              "severity": "HARM_SEVERITY_NEGLIGIBLE",
              "severity_score": 0.0888671875
            },
            {
              "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
              "probability_label": "NEGLIGIBLE",
              "probability_score": 0.046630859375,
              "blocked": false,
              "severity": "HARM_SEVERITY_NEGLIGIBLE",
              "severity_score": 0.0461425781


100%|██████████| 5/5 [00:04<00:00,  1.17it/s]

All 5 metric requests are successfully computed.
Evaluation Took:4.277733208145946 seconds





### Summary Metrics

Unnamed: 0,row_count,trajectory_single_tool_use/mean,trajectory_single_tool_use/std,latency_in_seconds/mean,latency_in_seconds/std,failure/mean,failure/std
0,5.0,0.6,0.547723,3.375335,0.426647,0.0,0.0


### Row-wise Metrics

Unnamed: 0,prompt,reference_trajectory,response,latency_in_seconds,failure,predicted_trajectory,trajectory_single_tool_use/score
0,Get price for smartphone,"[{'tool_name': 'get_product_price', 'tool_inpu...",The price for smartphone is 500. \n,2.698278,0,"[{'tool_name': 'get_product_price', 'tool_inpu...",1.0
1,Get product details and price for headphones,"[{'tool_name': 'get_product_details', 'tool_in...",The headphones are wireless with advanced nois...,3.783446,0,"[{'tool_name': 'get_product_details', 'tool_in...",1.0
2,Get details for usb charger,"[{'tool_name': 'get_product_details', 'tool_in...",A super fast and light usb charger. \n,3.279385,0,"[{'tool_name': 'get_product_details', 'tool_in...",0.0
3,Get product details and price for shoes,"[{'tool_name': 'get_product_details', 'tool_in...",The product details for shoes are: High-perfor...,3.674767,0,"[{'tool_name': 'get_product_details', 'tool_in...",1.0
4,Get product details for speaker?,"[{'tool_name': 'get_product_details', 'tool_in...",A voice-controlled smart speaker that plays mu...,3.440797,0,"[{'tool_name': 'get_product_details', 'tool_in...",0.0


리모트 에이전스 실험하기

In [25]:
single_tool_call_eval_task = EvalTask(
    dataset=eval_sample_dataset,
    metrics=single_tool_usage_metrics,
    experiment=EXPERIMENT_NAME,
)

single_tool_call_eval_result = single_tool_call_eval_task.evaluate(
    runnable=remote_1p_agent, experiment_run_name=EXPERIMENT_RUN
)

display_eval_report(single_tool_call_eval_result)

Associating projects/377376242639/locations/us-central1/metadataStores/default/contexts/evaluate-re-agent-single-metric-eval-0tbqvw55 to Experiment: evaluate-re-agent


100%|██████████| 5/5 [00:03<00:00,  1.44it/s]

All 5 responses are successfully generated from the runnable.
Computing metrics with a total of 5 Vertex Gen AI Evaluation Service API requests.



100%|██████████| 5/5 [00:04<00:00,  1.15it/s]

All 5 metric requests are successfully computed.
Evaluation Took:4.362451583147049 seconds





### Summary Metrics

Unnamed: 0,row_count,trajectory_single_tool_use/mean,trajectory_single_tool_use/std,latency_in_seconds/mean,latency_in_seconds/std,failure/mean,failure/std
0,5.0,0.6,0.547723,3.041032,0.353834,0.0,0.0


### Row-wise Metrics

Unnamed: 0,prompt,reference_trajectory,response,latency_in_seconds,failure,predicted_trajectory,trajectory_single_tool_use/score
0,Get price for smartphone,"[{'tool_name': 'get_product_price', 'tool_inpu...",The price for smartphone is 500. \n,2.749151,0,"[{'tool_name': 'get_product_price', 'tool_inpu...",1.0
1,Get product details and price for headphones,"[{'tool_name': 'get_product_details', 'tool_in...",The headphones are wireless with advanced nois...,3.376055,0,"[{'tool_name': 'get_product_details', 'tool_in...",1.0
2,Get details for usb charger,"[{'tool_name': 'get_product_details', 'tool_in...",A super fast and light usb charger. \n,2.726993,0,"[{'tool_name': 'get_product_details', 'tool_in...",0.0
3,Get product details and price for shoes,"[{'tool_name': 'get_product_details', 'tool_in...",The product details for shoes are: High-perfor...,3.466347,0,"[{'tool_name': 'get_product_details', 'tool_in...",1.0
4,Get product details for speaker?,"[{'tool_name': 'get_product_details', 'tool_in...",A voice-controlled smart speaker that plays mu...,2.886612,0,"[{'tool_name': 'get_product_details', 'tool_in...",0.0


#### 평가 결과 시각화

몇 가지 헬퍼 함수를 사용하여 평가 결과 샘플을 시각화합니다.

In [26]:
display_dataframe_rows(single_tool_call_eval_result.metrics_table, num_rows=3)

### 궤적(Trajectory) 평가

주어진 작업에 대해 에이전트가 가장 적절한 도구를 선택하는 능력을 평가한 후, 사용자 입력(궤적)과 관련된 도구 시퀀스 선택을 분석하여 평가를 일반화합니다. 이는 에이전트가 올바른 도구를 선택할 뿐만 아니라 합리적이고 효과적인 순서로 사용하는지 평가합니다.

#### 궤적(Trajectory) 메트릭 설정

에이전트의 궤적을 평가하기 위해 Vertex AI Gen AI Evaluation은 다음과 같은 몇 가지 ground-truth 기반 메트릭을 제공합니다.

* `trajectory_exact_match`: 동일한 궤적 (동일한 액션, 동일한 순서)

* `trajectory_in_order_match`: 예측된 궤적에 참조 액션이 순서대로 존재 (추가 액션 허용)

* `trajectory_any_order_match`: 예측된 궤적에 모든 참조 액션이 존재 (순서, 추가 액션은 중요하지 않음).

* `trajectory_precision`: 참조 액션 중 예측된 액션의 비율

* `trajectory_recall`: 예측된 액션 중 참조 액션의 비율

모든 메트릭은 0 또는 1점을 반환하며, `trajectory_precision` 및 `trajectory_recall`은 0에서 1 사이의 값을 가집니다.

In [27]:
trajectory_metrics = [
    "trajectory_exact_match",
    "trajectory_in_order_match",
    "trajectory_any_order_match",
    "trajectory_precision",
    "trajectory_recall",
]

#### 평가 작업 실행

새로운 `EvalTask`의 `evaluate` 메서드를 실행하여 평가를 제출합니다.

In [28]:
EXPERIMENT_RUN = f"trajectory-{get_id()}"

trajectory_eval_task = EvalTask(
    dataset=eval_sample_dataset, metrics=trajectory_metrics, experiment=EXPERIMENT_NAME
)

trajectory_eval_result = trajectory_eval_task.evaluate(runnable=remote_1p_agent)

display_eval_report(trajectory_eval_result)

Associating projects/377376242639/locations/us-central1/metadataStores/default/contexts/evaluate-re-agent-0a72a189-07f1-41e9-aaf9-9319fec09dfa to Experiment: evaluate-re-agent


100%|██████████| 5/5 [00:03<00:00,  1.48it/s]

All 5 responses are successfully generated from the runnable.
Computing metrics with a total of 25 Vertex Gen AI Evaluation Service API requests.



100%|██████████| 25/25 [00:24<00:00,  1.03it/s]

All 25 metric requests are successfully computed.
Evaluation Took:24.341351083945483 seconds





### Summary Metrics

Unnamed: 0,row_count,trajectory_exact_match/mean,trajectory_exact_match/std,trajectory_in_order_match/mean,trajectory_in_order_match/std,trajectory_any_order_match/mean,trajectory_any_order_match/std,trajectory_precision/mean,trajectory_precision/std,trajectory_recall/mean,trajectory_recall/std,latency_in_seconds/mean,latency_in_seconds/std,failure/mean,failure/std
0,5.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,2.927058,0.386137,0.0,0.0


### Row-wise Metrics

Unnamed: 0,prompt,reference_trajectory,response,latency_in_seconds,failure,predicted_trajectory,trajectory_exact_match/score,trajectory_in_order_match/score,trajectory_any_order_match/score,trajectory_precision/score,trajectory_recall/score
0,Get price for smartphone,"[{'tool_name': 'get_product_price', 'tool_inpu...",The price for smartphone is 500.,2.593548,0,"[{'tool_name': 'get_product_price', 'tool_inpu...",1.0,1.0,1.0,1.0,1.0
1,Get product details and price for headphones,"[{'tool_name': 'get_product_details', 'tool_in...",The headphones are wireless with advanced nois...,3.377865,0,"[{'tool_name': 'get_product_details', 'tool_in...",1.0,1.0,1.0,1.0,1.0
2,Get details for usb charger,"[{'tool_name': 'get_product_details', 'tool_in...",A super fast and light usb charger. \n,2.56575,0,"[{'tool_name': 'get_product_details', 'tool_in...",1.0,1.0,1.0,1.0,1.0
3,Get product details and price for shoes,"[{'tool_name': 'get_product_details', 'tool_in...",The product details for shoes are: High-perfor...,3.295677,0,"[{'tool_name': 'get_product_details', 'tool_in...",1.0,1.0,1.0,1.0,1.0
4,Get product details for speaker?,"[{'tool_name': 'get_product_details', 'tool_in...",A voice-controlled smart speaker that plays mu...,2.802448,0,"[{'tool_name': 'get_product_details', 'tool_in...",1.0,1.0,1.0,1.0,1.0


#### 평가 결과 시각화

평가 결과 샘플을 출력하고 시각화합니다.

In [29]:
display_dataframe_rows(trajectory_eval_result.metrics_table, num_rows=3)

In [30]:
plot_bar_plot(
    trajectory_eval_result,
    title="Trajectory Metrics",
    metrics=[f"{metric}/mean" for metric in trajectory_metrics],
)

### 최종 응답 평가

모델 평가와 유사하게, Vertex AI Gen AI Evaluation을 사용하여 에이전트의 최종 응답을 평가할 수 있습니다.

#### 응답 메트릭 설정

에이전트 추론 후, Vertex AI Gen AI Evaluation은 생성된 응답을 평가하기 위한 여러 메트릭을 제공합니다. 계산 기반 메트릭을 사용하여 응답을 참조(필요한 경우)와 비교하고, 기존 또는 사용자 정의 모델 기반 메트릭을 사용하여 최종 응답의 품질을 판단할 수 있습니다.

자세한 내용은 [documentation](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval)을 참조하십시오.


In [31]:
response_metrics = ["safety", "coherence"]

#### 평가 작업 실행

에이전트가 생성한 응답을 평가하려면 `EvalTask` 클래스의 `evaluate` 메서드를 사용하세요.

In [32]:
EXPERIMENT_RUN = f"response-{get_id()}"

response_eval_task = EvalTask(
    dataset=eval_sample_dataset, metrics=response_metrics, experiment=EXPERIMENT_NAME
)

response_eval_result = response_eval_task.evaluate(runnable=remote_1p_agent)

display_eval_report(response_eval_result)

Associating projects/377376242639/locations/us-central1/metadataStores/default/contexts/evaluate-re-agent-e4263426-a2b5-46d0-8b03-8fca3996d7a1 to Experiment: evaluate-re-agent


100%|██████████| 5/5 [00:03<00:00,  1.50it/s]

All 5 responses are successfully generated from the runnable.
Computing metrics with a total of 10 Vertex Gen AI Evaluation Service API requests.



100%|██████████| 10/10 [00:23<00:00,  2.32s/it]

All 10 metric requests are successfully computed.
Evaluation Took:23.206598249962553 seconds





### Summary Metrics

Unnamed: 0,row_count,safety/mean,safety/std,coherence/mean,coherence/std,latency_in_seconds/mean,latency_in_seconds/std,failure/mean,failure/std
0,5.0,1.0,0.0,2.0,0.707107,2.931708,0.309561,0.0,0.0


### Row-wise Metrics

Unnamed: 0,prompt,reference_trajectory,response,latency_in_seconds,failure,predicted_trajectory,safety/explanation,safety/score,coherence/explanation,coherence/score
0,Get price for smartphone,"[{'tool_name': 'get_product_price', 'tool_inpu...",The price for smartphone is 500. \n,2.666916,0,"[{'tool_name': 'get_product_price', 'tool_inpu...",The response is safe. It provides the price fo...,1.0,STEP 1: The purpose is to get the price for a ...,2.0
1,Get product details and price for headphones,"[{'tool_name': 'get_product_details', 'tool_in...",The headphones are wireless with advanced nois...,3.181032,0,"[{'tool_name': 'get_product_details', 'tool_in...",The response is safe as it provides product de...,1.0,STEP 1: The purpose is to provide product deta...,3.0
2,Get details for usb charger,"[{'tool_name': 'get_product_details', 'tool_in...",A super fast and light usb charger. \n,2.669133,0,"[{'tool_name': 'get_product_details', 'tool_in...",The response is safe because it does not conta...,1.0,STEP 1: The purpose is to get details for a US...,1.0
3,Get product details and price for shoes,"[{'tool_name': 'get_product_details', 'tool_in...",The product details for shoes are: High-perfor...,3.338532,0,"[{'tool_name': 'get_product_details', 'tool_in...",The response is safe. It provides information ...,1.0,STEP 1: The prompt seeks specific product deta...,2.0
4,Get product details for speaker?,"[{'tool_name': 'get_product_details', 'tool_in...",A voice-controlled smart speaker that plays mu...,2.802927,0,"[{'tool_name': 'get_product_details', 'tool_in...",The response is safe. It describes a smart spe...,1.0,STEP 1: The prompt seeks basic product details...,2.0


#### Visualize evaluation results


Print new evaluation result sample.

In [33]:
display_dataframe_rows(response_eval_result.metrics_table, num_rows=3)

### 도구 선택에 따라 생성된 응답 평가

환경과 상호 작용하는 AI 에이전트를 평가할 때 일관성과 같은 표준 텍스트 생성 메트릭은 충분하지 않을 수 있습니다. 이러한 메트릭은 주로 텍스트 구조에 중점을 두지만, 에이전트 응답은 환경 내에서의 효과를 기준으로 평가해야 하기 때문입니다.

대신, 이 섹션에서와 같이 에이전트 응답이 도구 선택으로부터 논리적으로 도출되는지 평가하는 사용자 정의 메트릭을 사용하십시오.

#### 사용자 정의 메트릭 정의

[설명서](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval#model-based-metrics)에 따르면, AI 에이전트의 응답이 해당 작업으로부터 논리적으로 도출되는지 여부를 평가하기 위한 프롬프트 템플릿을 정의하여 평가 기준과 평가 시스템을 할 수 있습니다.

평가 지침을 설정하는 `criteria`와 점수 시스템(1 또는 0)을 제공하는 `pointwise_rating_rubric`을 정의합니다. 그런 다음 이러한 구성 요소를 사용하여 템플릿을 만드는 데 `PointwiseMetricPromptTemplate`을 사용합니다.


In [34]:
criteria = {
    "Follows trajectory": (
        "Evaluate whether the agent's response logically follows from the "
        "sequence of actions it took. Consider these sub-points:\n"
        "  - Does the response reflect the information gathered during the trajectory?\n"
        "  - Is the response consistent with the goals and constraints of the task?\n"
        "  - Are there any unexpected or illogical jumps in reasoning?\n"
        "Provide specific examples from the trajectory and response to support your evaluation."
    )
}

pointwise_rating_rubric = {
    "1": "Follows trajectory",
    "0": "Does not follow trajectory",
}

response_follows_trajectory_prompt_template = PointwiseMetricPromptTemplate(
    criteria=criteria,
    rating_rubric=pointwise_rating_rubric,
    input_variables=["prompt", "predicted_trajectory"],
)

In [35]:
print(response_follows_trajectory_prompt_template.prompt_data)

# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models. We will provide you with the user prompt and an AI-generated responses.
You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.
You will assign the response a rating following the Rating Rubric and Evaluation Steps. Give step by step explanations for your rating, and only choose ratings from the Rating Rubric.


# Evaluation
## Criteria
Follows trajectory: Evaluate whether the agent's response logically follows from the sequence of actions it took. Consider these sub-points:
  - Does the response reflect the information gathered during the trajectory?
  - Is the response consistent with the goals and constraints of the task?
  - Are there any unexpected or illogical jumps in reasoning?
Provide specific examples from the trajectory and response

In [45]:
response_follows_trajectory_metric = PointwiseMetric(
    metric="response_follows_trajectory",
    metric_prompt_template=response_follows_trajectory_prompt_template,
)

In [43]:
criteria_ko = {
    "Follows trajectory": (
        "에이전트의 응답이 수행한 작업 순서로부터 논리적으로 도출되는지 평가합니다. 다음 하위 사항을 고려하십시오.\n"
        "  - 응답이 궤적 동안 수집된 정보를 반영합니까?\n"
        "  - 응답이 작업의 목표 및 제약 조건과 일치합니까?\n"
        "  - 추론에 예기치 않거나 비논리적인 도약이 있습니까?\n"
        "평가를 뒷받침하기 위해 궤적 및 응답에서 구체적인 예를 제공하십시오."
    )
}

pointwise_rating_rubric_ko = {
    "1": "궤적을 따름",
    "0": "궤적을 따르지 않음",
}

response_follows_trajectory_prompt_template_ko = PointwiseMetricPromptTemplate(
    criteria=criteria_ko,
    rating_rubric=pointwise_rating_rubric_ko,
    input_variables=["prompt", "predicted_trajectory"],
)

In [44]:
print(response_follows_trajectory_prompt_template_ko.prompt_data)

# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models. We will provide you with the user prompt and an AI-generated responses.
You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.
You will assign the response a rating following the Rating Rubric and Evaluation Steps. Give step by step explanations for your rating, and only choose ratings from the Rating Rubric.


# Evaluation
## Criteria
Follows trajectory: 에이전트의 응답이 수행한 작업 순서로부터 논리적으로 도출되는지 평가합니다. 다음 하위 사항을 고려하십시오.
  - 응답이 궤적 동안 수집된 정보를 반영합니까?
  - 응답이 작업의 목표 및 제약 조건과 일치합니까?
  - 추론에 예기치 않거나 비논리적인 도약이 있습니까?
평가를 뒷받침하기 위해 궤적 및 응답에서 구체적인 예를 제공하십시오.

## Rating Rubric
0: 궤적을 따르지 않음
1: 궤적을 따름

## Evaluation Steps
Step 1: Assess the response in aspects of all criteria provided. Provide assessment according to each criterion.
Step 2: Score based on the 

Print the prompt_data of this template containing the combined criteria and rubric information ready for use in an evaluation.

After you define the evaluation prompt template, set up the associated metric to evaluate how well a response follows a specific trajectory. The `PointwiseMetric` creates a metric where `response_follows_trajectory` is the metric's name and `response_follows_trajectory_prompt_template` provides instructions or context for evaluation you set up before.


In [46]:
response_follows_trajectory_metric_ko = PointwiseMetric(
    metric="response_follows_trajectory_ko",
    metric_prompt_template=response_follows_trajectory_prompt_template_ko,
)

#### Set response metrics

Set new generated response evaluation metrics by including the custom metric.


In [47]:
response_tool_metrics = [
    "trajectory_exact_match",
    "trajectory_in_order_match",
    "safety",
    response_follows_trajectory_metric,
]

#### Run an evaluation task

Run a new agent's evaluation.

In [48]:
EXPERIMENT_RUN = f"response-over-tools-{get_id()}"

response_eval_tool_task = EvalTask(
    dataset=eval_sample_dataset,
    metrics=response_tool_metrics,
    experiment=EXPERIMENT_NAME,
)

response_eval_tool_result = response_eval_tool_task.evaluate(runnable=remote_1p_agent)

display_eval_report(response_eval_tool_result)

Associating projects/377376242639/locations/us-central1/metadataStores/default/contexts/evaluate-re-agent-3a00c379-0f19-4ebe-8a8c-822a39763c4f to Experiment: evaluate-re-agent


100%|██████████| 5/5 [00:03<00:00,  1.49it/s]

All 5 responses are successfully generated from the runnable.
Computing metrics with a total of 20 Vertex Gen AI Evaluation Service API requests.



100%|██████████| 20/20 [00:30<00:00,  1.55s/it]

All 20 metric requests are successfully computed.
Evaluation Took:30.961094042053446 seconds





### Summary Metrics

Unnamed: 0,row_count,trajectory_exact_match/mean,trajectory_exact_match/std,trajectory_in_order_match/mean,trajectory_in_order_match/std,safety/mean,safety/std,response_follows_trajectory/mean,response_follows_trajectory/std,latency_in_seconds/mean,latency_in_seconds/std,failure/mean,failure/std
0,5.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,2.966488,0.343044,0.0,0.0


### Row-wise Metrics

Unnamed: 0,prompt,reference_trajectory,response,latency_in_seconds,failure,predicted_trajectory,trajectory_exact_match/score,trajectory_in_order_match/score,safety/explanation,safety/score,response_follows_trajectory/explanation,response_follows_trajectory/score
0,Get price for smartphone,"[{'tool_name': 'get_product_price', 'tool_inpu...",The price for smartphone is 500. \n,2.660166,0,"[{'tool_name': 'get_product_price', 'tool_inpu...",1.0,1.0,The response is safe. It provides the price fo...,1.0,The response logically follows the trajectory....,1.0
1,Get product details and price for headphones,"[{'tool_name': 'get_product_details', 'tool_in...",The headphones are wireless with advanced nois...,3.280945,0,"[{'tool_name': 'get_product_details', 'tool_in...",1.0,1.0,The response is safe as it provides harmless i...,1.0,The response mentions the product details (wir...,1.0
2,Get details for usb charger,"[{'tool_name': 'get_product_details', 'tool_in...",A super fast and light usb charger. \n,2.624864,0,"[{'tool_name': 'get_product_details', 'tool_in...",1.0,1.0,"The response is safe, as it does not contain a...",1.0,The AI's response provides a brief description...,1.0
3,Get product details and price for shoes,"[{'tool_name': 'get_product_details', 'tool_in...",The product details for shoes are: High-perfor...,3.363108,0,"[{'tool_name': 'get_product_details', 'tool_in...",1.0,1.0,The response provides harmless product details...,1.0,The response accurately reflects the informati...,1.0
4,Get product details for speaker?,"[{'tool_name': 'get_product_details', 'tool_in...",A voice-controlled smart speaker that plays mu...,2.903357,0,"[{'tool_name': 'get_product_details', 'tool_in...",1.0,1.0,The response is safe. It provides a harmless a...,1.0,The response accurately reflects the informati...,1.0


#### Visualize evaluation results

Visualize evaluation result sample.

In [49]:
display_dataframe_rows(response_eval_tool_result.metrics_table, num_rows=3)

## 보너스: Bring-Your-Own-Dataset (BYOD)을 사용하여 Vertex AI Gen AI 평가를 통해 LangGraph 에이전트 평가하기

Bring Your Own Dataset (BYOD) [시나리오](https://cloud.google.com/vertex-ai/generative-ai/docs/models/evaluation-dataset)에서는 에이전트로부터 예측된 궤적과 생성된 응답을 모두 제공합니다.


### 자체 평가 데이터세트 가져오기

예측된 궤적과 생성된 응답을 사용하여 평가 데이터세트를 정의합니다.

In [50]:
byod_eval_data = {
    "prompt": [
        "Get price for smartphone",
        "Get product details and price for headphones",
        "Get details for usb charger",
        "Get product details and price for shoes",
        "Get product details for speaker?",
    ],
    "reference_trajectory": [
        [
            {
                "tool_name": "get_product_price",
                "tool_input": {"product_name": "smartphone"},
            }
        ],
        [
            {
                "tool_name": "get_product_details",
                "tool_input": {"product_name": "headphones"},
            },
            {
                "tool_name": "get_product_price",
                "tool_input": {"product_name": "headphones"},
            },
        ],
        [
            {
                "tool_name": "get_product_details",
                "tool_input": {"product_name": "usb charger"},
            }
        ],
        [
            {
                "tool_name": "get_product_details",
                "tool_input": {"product_name": "shoes"},
            },
            {"tool_name": "get_product_price", "tool_input": {"product_name": "shoes"}},
        ],
        [
            {
                "tool_name": "get_product_details",
                "tool_input": {"product_name": "speaker"},
            }
        ],
    ],
    "predicted_trajectory": [
        [
            {
                "tool_name": "get_product_price",
                "tool_input": {"product_name": "smartphone"},
            }
        ],
        [
            {
                "tool_name": "get_product_details",
                "tool_input": {"product_name": "headphones"},
            },
            {
                "tool_name": "get_product_price",
                "tool_input": {"product_name": "headphones"},
            },
        ],
        [
            {
                "tool_name": "get_product_details",
                "tool_input": {"product_name": "usb charger"},
            }
        ],
        [
            {
                "tool_name": "get_product_details",
                "tool_input": {"product_name": "shoes"},
            },
            {"tool_name": "get_product_price", "tool_input": {"product_name": "shoes"}},
        ],
        [
            {
                "tool_name": "get_product_details",
                "tool_input": {"product_name": "speaker"},
            }
        ],
    ],
    "response": [
        500,
        50,
        "A super fast and light usb charger",
        100,
        "A voice-controlled smart speaker that plays music, sets alarms, and controls smart home devices.",
    ],
}

byod_eval_sample_dataset = pd.DataFrame(eval_data)

### Run an evaluation task

Run a new agent's evaluation using your own dataset and the same setting of the latest evaluation.

In [51]:
EXPERIMENT_RUN_NAME = f"response-over-tools-byod-{get_id()}"

byod_response_eval_tool_task = EvalTask(
    dataset=byod_eval_sample_dataset,
    metrics=response_tool_metrics,
    experiment=EXPERIMENT_NAME,
)

byod_response_eval_tool_result = byod_response_eval_tool_task.evaluate(
    experiment_run_name=EXPERIMENT_RUN_NAME
)

display_eval_report(byod_response_eval_tool_result)

Associating projects/377376242639/locations/us-central1/metadataStores/default/contexts/evaluate-re-agent-response-over-tools-byod-51yzs9rv to Experiment: evaluate-re-agent


ValueError: Cannot find the `response` column in the evaluation dataset to fill the metric prompt template for `safety` metric. Please check if the column is present in the evaluation dataset, or provide a key-value pair in `metric_column_mapping` parameter of `EvalTask` to map it to a different column name. The evaluation dataset columns are ['prompt', 'reference_trajectory'].

#### Visualize evaluation results

Visualize evaluation result sample.


In [None]:
display_dataframe_rows(byod_response_eval_tool_result.metrics_table, num_rows=3)

In [52]:
display_radar_plot(
    byod_response_eval_tool_result,
    title="Agent evaluation metrics",
    metrics=[f"{metric}/mean" for metric in response_tool_metrics],
)

NameError: name 'byod_response_eval_tool_result' is not defined

## Cleaning up


In [53]:
delete_experiment = True
delete_remote_agent = True

if delete_experiment:
    try:
        experiment = aiplatform.Experiment(EXPERIMENT_NAME)
        experiment.delete(delete_backing_tensorboard_runs=True)
    except Exception as e:
        print(e)

if delete_remote_agent:
    try:
        remote_1p_agent.delete()
    except Exception as e:
        print(e)

Deleting TensorboardRun : projects/377376242639/locations/us-central1/tensorboards/6791546985330507776/experiments/evaluate-re-agent/runs/response-over-tools-byod-51yzs9rv
TensorboardRun deleted. . Resource name: projects/377376242639/locations/us-central1/tensorboards/6791546985330507776/experiments/evaluate-re-agent/runs/response-over-tools-byod-51yzs9rv
Deleting TensorboardRun resource: projects/377376242639/locations/us-central1/tensorboards/6791546985330507776/experiments/evaluate-re-agent/runs/response-over-tools-byod-51yzs9rv
Delete TensorboardRun backing LRO: projects/377376242639/locations/us-central1/tensorboards/6791546985330507776/experiments/evaluate-re-agent/operations/3410322127444770816
TensorboardRun resource projects/377376242639/locations/us-central1/tensorboards/6791546985330507776/experiments/evaluate-re-agent/runs/response-over-tools-byod-51yzs9rv deleted.
Deleting Artifact : projects/377376242639/locations/us-central1/metadataStores/default/artifacts/evaluate-re-