Skip to content

Gen AI eval: MULTI_TURN_GENERAL_QUALITY fails with "conversation_history is required but not provided" even when present #6317

@zacaries-benamiar-pm

Description

@zacaries-benamiar-pm

Summary

Gen AI eval: MULTI_TURN_GENERAL_QUALITY fails with "conversation_history is required but not provided" even when column is present in dataset


Environment details

  • OS type and version: Colab / Colab Enterprise (Linux, managed runtime)
  • Python version: Python 3.10 (from Colab)
  • pip version: pip 24.x (from Colab)
  • google-cloud-aiplatform version: 1.134.0
    (also reproducible on 1.135.0)
  • google-genai version: 1.61.0
  • API usage: vertexai.Client(..., http_options=genai_types.HttpOptions(api_version="v1beta1"))

Steps to reproduce

  1. Install the SDK in a fresh Colab / Colab Enterprise runtime

    %pip install -U -q "google-cloud-aiplatform[evaluation]==1.134.0"
    
    import google.cloud.aiplatform as aiplatform
    import google.genai as genai
    print("aiplatform:", aiplatform.__version__)   # 1.134.0
    print("genai:", genai.__version__)            # 1.61.0
  2. Initialize Vertex and define agent_info

    import os
    import vertexai
    from google.cloud import storage
    from google.genai import types as genai_types
    from vertexai import Client
    from vertexai import types
    import pandas as pd
    
    PROJECT_ID = os.getenv("PROJECT_ID", "<project-id>")
    LOCATION = os.getenv("LOCATION", "us-central1")
    AGENT = os.getenv("AGENT", "<reasoningEngine resource name>")
    GCS_DEST = os.getenv("GCS_DEST", "<bucket-or-prefix>")
    AUTOMATED_RUN = os.getenv("AUTOMATED_RUN", "false")
    AGENT_DISPLAY_NAME = os.getenv("AGENT_DISPLAY_NAME", AGENT.split("/")[-1])
    
    vertexai.init(project=PROJECT_ID, location=LOCATION)
    
    client = Client(
        project=PROJECT_ID,
        location=LOCATION,
        http_options=genai_types.HttpOptions(api_version="v1beta1"),
    )
    
    # Define agent_info (simplified)
    agent_info = types.evals.AgentInfo(
        agent_resource_name=AGENT,
        name="orchestrator_agent",
        # instruction + tools omitted for brevity
    )
  3. Build a multi‑turn dataset with history and run inference

    from vertexai import generative_models as genai_models
    
    multi_turn_conversations = [
        {
            "history": [
                genai_models.Content(
                    role="user",
                    parts=[genai_models.Part.from_text("First question")]
                ),
                genai_models.Content(
                    role="model",
                    parts=[genai_models.Part.from_text("First response")]
                ),
            ],
            "prompt": "Second question",
            "session_inputs": types.evals.SessionInput(
                user_id="user_1",
                state={"agent_type": "engineering", "conversation_id": "1"},
            ),
        },
        # ... a few more rows ...
    ]
    
    prompts = [conv["prompt"] for conv in multi_turn_conversations]
    histories = [conv["history"] for conv in multi_turn_conversations]
    session_inputs_list = [conv["session_inputs"] for conv in multi_turn_conversations]
    
    def content_to_dict(content):
        parts_list = []
        for part in content.parts:
            if hasattr(part, "text"):
                parts_list.append({"text": part.text})
        return {
            "role": content.role,
            "parts": parts_list,
        }
    
    histories_as_dicts = [
        [content_to_dict(content) for content in history]
        for history in histories
    ]
    
    # Create DataFrame with required columns
    df = pd.DataFrame({
        "prompt": prompts,
        "history": histories_as_dicts,      # list[dict] with role + parts[text]
        "session_inputs": session_inputs_list,
    })
    
    multi_turn_dataset = types.EvaluationDataset(eval_dataset_df=df)
    
    print(multi_turn_dataset.eval_dataset_df.columns)
    # Index(['prompt', 'history', 'session_inputs'], dtype='object')
    
    # Run inference
    eval_dataset = client.evals.run_inference(
        src=multi_turn_dataset,
        agent=AGENT,
    )
    
    # Workaround attempt: explicitly add conversation_history
    df2 = eval_dataset.eval_dataset_df.copy()
    df2["conversation_history"] = df2["history"]
    eval_dataset = types.EvaluationDataset(eval_dataset_df=df2)
    
    print(eval_dataset.eval_dataset_df.columns)
    # Index([... 'history', 'conversation_history', ...], dtype='object')
  4. Create the multi‑turn evaluation run with MULTI_TURN_GENERAL_QUALITY

    import datetime
    import time
    
    print("📊 Running multi-turn evaluation...")
    run_type = "Auto" if AUTOMATED_RUN.lower() == "true" else "Manual"
    
    evaluation_run = client.evals.create_evaluation_run(
        display_name=f"{run_type}-MultiTurn-Eval-{AGENT_DISPLAY_NAME}-{datetime.datetime.now().strftime('%Y%m%d-%H%M%S')}",
        dataset=eval_dataset,
        agent_info=agent_info,
        metrics=[
            types.RubricMetric.MULTI_TURN_GENERAL_QUALITY,
        ],
        dest=GCS_DEST,
    )
    
    evaluation_run.show()
    
    # Poll to completion
    while evaluation_run.state not in {"SUCCEEDED", "FAILED", "CANCELLED"}:
        evaluation_run = client.evals.get_evaluation_run(name=evaluation_run.name)
        time.sleep(10)
    
    eval_result = client.evals.get_evaluation_run(
        name=evaluation_run.name,
        include_evaluation_items=True,
    )
    
    eval_result.show()

Stack trace / error

The evaluation run consistently fails with FAILED_PRECONDITION:

📈 Multi-Turn Evaluation Results:

Status: FAILED

Error:
code=9 details=None message='code=FAILED_PRECONDITION, message=Evaluation items failed with errors:
Item ...: INVALID_ARGUMENT: code=INVALID_ARGUMENT, message=Error rendering metric prompt template: Variable conversation_history is required but not provided.., cause=null,
Item ...: INVALID_ARGUMENT: code=INVALID_ARGUMENT, message=Error rendering metric prompt template: Variable conversation_history is required but not provided.., cause=null,
...
cause=null'

This happens even when eval_dataset.eval_dataset_df clearly contains a conversation_history column which is a copy of history (each value is a list of {"role": ..., "parts": [{"text": ...}]} dicts).

Single‑turn evaluations in the same environment work as expected.


Expected behavior

  • Either MULTI_TURN_GENERAL_QUALITY should accept the documented history field for multi‑turn datasets, or
  • If conversation_history is now required by the metric prompt template, then providing conversation_history = history in the dataset should allow the metric to run successfully instead of returning:

Error rendering metric prompt template: Variable conversation_history is required but not provided.

If a different schema is now required for multi‑turn agent evaluation (e.g. a request object wrapper or a different field name/structure), updated documentation or a validation error before running the metric would be very helpful.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    api: vertex-aiIssues related to the googleapis/python-aiplatform API.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions