Gen AI eval: MULTI_TURN_GENERAL_QUALITY fails with "conversation_history is required but not provided" even when present

### Summary

`Gen AI eval: MULTI_TURN_GENERAL_QUALITY fails with "conversation_history is required but not provided" even when column is present in dataset`

---

#### Environment details

- **OS type and version**: Colab / Colab Enterprise (Linux, managed runtime)
- **Python version**: `Python 3.10` (from Colab)
- **pip version**: `pip 24.x` (from Colab)
- **`google-cloud-aiplatform` version**: `1.134.0`  
  (also reproducible on `1.135.0`)
- **`google-genai` version**: `1.61.0`
- **API usage**: `vertexai.Client(..., http_options=genai_types.HttpOptions(api_version="v1beta1"))`

---

#### Steps to reproduce

1. **Install the SDK in a fresh Colab / Colab Enterprise runtime**

   ```python
   %pip install -U -q "google-cloud-aiplatform[evaluation]==1.134.0"

   import google.cloud.aiplatform as aiplatform
   import google.genai as genai
   print("aiplatform:", aiplatform.__version__)   # 1.134.0
   print("genai:", genai.__version__)            # 1.61.0
   ```

2. **Initialize Vertex and define `agent_info`**

   ```python
   import os
   import vertexai
   from google.cloud import storage
   from google.genai import types as genai_types
   from vertexai import Client
   from vertexai import types
   import pandas as pd

   PROJECT_ID = os.getenv("PROJECT_ID", "<project-id>")
   LOCATION = os.getenv("LOCATION", "us-central1")
   AGENT = os.getenv("AGENT", "<reasoningEngine resource name>")
   GCS_DEST = os.getenv("GCS_DEST", "<bucket-or-prefix>")
   AUTOMATED_RUN = os.getenv("AUTOMATED_RUN", "false")
   AGENT_DISPLAY_NAME = os.getenv("AGENT_DISPLAY_NAME", AGENT.split("/")[-1])

   vertexai.init(project=PROJECT_ID, location=LOCATION)

   client = Client(
       project=PROJECT_ID,
       location=LOCATION,
       http_options=genai_types.HttpOptions(api_version="v1beta1"),
   )

   # Define agent_info (simplified)
   agent_info = types.evals.AgentInfo(
       agent_resource_name=AGENT,
       name="orchestrator_agent",
       # instruction + tools omitted for brevity
   )
   ```

3. **Build a multi‑turn dataset with `history` and run inference**

   ```python
   from vertexai import generative_models as genai_models

   multi_turn_conversations = [
       {
           "history": [
               genai_models.Content(
                   role="user",
                   parts=[genai_models.Part.from_text("First question")]
               ),
               genai_models.Content(
                   role="model",
                   parts=[genai_models.Part.from_text("First response")]
               ),
           ],
           "prompt": "Second question",
           "session_inputs": types.evals.SessionInput(
               user_id="user_1",
               state={"agent_type": "engineering", "conversation_id": "1"},
           ),
       },
       # ... a few more rows ...
   ]

   prompts = [conv["prompt"] for conv in multi_turn_conversations]
   histories = [conv["history"] for conv in multi_turn_conversations]
   session_inputs_list = [conv["session_inputs"] for conv in multi_turn_conversations]

   def content_to_dict(content):
       parts_list = []
       for part in content.parts:
           if hasattr(part, "text"):
               parts_list.append({"text": part.text})
       return {
           "role": content.role,
           "parts": parts_list,
       }

   histories_as_dicts = [
       [content_to_dict(content) for content in history]
       for history in histories
   ]

   # Create DataFrame with required columns
   df = pd.DataFrame({
       "prompt": prompts,
       "history": histories_as_dicts,      # list[dict] with role + parts[text]
       "session_inputs": session_inputs_list,
   })

   multi_turn_dataset = types.EvaluationDataset(eval_dataset_df=df)

   print(multi_turn_dataset.eval_dataset_df.columns)
   # Index(['prompt', 'history', 'session_inputs'], dtype='object')

   # Run inference
   eval_dataset = client.evals.run_inference(
       src=multi_turn_dataset,
       agent=AGENT,
   )

   # Workaround attempt: explicitly add conversation_history
   df2 = eval_dataset.eval_dataset_df.copy()
   df2["conversation_history"] = df2["history"]
   eval_dataset = types.EvaluationDataset(eval_dataset_df=df2)

   print(eval_dataset.eval_dataset_df.columns)
   # Index([... 'history', 'conversation_history', ...], dtype='object')
   ```

4. **Create the multi‑turn evaluation run with `MULTI_TURN_GENERAL_QUALITY`**

   ```python
   import datetime
   import time

   print("📊 Running multi-turn evaluation...")
   run_type = "Auto" if AUTOMATED_RUN.lower() == "true" else "Manual"

   evaluation_run = client.evals.create_evaluation_run(
       display_name=f"{run_type}-MultiTurn-Eval-{AGENT_DISPLAY_NAME}-{datetime.datetime.now().strftime('%Y%m%d-%H%M%S')}",
       dataset=eval_dataset,
       agent_info=agent_info,
       metrics=[
           types.RubricMetric.MULTI_TURN_GENERAL_QUALITY,
       ],
       dest=GCS_DEST,
   )

   evaluation_run.show()

   # Poll to completion
   while evaluation_run.state not in {"SUCCEEDED", "FAILED", "CANCELLED"}:
       evaluation_run = client.evals.get_evaluation_run(name=evaluation_run.name)
       time.sleep(10)

   eval_result = client.evals.get_evaluation_run(
       name=evaluation_run.name,
       include_evaluation_items=True,
   )

   eval_result.show()
   ```

---

#### Stack trace / error

The evaluation run consistently fails with `FAILED_PRECONDITION`:

```text
📈 Multi-Turn Evaluation Results:

Status: FAILED

Error:
code=9 details=None message='code=FAILED_PRECONDITION, message=Evaluation items failed with errors:
Item ...: INVALID_ARGUMENT: code=INVALID_ARGUMENT, message=Error rendering metric prompt template: Variable conversation_history is required but not provided.., cause=null,
Item ...: INVALID_ARGUMENT: code=INVALID_ARGUMENT, message=Error rendering metric prompt template: Variable conversation_history is required but not provided.., cause=null,
...
cause=null'
```

This happens even when `eval_dataset.eval_dataset_df` clearly contains a `conversation_history` column which is a copy of `history` (each value is a list of `{"role": ..., "parts": [{"text": ...}]}` dicts).

Single‑turn evaluations in the same environment work as expected.

---

#### Expected behavior

- Either `MULTI_TURN_GENERAL_QUALITY` should accept the documented `history` field for multi‑turn datasets, **or**
- If `conversation_history` is now required by the metric prompt template, then providing `conversation_history = history` in the dataset should allow the metric to run successfully instead of returning:

> `Error rendering metric prompt template: Variable conversation_history is required but not provided.`

If a different schema is now required for multi‑turn agent evaluation (e.g. a `request` object wrapper or a different field name/structure), updated documentation or a validation error before running the metric would be very helpful.

Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gen AI eval: MULTI_TURN_GENERAL_QUALITY fails with "conversation_history is required but not provided" even when present #6317

Summary

Environment details

Steps to reproduce

Stack trace / error

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Gen AI eval: MULTI_TURN_GENERAL_QUALITY fails with "conversation_history is required but not provided" even when present #6317

Description

Summary

Environment details

Steps to reproduce

Stack trace / error

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions