# Create Google ADK EvalSet from Chat Logs

**Objective:** This notebook processes a raw chat log CSV, identifies conversation sessions, and uses the Gemini 2.5 Pro model to generate a structured Google Agent Development Kit (ADK) EvalSet JSON.

---

### Technical Workflow

1.  **Session Identification (Pandas):**
    *   Loads `chats.csv`.
    *   Identifies sessions based on `User in Session` and a 30-minute inactivity threshold.
    *   Assigns a unique `session_id` to each session.
    *   Saves the processed DataFrame to `chats_with_sessions.csv`.

2.  **ADK EvalSet Generation (google-genai):**
    *   Constructs a prompt with the sessionized data and a target ADK JSON schema.
    *   Calls the Gemini model to convert the tabular data into a structured JSON.
    *   Saves the final, validated output to `gemini_response.json`.

### I/O
*   **Input:** `chats.csv`
*   **Output:** `gemini_response.json`

Disclaimer: This is not an official Google service/product and is intended for reference purposes only. Do not use in a production environment.

Questions ? mateuswagner@google.com

In [None]:
!pip install -U google-generativeai

In [None]:
from google import genai
from google.genai import types
from google.genai.types import HttpOptions
import base64
import os
import pandas as pd
import numpy as np
import random
from datetime import timedelta
import json


### **Define Model Configuration and Prompts for EvalSet Generation**

This cell sets up the configuration for a `gemini-2.5-pro` model. It defines two key prompts:

1.  `sys_prompt`: Sets the model's persona to be an expert in Google's Agent Development Kit (ADK).
2.  `create_evalset_prompt`: Provides detailed instructions for the model to convert agent chat logs from a pandas DataFrame into a structured JSON file compliant with the ADK `EvalSet` schema. This format is used for evaluating the performance of AI agents.

It also configures the model to use `gemini-2.5-pro` and disables all safety settings for the generation task.

In [None]:
sys_prompt = """You are an expert developer specializing in Google's Agent Development Kit (ADK)."""

create_evalset_prompt = """
Your sole function is to convert the agent chat logs into a single, complete, and perfectly valid json string compliant with the Google ADK EvalSet schema.

Critical Requirements:

json-Only Output: Your entire response MUST be the raw json content. Do not include any explanations, introductory text, or markdown code fences (like ```json).
Strict Schema Adherence: The output MUST be a valid ADK EvalSet. Include ALL required schema fields, even if it means generating plausible placeholder data for fields not present in the source DataFrame.
Chronological Integrity: Messages within each conversation MUST be maintained in their exact original chronological order.
Content Preservation: The text from the MESSAGE column MUST be preserved verbatim. Do not summarize, alter, or rephrase any message content.
Unique ID Generation: Generate a unique identifier for the top-level eval_set_id and for every invocation_id field within the conversations.
Graceful Edge Case Handling: If the input DataFrame is empty, output a valid json structure with an empty eval_cases list. If a session is incomplete, process the turns that are present. The output must always be valid json.

Input Source:
A pandas DataFrame with the columns: User in Session, ROLE, MESSAGE.

Conversion Logic:
Session Grouping: Group the DataFrame by User in Session. Each unique session maps to a single eval_case object.
Turn Processing: Within each session, process messages chronologically to build the conversation list. A turn consists of a user's message and all subsequent model messages before the next user input.
Role Mapping & Content Analysis:
If ROLE is 'user': Map the MESSAGE to the user_content.parts.text field.
If ROLE is 'model': Analyze the MESSAGE content to determine its type:
Tool Call: If the message indicates a function call (contains function names, parameters, etc.), structure it as an entry in the intermediate_data.tool_uses list, correctly extracting the tool name and args.
Final Response: If the message is the concluding natural language answer for that turn, map it to the final_response.parts.text field.
Metadata Population:
Use the User in Session value as the eval_id for its corresponding eval_case.
Populate session_input with appropriate placeholder data (e.g., app_name: "my_agent", user_id: "test_user").
Use "role": "user" or "model"

Task:
Process the provided logs and convert it to a complete, valid ADK EvalSet json file according to all requirements and logic specified above.


Example of Evalset:
# Do note that some fields are removed for sake of making this doc readable.
{
  "eval_set_id": "eval_set_example_with_multiple_sessions",
  "name": "Eval set with multiple sessions",
  "description": "This eval set is an example that shows that an eval set can have more than one session.",
  "eval_cases": [
    {
      "eval_id": "session_01",
      "conversation": [
        {
          "invocation_id": "e-0067f6c4-ac27-4f24-81d7-3ab994c28768",
          "user_content": {
            "parts": [
              {
                "text": "What can you do?"
              }
            ],
            "role": "user"
          },
          "final_response": {
            "parts": [
              {

                "text": "I can roll dice of different sizes and check if numbers are prime."
              }
            ],
            "role": null
          },
          "intermediate_data": {
            "tool_uses": [],
            "intermediate_responses": []
          },
        },
      ],
      "session_input": {
        "app_name": "hello_world",
        "user_id": "user",
        "state": {}
      },
    },
    {
      "eval_id": "session_02",
      "conversation": [
        {
          "invocation_id": "e-92d34c6d-0a1b-452a-ba90-33af2838647a",
          "user_content": {
            "parts": [
              {
                "text": "Roll a 19 sided dice"
              }
            ],
            "role": "user"
          },
          "final_response": {
            "parts": [
              {
                "text": "I rolled a 17."
              }
            ],
            "role": null
          },
          "intermediate_data": {
            "tool_uses": [],
            "intermediate_responses": []
          },
        },
        {
          "invocation_id": "e-bf8549a1-2a61-4ecc-a4ee-4efbbf25a8ea",
          "user_content": {
            "parts": [
              {
                "text": "Roll a 10 sided dice twice and then check if 9 is a prime or not"
              }
            ],
            "role": "user"
          },
          "final_response": {
            "parts": [
              {
                "text": "I got 4 and 7 from the dice roll, and 9 is not a prime number.\n"
              }
            ],
            "role": null
          },
          "intermediate_data": {
            "tool_uses": [
              {
                "id": "adk-1a3f5a01-1782-4530-949f-07cf53fc6f05",
                "args": {
                  "sides": 10
                },
                "name": "roll_die"
              },
              {
                "id": "adk-52fc3269-caaf-41c3-833d-511e454c7058",
                "args": {
                  "sides": 10
                },
                "name": "roll_die"
              },
              {
                "id": "adk-5274768e-9ec5-4915-b6cf-f5d7f0387056",
                "args": {
                  "nums": [
                    9
                  ]
                },
                "name": "check_prime"
              }
            ],
            "intermediate_responses": [
              [
                "data_processing_agent",
                [
                  {
                    "text": "I have rolled a 10 sided die twice. The first roll is 5 and the second roll is 3.\n"
                  }
                ]
              ]
            ]
          },
        }
      ],
      "session_input": {
        "app_name": "hello_world",
        "user_id": "user",
        "state": {}
      },
    }
  ],
}

"""
model = "gemini-2.5-pro"
safety_settings = [
        types.SafetySetting(category="HARM_CATEGORY_HATE_SPEECH", threshold="OFF"),
        types.SafetySetting(category="HARM_CATEGORY_DANGEROUS_CONTENT", threshold="OFF"),
        types.SafetySetting(category="HARM_CATEGORY_SEXUALLY_EXPLICIT", threshold="OFF"),
        types.SafetySetting(category="HARM_CATEGORY_HARASSMENT", threshold="OFF")]


# Identify and Assign Chat Session IDs
This script reads a chat log from `INPUT_FILE`, groups messages into sessions for each user, and assigns a unique `session_id`. A new session begins when the time between a user's consecutive messages exceeds the `TIME_GAP_MINUTES` threshold. The processed data, including the new `session_id` column, is saved to `OUTPUT_FILE`.

Columns:

- date: Timestamp of the message.

- User in Session: The user's name.

- ROLE: Role of the message sender (user or assistant).

- MESSAGE: The content of the message.

In [None]:
# Configuration
INPUT_FILE = "chats.csv"
OUTPUT_FILE = "chats_with_sessions.csv"
TIME_GAP_MINUTES = 30
random.seed(42)

def process_chat_sessions(file_path, time_gap_minutes=30):
    """Process chat data to identify sessions and assign unique IDs."""
    
    # Load and validate data
    try:
        df = pd.read_csv(file_path)
        required_cols = ['date', 'User in Session', 'ROLE', 'MESSAGE']
        if not all(col in df.columns for col in required_cols):
            raise ValueError(f"Missing required columns: {required_cols}")
        
        if df.empty:
            df['session_id'] = pd.Series(dtype='int64')
            return df
            
        print(f"Loaded {len(df)} rows")
    except Exception as e:
        print(f"Error loading data: {e}")
        return None
    
    # Preprocess data
    df['date'] = pd.to_datetime(df['date'], utc=True)
    df = df.sort_values(['User in Session', 'date']).reset_index(drop=True)
    
    # Initialize session tracking
    session_ids = []
    used_ids = set()
    
    # Process each user's messages
    for user, user_group in df.groupby('User in Session'):
        user_group = user_group.sort_values('date').reset_index(drop=True)
        time_diffs = user_group['date'].diff()
        current_session = None
        
        for idx in range(len(user_group)):
            # Start new session if: first message OR time gap > threshold
            if idx == 0 or (pd.notna(time_diffs.iloc[idx]) and 
                           time_diffs.iloc[idx] > timedelta(minutes=time_gap_minutes)):
                # Generate unique 8-digit session ID
                while True:
                    current_session = random.randint(10000000, 99999999)
                    if current_session not in used_ids:
                        used_ids.add(current_session)
                        break
            
            session_ids.append(current_session)
    
    # Add session_id column after date
    df['session_id'] = session_ids
    cols = list(df.columns)
    cols.remove('session_id')
    date_idx = cols.index('date')
    cols.insert(date_idx + 1, 'session_id')
    df = df[cols]
    
    return df

# Execute processing
try:
    result_df = process_chat_sessions(INPUT_FILE, TIME_GAP_MINUTES)
    
    if result_df is not None:
        # Save results
        result_df.to_csv(OUTPUT_FILE, index=False)
        
        # Display summary
        print(f"\nProcessing completed!")
        print(f"Sessions identified: {result_df['session_id'].nunique()}")
        print(f"Users processed: {result_df['User in Session'].nunique()}")
        print(f"Output saved to: {OUTPUT_FILE}")
        
        # Show sample
        print(f"\nSample output:")
        print(result_df[['date', 'session_id', 'User in Session', 'ROLE', 'MESSAGE']].head(10))
        
        # Validation
        sessions_per_user = result_df.groupby('session_id')['User in Session'].nunique()
        if (sessions_per_user == 1).all():
            print("Validation passed: All sessions belong to single users")
        else:
            print("Warning: Some sessions span multiple users")
            
except Exception as e:
    print(f"Error: {e}")


### Evaluation Dataset Schema
This JSON schema defines the structure for a model evaluation dataset. It contains a set of evaluation cases (`eval_cases`), where each case captures a complete multi-turn `conversation`. Each turn in the conversation includes the user's input, the model's final response, and any intermediate `tool_uses` and their results.

In [None]:
response_schema = {
  "type": "OBJECT",
  "properties": {
    "eval_set_id": {"type": "STRING"},
    "name": {"type": "STRING"},
    "description": {"type": "STRING"},
    "eval_cases": {
      "type": "ARRAY",
      "items": {
        "type": "OBJECT",
        "properties": {
          "eval_id": {"type": "STRING"},
          "conversation": {
            "type": "ARRAY",
            "items": {
              "type": "OBJECT",
              "properties": {
                "invocation_id": {"type": "STRING"},
                "user_content": {
                  "type": "OBJECT",
                  "properties": {
                    "parts": {
                      "type": "ARRAY",
                      "items": {
                        "type": "OBJECT",
                        "properties": {
                          "text": {"type": "STRING"}
                        },
                        "required": ["text"]
                      }
                    },
                    "role": {"type": "STRING"}
                  },
                  "required": ["parts", "role"]
                },
                "final_response": {
                  "type": "OBJECT",
                  "properties": {
                    "parts": {
                      "type": "ARRAY",
                      "items": {
                        "type": "OBJECT",
                        "properties": {
                          "text": {"type": "STRING"}
                        },
                        "required": ["text"]
                      }
                    },
                    "role": {"type": "STRING"}
                  },
                  "required": ["parts"]
                },
                "intermediate_data": {
                  "type": "OBJECT",
                  "properties": {
                    "tool_uses": {
                      "type": "ARRAY",
                      "items": {
                        "type": "OBJECT",
                        "properties": {
                          "id": {"type": "STRING"},
                          "args": {"type": "OBJECT"},
                          "name": {"type": "STRING"}
                        },
                        "required": ["id", "args", "name"]
                      }
                    },
                    "intermediate_responses": {
                      "type": "ARRAY",
                      "items": {
                        "type": "ARRAY",
                        "items": {"type": "STRING"}
                      }
                    }
                  },
                  "required": ["tool_uses", "intermediate_responses"]
                }
              },
              "required": ["invocation_id", "user_content", "final_response", "intermediate_data"]
            }
          },
          "session_input": {
            "type": "OBJECT",
            "properties": {
              "app_name": {"type": "STRING"},
              "user_id": {"type": "STRING"},
              "state": {"type": "OBJECT"}
            },
            "required": ["app_name", "user_id", "state"]
          }
        },
        "required": ["eval_id", "conversation", "session_input"]
      }
    }
  },
  "required": ["eval_set_id", "name", "description", "eval_cases"]
}

In [None]:
# Init client
client = genai.Client(vertexai=True, location="global", project = "matt-demos") # TODO: CHANGE IT

### Calculate & Verify Token Count
This cell calculates the total token count for the prompt, CSV data, and system prompt to ensure the combined input is within the model's context limit. It provides a detailed breakdown and a final validation check.

In [None]:
# Read the CSV file
df = pd.read_csv('chats_with_sessions.csv')
csv_content = df.to_csv(index=False)

# Count tokens for each part
def count_tokens_safe(text, model_name):
    """Count tokens with error handling"""
    try:
        model = genai.GenerativeModel(model_name)
        response = model.count_tokens(text)
        return response.total_tokens
    except Exception as e:
        # Fallback to character estimation if API fails
        return len(str(text)) // 3

# Token counting
prompt_tokens = count_tokens_safe(create_evalset_prompt, model_name=model)
csv_tokens = count_tokens_safe(csv_content, model_name=model)
sys_prompt_tokens = count_tokens_safe(sys_prompt, model_name=model)
total_input_tokens = prompt_tokens + csv_tokens + sys_prompt_tokens

print(f"Token Analysis:")
print(f"Prompt tokens:      {prompt_tokens:8,}")
print(f"CSV data tokens:    {csv_tokens:8,}")
print(f"System prompt:      {sys_prompt_tokens:8,}")
print(f"Total input tokens: {total_input_tokens:8,}")
print(f"Context limit:      {2_000_000:8,}")
print(f"Usage:              {(total_input_tokens/2_000_000)*100:7.1f}%")

# Check if within limits
if total_input_tokens > 2_000_000:
    print("WARNING: Exceeds context limit!")
else:
    print("OK: Within context limits")

### **Configure Model for Structured JSON Output**

This cell prepares and configures a request to a generative AI model, specifically instructing it to return a response in a structured JSON format.

1.  **Load and Prepare Data**: It reads chat logs from the `chats_with_sessions.csv` file into a pandas DataFrame and then converts this data into a single string variable (`csv_content`).

2.  **Construct the Prompt**: It creates the full prompt (`contents`) that will be sent to the model. This prompt combines a set of instructions (`create_evalset_prompt`) with the chat log data.

3.  **Define Generation Parameters**: It sets up a detailed `GenerateContentConfig` to control the model's behavior:
    * Sets a low `temperature` for more deterministic and less random outputs.
    * Provides the model with a `GoogleSearch` tool.
    * Assigns a `system_instruction` to guide the model's overall function.
    * **Crucially, it specifies a `response_schema` and sets the `response_mime_type` to `"application/json"`, which forces the model to generate its output in a predefined JSON structure.**

In [None]:
contents = [
    types.Content(role="user", parts=[
        types.Part.from_text(text=create_evalset_prompt),
        types.Part.from_text(text=f"\CHAT LOGS:\n{csv_content}")]
                 )
]
# Enable tools
tools = [types.Tool(google_search=types.GoogleSearch()),]

# Config
generate_content_config = types.GenerateContentConfig(
  temperature = 0,
  top_p = 0.95,
  seed = 1000,
  max_output_tokens = 65535,
  safety_settings = safety_settings,
  tools = tools,
  system_instruction=[
    types.Part.from_text(text=sys_prompt)],
  thinking_config=types.ThinkingConfig(thinking_budget=-1,),
  response_schema=response_schema,
  response_mime_type="application/json"  # Ensure JSON output
)


In [None]:
# Accumulate all chunks. 
full_response = ""

for chunk in client.models.generate_content_stream(
    model=model,
    contents=contents,
    config=generate_content_config,
):
    if not chunk.candidates or not chunk.candidates[0].content or not chunk.candidates[0].content.parts:
        continue
    #print(chunk.text, end="")
    full_response += chunk.text

# Save to JSON file
try:
    # Parse the response as JSON to validate it
    response_json = json.loads(full_response)
    
    # Save to file
    with open('evalset.json', 'w', encoding='utf-8') as f:
        json.dump(response_json, f, indent=2, ensure_ascii=False)
    
    print(f"\nResponse saved to gemini_response.json")
    
except json.JSONDecodeError as e:
    print(f"\nError parsing JSON: {e}")
    # Save as raw text if JSON parsing fails
    with open('gemini_response_raw.txt', 'w', encoding='utf-8') as f:
        f.write(full_response)
    print("Raw response saved to gemini_response_raw.txt")