#  Data and Eval Example

In this colab we showcase how to use the `tool_simulation` library in order to create a finetuning dataset or evaluate a given model.

For our example we will create a dataset related to a healthcare app with an AI form filler feature. To do this, we will set up a few patient scenarios and use them to create covnersations with an agent.

We will aslo show an example of how to evaluate a model that we've trained.

In [None]:
!pip install ai-edge-tool-simulation

In [None]:
import random
import json
from google.genai import types

import tool_simulation.stages.utils.margin as margin
from tool_simulation.core.aistudio_backend import AIStudioModel
from tool_simulation.core.aistudio_prompt_builder import AIStudioPromptBuilder, ChunkKind
from tool_simulation.core.tool2str import tool2str
from tool_simulation.stages.data_generation import seed_data
from tool_simulation.stages.function_calling.replier_prompt_builder import ReplierPromptBuilder
from tool_simulation.core.str2call import FunctionCall
from tool_simulation.stages.function_calling.session import SyntheticSession, FunctionReply
from tool_simulation.stages.function_calling import datagen_prompt_builder
from tool_simulation.stages.function_calling import function_calling_episode
from tool_simulation.stages.exports.export_hf_chat import export_hf_chat


## Part 1: Data Generation
Here we will genertate patient scenarios and use them to create conversations with an agent.

In [None]:
API_KEY = "YOUR_API_KEY"
MODEL = "gemini-2.5-flash-preview-04-17"
# How many examples to use when genearting scenarios
SUBSAMPLE_SIZE = 3
# How many scenario generation steps to make
SIMULATION_STEPS = 5

## Scenario Generation
We will create a set of sample scenarios and iteratievely generate more by taking a sample from our existing pool as a set of examples.

In [None]:
def _scenario_prompt(example_str: str) -> str:
  return margin.trim_margin(f"""\
    |Your job is to generate patient scenarios. We are trying to model a
    >health intake form.
    |Please generate short patient scenarios/biographies that include things
    >like name, birth date, health issues, job, marital status, but also
    >general bipgraphical places like city of birth, favorite
    >stuff, activity levels etc.
    |The scenarios should be formatted in plain text and be in the first person.
    |Please vary the sentence structure and syntax when generating everything.
    |Here are a few examples (dont forget the <END> tags after each scenario):
    |
    |{example_str}
    |
    |Please generate 3 to 5 new scenarios in the above
    >format (but dont repeat them).
    |List only the examples, do not add any other text or ennumeration.
    """)

In [None]:
# We recommmend having soem seed data. For this demo we will generate some
example_pool = []
example_pool.append(
    "My name is John Doe, I am a car mechanic suffering from back pain. I"
    " am married and I am 35. I am not super active.<END>"
)
example_pool.append(
    "I'm Aisha Khan, a 18-year-old student dealing with anxiety. I am"
    " currently single.<END>"
)
example_pool.append(
    "Hi I am Kenji Tanaka. I work as a high school teacher and have type 2"
    " diabetes. I'm 25 and divorced. I am from Toky originally.<END>"
)
example_pool.append(
    "Call me Chloe Dubois. I'm a retired librarian, 68 years old, widowed,"
    " and I manage my arthritis daily.<END>"
)

In [None]:
model = AIStudioModel(api_key=API_KEY, model_name=MODEL)

In [None]:
for _ in range(SIMULATION_STEPS):
  try:
    random.shuffle(example_pool)
    examples = min(SUBSAMPLE_SIZE, len(example_pool))
    example_str = "\n".join(example_pool[:examples])
    # We create a prompt builder object to hold the query
    pb = AIStudioPromptBuilder()
    pb.user_turn(_scenario_prompt(example_str))
    # Generate seed data will query the model, split
    # the reply by the <END> delimiter and filter empty queries
    # as well as re-append the "<END>" token
    extra_examples = seed_data.generate_seed_data(
        pb,
        model,
        delimiter="<END>",
        filter_fn=lambda x: x.strip(),
        post_process_fn=lambda x: x.strip()
        .replace("\n", "")
        .replace("<END>", "")
        + "<END>",
    )
    example_pool.extend(extra_examples)
  except Exception as e:
    print(e)
    break

scenarios = [x.replace("<END>", "") for x in example_pool]

## Part 2: Data Generation
Now we can run the data generation loop. To start, we will define an API schema
as well as a simple enviornment (for simplicity this will just give an OK signal). In addition we will define a parser function. For this example we will assume that the API we are calling does sufficient checks and returns a response in a structured form.

In [None]:
health_conditions_options = [
  "Hypertension",
  "Diabetes",
  "Asthma",
  "Arthritis",
  "Migraine",
  "Depression",
  "Kidney Disease",
  "Anxiety",
  "Allergies",
  "Heart Disease",
]

marital_status_options = ["Single","Married","Divorced",
                              "Widowed","Separated","Domestic Partnership"]
sex_options = ["Male", "Female"]




my_demo_tool = types.Tool(
    function_declarations=[
        types.FunctionDeclaration(
            name="record_personal_information",
            description="Records the user's personal information.",
            parameters=types.Schema(
                type=types.Type.OBJECT,
                properties={
                    "first_name": types.Schema(
                        type=types.Type.STRING,
                        description="The user's first name.",
                    ),
                    "last_name": types.Schema(
                        type=types.Type.STRING,
                        description="The user's last name.",
                    ),
                    "date_of_birth": types.Schema(
                        type=types.Type.STRING,
                        description=(
                            "The user's date of birth in MM/DD/YYYY format."
                        ),
                    ),
                    "occupation": types.Schema(
                        type=types.Type.STRING,
                        description="The user's occupation.",
                    ),
                },
                required=[
                    "first_name",
                    "last_name",
                    "date_of_birth",
                    "occupation",
                ],
            ),
        ),
        types.FunctionDeclaration(
            name="record_demographics",
            description="Records the user's sex and marital status.",
            parameters=types.Schema(
                type=types.Type.OBJECT,
                properties={
                    "sex": types.Schema(
                        type=types.Type.STRING,
                        description="The user's sex.",
                        enum=sex_options,
                    ),
                    "marital_status": types.Schema(
                        type=types.Type.STRING,
                        description="The user's marital status.",
                        enum=marital_status_options,
                    ),
                },
                required=["sex", "marital_status"],
            ),
        ),
        types.FunctionDeclaration(
            name="record_medical_history",
            description=(
                "Records the user's past or present medical history conditions."
            ),
            parameters=types.Schema(
                type=types.Type.OBJECT,
                properties={
                    "conditions": types.Schema(
                        type=types.Type.ARRAY,
                        description=(
                            "A list of medical conditions the user checks."
                        ),
                        items=types.Schema(
                            type=types.Type.STRING, enum=health_conditions_options
                        ),
                    )
                },
                required=["conditions"],
            ),
        ),
    ]
)

In [None]:
class SimpleEnvironment(SyntheticSession):
  def reply(self, function_call: FunctionCall) -> FunctionReply:
    function_call_unwrapped = json.loads(function_call.raw_string)
    reply = {"name": function_call_unwrapped["name"], "response": {"status": "ok"}}
    return FunctionReply(reply = reply, raw_reply = json.dumps(reply))

In [None]:
def parse_output(text: str) -> datagen_prompt_builder.ParseResult:
  if not text:
    return datagen_prompt_builder.ParseResult(function_call=None, forward=None)
  try:
    output_json = json.loads(text)
    function_call = FunctionCall(name = output_json["name"], args={}, raw_string=text)
    return datagen_prompt_builder.ParseResult(function_call=function_call, forward=None)
  except json.JSONDecodeError:
    return datagen_prompt_builder.ParseResult(function_call=None, forward=text)

In [None]:
def run_loop(scenario: str):
  # Make the replier model prompt. The replier model will pretend to be the human
  replier_pbuilder = ReplierPromptBuilder()
  replier_pbuilder.system_turn(margin.trim_margin(f"""\
      |You are using a mobile app with a voice feature attached to an agent. You
      >want to fill out a healthcare intake form which has three screens.
      |The screens are as follows:
      |Screen 1: Asks for first and second name, date of birth, and occupation
      |Screen 2: Asks for marital status ({", ".join(marital_status_options)})
      >and sex {", ".join(sex_options)}
      |Screen 3: Asks for health conditions ({", ".join(health_conditions_options)})
      |
      |You are given the following patient scenario:
      |{scenario}
      |
      |Please talk to the agent (you should go screen by screen). First you reply
      >to the first srceen, wait for its reply, then you reply to the second, etc.
      |Your messages are prefaced by the `User` role, while the agent's responses
      >are prefaced by the `Assistant` role.
      >If the agent is able to successfully complete the task, please
      >reply with STOP. If it made an error, reply with ERROR.
      |Do not ask the assistant any follow-up questions.
      |When you return your reply, do not preface it with any role, that happens
      >automatically.
      |NOTE: Only give a single reply at a time, do not complete the entire
      >conversation.
      |NOTE: Give your replies in natural language only.
      """))
  # Set models
  replier_model = AIStudioModel(api_key=API_KEY, model_name=MODEL)
  function_calling_model = AIStudioModel(api_key=API_KEY, model_name=MODEL, tools=my_demo_tool)
  function_calling_model.config.system_instruction = "Please help the user fill out an intake form. The user will give you some information and you need to call the correct functions. Only fill out what they have given you. You are only allowed to do function calls and summarize the result of the given function calls."

  # Get a scenario -> intial query
  initial_query = replier_model.query_model(replier_pbuilder.get_prompt())
  initial_query = initial_query.replace("User:", "").replace("User", "").strip()
  # Add the initial query to the user turn
  replier_pbuilder.user_turn(initial_query)

  # Create a prompt builder to represent the agent
  inner_fc_pb = AIStudioPromptBuilder()
  inner_fc_pb.user_turn(initial_query)
  datagen_pbuilder = datagen_prompt_builder.DataGenerationPromptBuilder(
    inner_prompt_builder=inner_fc_pb,
    session=SimpleEnvironment(),
    parse_fn=parse_output,
  )
  # Run episode
  try:
    result = function_calling_episode.run_function_calling_episode(
          fc_prompt_builder=datagen_pbuilder,
          replier_prompt_builder=replier_pbuilder,
          function_calling_model=function_calling_model,
          replier_model=replier_model,
          max_steps=6,
    )
  except Exception as e:
    print(e)
    return []
  #The below is formating the reply according to the hammer model's format.
  hf_format = []
  for turn in result.get_state():
    for chunk in turn.content:
      if chunk.kind == ChunkKind.CONTENT:
        hf_format.append({"role": "user" if turn.role == "user" else "assistant",
                          "content": chunk.content.text})
      elif chunk.kind == ChunkKind.TOOL_CALL:
        hf_format.append({"role": "assistant",
                    "content": f"```[{chunk.content.function_call.to_json_dict()}]```"})
      elif chunk.kind == ChunkKind.TOOL_RESULT:
        hf_format.append({"role": "tool",
              "content": f"{chunk.content.function_response.to_json_dict()}"})
  return hf_format


In [None]:
conversations = []
for scenario in scenarios:
  conversation = run_loop(scenario)
  if conversation:
    conversations.append({"messages" : conversation})
  print("Done with scenario ", scenario)

# Part 2: Evaluation

For this demo we will evaluate the Hammer function calling models. The evalation loop will be simular, with the only differnece being a change in the
tool calling model. While not shown above, we assume that we have a separate test set we can use in the case of model training. For simplicity we will be using the same set of scenarios as above.

In [None]:
import copy
from typing import Any, Dict, List, Optional

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    PreTrainedTokenizerBase,

)

from tool_simulation.core import base_prompt_builder
from tool_simulation.core.tool2str import tool2str
from tool_simulation.core.model_instance import ModelInstance

First we will define a prompt builder that works with the huggingface chat format. We will use this to prompt the tool callling model. We will also define a ModelInstance that works with HuggingFace.

In [None]:
"""Prompt builder for Hugging Face's chat template."""

BaseTurn = base_prompt_builder.BaseTurn
ChunkKind = base_prompt_builder.ChunkKind
BasePromptBuilder = base_prompt_builder.BasePromptBuilder
BaseChunk = base_prompt_builder.BaseChunk


class HfChunk(BaseChunk[str]):
  """A chunk of text in a Hugging Face prompt."""

  def __init__(self, content: str, kind: ChunkKind = ChunkKind.CONTENT):
    super().__init__(content=content, kind=kind)

  def __str__(self) -> str:
    return self.content


class HfTurn(BaseTurn[HfChunk]):
  """A turn in a Hf prompt.

  Since the hf format is List[Dict[str, str]] we dont enforce any particular
  structure on the content. We use the abstraction already in the code but
  mock the Hf format.
  """

  def __init__(self, role: str, content: Optional[List[HfChunk]] = None):
    if content and len(content) > 1:
      raise ValueError(
          "Multiple chunks in a turn are not supported in Hugging Face format."
      )
    super().__init__(role=role, content=content)

  @property
  def content(self) -> List[HfChunk]:
    return list(self._content)

  @content.setter
  def content(self, value: List[HfChunk]) -> None:
    if len(value) > 1:
      raise ValueError(
          "Multiple chunks in a turn are not supported in Hugging Face format."
      )
    self._content = list(value)

  def add_chunk(self, chunk: HfChunk) -> None:
    if not isinstance(chunk, HfChunk):
      raise TypeError(f"Expected a HfChunk object, got {type(chunk)}")
    if len(self._content) >= 1:
      raise ValueError(
          "Multiple chunks in a turn are not supported in Hugging Face format."
      )
    self._content.append(chunk)

  @property
  def inner_content(self) -> str:
    if not self._content:
      raise ValueError("Turn has no content.")
    return str(self._content[0].content)

  def __str__(self) -> str:
    raise NotImplementedError("Prompts are generated via `get_prompt`.")

  def to_dict(self) -> Dict[str, str]:
    return {
        "role": self.role,
        "content": self.inner_content,
    }


class HfChatPromptBuilder(BasePromptBuilder[HfChunk, HfTurn, str]):
  """Builds prompts compatible with Hugging Face's apply_chat_template.

  Manages conversation history as a list of dictionaries suitable for
  tokenizer.apply_chat_template.

  The base prompt builder class is generic with respect to the
  chunk/turn/content types. For this example we subclass it with the types we
  define above. For simplicity the inner content type is str.
  """

  _tokenizer: PreTrainedTokenizerBase
  _tools: Optional[List[Dict[str, Any]]]

  _USER_ROLE = "user"
  _ASSISTANT_ROLE = "assistant"
  _TOOL_ROLE = "tool"

  def __init__(
      self,
      tokenizer: PreTrainedTokenizerBase,
      tools: Optional[List[Dict[str, Any]]] = None,
  ):

    self._tokenizer = tokenizer
    self._tools = tools
    super().__init__(turn_class=HfTurn, chunk_class=HfChunk)

  @property
  def user_role(self) -> str:
    return self._USER_ROLE

  @property
  def model_role(self) -> str:
    return self._ASSISTANT_ROLE

  @property
  def tool_role(self) -> str:
    return self._TOOL_ROLE

  def get_state(self) -> List[HfTurn]:
    return copy.deepcopy(self._state)

  def get_state_mutable(self) -> List[HfTurn]:
    return self._state

  def get_chunk(
      self, content: str, kind: ChunkKind = ChunkKind.CONTENT
  ) -> HfChunk:
    return HfChunk(content, kind=kind)

  def get_prompt(self, inference: bool = False, tokenize: bool = False) -> str:
    """Generates the prompt string using the tokenizer's chat template."""
    if self._current_turn is not None:
      raise ValueError("Cannot get the prompt while in the middle of a turn.")

    try:
      return self._tokenizer.apply_chat_template(
          [x.to_dict() for x in self._state],
          tools=self._tools,
          add_generation_prompt=inference,
          tokenize=tokenize,
      )
    except Exception as e:
      raise RuntimeError(f"Error applying chat template: {e}") from e


In [None]:
class HfModel(ModelInstance):
  def __init__(self, model, tokenizer, tools, temperature=0.3, top_k=10):
    self.model = model
    self.tokenizer = tokenizer
    self.tools = tools
    self.temperature = temperature
    self.top_k = top_k

  def query_model(self, prompt: str | base_prompt_builder.BasePromptBuilder) -> str | None:
    if isinstance(prompt, str):
      messages = [{"role": "user", "content": prompt}]
    elif isinstance(prompt, HfChatPromptBuilder):
      messages = [x.to_dict() for x in prompt.get_state()]
    try:
      inputs = self.tokenizer.apply_chat_template(
      messages,
      tools=self.tools,
      add_generation_prompt=True,
      return_dict=True,
      return_tensors="pt"
      )
      inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
      max_tokens = 256 # Adjust as needed based on observation
      out = self.model.generate(
          **inputs,
          max_new_tokens=max_tokens,
          pad_token_id=self.tokenizer.pad_token_id, # Use pad_token_id
          temperature=self.temperature,
          top_k=self.top_k
      )

      input_token_len = inputs["input_ids"].shape[1]
      generated_tokens = out[0][input_token_len:]
      generated_text = self.tokenizer.decode(generated_tokens, skip_special_tokens=True)
      return generated_text
    except Exception as e:
      print(e)
      return None

### Running Evals

We can now run the eval. For simplicity we will run it on the `scenarios` from above. The eval loop is very similar to above. The differnece is that we changed the mdoel and added an autorater call.

In [None]:
MODEL_NAME = "MadeAgents/Hammer2.1-1.5b"

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, device_map="auto", torch_dtype="auto", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="auto",
    trust_remote_code=True,
)


config.json:   0%|          | 0.00/913 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/117 [00:00<?, ?B/s]

In [None]:
def parse_output_eval(text: str) -> datagen_prompt_builder.ParseResult:
  if not text:
    return datagen_prompt_builder.ParseResult(function_call=None, forward=None)
  try:
    output_json = json.loads(text.replace("```", "").strip())
    function_call = FunctionCall(name = output_json["name"], args={}, raw_string=text)
    return datagen_prompt_builder.ParseResult(function_call=function_call, forward=None)
  except json.JSONDecodeError:
    return datagen_prompt_builder.ParseResult(function_call=None, forward=text)

In [None]:
def run_eval_loop(scenario: str):
  # Make the replier model prompt. The replier model will pretend to be the human
  replier_pbuilder = ReplierPromptBuilder()
  replier_pbuilder.system_turn(margin.trim_margin(f"""\
      |You are using a mobile app with a voice feature attached to an agent. You
      >want to fill out a healthcare intake form which has three screens.
      |The screens are as follows:
      |Screen 1: Asks for first and second name, date of birth, and occupation
      |Screen 2: Asks for marital status ({", ".join(marital_status_options)})
      >and sex {", ".join(sex_options)}
      |Screen 3: Asks for health conditions ({", ".join(health_conditions_options)})
      |
      |You are given the following patient scenario:
      |{scenario}
      |
      |Please talk to the agent (you should go screen by screen). First you reply
      >to the first srceen, wait for its reply, then you reply to the second, etc.
      |Your messages are prefaced by the `User` role, while the agent's responses
      >are prefaced by the `Assistant` role.
      >If the agent is able to successfully complete the task, please
      >reply with STOP. If it made an error, reply with ERROR.
      |Do not ask the assistant any follow-up questions.
      |When you return your reply, do not preface it with any role, that happens
      >automatically.
      |NOTE: Only give a single reply at a time, do not complete the entire
      >conversation.
      |NOTE: Give your replies in natural language only.
      """))
  # Set models
  replier_model = AIStudioModel(api_key=API_KEY, model_name=MODEL)
  function_calling_model = HfModel(model=model, tokenizer=tokenizer, tools=json.loads(tool2str(my_demo_tool)))
  validation_model = AIStudioModel(api_key=API_KEY, model_name=MODEL)

  # Get a scenario -> intial query
  initial_query = replier_model.query_model(replier_pbuilder.get_prompt())
  initial_query = initial_query.replace("User:", "").replace("User", "").strip()
  # Add the initial query to the user turn
  replier_pbuilder.user_turn(initial_query)

  # Create a prompt builder to represent the agent
  inner_hf_pb = HfChatPromptBuilder(tokenizer=tokenizer, tools=json.loads(tool2str(my_demo_tool)))
  inner_hf_pb.begin_turn(inner_hf_pb.user_role)
  inner_hf_pb.add_content(inner_hf_pb.get_chunk(initial_query))
  inner_hf_pb.end_turn()
  datagen_pbuilder = datagen_prompt_builder.DataGenerationPromptBuilder(
    inner_prompt_builder=inner_hf_pb,
    session=SimpleEnvironment(),
    parse_fn=parse_output_eval,
  )
  # Run episode
  try:
    result = function_calling_episode.run_function_calling_episode(
          fc_prompt_builder=datagen_pbuilder,
          replier_prompt_builder=replier_pbuilder,
          function_calling_model=function_calling_model,
          replier_model=replier_model,
          max_steps=6,
    )
  except Exception as e:
    print(e)
    return []


  output_state = [x.to_dict() for x in result.get_state()]

  validation_result = validation_model.query_model(margin.trim_margin(f"""\
      |You are given the following conversation between a user and an AI voice
      >feature for filling out a healthcare intake form:
      |{json.dumps(output_state)}
      |
      |If the agent correctly filled out the form, reply with YES, otherwise
      >reply with NO.
      |
      |NOTE: Only reply with YES or NO.
      """))
  if "YES" in validation_result:
    return output_state
  else:
    return []


In [None]:
conversations = []
for scenario in scenarios:
  conversation = run_eval_loop(scenario)
  if conversation:
    print("Done with scenario ", scenario)
  else:
    print("Failed scenario ", scenario)