### GAIA

Example code for GAIA

In [1]:
from agentquest.benchmarks.gaia import GaiaDriver, GaiaUtils, GaiaAction

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# valid_categories = ["2023_all", "2023_level1", "2023_level2", "2023_level3"]
# valid_splits = ["validation", "test", "all"]
dataset = GaiaUtils.load_data(category="2023_all", split="validation")

### Non-interactive mode
This is basically, the agent has one shot to give the answer.

In [11]:
accumulated_metrics = []
for problem in dataset.to_dict(orient="records"):
    print(problem)
    if problem.get("file_name", ""):
        extra_filepath = GaiaUtils.get_filepath(problem.get("file_name", ""))
        # This is left for the agent developer to work with the multimodal file input.
        print("File path for the problem: ", extra_filepath)

    driver = GaiaDriver(
        problem=problem["Question"], goal=problem["Final answer"], interactive=False
    )
    obs = driver.reset()
    print(obs.output)
    human_input = input(obs.output)  # Can be replaced by agent input
    # obs = driver.step(GaiaAction(value=str(human_input)))
    obs = driver.step_raw(str(human_input))
    print(obs.output)
    accumulated_metrics.append(driver.metrics.export())

{'task_id': 'c61d22de-5f6c-4958-a7f6-5e9707bd3466', 'Question': 'A paper about AI regulation that was originally submitted to arXiv.org in June 2022 shows a figure with three axes, where each axis has a label word at both ends. Which of these words is used to describe a type of society in a Physics and Society article submitted to arXiv.org on August 11, 2016?', 'Level': '2', 'Final answer': 'egalitarian', 'file_name': '', 'file_path': '', 'Annotator Metadata': {'Steps': '1. Go to arxiv.org and navigate to the Advanced Search page.\n2. Enter "AI regulation" in the search box and select "All fields" from the dropdown.\n3. Enter 2022-06-01 and 2022-07-01 into the date inputs, select "Submission date (original)", and submit the search.\n4. Go through the search results to find the article that has a figure with three axes and labels on each end of the axes, titled "Fairness in Agreement With European Values: An Interdisciplinary Perspective on AI Regulation".\n5. Note the six words used a

Could not parse agent output. Make sure it is in the format: `Answer: <Answer>\n`
{'task_id': '17b5a6a3-bc87-42e8-b0fb-6ab0781ef2cc', 'Question': 'I’m researching species that became invasive after people who kept them as pets released them. There’s a certain species of fish that was popularized as a pet by being the main character of the movie Finding Nemo. According to the USGS, where was this fish found as a nonnative species, before the year 2020? I need the answer formatted as the five-digit zip codes of the places the species was found, separated by commas if there is more than one place.', 'Level': '2', 'Final answer': '34689', 'file_name': '', 'file_path': '', 'Annotator Metadata': {'Steps': '1. Search the web for “finding nemo main character”.\n2. Note the results, which state that the main character is a clownfish.\n3. Search the web for “usgs nonnative species database”.\n4. Click result for the Nonindigenous Aquatic Species site.\n5. Click “Marine Fishes”.\n6. Click “Specie

KeyboardInterrupt: Interrupted by user

In [12]:
accumulated_metrics

[{'problem': 'A paper about AI regulation that was originally submitted to arXiv.org in June 2022 shows a figure with three axes, where each axis has a label word at both ends. Which of these words is used to describe a type of society in a Physics and Society article submitted to arXiv.org on August 11, 2016?',
  'goal': 'egalitarian',
  'success': False,
  'actions': [],
  'states': [],
  'observations': [],
  'progress': []},
 {'problem': 'I’m researching species that became invasive after people who kept them as pets released them. There’s a certain species of fish that was popularized as a pet by being the main character of the movie Finding Nemo. According to the USGS, where was this fish found as a nonnative species, before the year 2020? I need the answer formatted as the five-digit zip codes of the places the species was found, separated by commas if there is more than one place.',
  'goal': '34689',
  'success': False,
  'actions': [{'value': 'egalitarian'}],
  'states': [{'v

### Interactive mode. 
The agent has multiple shots for one problem.

In [13]:
accumulated_metrics = []
for problem in dataset.to_dict(orient="records"):
    print(problem)
    if problem.get("file_name", ""):
        extra_filepath = GaiaUtils.get_filepath(problem.get("file_name", ""))
        print("File path for the problem: ", extra_filepath)

    driver = GaiaDriver(
        problem=problem["Question"], goal=problem["Final answer"], interactive=True
    )
    obs = driver.reset()

    num_attempts = 1
    while not obs.success and obs.can_proceed and (num_attempts <= 5):
        human_input = input(obs.output)  # Can be replaced by agent input
        obs = driver.step(GaiaAction(value=str(human_input)))
        # obs = driver.step_raw(str(human_input))
        print(obs.output)
        num_attempts += 1

    accumulated_metrics.append(
        driver.metrics.export(
            repetition_function_kwargs={"theta_a": 1, "num_execution_steps": 5}
        )
    )

{'task_id': 'c61d22de-5f6c-4958-a7f6-5e9707bd3466', 'Question': 'A paper about AI regulation that was originally submitted to arXiv.org in June 2022 shows a figure with three axes, where each axis has a label word at both ends. Which of these words is used to describe a type of society in a Physics and Society article submitted to arXiv.org on August 11, 2016?', 'Level': '2', 'Final answer': 'egalitarian', 'file_name': '', 'file_path': '', 'Annotator Metadata': {'Steps': '1. Go to arxiv.org and navigate to the Advanced Search page.\n2. Enter "AI regulation" in the search box and select "All fields" from the dropdown.\n3. Enter 2022-06-01 and 2022-07-01 into the date inputs, select "Submission date (original)", and submit the search.\n4. Go through the search results to find the article that has a figure with three axes and labels on each end of the axes, titled "Fairness in Agreement With European Values: An Interdisciplinary Perspective on AI Regulation".\n5. Note the six words used a

KeyboardInterrupt: Interrupted by user

In [14]:
accumulated_metrics

[{'problem': 'A paper about AI regulation that was originally submitted to arXiv.org in June 2022 shows a figure with three axes, where each axis has a label word at both ends. Which of these words is used to describe a type of society in a Physics and Society article submitted to arXiv.org on August 11, 2016?',
  'goal': 'egalitarian',
  'success': True,
  'actions': [{'value': 'egalitarian'}],
  'states': [{'value': 'egalitarian'}],
  'observations': [{'output': 'Correct answer!!!',
    'success': True,
    'can_proceed': False}]}]

### Making it ready for submission
For GAIA, only the validation set has correct labels. The test set data does not have correct labels and you can submit it to see your scores in GAIA leaderboard. 

Link: https://huggingface.co/spaces/gaia-benchmark/leaderboard


The code below is for non-interactive mode.

In [None]:
dataset = GaiaUtils.load_data(category="2023_all", split="test")

submissions = []
for problem in dataset.to_dict(orient="records"):
    print(problem)
    if problem.get("file_name", ""):
        extra_filepath = GaiaUtils.get_filepath(problem.get("file_name", ""))
        print("File path for the problem: ", extra_filepath)

    driver = GaiaDriver(
        problem=problem["Question"], goal=problem["Final answer"], interactive=False
    )
    obs = driver.reset()
    # answer = agent.invoke(obs.output)
    # Parse the agent output if needed
    answer = input(obs.output)  # To be replaced by agent input
    submissions.append({"task_id": problem["task_id"], "model_answer": answer})

### Making it ready for submission
For GAIA, only the validation set has correct labels. The test set data does not have correct labels and you can submit it to see your scores in GAIA leaderboard. 

Link: https://huggingface.co/spaces/gaia-benchmark/leaderboard


The code below is for non-interactive mode.

In [None]:
dataset = GaiaUtils.load_data(category="2023_all", split="test")

submissions = []
for problem in dataset.to_dict(orient="records"):
    print(problem)
    driver = GaiaDriver(
        problem=problem["Question"], goal=problem["Final answer"], interactive=False
    )
    obs = driver.reset()
    # TODO: answer = agent.invoke(obs.output)
    # Parse the agent output if needed
    answer = input(obs.output)  # Can be replaced by agent input
    submissions.append({"task_id": problem["task_id"], "model_answer": answer})

{'task_id': '6af95c8f-8cbf-4c12-b02c-f9a23cc1ecb9', 'Question': 'Here\'s a fun riddle that I\'d like you to try.\n\nAn adventurer exploring an ancient tomb came across a horde of gold coins, all neatly stacked in columns. As he reached to scoop them into his backpack, a mysterious voice filled the room. "You have fallen for my trap adventurer," the voice began, and suddenly the doorway to the chamber was sealed by a heavy rolling disk of stone. The adventurer tried to move the stone disk but was unable to budge the heavy stone. Trapped, he was startled when the voice again spoke. \n\n"If you solve my riddle, I will reward you with a portion of my riches, but if you are not clever, you will never leave this treasure chamber. Before you are 200 gold coins. I pose a challenge to you, adventurer. Within these stacks of coins, all but 30 are face-up. You must divide the coins into two piles, one is yours, and one is mine. You may place as many coins as you like in either pile. You may flip 

KeyboardInterrupt: Interrupted by user

In [4]:
submissions

[{'task_id': '6af95c8f-8cbf-4c12-b02c-f9a23cc1ecb9', 'model_answer': 'nepal'},
 {'task_id': 'c80ed443-b494-4e86-bec8-10ecb41c2326', 'model_answer': 'home'},
 {'task_id': 'e14448e9-5243-4b07-86e1-22e657f96bcf', 'model_answer': 'there'},
 {'task_id': '198ffd8f-6041-458d-bacc-fe49872cfa43',
  'model_answer': 'in the whole world'},
 {'task_id': '6583799b-573a-4e95-8b28-4f0397bd45c2', 'model_answer': 'okay'},
 {'task_id': '12a682d7-8e8e-4d4c-8102-a97628027441', 'model_answer': ''}]

### Making it ready for submission
For GAIA, only the validation set has correct labels. The test set data does not have correct labels and you can submit it to see your scores in GAIA leaderboard. 

Link: https://huggingface.co/spaces/gaia-benchmark/leaderboard


The code below is for non-interactive mode.

In [None]:
dataset = GaiaUtils.load_data(category="2023_all", split="test")

submissions = []
for problem in dataset.to_dict(orient="records"):
    print(problem)
    driver = GaiaDriver(
        problem=problem["Question"], goal=problem["Final answer"], interactive=False
    )
    obs = driver.reset()
    # TODO: answer = agent.invoke(obs.output)
    # Parse the agent output if needed
    answer = input(obs.output)  # Can be replaced by agent input
    submissions.append({"task_id": problem["task_id"], "model_answer": answer})

{'task_id': '6af95c8f-8cbf-4c12-b02c-f9a23cc1ecb9', 'Question': 'Here\'s a fun riddle that I\'d like you to try.\n\nAn adventurer exploring an ancient tomb came across a horde of gold coins, all neatly stacked in columns. As he reached to scoop them into his backpack, a mysterious voice filled the room. "You have fallen for my trap adventurer," the voice began, and suddenly the doorway to the chamber was sealed by a heavy rolling disk of stone. The adventurer tried to move the stone disk but was unable to budge the heavy stone. Trapped, he was startled when the voice again spoke. \n\n"If you solve my riddle, I will reward you with a portion of my riches, but if you are not clever, you will never leave this treasure chamber. Before you are 200 gold coins. I pose a challenge to you, adventurer. Within these stacks of coins, all but 30 are face-up. You must divide the coins into two piles, one is yours, and one is mine. You may place as many coins as you like in either pile. You may flip 

KeyboardInterrupt: Interrupted by user

In [4]:
submissions

[{'task_id': '6af95c8f-8cbf-4c12-b02c-f9a23cc1ecb9', 'model_answer': 'nepal'},
 {'task_id': 'c80ed443-b494-4e86-bec8-10ecb41c2326', 'model_answer': 'home'},
 {'task_id': 'e14448e9-5243-4b07-86e1-22e657f96bcf', 'model_answer': 'there'},
 {'task_id': '198ffd8f-6041-458d-bacc-fe49872cfa43',
  'model_answer': 'in the whole world'},
 {'task_id': '6583799b-573a-4e95-8b28-4f0397bd45c2', 'model_answer': 'okay'},
 {'task_id': '12a682d7-8e8e-4d4c-8102-a97628027441', 'model_answer': ''}]

In [None]:
submissions