[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/google/earthengine-community/blob/master/experimental/cbgb_benchmark/cbgb_eval_pipeline.ipynb)

In [None]:
#@title Copyright 2025 The Earth Engine Community Authors { display-mode: "form" }
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Lightweight Evaluation pipeline for "The Cloud-Based Geospatial Benchmark (CBGB)"

## Overview

This Colab notebook demonstrates the implementation of a lightweight evaluation pipeline for the "Cloud-Based Geospatial Benchmark" (CBGB) - link to the corresponding paper is forthcoming. It is designed to assess the capabilities of various Large Language Models (LLMs) to generate Earth Engine Python code to solve geospatial problems.

The notebook imports a series of challenges, each with a corresponding question and expected answer. It then initializes and runs a set of "Code Generating Agents" - each powered by a different LLM (including models from Gemini, OpenAI, Anthropic, and Deepseek) - to generate Python code to answer these questions.

The pipeline includes functionality for code execution, error correction with multiple retries, and recording of results such as code execution time, the raw and cleaned code, and the final answer.

The questions themselves are hosted in a public Google Cloud Storage bucket at:

- gs://cbgb-1/eval_set_2025_05_08.csv
- Direct download at: https://storage.googleapis.com/cbgb-1/eval_set_2025_05_08.csv


## Setup Details and Billing

You will need:

- A Google cloud project with the Earth Engine API enabled. ([Details](https://developers.google.com/earth-engine/cloud/earthengine_cloud_project_setup)).

- API keys for any of the proprietary LLMs you'd like to run evals against. Options include Gemini, OpenAI, Anthropic, and Deepseek.


API keys and project IDs can be stored in the [colab "Secrets" panel](https://medium.com/@parthdasawant/how-to-use-secrets-in-google-colab-450c38e3ec75). Add the following strings as secrets:

 - Use `EE_PROJECT_ID` for the Cloud project id to use for EE initialization.
 - Use `GOOGLE_API_KEY` for the Gemini API key
 - Use `ANTHROPIC_API_KEY` for Anthropic
 - Use `DEEPSEEK_API_KEY` for Deepseek
 - Use `OPENAI_API_KEY` for OpenAI

## Caveats

 - THIS TOOL IS UNSAFE, AS IT AUTOMATICALLY RUNS LLM-GENERATED
PYTHON CODE USING THE NOTEBOOK KERNEL! USE AT YOUR OWN RISK.


 - For assistance, please email corresponding author (TODO decide what to add here).

In [None]:
!pip install "git+https://github.com/google/earthengine-community.git#egg=functionsmith&subdirectory=experimental/functionsmith" --quiet


  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [None]:
#@title Imports
# Standard Library Imports
import asyncio
import ast
import copy
import io
import IPython
import json
import logging
import math
import os
import re
import sys
import threading
import time
from contextlib import redirect_stdout
from dataclasses import asdict, dataclass
from datetime import datetime as dt
from enum import Enum

# Third-Party Imports
import google.generativeai as genai
import numpy as np
import pandas as pd
import tenacity
from google.colab import auth, syntax, userdata
from jinja2 import Template
from tqdm.asyncio import tqdm

# First-Party / Project-Specific Imports
# See https://github.com/google/earthengine-community/blob/master/experimental/functionsmith/README.md
from functionsmith import code_parser, executor, llm as fs_llm

In [None]:
#@title Initialize Earth Engine and Google libraries
import ee

# Trigger the authentication flow.
ee.Authenticate(scopes=['https://www.googleapis.com/auth/earthengine.readonly'])

# Initialize the library.
ee.Initialize(project=userdata.get('EE_PROJECT_ID'))
genai.configure(api_key=userdata.get('GOOGLE_API_KEY'))

auth.authenticate_user()

# Used so that logging from functionsmith doesn't clutter output.
logging.basicConfig(level=logging.CRITICAL)

In [None]:
#@title Constants/Configs
# The location in a Google Cloud Storage bucket of the original eval questions.
BENCHMARK_QUESTIONS = f'gs://cbgb-1/eval_set_2025_05_08.csv'

# If true, only runs a small subset of questions and agents since the full
# pipeline is somewhat slow due to LLM and EE quota limits, and long running
# EE code more generally.
SIMPLE_DEMO = True

# The maximum time in seconds to wait for a piece of generated Earth Engine
# code to execute before timing out.
if SIMPLE_DEMO:
  EE_CODE_EXEC_TIMEOUT = 60 # 1 min
else:
  EE_CODE_EXEC_TIMEOUT = 600 # 10 min

# Determines whether to include "error correcting" agents in the eval run.
INCLUDE_ERROR_CORRECTION = True
# Suffix added to the agent name to distinguish the error-correction run.
ERROR_CORRECTION_SUFFIX = 'error_correction'
# The maximum number of times an agent will try to fix its code if an error occurs.
MAX_RETRIES = 3


# If True, the notebook will save the result of each question for each agent locally as a JSON file.
# On subsequent runs, it will read from this local cache.
USE_LOCAL_ANSWERS = False

# If True, the local answer cache will be stored in your Google Drive.
# This is recommended for persistence, as Colab's local disk can be cleared.
USE_GOOGLE_DRIVE = True

# A list of the specific Large Language Model identifiers to be used for the evaluation.
# The code will create an agent for each model in this list.
if SIMPLE_DEMO:
  LLMS_FOR_AGENTS = ['gemini-2.5-flash', 'claude-3-5-haiku-20241022', 'o4-mini-2025-04-16']
else:
  # The full set used in the paper
  LLMS_FOR_AGENTS = [
    'claude-3-5-haiku-20241022',
    'claude-3-5-sonnet-20241022',
    'claude-3-7-sonnet-20250219',
    'deepseek-chat',
    'deepseek-reasoner',
    'gemini-2.5-pro',
    'gemini-2.5-flash',
    'gemini-2.0-flash',
    'o3-2025-04-16',
    'o4-mini-2025-04-16'
  ]

# Defines the base name for the directory where cached agent answers will be stored.
LOCAL_ANSWERS_DIRECTORY = f'agent_answers_{dt.now().strftime("%Y%m%d")}'

# Mounts Google Drive and sets the answers directory path to a folder in "My Drive" if USE_GOOGLE_DRIVE is True.
if USE_LOCAL_ANSWERS and USE_GOOGLE_DRIVE:
  from google.colab import drive
  drive.mount('/content/drive')
  LOCAL_ANSWERS_DIRECTORY = f'/content/drive/MyDrive/{LOCAL_ANSWERS_DIRECTORY}'

## Agent Code

In [None]:
#@title Helper Methods

@dataclass
class ChallengeInfo:
  """A dataclass to hold all information about a single challenge, including the question, expected answer, and the results from an agent."""
  id: str
  title: str
  question: str
  difficulty: str
  expected_answer: float
  code_execution_time: float = None
  total_attempts: int = None
  raw_code: str = ''
  clean_code: str = ''
  raw_code_results: str = ''
  exception: str = ''
  final_answer: float = None

def simplify_df(full_results_df):
  """Removes detailed columns from the full results DataFrame to create a more readable summary."""
  drop_suffixes = ['question', '_raw_code', '_raw_code_results', '_clean_code',  '_code_execution_time', '_total_attempts', 'exception', 'solution_script']
  cols_to_drop = []
  for col in full_results_df.columns:
      if any(col.endswith(suffix) for suffix in drop_suffixes):
          cols_to_drop.append(col)
  simple_results_df = full_results_df.drop(columns=cols_to_drop)

  common_cols = ['id', 'title',  'difficulty', 'expected_answer']
  all_cols = simple_results_df.columns.tolist()
  other_cols = [col for col in all_cols if col not in common_cols]
  other_cols_sorted = sorted(other_cols)
  new_col_order = common_cols + other_cols_sorted
  simple_results_df = simple_results_df.reindex(columns=new_col_order)

  return simple_results_df

def make_llm_interface(model_name: str, system_instruction=''):
    """Creates and returns a specific LLM interface from the 'functionsmith' library based on the model name."""
    if 'gemini' in model_name:
      llm_interface = fs_llm.Gemini(system_instruction, model_name=model_name, api_key=userdata.get('GOOGLE_API_KEY'))
    elif 'claude' in model_name:
      llm_interface = fs_llm.Claude(system_instruction, model_name=model_name, api_key=userdata.get('ANTHROPIC_API_KEY'))
    elif 'deepseek' in model_name:
      llm_interface = fs_llm.DeepSeek(system_instruction, model_name=model_name, api_key=userdata.get('DEEPSEEK_API_KEY'))
    else:
      llm_interface = fs_llm.ChatGPT(system_instruction, model_name=model_name, api_key=userdata.get('OPENAI_API_KEY'))
    return llm_interface

def clean_up_ee_code(code: str) -> str:
  """Clean up the Earth Engine code.

  Removes EE imports, authentication and authorization code.

  Args:
    code: The Earth Engine code to clean up.

  Returns:
    The cleaned up Earth Engine code.
  """
  regex_patterns_to_remove = [
      r'^(\s*).*import ee.*$',
      r'^(\s*).*ee\.Authenticate\(.*\).*$',
      r'^(\s*).*ee\.Initialize\(.*\).*$',
  ]

  for regex_pattern in regex_patterns_to_remove:
      code = re.sub(regex_pattern, r'\1pass', code, flags=re.MULTILINE)


  # Import EE is inconsistently included by llm, so we make sure it's added exactly
  # once.
  code = code.replace('import ee', '').replace('```Python', '').replace('```', '')
  return 'import ee\n' + code.strip()


def list_to_df (obj_list) -> pd.DataFrame:
    """Converts a list of data objects into a pandas DataFrame."""
    return pd.DataFrame([asdict(o) for o in obj_list])



class TimeoutError(Exception):
    """Custom exception raised when code execution exceeds the specified timeout."""
    pass

def extract_all_numbers(text, timeout_seconds):
    """Extracts and rounds all numerical values from a given string."""
    if not text.startswith('Exception'):
        matches = re.findall(r'-?\d*\.?\d+', text)
        return [round(float(match), 3) for match in matches] if matches else []
    elif 'TIMEOUT' in text:
        return [f'EE CODE TIMEOUT - {timeout_seconds} seconds.']
    return []

def truncate_cell(val):
  """Truncates a string value if it exceeds a maximum character limit to keep DataFrame cells readable."""
  MAX_CELL_CHARS = 50000
  if val is None:
      return val

  # Convert value to string if it's not already
  str_val = str(val) if not isinstance(val, str) else val

  # Check if the string length exceeds the limit
  if len(str_val) > MAX_CELL_CHARS:
      return "OUTPUT TOO BIG"
  return val

In [None]:
#@title CodeGenAgent definition

ERROR_CORRECTION_TEMPLATE= Template("""Here is the question:
{{ question }}

You previously wrote this code which had an error:
```python
{{ bad_code }}
```

This was the error:
{{ error_msg }}

Please try again and rewrite the code.""")
stdout_mutex = asyncio.Lock()

class CodeGenAgent:
  """An agent that uses a Large Language Model to generate and execute Earth
   Engine code to answer a specific question."""

  def __init__(self, agent_name: str, model_name: str, system_instructions=''):
    """Initializes the CodeGenAgent.

    Args:
      agent_name: A unique name for this agent instance.
      model_name: The identifier for the LLM to be used (e.g., 'gemini-2.5-flash').
      system_instructions: The system prompt to guide the LLM's behavior.
    """
    self.logger = logging.Logger(agent_name, level=logging.CRITICAL)
    self.parser = code_parser.Parser(self.logger)
    self.executor = executor.Executor(self.logger)
    self.agent_name = agent_name
    self.model_name = model_name
    self.system_instructions = system_instructions


  async def answer_question(self, original_challenge: ChallengeInfo,
                            timeout_seconds=EE_CODE_EXEC_TIMEOUT,
                            use_local_answer=USE_LOCAL_ANSWERS) -> ChallengeInfo:
    '''Generates and executes code to answer a challenge, handling retries and caching.
    This is the main method for the agent. It takes a challenge, attempts to
    generate and run code to solve it, and will retry with error correction
    if configured to do so. Results are cached locally to avoid re-computation.

    Args:
      original_challenge: The ChallengeInfo object containing the question.
      timeout_seconds: The maximum time allowed for code execution.
      use_local_answer: A boolean to control whether to use cached answers.

    Returns:
      A ChallengeInfo object populated with the agent's results, including the
      generated code, execution output, and final answer.
    '''

    local_answer_path = f'{LOCAL_ANSWERS_DIRECTORY}/{self.agent_name}/{original_challenge.id}.json'
    if not os.path.exists(os.path.dirname(local_answer_path)):
      os.makedirs(os.path.dirname(local_answer_path))

    if use_local_answer and os.path.exists(local_answer_path):
      with open(local_answer_path, 'r') as f:
        challenge_json = json.load(f)
        return ChallengeInfo(**challenge_json)

    challenge = copy.deepcopy(original_challenge)
    if not challenge.question.strip():
      challenge.final_answer = ['MISSING QUESTION']
      return challenge

    if self.agent_name.endswith(ERROR_CORRECTION_SUFFIX):
      max_attempts = MAX_RETRIES
    else:
      max_attempts = 1

    challenge.total_attempts = 0
    while challenge.total_attempts < max_attempts:
      challenge.total_attempts += 1

      code_text = await self.generate_code(challenge)

      # Parse the generated code
      challenge.raw_code = self.parser.extract_python_code_blocks(code_text).code
      challenge.clean_code = clean_up_ee_code(challenge.raw_code)

      # Execute the code
      async with stdout_mutex:
        start_time = time.perf_counter()
        try:
          challenge = await asyncio.wait_for(self.run_code(challenge), timeout_seconds)
        except asyncio.TimeoutError:
          challenge.raw_code_results = f'Exception: EE CODE TIMEOUT.'
          challenge.final_answer = [f'Exception: EE CODE TIMEOUT.']
        end_time = time.perf_counter()
        code_execution_time = end_time - start_time
        challenge.code_execution_time = code_execution_time

      # If the executed code did not raise an exception, we have our solution!
      if not challenge.raw_code_results.startswith('Exception'):
        with open(local_answer_path, 'w') as f:
          json.dump(asdict(challenge), f)
        return challenge

    # If we've hit the max retries, return even if there was an exception.
    with open(local_answer_path, 'w') as f:
      json.dump(asdict(challenge), f)
    return challenge

  @tenacity.retry(
    stop=tenacity.stop_after_attempt(3),
    wait=tenacity.wait_fixed(1),
  )
  async def generate_code(self, challenge)-> str:
    """Generates Python code by prompting the LLM.
    It constructs the prompt, which includes the original question and, if
    applicable, the previous code and error message for correction. It uses
    a retry mechanism for robustness against transient LLM API issues.

    Args:
        challenge: The ChallengeInfo object with question and error context.
    Returns:
      The raw text response from the language model.
    """
    # Initialize a new llm client. We want a fresh chat experience for each question, even on retries.
    llm = make_llm_interface(self.model_name, self.system_instructions)

    # Prompt LLM to generate code, if an error occurred in a previous attempt,
    # prompt the llm to fix it.
    question = challenge.question
    if challenge.raw_code_results.startswith('Exception') and challenge.raw_code:
      # This means the code has already run and there was at least one failure
      question = ERROR_CORRECTION_TEMPLATE.render(
          question=challenge.question,
          bad_code=challenge.raw_code,
          error_msg=challenge.raw_code_results)
      code_text =  await asyncio.to_thread(llm.chat, question)
    else :
      code_text =  await asyncio.to_thread(llm.chat, question)
    return code_text


  @tenacity.retry(
    stop=tenacity.stop_after_attempt(3),
    wait=tenacity.wait_fixed(1),
  )
  async def run_code(self, original_challenge: ChallengeInfo, timeout_seconds=EE_CODE_EXEC_TIMEOUT):
    """Executes the cleaned Python code, captures the output, and updates relevant challenge fields."""
    challenge = copy.deepcopy(original_challenge)
    if challenge.question and not challenge.clean_code:
      # This shouldn't happen.
      challenge.final_answer = ['NO CODE TO RUN']
      return challenge

    answer_str = await asyncio.to_thread(self.executor.run_code, challenge.clean_code, syscalls={}, code_globals={'ee': ee})
    challenge.raw_code_results = answer_str
    challenge.final_answer = extract_all_numbers(answer_str, timeout_seconds)
    return challenge

In [None]:
#@title Helper Methods to run CodeGen Agent

async def run_agent(agent, challenges, timeout_seconds=EE_CODE_EXEC_TIMEOUT):
    """Runs a single agent against a list of challenges and returns the results as a DataFrame."""

    async def process_challenge(challenge):
        result = await agent.answer_question(challenge, timeout_seconds=timeout_seconds)
        return result

    results = await tqdm.gather(*[process_challenge(c) for c in challenges],
                                 desc=f"Running agent {agent.agent_name} on challenges")

    results_df = list_to_df(results)
    return results_df


async def run_all_agents(agent_dicts, challenges, timeout_seconds=EE_CODE_EXEC_TIMEOUT):
  """Asynchronously runs all specified agents against a list of challenges."""
  full_results_df_dict = {}
  for llm_model, agent in agent_dicts.items():
    results_df = await run_agent(agent, challenges, timeout_seconds=timeout_seconds)
    full_results_df_dict[llm_model] = results_df
    full_df = merge_df_dict(full_results_df_dict)
  return full_df

def merge_df_dict(df_dict):
  """Merges a dictionary of DataFrames from multiple agent runs into a single DataFrame."""
  common_cols = ['id', 'title', 'question',  'difficulty', 'expected_answer']
  custom_cols = ['raw_code', 'clean_code', 'raw_code_results', 'final_answer', 'code_execution_time', 'total_attempts', 'exception']

  # Create a copy of the first dataframe to start with
  first_key = list(df_dict.keys())[0]
  result_df = df_dict[first_key].copy()
  result_df.rename(columns={col: f"{first_key}_{col}" for col in custom_cols}, inplace=True)

  # Process each subsequent dataframe
  for key, df in list(df_dict.items())[1:] :
    rename_dict = {col: f"{key}_{col}" for col in custom_cols}
    temp_df = df.copy()
    # drop solution script col if it exists
    if 'solution_script' in temp_df.columns:
      temp_df.drop(columns=['solution_script'], inplace=True)
    temp_df.rename(columns=rename_dict, inplace=True)
    result_df = pd.merge(result_df, temp_df, on=common_cols, how='outer')
  return result_df

## Main



In [None]:
#@title Read Benchmark Questions and initialize Challenge objects
df = pd.read_csv(BENCHMARK_QUESTIONS)

challenges = [
  ChallengeInfo(
      id=row["id"],
      title=row["title"],
      difficulty=row["difficulty"],
      question=row["question"],
      expected_answer=row["expected_answer"],
  )
  for _, row in df.iterrows()
]

print(f"Loaded {len(challenges)} total challenges.")
df.head()

Loaded 45 total challenges.


Unnamed: 0,id,title,difficulty,expected_answer,question
0,EBA_A1.2_A1a,Land Cover Classification Using Random Forest,Intermediate,158.799,ID: EBA_A1.2_A1a \n\n### Title : Land Cover Cl...
1,EBA_A1.2_A1b,Land Cover Classification Using SVM,Intermediate,39.956,**ID : EBA_A1.2_A1b**\n\n### Title : Land Cove...
2,EBA_A1.3_A1,Calculating Road Length in Lesotho,Intermediate,3300.914,**ID : EBA_A1.3_A1**\n\n**Title: Calculating R...
3,EBA_A1.5_A1,"Summer SUHI Regional Change in New Haven, CT (...",Difficult,0.934,**ID : EBA_A1.5_A1**\n\n**Title :****Summer SU...
4,EBA_F1.3_A3,Image Count Comparison: MODIS vs. Sentinel-2 (...,Intermediate,52.0,**ID :****EBA_F1.3_A3**\n\n**Title: Image Coun...


In [None]:
#@title Display a single question
IPython.display.Markdown(df.iloc[4]['question'])

**ID :****EBA_F1.3_A3**

**Title: Image Count Comparison: MODIS vs. Sentinel-2 (2017)**

**Objective :** Geometry: Use the point [-122.30144, 37.80215] as the focal point for analysisAccess the following collections: 

* MODIS Surface Reflectance (MOD09A1): ee.ImageCollection("MODIS/061/MOD09A1") 
* Sentinel-2 Surface Reflectance (Harmonized S2): ee.ImageCollection("COPERNICUS/S2_HARMONIZED") 

Task: Count the number of images in both collections in 2017 that intersect the given point. Determine which collection has the larger number of intersecting images. Print only the larger of the two counts to the console. 


In [None]:
#@title System Instructions

SYSTEM_INSTRUCTIONS = '''
You are an expert Earth Engine developer. Write Earth Engine Python code to answer the following question. No extra explanation is needed. ONLY return the Earth Engine python code in a ```python section.

Call getInfo() when you need to get the value of the final computation.
The answer must be printed to stdout.

In cases where the answer is a number, print ONLY the number corresponding to
the answer, no other explanatory text.

For example, don't do this:
`print('here is my answer: ' + x)`

Just do:
`print(x)`

Output:
If a single number is requested, be sure to produce an answer that is a single number.
In that case, do not produce a list of numbers that contains the requested specific answer.
For example,if a single number is requested, “[1.0, 5.0, 2222.0,..., 0.356]” is not a proper answer.

'''

In [None]:
#@title Initialize Agents

agents = {}
for model_name in LLMS_FOR_AGENTS:
  agent_name = model_name
  agents[model_name] = CodeGenAgent(agent_name=agent_name, model_name=model_name, system_instructions=SYSTEM_INSTRUCTIONS)

  if INCLUDE_ERROR_CORRECTION:
    agent_name = f'{model_name}_{ERROR_CORRECTION_SUFFIX}'
    agents[agent_name] = CodeGenAgent(
        agent_name=agent_name, model_name=model_name, system_instructions=SYSTEM_INSTRUCTIONS)

print('Created the following coding agents:')
list(agents.keys())

Created the following coding agents:


['gemini-2.5-flash',
 'gemini-2.5-flash_error_correction',
 'claude-3-5-haiku-20241022',
 'claude-3-5-haiku-20241022_error_correction',
 'o4-mini-2025-04-16',
 'o4-mini-2025-04-16_error_correction']

In [None]:
#@title Run all the Agents!
if SIMPLE_DEMO:
  easy_challenges = [c for c in challenges if c.difficulty == 'Easy']
  full_results_df = await run_all_agents(agents, easy_challenges[0:5])
else:
  full_results_df = await run_all_agents(agents, challenges)

Running agent gemini-2.5-flash on challenges: 100%|██████████| 5/5 [00:47<00:00,  9.59s/it]
Running agent gemini-2.5-flash_error_correction on challenges: 100%|██████████| 5/5 [00:24<00:00,  5.00s/it]
Running agent claude-3-5-haiku-20241022 on challenges: 100%|██████████| 5/5 [00:13<00:00,  2.61s/it]
Running agent claude-3-5-haiku-20241022_error_correction on challenges: 100%|██████████| 5/5 [00:53<00:00, 10.65s/it]
Running agent o4-mini-2025-04-16 on challenges: 100%|██████████| 5/5 [00:30<00:00,  6.05s/it]
Running agent o4-mini-2025-04-16_error_correction on challenges: 100%|██████████| 5/5 [01:21<00:00, 16.37s/it]


In [None]:
# @title Optional hack to make sure stdout doesn't get eaten during EE code execution.
# Sometimes there's a weird interaction between the functionsmith library
# and colab where print statements stop outputting in notebook cells.
# This line will make print statements start working again without requiring a
# restart to the notebook.

# print = lambda *args, **kwargs: display(' '.join(str(arg) for arg in args)) if args else display('')

In [None]:
full_results_df.head()

Unnamed: 0,id,title,question,difficulty,expected_answer,gemini-2.5-flash_code_execution_time,gemini-2.5-flash_total_attempts,gemini-2.5-flash_raw_code,gemini-2.5-flash_clean_code,gemini-2.5-flash_raw_code_results,...,o4-mini-2025-04-16_raw_code_results,o4-mini-2025-04-16_exception,o4-mini-2025-04-16_final_answer,o4-mini-2025-04-16_error_correction_code_execution_time,o4-mini-2025-04-16_error_correction_total_attempts,o4-mini-2025-04-16_error_correction_raw_code,o4-mini-2025-04-16_error_correction_clean_code,o4-mini-2025-04-16_error_correction_raw_code_results,o4-mini-2025-04-16_error_correction_exception,o4-mini-2025-04-16_error_correction_final_answer
0,EBA_F2.0_A2,Calculating Iron Oxide Ratio (IOR) for Hydroth...,**ID: EBA_F2.0_A2**\n\n### Title: Calculating ...,Easy,0.994,0.399559,1,import ee\n\nee.Initialize()\n\n# 1. Focus on ...,import ee\npass\n\npass\n\n# 1. Focus on this ...,0.994\n,...,0.994\n,,[0.994],0.526946,1,import ee\nee.Initialize()\n\n# Define the poi...,import ee\npass\npass\n\n# Define the point of...,0.994\n,,[0.994]
1,EBA_F5.1_A1,Forest growth calculation by raster extraction...,**ID : EBA_F5.1_A1**\n\n**Title: Forest Growth...,Easy,2990.609,20.887174,1,import ee\n\nee.Initialize()\n\n# Data Import\...,import ee\npass\n\npass\n\n# Data Import\ngfc ...,2990.6138607417993\n,...,2990.6138607417984\n,,[2990.614],3.086721,1,import ee\nee.Initialize()\n\n# Load datasets\...,import ee\npass\npass\n\n# Load datasets\ngfc ...,2990.6138607418047\n,,[2990.614]
2,EBA_F5.2_A2,Zonal Mean Slope Calculation Using Buffered Po...,**ID: EBA_F5.2_A2**\n\n**Title: Maximum Zonal ...,Easy,34.5,0.326691,1,import ee\n\nee.Initialize()\n\n# Define the p...,import ee\npass\n\npass\n\n# Define the point ...,34.5\n,...,Exception occurred: Reducer can only be used a...,,[],0.364431,1,import ee\nee.Initialize()\n\n# Define points\...,import ee\npass\npass\n\n# Define points\npoin...,34.5\n,,[34.5]
3,EBD_F2.0_C1,Calculating Normalized Difference Water Index,**ID: EBD_F2.0_C1**\n\n**Title:****Calculating...,Easy,-0.002,0.551164,1,import ee\n\nee.Initialize()\n\n# 1. Define th...,import ee\npass\n\npass\n\n# 1. Define the poi...,-0.002\n,...,-0.002\n,,[-0.002],0.530126,1,import ee\nee.Initialize()\n\n# Define the poi...,import ee\npass\npass\n\n# Define the point of...,-0.002\n,,[-0.002]
4,EBD_F2.1_C1,Land Cover Classification of Milan using Lands...,**ID: EBD_F2.1_C1**\n\n**Title:****Land Cover ...,Easy,4012839.22,19.111872,1,import ee\n\n# Initialize Earth Engine\nee.Ini...,import ee\npass\n\n# Initialize Earth Engine\n...,4012839.2196078426\n,...,4012839.2196078426\n,,[4012839.22],24.74384,2,import ee\nee.Initialize()\n\n# Define trainin...,import ee\npass\npass\n\n# Define training poi...,4012839.2196078426\n,,[4012839.22]


In [None]:
simple_df = simplify_df(full_results_df)
simple_df.head()

Unnamed: 0,id,title,difficulty,expected_answer,claude-3-5-haiku-20241022_error_correction_final_answer,claude-3-5-haiku-20241022_final_answer,gemini-2.5-flash_error_correction_final_answer,gemini-2.5-flash_final_answer,o4-mini-2025-04-16_error_correction_final_answer,o4-mini-2025-04-16_final_answer
0,EBA_F2.0_A2,Calculating Iron Oxide Ratio (IOR) for Hydroth...,Easy,0.994,[0.994],[0.994],[0.994],[0.994],[0.994],[0.994]
1,EBA_F5.1_A1,Forest growth calculation by raster extraction...,Easy,2990.609,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 2023.0, 1....","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 2023.0, 1....",[],[2990.614],[2990.614],[2990.614]
2,EBA_F5.2_A2,Zonal Mean Slope Calculation Using Buffered Po...,Easy,34.5,[34.474],[],[34.5],[34.5],[34.5],[]
3,EBD_F2.0_C1,Calculating Normalized Difference Water Index,Easy,-0.002,[-0.002],[-0.002],[-0.002],[-0.002],[-0.002],[-0.002]
4,EBD_F2.1_C1,Land Cover Classification of Milan using Lands...,Easy,4012839.22,[],[],[4012839.22],[4012839.22],[4012839.22],[4012839.22]


# Analysis

In [None]:
def check_answer_within_nsf(agent_answer_str, expected_answer, nsf=2):
    """
    Parses an agent's string-formatted list and checks if any answer
    matches the expected answer to n significant figures.
    """
    try:
        # Safely evaluate the string to a list of numbers
        answers = ast.literal_eval(str(agent_answer_str))
        if not isinstance(answers, list) or not answers:
            return False
    except (ValueError, SyntaxError):
        return False # Handles empty or malformed strings

    def round_to_nsf(n):
        """Helper to round a number to n significant figures."""
        if n == 0: return 0
        return round(n, nsf - int(math.floor(math.log10(abs(n)))) - 1)

    # Check if any provided answer matches the expected one when rounded
    return any(
        round_to_nsf(float(ans)) == round_to_nsf(expected_answer)
        for ans in answers
        if isinstance(ans, (int, float))
    )


# Identify agent columns and create a results DataFrame
agent_cols = [col for col in simple_df.columns if 'final_answer' in col]
results_df = simple_df.copy()

# Use apply() to efficiently generate boolean results for each agent
for col in agent_cols:
    results_df[f'{col}_correct'] = simple_df.apply(
        lambda row: check_answer_within_nsf(row[col], row['expected_answer'], nsf=2),
        axis=1
    )

# --- Output ---

# Identify the newly created boolean columns
correct_cols = [f'{col}_correct' for col in agent_cols]

# Group by difficulty and sum the True values (since True=1, False=0)
summary = results_df.groupby('difficulty')[correct_cols].sum().astype(int)

# To give context, let's also get the total number of questions per difficulty
difficulty_counts = results_df['difficulty'].value_counts().rename("total_questions")

# Combine the counts with the summary for a comprehensive view
final_summary = summary.join(difficulty_counts)

# drop final_answer_correct suffixes from all columns
final_summary.columns = final_summary.columns.str.replace('_final_answer_correct', '')

# replace error_correction with ec
final_summary.columns = final_summary.columns.str.replace('error_correction', 'ec')
# --- Output ---
print("Summary of Correct Answers by Difficulty:")
final_summary

Summary of Correct Answers by Difficulty:


Unnamed: 0_level_0,claude-3-5-haiku-20241022_ec,claude-3-5-haiku-20241022,gemini-2.5-flash_ec,gemini-2.5-flash,o4-mini-2025-04-16_ec,o4-mini-2025-04-16,total_questions
difficulty,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Easy,3,2,4,5,5,4,5
