Summary
RecipeDeleteMultipleRecipesWithConstraint can generate task instances where the user is asked to delete recipes using an ingredient, but there are no matching recipes in the initialized Broccoli database. In the attached retained run, the goal is:
Delete the recipes from Broccoli app that use spirulina in the directions.
The generated parameters contain ingredient: "spirulina" and row_objects: []. The 29 initialized noise_row_objects also do not contain spirulina in either title or directions. The agent therefore deleted nothing, and the native evaluator returned success: 1.0.
This is not the same as the mixed-case Parmesan bug where matching rows exist but are missed. This issue is about vacuous task generation: the task asks the agent to delete matching recipes, but the instance contains no matching target at all.
Expected behavior
For a deletion task class with n_rows = 3, generated instances should normally contain at least one target recipe to delete, preferably the requested number of target rows. If no matching recipes can be generated for a selected ingredient, the generator should choose a different ingredient or reject/regenerate the instance.
Alternatively, if empty-target tasks are intentionally allowed, the user-facing goal and metadata should make that explicit, because otherwise the benchmark can award success for doing nothing on a task that appears to require deletion.
Actual behavior in the attached run
native_evaluator_input.json records:
"ingredient": "spirulina",
"row_objects": []
A direct check of the retained noise_row_objects finds zero rows whose title or directions contain spirulina.
The trajectory shows the agent inspected visible recipes such as Beef Stir Fry, Cauliflower Fried "Rice", Chicken Alfredo Pasta, and Chickpea Vegetable Soup, found no spirulina, deleted nothing, and marked the task complete. The official evaluator output still reports:
Likely root cause
In RecipeDeleteMultipleRecipesWithConstraint.generate_random_params, the generator starts with n_rows = cls.n_rows, then decrements n_rows on ValueError until it reaches zero:
targets = []
n_rows = cls.n_rows
while n_rows > 0:
try:
targets = sqlite_schema_utils.get_random_items(
n_rows,
_generate_random_recipe,
replacement=False,
filter_fn=lambda r: ingredient in r.directions.lower(),
)
break
except ValueError:
n_rows -= 1
return {
sqlite_validators.ROW_OBJECTS: targets,
sqlite_validators.NOISE_ROW_OBJECTS: noise,
'ingredient': ingredient,
}
_COMMON_INGREDIENTS also includes exotic ingredients that are "likely not in the existing recipes", including spirulina. When such an ingredient is selected and no matching recipes exist, the loop can return an empty targets list. The task then becomes a no-op, and the deletion validator can pass because there are no target IDs to remove.
Suggested fix
Reject empty target sets for this task class. For example:
if not targets:
raise ValueError(
f'Could not generate any target recipes for ingredient {ingredient!r}.'
)
or retry ingredient selection until at least one target row exists:
for _ in range(max_attempts):
ingredient = random.choice(_COMMON_INGREDIENTS)
ingredient_lower = ingredient.lower()
targets = ...
if targets:
break
else:
raise ValueError('Could not generate a non-empty target set.')
If the intended invariant is exactly cls.n_rows targets, the generator should require len(targets) == cls.n_rows instead of silently reducing to fewer rows.
Files in the evidence package
evidence/native_evaluator_input.json: shows ingredient: "spirulina" and row_objects: [].
evidence/native_evaluator_output.json: shows success: 1.0.
evidence/trajectory_steps.json: shows the agent inspected recipes, found no spirulina, and completed without deletion.
source/recipe.py: local copy of the relevant AndroidWorld task source.
source/sqlite_validators.py: local copy of the deletion validator used by the task.
RecipeDeleteMultipleRecipesWithConstraint_agent_b_spirulina_empty_target_issue.tar.gz
Summary
RecipeDeleteMultipleRecipesWithConstraintcan generate task instances where the user is asked to delete recipes using an ingredient, but there are no matching recipes in the initialized Broccoli database. In the attached retained run, the goal is:The generated parameters contain
ingredient: "spirulina"androw_objects: []. The 29 initializednoise_row_objectsalso do not containspirulinain eithertitleordirections. The agent therefore deleted nothing, and the native evaluator returnedsuccess: 1.0.This is not the same as the mixed-case
Parmesanbug where matching rows exist but are missed. This issue is about vacuous task generation: the task asks the agent to delete matching recipes, but the instance contains no matching target at all.Expected behavior
For a deletion task class with
n_rows = 3, generated instances should normally contain at least one target recipe to delete, preferably the requested number of target rows. If no matching recipes can be generated for a selected ingredient, the generator should choose a different ingredient or reject/regenerate the instance.Alternatively, if empty-target tasks are intentionally allowed, the user-facing goal and metadata should make that explicit, because otherwise the benchmark can award success for doing nothing on a task that appears to require deletion.
Actual behavior in the attached run
native_evaluator_input.jsonrecords:A direct check of the retained
noise_row_objectsfinds zero rows whosetitleordirectionscontainspirulina.The trajectory shows the agent inspected visible recipes such as
Beef Stir Fry,Cauliflower Fried "Rice",Chicken Alfredo Pasta, andChickpea Vegetable Soup, found nospirulina, deleted nothing, and marked the task complete. The official evaluator output still reports:Likely root cause
In
RecipeDeleteMultipleRecipesWithConstraint.generate_random_params, the generator starts withn_rows = cls.n_rows, then decrementsn_rowsonValueErroruntil it reaches zero:_COMMON_INGREDIENTSalso includes exotic ingredients that are "likely not in the existing recipes", includingspirulina. When such an ingredient is selected and no matching recipes exist, the loop can return an emptytargetslist. The task then becomes a no-op, and the deletion validator can pass because there are no target IDs to remove.Suggested fix
Reject empty target sets for this task class. For example:
or retry ingredient selection until at least one target row exists:
If the intended invariant is exactly
cls.n_rowstargets, the generator should requirelen(targets) == cls.n_rowsinstead of silently reducing to fewer rows.Files in the evidence package
evidence/native_evaluator_input.json: showsingredient: "spirulina"androw_objects: [].evidence/native_evaluator_output.json: showssuccess: 1.0.evidence/trajectory_steps.json: shows the agent inspected recipes, found no spirulina, and completed without deletion.source/recipe.py: local copy of the relevant AndroidWorld task source.source/sqlite_validators.py: local copy of the deletion validator used by the task.RecipeDeleteMultipleRecipesWithConstraint_agent_b_spirulina_empty_target_issue.tar.gz