v0.3.0
[0.3.0] - 2025-01-18
Added
Parallel Execution
- Added parallel task execution with
num_workersparameter inBenchmark.run()usingThreadPoolExecutor(PR: #14) - Added
ComponentRegistryclass for thread-safe component registration with thread-local storage (PR: #14) - Added
TaskContextfor cooperative timeout checking withcheck_timeout(),elapsed,remaining, andis_expiredproperties (PR: #14) - Added
TaskProtocoldataclass withtimeout_seconds,timeout_action,max_retries,priority, andtagsfields for task-level execution control (PR: #14) - Added
TimeoutActionenum (SKIP,RETRY,RAISE) for configurable timeout behavior (PR: #14) - Added
TaskTimeoutErrorexception withelapsed,timeout, andpartial_tracesattributes (PR: #14) - Added
TASK_TIMEOUTtoTaskExecutionStatusenum for timeout classification (PR: #14)
Task Queue Abstraction
- Added
TaskQueueabstract base class with iterator interface for flexible task scheduling (PR: #14) - Added
SequentialQueuefor simple FIFO task ordering (PR: #14) - Added
PriorityQueuefor priority-based task scheduling usingTaskProtocol.priority(PR: #14) - Added
AdaptiveTaskQueueabstract base class for feedback-based adaptive scheduling withinitial_state(),select_next_task(remaining, state), andupdate_state(task, report, state)methods (PR: #14)
ModelAdapter Chat Interface
- Added
chat()method toModelAdapteras the primary interface for LLM inference, accepting a list of messages in OpenAI format and returning aChatResponseobject and accepting tools - Added
ChatResponsedataclass containingcontent,tool_calls,role,usage,model, andstop_reasonfields for structured response handling
AnthropicModelAdapter
- New
AnthropicModelAdapterfor direct integration with Anthropic Claude models via the official Anthropic SDK - Handles Anthropic-specific message format conversion (system messages, tool_use/tool_result blocks) internally while accepting OpenAI-compatible input
- Added
anthropicoptional dependency:pip install maseval[anthropic]
Benchmarks
- Tau2 Benchmark: Full implementation of the tau2-bench benchmark for evaluating LLM-based agents on customer service tasks across airline, retail, and telecom domains (PR: #16)
Tau2Benchmark,Tau2Environment,Tau2User,Tau2Evaluatorcomponents for framework-agnostic evaluation (PR: #16)DefaultAgentTau2Benchmarkusing an agent setup closely resembeling to the original tau2-bench implementation (PR: #16)- Data loading utilities:
load_tasks(),ensure_data_exists(),configure_model_ids()(PR: #16) - Metrics:
compute_benchmark_metrics(),compute_pass_at_k(),compute_pass_hat_k()for tau2-style scoring (PR: #16) - Domain implementations with tool kits:
AirlineTools,RetailTools,TelecomToolswith full database simulation (PR: #16)
User
AgenticUserclass for users that can use tools during conversations (PR: #16)- Multiple stop token support:
Usernow acceptsstop_tokens(list) instead of singlestop_token, enabling different termination reasons (PR: #16) - Stop reason tracking:
Usertraces now includestop_reason,max_turns,turns_used, andstopped_by_userfor detailed termination analysis (PR: #16)
Simulator
AgenticUserLLMSimulatorfor LLM-based user simulation with tool use capabilities (PR: #16)
Examples
- Tau2 benchmark example with default agent implementation and result comparison scripts (PR: #16)
Changed
Benchmark
Benchmark.agent_dataparameter is now optional (defaults to empty dict) (PR: #16)- Refactored
Benchmarkto delegate registry operations toComponentRegistryclass (PR: #) Benchmark.run()now accepts optionalqueueparameter (BaseTaskQueue) for custom task scheduling (PR: #14)
Task
Task.idis nowstrtype instead ofUUID. Benchmarks can provide human-readable IDs directly (e.g.,Task(id="retail_001", ...)). Auto-generates UUID string if not provided. (PR: #16)
Fixed
- Task reports now use
task.iddirectly instead ofmetadata["task_id"](PR: #16)