Release Summary: Version 0.6.0
This release introduces significant capabilities for multimodal benchmarking, OTel/Weka trace & session replays, and major improvements to streaming measurement correctness and robustness.
1. Major Features
Multimodal Benchmarking & Vision Datasets
- Multimodal Payloads: Added native support for generating and benchmark-testing image, video, and audio inputs (#450, #477).
- VisionArena Dataset: Integrated the
VisionArenadataset/generator format out of the box to validate multimodal visual reasoning workflows (#525). - Prefix Caching for Media: Fixed cache key hashing for media inputs in shared-prefix workloads (#485).
Advanced Session & Trace Replays (OTel & Weka)
- Trace Replay Engines: Added support for replaying OpenTelemetry (OTel) and Weka JSON transaction traces (#468, #550).
- Agentic session features: Added tool-call simulation, session-to-worker affinity, and session context recovery protocols to replicate real-world multi-turn conversational agents.
- Filtering & Personalization: Enabled lambda-based query filters for trace records (#507) and injected custom headers for multi-tenant and session routing (#504, #523).
2. Metrics, Correctness & Streaming Enhancements
- Server-Sourced Token Counts: Added server-sourced usage statistics (e.g. prompt cached tokens, completion tokens) to eliminate discrepancies from client-side re-tokenization (#473, #565).
- SSE Stream Correctness: Addressed inter-token latency (ITL) deflation caused by leading BOS tokens (#566) and resolved SSE stream parsing issues for failed request bodies (#495, #530).
- CLI Report Enhancements: Added session-level stats and aggregate summaries to the standard CLI table layout (#493).
3. Bug Fixes & Correctness
- Distribution Types Fix: Fixed
RandomDataGeneratorandSyntheticDataGeneratorto respect configured distribution types (e.g., fixed, uniform) instead of always defaulting to normal distribution (#572). - RNG Synchronization: Fixed RNG synchronization issues across multiple subprocess workers to guarantee workload determinism (#539).
- Prefix Off-by-One: Standardized prompt prefix rendering at the token level instead of string slices to avoid off-by-one mismatches on different vocabularies (#591).
4. Developer Experience & Cleanups
- Config Restructuring: Segmented config files for cleaner imports and validation (#505).
- Package Reorganizations: Relocated and consolidated logging and distribution utility components under structured namespaces (#541, #544).
Docker Image
quay.io/inference-perf/inference-perf:v0.6.0
Python Package
pip install inference-perf==v0.6.0
What's Changed
- Re-prime conversation_replay sessions after clear_instances() by @kaushikmitr in #466
- Add Multimodal Benchmarking by @Bslabe123 in #450
- feat: tool-call replay, HuggingFace trace loading, and session replay hardening by @alonh in #468
- Increase context length on code-generation use case by @achandrasekar in #470
- fix: per-group cache_key for shared_prefix multimodal bytes by @Bslabe123 in #485
- Make per-stage progress visible in non-TTY logs by @Bslabe123 in #487
- Log synthetic datagen materialization progress by @Bslabe123 in #496
- Surface prompt cache token metrics in lifecycle summary by @MikeTomlin19 in #473
- Payload specs and measurement rules for each (media, provenance) pair by @Bslabe123 in #477
- Add session-level metrics to CLI summary tables by @alonh in #493
- fix: _record_otel_metrics() correctly parse SSE streaming responses by @alonh in #495
- fix: shared_prefix off-by-one by composing at token level by @Bslabe123 in #491
- Fixed failing e2e test for nix pdm sync error by @tico88612 in #521
- Analyze charts font size increase and paper update by @SachinVarghese in #513
- Add K8s Slack invitation link by @tico88612 in #510
- Split config file by @Bslabe123 in #505
- Feat: Add filtering support for OTel trace replay across all data sources by @lenadankin in #507
- regenerate system prompts per stage in conversation replay by @zetxqx in #480
- feat: add reasoning output support for OTEL trace replay by @oritht in #499
- Copy-editing of article text by @logological in #532
- Copy-editing of bibliography by @logological in #533
- Inject session identity header for session replay requests by @pavanipenumalla in #504
- Add Support for VisionArena Dataset by @Bslabe123 in #525
- Support multi-tenant headers and OTel mapping by @LukeAVanDrie in #523
- fix: RNG state synchronization across multiworkers by @changminbark in #539
- cleanup: move
logger.pyto/observability/loggingby @Bslabe123 in #541 - Address JOSS paper reviewer feedback: clarify extensions sentence, co… by @jjk-g in #534
- fix: preserve partial response body when a streaming request fails by @Bslabe123 in #530
- docs(otel-replay): update for accuracy and feature coverage by @alonh in #536
- Add weka trace replay support by @achandrasekar in #550
- fix: catch TimeoutError in predecessor wait to prevent session hang by @pavanipenumalla in #555
- Add SLOs to workload catalog by @namasl in #554
- Test/optional live tier by @Bslabe123 in #529
- cleanup: Move utils/distribution.py to utils/numeric/distribution by @Bslabe123 in #544
- Skip code coverage check on markdown-only changes by @jjk-g in #557
- Addressed inflated streamed output_len by adding server-sourced output_tokens metric by @Bslabe123 in #565
- Address ITL deflation from per-chunk BOS in streamed timing by @Bslabe123 in #566
- Fix: Randomize prompt prefixes in SyntheticDataGenerator by @jjk-g in #570
- fix(datagen): respect configured distribution types in Random/Synthetic generators by @jjk-g in #572
New Contributors
- @MikeTomlin19 made their first contribution in #473
- @tico88612 made their first contribution in #521
- @oritht made their first contribution in #499
- @logological made their first contribution in #532
- @pavanipenumalla made their first contribution in #504
Full Changelog: v0.5.0...v0.6.0