What's Changed
- Full conversion to
python, with a new CLI and a new declarative specification language for experiment description- Plugin architecture makes adding new stages to the life cycle fluent and scalable for future features.
- User experience was enhanced with a much more meaningful logging and message display
- Extensive health checking during and at the end of the deployment.
- New standup method available: "Fast Model Actuator" (FMA)
- Fast Model Actuation (FMA) is a Kubernetes-native system for efficiently managing LLM inference servers and reduces model startup latency from minutes to seconds. FMA uses two techniques: vLLM sleep/wake, where model instances move tensors from GPU to CPU memory — freeing accelerator resources while keeping the process alive for rapid wake-up and model swapping, where a persistent launcher process handles initialization upfront so instances can be swapped without full cold starts.
- Significant improvements for perfomance data collection, including relevant changes on benchmark report
- "Time-series" metrics on version 0.2 of the benchmark reports now include both statics summarization and link to raw collected data on csv format.
- Tighter integration with Workload Variant Autoscaler (WVA), including the ability to deploy multiple models on the same namespace as defined within a scenario. In the same vein - allowing one or more stacks in the scenario to be deployed and torn down based on user preference.
- Ability to provide different parameters for
vllmprocess on differentpods(by usingLeaderWorkerSet(LWS) Kubernetes API).- Allow filling in stack details from a YAML file from harness pod.
- Assorted corrections and robustness improvements.
- The "capacity planner" and "configuration explorer" are now part of a new project: https://github.com/llm-d-incubation/llm-d-planner
- Strongly enhanced development constructs including pre-commit and CICD that safe guard existing library patterns and functionality.
Regular Contributors to this release
- @namasl
- @kalantar
- @Vezio
- @jgchn
- @mengmeiye
- @deanlorenz
- @dmitripikus
- @manoelmarques
- @achandrasekar
- @jjk-g
- @maugustosilva
New Contributors
- @DolevAdas made their first contribution in #742
- @michael-desmond made their first contribution in #853
- @jia-gao made their first contribution in #859
- @ruocco made their first contribution in #867
- @adinilfeld made their first contribution in #874
- @Luka-D made their first contribution in #917
- @forfreedomforrich-eng made their first contribution in #899
- @Copilot made their first contribution in #951
- @aavarghese made their first contribution in #995
What's Changed
- 🌱 Remove per-repo gh-aw typo/link/upstream workflows by @clubanderson in #778
- ⬆️ Bump yq from v4.45.4 to v4.45.5 by @github-actions[bot] in #748
- Fix logs for new vllm on nop harness by @manoelmarques in #781
- Add memory and cache metrics #2 by @DolevAdas in #742
- [Experimental] Add a new production trace replay for real-world multi-turn chat workflow by @achandrasekar in #761
- Update GAIE InferencePool v1.3.0 to v1.3.1 by @diegocastanibm in #830
- Fix partial metrics by @mengmeiye in #834
- update istio by @diegocastanibm in #840
- update vllm by @diegocastanibm in #837
- update yq by @diegocastanibm in #836
- update inferecemax by @diegocastanibm in #835
- update kgateway by @diegocastanibm in #839
- update helmfile to v1.4.1 by @diegocastanibm in #832
- update wva by @diegocastanibm in #838
- update inference-perf by @diegocastanibm in #833
- v0.5.3 tagged release by @diegocastanibm in #831
- [Standup] Add the ability to use initContainers. by @maugustosilva in #851
- [Standup] Additional fixes (accelerator automatic selection) by @maugustosilva in #852
- 🌱 Add missing governance files per CNCF audit by @clubanderson in #783
- Feat/small cluster config by @michael-desmond in #853
- [Standup] Consolidate all sim scenarios (with small gateway pod) by @maugustosilva in #856
- Fix metrics scrape by @mengmeiye in #854
- Fix standalone preprocess env. variable by @manoelmarques in #860
- Epp log scrape by @mengmeiye in #855
- [Run] Add --repeat flag to repeat experiments N times with aggregation by @jia-gao in #859
- remove accessLogging for helm chart schema validation error by @mengmeiye in #861
- workload, inference-perf: increase tokens in sanity check. by @ruocco in #867
- Stack discovery tool by @namasl in #762
- AI generated scenarios POC by @kalantar in #674
- [Standup] Fix for GKE with new
v0.5.1llm-d-cudaimage by @maugustosilva in #868 - [Run] Add pre/post workload hooks to run_only.sh by @jia-gao in #873
- Declarative Python Package by @Vezio in #848
- Use --serviceaccount value when creating model verification pod by @adinilfeld in #874
- [Docs] Add Note for Previous Library by @Vezio in #875
- feat: introduce harness namespace step in run sequence by @adinilfeld in #877
- fix pd-disaggregation by @mengmeiye in #878
- feat: Fix secret for monitoring epp by @Vezio in #879
- fix: Extract IP values through standard status.addresses object lookups by @adinilfeld in #883
- fix template for pd-disaggregation by @mengmeiye in #884
- fix: Configuration file concatenation bugs and Crane resolution fallback by @adinilfeld in #888
- docs: Add inline comments providing recommended storage classes by @adinilfeld in #887
- Fix: Re-Enable CICD via Kind Deployment for PRs by @Vezio in #885
- Remove unneeded config explorer components, consolidate analysis notebook by @namasl in #886
- feat: Split CI benchmark into parallel standalone and modelservice jobs by @Vezio in #889
- Fix harness metadata loss from subshell variable scoping by @Vezio in #880
- [Run] Updated trivy scanner version by @maugustosilva in #890
- Remove redundant metrics by @mengmeiye in #891
- Add GCS results metadata injection skill and standardize skills directory by @adinilfeld in #895
- Remove public IP address from Gemini skill by @adinilfeld in #896
- Auto-provision RBAC and enable pod-native auth for run-only mode by @adinilfeld in #894
- add metrics stat to benchmark report v0.2 by @mengmeiye in #897
- Enhance Smoketest Cleanup and Document ModelService Protocols by @Vezio in #901
- feature: Allows ModelService K8 Manifests to be Rendered BEFORE being Applied by @Vezio in #902
- fix: Removes Redundant Version References by @Vezio in #903
- fix: Render K8 Manifests for MS Early in Plan Phase by @Vezio in #907
- fix(standup): skip accelerator validators for CPU-only scenarios by @Vezio in #911
- fix the detection of whether a cluster is an OpenShift one by @mengmeiye in #913
- fix: Update CPU Scenario by @Vezio in #912
- fix: Standalone Rendering by @Vezio in #916
- fix: Brings Back Pre Commit and Updates Getting Started by @Vezio in #919
- fix capacity_validator on the number of accelerators by @mengmeiye in #920
- fix: helm-diff install uses verify=false flag by @Luka-D in #917
- add parser to replace model.name automatically by @mengmeiye in #918
- fix: Quickstart Guide Fix by @Vezio in #921
- Add description and keywords metadata to experiment config by @jia-gao in #898
- Update production-trace-replay-qwen.py by @forfreedomforrich-eng in #899
- fix issue 922 by @mengmeiye in #924
- auto render host and add VLLM_INFERENCE_PORT in default template by @mengmeiye in #929
- Add Fast Model Actuation Mode and Fix Standalone mode with nop harness by @manoelmarques in #900
- Feature: Added tuneable cli arguments for timeouts during standup and run by @Luka-D in #934
- Remove reference to custom image in fma and launcher specs by @manoelmarques in #933
- Phase config-explorer out of llm-d-benchmark and import llm-d-planner by @jgchn in #930
- deps(actions): bump actions/github-script from 7 to 9 by @dependabot[bot] in #935
- deps(actions): bump actions/upload-artifact from 7.0.0 to 7.0.1 by @dependabot[bot] in #936
- feat: Pull in AgentGateway and Re-Enable all Scenarios and LWS by @maugustosilva in #937
- fix: Fix Documentation for QuickStart by @Vezio in #939
- chore: bump llm-d-infra chart version v1.3.8 → v1.4.0 by @Copilot in #951
- chore: bump kgateway v2.1.1 → v2.2.3 by @Copilot in #949
- update curl image path by @mengmeiye in #954
- add support for context length aware router by @mengmeiye in #955
- [Standup] Ensure
simulated-acceleratorsis in sync with guides by @maugustosilva in #956 - feat: Generate SBOM Automatically and Add Precommit to Installer by @Vezio in #957
- Monitor replicas and standup time by @mengmeiye in #958
- update metrics documentation by @mengmeiye in #959
- fix: Remove some Pre-Push Requirement and Fix Python Version by @Vezio in #960
- [Standup] Update istio and gaie by @maugustosilva in #961
- fix the bugs for configMap under sidecar and preprocess script by @mengmeiye in #964
- Fix spyre smoketest by @mengmeiye in #965
- fix bug when harness waitTimeout is defined in scenario by @mengmeiye in #986
- [Optional] Uv install by @mengmeiye in #963
- Upgrade FMA to next release v0.5.1-alpha.7 by @aavarghese in #995
- ignore pycache in dockerbuild by @mengmeiye in #994
- feat: Workload Variant Autoscaler Scenario and Infra Imeplementation by @Vezio in #999
- Additional
nopharness metrics by @manoelmarques in #990 - feat: Add Check for PVC Creation Instead of Timing Out Only by @Vezio in #1001
- teardown leaderworkerset and statefulset by @mengmeiye in #1003
- Monitor startup time and replicas by @mengmeiye in #1002
- Add FMA mode to pull request CI by @manoelmarques in #1000
- fix spyre standalone by @mengmeiye in #1004
- deployment: add chat-template parameter by @ruocco in #1017
- Re-enabled all CI/CD workflows by @maugustosilva in #1018
- improve replica monitoring by @mengmeiye in #1019
- fix: Delay Metric Validation via Prom. Adapt. to Post Install by @Vezio in #1020
- fix: Removes Final Duplicate Reference to Benchmark Report by @Vezio in #1021
- fix: Bump Pckg Version for llmdbenchmark by @Vezio in #1024
- feat: Enable Flow Control for Inference-Scheduling + WVA Scenario by @Vezio in #1025
- add epp pool monitoring by @mengmeiye in #1026
- update spyre scenario by @mengmeiye in #1027
⚠️ Switch OCP nightly benchmark runners from platform-eval to pokprod01 by @clubanderson in #1028- [cicd] Multiple updates to restore cicd (nightly) by @maugustosilva in #1030
- feat: llm-d Multi Model + WVA Enablement per Namespace by @Vezio in #1029
- [Standup] Consolidate versioning information by @maugustosilva in #1034
- make monitoring enabled as default for standup by @mengmeiye in #904
- feat: Persist WVA Resources on per-stack Teardown by @Vezio in #1035
- Remove -f in llmdbenchmark run by @mengmeiye in #1037
- Add FMA to nigthly CI by @manoelmarques in #1036
- skip monitoring check for dry run by @mengmeiye in #1038
- Release v0.6.0 by @maugustosilva in #1039
Full Changelog: v0.5.0...v0.6.0