v0.11.0: Prefix caching, VLMs, BERT (embed, NER), FP8
🎉 Enhancements
- Add prefix caching by @tgaddair in #581
- Add Llava Next (VLM) by @tgaddair in #586
- Embedder Service v0 with FlashBert by @magdyksaleh in #385
- Added eager prefill option by @tgaddair in #524
- BERT NER support by @magdyksaleh in #531
- Preload adapters during init by @tgaddair in #543
- Add support for batching to embedder models by @tgaddair in #503
- Bert to gpu by @magdyksaleh in #507
- Add distilbert by @magdyksaleh in #508
- feat: return usage in ChatCompletionStreamResponse by @GirinMan in #506
- Added Gemma2 by @tgaddair in #530
- Move kv cache allocation to router to ensure correct block allocation by @tgaddair in #545
- Tokenize inputs in router by @tgaddair in #548
- Add support for Llama 3 rotary embeddings by @tgaddair in #551
- Apply chat template in router to properly validate input length by @tgaddair in #538
- Allow eager_prefill to be set in Helm chart by @bdalal in #557
- Support FP8 for Mistral by @ajtejankar in #559
- Support FP8 for LLaMa by @ajtejankar in #562
- Support classify batch by @magdyksaleh in #577
- Adding longrope for serve Phi-3 by @huytuong010101 in #576
- Add new agnostic health endpoint by @magdyksaleh in #588
- Support FlashInfer for BERT by @tgaddair in #597
- Speed up NER inference by @magdyksaleh in #598
- Disable healthcheck tracing and add metrics to classify + classify_batch endpoints by @magdyksaleh in #603
- Added launcher args for preloaded_adapter_source and backend by @tgaddair in #604
- Parallelize tokenization for /classify_batch and remove block allocator for non-causal LMs by @tgaddair in #609
- support bge-base-en-v1.5 by @magdyksaleh in #593
🐛 Bugfixes
- Fix for the LM_HEAD issue by @ajtejankar in #475
- fix: load tokenizer/config with trust_remote_code by @thincal in #476
- Fix issue with Medusa batch load signature by @tgaddair in #492
- add missed dtypes for 8bit kv cache by @flozi00 in #490
- Fix quant cache OOM by @flozi00 in #494
- Add retries on common session errors for the client by @gyanesh-mishra in #495
- Revert AWQ to stable commit by @tgaddair in #498
- Fixed phi-3 with Su Rotary Embedding by @tgaddair in #499
- Fixed case where loaded lora adapter has no segments by @tgaddair in #510
- fix batching bug by @magdyksaleh in #513
- Fix issue with GQA initialization for Qwen2 by @arnavgarg1 in #514
- Disable fp8 kv cache for lovelace by @tgaddair in #520
- Bug fix for illegal memory access error caused when running medusa lora and plain loras in parallel. by @ajtejankar in #525
- bug : fix the type checking errors thrown by new ruff version by @ajtejankar in #533
- bug : fix Qwen-2 sliding_window config bug by @ajtejankar in #532
- Infer dtype from model config when not explicitly specified by @arnavgarg1 in #534
- Fix gemma2 by @Infernaught in #539
- Fix : compile bug causing models to error with 'lora' key not found by @ajtejankar in #547
- Fix: short circuit download, load, offload for preloaded adapters by @tgaddair in #552
- Fix the attention bug caused by upgrading vLLM by @ajtejankar in #555
- Fix LM head interaction with Medusa by @tgaddair in #567
- Fix adapter mask when using speculative decoding + LM head LoRA by @tgaddair in #570
- Fix outlines compatibility with speculative decoding by @tgaddair in #578
- Fix qwen lora by @magdyksaleh in #585
- Fix classify and classify_batch for Python client by @tgaddair in #608
- Fix ner entity merging by @magdyksaleh in #596
- Fix class ner by @magdyksaleh in #602
- Fix dependencies to address high urgency dependabot alerts by @magdyksaleh in #612
📝 Docs
- docs: update development_env.md by @eltociear in #515
- Doc updates for Medusa training by @arnavgarg1 in #544
- Add "pbase" to adapter_source docstrings by @alexsherstinsky in #583
- Add prerequisites to readme by @csabakecskemeti in #584
🔧 Maintenance
- chore: update infer.rs by @eltociear in #487
- start porting latest tgi by @flozi00 in #480
- Bump client to v0.6.1 by @tgaddair in #496
- Update Makefile-awq by @flozi00 in #493
- hqq upgrades by @flozi00 in #491
- try out an integration test workflow by @noyoshi in #516
- no warm up by @magdyksaleh in #540
- Update PyTorch, CUDA, vLLM, and Bitsandbytes by @ajtejankar in #553
- Added missing nvidia-ml-py package by @tgaddair in #558
- parse headers for errored requests by @noyoshi in #564
- handle folders for predibase by @noyoshi in #565
- enable mistral nemo by @magdyksaleh in #568
- bump version by @noyoshi in #569
- Install flashinfer in Docker by @tgaddair in #582
- feat : use --no-cache-dir flag to pip in dockerfiles to save space by @rajpratik71 in #587
- Add missing configs by @magdyksaleh in #590
- Address rust compiler warnings by @magdyksaleh in #589
New Contributors
- @eltociear made their first contribution in #487
- @ajtejankar made their first contribution in #475
- @bdalal made their first contribution in #557
- @rajpratik71 made their first contribution in #587
- @csabakecskemeti made their first contribution in #584
Full Changelog: v0.10.0...v0.11.0