Release v0.11.0: Prefix caching, VLMs, BERT (embed, NER), FP8 · predibase/lorax

🎉 Enhancements

Add prefix caching by @tgaddair in #581
Add Llava Next (VLM) by @tgaddair in #586
Embedder Service v0 with FlashBert by @magdyksaleh in #385
Added eager prefill option by @tgaddair in #524
BERT NER support by @magdyksaleh in #531
Preload adapters during init by @tgaddair in #543
Add support for batching to embedder models by @tgaddair in #503
Bert to gpu by @magdyksaleh in #507
Add distilbert by @magdyksaleh in #508
feat: return usage in ChatCompletionStreamResponse by @GirinMan in #506
Added Gemma2 by @tgaddair in #530
Move kv cache allocation to router to ensure correct block allocation by @tgaddair in #545
Tokenize inputs in router by @tgaddair in #548
Add support for Llama 3 rotary embeddings by @tgaddair in #551
Apply chat template in router to properly validate input length by @tgaddair in #538
Allow eager_prefill to be set in Helm chart by @bdalal in #557
Support FP8 for Mistral by @ajtejankar in #559
Support FP8 for LLaMa by @ajtejankar in #562
Support classify batch by @magdyksaleh in #577
Adding longrope for serve Phi-3 by @huytuong010101 in #576
Add new agnostic health endpoint by @magdyksaleh in #588
Support FlashInfer for BERT by @tgaddair in #597
Speed up NER inference by @magdyksaleh in #598
Disable healthcheck tracing and add metrics to classify + classify_batch endpoints by @magdyksaleh in #603
Added launcher args for preloaded_adapter_source and backend by @tgaddair in #604
Parallelize tokenization for /classify_batch and remove block allocator for non-causal LMs by @tgaddair in #609
support bge-base-en-v1.5 by @magdyksaleh in #593

🐛 Bugfixes

Fix for the LM_HEAD issue by @ajtejankar in #475
fix: load tokenizer/config with trust_remote_code by @thincal in #476
Fix issue with Medusa batch load signature by @tgaddair in #492
add missed dtypes for 8bit kv cache by @flozi00 in #490
Fix quant cache OOM by @flozi00 in #494
Add retries on common session errors for the client by @gyanesh-mishra in #495
Revert AWQ to stable commit by @tgaddair in #498
Fixed phi-3 with Su Rotary Embedding by @tgaddair in #499
Fixed case where loaded lora adapter has no segments by @tgaddair in #510
fix batching bug by @magdyksaleh in #513
Fix issue with GQA initialization for Qwen2 by @arnavgarg1 in #514
Disable fp8 kv cache for lovelace by @tgaddair in #520
Bug fix for illegal memory access error caused when running medusa lora and plain loras in parallel. by @ajtejankar in #525
bug : fix the type checking errors thrown by new ruff version by @ajtejankar in #533
bug : fix Qwen-2 sliding_window config bug by @ajtejankar in #532
Infer dtype from model config when not explicitly specified by @arnavgarg1 in #534
Fix gemma2 by @Infernaught in #539
Fix : compile bug causing models to error with 'lora' key not found by @ajtejankar in #547
Fix: short circuit download, load, offload for preloaded adapters by @tgaddair in #552
Fix the attention bug caused by upgrading vLLM by @ajtejankar in #555
Fix LM head interaction with Medusa by @tgaddair in #567
Fix adapter mask when using speculative decoding + LM head LoRA by @tgaddair in #570
Fix outlines compatibility with speculative decoding by @tgaddair in #578
Fix qwen lora by @magdyksaleh in #585
Fix classify and classify_batch for Python client by @tgaddair in #608
Fix ner entity merging by @magdyksaleh in #596
Fix class ner by @magdyksaleh in #602
Fix dependencies to address high urgency dependabot alerts by @magdyksaleh in #612

📝 Docs

docs: update development_env.md by @eltociear in #515
Doc updates for Medusa training by @arnavgarg1 in #544
Add "pbase" to adapter_source docstrings by @alexsherstinsky in #583
Add prerequisites to readme by @csabakecskemeti in #584

🔧 Maintenance

chore: update infer.rs by @eltociear in #487
start porting latest tgi by @flozi00 in #480
Bump client to v0.6.1 by @tgaddair in #496
Update Makefile-awq by @flozi00 in #493
hqq upgrades by @flozi00 in #491
try out an integration test workflow by @noyoshi in #516
no warm up by @magdyksaleh in #540
Update PyTorch, CUDA, vLLM, and Bitsandbytes by @ajtejankar in #553
Added missing nvidia-ml-py package by @tgaddair in #558
parse headers for errored requests by @noyoshi in #564
handle folders for predibase by @noyoshi in #565
enable mistral nemo by @magdyksaleh in #568
bump version by @noyoshi in #569
Install flashinfer in Docker by @tgaddair in #582
feat : use --no-cache-dir flag to pip in dockerfiles to save space by @rajpratik71 in #587
Add missing configs by @magdyksaleh in #590
Address rust compiler warnings by @magdyksaleh in #589

New Contributors

@eltociear made their first contribution in #487
@ajtejankar made their first contribution in #475
@bdalal made their first contribution in #557
@rajpratik71 made their first contribution in #587
@csabakecskemeti made their first contribution in #584

Full Changelog: v0.10.0...v0.11.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.11.0: Prefix caching, VLMs, BERT (embed, NER), FP8

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

🎉 Enhancements

🐛 Bugfixes

📝 Docs

🔧 Maintenance

New Contributors

Contributors

Uh oh!