Skip to content

Releases: neuralmagic/nm-vllm

v0.2.0

10 Apr 19:10
e752ec7
Compare
Choose a tag to compare

Key Features

This release is based on vllm==0.4.0.post1

  • New model architectures supported! DbrxForCausalLM, CohereForCausalLM (Command-R), JAISLMHeadModel, LlavaForConditionalGeneration (experimental vision LM), OrionForCausalLM, Qwen2MoeForCausalLM, StableLmForCausalLM, Starcoder2ForCausalLM, XverseForCausalLM
  • Automated benchmarking
  • Code coverage reporting
  • lm-evaluation-harness nightly accuracy testing
  • Layerwise Profiling for the inference graph (#124)

What's Changed

New Contributors

Full Changelog: 0.1.0...0.2.0

v0.1.0

05 Mar 17:08
007ada5
Compare
Choose a tag to compare

Initial release of 🪄 nm-vllm 🪄

nm-vllm is Neural Magic's fork of vLLM with an opinionated focus on incorporating the latest LLM optimizations like quantization and sparsity for enhanced performance.

This release is based on vllm==0.3.2

Key Features

This first release focuses on our initial LLM performance contributions through support for Marlin, an extremely optimized FP16xINT4 matmul kernel, and weight sparsity acceleration.

Model Inference with Marlin (4-bit Quantization)

Marlin is enabled automatically if a quantized model has the "is_marlin_format": true flag present in it's quant_config.json

from vllm import LLM
model = LLM("neuralmagic/llama-2-7b-chat-marlin")
print(model.generate("Hello quantized world!")

Optionally, you can specify it explicitly by setting quantization="marlin".

Marlin Performance

Model Inference with Weight Sparsity

nm-vllm includes support for newly-developed sparse inference kernels, which provides both memory reduction and acceleration of sparse models leveraging sparsity.

Here is an example running a 50% sparse OpenHermes 2.5 Mistral 7B model fine-tuned for instruction-following:

from vllm import LLM, SamplingParams

model = LLM(
    "nm-testing/OpenHermes-2.5-Mistral-7B-pruned50",
    sparsity="sparse_w16a16",
    max_model_len=1024
)

sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Hello my name is", sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

There is also support for semi-structured 2:4 sparsity using the sparsity="semi_structured_sparse_w16a16" argument:

from vllm import LLM, SamplingParams

model = LLM("nm-testing/llama2.c-stories110M-pruned2.4", sparsity="semi_structured_sparse_w16a16")
sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Once upon a time, ", sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

Sparse Memory Compression Sparse Inference Performance

What's Changed

New Contributors

Full Changelog: https://github.com/neuralmagic/nm-vllm/commits/0.1.0