10 Apr 19:10

mgoin

e752ec7

v0.2.0 Latest

Latest

Key Features

This release is based on vllm==0.4.0.post1

New model architectures supported! DbrxForCausalLM, CohereForCausalLM (Command-R), JAISLMHeadModel, LlavaForConditionalGeneration (experimental vision LM), OrionForCausalLM, Qwen2MoeForCausalLM, StableLmForCausalLM, Starcoder2ForCausalLM, XverseForCausalLM
Automated benchmarking
Code coverage reporting
lm-evaluation-harness nightly accuracy testing
Layerwise Profiling for the inference graph (#124)

What's Changed

turn off single gpu scenario by @andy-neuma in #88
Benchmarking : Absolute -> Relative imports by @varun-sundar-rabindranath in #85
Benchmarking : update Gi_per_thread by @varun-sundar-rabindranath in #90
Update README.md with sparsity and quantization explainers by @mgoin in #91
Add notebooks for sparsegpt and marlin compression with nm-vllm by @mgoin in #94
upstream sync 2024-03-04 by @andy-neuma in #89
Update README.md by @robertgshaw2-neuralmagic in #96
Formatting : Fix yapf by @varun-sundar-rabindranath in #101
Lower unstructured sparsity threshold to 40% by @mgoin in #100
Benchmarking : Misc updates by @varun-sundar-rabindranath in #95
upstream merge sync 2024-03-11 by @andy-neuma in #108
Add lm-eval comparison script by @mgoin in #99
Benchmarks : Standardize benchmark result store by @varun-sundar-rabindranath in #87
seed whl centric workflows by @andy-neuma in #116
Benchmarking : Remote push job by @varun-sundar-rabindranath in #92
reverted accidental commit to main by @robertgshaw2-neuralmagic in #119
skipped test for nightly failure by @robertgshaw2-neuralmagic in #120
Turned back on the Marlin tests by @robertgshaw2-neuralmagic in #121
Benchmarking : Prepare for GHA benchmark UI by @varun-sundar-rabindranath in #122
Upstream sync 2024 03 14 by @robertgshaw2-neuralmagic in #127
Benchmark : Update benchmark configs for Nightly by @varun-sundar-rabindranath in #126
Benchmark : Modify/Add workflows/actions for github-action-benchmark by @varun-sundar-rabindranath in #123
Benchmark: fix nightly by @varun-sundar-rabindranath in #131
Fix nightly - 03/18/2024 by @varun-sundar-rabindranath in #136
Upstream sync 2024 03 18 by @robertgshaw2-neuralmagic in #134
Update Dockerfile with extensions support by @mgoin in #107
Benchmark : Turn-off nightly multi-gpu benchmarks temporarily by @varun-sundar-rabindranath in #130
Benchmark Fix : Remove special tokens from warmup prompts by @varun-sundar-rabindranath in #140
Delete .github/pull_request_template.md by @mgoin in #145
Benchmarking : Update readme by @varun-sundar-rabindranath in #144
Initial Layerwise Profiler by @LucasWilkinson in #124
Benchmark Fix : Fix JSON decode error by @varun-sundar-rabindranath in #142
Upstream sync 2024 03 24 by @robertgshaw2-neuralmagic in #143
Benchmark : Fix remote push job by @varun-sundar-rabindranath in #129
Benchmarks : Prune nightly benchmarks by @varun-sundar-rabindranath in #150
Lock lm-evaluation-harness to commit 262f879 by @mgoin in #151
Benchmarks : Copy benchmark results to EFS by @varun-sundar-rabindranath in #148
update readme with nvcc threads option by @varun-sundar-rabindranath in #153
Generate tarball along with wheel build, and upload both in a package to GH by @dhuangnm in #138
switch to nightly whl's by @andy-neuma in #154
whl centric workflow for "remote push" by @andy-neuma in #117
remove low-workload benchmarks that are flaky by @varun-sundar-rabindranath in #156
nightly patches by @andy-neuma in #160
Upstream sync v0.4.0.post1 (merged with upstream-v0.4.0.post1) by @mgoin in #157
Bump version to 0.2 by @mgoin in #165

New Contributors

@dhuangnm made their first contribution in #138

Full Changelog: 0.1.0...0.2.0

Contributors

mgoin, varun-sundar-rabindranath, and 4 other contributors

Assets 7

05 Mar 17:08

mgoin

0.1.0

007ada5

v0.1.0

Initial release of 🪄 nm-vllm 🪄

nm-vllm is Neural Magic's fork of vLLM with an opinionated focus on incorporating the latest LLM optimizations like quantization and sparsity for enhanced performance.

This release is based on vllm==0.3.2

Key Features

This first release focuses on our initial LLM performance contributions through support for Marlin, an extremely optimized FP16xINT4 matmul kernel, and weight sparsity acceleration.

Model Inference with Marlin (4-bit Quantization)

Marlin is enabled automatically if a quantized model has the "is_marlin_format": true flag present in it's quant_config.json

from vllm import LLM
model = LLM("neuralmagic/llama-2-7b-chat-marlin")
print(model.generate("Hello quantized world!")

Optionally, you can specify it explicitly by setting quantization="marlin".

Model Inference with Weight Sparsity

nm-vllm includes support for newly-developed sparse inference kernels, which provides both memory reduction and acceleration of sparse models leveraging sparsity.

Here is an example running a 50% sparse OpenHermes 2.5 Mistral 7B model fine-tuned for instruction-following:

from vllm import LLM, SamplingParams

model = LLM(
    "nm-testing/OpenHermes-2.5-Mistral-7B-pruned50",
    sparsity="sparse_w16a16",
    max_model_len=1024
)

sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Hello my name is", sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

There is also support for semi-structured 2:4 sparsity using the sparsity="semi_structured_sparse_w16a16" argument:

from vllm import LLM, SamplingParams

model = LLM("nm-testing/llama2.c-stories110M-pruned2.4", sparsity="semi_structured_sparse_w16a16")
sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Once upon a time, ", sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

What's Changed

Sparsity by @robertgshaw2-neuralmagic in #1
Sparse fused gemm integration by @LucasWilkinson in #12
Abf149/fix semi structured sparse by @afeldman-nm in #16
Enable bfloat16 for sparse_w16a16 by @mgoin in #18
seed workflow by @andy-neuma in #19
Add bias support for sparse layers by @mgoin in #25
Use naive decompress for SM<8.0 by @mgoin in #32
Varun/benchmark workflow by @varun-sundar-rabindranath in #28
initial GHA workflows for "build test" and "remote push" by @andy-neuma in #27
Only import magic_wand if sparsity is enabled by @mgoin in #37
Sparsity fix by @robertgshaw2-neuralmagic in #40
Add NM benchmarking scripts & utils by @varun-sundar-rabindranath in #14
Rs/marlin downstream v0.3.2 by @robertgshaw2-neuralmagic in #43
Update README.md by @mgoin in #47
additional updates to "bump-to-v0.3.2" by @andy-neuma in #39
Add empty tensor initialization to LazyCompressedParameter by @alexm-nm in #53
Update arg_utils.py with semi_structured_sparse_w16a16 by @mgoin in #45
additions for bump to v0.3.2 by @andy-neuma in #50
formatting patch by @andy-neuma in #54
Rs/bump main to v0.3.2 by @robertgshaw2-neuralmagic in #38
Update setup.py naming by @mgoin in #44
Loudly reject compression when the tensor isn't sparse enough by @mgoin in #55
Benchmarking : Fix server response aggregation by @varun-sundar-rabindranath in #51
initial whl workflow by @andy-neuma in #57
GHA Benchmark : Automatic benchmarking on manual trigger by @varun-sundar-rabindranath in #46
delete NOTICE.txt by @andy-neuma in #63
pin GPU and use "--forked" for some tests by @andy-neuma in #58
obsfucate pypi server ip by @andy-neuma in #64
add HF cache by @andy-neuma in #65
Rs/sparse integration test clean 2 by @robertgshaw2-neuralmagic in #67
neuralmagic-vllm -> nm-vllm by @mgoin in #69
Mark files that have been modified by Neural Magic by @tlrmchlsmth in #70
Benchmarking - Add tensor_parallel_size arg for multi-gpu benchmarking by @varun-sundar-rabindranath in #66
Jfinks license by @jeanniefinks in #72
Add Nightly benchmark workflow by @varun-sundar-rabindranath in #62
Rs/licensing by @robertgshaw2-neuralmagic in #68
Rs/model integration tests logprobs by @robertgshaw2-neuralmagic in #71
fixes issue identified by derek by @robertgshaw2-neuralmagic in #83
Add nm-vllm[sparse]+nm-vllm[sparsity] extras, move version to 0.1 by @mgoin in #76
Update setup.py by @mgoin in #82
Fixes the multi-gpu tests by @robertgshaw2-neuralmagic in #79
various updates to "build whl" workflow by @andy-neuma in #59
Change magic_wand to nm-magic-wand by @mgoin in #86

New Contributors

@LucasWilkinson made their first contribution in #12
@alexm-nm made their first contribution in #53
@tlrmchlsmth made their first contribution in #70
@jeanniefinks made their first contribution in #72

Full Changelog: https://github.com/neuralmagic/nm-vllm/commits/0.1.0

Contributors

tlrmchlsmth, mgoin, and 7 other contributors

Assets 6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Key Features

What's Changed

New Contributors

Contributors

Initial release of 🪄 nm-vllm 🪄

Key Features

Model Inference with Marlin (4-bit Quantization)

Model Inference with Weight Sparsity

What's Changed

New Contributors

Contributors

Releases: neuralmagic/nm-vllm

v0.2.0

Key Features

What's Changed

New Contributors

Contributors

v0.1.0

Initial release of 🪄 nm-vllm 🪄

Key Features

Model Inference with Marlin (4-bit Quantization)

Model Inference with Weight Sparsity

What's Changed

New Contributors

Contributors