andy/bump main to v0.3.2 #49

andy-neuma · 2024-02-23T13:35:30Z

[ROCm] add support to ROCm 6.0 and MI300 ([ROCm] add support to ROCm 6.0 and MI300 vllm-project/vllm#2274)
Support for Stable LM 2 (Support for Stable LM 2 vllm-project/vllm#2598)
Don't build punica kernels by default (Don't build punica kernels by default vllm-project/vllm#2605)
AWQ: Up to 2.66x higher throughput (AWQ: Up to 2.66x higher throughput vllm-project/vllm#2566)
Use head_dim in config if exists (Use head_dim in config if exists vllm-project/vllm#2622)
Implement custom all reduce kernels (Custom all reduce kernels vllm-project/vllm#2192)
[Minor] Fix warning on Ray dependencies ([Minor] Fix warning on Ray dependencies vllm-project/vllm#2630)
Speed up Punica compilation (Speed up Punica compilation vllm-project/vllm#2632)
Small async_llm_engine refactor (Small async_llm_engine refactor vllm-project/vllm#2618)
Update Ray version requirements (Update Ray version requirements vllm-project/vllm#2636)
Support FP8-E5M2 KV Cache (Support FP8-E5M2 KV Cache vllm-project/vllm#2279)
Fix error when tp > 1 (Fix error when tp > 1 vllm-project/vllm#2644)
No repeated IPC open (No repeated IPC open vllm-project/vllm#2642)
ROCm: Allow setting compilation target (ROCm: Allow setting compilation target vllm-project/vllm#2581)
DeepseekMoE support with Fused MoE kernel (DeepseekMoE support with Fused MoE kernel vllm-project/vllm#2453)
Fused MOE for Mixtral (Fused MOE for Mixtral vllm-project/vllm#2542)
Fix 'Actor methods cannot be called directly' when using --engine-use-ray (Fix 'Actor methods cannot be called directly' when using --engine-use-ray vllm-project/vllm#2664)
Add swap_blocks unit tests (Add swap_blocks unit tests vllm-project/vllm#2616)
[Minor] Fix a small typo (Fix a small typo (tenosr -> tensor) vllm-project/vllm#2672)
[Minor] Fix false warning when TP=1 ([Minor] Fix false warning when TP=1 vllm-project/vllm#2674)
Add quantized mixtral support (Add quantized mixtral support vllm-project/vllm#2673)
Bump up version to v0.3.0 (Bump up version to v0.3.0 vllm-project/vllm#2656)
Fixes assertion failure in prefix caching: the lora index mapping should respect prefix_len (Fixes assertion failure in prefix caching: the lora index mapping should respect prefix_len. vllm-project/vllm#2688)
fix some bugs (fix some bugs about parameter description vllm-project/vllm#2689)
[Minor] Fix test_cache.py CI test failure ([Minor] Fix test_cache.py CI test failure vllm-project/vllm#2684)
Add unit test for Mixtral MoE layer (Add unit test for Mixtral MoE layer vllm-project/vllm#2677)
Refactor Prometheus and Add Request Level Metrics (Refactor Prometheus and Add Request Level Metrics vllm-project/vllm#2316)
Add Internlm2 (Add Internlm2 vllm-project/vllm#2666)
Fix compile error when using rocm (Fix compile error when using rocm vllm-project/vllm#2648)
fix python 3.8 syntax (fix python 3.8 syntax vllm-project/vllm#2716)
Update README for meetup slides (Update README for meetup slides vllm-project/vllm#2718)
Use revision when downloading the quantization config file (Use revision when downloading the quantization config file vllm-project/vllm#2697)
Remove hardcoded device="cuda" to support more devices (remove hardcoded device="cuda" to support more device vllm-project/vllm#2503)
Fix default length_penalty to 1.0 (fix length_penalty default value to 1.0 vllm-project/vllm#2667)
Add one example to run batch inference distributed on Ray (Add one example to run batch inference distributed on Ray vllm-project/vllm#2696)
docs: fix langchain (docs: update langchain serving instructions vllm-project/vllm#2736)
set&get llm internal tokenizer instead of the TokenizerGroup (Set&Get llm internal tokenizer instead of the TokenizerGroup vllm-project/vllm#2741)
Remove eos tokens from output by default (Remove eos tokens from output by default vllm-project/vllm#2611)
Require triton >= 2.1.0 (add requirement: triton >= 2.1.0 vllm-project/vllm#2746)
[Minor] Fix benchmark_latency script ([Minor] Fix benchmark_latency vllm-project/vllm#2765)
[ROCm] Fix some kernels failed unit tests ([ROCm] Fix some kernels failed unit tests vllm-project/vllm#2498)
Set local logging level via env variable (Set local logging level via env variable vllm-project/vllm#2774)
[ROCm] Fixup arch checks for ROCM ([ROCm] Fixup arch checks for ROCM vllm-project/vllm#2627)
Add fused top-K softmax kernel for MoE (Add fused top-K softmax kernel for MoE vllm-project/vllm#2769)
modelscope: fix issue when model parameter is not a model id but path of the model. (fix issue when model parameter is not a model id but path of the model. vllm-project/vllm#2489)
[Minor] More fix of test_cache.py CI test failure ([Minor] More fix of test_cache.py CI test failure vllm-project/vllm#2750)
[ROCm] Fix build problem resulted from previous commit related to FP8 kv-cache support ([ROCm] Fix build problem resulted from previous commit related to FP8 kv-cache support vllm-project/vllm#2790)
Add documentation on how to do incremental builds (Add documentation on how to do incremental builds vllm-project/vllm#2796)
[Ray] Integration compiled DAG off by default ([Ray] Integration compiled DAG off by default vllm-project/vllm#2471)
Disable custom all reduce by default (Disable custom all reduce by default vllm-project/vllm#2808)
[ROCm] support Radeon™ 7900 series (gfx1100) without using flash-attention ([ROCm] support Radeon™ 7900 series (gfx1100) without using flash-attention vllm-project/vllm#2768)
Add documentation section about LoRA (Add documentation section about LoRA vllm-project/vllm#2834)
Refactor 2 awq gemm kernels into m16nXk32 (Refactor 2 awq gemm kernels into m16nXk32 vllm-project/vllm#2723)
Serving Benchmark Refactoring (Serving Benchmark Refactoring vllm-project/vllm#2433)
[CI] Ensure documentation build is checked in CI ([CI] Ensure documentation build is checked in CI vllm-project/vllm#2842)
Refactor llama family models (Refactor llama family models vllm-project/vllm#2637)
Revert "Refactor llama family models (Refactor llama family models vllm-project/vllm#2637)" (Revert "Refactor llama family models" vllm-project/vllm#2851)
Use CuPy for CUDA graphs (Use CuPy for CUDA graphs vllm-project/vllm#2811)
Remove Yi model definition, please use LlamaForCausalLM instead (Remove Yi model definition, please use LlamaForCausalLM instead vllm-project/vllm#2854)
Add LoRA support for Mixtral (Add LoRA support for Mixtral vllm-project/vllm#2831)
Migrate InternLMForCausalLM to LlamaForCausalLM (Migrate InternLMForCausalLM to LlamaForCausalLM vllm-project/vllm#2860)
Fix internlm after Migrate InternLMForCausalLM to LlamaForCausalLM vllm-project/vllm#2860 (Fix internlm after https://github.com/vllm-project/vllm/pull/2860 vllm-project/vllm#2861)
[Fix] Fix memory profiling when GPU is used by multiple processes ([Fix] Fix memory profiling when GPU is used by multiple processes vllm-project/vllm#2863)
Fix docker python version (Fix docker python version vllm-project/vllm#2845)
Migrate AquilaForCausalLM to LlamaForCausalLM (Migrate AquilaForCausalLM to LlamaForCausalLM vllm-project/vllm#2867)
Don't use cupy NCCL for AMD backends (Don't use cupy NCCL for AMD backends vllm-project/vllm#2855)
Align LoRA code between Mistral and Mixtral (fixes Fix AttributeError: MixtralModel object has no attribute org_vocab_size. vllm-project/vllm#2875) (Align LoRA code between Mistral and Mixtral (fixes #2875) vllm-project/vllm#2880)
[BugFix] Fix GC bug for LLM class ([BugFix] Fix GC bug for LLM class vllm-project/vllm#2882)
Fix DeciLM (Fix decilm.py vllm-project/vllm#2883)
[ROCm] Dockerfile fix for flash-attention build ([ROCm] Dockerfile fix for flash-attention build vllm-project/vllm#2885)
Prefix Caching- fix t4 triton error (Prefix Caching- fix t4 triton error vllm-project/vllm#2517)
Bump up to v0.3.1 (Bump up to v0.3.1 vllm-project/vllm#2887)
Defensively copy sampling_params (Defensively copy sampling_params vllm-project/vllm#2881)
multi-LoRA as extra models in OpenAI server (multi-LoRA as extra models in OpenAI server vllm-project/vllm#2775)
Add code-revision config argument for Hugging Face Hub (Add code-revision config argument for Hugging Face Hub vllm-project/vllm#2892)
[Minor] Small fix to make distributed init logic in worker looks cleaner ([Minor] Small fix to make distributed init logic in worker looks cleaner vllm-project/vllm#2905)
[Test] Add basic correctness test ([Test] Add basic correctness test vllm-project/vllm#2908)
Support OLMo models. (Support OLMo models. vllm-project/vllm#2832)
Add warning to prevent changes to benchmark api server (Add warning to prevent changes to benchmark api server vllm-project/vllm#2858)
Fix vllm:prompt_tokens_total metric calculation (Fix vllm:prompt_tokens_total metric calculation vllm-project/vllm#2869)
[ROCm] include gfx908 as supported ([ROCm] include gfx908 as supported vllm-project/vllm#2792)
[FIX] Fix beam search test ([FIX] Fix beam search test vllm-project/vllm#2930)
Make vLLM logging formatting optional (Make vLLM logging formatting optional vllm-project/vllm#2877)
Add metrics to RequestOutput (Add metrics to RequestOutput vllm-project/vllm#2876)
Add Gemma model (Add Gemma model vllm-project/vllm#2964)
Upgrade transformers to v4.38.0 (Upgrade transformers to v4.38.0 vllm-project/vllm#2965)
[FIX] Add Gemma model to the doc ([FIX] Add Gemma model to the doc vllm-project/vllm#2966)
[ROCm] Upgrade transformers to v4.38.0 ([ROCm] Upgrade transformers to v4.38.0 vllm-project/vllm#2967)
Support per-request seed (Support per-request seed vllm-project/vllm#2514)
Bump up version to v0.3.2 (Bump up version to v0.3.2 vllm-project/vllm#2968)
Add sparsity support based with magic_wand GPU kernels
Update README.md
Semi-structured 2:4 sparsity via SparseSemiStructuredTensor Semi-structured 2:4 sparsity via SparseSemiStructuredTensor #4
Sparse fused gemm integration (Sparse fused gemm integration #12)
Abf149/fix semi structured sparse (Abf149/fix semi structured sparse #16)
Enable bfloat16 for sparse_w16a16 (Enable bfloat16 for sparse_w16a16 #18)
seed workflow (seed workflow #19)
Add bias support for sparse layers (Add bias support for sparse layers #25)
Use naive decompress for SM<8.0 (Use naive decompress for SM<8.0 #32)
Varun/benchmark workflow (Varun/benchmark workflow #28)
initial GHA workflows for "build test" and "remote push" (initial GHA workflows for "build test" and "remote push" #27)
Only import magic_wand if sparsity is enabled (Only import magic_wand if sparsity is enabled #37)
manually reverted requirements to match v0.3.2
reverted requirements
removed duplicate
format
added noqa to upstream scripts for linter
format
Sparsity fix (Sparsity fix #40)
Rs/marlin downstream v0.3.2 (Rs/marlin downstream v0.3.2 #43)
additional updates to "bump-to-v0.3.2" (additional updates to "bump-to-v0.3.2" #39)
move to 4 x gpu

Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>

Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>

Co-authored-by: zhaoyang-star <zhao.yang16@zte.com.cn>

Co-authored-by: roy <jasonailu87@gmail.com>

Co-authored-by: chen shen <scv119@gmail.com>

…e-ray` (vllm-project#2664) * fix: engine-useray complain * fix: typo

…uld respect prefix_len (vllm-project#2688) Signed-off-by: Tao He <sighingnow@gmail.com>

This version is for more model support. Add support for Gemma models (vllm-project#2964) and OLMo models (vllm-project#2832).

magic_wand semi_structured_sparse_tensor_linear branch integrates 2:4 semi-structured sparsity into SparseTensor. This PR adds a new sparsity config for 2:4 sparsity to neuralmagic-vllm, using the SparseTensor 2:4 support. This PR also refactors the sparse linear method into a separate file, vllm/model_executor/layers/sparsity/sparse_w16a16_linear_method.py, which supports all sparsity formats.

Summary: Initial integration for the sparse-fused gemm. To achieve this, we need to ensure that we compress the weight matrix only once and never decompress it, as decompression is currently unsupported. Before this change, using `SparseParameter(SparseTensor)` meant that in `MergedColumnParallelLinear` and `QKVParallelLinear` every time a new shard was loaded by the `weight_loader` (e.g., the "q" portion of `QKVParallelLinear`), we would decompress the tensor in-order to use narrow to update the appropriate section of the weight tensor. With this change, `SparseParameter(SparseTensor)` is replaced with `LazyCompressedParameter`, which allows us to operate on `uncompressed_data` until we explicitly compress it. At that point, the `uncompressed_data` is compressed into `compressed_data` and freed. Currently, the detection of when to call compress is somewhat hacky. For `QKVParallelLinear`, we compress only after inserting "q", "k", and "v" shard ids, and for `MergedColumnParallelLinear`, we compress once we've inserted the same number of shards as outputs (determined by `len(output_sizes)`), which implicitly assumes one shard per output. Moving away from `SparseParameter(SparseTensor)` means that `SparseTensor` no longer handles dispatching to the custom ops; instead, this is handled by `SparseW16A16LinearMethod`. I believe this is a positive change overall. `SparseTensor` was an unnecessary extra layer of abstraction/indirection originally designed for the SLoRA work, not vLLM. This did result in the 2:4 sparse implementation breaking. However, it turns out it was already broken (i.e., it was decompressing and running dense within `SparseTensor`), so we "disable" it for now ("disable" meaning decompress and run dense instead). We should revisit all of this infrastructure post-MVP. --------- Co-authored-by: Andrew Feldman <afeldman@neuralmagic.com>

SUMMARY: - Fix bug whereby 2:4 is not being invoked - Eschew SparseTensor based implementation TESTING: - examples/offline_inference_semi_structured_sparse.py --------- Co-authored-by: Lucas Wilkinson <wilkinson.lucas@gmail.com>

SUMMARY * add callable seed workflow for initial boundary testing Co-authored-by: marcella-found <marcella.found@gmail.com>

A warning will be printed out if this case is triggered: ``` WARNING 02-20 22:21:27 sparse_w16a16.py:32] Unstructured sparse kernels are not optimized for NVIDIA SM < 8.0. Naive decompress kernels will be used and can be slower than dense models ``` Works on a T4 with: ```python from vllm import LLM, SamplingParams model = LLM( "nm-testing/opt-125m-pruned2.4", sparsity="sparse_w16a16", enforce_eager=True, dtype="float16", ) sampling_params = SamplingParams(max_tokens=100, temperature=0) outputs = model.generate("Hello my name is", sampling_params=sampling_params) outputs[0].outputs[0].text ``` Test within colab: https://colab.research.google.com/drive/15xRvWX5gNaTb00BcaXhxwMm6yxavIKGN?usp=sharing

Add initial bechmark workflow --------- Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

SUMMARY: * initial set of "actions with a little a" that are the building blocks for eventual CI system * "build test" workflow * "remote push" workflow on `a10g` * update some requirement files to have packages listed in alphabetical order NOTE: this PR is still somewhat nebulas as i'm still working through building and testing "neuralmagic-vllm" in our automation environment. TEST: currently, i'm working through various workflow components, i.e. "actions with a little a". the bits making up the actions in this PR have been constructed from my notes along the way. we can do a "complete" run that includes: linting, building, installing, and running tests. GHA link ... https://github.com/neuralmagic/neuralmagic-vllm/actions/runs/7975058564 `testmo` ... https://neuralmagic.testmo.net/automation/runs/view/8097 Latest GHA link ... https://github.com/neuralmagic/neuralmagic-vllm/actions/runs/7992489982 --------- Co-authored-by: andy-neuma <andy@neuralmagic.com>

Tested by making sure magic_wand was uninstalled and this code for a dense model runs fine: ```python from vllm import LLM, SamplingParams model = LLM("nm-testing/opt-125m-pruned2.4", enforce_eager=True) ``` Then testing with a sparse model run: ```python from vllm import LLM, SamplingParams model = LLM("nm-testing/opt-125m-pruned2.4", sparsity="sparse_w16a16", enforce_eager=True) ``` output: ``` ... File "/home/michael/code/neuralmagic-vllm/vllm/model_executor/weight_utils.py", line 93, in get_sparse_config from vllm.model_executor.layers.sparsity import get_sparsity_config File "/home/michael/code/neuralmagic-vllm/vllm/model_executor/layers/sparsity/__init__.py", line 6, in <module> raise ValueError( ValueError: magic_wand is not available and required for sparsity support. Please install it with `pip install magic_wand` ```

Co-authored-by: Andrew Feldman <afeldman@neuralmagic.com> Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com> Co-authored-by: alexm <alexm@neuralmagic.com>

SUMMARY * update `TORCH_CUDA_ARCH_LIST` to match `magic_wand` * update "test vllm" action to run tests serially * add helper script to find *.py tests, run them serially, and output JUnit formatted xml TEST working through changes manually on debug instance --------- Co-authored-by: andy-neuma <andy@neuralmagic.com>

andy-neuma · 2024-02-23T13:36:10Z

wrong command.

hongxiayang and others added 30 commits January 26, 2024 12:41

[ROCm] add support to ROCm 6.0 and MI300 (vllm-project#2274)

6b7de1a

Support for Stable LM 2 (vllm-project#2598)

3a0e1fc

Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>

Don't build punica kernels by default (vllm-project#2605)

390b495

AWQ: Up to 2.66x higher throughput (vllm-project#2566)

beb89f6

Use head_dim in config if exists (vllm-project#2622)

220a476

Implement custom all reduce kernels (vllm-project#2192)

3801700

[Minor] Fix warning on Ray dependencies (vllm-project#2630)

5f036d2

Speed up Punica compilation (vllm-project#2632)

f8ecb84

Small async_llm_engine refactor (vllm-project#2618)

89be30f

Update Ray version requirements (vllm-project#2636)

7d64841

Support FP8-E5M2 KV Cache (vllm-project#2279)

9090bf0

Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>

Fix error when tp > 1 (vllm-project#2644)

b72af8f

Co-authored-by: zhaoyang-star <zhao.yang16@zte.com.cn>

No repeated IPC open (vllm-project#2642)

1b20639

ROCm: Allow setting compilation target (vllm-project#2581)

ea8489f

DeepseekMoE support with Fused MoE kernel (vllm-project#2453)

5d60def

Co-authored-by: roy <jasonailu87@gmail.com>

Fused MOE for Mixtral (vllm-project#2542)

ab40644

Co-authored-by: chen shen <scv119@gmail.com>

Fix 'Actor methods cannot be called directly' when using `--engine-us…

d79ced3

…e-ray` (vllm-project#2664) * fix: engine-useray complain * fix: typo

Add swap_blocks unit tests (vllm-project#2616)

4f65af0

[Minor] Fix a small typo (vllm-project#2672)

bbe9bd9

[Minor] Fix false warning when TP=1 (vllm-project#2674)

105a40f

Add quantized mixtral support (vllm-project#2673)

3dad944

Bump up version to v0.3.0 (vllm-project#2656)

1af090b

Fixes assertion failure in prefix caching: the lora index mapping sho…

d69ff0c

…uld respect prefix_len (vllm-project#2688) Signed-off-by: Tao He <sighingnow@gmail.com>

fix some bugs (vllm-project#2689)

c664b0e

[Minor] Fix test_cache.py CI test failure (vllm-project#2684)

89efcf1

Add unit test for Mixtral MoE layer (vllm-project#2677)

d0d93b9

Refactor Prometheus and Add Request Level Metrics (vllm-project#2316)

93b38be

Add Internlm2 (vllm-project#2666)

cd9e60c

Fix compile error when using rocm (vllm-project#2648)

923797f

fix python 3.8 syntax (vllm-project#2716)

b9e96b1

WoosukKwon and others added 28 commits February 21, 2024 09:38

Upgrade transformers to v4.38.0 (vllm-project#2965)

c20ecb6

[FIX] Add Gemma model to the doc (vllm-project#2966)

a9c8212

[ROCm] Upgrade transformers to v4.38.0 (vllm-project#2967)

dc903e7

Support per-request seed (vllm-project#2514)

7d2dcce

Bump up version to v0.3.2 (vllm-project#2968)

8fbd84b

This version is for more model support. Add support for Gemma models (vllm-project#2964) and OLMo models (vllm-project#2832).

Add sparsity support based with magic_wand GPU kernels

7c4304b

Update README.md

5344a01

Abf149/fix semi structured sparse (#16)

7527b9c

SUMMARY: - Fix bug whereby 2:4 is not being invoked - Eschew SparseTensor based implementation TESTING: - examples/offline_inference_semi_structured_sparse.py --------- Co-authored-by: Lucas Wilkinson <wilkinson.lucas@gmail.com>

Enable bfloat16 for sparse_w16a16 (#18)

3c11f56

seed workflow (#19)

8147811

SUMMARY * add callable seed workflow for initial boundary testing Co-authored-by: marcella-found <marcella.found@gmail.com>

Add bias support for sparse layers (#25)

e802bc2

Varun/benchmark workflow (#28)

78ba5c1

Add initial bechmark workflow --------- Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

manually reverted requirements to match v0.3.2

acf16bf

Merge branch 'main' into rs/bump-main-to-v0.3.2

dbf3cab

reverted requirements

0feedf9

removed duplicate

ce8164d

format

166c13b

added noqa to upstream scripts for linter

1b395b4

format

8d935be

Sparsity fix (#40)

acb8615

Rs/marlin downstream v0.3.2 (#43)

4b44479

Co-authored-by: Andrew Feldman <afeldman@neuralmagic.com> Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com> Co-authored-by: alexm <alexm@neuralmagic.com>

move to 4 x gpu

b1e14c2

andy-neuma closed this Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

andy/bump main to v0.3.2 #49

andy/bump main to v0.3.2 #49

andy-neuma commented Feb 23, 2024

andy-neuma commented Feb 23, 2024

andy/bump main to v0.3.2 #49

andy/bump main to v0.3.2 #49

Conversation

andy-neuma commented Feb 23, 2024

andy-neuma commented Feb 23, 2024