Upstream merge jan09 #151

masahi · 2024-01-09T07:00:41Z

@vinx13 Please verify that Mixtral support is not broken.

PR mlc-ai#1048 updated the signature of softmax in the built model library and changed the temperature buffer shape in ChatModule. This causes some existing demo unable to run since we did not do a round of model library update. This PR reverts the ChatModule change, and adds back the softmax function in non-batching case. With this PR, the regression should be fixed.

…ai#1074) This PR lifts the device string parsing (just a few of lines) to a standalone function, so that on the serving side the serving can make use of this function as well. Tested Python API and it does not seem to incur regression.

The pass `fuse-split-rotary` assumes the compute dtype is fp16, which usually is, but in certain cases, e.g. `q0f32` and `q4f32_1`, the compute is based on fp32 instead. This PR strengthens the check guard.

This PR establishes the compiler components in MLC-Chat Python API, which currently includes two primary components: models and parameters. The models are `nn.Module`-based definition of an LLM, which, as the very first stab, contains only `LlamaForCasualLM`. It is decomposed into three files: - `llama_config.py`: common configurations for Llama, where we define relevant configurations of its architecture, as well as include standard config file for Llama2-7B/13B/70B for convenient testing; - `llama.py`: the model architecture of Llama, based on the PyTorch-like `nn.Module` API; - `llama_parameter.py`: defines the mapping between MLC parameters and pytorch parameters. The parameters contains the basic functionality of parameter mapping, and the loaders that effectively convert parameters from PyTorch to MLC according to the mapping specified. Currently, only `HFTorchLoader` is implemented, but loaders like SafeTensor, GPTQ or AWQ should be quite straightforward according to the existing design. On top of this PR, on-the-fly quantization could be defined as a loading time transformation on MLC parameters, while pre-quantized parameter loading is effectively parameter loading after MLC's `nn.Module` is quantized. Two unittests examplify how the infrastructure works: - `./tests/python/model/test_llama.py` shows how to create an `nn.Module` using the new infra, and then convert it to TVM IRModule; - `./tests/python/parameter/hf_torch_loader.py` shows how to load parameters from HuggingFace PyTorch format. Besides, `mlc_chat.support` is established for utility functions, which now contains two utils: - `config.py` which supports reading configurations into dataclasses from JSON file or Python dict. On top of Python dataclass, it throws irrelevant fields into `cls.kwargs`, which is helpful when loading HuggingFace configuration file; - `tqdm.py` which contains tqdm-related utilities, primarily redirecting logging and printing to work nicely with tqdm.

…ages (mlc-ai#1086) * Support lib_path option in C++ CLI. Disable ChatConfig.model_lib override in Python API. Improvements on helper messages and error messages * Update docs * Rename lib_path -> model_lib_path

Co-authored-by: Varshith <varshith.bathini@sprinklr.com>

Update `benchmark.py`

[Format] Apply isort and black on `python/` The commands I am using are: ``` isort --profile black python/ black python/ ``` It is always recommended to format the code before submission, given we don't have a linter CI yet.

This PR enables two Python formatters "black" and "isort" on the following directory: - `./python/` - `./tests/python/` Enabling pylint and mypy is left for future work

Add pylint/mypy tooling into pyproject.toml This PR establishes the initial Python tooling infra with Pylint and Mypy. Currently only the newest modules, i.e. `mlc_chat.support` and `mlc_chat.compiler` are covered, and we expect to cover the entire package, as being tracked in mlc-ai#1101.

…1052) Prior to this commit, `mlc_llm.transform.rewrite_attention` updated a single function. This commit modifies it to instead be a transform operating on any pattern matches within an `IRModule`.

…#1056) * [ParamManager] Use BundleModelParams for transform_quantize Prior to this commit, `ParamManager.transform_quantize` function took as input functions with separate parameters for each weight tensor, and produced output functions with a tuple parameter for all weights. Because `LiftTransformParams` had the same convention, neither could be applied as part of the same build flow. This commit updates `ParamManager.transform_quantize` pass to produce outputs with separate tensor parameters, using the `BundleModelParams` transform to later combine them into a single tuple parameter. The analogous change was also performed for `LiftTransformParams` as part of apache/tvm#15657. In addition, prior to this commit, the `ParamManager.transform_dequantize` function operated directly on a `IRModule` object. As a result, any debug instrumentation (e.g. before/after printouts for each pass, before/after verification with `relax.analysis.well_formed`, etc.) did not apply to this `transform_dequantize`. This commit updates `ParamManager.transform_dequantize` to return a `ir.transform.Pass`. * Correct type annotation

…mlc-ai#1113)

fix error introduced by recent code changes fixes mlc-ai#1116

…lc-ai#1119) * Add doc for max and mean gen len, shift factor * Update python docs for BuildArgs

mlc-ai#1120) Revert "[ParamManager] Use BundleModelParams for transform_dequantize (mlc-ai#1056)" This reverts commit e5927ce. This causes a regression impacting all MLC LLM nightlies as it violates the existing calling convention in MLC Chat runtime. An example: mlc-ai#1060 (comment)

This PR removes an inaccurate warning from mlc-ai#1086, which warns about `model_lib` overriding regardless of whether or not it's actually overridden. With this commit, we only warn if its value is not None.

* add presence and frequency penalty * Added support for passing conversation history in /v1/chat/completions endpoint * Added support for RestAPI parameters max_gen_len, n, and stop_str * * add presence and frequency penalty to generation config * refactor generation config * Added documentation for parameters * replace lib_path with model_lib_path in rest.py * fixed black isort issues * fix lib_path

…lc-ai#1127) Prior to this commit, `ParamManager.transform_quantize` function took as input functions with separate parameters for each weight tensor, and produced output functions with a tuple parameter for all weights. Because `LiftTransformParams` had the same convention, neither could be applied as part of the same build flow. This commit updates `ParamManager.transform_quantize` pass to produce outputs with separate tensor parameters, using the `BundleModelParams` transform to later combine them into a single tuple parameter. The analogous change was also performed for `LiftTransformParams` as part of apache/tvm#15657. In addition, prior to this commit, the `ParamManager.transform_dequantize` function operated directly on a `IRModule` object. As a result, any debug instrumentation (e.g. before/after printouts for each pass, before/after verification with `relax.analysis.well_formed`, etc.) did not apply to this `transform_dequantize`. This commit updates `ParamManager.transform_dequantize` to return a `ir.transform.Pass`. This commit is a repeat of the reverted PR mlc-ai#1056. This PR resolves the bug in the earlier implementation by removing the call to `.without_attr("num_input")` in `ParamReplacer.rewrite_func`. This follows an analogous update in `LiftTransformParams`, preserving the `"num_input"` attribute for use in `BundleModelParams`.

32bit version of the zstd.dll library was causing issues, so updated the doc to be more specific and download the 64bit version.

Use sys executable in delivery

add mistral android lib url

fix tp bigcode

* cublas * fix

Integrate fused rope into model gpt_neox and phi. Add an optional parameter `rotary_dim` to `llama_rope`. `rotary_dim` indicates the number of dimensions in the embedding that RoPE is applied to. By default `rotary_dim` is the same as `head_dim`. In model `Phi`, `rotary_dim` is set to a different number based on the config.

This PR addresses a package name conflict issue introduced by mlc-ai#1502, where `mlc_chat.operator` collides with python's official `operator` library. More details: mlc-ai#1502 (comment).

A minor path fix in the Android Doc, as the file `prepare_libs.sh` is under `library` folder.

…-ai#1522) This PR introduces an environment variable `MLC_JIT_POLICY` as a follow-up item to PR [mlc-ai#1508](mlc-ai#1508 (comment)). It allows to enable/disable the JIT behavior by: - `OFF`: never JIT, and will throw an error if `model_lib` is missing; - `ON` (default): JIT whenever the model lib is missing and there's a cache miss; - `REDO`: whenever the model lib is missing, always do JIT compilation even if cache hits; - `READONLY`: never do JIT compilation but look up the JIT cache whenever the model lib is missing. It also dissolves the newly-introduced `JITOption` into `ChatConfig` so that it can be used more seamlessly with exactly the existing APIs. By doing so, users can simply specify `context_window_size`, `prefill_chunk_size` to control the VRAM used in each model without having to recompile the model lib themselves. Example: If one focuses on developing compiler/runtime rather than quantization, we could simply run ```bash MLC_JIT_POLICY=REDO python main.py ``` to test if the compiler/runtime work smoothly together, where `main.py` is: ```python from mlc_chat import ChatConfig, ChatModule, callback from mlc_chat.support import logging logging.enable_logging() MODEL="HF://junrushao/Llama-2-7b-chat-hf-q4f16_1-MLC", cm = ChatModule( MODEL, device="cuda", chat_config=ChatConfig( context_window_size=1024, prefill_chunk_size=1024, ), ) cm.generate( "What is the meaning of life?", progress_callback=callback.StreamToStdout(callback_interval=2), ) ```

* Add support for loading weights from a safetensor file * Set pylint to ignore the import error * Move pylint-disable line Co-authored-by: Junru Shao <junrushao1994@gmail.com> --------- Co-authored-by: Junru Shao <junrushao1994@gmail.com>

This PR introduces a command that reports the estimated upper-bound memory usage based on the metadata section of an SLM-compiled model. Example: ```bash >> python -m mlc_chat.cli.model_metadata /path/to/model_lib.so --memory-only [2023-12-31 18:40:43] INFO model_metadata.py:49: Parameter size: 3885.14 MB [2023-12-31 18:40:43] INFO model_metadata.py:58: Temporary buffer size: 7184.15 MB [2023-12-31 18:40:43] INFO model_metadata.py:71: KVCache size when context/sliding window size is 4096: 512.00 MB [2023-12-31 18:40:43] INFO model_metadata.py:79: Total memory usage: 11581.29 MB [2023-12-31 18:40:43] INFO model_metadata.py:84: Tweaking `prefill_chunk_size`, `context_window_size` and `sliding_window_size` to reduce memory usage ``` Addresses both B1 and B2 in mlc-ai#1516 (comment). Another demo using Python API: ```python from mlc_chat import ChatConfig, ChatModule, callback from mlc_chat.support import logging logging.enable_logging() MODEL="HF://junrushao/NeuralHermes-2.5-Mistral-7B-q4f16_1-MLC" cm = ChatModule( MODEL, device="cuda", chat_config=ChatConfig( sliding_window_size=4096, prefill_chunk_size=1024, opt="O2", ), ) cm.generate( "What is the meaning of life?", progress_callback=callback.StreamToStdout(callback_interval=2), ) ``` ```bash >>> MLC_JIT_POLICY=REDO python main.py ``` <img width="958" alt="image" src="https://github.com/mlc-ai/mlc-llm/assets/22515877/8fcf1fb2-53b3-4768-91b4-89f90712dea8">

1. support n-dimension tensor sharding 2. remove unnecessary `row`, `col` and `group` field

This PR turns on FlashInfer in O2 mode given it has been relatively stable over the past few weeks. This commits also brings a few misc improvements: - Pass in scratch memory managed by RelaxVM's memory pool - this change depends on TVM's [PR #16327](apache/tvm#16327) and FlashInfer's [PR mlc-ai#43](flashinfer-ai/flashinfer#43) - Enable FlashInfer for group size = 4, which is a setting used in Mistral models; - Slightly shorten and clarify the log message on memory usage on model lib loading. - Integrate FlashInfer into GPT-BigCode models. With this PR, FlashInfer is integrated into Mistral, Llama, GPT-NeoX, GPT-BigCode, Phi. The only left out is GPT2, which has a special flag `scale_attn_by_inverse_layer_idx` which applies an elementwise normalization term `1.0 / layer_id` to attn scores before masked softmax.

This PR enbales the FasterTransformer quantization of `q4f16_ft`.

This PR includes two minor fixes to support TinyLlama: - Fix BF16 loading via SafeTensor - it was broken because numpy does not support bf16, which leads to an exception in safetensor internally. - FlashInfer doesn't support `head_dim == 64`, which we skipped in this PR. After this PR, the following snippet runs TinyLlama pretty conveniently: ```python from mlc_chat import ChatConfig, ChatModule, callback from mlc_chat.support import logging logging.enable_logging() MODEL = "HF://junrushao/TinyLlama-1.1B-Chat-v1.0-q4f16_1-MLC" def main(): cm = ChatModule( MODEL, device="metal", chat_config=ChatConfig(context_window_size=1024), ) cm.generate( "What is the meaning of life?", progress_callback=callback.StreamToStdout(callback_interval=2), ) if __name__ == "__main__": main() ```

``` MODEL = "HF://junrushao/Mistral-7B-Instruct-v0.2-q4f16_1-MLC" TP_SHARDS = 2 from mlc_chat import ChatConfig, ChatModule, callback from mlc_chat.support import logging logging.enable_logging() cm = ChatModule( MODEL, device="cuda", chat_config=ChatConfig( context_window_size=1024, prefill_chunk_size=1024, tensor_parallel_shards=TP_SHARDS, opt="flashinfer=0;cublas_gemm=1;cudagraph=0", ), ) cm.generate( "What is the meaning of life?", progress_callback=callback.StreamToStdout(callback_interval=2), ) ```

This PR introduces the batched llama modeling with Paged KV cache in SLM flow.

This is a quick fix to mlc-ai#1547. Sorry for missing the init file in the nn subpackage.

This PR enables FasterTransformer dequantize matmul epilogue fusion.

Introduce Mixtral MoE Model This PR introduces support for Mixtral MoE models with MLC's latest SLM quantization/compilation pipeline. It includes the following pieces of changes: **Operators.** We implemented a list of operators in TIR's TVMScript format in two files `moe_misc` and `moe_matmul`. Those TIR kernels implement "transpose indices" and "blocked-CSR-COO" as described in MegaBlock [1]. `moe_misc.py` primarily concerns sparsity-related operators, including: - `get_indices`, `get_indptr` and `scatter_output`: CSR-style index manipulation and array shuffling that makes the input ranges each expert has to deal with contiguous. - `moe_sum`, `moe_cumsum`, `topk` which are standard operators but specialized for MoE usecases, e.g. #experts and #activated-experts are small. `moe_matmul.py` includes non-quantized and quantized GEMV and GEMV operators used in MoE model serving. Typically, in single batch decoding, GEMV operators should suffice, but group GEMM is a necessary dependency in both prefilling and batched decoding. **Model architecture.** We reuse the attention blocking block from Mistral, and implemented MLP MoE in `mixtral_model.py`. In Mixtral, there are three groups of experts in each MLP, where `e1` and `e3` are gate/up projections (project-in) and `e2` is down project (project-out). **Weight quantization.** We batch all experts of the same kind into a single tensor, whose shape is `(Ne, N, K)`, where `Ne` is the total number of experts, `N` is out features and `K` is in-features. Applying group quantization, we compress along the `K` dimension as consistent with the rest of the project. **Performance.** The current TIR is highly optimized for non-tensor core scenarios (Metal, WebGPU, non-TensorCore CUDA, AMD, etc) and tensor core performance is left for a PR in the nearest future. **Try out MLC's Mixtral Model.** The int4-quantized Mixtral model has 24.5G of parameters. ```python from mlc_chat import ChatConfig, ChatModule, callback from mlc_chat.support import logging logging.enable_logging() MODEL = "HF://junrushao/Mixtral-8x7B-Instruct-v0.1-q4f16_1-MLC" NUM_GPU = 1 def main(): cm = ChatModule(MODEL, device="cuda:0", chat_config=ChatConfig( sliding_window_size=1024, tensor_parallel_shards=NUM_GPU, )) cm.generate("What is the meaning of life?", progress_callback=callback.StreamToStdout(callback_interval=2)) if __name__ == "__main__": main() ``` Quantization formats: - 3-bit (19.662 GB): ["HF://junrushao/Mixtral-8x7B-Instruct-v0.1-q3f16_1-MLC"](https://huggingface.co/junrushao/Mixtral-8x7B-Instruct-v0.1-q3f16_1-MLC) - 4-bit (24.466 GB): ["HF://junrushao/Mixtral-8x7B-Instruct-v0.1-q4f16_1-MLC"](https://huggingface.co/junrushao/Mixtral-8x7B-Instruct-v0.1-q4f16_1-MLC) The 3-bit version can be run comfortably using a 24G GPU (e.g. 4090, 3090Ti). **Convert Mixtral to MLC format from scratch.** The following instructions are only needed for advanced users to quantize Mixtral from scratch. ```bash SRC_DIR=/path/to/Mixtral-8x7B-v0.1 # raw model downloaded from HuggingFace MODEL_DIR=/mlc_models/mixtral-q4f16_1 # destination directory mlc_chat gen_config $SRC_DIR -o $MODEL_DIR --quantization q4f16_1 \ --conv-template LM # "LM" (lang model) means no conversation template yet mlc_chat convert_weight $SRC_DIR --quantization q4f16_1 -o $MODEL_DIR ``` [1] Gale, Trevor, Deepak Narayanan, Cliff Young, and Matei Zaharia. "MegaBlocks: Efficient Sparse Training with Mixture-of-Experts." Proceedings of MLSys 2023. Co-authored-by: Junru Shao <junrushao@apache.org>

A follow-up of my previous PR (mlc-ai#1529). This PR makes Mixtral work on Metal GPUs that macOS comes with. There are honestly no much change needed, except for that Metal doesn't support fp64 data types. A python script to run Mixtral: ```python from mlc_chat import ChatConfig, ChatModule, callback from mlc_chat.support import logging logging.enable_logging() MODEL = "HF://junrushao/Mixtral-8x7B-Instruct-v0.1-q4f16_1-MLC" NUM_GPU = 1 def main(): cm = ChatModule(MODEL, chat_config=ChatConfig( sliding_window_size=1024, tensor_parallel_shards=NUM_GPU, )) cm.generate("What is the meaning of life?", progress_callback=callback.StreamToStdout(callback_interval=2)) if __name__ == "__main__": main() ``` Quantization formats: - 3-bit (19.662 GB): ["HF://junrushao/Mixtral-8x7B-Instruct-v0.1-q3f16_1-MLC"](https://huggingface.co/junrushao/Mixtral-8x7B-Instruct-v0.1-q3f16_1-MLC) - 4-bit (24.466 GB): ["HF://junrushao/Mixtral-8x7B-Instruct-v0.1-q4f16_1-MLC"](https://huggingface.co/junrushao/Mixtral-8x7B-Instruct-v0.1-q4f16_1-MLC)

…i#1555) We recently noticed that when FlashInfer is not built due to unsupported cuda architecture or platform, running single-sequence ChatModule will hit VM function initialization error, where the function is used in `create_flashinfer_paged_kv_cache`, which won't actually be invoked in single-sequence flow. This is due to relax VM eagerly initializes all used PackedFunc at initialization stage (instead of lazy load). Therefore, even when the `create_flashinfer_paged_kv_cache` is not invoked, the PackedFuncs will be looked up. So whenever FlashInfer is not available, the issue will happen. This PR adds a compiler pass which removes `create_flashinfer_paged_kv_cache` (and also other similar functions that may be introduced in the future) based on the target. This pass can effectively address the issue.

MasterJH5574 and others added 30 commits October 15, 2023 11:02

[ParamManager] Added progress bar for get_item/set_item (mlc-ai#1063)

d202077

[Bugfix] Compilation Error in q4f32_1 (mlc-ai#1078)

3aefd9f

The pass `fuse-split-rotary` assumes the compute dtype is fp16, which usually is, but in certain cases, e.g. `q0f32` and `q4f32_1`, the compute is based on fp32 instead. This PR strengthens the check guard.

Update README.md for Multi-GPU (mlc-ai#1090)

56a8004

Support lib_path override in C++. Improvements on docs and error mess…

b0373d1

…ages (mlc-ai#1086) * Support lib_path option in C++ CLI. Disable ChatConfig.model_lib override in Python API. Improvements on helper messages and error messages * Update docs * Rename lib_path -> model_lib_path

StreamIterator (mlc-ai#1057)

830656f

Co-authored-by: Varshith <varshith.bathini@sprinklr.com>

Update benchmark.py according to mlc-ai#1086 (mlc-ai#1091)

9bf5723

Update `benchmark.py`

Disable Disco for q4f16_ft and q8f16_ft quantization (mlc-ai#1094)

62d0c03

[Format] Apply isort and black for python/ (mlc-ai#1097)

cf39bf6

[Format] Apply isort and black on `python/` The commands I am using are: ``` isort --profile black python/ black python/ ``` It is always recommended to format the code before submission, given we don't have a linter CI yet.

More formatting (mlc-ai#1099)

e9b85ce

Enable Python Linter (mlc-ai#1098)

03c641a

This PR enables two Python formatters "black" and "isort" on the following directory: - `./python/` - `./tests/python/` Enabling pylint and mypy is left for future work

[CI] Add clang-format (mlc-ai#1103)

6159cc4

[Slim-LM] Smart path finding for config and weight (mlc-ai#1088)

16dd2ae

[Transform] Provide IRModule transform for rewrite_attention (mlc-ai#…

f57c9c9

…1052) Prior to this commit, `mlc_llm.transform.rewrite_attention` updated a single function. This commit modifies it to instead be a transform operating on any pattern matches within an `IRModule`.

[Slim-LM] Introduce HFLoad for loading Pytorch and SafeTensor weights (…

7ae8c6d

…mlc-ai#1113)

[WINDOWS] reduce noise in windows build (mlc-ai#1115)

5a7dcd8

Add CLI commands for compilation (mlc-ai#1109)

61179a0

Auto updated submodule references

8ce7793

fix mismatched argument name (mlc-ai#1117)

488017d

fix error introduced by recent code changes fixes mlc-ai#1116

[Docs] Add doc for max and mean gen len, shift factor; and buildArgs (m…

206103b

…lc-ai#1119) * Add doc for max and mean gen len, shift factor * Update python docs for BuildArgs

Remove inaccurate warning message (mlc-ai#1121)

9cb8e8e

This PR removes an inaccurate warning from mlc-ai#1086, which warns about `model_lib` overriding regardless of whether or not it's actually overridden. With this commit, we only warn if its value is not None.

Add --opt flag parsing to CLI (mlc-ai#1123)

a4279e3

added details to windows installation (mlc-ai#1133)

24f795e

32bit version of the zstd.dll library was causing issues, so updated the doc to be more specific and download the 64bit version.

CharlieFRuan and others added 29 commits December 29, 2023 12:08

[Fix][Delivery] Use sys executable in delivery (mlc-ai#1510)

1ec441f

Use sys executable in delivery

[Doc] Update define new models page (mlc-ai#1511)

7398c87

[SLM] Add Prebuilt lib url for Mistral in Docs (mlc-ai#1514)

c76b85c

add mistral android lib url

[Fix] tp bigcode (mlc-ai#1515)

2ca6cb9

fix tp bigcode

[SLM] cublas dispatch (mlc-ai#1380)

cb7bd10

* cublas * fix

Rename subpackage operator => op (mlc-ai#1521)

09f5e24

This PR addresses a package name conflict issue introduced by mlc-ai#1502, where `mlc_chat.operator` collides with python's official `operator` library. More details: mlc-ai#1502 (comment).

[Doc] Minor Fix Android Doc (mlc-ai#1524)

aa6fdd6

A minor path fix in the Android Doc, as the file `prepare_libs.sh` is under `library` folder.

Refactor tensor parallel (mlc-ai#1509)

cc94447

1. support n-dimension tensor sharding 2. remove unnecessary `row`, `col` and `group` field

Auto updated submodule references

073e007

[SLM] Enable FasterTransformer quantization (mlc-ai#1480)

791b09a

This PR enbales the FasterTransformer quantization of `q4f16_ft`.

[SLM] Batched Llama (mlc-ai#1520)

7239a91

This PR introduces the batched llama modeling with Paged KV cache in SLM flow.

[Fix][Python] __init__.py under nn subpackage (mlc-ai#1548)

5e23900

This is a quick fix to mlc-ai#1547. Sorry for missing the init file in the nn subpackage.

[SLM] Fuse FasterTransformer dequantize matmul epilogue (mlc-ai#1544)

78f283c

This PR enables FasterTransformer dequantize matmul epilogue fusion.

Add Mistral and Phi to model table (mlc-ai#1553)

e9afc9c

hot fix

6fbbc64

Merge branch 'main' into upstream-merge-jan09

7406684

rm

89d319e

more rm

10a48be

remove dup

a0ace9d

masahi merged commit 5efaa53 into octoml:batch-serving Jan 9, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upstream merge jan09 #151

Upstream merge jan09 #151

masahi commented Jan 9, 2024

Upstream merge jan09 #151

Upstream merge jan09 #151

Conversation

masahi commented Jan 9, 2024