forked from mlc-ai/mlc-llm
-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upstream merge jan09 #151
Merged
Merged
Upstream merge jan09 #151
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
PR mlc-ai#1048 updated the signature of softmax in the built model library and changed the temperature buffer shape in ChatModule. This causes some existing demo unable to run since we did not do a round of model library update. This PR reverts the ChatModule change, and adds back the softmax function in non-batching case. With this PR, the regression should be fixed.
…ai#1074) This PR lifts the device string parsing (just a few of lines) to a standalone function, so that on the serving side the serving can make use of this function as well. Tested Python API and it does not seem to incur regression.
The pass `fuse-split-rotary` assumes the compute dtype is fp16, which usually is, but in certain cases, e.g. `q0f32` and `q4f32_1`, the compute is based on fp32 instead. This PR strengthens the check guard.
This PR establishes the compiler components in MLC-Chat Python API, which currently includes two primary components: models and parameters. The models are `nn.Module`-based definition of an LLM, which, as the very first stab, contains only `LlamaForCasualLM`. It is decomposed into three files: - `llama_config.py`: common configurations for Llama, where we define relevant configurations of its architecture, as well as include standard config file for Llama2-7B/13B/70B for convenient testing; - `llama.py`: the model architecture of Llama, based on the PyTorch-like `nn.Module` API; - `llama_parameter.py`: defines the mapping between MLC parameters and pytorch parameters. The parameters contains the basic functionality of parameter mapping, and the loaders that effectively convert parameters from PyTorch to MLC according to the mapping specified. Currently, only `HFTorchLoader` is implemented, but loaders like SafeTensor, GPTQ or AWQ should be quite straightforward according to the existing design. On top of this PR, on-the-fly quantization could be defined as a loading time transformation on MLC parameters, while pre-quantized parameter loading is effectively parameter loading after MLC's `nn.Module` is quantized. Two unittests examplify how the infrastructure works: - `./tests/python/model/test_llama.py` shows how to create an `nn.Module` using the new infra, and then convert it to TVM IRModule; - `./tests/python/parameter/hf_torch_loader.py` shows how to load parameters from HuggingFace PyTorch format. Besides, `mlc_chat.support` is established for utility functions, which now contains two utils: - `config.py` which supports reading configurations into dataclasses from JSON file or Python dict. On top of Python dataclass, it throws irrelevant fields into `cls.kwargs`, which is helpful when loading HuggingFace configuration file; - `tqdm.py` which contains tqdm-related utilities, primarily redirecting logging and printing to work nicely with tqdm.
…ages (mlc-ai#1086) * Support lib_path option in C++ CLI. Disable ChatConfig.model_lib override in Python API. Improvements on helper messages and error messages * Update docs * Rename lib_path -> model_lib_path
Co-authored-by: Varshith <varshith.bathini@sprinklr.com>
Update `benchmark.py`
[Format] Apply isort and black on `python/` The commands I am using are: ``` isort --profile black python/ black python/ ``` It is always recommended to format the code before submission, given we don't have a linter CI yet.
This PR enables two Python formatters "black" and "isort" on the following directory: - `./python/` - `./tests/python/` Enabling pylint and mypy is left for future work
Add pylint/mypy tooling into pyproject.toml This PR establishes the initial Python tooling infra with Pylint and Mypy. Currently only the newest modules, i.e. `mlc_chat.support` and `mlc_chat.compiler` are covered, and we expect to cover the entire package, as being tracked in mlc-ai#1101.
…1052) Prior to this commit, `mlc_llm.transform.rewrite_attention` updated a single function. This commit modifies it to instead be a transform operating on any pattern matches within an `IRModule`.
…#1056) * [ParamManager] Use BundleModelParams for transform_quantize Prior to this commit, `ParamManager.transform_quantize` function took as input functions with separate parameters for each weight tensor, and produced output functions with a tuple parameter for all weights. Because `LiftTransformParams` had the same convention, neither could be applied as part of the same build flow. This commit updates `ParamManager.transform_quantize` pass to produce outputs with separate tensor parameters, using the `BundleModelParams` transform to later combine them into a single tuple parameter. The analogous change was also performed for `LiftTransformParams` as part of apache/tvm#15657. In addition, prior to this commit, the `ParamManager.transform_dequantize` function operated directly on a `IRModule` object. As a result, any debug instrumentation (e.g. before/after printouts for each pass, before/after verification with `relax.analysis.well_formed`, etc.) did not apply to this `transform_dequantize`. This commit updates `ParamManager.transform_dequantize` to return a `ir.transform.Pass`. * Correct type annotation
fix error introduced by recent code changes fixes mlc-ai#1116
…lc-ai#1119) * Add doc for max and mean gen len, shift factor * Update python docs for BuildArgs
mlc-ai#1120) Revert "[ParamManager] Use BundleModelParams for transform_dequantize (mlc-ai#1056)" This reverts commit e5927ce. This causes a regression impacting all MLC LLM nightlies as it violates the existing calling convention in MLC Chat runtime. An example: mlc-ai#1060 (comment)
This PR removes an inaccurate warning from mlc-ai#1086, which warns about `model_lib` overriding regardless of whether or not it's actually overridden. With this commit, we only warn if its value is not None.
* add presence and frequency penalty * Added support for passing conversation history in /v1/chat/completions endpoint * Added support for RestAPI parameters max_gen_len, n, and stop_str * * add presence and frequency penalty to generation config * refactor generation config * Added documentation for parameters * replace lib_path with model_lib_path in rest.py * fixed black isort issues * fix lib_path
…lc-ai#1127) Prior to this commit, `ParamManager.transform_quantize` function took as input functions with separate parameters for each weight tensor, and produced output functions with a tuple parameter for all weights. Because `LiftTransformParams` had the same convention, neither could be applied as part of the same build flow. This commit updates `ParamManager.transform_quantize` pass to produce outputs with separate tensor parameters, using the `BundleModelParams` transform to later combine them into a single tuple parameter. The analogous change was also performed for `LiftTransformParams` as part of apache/tvm#15657. In addition, prior to this commit, the `ParamManager.transform_dequantize` function operated directly on a `IRModule` object. As a result, any debug instrumentation (e.g. before/after printouts for each pass, before/after verification with `relax.analysis.well_formed`, etc.) did not apply to this `transform_dequantize`. This commit updates `ParamManager.transform_dequantize` to return a `ir.transform.Pass`. This commit is a repeat of the reverted PR mlc-ai#1056. This PR resolves the bug in the earlier implementation by removing the call to `.without_attr("num_input")` in `ParamReplacer.rewrite_func`. This follows an analogous update in `LiftTransformParams`, preserving the `"num_input"` attribute for use in `BundleModelParams`.
32bit version of the zstd.dll library was causing issues, so updated the doc to be more specific and download the 64bit version.
Use sys executable in delivery
add mistral android lib url
fix tp bigcode
* cublas * fix
Integrate fused rope into model gpt_neox and phi. Add an optional parameter `rotary_dim` to `llama_rope`. `rotary_dim` indicates the number of dimensions in the embedding that RoPE is applied to. By default `rotary_dim` is the same as `head_dim`. In model `Phi`, `rotary_dim` is set to a different number based on the config.
This PR addresses a package name conflict issue introduced by mlc-ai#1502, where `mlc_chat.operator` collides with python's official `operator` library. More details: mlc-ai#1502 (comment).
A minor path fix in the Android Doc, as the file `prepare_libs.sh` is under `library` folder.
…-ai#1522) This PR introduces an environment variable `MLC_JIT_POLICY` as a follow-up item to PR [mlc-ai#1508](mlc-ai#1508 (comment)). It allows to enable/disable the JIT behavior by: - `OFF`: never JIT, and will throw an error if `model_lib` is missing; - `ON` (default): JIT whenever the model lib is missing and there's a cache miss; - `REDO`: whenever the model lib is missing, always do JIT compilation even if cache hits; - `READONLY`: never do JIT compilation but look up the JIT cache whenever the model lib is missing. It also dissolves the newly-introduced `JITOption` into `ChatConfig` so that it can be used more seamlessly with exactly the existing APIs. By doing so, users can simply specify `context_window_size`, `prefill_chunk_size` to control the VRAM used in each model without having to recompile the model lib themselves. Example: If one focuses on developing compiler/runtime rather than quantization, we could simply run ```bash MLC_JIT_POLICY=REDO python main.py ``` to test if the compiler/runtime work smoothly together, where `main.py` is: ```python from mlc_chat import ChatConfig, ChatModule, callback from mlc_chat.support import logging logging.enable_logging() MODEL="HF://junrushao/Llama-2-7b-chat-hf-q4f16_1-MLC", cm = ChatModule( MODEL, device="cuda", chat_config=ChatConfig( context_window_size=1024, prefill_chunk_size=1024, ), ) cm.generate( "What is the meaning of life?", progress_callback=callback.StreamToStdout(callback_interval=2), ) ```
* Add support for loading weights from a safetensor file * Set pylint to ignore the import error * Move pylint-disable line Co-authored-by: Junru Shao <junrushao1994@gmail.com> --------- Co-authored-by: Junru Shao <junrushao1994@gmail.com>
This PR introduces a command that reports the estimated upper-bound memory usage based on the metadata section of an SLM-compiled model. Example: ```bash >> python -m mlc_chat.cli.model_metadata /path/to/model_lib.so --memory-only [2023-12-31 18:40:43] INFO model_metadata.py:49: Parameter size: 3885.14 MB [2023-12-31 18:40:43] INFO model_metadata.py:58: Temporary buffer size: 7184.15 MB [2023-12-31 18:40:43] INFO model_metadata.py:71: KVCache size when context/sliding window size is 4096: 512.00 MB [2023-12-31 18:40:43] INFO model_metadata.py:79: Total memory usage: 11581.29 MB [2023-12-31 18:40:43] INFO model_metadata.py:84: Tweaking `prefill_chunk_size`, `context_window_size` and `sliding_window_size` to reduce memory usage ``` Addresses both B1 and B2 in mlc-ai#1516 (comment). Another demo using Python API: ```python from mlc_chat import ChatConfig, ChatModule, callback from mlc_chat.support import logging logging.enable_logging() MODEL="HF://junrushao/NeuralHermes-2.5-Mistral-7B-q4f16_1-MLC" cm = ChatModule( MODEL, device="cuda", chat_config=ChatConfig( sliding_window_size=4096, prefill_chunk_size=1024, opt="O2", ), ) cm.generate( "What is the meaning of life?", progress_callback=callback.StreamToStdout(callback_interval=2), ) ``` ```bash >>> MLC_JIT_POLICY=REDO python main.py ``` <img width="958" alt="image" src="https://github.com/mlc-ai/mlc-llm/assets/22515877/8fcf1fb2-53b3-4768-91b4-89f90712dea8">
1. support n-dimension tensor sharding 2. remove unnecessary `row`, `col` and `group` field
This PR turns on FlashInfer in O2 mode given it has been relatively stable over the past few weeks. This commits also brings a few misc improvements: - Pass in scratch memory managed by RelaxVM's memory pool - this change depends on TVM's [PR #16327](apache/tvm#16327) and FlashInfer's [PR mlc-ai#43](flashinfer-ai/flashinfer#43) - Enable FlashInfer for group size = 4, which is a setting used in Mistral models; - Slightly shorten and clarify the log message on memory usage on model lib loading. - Integrate FlashInfer into GPT-BigCode models. With this PR, FlashInfer is integrated into Mistral, Llama, GPT-NeoX, GPT-BigCode, Phi. The only left out is GPT2, which has a special flag `scale_attn_by_inverse_layer_idx` which applies an elementwise normalization term `1.0 / layer_id` to attn scores before masked softmax.
This PR enbales the FasterTransformer quantization of `q4f16_ft`.
This PR includes two minor fixes to support TinyLlama: - Fix BF16 loading via SafeTensor - it was broken because numpy does not support bf16, which leads to an exception in safetensor internally. - FlashInfer doesn't support `head_dim == 64`, which we skipped in this PR. After this PR, the following snippet runs TinyLlama pretty conveniently: ```python from mlc_chat import ChatConfig, ChatModule, callback from mlc_chat.support import logging logging.enable_logging() MODEL = "HF://junrushao/TinyLlama-1.1B-Chat-v1.0-q4f16_1-MLC" def main(): cm = ChatModule( MODEL, device="metal", chat_config=ChatConfig(context_window_size=1024), ) cm.generate( "What is the meaning of life?", progress_callback=callback.StreamToStdout(callback_interval=2), ) if __name__ == "__main__": main() ```
``` MODEL = "HF://junrushao/Mistral-7B-Instruct-v0.2-q4f16_1-MLC" TP_SHARDS = 2 from mlc_chat import ChatConfig, ChatModule, callback from mlc_chat.support import logging logging.enable_logging() cm = ChatModule( MODEL, device="cuda", chat_config=ChatConfig( context_window_size=1024, prefill_chunk_size=1024, tensor_parallel_shards=TP_SHARDS, opt="flashinfer=0;cublas_gemm=1;cudagraph=0", ), ) cm.generate( "What is the meaning of life?", progress_callback=callback.StreamToStdout(callback_interval=2), ) ```
This PR introduces the batched llama modeling with Paged KV cache in SLM flow.
This is a quick fix to mlc-ai#1547. Sorry for missing the init file in the nn subpackage.
This PR enables FasterTransformer dequantize matmul epilogue fusion.
Introduce Mixtral MoE Model This PR introduces support for Mixtral MoE models with MLC's latest SLM quantization/compilation pipeline. It includes the following pieces of changes: **Operators.** We implemented a list of operators in TIR's TVMScript format in two files `moe_misc` and `moe_matmul`. Those TIR kernels implement "transpose indices" and "blocked-CSR-COO" as described in MegaBlock [1]. `moe_misc.py` primarily concerns sparsity-related operators, including: - `get_indices`, `get_indptr` and `scatter_output`: CSR-style index manipulation and array shuffling that makes the input ranges each expert has to deal with contiguous. - `moe_sum`, `moe_cumsum`, `topk` which are standard operators but specialized for MoE usecases, e.g. #experts and #activated-experts are small. `moe_matmul.py` includes non-quantized and quantized GEMV and GEMV operators used in MoE model serving. Typically, in single batch decoding, GEMV operators should suffice, but group GEMM is a necessary dependency in both prefilling and batched decoding. **Model architecture.** We reuse the attention blocking block from Mistral, and implemented MLP MoE in `mixtral_model.py`. In Mixtral, there are three groups of experts in each MLP, where `e1` and `e3` are gate/up projections (project-in) and `e2` is down project (project-out). **Weight quantization.** We batch all experts of the same kind into a single tensor, whose shape is `(Ne, N, K)`, where `Ne` is the total number of experts, `N` is out features and `K` is in-features. Applying group quantization, we compress along the `K` dimension as consistent with the rest of the project. **Performance.** The current TIR is highly optimized for non-tensor core scenarios (Metal, WebGPU, non-TensorCore CUDA, AMD, etc) and tensor core performance is left for a PR in the nearest future. **Try out MLC's Mixtral Model.** The int4-quantized Mixtral model has 24.5G of parameters. ```python from mlc_chat import ChatConfig, ChatModule, callback from mlc_chat.support import logging logging.enable_logging() MODEL = "HF://junrushao/Mixtral-8x7B-Instruct-v0.1-q4f16_1-MLC" NUM_GPU = 1 def main(): cm = ChatModule(MODEL, device="cuda:0", chat_config=ChatConfig( sliding_window_size=1024, tensor_parallel_shards=NUM_GPU, )) cm.generate("What is the meaning of life?", progress_callback=callback.StreamToStdout(callback_interval=2)) if __name__ == "__main__": main() ``` Quantization formats: - 3-bit (19.662 GB): ["HF://junrushao/Mixtral-8x7B-Instruct-v0.1-q3f16_1-MLC"](https://huggingface.co/junrushao/Mixtral-8x7B-Instruct-v0.1-q3f16_1-MLC) - 4-bit (24.466 GB): ["HF://junrushao/Mixtral-8x7B-Instruct-v0.1-q4f16_1-MLC"](https://huggingface.co/junrushao/Mixtral-8x7B-Instruct-v0.1-q4f16_1-MLC) The 3-bit version can be run comfortably using a 24G GPU (e.g. 4090, 3090Ti). **Convert Mixtral to MLC format from scratch.** The following instructions are only needed for advanced users to quantize Mixtral from scratch. ```bash SRC_DIR=/path/to/Mixtral-8x7B-v0.1 # raw model downloaded from HuggingFace MODEL_DIR=/mlc_models/mixtral-q4f16_1 # destination directory mlc_chat gen_config $SRC_DIR -o $MODEL_DIR --quantization q4f16_1 \ --conv-template LM # "LM" (lang model) means no conversation template yet mlc_chat convert_weight $SRC_DIR --quantization q4f16_1 -o $MODEL_DIR ``` [1] Gale, Trevor, Deepak Narayanan, Cliff Young, and Matei Zaharia. "MegaBlocks: Efficient Sparse Training with Mixture-of-Experts." Proceedings of MLSys 2023. Co-authored-by: Junru Shao <junrushao@apache.org>
A follow-up of my previous PR (mlc-ai#1529). This PR makes Mixtral work on Metal GPUs that macOS comes with. There are honestly no much change needed, except for that Metal doesn't support fp64 data types. A python script to run Mixtral: ```python from mlc_chat import ChatConfig, ChatModule, callback from mlc_chat.support import logging logging.enable_logging() MODEL = "HF://junrushao/Mixtral-8x7B-Instruct-v0.1-q4f16_1-MLC" NUM_GPU = 1 def main(): cm = ChatModule(MODEL, chat_config=ChatConfig( sliding_window_size=1024, tensor_parallel_shards=NUM_GPU, )) cm.generate("What is the meaning of life?", progress_callback=callback.StreamToStdout(callback_interval=2)) if __name__ == "__main__": main() ``` Quantization formats: - 3-bit (19.662 GB): ["HF://junrushao/Mixtral-8x7B-Instruct-v0.1-q3f16_1-MLC"](https://huggingface.co/junrushao/Mixtral-8x7B-Instruct-v0.1-q3f16_1-MLC) - 4-bit (24.466 GB): ["HF://junrushao/Mixtral-8x7B-Instruct-v0.1-q4f16_1-MLC"](https://huggingface.co/junrushao/Mixtral-8x7B-Instruct-v0.1-q4f16_1-MLC)
…i#1555) We recently noticed that when FlashInfer is not built due to unsupported cuda architecture or platform, running single-sequence ChatModule will hit VM function initialization error, where the function is used in `create_flashinfer_paged_kv_cache`, which won't actually be invoked in single-sequence flow. This is due to relax VM eagerly initializes all used PackedFunc at initialization stage (instead of lazy load). Therefore, even when the `create_flashinfer_paged_kv_cache` is not invoked, the PackedFuncs will be looked up. So whenever FlashInfer is not available, the issue will happen. This PR adds a compiler pass which removes `create_flashinfer_paged_kv_cache` (and also other similar functions that may be introduced in the future) based on the target. This pass can effectively address the issue.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
@vinx13 Please verify that Mixtral support is not broken.