Merge with `mlc-ai/main` (`d3d264d4b05d73e9757375013b842254f052c6ed`, April 29th 2024) #265

sunggg · 2024-04-29T22:25:02Z

No description provided.

This PR introduces the logprobs support with OpenAI API compatibility. It enhances the sampler with a function to get the top-probability tokens (supporting 5 tokens at most as of now). To make it easy to pass logprob results back from serving engine to frontend, we choose to pass logprob results in JSON string with OpenAI API spec. Unit tests are added to ensure the correctness of logprobs. And the logprobs support also work with speculative decoding.

This PR supports Mixtral in MLC serve. The main thing is only introducing the Mistral conversation template to Python registry so that MLC Serve can use. Besides that, this PR updates the KV cache capacity analysis to make it more accurate in terms of usage calculation, while being conservative since there is a known issue regarding batch-prefill embedding taking which may lead to OOM. We will reset the follow up on the issue with a fix in the future and then enable the estimation to use more GPU vRAM.

Prior to this PR, `u_char` was used while it is not a standard type in C++, which causes Windows build failure. This PR fixes it by using `unsigned char`.

…#1849) [Fix] Add phi lm head name to is_final_fc

…#1852) Instead of a python function that returns an updated `IRModule`, the new `optimize_mod_pipeline` function returns a `tvm.ir.transform.Pass` which can be applied to an `IRModule`.

* Create __init__.py * Add files via upload * Update model.py * Update model_preset.py * Update conv_templates.cc * Update internlm_loader.py * Update internlm_quantization.py * fix name of notes * Update model.py * Migration * fix pylint issue * fix pylint issue * fix pylint error * Update internlm_loader.py * Update __init__.py * Update __init__.py * Delete python/mlc_chat/model/internlm/__init__.py * Add files via upload

Prior to this commit, a model name with multiple path components (e.g. `dist/models/group_name/model_name`) would have duplicated path components (e.g. `dist/group_name/artifact_path/group_name/libname.so`). This commit resolves the duplication.

* [KVCache] Add max num threads to KVCache kernels, fix WebGPU * Read max_num_threads_per_block when available * Change merge state in place kernel * Make attention decode aware of max num threads, not just webgpu Co-authored-by: Egor Churaev <egor.churaev@gmail.com> * Change util function name --------- Co-authored-by: Egor Churaev <egor.churaev@gmail.com>

…1860) This PR moves the import of transformers into the function body of tiktoken tokenizer conversion, so we do not have a force dependency on transformers.

This PR adds RWKV5 support with RNNState, a similar interface as PagedAttention. Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>

Following mlc-ai#1854 , this pr registers the ChatML conversation template.

Sets the entry functions for a module. This utility is intended for cases where only module contains several externally-exposed functions, and only one is desired for use. (e.g. Separating out a `transform_params` function from an `IRModule` that also contains inference functions.) This commit only updates the external visibility, after which `relax.transform.DeadCodeElimination()` can be applied.

…i#1856) This allows it to be used as part of a optimization pipeline specified as a `tvm.ir.transform.Sequential`.

mlc-ai#1867) This PR is the 3rd part of the grammar-guided generation. This intregrates the grammar framework into the generation process, and supports JSON output for now. The API this PR provides is compatible with the OpenAI api. ### APIs #### Python API ``` @DataClass class ResponseFormat: type: Literal["text", "json_object"] = "text" json_schema: Optional[str] = None @DataClass class GenerationConfig: response_format: ResponseFormat = ResponseFormat(type="text") ``` #### Rest API ``` response_format: { "type": "text" } # text generation, by default response_format: { "type": "json_object" } # json generation response_format: { "type": "json_object", json_schema="..."} # json generation with schema ``` JSON generation with schema is not supported yet, but has been planned to be realized in the future. ### Performance #### Without JSON ``` Single token prefill latency: 891.2234 ms/tok Single token decode latency: 31.3399 ms/tok Prefill token throughput: 4693.3077 tok/s Decode token throughput: 226.4406 tok/s Overall token throughput: 470.3180 tok/s ``` #### With JSON ``` Single token prefill latency: 219.2287 ms/tok Single token decode latency: 29.1399 ms/tok Prefill token throughput: 7392.1555 tok/s Decode token throughput: 179.2296 tok/s Overall token throughput: 1052.1996 tok/s ``` We observed a slight decrease in performance under JSON mode. This will be further optimized in the future.

This PR brings field `n` to generation config and thereby supports parallel generation. This parallel generation effectively leverages the "fork" functionality of paged KV cache. This PR supports specifying the number of parallel generation `n` in stardard OpenAI ChatCompletion API. This is the last feature towards the OpenAI API feature completeness.

Sometimes scm checkout can timeout, this PR add retry to that

Prior to this PR, the TIR attention kernels does not cast matmul operands to fp32 before multiplying. For models like Phi-2 which may have large Q/K/V data (at the level of a few hundreds), the fp16 multiplication exceeds the range of fp16, and lead to attention result being NAN sometimes. This PR fixes this issue.

…lc-ai#1857) Prior to this commit, the `ReorderTransformFunc` required several components of the `ParamManager` to use. The functionality it provides, reordering dataflow blocks to minimize the liveset, is useful outside of the context of the `ParamManager`. This commit makes the following changes, allowing it to be used independently of the `ParamManager`. - Generate the `pidx2binname` dictionary outside of `ReorderTransformFunc` - Allow parameters to be separate `func.params`, rather than a single bundled tuple parameter.

This PR migrates Phi-2 for Paged KV cache Attention as a part of Model definition migration according to mlc-ai#1749 . Co-authored-by: Shrey Gupta <shrey2809@gmail.com>

…c-ai#1874) The use of `call_inplace_packed` and `call_pure_packed` in the old flow is outdated due to signature changes. This PR fixes the issue.

PR mlc-ai#1852 missed to apply the BundleModelParams pass and thus made the compiled models not runnable through ChatModule (mlc-ai#1864). This PR fixes the issue.

As pointed out by mlc-ai#1830, this PR fixes the Android app download link in docs.

Fix website link not accessible

…lc-ai#1884)

This PR adopts suggestions from the support of OpenAI API parallel generation `n` in mlc-ai#1868. The main update in this PR is to make the RequestState as a standalone object class, which was a typedef from `std::vector<RequestStateEntry>` before. This PR also fixes a bug in prefill that will cause engine failure when `n` is large.

…1882)

Support Qwen1.0 Paged KV Cache

This PR introduces the Paged Radix Tree data structure, as foundation and prerequisite of prefix caching.

This PR removes the mandatory model check in server since as of now we serve one engine at most which means there is always a unique engine being served. As issue mlc-ai#2155 points out, the model check in server can be a bad experience when the model string mismatches.

* [Eagle] Attach gpu verifier to model * WIP * WIP * fix * Enable GPU verifier * lint * lint

* [Eagle] Make BatchSelectLastHidden able to run on the controller

…lc-ai#2206) This PR updates the draft verification of the normal mode speculative decoding. Prior to this PR, we did not effectively leverage all the draft tokens, and this PR fixes the issue.

) This PR introduces a renormalization interface with regard to top-p values for speculative decoding. This is helpful for simplifying the logic of speculative decoding verification stage, as all probs have been already updated with the top-p values and no top-p needs to be taken into consideration. So for speculative decoding, we always renorm the probability distribution before sampling/verifying. For non speculative decoding mode, we keep using the previous flow, which applies top-p together when sampling. Co-authored-by: Wuwei Lin <wuwei@apache.org>

This commit renames the LLMEngine to MLCEngine.

This commit returns a list of integers and adds an assert to check that the string of CUDA architecture must contain numbers only. Co-authored-by: msyu <msyu@pllab.cs.nthu.edu.tw>

Take advantage of OpenCl host ptr that improves copy performance

It improves 2x time for tir based page attention for opencl adreno.

feat: support serving for rwkv

…#2226) This PR removes the imports of functions in `cli.model_metadata` from engine_base.py. The file `cli.model_metadata` is not designed for import directly, and when importing functions from the file, it repetitively reports warnings of ``` RuntimeWarning: 'mlc_llm.cli.model_metadata' found in sys.modules after import of package 'mlc_llm.cli', but prior to execution of 'mlc_llm.cli.model_metadata'; this may result in unpredictable behaviour ```

…onfig values to NOT_GIVEN (mlc-ai#2225) * Change OpenAI protocol default value to None in JSON FFI engine * [JSONFFIEngine] Support generation config in JSONFFIEngine. Default config values to NOT_GIVEN

This PR adds the early exit for the GPU sampler, which ran into GPU kernels even when the batch size is 0 prior to this commit. The 0 batch size case can happen when parallel generation of a request and engine preemption exists. In this case, the GPU sampler should just synchronization and return, and not run into any GPU kernel.

This PR introduces the compiler pass that rewrites the normal softmax to a two-stage softmax. This is based on our finding that when vocabulary size is large, the normal softmax cannot have high-enough parallelism on GPU. So we partition the workload into two stages for better parallelism and better performance.

remove model metadata step (#1) * remove model metadata step and make minor fixes

This commit introduces the GPU top-p cutoff operator for efficient probability renormalization under top-p.

…-ai#2236) * dev * dev

This PR supports creating EngineConfig from a JSON string, which is useful for JSONFFIEngine and its API bindings. This commit also removes the device from the EngineConfig for better clarity.

This PR migrates JSONFFIEngine to a formal namespace. Also list TODOs to further simplify the JSONFFIEngine.

…lc-ai#2242)

improve Install via environment variable

This PR integrates the sampling function in FlashInfer. We integrate the one without top-p for now.

* add model lib delivery * fix lint

Ubospica and others added 30 commits February 24, 2024 08:35

[Serving][Grammar] BNF grammar simplifier and matcher (mlc-ai#1801)

bcb9b6a

[Fix] Fix u_char for Windows build (mlc-ai#1848)

607dc5a

Prior to this PR, `u_char` was used while it is not a standard type in C++, which causes Windows build failure. This PR fixes it by using `unsigned char`.

Auto updated submodule references

c4d1b69

[Fix] Add phi lm head name to is_final_fc, add q4f16_ft to CI (mlc-ai…

31e0571

…#1849) [Fix] Add phi lm head name to is_final_fc

[Build] Replace mod_transform_before_build with IRModule pass (mlc-ai…

89f3e41

…#1852) Instead of a python function that returns an updated `IRModule`, the new `optimize_mod_pipeline` function returns a `tvm.ir.transform.Pass` which can be applied to an `IRModule`.

[KVCache] Migrate Baichuan model to PagedKVCache (mlc-ai#1854)

52d002f

[Python] Lazy import of transformers for tiktoken conversion (mlc-ai#…

ac57c03

…1860) This PR moves the import of transformers into the function body of tiktoken tokenizer conversion, so we do not have a force dependency on transformers.

[SLM] RWKV5 World Support (mlc-ai#1787)

1f70d71

This PR adds RWKV5 support with RNNState, a similar interface as PagedAttention. Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>

[Serving] Register the ChatML conversation template (mlc-ai#1862)

eb465ec

Following mlc-ai#1854 , this pr registers the ChatML conversation template.

[Build] Update transform_params_for_each_rank to IRModule pass (mlc-a…

eb66452

…i#1856) This allows it to be used as part of a optimization pipeline specified as a `tvm.ir.transform.Sequential`.

[CI] Add retry to scm checkout (mlc-ai#1869)

63c338b

Sometimes scm checkout can timeout, this PR add retry to that

[SLM] Migrate Phi-2 to paged KV Cache mlc-ai#1871 (mlc-ai#1872)

731616e

This PR migrates Phi-2 for Paged KV cache Attention as a part of Model definition migration according to mlc-ai#1749 . Co-authored-by: Shrey Gupta <shrey2809@gmail.com>

[Fix] Fix the use of "call_inplace_packed" and "call_pure_packed" (ml…

e4341b3

…c-ai#1874) The use of `call_inplace_packed` and `call_pure_packed` in the old flow is outdated due to signature changes. This PR fixes the issue.

[Fix] Add the missing BundleModelParams pass (mlc-ai#1875)

c0606ec

PR mlc-ai#1852 missed to apply the BundleModelParams pass and thus made the compiled models not runnable through ChatModule (mlc-ai#1864). This PR fixes the issue.

[Docs] Update Android APK download link (mlc-ai#1876)

07af0f9

As pointed out by mlc-ai#1830, this PR fixes the Android app download link in docs.

Fix MLC-LLM website link weight convert not accessible (mlc-ai#1877)

837869a

Fix website link not accessible

[Serving][Grammar] Support termination state in GrammarStateMatcher (m…

d2cfb1e

…lc-ai#1884)

[SLM] Update StableLM model and migrate it to paged KV Cache (mlc-ai#…

ffef890

…1882)

[KVCache] Qwen 1.0 Model PagedKV Support (mlc-ai#1887)

ef2db85

Support Qwen1.0 Paged KV Cache

cyx-6 and others added 28 commits April 22, 2024 18:18

[Serving] Paged Radix Tree for Prefix Caching (mlc-ai#2183)

12647d5

This PR introduces the Paged Radix Tree data structure, as foundation and prerequisite of prefix caching.

[Sampler] Enable GPU sampler for draft verification (mlc-ai#2198)

651c2a0

* [Eagle] Attach gpu verifier to model * WIP * WIP * fix * Enable GPU verifier * lint * lint

[Eagle] Make eagle disco compatible (mlc-ai#2197)

0ed4bcb

* [Eagle] Make BatchSelectLastHidden able to run on the controller

[Serving][Spec] Fix normal mode verification for extra draft token (m…

af8206b

…lc-ai#2206) This PR updates the draft verification of the normal mode speculative decoding. Prior to this PR, we did not effectively leverage all the draft tokens, and this PR fixes the issue.

[Python] Rename LLMEngine to MLCEngine (mlc-ai#2210)

9ec75ee

This commit renames the LLMEngine to MLCEngine.

[Fix] CUDA architecture detection bug fix (mlc-ai#2211)

e115dde

This commit returns a list of integers and adds an assert to check that the string of CUDA architecture must contain numbers only. Co-authored-by: msyu <msyu@pllab.cs.nthu.edu.tw>

[Android ] Enable OpenCL host pointer usage (mlc-ai#2215)

55b5c00

Take advantage of OpenCl host ptr that improves copy performance

[PYTHON][KVCACHE] Enhance the thread limit for opencl (mlc-ai#2216)

85fffee

It improves 2x time for tir based page attention for opencl adreno.

[Serving] Support RWKV for serving (mlc-ai#2111)

71c7b3c

feat: support serving for rwkv

[JSONFFIEngine] Support generation config in JSONFFIEngine. Default c…

1cdd0f9

…onfig values to NOT_GIVEN (mlc-ai#2225) * Change OpenAI protocol default value to None in JSON FFI engine * [JSONFFIEngine] Support generation config in JSONFFIEngine. Default config values to NOT_GIVEN

Auto updated submodule references

3139fd7

[Docs] Update deploy/ios#bring-your-own-model-library (mlc-ai#2235)

470a42a

remove model metadata step (#1) * remove model metadata step and make minor fixes

[Op] Top-p cutoff pivot (mlc-ai#2221)

93c560b

This commit introduces the GPU top-p cutoff operator for efficient probability renormalization under top-p.

[Op] Batch Verify: accept proposal when p and q are close enough (mlc…

8e7b38a

…-ai#2236) * dev * dev

[Serving] Creating EngineConfig from JSON (mlc-ai#2237)

135bcf9

This PR supports creating EngineConfig from a JSON string, which is useful for JSONFFIEngine and its API bindings. This commit also removes the device from the EngineConfig for better clarity.

[Bugfix] layer_norm_eps in GPT2Config should be float (mlc-ai#2240)

fd65973

[REFACTOR] Migrate JSONFFIEngine to formal namespace (mlc-ai#2241)

63a3804

This PR migrates JSONFFIEngine to a formal namespace. Also list TODOs to further simplify the JSONFFIEngine.

[Serving] Share disco sessions among multiple model function tables (m…

1a8bad0

…lc-ai#2242)

[DOC] Improve Install via environment variable (mlc-ai#2245)

5a26795

improve Install via environment variable

[Sampler] FlashInfer sampling func integration (mlc-ai#2224)

3cb2ee8

This PR integrates the sampling function in FlashInfer. We integrate the one without top-p for now.

Model Library Delivery (mlc-ai#2139)

d3d264d

* add model lib delivery * fix lint

merged

0b7864e

fixed

b7c93fb

sunggg merged commit 50e2686 into mlc-serve-v0.2.0 Apr 29, 2024

Lunderberg pushed a commit to Lunderberg/mlc-llm that referenced this pull request Jul 25, 2024

[Github] Add issue templates. (octoml#265)

c87bcf1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge with `mlc-ai/main` (`d3d264d4b05d73e9757375013b842254f052c6ed`, April 29th 2024) #265

Merge with `mlc-ai/main` (`d3d264d4b05d73e9757375013b842254f052c6ed`, April 29th 2024) #265

sunggg commented Apr 29, 2024

Merge with mlc-ai/main (d3d264d4b05d73e9757375013b842254f052c6ed, April 29th 2024) #265

Merge with mlc-ai/main (d3d264d4b05d73e9757375013b842254f052c6ed, April 29th 2024) #265

Conversation

sunggg commented Apr 29, 2024

Merge with `mlc-ai/main` (`d3d264d4b05d73e9757375013b842254f052c6ed`, April 29th 2024) #265

Merge with `mlc-ai/main` (`d3d264d4b05d73e9757375013b842254f052c6ed`, April 29th 2024) #265