Merge with `mlc-ai/main` (`68cd794d02bbff9842f08b6b2ff37eb582f411c0`, 2024-08-01) #277

sunggg · 2024-08-01T22:42:55Z

Summary

python/mlc_llm/grammar/grammar.py has a change in find_next_token_bitmask_as_ndarray and this might come with the change in cpp side as well. cc. @adstraw
Other changes are related to their engine or Llama 3.1-related supports, so they don't impact our internal flow.

This PR introduces the packaging CLI `mlc_llm package` which reads from a `mlc-package-config.json` and compiles model and prepares model/runtime libraries automatically. With this PR, we get rid of prebuilt model library dependency for iOS app build. Validated that the iOS build can work. iOS documentation is updated according to this latest change. The same flow is supposed to work for Android as well, while it still needs verification for Android app build.

1. Avoid the cpu softmax for different penality config by having copy sync to gpu and use gpu softmax. 2. Disable decode token time counter for first token.

Move MLCChat to its own sub folder minor improvements to package.

…lc-ai#2289) * [KVCACHE][TIR] Improved tir schedule for decode tir page attention 1. Improved tir schedule of page attention (It improved 30% to this function). 2. Enable missing dequant+matmul fusion in ph-2 model * Updated K_local to QK_local * Update kv_cache.py * Increase max thread for android:adreno

This PR lifts the existing `library` of android app into a standalone `mlc4j` directory, which can be referenced by android app at any location. On the app side, this PR moves the android app into a subfolder `MLCChat` which itself is a well-formed Android app. This folder contains two core files for app build: * `MLCChat/mlc-package-config.json` the config file that specifies the models to build into the app. * `MLCChat/prepare_package.py` the Python script that helps automatically prepare/build mlc4j and model libraries. This PR also updates the android app documentation to reflect this latest change.

Shorten titles so they fit into one line of navbar, add mention of jit cache. Remote old project overview

Avoid showing full tree and mention what the dist/lib/mlc4j stands for

Avoid showing full tree and mention what the dist/lib/mlc4j stands for Avoid python3 instead directly use python, since python3 sometimes will points to system python.

…lc-ai#2249)

This PR removes mention of legacy modules and prebuilt in favor of JIT.

This PR adds the `-j` option to cmake build to parallelize the build job over CPU cores.

This PR sets a more clear instruction for android JDK setup

This PR modifies the MLCEngine chatCompletion to take in structured data. Co-authored-by: Vivian Zhai <98248913+YiyanZhai@users.noreply.github.com>

This PR refactors JSONFFI conv template to use immutable processing. This helps to prevent bugs from multiple requests and concurrent access to the conversation data structure. It also reduces the need to deep copy the struct.

…i#2336)

…lc-ai#2335) This PR refactors GrammarStateMatcher and support the LLaMA-3 tokenizer. Common tokenizers, including Phi-2, Gemma, LLaMA-2, etc. are also supported. The performance is optimized for LLaMA-3 tokenizer since its token table has size 128k, much larger than LLaMA-2 tokenizer. These changes are introduced to the grammar library: These changes are introduced to the grammar library: 1. Introduce ByteString rule expression and simplify CharacterClass and CharacterClassStar 2. Refactor BNFGrammarVisitor and BNFGrammarMutator for visiting and mutating grammar rules 3. Now GrammarStateMatcherBase, the internally impl of the GrammarStateMatcher, accepts char by char, instead of codepoint by codepoint. So it supports any valid UTF-8 string, even if the token is not a complete codepoint. 4. Support lookahead assertion for rules to specify the rule must be followed by a sequence. This can eliminate some uncertain tokens in preprocessing. Minor changes: 1. Introduce template hash function HashCombine 2. Update the UTF8 encoding handling functions Performance: 1. For JSON, finding mask requires <30us on 5900X with single thread. The uncertain tokens is <30 in most cases. 2. For JSON schema, finding mask requires <30us on 5900X with single thread. The uncertain tokens is <30 in most cases.

…older (mlc-ai#2342) * [DebugChat] Fix DebugChat softmax function and save logits to debug folder * Fix lint

* [Serving] Add Medusa speculative decoding

This PR supports TP function of starcoder2 and fixes two typos.

This PR defers the collection of decode inputs in hybrid prefill, as the collection of decode inputs may cause much CPU overhead while it ends up no prefill can be performed. By deferring the collection of decode inputs, we can quickly decide whether prefill is doable, and this decision does not involve too much CPU overhead.

* Update starcoder2_quantization.py * Update qwen2_loader.py * Update qwen2_model.py * Update qwen2_moe_loader.py * Update rwkv5_loader.py * Update rwkv6_loader.py * Update qwen_loader.py * Update phi3_quantization.py * Update phi_quantization.py * Update phi3_model.py * Update phi3_model.py * Update phi3_quantization.py * fix tp

This PR supports the [Llama3.1](https://huggingface.co/collections/meta-llama/llama-31-669fc079a0c406a149a5738f) family. Particularly we introduced the conversation template and RoPE scaling for Llama3.1. In the future we will bring the support of more RoPE scaling. Co-authored-by: Charlie Ruan <53290280+CharlieFRuan@users.noreply.github.com>

Introduce microsoft/Phi-3 vision from https://huggingface.co/microsoft/Phi-3-vision-128k-instruct

…-ai#2689) This PR revamps the FuseAddRMSNorm pass with manual pattern matching, in purpose of avoiding `rewrite_bindings` which is recursive and may cause unaffordable time when the model is large.

…2694) If `x` has `nan` or `-inf` values, the condition `x[vi,vk] > local_top_k[0]` may be false. Falling back to the condition `x[vi,vk] > local_top_k[1]` then reads the uninitialized value in `local_top_k[1]`. This can also result in out-of-bounds memory access. If all values in `x[vi,vk]` are `nan` or `-inf` along some row `vi`, then `local_top_k_index[1]` is never populated. For mixture-of-experts models, when `gating_softmax_topk` is used to select the expert, this uninitialized value is then used as an array index. This commit updates the `top2_softmax_norm_func` implementation in `gating_softmax_topk` to initialize both elements of the `local_top_k` and `local_top_k_index` arrays, matching the implementation of `top4_softmax_norm_func`.

This PR adds a context manager to properly cleanup during async for exception. Naively use the try except pattern will results in bug when we chain up async generators and exception get raised not inside the try except in between iterations.

This PR exposes the option of prefill mode to chunked prefill or hybrid prefill with split fuse decode.

This PR fixes the mlc-ai#2701 when the prefill mode is chunked but the prefill requests are not collected.

This PR fixes the index error when hybrid prefill is enabled. Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>

This PR revamp the benchmark submodule with a `__main__` entry that enables running the benchmark.

…mlc-ai#2709)

This commit updates `mlc_llm.cli.worker` to be compatible with upstream TVM apache/tvm#17180, which adds a `num_groups` argument to the disco worker function. To de-couple this compatibility from a general TVM version bump, this commit has a check on the number of `worker.py` arguments provided, to determine whether the `num_groups` argument is present. After the TVM version used by MLC-LLM is updated to include the upstream changes, this check can be removed.

* Add support for Gemma2 * Update Gemma2 impl This commit updates the Gemma2 implementation, including the following aspects: 1. We try to reuse as much code as possible from the Gemma model for the overall code structure clarity and management. 2. We properly set the scaling factor for attention. 3. We add the final logit soft-capping for Gemma2. --------- Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>

Add gemma2 2b 9b and 27b to preset, remove gemma1 preset.

* Update android package config from gemma 2b to gemma 2 2b * Revert phi3 model definition for backward compatibility

This commit switches the Gemma model in iOS app to Gemma2.

This PR adds the LLMPerf into benchmark module.

This commit adds `<bos>` to the gemma's conversation template.

The `system_prefix_token_ids` of conv template already contains the bos token usually, which should be processed when converting message list to a single prompt. However, the C++ side didn't well respect this field before.

…m-08012024

vinx13 and others added 30 commits May 7, 2024 19:43

[Eagle] Run additional decode for draft model when all proposals are …

3621bf6

…accepted (mlc-ai#2294)

Increase the timeout in PopenServer (mlc-ai#2298)

8a31986

[LLM-CHAT] Enable gpu softmax for penality softmax (mlc-ai#2288)

65f9716

1. Avoid the cpu softmax for different penality config by having copy sync to gpu and use gpu softmax. 2. Disable decode token time counter for first token.

[iOS][REFACTOR] Restructure the iOS folders (mlc-ai#2299)

1bd1ab0

Move MLCChat to its own sub folder minor improvements to package.

[Sampler] Remove unneeded output_prob_dist param (mlc-ai#2300)

10f3e4d

Enable cuda graph for batch_verify (mlc-ai#2304)

33c15e7

[DOCS] Minor cleanup (mlc-ai#2308)

b62dd91

Shorten titles so they fit into one line of navbar, add mention of jit cache. Remote old project overview

[DOCS] Update android doc (mlc-ai#2309)

37230db

Avoid showing full tree and mention what the dist/lib/mlc4j stands for

[DOCS] Update android doc (mlc-ai#2310)

8bb1d6e

Avoid showing full tree and mention what the dist/lib/mlc4j stands for Avoid python3 instead directly use python, since python3 sometimes will points to system python.

[SLM] Support BERT architecture. Implement a text embedding module (m…

459ffe3

…lc-ai#2249)

[Serving] Log batch size in NVTX (mlc-ai#2312)

ea391de

[Model] Removing unnecessary reshapes in get_logits (mlc-ai#2314)

b01cfab

Skip cublas dispatch for single batch (mlc-ai#2315)

347222c

Auto updated submodule references

73b733d

[DOCS] Remove mention of legacy modules (mlc-ai#2318)

3a0b42c

This PR removes mention of legacy modules and prebuilt in favor of JIT.

[Android] Add -j option to cmake build (mlc-ai#2321)

2b8aadf

This PR adds the `-j` option to cmake build to parallelize the build job over CPU cores.

[DOCS] More clear android instruction (mlc-ai#2327)

98f0424

This PR sets a more clear instruction for android JDK setup

[Serving] Refactor to consolidate new request prefill (mlc-ai#2329)

21feb70

[iOS] Make MLCEngine input to take in structured data (mlc-ai#2330)

45a0487

This PR modifies the MLCEngine chatCompletion to take in structured data. Co-authored-by: Vivian Zhai <98248913+YiyanZhai@users.noreply.github.com>

[REFACTOR] Refactor JSONFFI Conv template (mlc-ai#2331)

679d3a8

This PR refactors JSONFFI conv template to use immutable processing. This helps to prevent bugs from multiple requests and concurrent access to the conversation data structure. It also reduces the need to deep copy the struct.

[Eagle] Fix the requests for additional decode in eagle verify (mlc-a…

821ee5d

…i#2336)

[DebugChat] Fix DebugChat softmax function and save logits to debug f…

0c03537

…older (mlc-ai#2342) * [DebugChat] Fix DebugChat softmax function and save logits to debug folder * Fix lint

[Serving] Add Medusa speculative decoding (mlc-ai#2337)

b247f8d

* [Serving] Add Medusa speculative decoding

Fix cublas offloading (mlc-ai#2343)

2bbbd52

Add false for arg worker0_only in disco.empty (mlc-ai#2344)

227dbb8

Auto updated submodule references

9b89e04

tlopex and others added 28 commits July 16, 2024 22:41

[SLM] Starcoder2 Multi-GPU support (mlc-ai#2662)

c06bb39

This PR supports TP function of starcoder2 and fixes two typos.

support mistral-nemo (mlc-ai#2676)

b1834f8

[SLM] Introduce microsoft/Phi-3 vision (mlc-ai#2658)

cdbd3ed

Introduce microsoft/Phi-3 vision from https://huggingface.co/microsoft/Phi-3-vision-128k-instruct

[Preset] Add llama3.1 to preset, comment out llama3 (mlc-ai#2683)

9e23e37

[Pass] Rewrite FuseAddRMSNorm to avoid binding rewrite recursion (mlc…

fd20c56

…-ai#2689) This PR revamps the FuseAddRMSNorm pass with manual pattern matching, in purpose of avoiding `rewrite_bindings` which is recursive and may cause unaffordable time when the model is large.

[Serving] Fix spec decoding call packed with rvalue (mlc-ai#2699)

803becc

[Serve] Expose prefill mode option (mlc-ai#2701)

6156dc3

This PR exposes the option of prefill mode to chunked prefill or hybrid prefill with split fuse decode.

[Fix] Fix hybrid prefill disabled (mlc-ai#2705)

da06a06

This PR fixes the mlc-ai#2701 when the prefill mode is chunked but the prefill requests are not collected.

Turn on custom allreduce by default in O3 (mlc-ai#2706)

3c7a6d5

[Fix] Fix hybrid prefill index error (mlc-ai#2707)

551f3fe

This PR fixes the index error when hybrid prefill is enabled. Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>

[Bench] Revamp benchmark submodule (mlc-ai#2702)

95f8797

This PR revamp the benchmark submodule with a `__main__` entry that enables running the benchmark.

[Serving] Fix handling of num_tokens_for_next_decode in spec decoding (…

d54007b

…mlc-ai#2709)

[Preset] Add gemma2 preset (mlc-ai#2715)

39069f7

Add gemma2 2b 9b and 27b to preset, remove gemma1 preset.

[Android] Update model for Andorid APK (mlc-ai#2718)

7296565

* Update android package config from gemma 2b to gemma 2 2b * Revert phi3 model definition for backward compatibility

[iOS] Add Gemma2 for iOS app (mlc-ai#2717)

59cf662

This commit switches the Gemma model in iOS app to Gemma2.

Default bundle gemma2 (mlc-ai#2721)

97bbf52

[Bench] LLMPerf dataset (mlc-ai#2713)

b0f2731

This PR adds the LLMPerf into benchmark module.

[ConvTemplate] Update Gemma template with <bos> (mlc-ai#2722)

709f484

This commit adds `<bos>` to the gemma's conversation template.

Merge remote-tracking branch 'upstream/main' into spark/merge-upstrea…

a0702ca

…m-08012024

Delete .gitmodules

e413b3c

masahi approved these changes Aug 2, 2024

View reviewed changes

masahi merged commit 9903beb into mlc-serve-v0.2.0 Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge with `mlc-ai/main` (`68cd794d02bbff9842f08b6b2ff37eb582f411c0`, 2024-08-01) #277

Merge with `mlc-ai/main` (`68cd794d02bbff9842f08b6b2ff37eb582f411c0`, 2024-08-01) #277

sunggg commented Aug 1, 2024

Merge with mlc-ai/main (68cd794d02bbff9842f08b6b2ff37eb582f411c0, 2024-08-01) #277

Merge with mlc-ai/main (68cd794d02bbff9842f08b6b2ff37eb582f411c0, 2024-08-01) #277

Conversation

sunggg commented Aug 1, 2024

Summary

Merge with `mlc-ai/main` (`68cd794d02bbff9842f08b6b2ff37eb582f411c0`, 2024-08-01) #277

Merge with `mlc-ai/main` (`68cd794d02bbff9842f08b6b2ff37eb582f411c0`, 2024-08-01) #277