forked from mlc-ai/mlc-llm
-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge with mlc-ai/main
(68cd794d02bbff9842f08b6b2ff37eb582f411c0
, 2024-08-01)
#277
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This PR introduces the packaging CLI `mlc_llm package` which reads from a `mlc-package-config.json` and compiles model and prepares model/runtime libraries automatically. With this PR, we get rid of prebuilt model library dependency for iOS app build. Validated that the iOS build can work. iOS documentation is updated according to this latest change. The same flow is supposed to work for Android as well, while it still needs verification for Android app build.
1. Avoid the cpu softmax for different penality config by having copy sync to gpu and use gpu softmax. 2. Disable decode token time counter for first token.
Move MLCChat to its own sub folder minor improvements to package.
…lc-ai#2289) * [KVCACHE][TIR] Improved tir schedule for decode tir page attention 1. Improved tir schedule of page attention (It improved 30% to this function). 2. Enable missing dequant+matmul fusion in ph-2 model * Updated K_local to QK_local * Update kv_cache.py * Increase max thread for android:adreno
This PR lifts the existing `library` of android app into a standalone `mlc4j` directory, which can be referenced by android app at any location. On the app side, this PR moves the android app into a subfolder `MLCChat` which itself is a well-formed Android app. This folder contains two core files for app build: * `MLCChat/mlc-package-config.json` the config file that specifies the models to build into the app. * `MLCChat/prepare_package.py` the Python script that helps automatically prepare/build mlc4j and model libraries. This PR also updates the android app documentation to reflect this latest change.
Shorten titles so they fit into one line of navbar, add mention of jit cache. Remote old project overview
Avoid showing full tree and mention what the dist/lib/mlc4j stands for
Avoid showing full tree and mention what the dist/lib/mlc4j stands for Avoid python3 instead directly use python, since python3 sometimes will points to system python.
This PR removes mention of legacy modules and prebuilt in favor of JIT.
This PR adds the `-j` option to cmake build to parallelize the build job over CPU cores.
This PR sets a more clear instruction for android JDK setup
This PR modifies the MLCEngine chatCompletion to take in structured data. Co-authored-by: Vivian Zhai <98248913+YiyanZhai@users.noreply.github.com>
This PR refactors JSONFFI conv template to use immutable processing. This helps to prevent bugs from multiple requests and concurrent access to the conversation data structure. It also reduces the need to deep copy the struct.
…lc-ai#2335) This PR refactors GrammarStateMatcher and support the LLaMA-3 tokenizer. Common tokenizers, including Phi-2, Gemma, LLaMA-2, etc. are also supported. The performance is optimized for LLaMA-3 tokenizer since its token table has size 128k, much larger than LLaMA-2 tokenizer. These changes are introduced to the grammar library: These changes are introduced to the grammar library: 1. Introduce ByteString rule expression and simplify CharacterClass and CharacterClassStar 2. Refactor BNFGrammarVisitor and BNFGrammarMutator for visiting and mutating grammar rules 3. Now GrammarStateMatcherBase, the internally impl of the GrammarStateMatcher, accepts char by char, instead of codepoint by codepoint. So it supports any valid UTF-8 string, even if the token is not a complete codepoint. 4. Support lookahead assertion for rules to specify the rule must be followed by a sequence. This can eliminate some uncertain tokens in preprocessing. Minor changes: 1. Introduce template hash function HashCombine 2. Update the UTF8 encoding handling functions Performance: 1. For JSON, finding mask requires <30us on 5900X with single thread. The uncertain tokens is <30 in most cases. 2. For JSON schema, finding mask requires <30us on 5900X with single thread. The uncertain tokens is <30 in most cases.
…older (mlc-ai#2342) * [DebugChat] Fix DebugChat softmax function and save logits to debug folder * Fix lint
* [Serving] Add Medusa speculative decoding
This PR supports TP function of starcoder2 and fixes two typos.
This PR defers the collection of decode inputs in hybrid prefill, as the collection of decode inputs may cause much CPU overhead while it ends up no prefill can be performed. By deferring the collection of decode inputs, we can quickly decide whether prefill is doable, and this decision does not involve too much CPU overhead.
* Update starcoder2_quantization.py * Update qwen2_loader.py * Update qwen2_model.py * Update qwen2_moe_loader.py * Update rwkv5_loader.py * Update rwkv6_loader.py * Update qwen_loader.py * Update phi3_quantization.py * Update phi_quantization.py * Update phi3_model.py * Update phi3_model.py * Update phi3_quantization.py * fix tp
This PR supports the [Llama3.1](https://huggingface.co/collections/meta-llama/llama-31-669fc079a0c406a149a5738f) family. Particularly we introduced the conversation template and RoPE scaling for Llama3.1. In the future we will bring the support of more RoPE scaling. Co-authored-by: Charlie Ruan <53290280+CharlieFRuan@users.noreply.github.com>
Introduce microsoft/Phi-3 vision from https://huggingface.co/microsoft/Phi-3-vision-128k-instruct
…-ai#2689) This PR revamps the FuseAddRMSNorm pass with manual pattern matching, in purpose of avoiding `rewrite_bindings` which is recursive and may cause unaffordable time when the model is large.
…2694) If `x` has `nan` or `-inf` values, the condition `x[vi,vk] > local_top_k[0]` may be false. Falling back to the condition `x[vi,vk] > local_top_k[1]` then reads the uninitialized value in `local_top_k[1]`. This can also result in out-of-bounds memory access. If all values in `x[vi,vk]` are `nan` or `-inf` along some row `vi`, then `local_top_k_index[1]` is never populated. For mixture-of-experts models, when `gating_softmax_topk` is used to select the expert, this uninitialized value is then used as an array index. This commit updates the `top2_softmax_norm_func` implementation in `gating_softmax_topk` to initialize both elements of the `local_top_k` and `local_top_k_index` arrays, matching the implementation of `top4_softmax_norm_func`.
This PR adds a context manager to properly cleanup during async for exception. Naively use the try except pattern will results in bug when we chain up async generators and exception get raised not inside the try except in between iterations.
This PR exposes the option of prefill mode to chunked prefill or hybrid prefill with split fuse decode.
This PR fixes the mlc-ai#2701 when the prefill mode is chunked but the prefill requests are not collected.
This PR fixes the index error when hybrid prefill is enabled. Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>
This PR revamp the benchmark submodule with a `__main__` entry that enables running the benchmark.
This commit updates `mlc_llm.cli.worker` to be compatible with upstream TVM apache/tvm#17180, which adds a `num_groups` argument to the disco worker function. To de-couple this compatibility from a general TVM version bump, this commit has a check on the number of `worker.py` arguments provided, to determine whether the `num_groups` argument is present. After the TVM version used by MLC-LLM is updated to include the upstream changes, this check can be removed.
* Add support for Gemma2 * Update Gemma2 impl This commit updates the Gemma2 implementation, including the following aspects: 1. We try to reuse as much code as possible from the Gemma model for the overall code structure clarity and management. 2. We properly set the scaling factor for attention. 3. We add the final logit soft-capping for Gemma2. --------- Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>
Add gemma2 2b 9b and 27b to preset, remove gemma1 preset.
* Update android package config from gemma 2b to gemma 2 2b * Revert phi3 model definition for backward compatibility
This commit switches the Gemma model in iOS app to Gemma2.
This PR adds the LLMPerf into benchmark module.
This commit adds `<bos>` to the gemma's conversation template.
The `system_prefix_token_ids` of conv template already contains the bos token usually, which should be processed when converting message list to a single prompt. However, the C++ side didn't well respect this field before.
masahi
approved these changes
Aug 2, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
python/mlc_llm/grammar/grammar.py
has a change infind_next_token_bitmask_as_ndarray
and this might come with the change in cpp side as well. cc. @adstraw