Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge with mlc-ai/main (68cd794d02bbff9842f08b6b2ff37eb582f411c0, 2024-08-01) #277

Merged
merged 532 commits into from
Aug 2, 2024

Conversation

sunggg
Copy link
Member

@sunggg sunggg commented Aug 1, 2024

Summary

  • python/mlc_llm/grammar/grammar.py has a change in find_next_token_bitmask_as_ndarray and this might come with the change in cpp side as well. cc. @adstraw
  • Other changes are related to their engine or Llama 3.1-related supports, so they don't impact our internal flow.

vinx13 and others added 30 commits May 7, 2024 19:43
This PR introduces the packaging CLI `mlc_llm package` which
reads from a `mlc-package-config.json` and compiles model
and prepares model/runtime libraries automatically.

With this PR, we get rid of prebuilt model library dependency
for iOS app build.

Validated that the iOS build can work. iOS documentation is updated
according to this latest change. The same flow is supposed to work
for Android as well, while it still needs verification for Android
app build.
1. Avoid the cpu softmax for different penality config by
  having copy sync to gpu and use gpu softmax.
2. Disable decode token time counter for first token.
Move MLCChat to its own sub folder minor improvements to package.
…lc-ai#2289)

* [KVCACHE][TIR] Improved tir schedule for decode tir page attention

 1. Improved tir schedule of page attention (It improved 30% to this
function).
 2. Enable missing dequant+matmul fusion in ph-2 model

* Updated K_local to QK_local

* Update kv_cache.py

* Increase max thread for android:adreno
This PR lifts the existing `library` of android app into a standalone
`mlc4j` directory, which can be referenced by android app at any
location.

On the app side, this PR moves the android app into a subfolder
`MLCChat` which itself is a well-formed Android app. This folder
contains two core files for app build:

* `MLCChat/mlc-package-config.json` the config file that specifies
the models to build into the app.
* `MLCChat/prepare_package.py` the Python script that helps
automatically prepare/build mlc4j and model libraries.

This PR also updates the android app documentation to reflect this
latest change.
Shorten titles so they fit into one line of navbar, add mention of jit cache.
Remote old project overview
Avoid showing full tree and mention what the dist/lib/mlc4j stands for
Avoid showing full tree and mention what the dist/lib/mlc4j stands for
Avoid python3 instead directly use python, since python3 sometimes
will points to system python.
This PR removes mention of legacy modules
and prebuilt in favor of JIT.
This PR adds the `-j` option to cmake build to parallelize the
build job over CPU cores.
This PR sets a more clear instruction for android JDK setup
This PR modifies the MLCEngine chatCompletion to take in structured data.

Co-authored-by: Vivian Zhai <98248913+YiyanZhai@users.noreply.github.com>
This PR refactors JSONFFI conv template to use immutable processing.
This helps to prevent bugs from multiple requests and concurrent
access to the conversation data structure.

It also reduces the need to deep copy the struct.
…lc-ai#2335)

This PR refactors GrammarStateMatcher and support the LLaMA-3 tokenizer.

Common tokenizers, including Phi-2, Gemma, LLaMA-2, etc. are also
supported.

The performance is optimized for LLaMA-3 tokenizer since its token table
has size 128k, much larger than LLaMA-2 tokenizer.

These changes are introduced to the grammar library:

These changes are introduced to the grammar library:
1. Introduce ByteString rule expression and simplify CharacterClass
   and CharacterClassStar
2. Refactor BNFGrammarVisitor and BNFGrammarMutator for visiting and
   mutating grammar rules
3. Now GrammarStateMatcherBase, the internally impl of the
   GrammarStateMatcher, accepts char by char, instead of codepoint by
   codepoint. So it supports any valid UTF-8 string, even if the token
   is not a complete codepoint.
4. Support lookahead assertion for rules to specify the rule must be
   followed by a sequence. This can eliminate some uncertain tokens
   in preprocessing.

Minor changes:
1. Introduce template hash function HashCombine
2. Update the UTF8 encoding handling functions

Performance:
1. For JSON, finding mask requires <30us on 5900X with single thread.
   The uncertain tokens is <30 in most cases.
2. For JSON schema, finding mask requires <30us on 5900X with single
   thread. The uncertain tokens is <30 in most cases.
…older (mlc-ai#2342)

* [DebugChat] Fix DebugChat softmax function and save logits to debug folder

* Fix lint
* [Serving] Add Medusa speculative decoding
tlopex and others added 28 commits July 16, 2024 22:41
This PR supports TP function of starcoder2 and fixes two typos.
This PR defers the collection of decode inputs in hybrid prefill,
as the collection of decode inputs may cause much CPU overhead
while it ends up no prefill can be performed. By deferring the
collection of decode inputs, we can quickly decide whether prefill
is doable, and this decision does not involve too much CPU overhead.
* Update starcoder2_quantization.py

* Update qwen2_loader.py

* Update qwen2_model.py

* Update qwen2_moe_loader.py

* Update rwkv5_loader.py

* Update rwkv6_loader.py

* Update qwen_loader.py

* Update phi3_quantization.py

* Update phi_quantization.py

* Update phi3_model.py

* Update phi3_model.py

* Update phi3_quantization.py

* fix tp
This PR supports the [Llama3.1](https://huggingface.co/collections/meta-llama/llama-31-669fc079a0c406a149a5738f)
family.

Particularly we introduced the conversation template and RoPE scaling
for Llama3.1. In the future we will bring the support of more RoPE
scaling.

Co-authored-by: Charlie Ruan <53290280+CharlieFRuan@users.noreply.github.com>
…-ai#2689)

This PR revamps the FuseAddRMSNorm pass with manual pattern matching,
in purpose of avoiding `rewrite_bindings` which is recursive and may
cause unaffordable time when the model is large.
…2694)

If `x` has `nan` or `-inf` values, the condition `x[vi,vk] >
local_top_k[0]` may be false.  Falling back to the condition `x[vi,vk]
> local_top_k[1]` then reads the uninitialized value in
`local_top_k[1]`.

This can also result in out-of-bounds memory access.  If all values in
`x[vi,vk]` are `nan` or `-inf` along some row `vi`, then
`local_top_k_index[1]` is never populated.  For mixture-of-experts
models, when `gating_softmax_topk` is used to select the expert, this
uninitialized value is then used as an array index.

This commit updates the `top2_softmax_norm_func` implementation in
`gating_softmax_topk` to initialize both elements of the `local_top_k`
and `local_top_k_index` arrays, matching the implementation of
`top4_softmax_norm_func`.
This PR adds a context manager to properly cleanup
during async for exception.

Naively use the try except pattern will results in bug when we chain up
async generators and exception get raised not inside the try
except in between iterations.
This PR exposes the option of prefill mode to chunked prefill or
hybrid prefill with split fuse decode.
This PR fixes the mlc-ai#2701 when the prefill mode is chunked but the prefill requests are not collected.
This PR fixes the index error when hybrid prefill is enabled.

Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>
This PR revamp the benchmark submodule with a `__main__` entry
that enables running the benchmark.
This commit updates `mlc_llm.cli.worker` to be compatible with
upstream TVM apache/tvm#17180, which adds a
`num_groups` argument to the disco worker function.

To de-couple this compatibility from a general TVM version bump, this
commit has a check on the number of `worker.py` arguments provided, to
determine whether the `num_groups` argument is present.  After the TVM
version used by MLC-LLM is updated to include the upstream changes,
this check can be removed.
* Add support for Gemma2

* Update Gemma2 impl

This commit updates the Gemma2 implementation, including the following
aspects:

1. We try to reuse as much code as possible from the Gemma model for
the overall code structure clarity and management.
2. We properly set the scaling factor for attention.
3. We add the final logit soft-capping for Gemma2.

---------

Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>
Add gemma2 2b 9b and 27b to preset, remove gemma1 preset.
* Update android package config from gemma 2b to gemma 2 2b

  * Revert phi3 model definition for backward compatibility
This commit switches the Gemma model in iOS app to Gemma2.
This PR adds the LLMPerf into benchmark module.
This commit adds `<bos>` to the gemma's conversation template.
The `system_prefix_token_ids` of conv template already contains the
bos token usually, which should be processed when converting message
list to a single prompt. However, the C++ side didn't well respect
this field before.
@masahi masahi merged commit 9903beb into mlc-serve-v0.2.0 Aug 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.