Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serving] PagedKVCache tree-attention integration #2487

Merged
merged 1 commit into from
Jun 4, 2024

Conversation

MasterJH5574
Copy link
Member

This PR integrates the recent support of tree-attention in PagedKVCache into the speculative decoding in MLC. Right now only chains are supported. Tree-based speculative decoding is on the project road map and we are planning to support it in recent future.

@MasterJH5574 MasterJH5574 marked this pull request as ready for review June 4, 2024 00:11
This PR integrates the recent support of tree-attention in PagedKVCache
into the speculative decoding in MLC. Right now only chains are
supported. Tree-based speculative decoding is on the project road map
and we are planning to support it in recent future.
@tqchen tqchen merged commit c0c33a5 into mlc-ai:main Jun 4, 2024
2 checks passed
sunggg added a commit to octoml/mlc-llm that referenced this pull request Jul 8, 2024
… July 2nd 2024) (#272)

* [Bugfix] layer_norm_eps in GPT2Config should be float (#2240)

* [REFACTOR] Migrate JSONFFIEngine to formal namespace (#2241)

This PR migrates JSONFFIEngine to a formal namespace.
Also list TODOs to further simplify the JSONFFIEngine.

* [Serving] Share disco sessions among multiple model function tables (#2242)

* [DOC] Improve Install via environment variable (#2245)

improve Install via environment variable

* [Sampler] FlashInfer sampling func integration (#2224)

This PR integrates the sampling function in FlashInfer.
We integrate the one without top-p for now.

* Model Library Delivery (#2139)

* add model lib delivery

* fix lint

* [Support] Simplify function names in encoding.h (#2251)

This PR simplifies the tool function names in encoding.h. The new names are
- PrintAsUTF8
- PrintAsEscaped
- ParseNextUTF8
- ParseUTF8
- ParseNextUTF8OrEscaped

Also make ParseNextUTF8 return the new char pointer instead of the number of
chars processed to make the interface simpler.

* [Serving] Introduce DraftTokenWorkspaceManager (#2250)

Using DraftTokenWorkspaceManager to maintain workspace for draft probs
and hidden states (if needed). This allows states of the draft token to
be kept fully on GPU.

* [Fix] fix a typo in event_trace_recorder (#2253)

* Fix typo in event_tracer

* [Tokenizer] Support ByteLevel BPE in tokenizer token table (#2248)

* [Eagle] Avoid worker - engine transfer for hidden states (#2256)

* [Serving] Add engine stats for speculative decoding (#2257)

* [Serving] Fix lints (#2258)

* [Sampler] Avoid unnecessary sync in GPU verifier (#2260)

* Fix typo in token_postproc_method names (#2261)

* [Sampler] Add missing sync in gpu verifier (#2262)

* [Model] Remove redundant space in llama2 tokenizer (#2263)

* [Model] Fix llama2 chat template and remove redundant separator added by engine (#2264)

* [Model] Fix llama2 chat template and remove redundant separator added by engine

* [Refactor][Serving] EngineConfig refactor and "model-lib-path" rename (#2268)

* This PR refactors the EngineConfig to allow minimal JSON string
passing. This is helpful for the JSONFFIEngine construction.
* This PR moves the automatic engine config inference from Python side
to C++ side, so that we don't have duplicate code on multiple platforms.
* This PR renames `model_lib_path` to `model_lib`.
* This PR makes the reload/unload of ThreadedEngine act in a blocking
style.
* This PR refactors the default generation config process flow,
and unifies everything to C++.

* [Serving] Add some try-except captures in AsyncMLCEngine (#2265)

* [Serving] Add some try-except captures in AsyncMLCEngine

* [Eagle] Fix token shifting for prefill step (#2266)

* [Fix] Fix the two-stage softmax func by removing log2e (#2269)

* [Fix] Fix the two-stage softmax func by removing log2e

When two-stage softmax was introduced, we use a log2e numeric
transformation for some potentially better performance.

However, under the case of low temperature, the log2e transformation
is not numerically stable, which may cause the softmax result not
summing up to 1.

This PR fixes this by removing all the log2e related calculation.

* Remove redundant import

* [Eagle] Fix missing broadcast in hidden states gather/scatter (#2271)

* [Eagle] Fix missing broadcast in hidden states gather/scatter

* [Sampler] Use pivot-based renormalization for top-p sampling (#2272)

This PR integrates the pivot-based prob renormalization for top-p
sampling, whose performance is a few times faster than the current
sort-based top-p sampling on CUDA.

* [JSONFFI] Update JSONFFI error checking with the Result class (#2275)

This PR updates the error checking in JSONFFIEngine and related request
parsing to use the Result class.

* [Bugfix] fix _kv_cache_transpose_append buffer read region error (#2277)

* improve Install via environment variable

* [HotFix] fix kv_cache_transpose_append buffer region

* [GenConfig] Set upper bound for prefill chunk size (#2278)

By default the prefill chunk size is set to the context window size
or the sliding window size. When the number is large, our memory
planning during model compilation will allocate a lot memory.

Given we have support for input chunking, we can reduce the prefill
chunk size to a small value to save runtime memory.

This PR sets the prefill chunk size to be at most 2048.

* [iOS] Initial scaffolding of MLCEngine in Swift (#2279)

[iOS] Initial scaffolding of LLMEngine in Swift

This PR adds initial scaffolding of LLMEngine in swift.
We wraps callback to AsyncStream so it can be accessed using for await API.

We also added an minimal example app to showcase the new MLCEngine,
the old ChatModule is still used in the MLCChat App.

The return value is structified already.
We will still need to structurify the chat completion interface.

* Rename READMD.md to README.md

* [Serving] Image support in JSONFFIEngine (#2208)

Using new Result interface

Co-authored-by: Animesh Bohara <abohara@cs.cmu.edu>

* [Pass] Attach manual softmax-with-temperature (#2280)

This PR updates all the models to use the new softmax-with-temperature
function, which inlines the temperature division (or argmax if
temperature is 0) process into the two-stage softmax.

Unit benchmark shows that the inline of division does no harm to the
softmax. When batch size is large, the inlined softmax can have better
performance than a standalone divide kernel, which takes much time
when batch size is large.

* [Model] Remove unused import to fix lint (#2284)

This PR removes the unused import in llava model to fix lint.

* [Serving] Fix BatchVerify to feed the extra token when fully accepted (#2285)

This PR fixes a bug in the BatchVerify action.
When a draft model's proposal is fully accepted by the main model, there
is an extra token which is already in the main model's KV cache but not
in the draft model's KV cache.

Prior to this PR, BatchVerify action does not feed this extra token into
the draft model's KV cache, which causes size mismatch between the
main model's KV cache and draft model's KV cache.

This PR fixes this issue by adding an additional BatchDecode step for
the requests whose draft proposals are fully accepted by the main model.

* Update engine.cc

* [CMAKE][BUILD] Add config option to enable OpenCL Host ptr (#2287)

[CMAKE][BUILD] Add user option to enable OpenCL Host ptr

* [Serving][Fix] Pass draft length when constructing draft action (#2291)

This PR fixes a bug which does not pass the speculative decoding
draft length to the draft generation stage.

* [Pass] Fix sampling func attachment to not read existing vocab size (#2292)

This PR updates the AttachGPUSamplingFunc pass to make each sampling
func have independent dynamic vocab size var. So we do not have to
read the vocab size from the prefill function.

* [SLM] Introduce microsoft/Phi-3 (#2222)

Introduce microsoft/Phi-3 from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct

* [Eagle] Run additional decode for draft model when all proposals are accepted (#2294)

* [iOS] Introducing package CLI for iOS app packaging (#2297)

This PR introduces the packaging CLI `mlc_llm package` which
reads from a `mlc-package-config.json` and compiles model
and prepares model/runtime libraries automatically.

With this PR, we get rid of prebuilt model library dependency
for iOS app build.

Validated that the iOS build can work. iOS documentation is updated
according to this latest change. The same flow is supposed to work
for Android as well, while it still needs verification for Android
app build.

* Increase the timeout in PopenServer (#2298)

* [LLM-CHAT] Enable gpu softmax for penality softmax (#2288)

1. Avoid the cpu softmax for different penality config by
  having copy sync to gpu and use gpu softmax.
2. Disable decode token time counter for first token.

* [iOS][REFACTOR] Restructure the iOS folders (#2299)

Move MLCChat to its own sub folder minor improvements to package.

* [KVCACHE][TIR] Improved tir schedule for decode tir page attention (#2289)

* [KVCACHE][TIR] Improved tir schedule for decode tir page attention

 1. Improved tir schedule of page attention (It improved 30% to this
function).
 2. Enable missing dequant+matmul fusion in ph-2 model

* Updated K_local to QK_local

* Update kv_cache.py

* Increase max thread for android:adreno

* [Sampler] Remove unneeded output_prob_dist param (#2300)

* Enable cuda graph for batch_verify (#2304)

* [Android] Introducing mlc4j and app packaging (#2305)

This PR lifts the existing `library` of android app into a standalone
`mlc4j` directory, which can be referenced by android app at any
location.

On the app side, this PR moves the android app into a subfolder
`MLCChat` which itself is a well-formed Android app. This folder
contains two core files for app build:

* `MLCChat/mlc-package-config.json` the config file that specifies
the models to build into the app.
* `MLCChat/prepare_package.py` the Python script that helps
automatically prepare/build mlc4j and model libraries.

This PR also updates the android app documentation to reflect this
latest change.

* [DOCS] Minor cleanup (#2308)

Shorten titles so they fit into one line of navbar, add mention of jit cache.
Remote old project overview

* [DOCS] Update android doc (#2309)

Avoid showing full tree and mention what the dist/lib/mlc4j stands for

* [DOCS] Update android doc (#2310)

Avoid showing full tree and mention what the dist/lib/mlc4j stands for
Avoid python3 instead directly use python, since python3 sometimes
will points to system python.

* [SLM] Support BERT architecture. Implement a text embedding module (#2249)

* [Serving] Log batch size in NVTX (#2312)

* [Model] Removing unnecessary reshapes in get_logits (#2314)

* Skip cublas dispatch for single batch (#2315)

* Auto updated submodule references

* [DOCS] Remove mention of legacy modules (#2318)

This PR removes mention of legacy modules
and prebuilt in favor of JIT.

* [Android] Add `-j` option to cmake build (#2321)

This PR adds the `-j` option to cmake build to parallelize the
build job over CPU cores.

* [DOCS] More clear android instruction (#2327)

This PR sets a more clear instruction for android JDK setup

* [Serving] Refactor to consolidate new request prefill (#2329)

* [iOS] Make MLCEngine input to take in structured data (#2330)

This PR modifies the MLCEngine chatCompletion to take in structured data.

Co-authored-by: Vivian Zhai <98248913+YiyanZhai@users.noreply.github.com>

* [REFACTOR] Refactor JSONFFI Conv template (#2331)

This PR refactors JSONFFI conv template to use immutable processing.
This helps to prevent bugs from multiple requests and concurrent
access to the conversation data structure.

It also reduces the need to deep copy the struct.

* [Eagle] Fix the requests for additional decode in eagle verify (#2336)

* [Serving][Grammar] Refactor GrammarStateMatcher and support LLaMA-3 (#2335)

This PR refactors GrammarStateMatcher and support the LLaMA-3 tokenizer.

Common tokenizers, including Phi-2, Gemma, LLaMA-2, etc. are also
supported.

The performance is optimized for LLaMA-3 tokenizer since its token table
has size 128k, much larger than LLaMA-2 tokenizer.

These changes are introduced to the grammar library:

These changes are introduced to the grammar library:
1. Introduce ByteString rule expression and simplify CharacterClass
   and CharacterClassStar
2. Refactor BNFGrammarVisitor and BNFGrammarMutator for visiting and
   mutating grammar rules
3. Now GrammarStateMatcherBase, the internally impl of the
   GrammarStateMatcher, accepts char by char, instead of codepoint by
   codepoint. So it supports any valid UTF-8 string, even if the token
   is not a complete codepoint.
4. Support lookahead assertion for rules to specify the rule must be
   followed by a sequence. This can eliminate some uncertain tokens
   in preprocessing.

Minor changes:
1. Introduce template hash function HashCombine
2. Update the UTF8 encoding handling functions

Performance:
1. For JSON, finding mask requires <30us on 5900X with single thread.
   The uncertain tokens is <30 in most cases.
2. For JSON schema, finding mask requires <30us on 5900X with single
   thread. The uncertain tokens is <30 in most cases.

* [DebugChat] Fix DebugChat softmax function and save logits to debug folder (#2342)

* [DebugChat] Fix DebugChat softmax function and save logits to debug folder

* Fix lint

* [Serving] Add Medusa speculative decoding (#2337)


* [Serving] Add Medusa speculative decoding

* Fix cublas offloading (#2343)

* Add false for arg worker0_only in disco.empty (#2344)

* Auto updated submodule references

* [JSONFFIEngine] Refactor device argument and request_stream_callback argument (#2334)

* 1. Refactor init_background_engine in JSONFFIEngine to use device_type and device_id arguments.
2. request_stream_callback is called on each string of the array of strings.

* Calling callback on string of list of JSON dicts instead of each string of JSON dict multiple times

---------

Co-authored-by: Animesh Bohara <abohara@cs.cmu.edu>

* [Serving] Add reset_engine in debug_entrypoints (#2347)

* [Bugfix] Make sequence_length dtype int64 in EngineConfig. Fix Mistral engine serving issue (#2358)

* [Bugfix] Make sequence_length dtype int64 in EngineConfig. Fix Mistral engine serving issue

* [JSON FFI] Example Android Application using JSON FFI Engine (#2322)

* pass str to callback and not List[str]

add json ffif android example

fix lint

Refactor MLCEngineExample and MLCEngine.kt

Use ChatCompletionMessageContent class

ChatCompletionMessageContent: text and parts

* JSONFFIEngine: Cast request_stream_callback argument to std::string. Decode in Android as List<ChatCompletionStreamResponse>

---------

Co-authored-by: Animesh Bohara <abohara@cs.cmu.edu>

* [iOS] Update MLCEngine API to latest JSON FFI convention (#2359)

This PR updates the MLCEngine API to latest JSON FFI convention.

* [JSONFFI] Fix JSONFFI conv template. Add unit tests (#2360)

* [Fix][Serving] Fix prefill chunk in interactive mode (#2363)

This PR fixes a bug of prefill chunking in the interactive mode.
The bug counts requests with remaining inputs as running requests
which turns out disabling the prefill of the remaining inputs.

This PR fixes by no longer counting requests with unfinished inputs
as running requests for decode.

* [Fix][Serving] Respect sliding window size in config inference (#2364)

This PR fixes the automatic engine config inference which did not
respect the sliding window size, which led to memory usage higher
than expected in the interactive mode for mistral model.

* [iOS] Add padding to app icon (#2365)

* [Serving] Fix the self-ref in engine (#2367)

This PR fixes the self ref in engine and enable auto terminate in deleter.

* [Serving] Prefix Cache (#2295)

* [Serving] Prefix Cache

This PR introduces the prefix cache into serving engine, to manage prefix and accelerate prefill process.

* [Fix] Use static_cast for `.size()` for safety (#2369)

This PR updates the occurences of `.size() - 1` with static_cast
to avoid the integer underflow.

* [Serving] Sliding-window-aware request prefill (#2370)

This PR supports the prefill conditions with sliding window awareness.
Now when the input length is larger than the sliding window size,
the prefill can still be processed without error.

* [iOS] Update MLCSwift to fully follow OAI style. (#2371)

It also refactors the MLCSwift to be follow
engine.chat.completions.create style as per
other OpenAI APIs.

It also removes the cyclic dependencies in
the closure capture by having a separate EngineState

* Add nvtx in logic update (#2372)

* [Test] Use HF model for JIT as much as possible (#2373)

This PR updates the test files to use JIT by default as much as
possible, in order to make tests runnable out of the box.

Of course, they can be locally tweaked to use local models.

For Eagle/Llava/rwkv, given we don't have them delivered yet, they
are kept as using local model lib now.

* [Fix] Fix prefix cache reset and forking logic (#2374)

This PR refactors the reset logic in prefix cache and disable forking from sequences with sliding windows enabled.

* [CLI] Migrate CLI to use the new Engine (#2375)

* [CLI] Migrate CLI to use the new Engine

This PR migrates the CLI to the new JSON FFI Engine.
The resulting generation will be faster, we still need to ensure
we can enable sliding window support when needed.

Also Refactors JSONFFI Engine to be OpenAI compatible.

* Fix lint and remove bench which is stale

* [TESTING] Introduce testing util to manage models (#2377)

This PR introduce a new env var MLC_TEST_MODEL_PATH to allow a list of model path
specified for test model search purposes.

If not found, an error message would appear and we auto skip test in both
pytest and normal running settings.

The path defaults to the cached HF path so as long as we run mlc_llm chat
the model can be found. But we do not automatically download to avoid
excessive networking in CI settings.

Followup PR needed for remaining testcases

* [REFACTOR][Rename]  MLC_LLM_SOURCE_DIR and TVM_SOURCE_DIR source directory env  (#2378)

* [REFACTOR] Rename use MLC_LLM_SOURCE_DIR for source directory

This PR updates to use MLC_LLM_SOURCE_DIR to specify the
directory of mlc llm source directory.

The reason for this update is that the term XXX_HOME was usually
meant to be used in different scenarios in ML frameworks.

For example, both torch and huggingface have TORCH_HOME and HF_HOME
pointing to their local cache directory.

The variable MLC_LLM_SOURCE_DIR is aligned with cmake naming convention
(CMAKE_SOURCE_DIR).

We will have followup PR to udpate MLC_CACHE_DIR to MLC_LLM_HOME, following
the existing practices.

* Update env to point to TVM_SOURCE_DIR

* [REFACTOR][ENV] MLC_CACHE_DIR to MLC_LLM_HOME (#2379)

This PR changes the MLC_CACHE_DIR env to MLC_LLM_HOME.
This change aligns with most of the packages.

* [iOS] Switch MLC Chat to use MLCEngine (#2380)

This PR switchs MLC Chat to use MLC Engine

Also did a minor refactoring to make serve side more
flexible in dealing with compile time overrides.

* [REFACTOR] Cleanup legacy code (#2381)

This PR cleans up legacy code and reorgaizes some of the project structure.

- Removed stale interface
- Removed stale examples
- Temp remove rust as it depends on chat module that we plan to phase out
- Move embeddings to contrib(experimental)

* [Fix] Update prefix cache config (#2382)

This PR updates the prefix cache config to prefix cache mode and prefix cache max number of recycling sequences. Also this PR adds the missing `final` keyword in member methods.

* [PREFIX-CACHE] Fix some issues with prefix cache (#2384)

This PR fixes issues with prefix cache when used together with MLCEngine.
It also fixes an issue when prefix_cache_max_num_recycling_seqs == 0

* [FIX] Typo on OpenAI Chat class in engine (#2385)

This commit fixes a typo on JSONFFIEngine Python side.

* [Serving][Refactor] Metrics and stats for CLI (#2387)

This PR introduces the `Metric` class for convenient metric update
and management in MLC. The previous `EngineStats` class is renamed
to `EngineMetrics` accordingly.

This PR brings the metric support to JSONFFIEngine, and implements
the `/stats` command in CLI.

Besides, this PR

* fixes a bug of time measurement when parallel generation exists.
* aligns the metric names with LLMPerf (particularly, we now use
`num_input_tokens`, `num_output_tokens`, `sum_num_input_tokens`, etc.)
* measures the time of a single step of BatchDecode, a single step
of draft generation in BatchDraft, and a single step of BatchVerify
when the effective batch size is less than 64 (hardcoded as a constant
as of now). This can help build the understanding of the performance
of the key actions under a series of batch size.

* [REFACTOR] Organize metrics (#2390)

This PR perform one round of reorganization of metrics into
a centralized metrics header.

Also updates the ChatState to include overrides that can be used
in future cases to run chat test.

* [Fix] Avoid ref capture in prefix cache contruction (#2391)

This PR fixes the prefix cache construction in Engine, which captured
the references of models and thus led to the GPU memory unable to
be freed when the Engine is destructed.

* [REFACTOR] Cleanup Metrics (#2392)

This PR run another round of cleanup of metrics.

- Remove less useful ones
- Reorganize by labels in prometheus style

* [FIX] Fix mlc llm source dir argument (#2394)

This PR fixes the mlc llm source dir argument
in android packaging.

* [Fix] Fix the serialization of SpecDecodeMetrics (#2395)

This commit fixes a bug when serializing SpecDecodeMetrics.

* [Fix] Update missing change in engine ffi func name (#2396)

This PR updates the missange change in engine ffi func name from #2390.

* Auto updated submodule references

* [Fix] Fix no prefix cache (#2397)

This PR fixes the no prefix cache, to avoid double adding of new sequence.

* add hasattr safecheck for MLCEngineBase (#2400)

Co-authored-by: Huaishun Hu <huaishun.hu@mthreads.com>

* [Refactor] Expose EngineConfig in engine constructor (#2399)

This PR lifts the EngineConfig as one engine constructor, so that
we can hide most less important arguments in EngineConfig, and thus
focus the user attention to the few key arguments.

`mlc_llm serve` CLI and PopenServer are updated accordingly.
Documentation is updated accordingly.

* [REFACTOR] Introduce RequestMetrics and metrics endpoint (#2401)

This PR introduces RequestMetrics to collect aggregated metrics for each request.
We also introduces a prometheus end point.

Finally, we fixed a cylic dependency in engine states.

* [Fix] Fix format issue of MLCEngineBase (#2402)

This PR fixes a format issue caused by #2400.

* [FIX] fix comments in radix_tree.py (#2403)

Seems function descriptions for `PagedRadixTree.add` and `PagedRadixTree.extend`
are misleading.

Fixed according to implementations in radix_tree.cc

* [Fix] Fix metric names in tests and static PrefixCacheModes (#2404)

* This PR fixes the metric names referenced in tests which were not
updated together with previous PRs.

* This PR fixes the static PrefixCacheMode member introduced in #2397.
The way of fix using the static class members is not correct, which
essentially disables PrefixCache forever. This is because when checking
the `mode` member of a PrefixCache instance, it is always the base class
mode (which is `kDisabled`) being returned.

* This PR also adds a missing header for chrono.

* [Op] Tree attention (#2376)

* [REFACTOR] Reorganize GenerationConfig DebugConfig and FFI (#2407)

This PR reorganizes GenerationConfig, DebugConfig and FFI.

- Internally, we now directly use the config object instead of json stream.
- Request construction turns into engine side so it can make use of debug_config.
- Ignore eos now moves to debug_config option.
- Removes most string based re-export of gen conifg.

* [Fix] Fix vector OOB when no inputs can be prefilled in spec decode (#2408)

This PR fixes an issue that causes vector index out of bound.
This happens in speculative decoding, when an model can accept inputs
while the other cannot.

We still need to look into this inconsistency. Ideally all models should
behave the same.

* [Fix] Update number of available pages after prefix cache free (#2409)

This PR fixes an issue that causes the inconsistency of CanPrefill
result from different models.

* [REFACTOR] Enable validation logic in GenerationConfig (#2411)

This PR enables a centralized validation logic in GenerationConfig.

* [Chat] Support chat completion config override (#2412)

This PR supports chat CLI with arguments override.

Right now, arguments supported are: `top_p`, `temperature`,
`presence_penalty`, `frequency_penalty`, `max_tokens`, `seed`,
`stop`.

This PR adds the corresponding support to the ChatCompletion request
parsing for JSONFFIEngine.

* Change name RedixPage -> RadixPage in RadixTree.cc (#2413)

change name RedixPage -> RadixPage

* [Fix] Fix ignore_eos support (#2414)

The ignore_eos support was broken during recent refactors. This PR
fixes the support.

* [Test][Refactor] Update tests to use require_test_model (#2415)

This PR updates tests to use the `require_test_model` testing util
for better out-of-box testing while avoid automatic downloading.

Some tests that require manually model compilation are kept in the
old test style (e.g., with model "llava", "eagle", etc.).

This PR also fixes some typing issues suggested by mypy.

* [Serving] Enable GPU Sampling (#2368)

enable gpu sampling

* [REFACTOR] Support latest include_usage and DebugOptions (#2417)

This PR refactors the mechanism of request end detection
and also attaches the request metrics in response usage field.

RequestResponse usage field:
- include_usage can be passed to API. When include usage is on,
  metrics are now streamed back in the usage.extra
- Changed debug_option parameter to extra_body, so they are fully compatible with OpenAI client
- Support special requests in debug options, engine metrics are now streamed back via a special request

We also change the FFI mechanism to detect response finish. Previously
we keep track of number of stoppped streams. Now that the FFI always stream
back the final chunk which have no choices and contains usage. We use the usage
field to detect the final chunk. Code path are updated according.

We also make Chat CLI a helper class that can be reused.

iOS app now comes with stats support.

* [DOWNLOAD] MLC_DOWNLOAD_POLICY and MLC_LLM_READONLY_WEIGHT_CACHES (#2421)

This PR introduces support for MLC_DOWNLOAD_POLICY
and MLC_LLM_READONLY_WEIGHT_CACHES

Allows reading from readonly cache besides MLC_LLM_HOME.
Also introduces a domain subfolder in cached weights

* [REFACTOR] Rename MLC_LLM_READONLY_WEIGHT_CACHES (#2423)

This PR renames MLC_LLM_READONLY_WEIGHT_CACHES=>MLC_LLM_READONLY_WEIGHT_CACHE
to be consistent with rest of env var convention

* [Tokenizer] Auto-detect TokenizerInfo from tokenizer.json (#2416)

This PR adds a new `TokenizerInfo` class that contains useful information
about the tokenizer during generation. It is auto-detected from
tokenizer.json if it exists. Otherwise it raises a warning and uses
the default value (byte fallback tokenizer, not prepend/strip space).

* [REFACTOR] Remove dependencies on legacy chat_module (#2424)

This PR removes the all dependencies from chat_module.py
So we can prepare for deprecating this module.

This PR refactors and moves MLCChatConfig to protocol.
This helps us to consolidate all API spec and config files
under the protocol folder.

The protocol folder mainly keeps the data schema and metadata,
most of the actions(gen_config) are still kept in their current location.

* [REFACTOR] Terminology download=>download_cache (#2425)

This PR renames download to download_cache for better clarity.

* [REFACTOR] Move GenerationConfig to protocol (#2427)

This PR moves GenerationConfig to protocol.
As we move towards OAI style API GenerationConfig becomes more like an internal API.

This change reflects that and also removes duplicated definition of ResponseFormat
and DebugConfig

* Update README.md

* [site] Add hero section to website (#2430)

* [Compile] Skip CUDA graph rewrite when target is not CUDA (#2433)

This PR rewrites the CUDA graph compiler flag to false when the
backend is not CUDA. Otherwise, CUDA graph may be enabled for other
backends and causes result error.

* [DOCS] Simplify read me (#2435)

This PR simplifies readme so most attention
can be pointed to our docs page.

* [DOCS] Update title to focus on engine feature

This commit updates the docs to focus on engine feature

* [Metadata] Remove stale KV cache size (#2434)

This PR removes the KV cache size from model metadata. This is because
we have fully switched to the new compilation flow with PagedKVCache
and MLCEngine as backend, where KV cache size is runtime dependent and
will be estimated at runtime.

* [iOS] Update the MLCSwift APIs to async (#2436)

This PR updates all MLCSwift APIs to be async
for consistency purposes.

* [Android] Switch MLC Chat to use MLCEngine (#2410)

* [Android] Switch MLC Chat to use MLCEngine

* [Serving] Add helper function - TotalDetectGlobalMemory

* [iOS] Remove Legacy ChatModule (#2437)

This PR removes the legacy chat module in iOS.

* [Delivery] Update model delivery script to support specifying the output and hf directory (#2431)

* Update model delivery script to support specifying the output directory

* [Android] Remove Legacy ChatModule (#2438)

* [Refactor] Remove ChatModule (#2440)

This PR formally removes ChatModule from the codebase, given all
the frontends have fully switched to use MLCEngine.

* [Fix][REST] Fix usage-related server tests (#2441)

This PR fixes some server tests which were broken due to recent
refactors.

* [Site] Enlarge hero image in small screens

* Fix lint

* [ANDROID] Patches to enable windows usescase (#2443)

This PR add a few patches to enable build under windows

* [DOCS] Guides for android on windows (#2444)

* [DOCS] mention git-lfs (#2445)

* Fix Llama-3 conversation template. Add unit test (#2442)

* Fix Llama-3 conversation template. Add unit test

* [Grammar][Wasm] Update new grammar to wasm runtime (#2446)

* [Model] Use float32 for RoPE calculation (#2449)

This PR updates the RoPE calculation to use float32 for multiplication
and addition. This is motivated by the observation that calculating
RoPE in float16 may cause accuracy issue.

* [LogitProcessor] Use min float value as the mask value (#2451)

This PR updates the mask values in LogitProcessor to the min value
of float32. Prior to this PR it was -1e10. This update is the safest
for softmax as long as the masking is always the last step in logit
processor.

* [Protocol] Use `by_alias=True` when dumping pydantic classes (#2450)

This PR sets the parameter `by_alias=True` for all the `model_dump_json`
of pydantic classes, so that aliases are always respected.

* [Protocol] Use `by_alias=True` when dumping pydantic classes (#2452)

This PR sets the parameter `by_alias=True` for all the `model_dump`
of pydantic classes, so that aliases are always respected.

* [DOCS] Updates the URL of the Android APK (#2453)

* Auto updated submodule references

* [Fix][Phi3] Add `</s>` as stop token for phi3 (#2455)

[Fix][Phi3] Add </s> as stop token for phi3

* [Site] Add GitHub link to hero section

* Update README.md

* [Hermes2] Add conv template for Hermes2-Pro-Llama3 (#2457)

* [Compile] Add max_batch_size to metadata (#2463)

This PR adds the max_batch_size at compile time to metadata for
runtime to read.

**Note.** This may be a breaking change for the compiled model
libraries. And please set environment variable `MLC_JIT_POLICY=REDO`
to recompile the models with JIT, or manually recompile the model
libraries.

This PR also adds the max_batch_size to qwen2.

* [REFACTOR] Re-organize the modules after transition to MLCEngine (#2464)

This PR reorganizes the modules after transition to MLCEngine.

- grammar is a root level module
- streamers and tokenizers are in the tokenizers namespace
- conversation_template is module

Testcases are restructured accordingly. We also removed some of the stale files.

* [Serving] Add ICHECK for running batch size (#2465)

This PR adds ICHECK to make sure that the running batch size
in BatchDecode and BatchDraft does not exceed the `max_num_sequence`
as in the engine config.

The prefill actions should keep this invariant. And the ICHECKs
added mainly serve for internal error detection and report purpose.

* Auto updated submodule references

* [TEST] Start to categorize tests (#2466)

* [TEST] Start to categorize tests

This PR add test categorization via pytestmark

For now we have five categories of tests

unittest
op_correctness
engine
endpoint
uncategorized

We should start to fix some of the broken tests
and move them to these categories. When possible
we should cover a bug under unittest, since they get run every PR,
as part of the CI.

* Implemented FP8 calibration (#2454)

* Implemented FP8 calibration

* update

* add transformers

* Use encode_batch


---------

Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>

* [CI] Update CUDA build script with FlashInfer options (#2469)

This PR updates the CI CUDA build script with FlashInfer compile
options after a recent bump of FlashInfer version.

* [Serving] Use preferred host memory for host NDArrays (#2468)

This PR updates the host memory in model, logit processor and GPUSampler
with the support of preferred host device, so that for CUDA and ROCm
the pinned memory will be used for the host arrays, which may be faster
than the default CPU memory during copying.

* [TEST] Temp disable UT stage

This PR temp disables the UT stage for now before we can get a fix on the docker execution

* [CUDA] Turn on cuda graph at O2 (#2467)

* [CI] Enable GPU env in CI  (#2476)

* [CI] Enable GPU env in CI

This PR enables GPU env in ci docker/bash.sh

* remove dep on tvm testing plugin

* [CMake] Update config.cmake generation script (#2478)

This PR updates the config.cmake generation script to provide
the FlashInfer compile options explicitly.

* [TEST] MockEchoEngine (#2479)

This PR introduces a MockEchoEngine that echos the
inputs prompt and the generation conflig(as part of usage.extra).

The engine can be used to create unit-test cases that covers engine API handling.
Note that mock tests cannot replace real engine tests.

* Auto updated submodule references

* [Fix] Fix JSONFFI MemoryBufferStream after dmlc bump (#2480)

A recent bump in dmlc has changed the `Write` signature of
`dmlc::Stream`. This commit updates the codebase to follow the
upstream change.

* [JSON-FFI] Enable n generation and pass in json schema (#2481)

This PR enables n generation and pass in json schema in JSON FFI.

* Refactor model delivery script to use pydantic (#2482)

* Fix tokenizers encode batch (#2484)

* [Bugfix] Fix delivered log issue in delivery cli (#2489)

* Support Qwen2-MoE Architecture (#2089)

* [3rdparty] Bump tokenizers-cpp to include HF tokenizers bump (#2490)

This PR bumps the 3rdparty tokenizers-cpp to include the HuggingFace
tokenizers package bump, in order to support some latest models such
as Mistral v0.3.

* [Bench] Add mlc bench (#2474)

This PR adds an initial pass of the bench infra

* Auto updated submodule references

* Enable n-sampling for Medusa spec decoding (#2495)

* Fix get_num_available_pages for model without kv cache


* Enable n-sampling for Medusa spec decoding

* [CONFIG] Remove mean_gen_len from the config (#2493)

This PR removes legacy mean_gen_len from the config

* Update ios android docs (#2497)

* [Bench] Add seed to __init__ and some minor change (#2496)

* [Fix][Config] Max total sequence length overflow with sliding window (#2500)

This PR fixes an issue which causes the int64 multiplication overflow
when sliding window is enabled.

* [Serving] PagedKVCache tree-attention integration (#2487)

This PR integrates the recent support of tree-attention in PagedKVCache
into the speculative decoding in MLC. Right now only chains are
supported. Tree-based speculative decoding is on the project road map
and we are planning to support it in recent future.

* [Sampler] Enhance checks for whether FlashInfer is enabled (#2502)

This PR improves the check in GPU sampler for whether FlashInfer is
enabled. Previously we did not check the CUDA compute capability,
which makes the GPU sampler not able to properly run on Colab where
the T4 GPU has a compute version of 7.5 which FlashInfer does not
support.

With this PR, when the compute capability is less than 8.0, we
will not use FlashInfer in GPU sampler.

* [Android] Updates the default mode list and the APK link in the document (#2503)

* [Android] Update default model list

Update the default model list in Android to include the following models
1. Phi-3-mini-4k-instruct-q4f16_1-MLC
2. Llama-3-8B-Instruct-q3f16_1-MLC
3. Mistral-7B-Instruct-v0.3-q4f16_1-MLC

* [DOCS] Updates the URL of the Android APK

* [Fix] Fix the global func name of TokenizerDecode (#2514)

This PR fixes the global func name for `TokenizerDecode`, which was
not updated when adding the namespace `tokenizers`.

* [Fix] Use the correct model to validate stream_options (#2508)

* [Fix] Typo in docs/install/tvm.rst (#2507)

Fix a typo in serve/engine.py

* [FP8] Use f32 scale to enable better fusion (#2505)

* [Metrics] Add ttft and itl to server metrics (#2510)

* Add ttft and itl to server metrics

* Fix ITL

* Fix clang-format

* Keep mobile and interface.chat untouched

* [Model] Fix config detection for Mistral (#2504)

The Mistral model has removed sliding window since its v0.2, while
in MLC we always enable sliding window. This PR updates the config
detection so that when sliding window is disabled, we turn to checking
the context window size and make sure it is properly set.

* [Fix] Provide a GetTokenId API for SampleResult (#2516)

Currently we use `sampled_token_id.first` to find the sampled token id
of a SampleResult object, which is obscure. This PR provides a
`GetTokenId` API for SampleResult to get the sampled token id.

This PR also updates the testing model path to include `./dist/`.

* [Reapply][BUGFIX] Fix rare deadlock in threaded engine (#2429) (#2518)

This PR reapplies #2429, which is missing in the main branch.

Below is the original commit message:

This PR fixes rare deadlock cases when engine unload/reload

Co-authored-by: Tianqi Chen <tqchen@users.noreply.github.com>

* [Fix] Fix metrics division by 0 (#2519)

This PR fixes an issue of the per-request metrics, where division-by-0
may happen when the request does not run any decode step.

The division-by-0 results in `inf`, and is added into a JSON file.
However, `inf` is usually not recognized as a float value in JSON
grammar. Thus JSON parsers fail on parsing any JSON string that comes
with `inf` wihtout being quoted.

* Corrected the folder path for Android Studio Project (#2520)

Update android.rst

Android project path corrected

* Update tvm.rst

* [iOS] Update model list (#2524)

Update the model list of iOS in `mlc-package-config.json`.

* [Android] Updates the order of mode list and the APK link in the document (#2526)

[Android] Updates the default mode list and the APK link in the document

1. Qwen1.5-1.8B-Chat-q4f16_1-MLC

* [Sampler] Skip top-p renormalization if top-p is 1 in CPUSampler (#2528)

This PR adds a shortcut in the top-p renormalization in CPU sampler,
which skips the renormalization when top-p is 1.0.

* [Docs] Rename javascript.rst to webllm.rst (#2531)

* [Conv] Add tinyLlama v1.0 conv template (#2530)

* [Conv] Add tinyLlama v1.0 conv template

* Fix lint

* [iOS] correct mistral q3 url and handle screen switch off (#2529)

This PR corrects the mistral q3 url

This PR also add a handler for screen switch off.
For now we just reset if the app is generating,
we will update to pause/resume once they are supported.

* [Grammar] Fix include protection and paths in docstring (#2515)

Following #2464, This PR fixes the include protecting in the header
files and the paths in the docstrings of the header files.

This PR also fixes tests that were broken after the refactor.

* [Tokenizer][Fix] Fix SegFault when analyzing tokenizers without tokenizer.json (#2532)

Previously the tokenizer would segfault when analyzing a tokenizer
that did not have a tokenizer.json file.

This is due to `TokenizerInfo()` is called previously, which creates
a null object. This PR fixes this problem.

* [Serving] Use stop strs and token ids for completions (#2534)

This PR applies the stop strings and stop token ids defined in
conversation tempalte to the raw text completions. So that whenever
the model outputs a stop token id or stop string, the raw generation
can stop.

Prior to this commit, the raw text never stops when the max tokens
is not given. This commit helps reduce the frequency of such events.
Nevertheless, if the model does not output a stop string/token id,
the generation will still not be going to stop.

* [Serving] Support tensor parallel shards override in command line (#2533)

This PR supports the command line overrides for model JIT compilation.
This is especially helpful for enabling tensor parallelism out of box,
so people don't need to manually tweak `mlc-chat-config.json` to
use tensor parallelism.

* Add tie_word_embedding option for Qwen2 model (#2535)

* [Bench] Defaults to aiohttp client, add ServerMetrics (#2527)

* [Bench] Defaults to aiohttp client

* Add ServerMetrics to summary

* Remove duplicate servermetric def

* [Android] Remove var capture in TVM_SOURCE_DIR (#2538)

This PR fixes the TVM_SOURCE_DIR parsing issue on Windows.

* [Fix] Fix inconsistent system prompt handling (#2539)

This PR fixes the conversation template of ChatML, whose
system prompt ends with `<|im_end|>`.

An inconsistent handling of system prompt between the JSONFFI side
and the Python side is also corrected.

* [Attention] Fix attn kernel for general GQA group size (#2543)

This PR fixes the TIR prefill attention kernels to support a broader
list of GQA group sizes.

* fix: typo error (#2544)

* [Fix] Fix attn kernel build issue (#2545)

This PR fixes TIR issues in the attn kernels.

* [iOS] Add Qwen2 support (#2547)

This PR add Qwen2 support to MLC Chat

* [Android] Add Qwen2 support (#2548)

* [Android] Escape backslashes and quotation marks (#2546)

This commit escapes the backslashes and quotation marks in Android
package build.

* [EngineConfig] Add override options (#2550)

This PR introduces override options to the Python side EngineConfig
so that they'll be reflected in JIT model compilation.

* [Site] Update link to webllm

* [Site] Update heading

* [Preset] Add model preset for model delivery (#2553)

[Preset] Add model preset for wasm delivery

* Update docs to remove mention of older models (#2557)

* [Docs] Fix typo in mlc_llm chat command (#2560)

* Fix compilation for gcc 13.2 (#2561)

* [Tokenizer] Priorize HuggingFace/SentencePiece over ByteLevelBPE (#2559)

This PR updates the tokenzier load logic, so that we prioritize
the use of HuggingFace and SentencePiece tokenizers over the
ByteLevelBPE tokenizer.

This fixes the issue that token `<im_start>` in Qwen model is
tokenized into multiple tokens when the ByteLevelBPE tokenizer
is chosen when available.

* [Serving][Grammar] Jump-forward decoding (#2551)

[Serve][Grammar] Jump-forward decoding

This PR supports the jump-forward decoding as described in
<https://lmsys.org/blog/2024-02-05-compressed-fsm/>. The jump-forward
decoding uses the grammar constraint to predict the next output string and
tokenize the string into tokens, and therefore speeds up the decoding.

This PR implements these optimizations to ensure the output quality:
- Retokenization in jumpforward: Tokenize the last k token as string appended with the predicted
  string. If the tokenization result differs from the old tokens, roll back
  these tokens and accept the new ones.
- Retokenization in decoding: Tokenize the last k token as string appended with
  the decoded token. This will happen in decoding stage when the jumpforward decoding happens
  in the last round. If the result differs, the old tokens will be rolled back.
- Skip prefix tokens in jumpforward: We call tokens that is a prefix of another token
  as prefix tokens. If the last token from jumpforward is a prefix token, it's highly possible
  that it will be rolled back in the next decode stage, as it may be combined with the
  decoded token. It also effects the output distribution as such pattern is rare in training data.
  Therefore, we skip the last prefix token in jumpforward decoding.

This PR also includes the following changes:
- Add several metrics for request and engine, especially about the jumpforward decoding
- Fix a bug in `_async_query_engine_metrics` to avoid throwing CancelledError from early return

Performance and benchmark:

Schema(Pydantic):
```
class Product(BaseModel):
    product_id: int
    is_available: bool
    price: float
    is_featured: Literal[True]
    category: Literal["Electronics", "Clothing", "Food"]
    tags: List[str]
    stock: Dict[str, int]
```

Platform: AMD Ryzen 9 5900X, NVIDIA 3080 10G

Results:
```
Jump forward: False, Batch: 1
Engine metrics:
{
    "engine_decode_time_sum": 0.4988938220000001,
    "engine_jump_forward_time_sum": 0,
    "completion_tokens_sum": 66,
    "decode_tokens_sum": 66,
    "jump_forward_tokens_sum": 0,
    "decode_tokens_per_s": 132.2926785010378,
}
Jump forward: True, Batch: 1
Engine metrics:
{
    "engine_decode_time_sum": 0.37242740600000007,
    "engine_jump_forward_time_sum": 0.027989265000000006,
    "completion_tokens_sum": 68,
    "decode_tokens_sum": 68,
    "jump_forward_tokens_sum": 28,
    "decode_tokens_per_s": 182.58591850246378,
}
Jump forward: False, Batch: 4
Engine metrics:
{
    "engine_decode_time_sum": 0.9106805410000002,
    "engine_jump_forward_time_sum": 0,
    "completion_tokens_sum": 261,
    "decode_tokens_sum": 261,
    "jump_forward_tokens_sum": 0,
    "decode_tokens_per_s": 286.5988546470984,
}
Jump forward: True, Batch: 4
Engine metrics:
{
    "engine_decode_time_sum": 0.6843025599999999,
    "engine_jump_forward_time_sum": 0.028089531999999997,
    "completion_tokens_sum": 266,
    "decode_tokens_sum": 266,
    "jump_forward_tokens_sum": 112,
    "decode_tokens_per_s": 388.71694415405966,
}
Jump forward: False, Batch: 8
Engine metrics:
{
    "engine_decode_time_sum": 1.62462493,
    "engine_jump_forward_time_sum": 0,
    "completion_tokens_sum": 538,
    "decode_tokens_sum": 538,
    "jump_forward_tokens_sum": 0,
    "decode_tokens_per_s": 331.1533573475325,
}
Jump forward: True, Batch: 8
Engine metrics:
{
    "engine_decode_time_sum": 1.0509048310000002,
    "engine_jump_forward_time_sum": 0.027971332000000022,
    "completion_tokens_sum": 525,
    "decode_tokens_sum": 525,
    "jump_forward_tokens_sum": 224,
    "decode_tokens_per_s": 499.5694990767436,
}
Jump forward: False, Batch: 16
Engine metrics:
{
    "engine_decode_time_sum": 2.317279175,
    "engine_jump_forward_time_sum": 0,
    "completion_tokens_sum": 1068,
    "decode_tokens_sum": 1068,
    "jump_forward_tokens_sum": 0,
    "decode_tokens_per_s": 460.8853398080531,
}
Jump forward: True, Batch: 16
Engine metrics:
{
    "engine_decode_time_sum": 1.3962938819999997,
    "engine_jump_forward_time_sum": 0.030129287999999994,
    "completion_tokens_sum": 1059,
    "decode_tokens_sum": 1059,
    "jump_forward_tokens_sum": 448,
    "decode_tokens_per_s": 758.4363246533227,
}
```

* [Delivery] Update model delivery script (#2565)

Some improvements of the delivery script:

- provide different overrides for different quantization. e.g. we can change
prefill chunk size for q0/q3/q4
- rerun gen config only if only conv_template changes
- do NOT recreate HF repo when the repo already exists. This will preserve
commit history
- dry-run validation

* [Model] Enhance error reporting for invalid tensor-parallel settings (#2566)

This PR enhances the error reporting for multi-GPU model compilation,
so we can provide as many error reasons as possible before loading and
running the models.

* [Serving] Apply tree structure in draft token verification (#2563)

This adds the interface to draft token state and sampler to allow tree
structure being recorded and used for verification

* [Bench] Json mode bench (#2552)

* [Bench] Json mode bench

This PR refactors mlc bench to enable json mode in dataset.

* upd

* fix lint

* [Model] Support Multi-GPU for Qwen-MoE model (#2573)

This PR introduces the multi-GPU support for the Qwen-MoE model.
Validated on 4090x2.

* [Metrics] Add missing fields in `Reset` (#2574)

This PR adds the missing fields that were not cleared up in
`EngineMetrics::Reset`.

* [Doc] Update WebLLM doc (#2578)

Update documentation for WebLLM. Currently we only provide a high-level view for WebLLM runtime here, and refer user to the WebLLM repo README for more. The documentation focuses on adding their own model variant / model library for WebLLM. Will follow up with more thorough runtime documentation.

* [Op] Top-4 implementation for MoE model (#2586)

This PR introduces a top-4 kernel for MoE model (particularly for
the Qwen-MoE) at this moment.

This is still a manual implementation and has some duplication
with the existing top-2 kernel. In the future we'll consider leveraging
meta-programming of TIR to unify the top-k kernel implementations.

* [Model] Gemma 1.1 compatibility (#2594)

This PR updates the Gemma config so that MLC can work properly with
Gemma 1.1.

* [Serving] Hybrid prefill (#2604)

This PR adds the support for the hybrid prefill. So during the prefill
engine action, it will do the decode for running requests as well.

* Update quick_start.rst to fix broken links (#2607)

Update quick_start.rst

Fix broken links for convert weights and compile model pages

* [Fix] Set the missed prefill finish time (#2613)

This PR fixes a bug which fails to set the prefill finish time
and results in metric error.

* [Android] Reduce binary size (#2606)

This PR updates the Android app the reduce the binary size.
Right now it can be reduced to 108MB when only building with the
Phi-3-mini-4k model.

* [Fix] Gemma hidden_activation compatibility (#2614)

This PR fixes the Gemma config compatibility issue.

* Update debug_compare (#2612)

This PR fixes a bug of the debug_compare.py script.

* [SLM] Add support for InternLM2 architecture (#2608)

This commit introduces the InternLM2 model support.

* [Fix] Prefix cache only enables sliding window on leaf sequence (#2615)

This PR updates the prefix cache to align the logic of enabling sliding window. Now only leaf sequence is enabled sliding window attention.

* [Android] Update include path for tvm runtime src (#2616)

This PR updates the include directories for the Android app
so that we can avoid using macros for src file include.

* remove

* works

* seems working

---------

Co-authored-by: Rick Zhou <rickzhoucmu@gmail.com>
Co-authored-by: Tianqi Chen <tqchen@users.noreply.github.com>
Co-authored-by: Wuwei Lin <wuwei@apache.org>
Co-authored-by: Wei Tao <1136862851@qq.com>
Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>
Co-authored-by: Kartik Khandelwal <kartikkhandelwal1998@gmail.com>
Co-authored-by: Yixin Dong <ubospica@gmail.com>
Co-authored-by: Kevin_Xiong <kevin_xiong1997@outlook.com>
Co-authored-by: zifeitong <zifeitong@gmail.com>
Co-authored-by: Yong Wu <yongcale@gmail.com>
Co-authored-by: Animesh Bohara <ani.bohara@gmail.com>
Co-authored-by: Animesh Bohara <abohara@cs.cmu.edu>
Co-authored-by: krishnaraj36 <quic_kvegiraj@quicinc.com>
Co-authored-by: Mengshiun Yu <mengshyu@gmail.com>
Co-authored-by: Git bot <bot@noreply.github.com>
Co-authored-by: Vivian Zhai <98248913+YiyanZhai@users.noreply.github.com>
Co-authored-by: Nestor Qin <imba.qxy@gmail.com>
Co-authored-by: Yaxing Cai <caiyaxing666@gmail.com>
Co-authored-by: Faolain <Faolain@users.noreply.github.com>
Co-authored-by: Bodhi <3882561+BodhiHu@users.noreply.github.com>
Co-authored-by: Huaishun Hu <huaishun.hu@mthreads.com>
Co-authored-by: Hyunsung Lee <ita9naiwa@gmail.com>
Co-authored-by: Bohan Hou <bohanhou@andrew.cmu.edu>
Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
Co-authored-by: tqchen <tqchenml@gmail.com>
Co-authored-by: Charlie Ruan <53290280+CharlieFRuan@users.noreply.github.com>
Co-authored-by: rmstc <ramees025@gmail.com>
Co-authored-by: KEL <me@iamkel.net>
Co-authored-by: Andrey Malyshev <ma_elvin@mail.ru>
Co-authored-by: Gunjan Dhanuka <d.gunjan@iitg.ac.in>
Co-authored-by: Shushi Hong <820958424@qq.com>
masahi pushed a commit to octoml/mlc-llm that referenced this pull request Aug 2, 2024
… 2024-08-01) (#277)

* [Eagle] Run additional decode for draft model when all proposals are accepted (#2294)

* [iOS] Introducing package CLI for iOS app packaging (#2297)

This PR introduces the packaging CLI `mlc_llm package` which
reads from a `mlc-package-config.json` and compiles model
and prepares model/runtime libraries automatically.

With this PR, we get rid of prebuilt model library dependency
for iOS app build.

Validated that the iOS build can work. iOS documentation is updated
according to this latest change. The same flow is supposed to work
for Android as well, while it still needs verification for Android
app build.

* Increase the timeout in PopenServer (#2298)

* [LLM-CHAT] Enable gpu softmax for penality softmax (#2288)

1. Avoid the cpu softmax for different penality config by
  having copy sync to gpu and use gpu softmax.
2. Disable decode token time counter for first token.

* [iOS][REFACTOR] Restructure the iOS folders (#2299)

Move MLCChat to its own sub folder minor improvements to package.

* [KVCACHE][TIR] Improved tir schedule for decode tir page attention (#2289)

* [KVCACHE][TIR] Improved tir schedule for decode tir page attention

 1. Improved tir schedule of page attention (It improved 30% to this
function).
 2. Enable missing dequant+matmul fusion in ph-2 model

* Updated K_local to QK_local

* Update kv_cache.py

* Increase max thread for android:adreno

* [Sampler] Remove unneeded output_prob_dist param (#2300)

* Enable cuda graph for batch_verify (#2304)

* [Android] Introducing mlc4j and app packaging (#2305)

This PR lifts the existing `library` of android app into a standalone
`mlc4j` directory, which can be referenced by android app at any
location.

On the app side, this PR moves the android app into a subfolder
`MLCChat` which itself is a well-formed Android app. This folder
contains two core files for app build:

* `MLCChat/mlc-package-config.json` the config file that specifies
the models to build into the app.
* `MLCChat/prepare_package.py` the Python script that helps
automatically prepare/build mlc4j and model libraries.

This PR also updates the android app documentation to reflect this
latest change.

* [DOCS] Minor cleanup (#2308)

Shorten titles so they fit into one line of navbar, add mention of jit cache.
Remote old project overview

* [DOCS] Update android doc (#2309)

Avoid showing full tree and mention what the dist/lib/mlc4j stands for

* [DOCS] Update android doc (#2310)

Avoid showing full tree and mention what the dist/lib/mlc4j stands for
Avoid python3 instead directly use python, since python3 sometimes
will points to system python.

* [SLM] Support BERT architecture. Implement a text embedding module (#2249)

* [Serving] Log batch size in NVTX (#2312)

* [Model] Removing unnecessary reshapes in get_logits (#2314)

* Skip cublas dispatch for single batch (#2315)

* Auto updated submodule references

* [DOCS] Remove mention of legacy modules (#2318)

This PR removes mention of legacy modules
and prebuilt in favor of JIT.

* [Android] Add `-j` option to cmake build (#2321)

This PR adds the `-j` option to cmake build to parallelize the
build job over CPU cores.

* [DOCS] More clear android instruction (#2327)

This PR sets a more clear instruction for android JDK setup

* [Serving] Refactor to consolidate new request prefill (#2329)

* [iOS] Make MLCEngine input to take in structured data (#2330)

This PR modifies the MLCEngine chatCompletion to take in structured data.

Co-authored-by: Vivian Zhai <98248913+YiyanZhai@users.noreply.github.com>

* [REFACTOR] Refactor JSONFFI Conv template (#2331)

This PR refactors JSONFFI conv template to use immutable processing.
This helps to prevent bugs from multiple requests and concurrent
access to the conversation data structure.

It also reduces the need to deep copy the struct.

* [Eagle] Fix the requests for additional decode in eagle verify (#2336)

* [Serving][Grammar] Refactor GrammarStateMatcher and support LLaMA-3 (#2335)

This PR refactors GrammarStateMatcher and support the LLaMA-3 tokenizer.

Common tokenizers, including Phi-2, Gemma, LLaMA-2, etc. are also
supported.

The performance is optimized for LLaMA-3 tokenizer since its token table
has size 128k, much larger than LLaMA-2 tokenizer.

These changes are introduced to the grammar library:

These changes are introduced to the grammar library:
1. Introduce ByteString rule expression and simplify CharacterClass
   and CharacterClassStar
2. Refactor BNFGrammarVisitor and BNFGrammarMutator for visiting and
   mutating grammar rules
3. Now GrammarStateMatcherBase, the internally impl of the
   GrammarStateMatcher, accepts char by char, instead of codepoint by
   codepoint. So it supports any valid UTF-8 string, even if the token
   is not a complete codepoint.
4. Support lookahead assertion for rules to specify the rule must be
   followed by a sequence. This can eliminate some uncertain tokens
   in preprocessing.

Minor changes:
1. Introduce template hash function HashCombine
2. Update the UTF8 encoding handling functions

Performance:
1. For JSON, finding mask requires <30us on 5900X with single thread.
   The uncertain tokens is <30 in most cases.
2. For JSON schema, finding mask requires <30us on 5900X with single
   thread. The uncertain tokens is <30 in most cases.

* [DebugChat] Fix DebugChat softmax function and save logits to debug folder (#2342)

* [DebugChat] Fix DebugChat softmax function and save logits to debug folder

* Fix lint

* [Serving] Add Medusa speculative decoding (#2337)


* [Serving] Add Medusa speculative decoding

* Fix cublas offloading (#2343)

* Add false for arg worker0_only in disco.empty (#2344)

* Auto updated submodule references

* [JSONFFIEngine] Refactor device argument and request_stream_callback argument (#2334)

* 1. Refactor init_background_engine in JSONFFIEngine to use device_type and device_id arguments.
2. request_stream_callback is called on each string of the array of strings.

* Calling callback on string of list of JSON dicts instead of each string of JSON dict multiple times

---------

Co-authored-by: Animesh Bohara <abohara@cs.cmu.edu>

* [Serving] Add reset_engine in debug_entrypoints (#2347)

* [Bugfix] Make sequence_length dtype int64 in EngineConfig. Fix Mistral engine serving issue (#2358)

* [Bugfix] Make sequence_length dtype int64 in EngineConfig. Fix Mistral engine serving issue

* [JSON FFI] Example Android Application using JSON FFI Engine (#2322)

* pass str to callback and not List[str]

add json ffif android example

fix lint

Refactor MLCEngineExample and MLCEngine.kt

Use ChatCompletionMessageContent class

ChatCompletionMessageContent: text and parts

* JSONFFIEngine: Cast request_stream_callback argument to std::string. Decode in Android as List<ChatCompletionStreamResponse>

---------

Co-authored-by: Animesh Bohara <abohara@cs.cmu.edu>

* [iOS] Update MLCEngine API to latest JSON FFI convention (#2359)

This PR updates the MLCEngine API to latest JSON FFI convention.

* [JSONFFI] Fix JSONFFI conv template. Add unit tests (#2360)

* [Fix][Serving] Fix prefill chunk in interactive mode (#2363)

This PR fixes a bug of prefill chunking in the interactive mode.
The bug counts requests with remaining inputs as running requests
which turns out disabling the prefill of the remaining inputs.

This PR fixes by no longer counting requests with unfinished inputs
as running requests for decode.

* [Fix][Serving] Respect sliding window size in config inference (#2364)

This PR fixes the automatic engine config inference which did not
respect the sliding window size, which led to memory usage higher
than expected in the interactive mode for mistral model.

* [iOS] Add padding to app icon (#2365)

* [Serving] Fix the self-ref in engine (#2367)

This PR fixes the self ref in engine and enable auto terminate in deleter.

* [Serving] Prefix Cache (#2295)

* [Serving] Prefix Cache

This PR introduces the prefix cache into serving engine, to manage prefix and accelerate prefill process.

* [Fix] Use static_cast for `.size()` for safety (#2369)

This PR updates the occurences of `.size() - 1` with static_cast
to avoid the integer underflow.

* [Serving] Sliding-window-aware request prefill (#2370)

This PR supports the prefill conditions with sliding window awareness.
Now when the input length is larger than the sliding window size,
the prefill can still be processed without error.

* [iOS] Update MLCSwift to fully follow OAI style. (#2371)

It also refactors the MLCSwift to be follow
engine.chat.completions.create style as per
other OpenAI APIs.

It also removes the cyclic dependencies in
the closure capture by having a separate EngineState

* Add nvtx in logic update (#2372)

* [Test] Use HF model for JIT as much as possible (#2373)

This PR updates the test files to use JIT by default as much as
possible, in order to make tests runnable out of the box.

Of course, they can be locally tweaked to use local models.

For Eagle/Llava/rwkv, given we don't have them delivered yet, they
are kept as using local model lib now.

* [Fix] Fix prefix cache reset and forking logic (#2374)

This PR refactors the reset logic in prefix cache and disable forking from sequences with sliding windows enabled.

* [CLI] Migrate CLI to use the new Engine (#2375)

* [CLI] Migrate CLI to use the new Engine

This PR migrates the CLI to the new JSON FFI Engine.
The resulting generation will be faster, we still need to ensure
we can enable sliding window support when needed.

Also Refactors JSONFFI Engine to be OpenAI compatible.

* Fix lint and remove bench which is stale

* [TESTING] Introduce testing util to manage models (#2377)

This PR introduce a new env var MLC_TEST_MODEL_PATH to allow a list of model path
specified for test model search purposes.

If not found, an error message would appear and we auto skip test in both
pytest and normal running settings.

The path defaults to the cached HF path so as long as we run mlc_llm chat
the model can be found. But we do not automatically download to avoid
excessive networking in CI settings.

Followup PR needed for remaining testcases

* [REFACTOR][Rename]  MLC_LLM_SOURCE_DIR and TVM_SOURCE_DIR source directory env  (#2378)

* [REFACTOR] Rename use MLC_LLM_SOURCE_DIR for source directory

This PR updates to use MLC_LLM_SOURCE_DIR to specify the
directory of mlc llm source directory.

The reason for this update is that the term XXX_HOME was usually
meant to be used in different scenarios in ML frameworks.

For example, both torch and huggingface have TORCH_HOME and HF_HOME
pointing to their local cache directory.

The variable MLC_LLM_SOURCE_DIR is aligned with cmake naming convention
(CMAKE_SOURCE_DIR).

We will have followup PR to udpate MLC_CACHE_DIR to MLC_LLM_HOME, following
the existing practices.

* Update env to point to TVM_SOURCE_DIR

* [REFACTOR][ENV] MLC_CACHE_DIR to MLC_LLM_HOME (#2379)

This PR changes the MLC_CACHE_DIR env to MLC_LLM_HOME.
This change aligns with most of the packages.

* [iOS] Switch MLC Chat to use MLCEngine (#2380)

This PR switchs MLC Chat to use MLC Engine

Also did a minor refactoring to make serve side more
flexible in dealing with compile time overrides.

* [REFACTOR] Cleanup legacy code (#2381)

This PR cleans up legacy code and reorgaizes some of the project structure.

- Removed stale interface
- Removed stale examples
- Temp remove rust as it depends on chat module that we plan to phase out
- Move embeddings to contrib(experimental)

* [Fix] Update prefix cache config (#2382)

This PR updates the prefix cache config to prefix cache mode and prefix cache max number of recycling sequences. Also this PR adds the missing `final` keyword in member methods.

* [PREFIX-CACHE] Fix some issues with prefix cache (#2384)

This PR fixes issues with prefix cache when used together with MLCEngine.
It also fixes an issue when prefix_cache_max_num_recycling_seqs == 0

* [FIX] Typo on OpenAI Chat class in engine (#2385)

This commit fixes a typo on JSONFFIEngine Python side.

* [Serving][Refactor] Metrics and stats for CLI (#2387)

This PR introduces the `Metric` class for convenient metric update
and management in MLC. The previous `EngineStats` class is renamed
to `EngineMetrics` accordingly.

This PR brings the metric support to JSONFFIEngine, and implements
the `/stats` command in CLI.

Besides, this PR

* fixes a bug of time measurement when parallel generation exists.
* aligns the metric names with LLMPerf (particularly, we now use
`num_input_tokens`, `num_output_tokens`, `sum_num_input_tokens`, etc.)
* measures the time of a single step of BatchDecode, a single step
of draft generation in BatchDraft, and a single step of BatchVerify
when the effective batch size is less than 64 (hardcoded as a constant
as of now). This can help build the understanding of the performance
of the key actions under a series of batch size.

* [REFACTOR] Organize metrics (#2390)

This PR perform one round of reorganization of metrics into
a centralized metrics header.

Also updates the ChatState to include overrides that can be used
in future cases to run chat test.

* [Fix] Avoid ref capture in prefix cache contruction (#2391)

This PR fixes the prefix cache construction in Engine, which captured
the references of models and thus led to the GPU memory unable to
be freed when the Engine is destructed.

* [REFACTOR] Cleanup Metrics (#2392)

This PR run another round of cleanup of metrics.

- Remove less useful ones
- Reorganize by labels in prometheus style

* [FIX] Fix mlc llm source dir argument (#2394)

This PR fixes the mlc llm source dir argument
in android packaging.

* [Fix] Fix the serialization of SpecDecodeMetrics (#2395)

This commit fixes a bug when serializing SpecDecodeMetrics.

* [Fix] Update missing change in engine ffi func name (#2396)

This PR updates the missange change in engine ffi func name from #2390.

* Auto updated submodule references

* [Fix] Fix no prefix cache (#2397)

This PR fixes the no prefix cache, to avoid double adding of new sequence.

* add hasattr safecheck for MLCEngineBase (#2400)

Co-authored-by: Huaishun Hu <huaishun.hu@mthreads.com>

* [Refactor] Expose EngineConfig in engine constructor (#2399)

This PR lifts the EngineConfig as one engine constructor, so that
we can hide most less important arguments in EngineConfig, and thus
focus the user attention to the few key arguments.

`mlc_llm serve` CLI and PopenServer are updated accordingly.
Documentation is updated accordingly.

* [REFACTOR] Introduce RequestMetrics and metrics endpoint (#2401)

This PR introduces RequestMetrics to collect aggregated metrics for each request.
We also introduces a prometheus end point.

Finally, we fixed a cylic dependency in engine states.

* [Fix] Fix format issue of MLCEngineBase (#2402)

This PR fixes a format issue caused by #2400.

* [FIX] fix comments in radix_tree.py (#2403)

Seems function descriptions for `PagedRadixTree.add` and `PagedRadixTree.extend`
are misleading.

Fixed according to implementations in radix_tree.cc

* [Fix] Fix metric names in tests and static PrefixCacheModes (#2404)

* This PR fixes the metric names referenced in tests which were not
updated together with previous PRs.

* This PR fixes the static PrefixCacheMode member introduced in #2397.
The way of fix using the static class members is not correct, which
essentially disables PrefixCache forever. This is because when checking
the `mode` member of a PrefixCache instance, it is always the base class
mode (which is `kDisabled`) being returned.

* This PR also adds a missing header for chrono.

* [Op] Tree attention (#2376)

* [REFACTOR] Reorganize GenerationConfig DebugConfig and FFI (#2407)

This PR reorganizes GenerationConfig, DebugConfig and FFI.

- Internally, we now directly use the config object instead of json stream.
- Request construction turns into engine side so it can make use of debug_config.
- Ignore eos now moves to debug_config option.
- Removes most string based re-export of gen conifg.

* [Fix] Fix vector OOB when no inputs can be prefilled in spec decode (#2408)

This PR fixes an issue that causes vector index out of bound.
This happens in speculative decoding, when an model can accept inputs
while the other cannot.

We still need to look into this inconsistency. Ideally all models should
behave the same.

* [Fix] Update number of available pages after prefix cache free (#2409)

This PR fixes an issue that causes the inconsistency of CanPrefill
result from different models.

* [REFACTOR] Enable validation logic in GenerationConfig (#2411)

This PR enables a centralized validation logic in GenerationConfig.

* [Chat] Support chat completion config override (#2412)

This PR supports chat CLI with arguments override.

Right now, arguments supported are: `top_p`, `temperature`,
`presence_penalty`, `frequency_penalty`, `max_tokens`, `seed`,
`stop`.

This PR adds the corresponding support to the ChatCompletion request
parsing for JSONFFIEngine.

* Change name RedixPage -> RadixPage in RadixTree.cc (#2413)

change name RedixPage -> RadixPage

* [Fix] Fix ignore_eos support (#2414)

The ignore_eos support was broken during recent refactors. This PR
fixes the support.

* [Test][Refactor] Update tests to use require_test_model (#2415)

This PR updates tests to use the `require_test_model` testing util
for better out-of-box testing while avoid automatic downloading.

Some tests that require manually model compilation are kept in the
old test style (e.g., with model "llava", "eagle", etc.).

This PR also fixes some typing issues suggested by mypy.

* [Serving] Enable GPU Sampling (#2368)

enable gpu sampling

* [REFACTOR] Support latest include_usage and DebugOptions (#2417)

This PR refactors the mechanism of request end detection
and also attaches the request metrics in response usage field.

RequestResponse usage field:
- include_usage can be passed to API. When include usage is on,
  metrics are now streamed back in the usage.extra
- Changed debug_option parameter to extra_body, so they are fully compatible with OpenAI client
- Support special requests in debug options, engine metrics are now streamed back via a special request

We also change the FFI mechanism to detect response finish. Previously
we keep track of number of stoppped streams. Now that the FFI always stream
back the final chunk which have no choices and contains usage. We use the usage
field to detect the final chunk. Code path are updated according.

We also make Chat CLI a helper class that can be reused.

iOS app now comes with stats support.

* [DOWNLOAD] MLC_DOWNLOAD_POLICY and MLC_LLM_READONLY_WEIGHT_CACHES (#2421)

This PR introduces support for MLC_DOWNLOAD_POLICY
and MLC_LLM_READONLY_WEIGHT_CACHES

Allows reading from readonly cache besides MLC_LLM_HOME.
Also introduces a domain subfolder in cached weights

* [REFACTOR] Rename MLC_LLM_READONLY_WEIGHT_CACHES (#2423)

This PR renames MLC_LLM_READONLY_WEIGHT_CACHES=>MLC_LLM_READONLY_WEIGHT_CACHE
to be consistent with rest of env var convention

* [Tokenizer] Auto-detect TokenizerInfo from tokenizer.json (#2416)

This PR adds a new `TokenizerInfo` class that contains useful information
about the tokenizer during generation. It is auto-detected from
tokenizer.json if it exists. Otherwise it raises a warning and uses
the default value (byte fallback tokenizer, not prepend/strip space).

* [REFACTOR] Remove dependencies on legacy chat_module (#2424)

This PR removes the all dependencies from chat_module.py
So we can prepare for deprecating this module.

This PR refactors and moves MLCChatConfig to protocol.
This helps us to consolidate all API spec and config files
under the protocol folder.

The protocol folder mainly keeps the data schema and metadata,
most of the actions(gen_config) are still kept in their current location.

* [REFACTOR] Terminology download=>download_cache (#2425)

This PR renames download to download_cache for better clarity.

* [REFACTOR] Move GenerationConfig to protocol (#2427)

This PR moves GenerationConfig to protocol.
As we move towards OAI style API GenerationConfig becomes more like an internal API.

This change reflects that and also removes duplicated definition of ResponseFormat
and DebugConfig

* Update README.md

* [site] Add hero section to website (#2430)

* [Compile] Skip CUDA graph rewrite when target is not CUDA (#2433)

This PR rewrites the CUDA graph compiler flag to false when the
backend is not CUDA. Otherwise, CUDA graph may be enabled for other
backends and causes result error.

* [DOCS] Simplify read me (#2435)

This PR simplifies readme so most attention
can be pointed to our docs page.

* [DOCS] Update title to focus on engine feature

This commit updates the docs to focus on engine feature

* [Metadata] Remove stale KV cache size (#2434)

This PR removes the KV cache size from model metadata. This is because
we have fully switched to the new compilation flow with PagedKVCache
and MLCEngine as backend, where KV cache size is runtime dependent and
will be estimated at runtime.

* [iOS] Update the MLCSwift APIs to async (#2436)

This PR updates all MLCSwift APIs to be async
for consistency purposes.

* [Android] Switch MLC Chat to use MLCEngine (#2410)

* [Android] Switch MLC Chat to use MLCEngine

* [Serving] Add helper function - TotalDetectGlobalMemory

* [iOS] Remove Legacy ChatModule (#2437)

This PR removes the legacy chat module in iOS.

* [Delivery] Update model delivery script to support specifying the output and hf directory (#2431)

* Update model delivery script to support specifying the output directory

* [Android] Remove Legacy ChatModule (#2438)

* [Refactor] Remove ChatModule (#2440)

This PR formally removes ChatModule from the codebase, given all
the frontends have fully switched to use MLCEngine.

* [Fix][REST] Fix usage-related server tests (#2441)

This PR fixes some server tests which were broken due to recent
refactors.

* [Site] Enlarge hero image in small screens

* Fix lint

* [ANDROID] Patches to enable windows usescase (#2443)

This PR add a few patches to enable build under windows

* [DOCS] Guides for android on windows (#2444)

* [DOCS] mention git-lfs (#2445)

* Fix Llama-3 conversation template. Add unit test (#2442)

* Fix Llama-3 conversation template. Add unit test

* [Grammar][Wasm] Update new grammar to wasm runtime (#2446)

* [Model] Use float32 for RoPE calculation (#2449)

This PR updates the RoPE calculation to use float32 for multiplication
and addition. This is motivated by the observation that calculating
RoPE in float16 may cause accuracy issue.

* [LogitProcessor] Use min float value as the mask value (#2451)

This PR updates the mask values in LogitProcessor to the min value
of float32. Prior to this PR it was -1e10. This update is the safest
for softmax as long as the masking is always the last step in logit
processor.

* [Protocol] Use `by_alias=True` when dumping pydantic classes (#2450)

This PR sets the parameter `by_alias=True` for all the `model_dump_json`
of pydantic classes, so that aliases are always respected.

* [Protocol] Use `by_alias=True` when dumping pydantic classes (#2452)

This PR sets the parameter `by_alias=True` for all the `model_dump`
of pydantic classes, so that aliases are always respected.

* [DOCS] Updates the URL of the Android APK (#2453)

* Auto updated submodule references

* [Fix][Phi3] Add `</s>` as stop token for phi3 (#2455)

[Fix][Phi3] Add </s> as stop token for phi3

* [Site] Add GitHub link to hero section

* Update README.md

* [Hermes2] Add conv template for Hermes2-Pro-Llama3 (#2457)

* [Compile] Add max_batch_size to metadata (#2463)

This PR adds the max_batch_size at compile time to metadata for
runtime to read.

**Note.** This may be a breaking change for the compiled model
libraries. And please set environment variable `MLC_JIT_POLICY=REDO`
to recompile the models with JIT, or manually recompile the model
libraries.

This PR also adds the max_batch_size to qwen2.

* [REFACTOR] Re-organize the modules after transition to MLCEngine (#2464)

This PR reorganizes the modules after transition to MLCEngine.

- grammar is a root level module
- streamers and tokenizers are in the tokenizers namespace
- conversation_template is module

Testcases are restructured accordingly. We also removed some of the stale files.

* [Serving] Add ICHECK for running batch size (#2465)

This PR adds ICHECK to make sure that the running batch size
in BatchDecode and BatchDraft does not exceed the `max_num_sequence`
as in the engine config.

The prefill actions should keep this invariant. And the ICHECKs
added mainly serve for internal error detection and report purpose.

* Auto updated submodule references

* [TEST] Start to categorize tests (#2466)

* [TEST] Start to categorize tests

This PR add test categorization via pytestmark

For now we have five categories of tests

unittest
op_correctness
engine
endpoint
uncategorized

We should start to fix some of the broken tests
and move them to these categories. When possible
we should cover a bug under unittest, since they get run every PR,
as part of the CI.

* Implemented FP8 calibration (#2454)

* Implemented FP8 calibration

* update

* add transformers

* Use encode_batch


---------

Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>

* [CI] Update CUDA build script with FlashInfer options (#2469)

This PR updates the CI CUDA build script with FlashInfer compile
options after a recent bump of FlashInfer version.

* [Serving] Use preferred host memory for host NDArrays (#2468)

This PR updates the host memory in model, logit processor and GPUSampler
with the support of preferred host device, so that for CUDA and ROCm
the pinned memory will be used for the host arrays, which may be faster
than the default CPU memory during copying.

* [TEST] Temp disable UT stage

This PR temp disables the UT stage for now before we can get a fix on the docker execution

* [CUDA] Turn on cuda graph at O2 (#2467)

* [CI] Enable GPU env in CI  (#2476)

* [CI] Enable GPU env in CI

This PR enables GPU env in ci docker/bash.sh

* remove dep on tvm testing plugin

* [CMake] Update config.cmake generation script (#2478)

This PR updates the config.cmake generation script to provide
the FlashInfer compile options explicitly.

* [TEST] MockEchoEngine (#2479)

This PR introduces a MockEchoEngine that echos the
inputs prompt and the generation conflig(as part of usage.extra).

The engine can be used to create unit-test cases that covers engine API handling.
Note that mock tests cannot replace real engine tests.

* Auto updated submodule references

* [Fix] Fix JSONFFI MemoryBufferStream after dmlc bump (#2480)

A recent bump in dmlc has changed the `Write` signature of
`dmlc::Stream`. This commit updates the codebase to follow the
upstream change.

* [JSON-FFI] Enable n generation and pass in json schema (#2481)

This PR enables n generation and pass in json schema in JSON FFI.

* Refactor model delivery script to use pydantic (#2482)

* Fix tokenizers encode batch (#2484)

* [Bugfix] Fix delivered log issue in delivery cli (#2489)

* Support Qwen2-MoE Architecture (#2089)

* [3rdparty] Bump tokenizers-cpp to include HF tokenizers bump (#2490)

This PR bumps the 3rdparty tokenizers-cpp to include the HuggingFace
tokenizers package bump, in order to support some latest models such
as Mistral v0.3.

* [Bench] Add mlc bench (#2474)

This PR adds an initial pass of the bench infra

* Auto updated submodule references

* Enable n-sampling for Medusa spec decoding (#2495)

* Fix get_num_available_pages for model without kv cache


* Enable n-sampling for Medusa spec decoding

* [CONFIG] Remove mean_gen_len from the config (#2493)

This PR removes legacy mean_gen_len from the config

* Update ios android docs (#2497)

* [Bench] Add seed to __init__ and some minor change (#2496)

* [Fix][Config] Max total sequence length overflow with sliding window (#2500)

This PR fixes an issue which causes the int64 multiplication overflow
when sliding window is enabled.

* [Serving] PagedKVCache tree-attention integration (#2487)

This PR integrates the recent support of tree-attention in PagedKVCache
into the speculative decoding in MLC. Right now only chains are
supported. Tree-based speculative decoding is on the project road map
and we are planning to support it in recent future.

* [Sampler] Enhance checks for whether FlashInfer is enabled (#2502)

This PR improves the check in GPU sampler for whether FlashInfer is
enabled. Previously we did not check the CUDA compute capability,
which makes the GPU sampler not able to properly run on Colab where
the T4 GPU has a compute version of 7.5 which FlashInfer does not
support.

With this PR, when the compute capability is less than 8.0, we
will not use FlashInfer in GPU sampler.

* [Android] Updates the default mode list and the APK link in the document (#2503)

* [Android] Update default model list

Update the default model list in Android to include the following models
1. Phi-3-mini-4k-instruct-q4f16_1-MLC
2. Llama-3-8B-Instruct-q3f16_1-MLC
3. Mistral-7B-Instruct-v0.3-q4f16_1-MLC

* [DOCS] Updates the URL of the Android APK

* [Fix] Fix the global func name of TokenizerDecode (#2514)

This PR fixes the global func name for `TokenizerDecode`, which was
not updated when adding the namespace `tokenizers`.

* [Fix] Use the correct model to validate stream_options (#2508)

* [Fix] Typo in docs/install/tvm.rst (#2507)

Fix a typo in serve/engine.py

* [FP8] Use f32 scale to enable better fusion (#2505)

* [Metrics] Add ttft and itl to server metrics (#2510)

* Add ttft and itl to server metrics

* Fix ITL

* Fix clang-format

* Keep mobile and interface.chat untouched

* [Model] Fix config detection for Mistral (#2504)

The Mistral model has removed sliding window since its v0.2, while
in MLC we always enable sliding window. This PR updates the config
detection so that when sliding window is disabled, we turn to checking
the context window size and make sure it is properly set.

* [Fix] Provide a GetTokenId API for SampleResult (#2516)

Currently we use `sampled_token_id.first` to find the sampled token id
of a SampleResult object, which is obscure. This PR provides a
`GetTokenId` API for SampleResult to get the sampled token id.

This PR also updates the testing model path to include `./dist/`.

* [Reapply][BUGFIX] Fix rare deadlock in threaded engine (#2429) (#2518)

This PR reapplies #2429, which is missing in the main branch.

Below is the original commit message:

This PR fixes rare deadlock cases when engine unload/reload

Co-authored-by: Tianqi Chen <tqchen@users.noreply.github.com>

* [Fix] Fix metrics division by 0 (#2519)

This PR fixes an issue of the per-request metrics, where division-by-0
may happen when the request does not run any decode step.

The division-by-0 results in `inf`, and is added into a JSON file.
However, `inf` is usually not recognized as a float value in JSON
grammar. Thus JSON parsers fail on parsing any JSON string that comes
with `inf` wihtout being quoted.

* Corrected the folder path for Android Studio Project (#2520)

Update android.rst

Android project path corrected

* Update tvm.rst

* [iOS] Update model list (#2524)

Update the model list of iOS in `mlc-package-config.json`.

* [Android] Updates the order of mode list and the APK link in the document (#2526)

[Android] Updates the default mode list and the APK link in the document

1. Qwen1.5-1.8B-Chat-q4f16_1-MLC

* [Sampler] Skip top-p renormalization if top-p is 1 in CPUSampler (#2528)

This PR adds a shortcut in the top-p renormalization in CPU sampler,
which skips the renormalization when top-p is 1.0.

* [Docs] Rename javascript.rst to webllm.rst (#2531)

* [Conv] Add tinyLlama v1.0 conv template (#2530)

* [Conv] Add tinyLlama v1.0 conv template

* Fix lint

* [iOS] correct mistral q3 url and handle screen switch off (#2529)

This PR corrects the mistral q3 url

This PR also add a handler for screen switch off.
For now we just reset if the app is generating,
we will update to pause/resume once they are supported.

* [Grammar] Fix include protection and paths in docstring (#2515)

Following #2464, This PR fixes the include protecting in the header
files and the paths in the docstrings of the header files.

This PR also fixes tests that were broken after the refactor.

* [Tokenizer][Fix] Fix SegFault when analyzing tokenizers without tokenizer.json (#2532)

Previously the tokenizer would segfault when analyzing a tokenizer
that did not have a tokenizer.json file.

This is due to `TokenizerInfo()` is called previously, which creates
a null object. This PR fixes this problem.

* [Serving] Use stop strs and token ids for completions (#2534)

This PR applies the stop strings and stop token ids defined in
conversation tempalte to the raw text completions. So that whenever
the model outputs a stop token id or stop string, the raw generation
can stop.

Prior to this commit, the raw text never stops when the max tokens
is not given. This commit helps reduce the frequency of such events.
Nevertheless, if the model does not output a stop string/token id,
the generation will still not be going to stop.

* [Serving] Support tensor parallel shards override in command line (#2533)

This PR supports the command line overrides for model JIT compilation.
This is especially helpful for enabling tensor parallelism out of box,
so people don't need to manually tweak `mlc-chat-config.json` to
use tensor parallelism.

* Add tie_word_embedding option for Qwen2 model (#2535)

* [Bench] Defaults to aiohttp client, add ServerMetrics (#2527)

* [Bench] Defaults to aiohttp client

* Add ServerMetrics to summary

* Remove duplicate servermetric def

* [Android] Remove var capture in TVM_SOURCE_DIR (#2538)

This PR fixes the TVM_SOURCE_DIR parsing issue on Windows.

* [Fix] Fix inconsistent system prompt handling (#2539)

This PR fixes the conversation template of ChatML, whose
system prompt ends with `<|im_end|>`.

An inconsistent handling of system prompt between the JSONFFI side
and the Python side is also corrected.

* [Attention] Fix attn kernel for general GQA group size (#2543)

This PR fixes the TIR prefill attention kernels to support a broader
list of GQA group sizes.

* fix: typo error (#2544)

* [Fix] Fix attn kernel build issue (#2545)

This PR fixes TIR issues in the attn kernels.

* [iOS] Add Qwen2 support (#2547)

This PR add Qwen2 support to MLC Chat

* [Android] Add Qwen2 support (#2548)

* [Android] Escape backslashes and quotation marks (#2546)

This commit escapes the backslashes and quotation marks in Android
package build.

* [EngineConfig] Add override options (#2550)

This PR introduces override options to the Python side EngineConfig
so that they'll be reflected in JIT model compilation.

* [Site] Update link to webllm

* [Site] Update heading

* [Preset] Add model preset for model delivery (#2553)

[Preset] Add model preset for wasm delivery

* Update docs to remove mention of older models (#2557)

* [Docs] Fix typo in mlc_llm chat command (#2560)

* Fix compilation for gcc 13.2 (#2561)

* [Tokenizer] Priorize HuggingFace/SentencePiece over ByteLevelBPE (#2559)

This PR updates the tokenzier load logic, so that we prioritize
the use of HuggingFace and SentencePiece tokenizers over the
ByteLevelBPE tokenizer.

This fixes the issue that token `<im_start>` in Qwen model is
tokenized into multiple tokens when the ByteLevelBPE tokenizer
is chosen when available.

* [Serving][Grammar] Jump-forward decoding (#2551)

[Serve][Grammar] Jump-forward decoding

This PR supports the jump-forward decoding as described in
<https://lmsys.org/blog/2024-02-05-compressed-fsm/>. The jump-forward
decoding uses the grammar constraint to predict the next output string and
tokenize the string into tokens, and therefore speeds up the decoding.

This PR implements these optimizations to ensure the output quality:
- Retokenization in jumpforward: Tokenize the last k token as string appended with the predicted
  string. If the tokenization result differs from the old tokens, roll back
  these tokens and accept the new ones.
- Retokenization in decoding: Tokenize the last k token as string appended with
  the decoded token. This will happen in decoding stage when the jumpforward decoding happens
  in the last round. If the result differs, the old tokens will be rolled back.
- Skip prefix tokens in jumpforward: We call tokens that is a prefix of another token
  as prefix tokens. If the last token from jumpforward is a prefix token, it's highly possible
  that it will be rolled back in the next decode stage, as it may be combined with the
  decoded token. It also effects the output distribution as such pattern is rare in training data.
  Therefore, we skip the last prefix token in jumpforward decoding.

This PR also includes the following changes:
- Add several metrics for request and engine, especially about the jumpforward decoding
- Fix a bug in `_async_query_engine_metrics` to avoid throwing CancelledError from early return

Performance and benchmark:

Schema(Pydantic):
```
class Product(BaseModel):
    product_id: int
    is_available: bool
    price: float
    is_featured: Literal[True]
    category: Literal["Electronics", "Clothing", "Food"]
    tags: List[str]
    stock: Dict[str, int]
```

Platform: AMD Ryzen 9 5900X, NVIDIA 3080 10G

Results:
```
Jump forward: False, Batch: 1
Engine metrics:
{
    "engine_decode_time_sum": 0.4988938220000001,
    "engine_jump_forward_time_sum": 0,
    "completion_tokens_sum": 66,
    "decode_tokens_sum": 66,
    "jump_forward_tokens_sum": 0,
    "decode_tokens_per_s": 132.2926785010378,
}
Jump forward: True, Batch: 1
Engine metrics:
{
    "engine_decode_time_sum": 0.37242740600000007,
    "engine_jump_forward_time_sum": 0.027989265000000006,
    "completion_tokens_sum": 68,
    "decode_tokens_sum": 68,
    "jump_forward_tokens_sum": 28,
    "decode_tokens_per_s": 182.58591850246378,
}
Jump forward: False, Batch: 4
Engine metrics:
{
    "engine_decode_time_sum": 0.9106805410000002,
    "engine_jump_forward_time_sum": 0,
    "completion_tokens_sum": 261,
    "decode_tokens_sum": 261,
    "jump_forward_tokens_sum": 0,
    "decode_tokens_per_s": 286.5988546470984,
}
Jump forward: True, Batch: 4
Engine metrics:
{
    "engine_decode_time_sum": 0.6843025599999999,
    "engine_jump_forward_time_sum": 0.028089531999999997,
    "completion_tokens_sum": 266,
    "decode_tokens_sum": 266,
    "jump_forward_tokens_sum": 112,
    "decode_tokens_per_s": 388.71694415405966,
}
Jump forward: False, Batch: 8
Engine metrics:
{
    "engine_decode_time_sum": 1.62462493,
    "engine_jump_forward_time_sum": 0,
    "completion_tokens_sum": 538,
    "decode_tokens_sum": 538,
    "jump_forward_tokens_sum": 0,
    "decode_tokens_per_s": 331.1533573475325,
}
Jump forward: True, Batch: 8
Engine metrics:
{
    "engine_decode_time_sum": 1.0509048310000002,
    "engine_jump_forward_time_sum": 0.027971332000000022,
    "completion_tokens_sum": 525,
    "decode_tokens_sum": 525,
    "jump_forward_tokens_sum": 224,
    "decode_tokens_per_s": 499.5694990767436,
}
Jump forward: False, Batch: 16
Engine metrics:
{
    "engine_decode_time_sum": 2.317279175,
    "engine_jump_forward_time_sum": 0,
    "completion_tokens_sum": 1068,
    "decode_tokens_sum": 1068,
    "jump_forward_tokens_sum": 0,
    "decode_tokens_per_s": 460.8853398080531,
}
Jump forward: True, Batch: 16
Engine metrics:
{
    "engine_decode_time_sum": 1.3962938819999997,
    "engine_jump_forward_time_sum": 0.030129287999999994,
    "completion_tokens_sum": 1059,
    "decode_tokens_sum": 1059,
    "jump_forward_tokens_sum": 448,
    "decode_tokens_per_s": 758.4363246533227,
}
```

* [Delivery] Update model delivery script (#2565)

Some improvements of the delivery script:

- provide different overrides for different quantization. e.g. we can change
prefill chunk size for q0/q3/q4
- rerun gen config only if only conv_template changes
- do NOT recreate HF repo when the repo already exists. This will preserve
commit history
- dry-run validation

* [Model] Enhance error reporting for invalid tensor-parallel settings (#2566)

This PR enhances the error reporting for multi-GPU model compilation,
so we can provide as many error reasons as possible before loading and
running the models.

* [Serving] Apply tree structure in draft token verification (#2563)

This adds the interface to draft token state and sampler to allow tree
structure being recorded and used for verification

* [Bench] Json mode bench (#2552)

* [Bench] Json mode bench

This PR refactors mlc bench to enable json mode in dataset.

* upd

* fix lint

* [Model] Support Multi-GPU for Qwen-MoE model (#2573)

This PR introduces the multi-GPU support for the Qwen-MoE model.
Validated on 4090x2.

* [Metrics] Add missing fields in `Reset` (#2574)

This PR adds the missing fields that were not cleared up in
`EngineMetrics::Reset`.

* [Doc] Update WebLLM doc (#2578)

Update documentation for WebLLM. Currently we only provide a high-level view for WebLLM runtime here, and refer user to the WebLLM repo README for more. The documentation focuses on adding their own model variant / model library for WebLLM. Will follow up with more thorough runtime documentation.

* [Op] Top-4 implementation for MoE model (#2586)

This PR introduces a top-4 kernel for MoE model (particularly for
the Qwen-MoE) at this moment.

This is still a manual implementation and has some duplication
with the existing top-2 kernel. In the future we'll consider leveraging
meta-programming of TIR to unify the top-k kernel implementations.

* [Model] Gemma 1.1 compatibility (#2594)

This PR updates the Gemma config so that MLC can work properly with
Gemma 1.1.

* [Serving] Hybrid prefill (#2604)

This PR adds the support for the hybrid prefill. So during the prefill
engine action, it will do the decode for running requests as well.

* Update quick_start.rst to fix broken links (#2607)

Update quick_start.rst

Fix broken links for convert weights and compile model pages

* [Fix] Set the missed prefill finish time (#2613)

This PR fixes a bug which fails to set the prefill finish time
and results in metric error.

* [Android] Reduce binary size (#2606)

This PR updates the Android app the reduce the binary size.
Right now it can be reduced to 108MB when only building with the
Phi-3-mini-4k model.

* [Fix] Gemma hidden_activation compatibility (#2614)

This PR fixes the Gemma config compatibility issue.

* Update debug_compare (#2612)

This PR fixes a bug of the debug_compare.py script.

* [SLM] Add support for InternLM2 architecture (#2608)

This commit introduces the InternLM2 model support.

* [Fix] Prefix cache only enables sliding window on leaf sequence (#2615)

This PR updates the prefix cache to align the logic of enabling sliding window. Now only leaf sequence is enabled sliding window attention.

* [Android] Update include path for tvm runtime src (#2616)

This PR updates the include directories for the Android app
so that we can avoid using macros for src file include.

* [Fix] Mark the decode requests in hybrid prefill (#2621)

This PR fixes an issue that may cause duplicate prefix updates
for the decode requests in the hybrid prefill action.

* [Fix] Fix the chunked prefill condition (#2628)

This PR fixes a bug of the prefill chunking which may cause the
running batch size exceeding the maximum allowed batch size.

* [SLM] Internlm2 Multi-GPU support (#2626)

This PR enable TP function of internlm2 model.

* [Serving] Merge multiple token embedding lookup into one (#2629)

This PR supports merging multiple token embedding lookup into a single
one, since each token embedding lookup needs to go through the model,
and multiple lookup will introduces extra overhead.

* [Model] Support Internlm2.5 (#2630)

InternLM2.5 series that have outstanding features were released just
days ago, and this PR support Internlm2.5 by adding model preset of
internlm_2_5_7b.

* Fix for RWKV new config and new format vocab (#2632)

* [Fix] Fix KV cache single-page copy kernel (#2644)

The current single-page copy kernel misses a predicate, which may
cause incorrect attention results in serving, when RemoveRequest
is involved.

* [Fix][Tokenizer] Fix failure in decoding tokens for ByteLevel BPE (#2649)

This PR fixes the issue where the tokenizer would fail in
decoding tokens for ByteLevel BPE when the token is not recognized by
ByteLevel. E.g. in decoding,

```
"hello" -> "hello" (recognized by ByteLevel)
"Ġthere" -> " there" (recognized by ByteLevel)
"\n" -> not recognized by ByteLevel
"\u203c" -> not recognized by ByteLevel
```

This PR adds the logic that in decoding, when the token is not
recognized by ByteLevel, the original token will be returned. Then

```
"hello" -> "hello" (recognized by ByteLevel)
"Ġthere" -> " there" (recognized by ByteLevel)
"\n" -> "\n" (not recognized by ByteLevel)
"\u203c" -> "\u203c" (not recognized by ByteLevel)
```

This behavior is align to huggingface tokenizers.

* [Fix][Bitmask] Mask dummy padded tokens for grammar (#2651)

* [Engine] Reduce action post-process overhead (#2653)

This PR optimizes the post-process overhead and adds more detailed
nvtx instruments.

* [PrefixCache] Defer sequence extension (#2654)

This PR deferrs the prefix cache sequence extention.
Previously, the prefix cache update is committed after every action,
which is unnecessary. We can defer this sequence extention and
commit the extentions when the prefix cache is used again.

This PR also changes the IntTuple used in PrefixCache to
`std::vector<int32_t>` for less data structure construction overhead.

* [Model] Support Starcoder2 (#2657)

This PR supports Starcoder2 model.

* [Engine] Lazy recompute in GetRunningRequestStateEntries (#2655)

This PR updates GetRunningRequestStateEntries to make it lazy.
We use a dirty flag to check whether the running request state entries
are changed since the last recompute.

We make this improvement due to the observation that this function
may cause some CPU overhead. During consecutive rounds of batch decode,
the running requests don't change, so we can effectively use this
dirty flag to avoid recomputation.

* [Fix] Fix prefix cache reuse with eagle mode (#2664)

This PR fixes the prefix cache bug with eagle mode on.
The prefilled offset is forgotten to be shifted in this case.

* [Model] Support SmolLM (#2667)

This PR supports HuggingFace's SmolLM. The only change needed
is to support `tie_word_embeddings` in `llama_model.py`.
Currently we extend an `nn.Embedding`, following our approach for
QWen2. In future we can think about abstracting it out, perhaps
implementing `forward_as_linear()` for `nn.Embedding`.

* [SLM] Starcoder2 Multi-GPU support (#2662)

This PR supports TP function of starcoder2 and fixes two typos.

* [Engine] Defer the collection of decode inputs in prefill (#2668)

This PR defers the collection of decode inputs in hybrid prefill,
as the collection of decode inputs may cause much CPU overhead
while it ends up no prefill can be performed. By deferring the
collection of decode inputs, we can quickly decide whether prefill
is doable, and this decision does not involve too much CPU overhead.

* support mistral-nemo (#2676)

* [Model] Fix annotation typos  (#2672)

* Update starcoder2_quantization.py

* Update qwen2_loader.py

* Update qwen2_model.py

* Update qwen2_moe_loader.py

* Update rwkv5_loader.py

* Update rwkv6_loader.py

* Update qwen_loader.py

* Update phi3_quantization.py

* Update phi_quantization.py

* Update phi3_model.py

* Update phi3_model.py

* Update phi3_quantization.py

* fix tp

* [Model] Support Llama3.1 (#2682)

This PR supports the [Llama3.1](https://huggingface.co/collections/meta-llama/llama-31-669fc079a0c406a149a5738f)
family.

Particularly we introduced the conversation template and RoPE scaling
for Llama3.1. In the future we will bring the support of more RoPE
scaling.

Co-authored-by: Charlie Ruan <53290280+CharlieFRuan@users.noreply.github.com>

* [SLM] Introduce microsoft/Phi-3 vision (#2658)

Introduce microsoft/Phi-3 vision from https://huggingface.co/microsoft/Phi-3-vision-128k-instruct

* [Preset] Add llama3.1 to preset, comment out llama3 (#2683)

* [Pass] Rewrite FuseAddRMSNorm to avoid binding rewrite recursion (#2689)

This PR revamps the FuseAddRMSNorm pass with manual pattern matching,
in purpose of avoiding `rewrite_bindings` which is recursive and may
cause unaffordable time when the model is large.

* Initialize all `local_top_k` values in `gating_softmax_topk` (#2694)

If `x` has `nan` or `-inf` values, the condition `x[vi,vk] >
local_top_k[0]` may be false.  Falling back to the condition `x[vi,vk]
> local_top_k[1]` then reads the uninitialized value in
`local_top_k[1]`.

This can also result in out-of-bounds memory access.  If all values in
`x[vi,vk]` are `nan` or `-inf` along some row `vi`, then
`local_top_k_index[1]` is never populated.  For mixture-of-experts
models, when `gating_softmax_topk` is used to select the expert, this
uninitialized value is then used as an array index.

This commit updates the `top2_softmax_norm_func` implementation in
`gating_softmax_topk` to initialize both elements of the `local_top_k`
and `local_top_k_index` arrays, matching the implementation of
`top4_softmax_norm_func`.

* [Serving] Fix spec decoding call packed with rvalue (#2699)

* [ASYNC] Properly abort cleanup in async handling (#2698)

This PR adds a context manager to properly cleanup
during async for exception.

Naively use the try except pattern will results in bug when we chain up
async generators and exception get raised not inside the try
except in between iterations.

* [Serve] Expose prefill mode option (#2701)

This PR exposes the option of prefill mode to chunked prefill or
hybrid prefill with split fuse decode.

* [Fix] Fix hybrid prefill disabled (#2705)

This PR fixes the #2701 when the prefill mode is chunked but the prefill requests are not collected.

* Turn on custom allreduce by default in O3 (#2706)

* [Fix] Fix hybrid prefill index error (#2707)

This PR fixes the index error when hybrid prefill is enabled.

Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>

* [Bench] Revamp benchmark submodule (#2702)

This PR revamp the benchmark submodule with a `__main__` entry
that enables running the benchmark.

* [Serving] Fix handling of num_tokens_for_next_decode in spec decoding (#2709)

* Update worker.py for compatibility with upstream TVM (#2712)

This commit updates `mlc_llm.cli.worker` to be compatible with
upstream TVM https://github.com/apache/tvm/pull/17180, which adds a
`num_groups` argument to the disco worker function.

To de-couple this compatibility from a general TVM version bump, this
commit has a check on the number of `worker.py` arguments provided, to
determine whether the `num_groups` argument is present.  After the TVM
version used by MLC-LLM is updated to include the upstream changes,
this check can be removed.

* Add support for Gemma2 (#2674)

* Add support for Gemma2

* Update Gemma2 impl

This commit updates the Gemma2 implementation, including the following
aspects:

1. We try to reuse as much code as possible from the Gemma model for
the overall code structure clarity and management.
2. We properly set the scaling factor for attention.
3. We add the final logit soft-capping for Gemma2.

---------

Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>

* [Preset] Add gemma2 preset (#2715)

Add gemma2 2b 9b and 27b to preset, remove gemma1 preset.

* [Android] Update model for Andorid APK (#2718)

* Update android package config from gemma 2b to gemma 2 2b

  * Revert phi3 model definition for backward compatibility

* [iOS] Add Gemma2 for iOS app (#2717)

This commit switches the Gemma model in iOS app to Gemma2.

* Default bundle gemma2 (#2721)

* [Bench] LLMPerf dataset (#2713)

This PR adds the LLMPerf into benchmark module.

* [ConvTemplate] Update Gemma template with <bos> (#2722)

This commit adds `<bos>` to the gemma's conversation template.

* [C++] Handle system_prefix_token_ids in C++ Conv template (#2723)

The `system_prefix_token_ids` of conv template already contains the
bos token usually, which should be processed when converting message
list to a single prompt. However, the C++ side didn't well respect
this field before.

* Delete .gitmodules

---------

Co-authored-by: Wuwei Lin <wuwei@apache.org>
Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>
Co-authored-by: Yong Wu <yongcale@gmail.com>
Co-authored-by: krishnaraj36 <quic_kvegiraj@quicinc.com>
Co-authored-by: Tianqi Chen <tqchen@users.noreply.github.com>
Co-authored-by: Rick Zhou <rickzhoucmu@gmail.com>
Co-authored-by: Git bot <bot@noreply.github.com>
Co-authored-by: Vivian Zhai <98248913+YiyanZhai@users.noreply.github.com>
Co-authored-by: Yixin Dong <ubospica@gmail.com>
Co-authored-by: Animesh Bohara <ani.bohara@gmail.com>
Co-authored-by: Animesh Bohara <abohara@cs.cmu.edu>
Co-authored-by: Kartik Khandelwal <kartikkhandelwal1998@gmail.com>
Co-authored-by: Nestor Qin <imba.qxy@gmail.com>
Co-authored-by: Yaxing Cai <caiyaxing666@gmail.com>
Co-authored-by: Faolain <Faolain@users.noreply.github.com>
Co-authored-by: Bodhi <3882561+BodhiHu@users.noreply.github.com>
Co-authored-by: Huaishun Hu <huaishun.hu@mthreads.com>
Co-authored-by: Hyunsung Lee <ita9naiwa@gmail.com>
Co-authored-by: Bohan Hou <bohanhou@andrew.cmu.edu>
Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
Co-authored-by: tqchen <tqchenml@gmail.com>
Co-authored-by: Mengshiun Yu <mengshyu@gmail.com>
Co-authored-by: Charlie Ruan <53290280+CharlieFRuan@users.noreply.github.com>
Co-authored-by: zifeitong <zifeitong@gmail.com>
Co-authored-by: rmstc <ramees025@gmail.com>
Co-authored-by: KEL <me@iamkel.net>
Co-authored-by: Andrey Malyshev <ma_elvin@mail.ru>
Co-authored-by: Gunjan Dhanuka <d.gunjan@iitg.ac.in>
Co-authored-by: Shushi Hong <820958424@qq.com>
Co-authored-by: Yao Yujian <yyjhao@gmail.com>
Co-authored-by: Eric Lunderberg <Lunderberg@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants