[Serving] PagedKVCache tree-attention integration #2487

MasterJH5574 · 2024-06-02T04:45:35Z

This PR integrates the recent support of tree-attention in PagedKVCache into the speculative decoding in MLC. Right now only chains are supported. Tree-based speculative decoding is on the project road map and we are planning to support it in recent future.

… July 2nd 2024) (#272) * [Bugfix] layer_norm_eps in GPT2Config should be float (#2240) * [REFACTOR] Migrate JSONFFIEngine to formal namespace (#2241) This PR migrates JSONFFIEngine to a formal namespace. Also list TODOs to further simplify the JSONFFIEngine. * [Serving] Share disco sessions among multiple model function tables (#2242) * [DOC] Improve Install via environment variable (#2245) improve Install via environment variable * [Sampler] FlashInfer sampling func integration (#2224) This PR integrates the sampling function in FlashInfer. We integrate the one without top-p for now. * Model Library Delivery (#2139) * add model lib delivery * fix lint * [Support] Simplify function names in encoding.h (#2251) This PR simplifies the tool function names in encoding.h. The new names are - PrintAsUTF8 - PrintAsEscaped - ParseNextUTF8 - ParseUTF8 - ParseNextUTF8OrEscaped Also make ParseNextUTF8 return the new char pointer instead of the number of chars processed to make the interface simpler. * [Serving] Introduce DraftTokenWorkspaceManager (#2250) Using DraftTokenWorkspaceManager to maintain workspace for draft probs and hidden states (if needed). This allows states of the draft token to be kept fully on GPU. * [Fix] fix a typo in event_trace_recorder (#2253) * Fix typo in event_tracer * [Tokenizer] Support ByteLevel BPE in tokenizer token table (#2248) * [Eagle] Avoid worker - engine transfer for hidden states (#2256) * [Serving] Add engine stats for speculative decoding (#2257) * [Serving] Fix lints (#2258) * [Sampler] Avoid unnecessary sync in GPU verifier (#2260) * Fix typo in token_postproc_method names (#2261) * [Sampler] Add missing sync in gpu verifier (#2262) * [Model] Remove redundant space in llama2 tokenizer (#2263) * [Model] Fix llama2 chat template and remove redundant separator added by engine (#2264) * [Model] Fix llama2 chat template and remove redundant separator added by engine * [Refactor][Serving] EngineConfig refactor and "model-lib-path" rename (#2268) * This PR refactors the EngineConfig to allow minimal JSON string passing. This is helpful for the JSONFFIEngine construction. * This PR moves the automatic engine config inference from Python side to C++ side, so that we don't have duplicate code on multiple platforms. * This PR renames `model_lib_path` to `model_lib`. * This PR makes the reload/unload of ThreadedEngine act in a blocking style. * This PR refactors the default generation config process flow, and unifies everything to C++. * [Serving] Add some try-except captures in AsyncMLCEngine (#2265) * [Serving] Add some try-except captures in AsyncMLCEngine * [Eagle] Fix token shifting for prefill step (#2266) * [Fix] Fix the two-stage softmax func by removing log2e (#2269) * [Fix] Fix the two-stage softmax func by removing log2e When two-stage softmax was introduced, we use a log2e numeric transformation for some potentially better performance. However, under the case of low temperature, the log2e transformation is not numerically stable, which may cause the softmax result not summing up to 1. This PR fixes this by removing all the log2e related calculation. * Remove redundant import * [Eagle] Fix missing broadcast in hidden states gather/scatter (#2271) * [Eagle] Fix missing broadcast in hidden states gather/scatter * [Sampler] Use pivot-based renormalization for top-p sampling (#2272) This PR integrates the pivot-based prob renormalization for top-p sampling, whose performance is a few times faster than the current sort-based top-p sampling on CUDA. * [JSONFFI] Update JSONFFI error checking with the Result class (#2275) This PR updates the error checking in JSONFFIEngine and related request parsing to use the Result class. * [Bugfix] fix _kv_cache_transpose_append buffer read region error (#2277) * improve Install via environment variable * [HotFix] fix kv_cache_transpose_append buffer region * [GenConfig] Set upper bound for prefill chunk size (#2278) By default the prefill chunk size is set to the context window size or the sliding window size. When the number is large, our memory planning during model compilation will allocate a lot memory. Given we have support for input chunking, we can reduce the prefill chunk size to a small value to save runtime memory. This PR sets the prefill chunk size to be at most 2048. * [iOS] Initial scaffolding of MLCEngine in Swift (#2279) [iOS] Initial scaffolding of LLMEngine in Swift This PR adds initial scaffolding of LLMEngine in swift. We wraps callback to AsyncStream so it can be accessed using for await API. We also added an minimal example app to showcase the new MLCEngine, the old ChatModule is still used in the MLCChat App. The return value is structified already. We will still need to structurify the chat completion interface. * Rename READMD.md to README.md * [Serving] Image support in JSONFFIEngine (#2208) Using new Result interface Co-authored-by: Animesh Bohara <abohara@cs.cmu.edu> * [Pass] Attach manual softmax-with-temperature (#2280) This PR updates all the models to use the new softmax-with-temperature function, which inlines the temperature division (or argmax if temperature is 0) process into the two-stage softmax. Unit benchmark shows that the inline of division does no harm to the softmax. When batch size is large, the inlined softmax can have better performance than a standalone divide kernel, which takes much time when batch size is large. * [Model] Remove unused import to fix lint (#2284) This PR removes the unused import in llava model to fix lint. * [Serving] Fix BatchVerify to feed the extra token when fully accepted (#2285) This PR fixes a bug in the BatchVerify action. When a draft model's proposal is fully accepted by the main model, there is an extra token which is already in the main model's KV cache but not in the draft model's KV cache. Prior to this PR, BatchVerify action does not feed this extra token into the draft model's KV cache, which causes size mismatch between the main model's KV cache and draft model's KV cache. This PR fixes this issue by adding an additional BatchDecode step for the requests whose draft proposals are fully accepted by the main model. * Update engine.cc * [CMAKE][BUILD] Add config option to enable OpenCL Host ptr (#2287) [CMAKE][BUILD] Add user option to enable OpenCL Host ptr * [Serving][Fix] Pass draft length when constructing draft action (#2291) This PR fixes a bug which does not pass the speculative decoding draft length to the draft generation stage. * [Pass] Fix sampling func attachment to not read existing vocab size (#2292) This PR updates the AttachGPUSamplingFunc pass to make each sampling func have independent dynamic vocab size var. So we do not have to read the vocab size from the prefill function. * [SLM] Introduce microsoft/Phi-3 (#2222) Introduce microsoft/Phi-3 from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct * [Eagle] Run additional decode for draft model when all proposals are accepted (#2294) * [iOS] Introducing package CLI for iOS app packaging (#2297) This PR introduces the packaging CLI `mlc_llm package` which reads from a `mlc-package-config.json` and compiles model and prepares model/runtime libraries automatically. With this PR, we get rid of prebuilt model library dependency for iOS app build. Validated that the iOS build can work. iOS documentation is updated according to this latest change. The same flow is supposed to work for Android as well, while it still needs verification for Android app build. * Increase the timeout in PopenServer (#2298) * [LLM-CHAT] Enable gpu softmax for penality softmax (#2288) 1. Avoid the cpu softmax for different penality config by having copy sync to gpu and use gpu softmax. 2. Disable decode token time counter for first token. * [iOS][REFACTOR] Restructure the iOS folders (#2299) Move MLCChat to its own sub folder minor improvements to package. * [KVCACHE][TIR] Improved tir schedule for decode tir page attention (#2289) * [KVCACHE][TIR] Improved tir schedule for decode tir page attention 1. Improved tir schedule of page attention (It improved 30% to this function). 2. Enable missing dequant+matmul fusion in ph-2 model * Updated K_local to QK_local * Update kv_cache.py * Increase max thread for android:adreno * [Sampler] Remove unneeded output_prob_dist param (#2300) * Enable cuda graph for batch_verify (#2304) * [Android] Introducing mlc4j and app packaging (#2305) This PR lifts the existing `library` of android app into a standalone `mlc4j` directory, which can be referenced by android app at any location. On the app side, this PR moves the android app into a subfolder `MLCChat` which itself is a well-formed Android app. This folder contains two core files for app build: * `MLCChat/mlc-package-config.json` the config file that specifies the models to build into the app. * `MLCChat/prepare_package.py` the Python script that helps automatically prepare/build mlc4j and model libraries. This PR also updates the android app documentation to reflect this latest change. * [DOCS] Minor cleanup (#2308) Shorten titles so they fit into one line of navbar, add mention of jit cache. Remote old project overview * [DOCS] Update android doc (#2309) Avoid showing full tree and mention what the dist/lib/mlc4j stands for * [DOCS] Update android doc (#2310) Avoid showing full tree and mention what the dist/lib/mlc4j stands for Avoid python3 instead directly use python, since python3 sometimes will points to system python. * [SLM] Support BERT architecture. Implement a text embedding module (#2249) * [Serving] Log batch size in NVTX (#2312) * [Model] Removing unnecessary reshapes in get_logits (#2314) * Skip cublas dispatch for single batch (#2315) * Auto updated submodule references * [DOCS] Remove mention of legacy modules (#2318) This PR removes mention of legacy modules and prebuilt in favor of JIT. * [Android] Add `-j` option to cmake build (#2321) This PR adds the `-j` option to cmake build to parallelize the build job over CPU cores. * [DOCS] More clear android instruction (#2327) This PR sets a more clear instruction for android JDK setup * [Serving] Refactor to consolidate new request prefill (#2329) * [iOS] Make MLCEngine input to take in structured data (#2330) This PR modifies the MLCEngine chatCompletion to take in structured data. Co-authored-by: Vivian Zhai <98248913+YiyanZhai@users.noreply.github.com> * [REFACTOR] Refactor JSONFFI Conv template (#2331) This PR refactors JSONFFI conv template to use immutable processing. This helps to prevent bugs from multiple requests and concurrent access to the conversation data structure. It also reduces the need to deep copy the struct. * [Eagle] Fix the requests for additional decode in eagle verify (#2336) * [Serving][Grammar] Refactor GrammarStateMatcher and support LLaMA-3 (#2335) This PR refactors GrammarStateMatcher and support the LLaMA-3 tokenizer. Common tokenizers, including Phi-2, Gemma, LLaMA-2, etc. are also supported. The performance is optimized for LLaMA-3 tokenizer since its token table has size 128k, much larger than LLaMA-2 tokenizer. These changes are introduced to the grammar library: These changes are introduced to the grammar library: 1. Introduce ByteString rule expression and simplify CharacterClass and CharacterClassStar 2. Refactor BNFGrammarVisitor and BNFGrammarMutator for visiting and mutating grammar rules 3. Now GrammarStateMatcherBase, the internally impl of the GrammarStateMatcher, accepts char by char, instead of codepoint by codepoint. So it supports any valid UTF-8 string, even if the token is not a complete codepoint. 4. Support lookahead assertion for rules to specify the rule must be followed by a sequence. This can eliminate some uncertain tokens in preprocessing. Minor changes: 1. Introduce template hash function HashCombine 2. Update the UTF8 encoding handling functions Performance: 1. For JSON, finding mask requires <30us on 5900X with single thread. The uncertain tokens is <30 in most cases. 2. For JSON schema, finding mask requires <30us on 5900X with single thread. The uncertain tokens is <30 in most cases. * [DebugChat] Fix DebugChat softmax function and save logits to debug folder (#2342) * [DebugChat] Fix DebugChat softmax function and save logits to debug folder * Fix lint * [Serving] Add Medusa speculative decoding (#2337) * [Serving] Add Medusa speculative decoding * Fix cublas offloading (#2343) * Add false for arg worker0_only in disco.empty (#2344) * Auto updated submodule references * [JSONFFIEngine] Refactor device argument and request_stream_callback argument (#2334) * 1. Refactor init_background_engine in JSONFFIEngine to use device_type and device_id arguments. 2. request_stream_callback is called on each string of the array of strings. * Calling callback on string of list of JSON dicts instead of each string of JSON dict multiple times --------- Co-authored-by: Animesh Bohara <abohara@cs.cmu.edu> * [Serving] Add reset_engine in debug_entrypoints (#2347) * [Bugfix] Make sequence_length dtype int64 in EngineConfig. Fix Mistral engine serving issue (#2358) * [Bugfix] Make sequence_length dtype int64 in EngineConfig. Fix Mistral engine serving issue * [JSON FFI] Example Android Application using JSON FFI Engine (#2322) * pass str to callback and not List[str] add json ffif android example fix lint Refactor MLCEngineExample and MLCEngine.kt Use ChatCompletionMessageContent class ChatCompletionMessageContent: text and parts * JSONFFIEngine: Cast request_stream_callback argument to std::string. Decode in Android as List<ChatCompletionStreamResponse> --------- Co-authored-by: Animesh Bohara <abohara@cs.cmu.edu> * [iOS] Update MLCEngine API to latest JSON FFI convention (#2359) This PR updates the MLCEngine API to latest JSON FFI convention. * [JSONFFI] Fix JSONFFI conv template. Add unit tests (#2360) * [Fix][Serving] Fix prefill chunk in interactive mode (#2363) This PR fixes a bug of prefill chunking in the interactive mode. The bug counts requests with remaining inputs as running requests which turns out disabling the prefill of the remaining inputs. This PR fixes by no longer counting requests with unfinished inputs as running requests for decode. * [Fix][Serving] Respect sliding window size in config inference (#2364) This PR fixes the automatic engine config inference which did not respect the sliding window size, which led to memory usage higher than expected in the interactive mode for mistral model. * [iOS] Add padding to app icon (#2365) * [Serving] Fix the self-ref in engine (#2367) This PR fixes the self ref in engine and enable auto terminate in deleter. * [Serving] Prefix Cache (#2295) * [Serving] Prefix Cache This PR introduces the prefix cache into serving engine, to manage prefix and accelerate prefill process. * [Fix] Use static_cast for `.size()` for safety (#2369) This PR updates the occurences of `.size() - 1` with static_cast to avoid the integer underflow. * [Serving] Sliding-window-aware request prefill (#2370) This PR supports the prefill conditions with sliding window awareness. Now when the input length is larger than the sliding window size, the prefill can still be processed without error. * [iOS] Update MLCSwift to fully follow OAI style. (#2371) It also refactors the MLCSwift to be follow engine.chat.completions.create style as per other OpenAI APIs. It also removes the cyclic dependencies in the closure capture by having a separate EngineState * Add nvtx in logic update (#2372) * [Test] Use HF model for JIT as much as possible (#2373) This PR updates the test files to use JIT by default as much as possible, in order to make tests runnable out of the box. Of course, they can be locally tweaked to use local models. For Eagle/Llava/rwkv, given we don't have them delivered yet, they are kept as using local model lib now. * [Fix] Fix prefix cache reset and forking logic (#2374) This PR refactors the reset logic in prefix cache and disable forking from sequences with sliding windows enabled. * [CLI] Migrate CLI to use the new Engine (#2375) * [CLI] Migrate CLI to use the new Engine This PR migrates the CLI to the new JSON FFI Engine. The resulting generation will be faster, we still need to ensure we can enable sliding window support when needed. Also Refactors JSONFFI Engine to be OpenAI compatible. * Fix lint and remove bench which is stale * [TESTING] Introduce testing util to manage models (#2377) This PR introduce a new env var MLC_TEST_MODEL_PATH to allow a list of model path specified for test model search purposes. If not found, an error message would appear and we auto skip test in both pytest and normal running settings. The path defaults to the cached HF path so as long as we run mlc_llm chat the model can be found. But we do not automatically download to avoid excessive networking in CI settings. Followup PR needed for remaining testcases * [REFACTOR][Rename] MLC_LLM_SOURCE_DIR and TVM_SOURCE_DIR source directory env (#2378) * [REFACTOR] Rename use MLC_LLM_SOURCE_DIR for source directory This PR updates to use MLC_LLM_SOURCE_DIR to specify the directory of mlc llm source directory. The reason for this update is that the term XXX_HOME was usually meant to be used in different scenarios in ML frameworks. For example, both torch and huggingface have TORCH_HOME and HF_HOME pointing to their local cache directory. The variable MLC_LLM_SOURCE_DIR is aligned with cmake naming convention (CMAKE_SOURCE_DIR). We will have followup PR to udpate MLC_CACHE_DIR to MLC_LLM_HOME, following the existing practices. * Update env to point to TVM_SOURCE_DIR * [REFACTOR][ENV] MLC_CACHE_DIR to MLC_LLM_HOME (#2379) This PR changes the MLC_CACHE_DIR env to MLC_LLM_HOME. This change aligns with most of the packages. * [iOS] Switch MLC Chat to use MLCEngine (#2380) This PR switchs MLC Chat to use MLC Engine Also did a minor refactoring to make serve side more flexible in dealing with compile time overrides. * [REFACTOR] Cleanup legacy code (#2381) This PR cleans up legacy code and reorgaizes some of the project structure. - Removed stale interface - Removed stale examples - Temp remove rust as it depends on chat module that we plan to phase out - Move embeddings to contrib(experimental) * [Fix] Update prefix cache config (#2382) This PR updates the prefix cache config to prefix cache mode and prefix cache max number of recycling sequences. Also this PR adds the missing `final` keyword in member methods. * [PREFIX-CACHE] Fix some issues with prefix cache (#2384) This PR fixes issues with prefix cache when used together with MLCEngine. It also fixes an issue when prefix_cache_max_num_recycling_seqs == 0 * [FIX] Typo on OpenAI Chat class in engine (#2385) This commit fixes a typo on JSONFFIEngine Python side. * [Serving][Refactor] Metrics and stats for CLI (#2387) This PR introduces the `Metric` class for convenient metric update and management in MLC. The previous `EngineStats` class is renamed to `EngineMetrics` accordingly. This PR brings the metric support to JSONFFIEngine, and implements the `/stats` command in CLI. Besides, this PR * fixes a bug of time measurement when parallel generation exists. * aligns the metric names with LLMPerf (particularly, we now use `num_input_tokens`, `num_output_tokens`, `sum_num_input_tokens`, etc.) * measures the time of a single step of BatchDecode, a single step of draft generation in BatchDraft, and a single step of BatchVerify when the effective batch size is less than 64 (hardcoded as a constant as of now). This can help build the understanding of the performance of the key actions under a series of batch size. * [REFACTOR] Organize metrics (#2390) This PR perform one round of reorganization of metrics into a centralized metrics header. Also updates the ChatState to include overrides that can be used in future cases to run chat test. * [Fix] Avoid ref capture in prefix cache contruction (#2391) This PR fixes the prefix cache construction in Engine, which captured the references of models and thus led to the GPU memory unable to be freed when the Engine is destructed. * [REFACTOR] Cleanup Metrics (#2392) This PR run another round of cleanup of metrics. - Remove less useful ones - Reorganize by labels in prometheus style * [FIX] Fix mlc llm source dir argument (#2394) This PR fixes the mlc llm source dir argument in android packaging. * [Fix] Fix the serialization of SpecDecodeMetrics (#2395) This commit fixes a bug when serializing SpecDecodeMetrics. * [Fix] Update missing change in engine ffi func name (#2396) This PR updates the missange change in engine ffi func name from #2390. * Auto updated submodule references * [Fix] Fix no prefix cache (#2397) This PR fixes the no prefix cache, to avoid double adding of new sequence. * add hasattr safecheck for MLCEngineBase (#2400) Co-authored-by: Huaishun Hu <huaishun.hu@mthreads.com> * [Refactor] Expose EngineConfig in engine constructor (#2399) This PR lifts the EngineConfig as one engine constructor, so that we can hide most less important arguments in EngineConfig, and thus focus the user attention to the few key arguments. `mlc_llm serve` CLI and PopenServer are updated accordingly. Documentation is updated accordingly. * [REFACTOR] Introduce RequestMetrics and metrics endpoint (#2401) This PR introduces RequestMetrics to collect aggregated metrics for each request. We also introduces a prometheus end point. Finally, we fixed a cylic dependency in engine states. * [Fix] Fix format issue of MLCEngineBase (#2402) This PR fixes a format issue caused by #2400. * [FIX] fix comments in radix_tree.py (#2403) Seems function descriptions for `PagedRadixTree.add` and `PagedRadixTree.extend` are misleading. Fixed according to implementations in radix_tree.cc * [Fix] Fix metric names in tests and static PrefixCacheModes (#2404) * This PR fixes the metric names referenced in tests which were not updated together with previous PRs. * This PR fixes the static PrefixCacheMode member introduced in #2397. The way of fix using the static class members is not correct, which essentially disables PrefixCache forever. This is because when checking the `mode` member of a PrefixCache instance, it is always the base class mode (which is `kDisabled`) being returned. * This PR also adds a missing header for chrono. * [Op] Tree attention (#2376) * [REFACTOR] Reorganize GenerationConfig DebugConfig and FFI (#2407) This PR reorganizes GenerationConfig, DebugConfig and FFI. - Internally, we now directly use the config object instead of json stream. - Request construction turns into engine side so it can make use of debug_config. - Ignore eos now moves to debug_config option. - Removes most string based re-export of gen conifg. * [Fix] Fix vector OOB when no inputs can be prefilled in spec decode (#2408) This PR fixes an issue that causes vector index out of bound. This happens in speculative decoding, when an model can accept inputs while the other cannot. We still need to look into this inconsistency. Ideally all models should behave the same. * [Fix] Update number of available pages after prefix cache free (#2409) This PR fixes an issue that causes the inconsistency of CanPrefill result from different models. * [REFACTOR] Enable validation logic in GenerationConfig (#2411) This PR enables a centralized validation logic in GenerationConfig. * [Chat] Support chat completion config override (#2412) This PR supports chat CLI with arguments override. Right now, arguments supported are: `top_p`, `temperature`, `presence_penalty`, `frequency_penalty`, `max_tokens`, `seed`, `stop`. This PR adds the corresponding support to the ChatCompletion request parsing for JSONFFIEngine. * Change name RedixPage -> RadixPage in RadixTree.cc (#2413) change name RedixPage -> RadixPage * [Fix] Fix ignore_eos support (#2414) The ignore_eos support was broken during recent refactors. This PR fixes the support. * [Test][Refactor] Update tests to use require_test_model (#2415) This PR updates tests to use the `require_test_model` testing util for better out-of-box testing while avoid automatic downloading. Some tests that require manually model compilation are kept in the old test style (e.g., with model "llava", "eagle", etc.). This PR also fixes some typing issues suggested by mypy. * [Serving] Enable GPU Sampling (#2368) enable gpu sampling * [REFACTOR] Support latest include_usage and DebugOptions (#2417) This PR refactors the mechanism of request end detection and also attaches the request metrics in response usage field. RequestResponse usage field: - include_usage can be passed to API. When include usage is on, metrics are now streamed back in the usage.extra - Changed debug_option parameter to extra_body, so they are fully compatible with OpenAI client - Support special requests in debug options, engine metrics are now streamed back via a special request We also change the FFI mechanism to detect response finish. Previously we keep track of number of stoppped streams. Now that the FFI always stream back the final chunk which have no choices and contains usage. We use the usage field to detect the final chunk. Code path are updated according. We also make Chat CLI a helper class that can be reused. iOS app now comes with stats support. * [DOWNLOAD] MLC_DOWNLOAD_POLICY and MLC_LLM_READONLY_WEIGHT_CACHES (#2421) This PR introduces support for MLC_DOWNLOAD_POLICY and MLC_LLM_READONLY_WEIGHT_CACHES Allows reading from readonly cache besides MLC_LLM_HOME. Also introduces a domain subfolder in cached weights * [REFACTOR] Rename MLC_LLM_READONLY_WEIGHT_CACHES (#2423) This PR renames MLC_LLM_READONLY_WEIGHT_CACHES=>MLC_LLM_READONLY_WEIGHT_CACHE to be consistent with rest of env var convention * [Tokenizer] Auto-detect TokenizerInfo from tokenizer.json (#2416) This PR adds a new `TokenizerInfo` class that contains useful information about the tokenizer during generation. It is auto-detected from tokenizer.json if it exists. Otherwise it raises a warning and uses the default value (byte fallback tokenizer, not prepend/strip space). * [REFACTOR] Remove dependencies on legacy chat_module (#2424) This PR removes the all dependencies from chat_module.py So we can prepare for deprecating this module. This PR refactors and moves MLCChatConfig to protocol. This helps us to consolidate all API spec and config files under the protocol folder. The protocol folder mainly keeps the data schema and metadata, most of the actions(gen_config) are still kept in their current location. * [REFACTOR] Terminology download=>download_cache (#2425) This PR renames download to download_cache for better clarity. * [REFACTOR] Move GenerationConfig to protocol (#2427) This PR moves GenerationConfig to protocol. As we move towards OAI style API GenerationConfig becomes more like an internal API. This change reflects that and also removes duplicated definition of ResponseFormat and DebugConfig * Update README.md * [site] Add hero section to website (#2430) * [Compile] Skip CUDA graph rewrite when target is not CUDA (#2433) This PR rewrites the CUDA graph compiler flag to false when the backend is not CUDA. Otherwise, CUDA graph may be enabled for other backends and causes result error. * [DOCS] Simplify read me (#2435) This PR simplifies readme so most attention can be pointed to our docs page. * [DOCS] Update title to focus on engine feature This commit updates the docs to focus on engine feature * [Metadata] Remove stale KV cache size (#2434) This PR removes the KV cache size from model metadata. This is because we have fully switched to the new compilation flow with PagedKVCache and MLCEngine as backend, where KV cache size is runtime dependent and will be estimated at runtime. * [iOS] Update the MLCSwift APIs to async (#2436) This PR updates all MLCSwift APIs to be async for consistency purposes. * [Android] Switch MLC Chat to use MLCEngine (#2410) * [Android] Switch MLC Chat to use MLCEngine * [Serving] Add helper function - TotalDetectGlobalMemory * [iOS] Remove Legacy ChatModule (#2437) This PR removes the legacy chat module in iOS. * [Delivery] Update model delivery script to support specifying the output and hf directory (#2431) * Update model delivery script to support specifying the output directory * [Android] Remove Legacy ChatModule (#2438) * [Refactor] Remove ChatModule (#2440) This PR formally removes ChatModule from the codebase, given all the frontends have fully switched to use MLCEngine. * [Fix][REST] Fix usage-related server tests (#2441) This PR fixes some server tests which were broken due to recent refactors. * [Site] Enlarge hero image in small screens * Fix lint * [ANDROID] Patches to enable windows usescase (#2443) This PR add a few patches to enable build under windows * [DOCS] Guides for android on windows (#2444) * [DOCS] mention git-lfs (#2445) * Fix Llama-3 conversation template. Add unit test (#2442) * Fix Llama-3 conversation template. Add unit test * [Grammar][Wasm] Update new grammar to wasm runtime (#2446) * [Model] Use float32 for RoPE calculation (#2449) This PR updates the RoPE calculation to use float32 for multiplication and addition. This is motivated by the observation that calculating RoPE in float16 may cause accuracy issue. * [LogitProcessor] Use min float value as the mask value (#2451) This PR updates the mask values in LogitProcessor to the min value of float32. Prior to this PR it was -1e10. This update is the safest for softmax as long as the masking is always the last step in logit processor. * [Protocol] Use `by_alias=True` when dumping pydantic classes (#2450) This PR sets the parameter `by_alias=True` for all the `model_dump_json` of pydantic classes, so that aliases are always respected. * [Protocol] Use `by_alias=True` when dumping pydantic classes (#2452) This PR sets the parameter `by_alias=True` for all the `model_dump` of pydantic classes, so that aliases are always respected. * [DOCS] Updates the URL of the Android APK (#2453) * Auto updated submodule references * [Fix][Phi3] Add `</s>` as stop token for phi3 (#2455) [Fix][Phi3] Add </s> as stop token for phi3 * [Site] Add GitHub link to hero section * Update README.md * [Hermes2] Add conv template for Hermes2-Pro-Llama3 (#2457) * [Compile] Add max_batch_size to metadata (#2463) This PR adds the max_batch_size at compile time to metadata for runtime to read. **Note.** This may be a breaking change for the compiled model libraries. And please set environment variable `MLC_JIT_POLICY=REDO` to recompile the models with JIT, or manually recompile the model libraries. This PR also adds the max_batch_size to qwen2. * [REFACTOR] Re-organize the modules after transition to MLCEngine (#2464) This PR reorganizes the modules after transition to MLCEngine. - grammar is a root level module - streamers and tokenizers are in the tokenizers namespace - conversation_template is module Testcases are restructured accordingly. We also removed some of the stale files. * [Serving] Add ICHECK for running batch size (#2465) This PR adds ICHECK to make sure that the running batch size in BatchDecode and BatchDraft does not exceed the `max_num_sequence` as in the engine config. The prefill actions should keep this invariant. And the ICHECKs added mainly serve for internal error detection and report purpose. * Auto updated submodule references * [TEST] Start to categorize tests (#2466) * [TEST] Start to categorize tests This PR add test categorization via pytestmark For now we have five categories of tests unittest op_correctness engine endpoint uncategorized We should start to fix some of the broken tests and move them to these categories. When possible we should cover a bug under unittest, since they get run every PR, as part of the CI. * Implemented FP8 calibration (#2454) * Implemented FP8 calibration * update * add transformers * Use encode_batch --------- Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu> * [CI] Update CUDA build script with FlashInfer options (#2469) This PR updates the CI CUDA build script with FlashInfer compile options after a recent bump of FlashInfer version. * [Serving] Use preferred host memory for host NDArrays (#2468) This PR updates the host memory in model, logit processor and GPUSampler with the support of preferred host device, so that for CUDA and ROCm the pinned memory will be used for the host arrays, which may be faster than the default CPU memory during copying. * [TEST] Temp disable UT stage This PR temp disables the UT stage for now before we can get a fix on the docker execution * [CUDA] Turn on cuda graph at O2 (#2467) * [CI] Enable GPU env in CI (#2476) * [CI] Enable GPU env in CI This PR enables GPU env in ci docker/bash.sh * remove dep on tvm testing plugin * [CMake] Update config.cmake generation script (#2478) This PR updates the config.cmake generation script to provide the FlashInfer compile options explicitly. * [TEST] MockEchoEngine (#2479) This PR introduces a MockEchoEngine that echos the inputs prompt and the generation conflig(as part of usage.extra). The engine can be used to create unit-test cases that covers engine API handling. Note that mock tests cannot replace real engine tests. * Auto updated submodule references * [Fix] Fix JSONFFI MemoryBufferStream after dmlc bump (#2480) A recent bump in dmlc has changed the `Write` signature of `dmlc::Stream`. This commit updates the codebase to follow the upstream change. * [JSON-FFI] Enable n generation and pass in json schema (#2481) This PR enables n generation and pass in json schema in JSON FFI. * Refactor model delivery script to use pydantic (#2482) * Fix tokenizers encode batch (#2484) * [Bugfix] Fix delivered log issue in delivery cli (#2489) * Support Qwen2-MoE Architecture (#2089) * [3rdparty] Bump tokenizers-cpp to include HF tokenizers bump (#2490) This PR bumps the 3rdparty tokenizers-cpp to include the HuggingFace tokenizers package bump, in order to support some latest models such as Mistral v0.3. * [Bench] Add mlc bench (#2474) This PR adds an initial pass of the bench infra * Auto updated submodule references * Enable n-sampling for Medusa spec decoding (#2495) * Fix get_num_available_pages for model without kv cache * Enable n-sampling for Medusa spec decoding * [CONFIG] Remove mean_gen_len from the config (#2493) This PR removes legacy mean_gen_len from the config * Update ios android docs (#2497) * [Bench] Add seed to __init__ and some minor change (#2496) * [Fix][Config] Max total sequence length overflow with sliding window (#2500) This PR fixes an issue which causes the int64 multiplication overflow when sliding window is enabled. * [Serving] PagedKVCache tree-attention integration (#2487) This PR integrates the recent support of tree-attention in PagedKVCache into the speculative decoding in MLC. Right now only chains are supported. Tree-based speculative decoding is on the project road map and we are planning to support it in recent future. * [Sampler] Enhance checks for whether FlashInfer is enabled (#2502) This PR improves the check in GPU sampler for whether FlashInfer is enabled. Previously we did not check the CUDA compute capability, which makes the GPU sampler not able to properly run on Colab where the T4 GPU has a compute version of 7.5 which FlashInfer does not support. With this PR, when the compute capability is less than 8.0, we will not use FlashInfer in GPU sampler. * [Android] Updates the default mode list and the APK link in the document (#2503) * [Android] Update default model list Update the default model list in Android to include the following models 1. Phi-3-mini-4k-instruct-q4f16_1-MLC 2. Llama-3-8B-Instruct-q3f16_1-MLC 3. Mistral-7B-Instruct-v0.3-q4f16_1-MLC * [DOCS] Updates the URL of the Android APK * [Fix] Fix the global func name of TokenizerDecode (#2514) This PR fixes the global func name for `TokenizerDecode`, which was not updated when adding the namespace `tokenizers`. * [Fix] Use the correct model to validate stream_options (#2508) * [Fix] Typo in docs/install/tvm.rst (#2507) Fix a typo in serve/engine.py * [FP8] Use f32 scale to enable better fusion (#2505) * [Metrics] Add ttft and itl to server metrics (#2510) * Add ttft and itl to server metrics * Fix ITL * Fix clang-format * Keep mobile and interface.chat untouched * [Model] Fix config detection for Mistral (#2504) The Mistral model has removed sliding window since its v0.2, while in MLC we always enable sliding window. This PR updates the config detection so that when sliding window is disabled, we turn to checking the context window size and make sure it is properly set. * [Fix] Provide a GetTokenId API for SampleResult (#2516) Currently we use `sampled_token_id.first` to find the sampled token id of a SampleResult object, which is obscure. This PR provides a `GetTokenId` API for SampleResult to get the sampled token id. This PR also updates the testing model path to include `./dist/`. * [Reapply][BUGFIX] Fix rare deadlock in threaded engine (#2429) (#2518) This PR reapplies #2429, which is missing in the main branch. Below is the original commit message: This PR fixes rare deadlock cases when engine unload/reload Co-authored-by: Tianqi Chen <tqchen@users.noreply.github.com> * [Fix] Fix metrics division by 0 (#2519) This PR fixes an issue of the per-request metrics, where division-by-0 may happen when the request does not run any decode step. The division-by-0 results in `inf`, and is added into a JSON file. However, `inf` is usually not recognized as a float value in JSON grammar. Thus JSON parsers fail on parsing any JSON string that comes with `inf` wihtout being quoted. * Corrected the folder path for Android Studio Project (#2520) Update android.rst Android project path corrected * Update tvm.rst * [iOS] Update model list (#2524) Update the model list of iOS in `mlc-package-config.json`. * [Android] Updates the order of mode list and the APK link in the document (#2526) [Android] Updates the default mode list and the APK link in the document 1. Qwen1.5-1.8B-Chat-q4f16_1-MLC * [Sampler] Skip top-p renormalization if top-p is 1 in CPUSampler (#2528) This PR adds a shortcut in the top-p renormalization in CPU sampler, which skips the renormalization when top-p is 1.0. * [Docs] Rename javascript.rst to webllm.rst (#2531) * [Conv] Add tinyLlama v1.0 conv template (#2530) * [Conv] Add tinyLlama v1.0 conv template * Fix lint * [iOS] correct mistral q3 url and handle screen switch off (#2529) This PR corrects the mistral q3 url This PR also add a handler for screen switch off. For now we just reset if the app is generating, we will update to pause/resume once they are supported. * [Grammar] Fix include protection and paths in docstring (#2515) Following #2464, This PR fixes the include protecting in the header files and the paths in the docstrings of the header files. This PR also fixes tests that were broken after the refactor. * [Tokenizer][Fix] Fix SegFault when analyzing tokenizers without tokenizer.json (#2532) Previously the tokenizer would segfault when analyzing a tokenizer that did not have a tokenizer.json file. This is due to `TokenizerInfo()` is called previously, which creates a null object. This PR fixes this problem. * [Serving] Use stop strs and token ids for completions (#2534) This PR applies the stop strings and stop token ids defined in conversation tempalte to the raw text completions. So that whenever the model outputs a stop token id or stop string, the raw generation can stop. Prior to this commit, the raw text never stops when the max tokens is not given. This commit helps reduce the frequency of such events. Nevertheless, if the model does not output a stop string/token id, the generation will still not be going to stop. * [Serving] Support tensor parallel shards override in command line (#2533) This PR supports the command line overrides for model JIT compilation. This is especially helpful for enabling tensor parallelism out of box, so people don't need to manually tweak `mlc-chat-config.json` to use tensor parallelism. * Add tie_word_embedding option for Qwen2 model (#2535) * [Bench] Defaults to aiohttp client, add ServerMetrics (#2527) * [Bench] Defaults to aiohttp client * Add ServerMetrics to summary * Remove duplicate servermetric def * [Android] Remove var capture in TVM_SOURCE_DIR (#2538) This PR fixes the TVM_SOURCE_DIR parsing issue on Windows. * [Fix] Fix inconsistent system prompt handling (#2539) This PR fixes the conversation template of ChatML, whose system prompt ends with `<|im_end|>`. An inconsistent handling of system prompt between the JSONFFI side and the Python side is also corrected. * [Attention] Fix attn kernel for general GQA group size (#2543) This PR fixes the TIR prefill attention kernels to support a broader list of GQA group sizes. * fix: typo error (#2544) * [Fix] Fix attn kernel build issue (#2545) This PR fixes TIR issues in the attn kernels. * [iOS] Add Qwen2 support (#2547) This PR add Qwen2 support to MLC Chat * [Android] Add Qwen2 support (#2548) * [Android] Escape backslashes and quotation marks (#2546) This commit escapes the backslashes and quotation marks in Android package build. * [EngineConfig] Add override options (#2550) This PR introduces override options to the Python side EngineConfig so that they'll be reflected in JIT model compilation. * [Site] Update link to webllm * [Site] Update heading * [Preset] Add model preset for model delivery (#2553) [Preset] Add model preset for wasm delivery * Update docs to remove mention of older models (#2557) * [Docs] Fix typo in mlc_llm chat command (#2560) * Fix compilation for gcc 13.2 (#2561) * [Tokenizer] Priorize HuggingFace/SentencePiece over ByteLevelBPE (#2559) This PR updates the tokenzier load logic, so that we prioritize the use of HuggingFace and SentencePiece tokenizers over the ByteLevelBPE tokenizer. This fixes the issue that token `<im_start>` in Qwen model is tokenized into multiple tokens when the ByteLevelBPE tokenizer is chosen when available. * [Serving][Grammar] Jump-forward decoding (#2551) [Serve][Grammar] Jump-forward decoding This PR supports the jump-forward decoding as described in <https://lmsys.org/blog/2024-02-05-compressed-fsm/>. The jump-forward decoding uses the grammar constraint to predict the next output string and tokenize the string into tokens, and therefore speeds up the decoding. This PR implements these optimizations to ensure the output quality: - Retokenization in jumpforward: Tokenize the last k token as string appended with the predicted string. If the tokenization result differs from the old tokens, roll back these tokens and accept the new ones. - Retokenization in decoding: Tokenize the last k token as string appended with the decoded token. This will happen in decoding stage when the jumpforward decoding happens in the last round. If the result differs, the old tokens will be rolled back. - Skip prefix tokens in jumpforward: We call tokens that is a prefix of another token as prefix tokens. If the last token from jumpforward is a prefix token, it's highly possible that it will be rolled back in the next decode stage, as it may be combined with the decoded token. It also effects the output distribution as such pattern is rare in training data. Therefore, we skip the last prefix token in jumpforward decoding. This PR also includes the following changes: - Add several metrics for request and engine, especially about the jumpforward decoding - Fix a bug in `_async_query_engine_metrics` to avoid throwing CancelledError from early return Performance and benchmark: Schema(Pydantic): ``` class Product(BaseModel): product_id: int is_available: bool price: float is_featured: Literal[True] category: Literal["Electronics", "Clothing", "Food"] tags: List[str] stock: Dict[str, int] ``` Platform: AMD Ryzen 9 5900X, NVIDIA 3080 10G Results: ``` Jump forward: False, Batch: 1 Engine metrics: { "engine_decode_time_sum": 0.4988938220000001, "engine_jump_forward_time_sum": 0, "completion_tokens_sum": 66, "decode_tokens_sum": 66, "jump_forward_tokens_sum": 0, "decode_tokens_per_s": 132.2926785010378, } Jump forward: True, Batch: 1 Engine metrics: { "engine_decode_time_sum": 0.37242740600000007, "engine_jump_forward_time_sum": 0.027989265000000006, "completion_tokens_sum": 68, "decode_tokens_sum": 68, "jump_forward_tokens_sum": 28, "decode_tokens_per_s": 182.58591850246378, } Jump forward: False, Batch: 4 Engine metrics: { "engine_decode_time_sum": 0.9106805410000002, "engine_jump_forward_time_sum": 0, "completion_tokens_sum": 261, "decode_tokens_sum": 261, "jump_forward_tokens_sum": 0, "decode_tokens_per_s": 286.5988546470984, } Jump forward: True, Batch: 4 Engine metrics: { "engine_decode_time_sum": 0.6843025599999999, "engine_jump_forward_time_sum": 0.028089531999999997, "completion_tokens_sum": 266, "decode_tokens_sum": 266, "jump_forward_tokens_sum": 112, "decode_tokens_per_s": 388.71694415405966, } Jump forward: False, Batch: 8 Engine metrics: { "engine_decode_time_sum": 1.62462493, "engine_jump_forward_time_sum": 0, "completion_tokens_sum": 538, "decode_tokens_sum": 538, "jump_forward_tokens_sum": 0, "decode_tokens_per_s": 331.1533573475325, } Jump forward: True, Batch: 8 Engine metrics: { "engine_decode_time_sum": 1.0509048310000002, "engine_jump_forward_time_sum": 0.027971332000000022, "completion_tokens_sum": 525, "decode_tokens_sum": 525, "jump_forward_tokens_sum": 224, "decode_tokens_per_s": 499.5694990767436, } Jump forward: False, Batch: 16 Engine metrics: { "engine_decode_time_sum": 2.317279175, "engine_jump_forward_time_sum": 0, "completion_tokens_sum": 1068, "decode_tokens_sum": 1068, "jump_forward_tokens_sum": 0, "decode_tokens_per_s": 460.8853398080531, } Jump forward: True, Batch: 16 Engine metrics: { "engine_decode_time_sum": 1.3962938819999997, "engine_jump_forward_time_sum": 0.030129287999999994, "completion_tokens_sum": 1059, "decode_tokens_sum": 1059, "jump_forward_tokens_sum": 448, "decode_tokens_per_s": 758.4363246533227, } ``` * [Delivery] Update model delivery script (#2565) Some improvements of the delivery script: - provide different overrides for different quantization. e.g. we can change prefill chunk size for q0/q3/q4 - rerun gen config only if only conv_template changes - do NOT recreate HF repo when the repo already exists. This will preserve commit history - dry-run validation * [Model] Enhance error reporting for invalid tensor-parallel settings (#2566) This PR enhances the error reporting for multi-GPU model compilation, so we can provide as many error reasons as possible before loading and running the models. * [Serving] Apply tree structure in draft token verification (#2563) This adds the interface to draft token state and sampler to allow tree structure being recorded and used for verification * [Bench] Json mode bench (#2552) * [Bench] Json mode bench This PR refactors mlc bench to enable json mode in dataset. * upd * fix lint * [Model] Support Multi-GPU for Qwen-MoE model (#2573) This PR introduces the multi-GPU support for the Qwen-MoE model. Validated on 4090x2. * [Metrics] Add missing fields in `Reset` (#2574) This PR adds the missing fields that were not cleared up in `EngineMetrics::Reset`. * [Doc] Update WebLLM doc (#2578) Update documentation for WebLLM. Currently we only provide a high-level view for WebLLM runtime here, and refer user to the WebLLM repo README for more. The documentation focuses on adding their own model variant / model library for WebLLM. Will follow up with more thorough runtime documentation. * [Op] Top-4 implementation for MoE model (#2586) This PR introduces a top-4 kernel for MoE model (particularly for the Qwen-MoE) at this moment. This is still a manual implementation and has some duplication with the existing top-2 kernel. In the future we'll consider leveraging meta-programming of TIR to unify the top-k kernel implementations. * [Model] Gemma 1.1 compatibility (#2594) This PR updates the Gemma config so that MLC can work properly with Gemma 1.1. * [Serving] Hybrid prefill (#2604) This PR adds the support for the hybrid prefill. So during the prefill engine action, it will do the decode for running requests as well. * Update quick_start.rst to fix broken links (#2607) Update quick_start.rst Fix broken links for convert weights and compile model pages * [Fix] Set the missed prefill finish time (#2613) This PR fixes a bug which fails to set the prefill finish time and results in metric error. * [Android] Reduce binary size (#2606) This PR updates the Android app the reduce the binary size. Right now it can be reduced to 108MB when only building with the Phi-3-mini-4k model. * [Fix] Gemma hidden_activation compatibility (#2614) This PR fixes the Gemma config compatibility issue. * Update debug_compare (#2612) This PR fixes a bug of the debug_compare.py script. * [SLM] Add support for InternLM2 architecture (#2608) This commit introduces the InternLM2 model support. * [Fix] Prefix cache only enables sliding window on leaf sequence (#2615) This PR updates the prefix cache to align the logic of enabling sliding window. Now only leaf sequence is enabled sliding window attention. * [Android] Update include path for tvm runtime src (#2616) This PR updates the include directories for the Android app so that we can avoid using macros for src file include. * remove * works * seems working --------- Co-authored-by: Rick Zhou <rickzhoucmu@gmail.com> Co-authored-by: Tianqi Chen <tqchen@users.noreply.github.com> Co-authored-by: Wuwei Lin <wuwei@apache.org> Co-authored-by: Wei Tao <1136862851@qq.com> Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu> Co-authored-by: Kartik Khandelwal <kartikkhandelwal1998@gmail.com> Co-authored-by: Yixin Dong <ubospica@gmail.com> Co-authored-by: Kevin_Xiong <kevin_xiong1997@outlook.com> Co-authored-by: zifeitong <zifeitong@gmail.com> Co-authored-by: Yong Wu <yongcale@gmail.com> Co-authored-by: Animesh Bohara <ani.bohara@gmail.com> Co-authored-by: Animesh Bohara <abohara@cs.cmu.edu> Co-authored-by: krishnaraj36 <quic_kvegiraj@quicinc.com> Co-authored-by: Mengshiun Yu <mengshyu@gmail.com> Co-authored-by: Git bot <bot@noreply.github.com> Co-authored-by: Vivian Zhai <98248913+YiyanZhai@users.noreply.github.com> Co-authored-by: Nestor Qin <imba.qxy@gmail.com> Co-authored-by: Yaxing Cai <caiyaxing666@gmail.com> Co-authored-by: Faolain <Faolain@users.noreply.github.com> Co-authored-by: Bodhi <3882561+BodhiHu@users.noreply.github.com> Co-authored-by: Huaishun Hu <huaishun.hu@mthreads.com> Co-authored-by: Hyunsung Lee <ita9naiwa@gmail.com> Co-authored-by: Bohan Hou <bohanhou@andrew.cmu.edu> Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> Co-authored-by: tqchen <tqchenml@gmail.com> Co-authored-by: Charlie Ruan <53290280+CharlieFRuan@users.noreply.github.com> Co-authored-by: rmstc <ramees025@gmail.com> Co-authored-by: KEL <me@iamkel.net> Co-authored-by: Andrey Malyshev <ma_elvin@mail.ru> Co-authored-by: Gunjan Dhanuka <d.gunjan@iitg.ac.in> Co-authored-by: Shushi Hong <820958424@qq.com>

… 2024-08-01) (#277) * [Eagle] Run additional decode for draft model when all proposals are accepted (#2294) * [iOS] Introducing package CLI for iOS app packaging (#2297) This PR introduces the packaging CLI `mlc_llm package` which reads from a `mlc-package-config.json` and compiles model and prepares model/runtime libraries automatically. With this PR, we get rid of prebuilt model library dependency for iOS app build. Validated that the iOS build can work. iOS documentation is updated according to this latest change. The same flow is supposed to work for Android as well, while it still needs verification for Android app build. * Increase the timeout in PopenServer (#2298) * [LLM-CHAT] Enable gpu softmax for penality softmax (#2288) 1. Avoid the cpu softmax for different penality config by having copy sync to gpu and use gpu softmax. 2. Disable decode token time counter for first token. * [iOS][REFACTOR] Restructure the iOS folders (#2299) Move MLCChat to its own sub folder minor improvements to package. * [KVCACHE][TIR] Improved tir schedule for decode tir page attention (#2289) * [KVCACHE][TIR] Improved tir schedule for decode tir page attention 1. Improved tir schedule of page attention (It improved 30% to this function). 2. Enable missing dequant+matmul fusion in ph-2 model * Updated K_local to QK_local * Update kv_cache.py * Increase max thread for android:adreno * [Sampler] Remove unneeded output_prob_dist param (#2300) * Enable cuda graph for batch_verify (#2304) * [Android] Introducing mlc4j and app packaging (#2305) This PR lifts the existing `library` of android app into a standalone `mlc4j` directory, which can be referenced by android app at any location. On the app side, this PR moves the android app into a subfolder `MLCChat` which itself is a well-formed Android app. This folder contains two core files for app build: * `MLCChat/mlc-package-config.json` the config file that specifies the models to build into the app. * `MLCChat/prepare_package.py` the Python script that helps automatically prepare/build mlc4j and model libraries. This PR also updates the android app documentation to reflect this latest change. * [DOCS] Minor cleanup (#2308) Shorten titles so they fit into one line of navbar, add mention of jit cache. Remote old project overview * [DOCS] Update android doc (#2309) Avoid showing full tree and mention what the dist/lib/mlc4j stands for * [DOCS] Update android doc (#2310) Avoid showing full tree and mention what the dist/lib/mlc4j stands for Avoid python3 instead directly use python, since python3 sometimes will points to system python. * [SLM] Support BERT architecture. Implement a text embedding module (#2249) * [Serving] Log batch size in NVTX (#2312) * [Model] Removing unnecessary reshapes in get_logits (#2314) * Skip cublas dispatch for single batch (#2315) * Auto updated submodule references * [DOCS] Remove mention of legacy modules (#2318) This PR removes mention of legacy modules and prebuilt in favor of JIT. * [Android] Add `-j` option to cmake build (#2321) This PR adds the `-j` option to cmake build to parallelize the build job over CPU cores. * [DOCS] More clear android instruction (#2327) This PR sets a more clear instruction for android JDK setup * [Serving] Refactor to consolidate new request prefill (#2329) * [iOS] Make MLCEngine input to take in structured data (#2330) This PR modifies the MLCEngine chatCompletion to take in structured data. Co-authored-by: Vivian Zhai <98248913+YiyanZhai@users.noreply.github.com> * [REFACTOR] Refactor JSONFFI Conv template (#2331) This PR refactors JSONFFI conv template to use immutable processing. This helps to prevent bugs from multiple requests and concurrent access to the conversation data structure. It also reduces the need to deep copy the struct. * [Eagle] Fix the requests for additional decode in eagle verify (#2336) * [Serving][Grammar] Refactor GrammarStateMatcher and support LLaMA-3 (#2335) This PR refactors GrammarStateMatcher and support the LLaMA-3 tokenizer. Common tokenizers, including Phi-2, Gemma, LLaMA-2, etc. are also supported. The performance is optimized for LLaMA-3 tokenizer since its token table has size 128k, much larger than LLaMA-2 tokenizer. These changes are introduced to the grammar library: These changes are introduced to the grammar library: 1. Introduce ByteString rule expression and simplify CharacterClass and CharacterClassStar 2. Refactor BNFGrammarVisitor and BNFGrammarMutator for visiting and mutating grammar rules 3. Now GrammarStateMatcherBase, the internally impl of the GrammarStateMatcher, accepts char by char, instead of codepoint by codepoint. So it supports any valid UTF-8 string, even if the token is not a complete codepoint. 4. Support lookahead assertion for rules to specify the rule must be followed by a sequence. This can eliminate some uncertain tokens in preprocessing. Minor changes: 1. Introduce template hash function HashCombine 2. Update the UTF8 encoding handling functions Performance: 1. For JSON, finding mask requires <30us on 5900X with single thread. The uncertain tokens is <30 in most cases. 2. For JSON schema, finding mask requires <30us on 5900X with single thread. The uncertain tokens is <30 in most cases. * [DebugChat] Fix DebugChat softmax function and save logits to debug folder (#2342) * [DebugChat] Fix DebugChat softmax function and save logits to debug folder * Fix lint * [Serving] Add Medusa speculative decoding (#2337) * [Serving] Add Medusa speculative decoding * Fix cublas offloading (#2343) * Add false for arg worker0_only in disco.empty (#2344) * Auto updated submodule references * [JSONFFIEngine] Refactor device argument and request_stream_callback argument (#2334) * 1. Refactor init_background_engine in JSONFFIEngine to use device_type and device_id arguments. 2. request_stream_callback is called on each string of the array of strings. * Calling callback on string of list of JSON dicts instead of each string of JSON dict multiple times --------- Co-authored-by: Animesh Bohara <abohara@cs.cmu.edu> * [Serving] Add reset_engine in debug_entrypoints (#2347) * [Bugfix] Make sequence_length dtype int64 in EngineConfig. Fix Mistral engine serving issue (#2358) * [Bugfix] Make sequence_length dtype int64 in EngineConfig. Fix Mistral engine serving issue * [JSON FFI] Example Android Application using JSON FFI Engine (#2322) * pass str to callback and not List[str] add json ffif android example fix lint Refactor MLCEngineExample and MLCEngine.kt Use ChatCompletionMessageContent class ChatCompletionMessageContent: text and parts * JSONFFIEngine: Cast request_stream_callback argument to std::string. Decode in Android as List<ChatCompletionStreamResponse> --------- Co-authored-by: Animesh Bohara <abohara@cs.cmu.edu> * [iOS] Update MLCEngine API to latest JSON FFI convention (#2359) This PR updates the MLCEngine API to latest JSON FFI convention. * [JSONFFI] Fix JSONFFI conv template. Add unit tests (#2360) * [Fix][Serving] Fix prefill chunk in interactive mode (#2363) This PR fixes a bug of prefill chunking in the interactive mode. The bug counts requests with remaining inputs as running requests which turns out disabling the prefill of the remaining inputs. This PR fixes by no longer counting requests with unfinished inputs as running requests for decode. * [Fix][Serving] Respect sliding window size in config inference (#2364) This PR fixes the automatic engine config inference which did not respect the sliding window size, which led to memory usage higher than expected in the interactive mode for mistral model. * [iOS] Add padding to app icon (#2365) * [Serving] Fix the self-ref in engine (#2367) This PR fixes the self ref in engine and enable auto terminate in deleter. * [Serving] Prefix Cache (#2295) * [Serving] Prefix Cache This PR introduces the prefix cache into serving engine, to manage prefix and accelerate prefill process. * [Fix] Use static_cast for `.size()` for safety (#2369) This PR updates the occurences of `.size() - 1` with static_cast to avoid the integer underflow. * [Serving] Sliding-window-aware request prefill (#2370) This PR supports the prefill conditions with sliding window awareness. Now when the input length is larger than the sliding window size, the prefill can still be processed without error. * [iOS] Update MLCSwift to fully follow OAI style. (#2371) It also refactors the MLCSwift to be follow engine.chat.completions.create style as per other OpenAI APIs. It also removes the cyclic dependencies in the closure capture by having a separate EngineState * Add nvtx in logic update (#2372) * [Test] Use HF model for JIT as much as possible (#2373) This PR updates the test files to use JIT by default as much as possible, in order to make tests runnable out of the box. Of course, they can be locally tweaked to use local models. For Eagle/Llava/rwkv, given we don't have them delivered yet, they are kept as using local model lib now. * [Fix] Fix prefix cache reset and forking logic (#2374) This PR refactors the reset logic in prefix cache and disable forking from sequences with sliding windows enabled. * [CLI] Migrate CLI to use the new Engine (#2375) * [CLI] Migrate CLI to use the new Engine This PR migrates the CLI to the new JSON FFI Engine. The resulting generation will be faster, we still need to ensure we can enable sliding window support when needed. Also Refactors JSONFFI Engine to be OpenAI compatible. * Fix lint and remove bench which is stale * [TESTING] Introduce testing util to manage models (#2377) This PR introduce a new env var MLC_TEST_MODEL_PATH to allow a list of model path specified for test model search purposes. If not found, an error message would appear and we auto skip test in both pytest and normal running settings. The path defaults to the cached HF path so as long as we run mlc_llm chat the model can be found. But we do not automatically download to avoid excessive networking in CI settings. Followup PR needed for remaining testcases * [REFACTOR][Rename] MLC_LLM_SOURCE_DIR and TVM_SOURCE_DIR source directory env (#2378) * [REFACTOR] Rename use MLC_LLM_SOURCE_DIR for source directory This PR updates to use MLC_LLM_SOURCE_DIR to specify the directory of mlc llm source directory. The reason for this update is that the term XXX_HOME was usually meant to be used in different scenarios in ML frameworks. For example, both torch and huggingface have TORCH_HOME and HF_HOME pointing to their local cache directory. The variable MLC_LLM_SOURCE_DIR is aligned with cmake naming convention (CMAKE_SOURCE_DIR). We will have followup PR to udpate MLC_CACHE_DIR to MLC_LLM_HOME, following the existing practices. * Update env to point to TVM_SOURCE_DIR * [REFACTOR][ENV] MLC_CACHE_DIR to MLC_LLM_HOME (#2379) This PR changes the MLC_CACHE_DIR env to MLC_LLM_HOME. This change aligns with most of the packages. * [iOS] Switch MLC Chat to use MLCEngine (#2380) This PR switchs MLC Chat to use MLC Engine Also did a minor refactoring to make serve side more flexible in dealing with compile time overrides. * [REFACTOR] Cleanup legacy code (#2381) This PR cleans up legacy code and reorgaizes some of the project structure. - Removed stale interface - Removed stale examples - Temp remove rust as it depends on chat module that we plan to phase out - Move embeddings to contrib(experimental) * [Fix] Update prefix cache config (#2382) This PR updates the prefix cache config to prefix cache mode and prefix cache max number of recycling sequences. Also this PR adds the missing `final` keyword in member methods. * [PREFIX-CACHE] Fix some issues with prefix cache (#2384) This PR fixes issues with prefix cache when used together with MLCEngine. It also fixes an issue when prefix_cache_max_num_recycling_seqs == 0 * [FIX] Typo on OpenAI Chat class in engine (#2385) This commit fixes a typo on JSONFFIEngine Python side. * [Serving][Refactor] Metrics and stats for CLI (#2387) This PR introduces the `Metric` class for convenient metric update and management in MLC. The previous `EngineStats` class is renamed to `EngineMetrics` accordingly. This PR brings the metric support to JSONFFIEngine, and implements the `/stats` command in CLI. Besides, this PR * fixes a bug of time measurement when parallel generation exists. * aligns the metric names with LLMPerf (particularly, we now use `num_input_tokens`, `num_output_tokens`, `sum_num_input_tokens`, etc.) * measures the time of a single step of BatchDecode, a single step of draft generation in BatchDraft, and a single step of BatchVerify when the effective batch size is less than 64 (hardcoded as a constant as of now). This can help build the understanding of the performance of the key actions under a series of batch size. * [REFACTOR] Organize metrics (#2390) This PR perform one round of reorganization of metrics into a centralized metrics header. Also updates the ChatState to include overrides that can be used in future cases to run chat test. * [Fix] Avoid ref capture in prefix cache contruction (#2391) This PR fixes the prefix cache construction in Engine, which captured the references of models and thus led to the GPU memory unable to be freed when the Engine is destructed. * [REFACTOR] Cleanup Metrics (#2392) This PR run another round of cleanup of metrics. - Remove less useful ones - Reorganize by labels in prometheus style * [FIX] Fix mlc llm source dir argument (#2394) This PR fixes the mlc llm source dir argument in android packaging. * [Fix] Fix the serialization of SpecDecodeMetrics (#2395) This commit fixes a bug when serializing SpecDecodeMetrics. * [Fix] Update missing change in engine ffi func name (#2396) This PR updates the missange change in engine ffi func name from #2390. * Auto updated submodule references * [Fix] Fix no prefix cache (#2397) This PR fixes the no prefix cache, to avoid double adding of new sequence. * add hasattr safecheck for MLCEngineBase (#2400) Co-authored-by: Huaishun Hu <huaishun.hu@mthreads.com> * [Refactor] Expose EngineConfig in engine constructor (#2399) This PR lifts the EngineConfig as one engine constructor, so that we can hide most less important arguments in EngineConfig, and thus focus the user attention to the few key arguments. `mlc_llm serve` CLI and PopenServer are updated accordingly. Documentation is updated accordingly. * [REFACTOR] Introduce RequestMetrics and metrics endpoint (#2401) This PR introduces RequestMetrics to collect aggregated metrics for each request. We also introduces a prometheus end point. Finally, we fixed a cylic dependency in engine states. * [Fix] Fix format issue of MLCEngineBase (#2402) This PR fixes a format issue caused by #2400. * [FIX] fix comments in radix_tree.py (#2403) Seems function descriptions for `PagedRadixTree.add` and `PagedRadixTree.extend` are misleading. Fixed according to implementations in radix_tree.cc * [Fix] Fix metric names in tests and static PrefixCacheModes (#2404) * This PR fixes the metric names referenced in tests which were not updated together with previous PRs. * This PR fixes the static PrefixCacheMode member introduced in #2397. The way of fix using the static class members is not correct, which essentially disables PrefixCache forever. This is because when checking the `mode` member of a PrefixCache instance, it is always the base class mode (which is `kDisabled`) being returned. * This PR also adds a missing header for chrono. * [Op] Tree attention (#2376) * [REFACTOR] Reorganize GenerationConfig DebugConfig and FFI (#2407) This PR reorganizes GenerationConfig, DebugConfig and FFI. - Internally, we now directly use the config object instead of json stream. - Request construction turns into engine side so it can make use of debug_config. - Ignore eos now moves to debug_config option. - Removes most string based re-export of gen conifg. * [Fix] Fix vector OOB when no inputs can be prefilled in spec decode (#2408) This PR fixes an issue that causes vector index out of bound. This happens in speculative decoding, when an model can accept inputs while the other cannot. We still need to look into this inconsistency. Ideally all models should behave the same. * [Fix] Update number of available pages after prefix cache free (#2409) This PR fixes an issue that causes the inconsistency of CanPrefill result from different models. * [REFACTOR] Enable validation logic in GenerationConfig (#2411) This PR enables a centralized validation logic in GenerationConfig. * [Chat] Support chat completion config override (#2412) This PR supports chat CLI with arguments override. Right now, arguments supported are: `top_p`, `temperature`, `presence_penalty`, `frequency_penalty`, `max_tokens`, `seed`, `stop`. This PR adds the corresponding support to the ChatCompletion request parsing for JSONFFIEngine. * Change name RedixPage -> RadixPage in RadixTree.cc (#2413) change name RedixPage -> RadixPage * [Fix] Fix ignore_eos support (#2414) The ignore_eos support was broken during recent refactors. This PR fixes the support. * [Test][Refactor] Update tests to use require_test_model (#2415) This PR updates tests to use the `require_test_model` testing util for better out-of-box testing while avoid automatic downloading. Some tests that require manually model compilation are kept in the old test style (e.g., with model "llava", "eagle", etc.). This PR also fixes some typing issues suggested by mypy. * [Serving] Enable GPU Sampling (#2368) enable gpu sampling * [REFACTOR] Support latest include_usage and DebugOptions (#2417) This PR refactors the mechanism of request end detection and also attaches the request metrics in response usage field. RequestResponse usage field: - include_usage can be passed to API. When include usage is on, metrics are now streamed back in the usage.extra - Changed debug_option parameter to extra_body, so they are fully compatible with OpenAI client - Support special requests in debug options, engine metrics are now streamed back via a special request We also change the FFI mechanism to detect response finish. Previously we keep track of number of stoppped streams. Now that the FFI always stream back the final chunk which have no choices and contains usage. We use the usage field to detect the final chunk. Code path are updated according. We also make Chat CLI a helper class that can be reused. iOS app now comes with stats support. * [DOWNLOAD] MLC_DOWNLOAD_POLICY and MLC_LLM_READONLY_WEIGHT_CACHES (#2421) This PR introduces support for MLC_DOWNLOAD_POLICY and MLC_LLM_READONLY_WEIGHT_CACHES Allows reading from readonly cache besides MLC_LLM_HOME. Also introduces a domain subfolder in cached weights * [REFACTOR] Rename MLC_LLM_READONLY_WEIGHT_CACHES (#2423) This PR renames MLC_LLM_READONLY_WEIGHT_CACHES=>MLC_LLM_READONLY_WEIGHT_CACHE to be consistent with rest of env var convention * [Tokenizer] Auto-detect TokenizerInfo from tokenizer.json (#2416) This PR adds a new `TokenizerInfo` class that contains useful information about the tokenizer during generation. It is auto-detected from tokenizer.json if it exists. Otherwise it raises a warning and uses the default value (byte fallback tokenizer, not prepend/strip space). * [REFACTOR] Remove dependencies on legacy chat_module (#2424) This PR removes the all dependencies from chat_module.py So we can prepare for deprecating this module. This PR refactors and moves MLCChatConfig to protocol. This helps us to consolidate all API spec and config files under the protocol folder. The protocol folder mainly keeps the data schema and metadata, most of the actions(gen_config) are still kept in their current location. * [REFACTOR] Terminology download=>download_cache (#2425) This PR renames download to download_cache for better clarity. * [REFACTOR] Move GenerationConfig to protocol (#2427) This PR moves GenerationConfig to protocol. As we move towards OAI style API GenerationConfig becomes more like an internal API. This change reflects that and also removes duplicated definition of ResponseFormat and DebugConfig * Update README.md * [site] Add hero section to website (#2430) * [Compile] Skip CUDA graph rewrite when target is not CUDA (#2433) This PR rewrites the CUDA graph compiler flag to false when the backend is not CUDA. Otherwise, CUDA graph may be enabled for other backends and causes result error. * [DOCS] Simplify read me (#2435) This PR simplifies readme so most attention can be pointed to our docs page. * [DOCS] Update title to focus on engine feature This commit updates the docs to focus on engine feature * [Metadata] Remove stale KV cache size (#2434) This PR removes the KV cache size from model metadata. This is because we have fully switched to the new compilation flow with PagedKVCache and MLCEngine as backend, where KV cache size is runtime dependent and will be estimated at runtime. * [iOS] Update the MLCSwift APIs to async (#2436) This PR updates all MLCSwift APIs to be async for consistency purposes. * [Android] Switch MLC Chat to use MLCEngine (#2410) * [Android] Switch MLC Chat to use MLCEngine * [Serving] Add helper function - TotalDetectGlobalMemory * [iOS] Remove Legacy ChatModule (#2437) This PR removes the legacy chat module in iOS. * [Delivery] Update model delivery script to support specifying the output and hf directory (#2431) * Update model delivery script to support specifying the output directory * [Android] Remove Legacy ChatModule (#2438) * [Refactor] Remove ChatModule (#2440) This PR formally removes ChatModule from the codebase, given all the frontends have fully switched to use MLCEngine. * [Fix][REST] Fix usage-related server tests (#2441) This PR fixes some server tests which were broken due to recent refactors. * [Site] Enlarge hero image in small screens * Fix lint * [ANDROID] Patches to enable windows usescase (#2443) This PR add a few patches to enable build under windows * [DOCS] Guides for android on windows (#2444) * [DOCS] mention git-lfs (#2445) * Fix Llama-3 conversation template. Add unit test (#2442) * Fix Llama-3 conversation template. Add unit test * [Grammar][Wasm] Update new grammar to wasm runtime (#2446) * [Model] Use float32 for RoPE calculation (#2449) This PR updates the RoPE calculation to use float32 for multiplication and addition. This is motivated by the observation that calculating RoPE in float16 may cause accuracy issue. * [LogitProcessor] Use min float value as the mask value (#2451) This PR updates the mask values in LogitProcessor to the min value of float32. Prior to this PR it was -1e10. This update is the safest for softmax as long as the masking is always the last step in logit processor. * [Protocol] Use `by_alias=True` when dumping pydantic classes (#2450) This PR sets the parameter `by_alias=True` for all the `model_dump_json` of pydantic classes, so that aliases are always respected. * [Protocol] Use `by_alias=True` when dumping pydantic classes (#2452) This PR sets the parameter `by_alias=True` for all the `model_dump` of pydantic classes, so that aliases are always respected. * [DOCS] Updates the URL of the Android APK (#2453) * Auto updated submodule references * [Fix][Phi3] Add `</s>` as stop token for phi3 (#2455) [Fix][Phi3] Add </s> as stop token for phi3 * [Site] Add GitHub link to hero section * Update README.md * [Hermes2] Add conv template for Hermes2-Pro-Llama3 (#2457) * [Compile] Add max_batch_size to metadata (#2463) This PR adds the max_batch_size at compile time to metadata for runtime to read. **Note.** This may be a breaking change for the compiled model libraries. And please set environment variable `MLC_JIT_POLICY=REDO` to recompile the models with JIT, or manually recompile the model libraries. This PR also adds the max_batch_size to qwen2. * [REFACTOR] Re-organize the modules after transition to MLCEngine (#2464) This PR reorganizes the modules after transition to MLCEngine. - grammar is a root level module - streamers and tokenizers are in the tokenizers namespace - conversation_template is module Testcases are restructured accordingly. We also removed some of the stale files. * [Serving] Add ICHECK for running batch size (#2465) This PR adds ICHECK to make sure that the running batch size in BatchDecode and BatchDraft does not exceed the `max_num_sequence` as in the engine config. The prefill actions should keep this invariant. And the ICHECKs added mainly serve for internal error detection and report purpose. * Auto updated submodule references * [TEST] Start to categorize tests (#2466) * [TEST] Start to categorize tests This PR add test categorization via pytestmark For now we have five categories of tests unittest op_correctness engine endpoint uncategorized We should start to fix some of the broken tests and move them to these categories. When possible we should cover a bug under unittest, since they get run every PR, as part of the CI. * Implemented FP8 calibration (#2454) * Implemented FP8 calibration * update * add transformers * Use encode_batch --------- Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu> * [CI] Update CUDA build script with FlashInfer options (#2469) This PR updates the CI CUDA build script with FlashInfer compile options after a recent bump of FlashInfer version. * [Serving] Use preferred host memory for host NDArrays (#2468) This PR updates the host memory in model, logit processor and GPUSampler with the support of preferred host device, so that for CUDA and ROCm the pinned memory will be used for the host arrays, which may be faster than the default CPU memory during copying. * [TEST] Temp disable UT stage This PR temp disables the UT stage for now before we can get a fix on the docker execution * [CUDA] Turn on cuda graph at O2 (#2467) * [CI] Enable GPU env in CI (#2476) * [CI] Enable GPU env in CI This PR enables GPU env in ci docker/bash.sh * remove dep on tvm testing plugin * [CMake] Update config.cmake generation script (#2478) This PR updates the config.cmake generation script to provide the FlashInfer compile options explicitly. * [TEST] MockEchoEngine (#2479) This PR introduces a MockEchoEngine that echos the inputs prompt and the generation conflig(as part of usage.extra). The engine can be used to create unit-test cases that covers engine API handling. Note that mock tests cannot replace real engine tests. * Auto updated submodule references * [Fix] Fix JSONFFI MemoryBufferStream after dmlc bump (#2480) A recent bump in dmlc has changed the `Write` signature of `dmlc::Stream`. This commit updates the codebase to follow the upstream change. * [JSON-FFI] Enable n generation and pass in json schema (#2481) This PR enables n generation and pass in json schema in JSON FFI. * Refactor model delivery script to use pydantic (#2482) * Fix tokenizers encode batch (#2484) * [Bugfix] Fix delivered log issue in delivery cli (#2489) * Support Qwen2-MoE Architecture (#2089) * [3rdparty] Bump tokenizers-cpp to include HF tokenizers bump (#2490) This PR bumps the 3rdparty tokenizers-cpp to include the HuggingFace tokenizers package bump, in order to support some latest models such as Mistral v0.3. * [Bench] Add mlc bench (#2474) This PR adds an initial pass of the bench infra * Auto updated submodule references * Enable n-sampling for Medusa spec decoding (#2495) * Fix get_num_available_pages for model without kv cache * Enable n-sampling for Medusa spec decoding * [CONFIG] Remove mean_gen_len from the config (#2493) This PR removes legacy mean_gen_len from the config * Update ios android docs (#2497) * [Bench] Add seed to __init__ and some minor change (#2496) * [Fix][Config] Max total sequence length overflow with sliding window (#2500) This PR fixes an issue which causes the int64 multiplication overflow when sliding window is enabled. * [Serving] PagedKVCache tree-attention integration (#2487) This PR integrates the recent support of tree-attention in PagedKVCache into the speculative decoding in MLC. Right now only chains are supported. Tree-based speculative decoding is on the project road map and we are planning to support it in recent future. * [Sampler] Enhance checks for whether FlashInfer is enabled (#2502) This PR improves the check in GPU sampler for whether FlashInfer is enabled. Previously we did not check the CUDA compute capability, which makes the GPU sampler not able to properly run on Colab where the T4 GPU has a compute version of 7.5 which FlashInfer does not support. With this PR, when the compute capability is less than 8.0, we will not use FlashInfer in GPU sampler. * [Android] Updates the default mode list and the APK link in the document (#2503) * [Android] Update default model list Update the default model list in Android to include the following models 1. Phi-3-mini-4k-instruct-q4f16_1-MLC 2. Llama-3-8B-Instruct-q3f16_1-MLC 3. Mistral-7B-Instruct-v0.3-q4f16_1-MLC * [DOCS] Updates the URL of the Android APK * [Fix] Fix the global func name of TokenizerDecode (#2514) This PR fixes the global func name for `TokenizerDecode`, which was not updated when adding the namespace `tokenizers`. * [Fix] Use the correct model to validate stream_options (#2508) * [Fix] Typo in docs/install/tvm.rst (#2507) Fix a typo in serve/engine.py * [FP8] Use f32 scale to enable better fusion (#2505) * [Metrics] Add ttft and itl to server metrics (#2510) * Add ttft and itl to server metrics * Fix ITL * Fix clang-format * Keep mobile and interface.chat untouched * [Model] Fix config detection for Mistral (#2504) The Mistral model has removed sliding window since its v0.2, while in MLC we always enable sliding window. This PR updates the config detection so that when sliding window is disabled, we turn to checking the context window size and make sure it is properly set. * [Fix] Provide a GetTokenId API for SampleResult (#2516) Currently we use `sampled_token_id.first` to find the sampled token id of a SampleResult object, which is obscure. This PR provides a `GetTokenId` API for SampleResult to get the sampled token id. This PR also updates the testing model path to include `./dist/`. * [Reapply][BUGFIX] Fix rare deadlock in threaded engine (#2429) (#2518) This PR reapplies #2429, which is missing in the main branch. Below is the original commit message: This PR fixes rare deadlock cases when engine unload/reload Co-authored-by: Tianqi Chen <tqchen@users.noreply.github.com> * [Fix] Fix metrics division by 0 (#2519) This PR fixes an issue of the per-request metrics, where division-by-0 may happen when the request does not run any decode step. The division-by-0 results in `inf`, and is added into a JSON file. However, `inf` is usually not recognized as a float value in JSON grammar. Thus JSON parsers fail on parsing any JSON string that comes with `inf` wihtout being quoted. * Corrected the folder path for Android Studio Project (#2520) Update android.rst Android project path corrected * Update tvm.rst * [iOS] Update model list (#2524) Update the model list of iOS in `mlc-package-config.json`. * [Android] Updates the order of mode list and the APK link in the document (#2526) [Android] Updates the default mode list and the APK link in the document 1. Qwen1.5-1.8B-Chat-q4f16_1-MLC * [Sampler] Skip top-p renormalization if top-p is 1 in CPUSampler (#2528) This PR adds a shortcut in the top-p renormalization in CPU sampler, which skips the renormalization when top-p is 1.0. * [Docs] Rename javascript.rst to webllm.rst (#2531) * [Conv] Add tinyLlama v1.0 conv template (#2530) * [Conv] Add tinyLlama v1.0 conv template * Fix lint * [iOS] correct mistral q3 url and handle screen switch off (#2529) This PR corrects the mistral q3 url This PR also add a handler for screen switch off. For now we just reset if the app is generating, we will update to pause/resume once they are supported. * [Grammar] Fix include protection and paths in docstring (#2515) Following #2464, This PR fixes the include protecting in the header files and the paths in the docstrings of the header files. This PR also fixes tests that were broken after the refactor. * [Tokenizer][Fix] Fix SegFault when analyzing tokenizers without tokenizer.json (#2532) Previously the tokenizer would segfault when analyzing a tokenizer that did not have a tokenizer.json file. This is due to `TokenizerInfo()` is called previously, which creates a null object. This PR fixes this problem. * [Serving] Use stop strs and token ids for completions (#2534) This PR applies the stop strings and stop token ids defined in conversation tempalte to the raw text completions. So that whenever the model outputs a stop token id or stop string, the raw generation can stop. Prior to this commit, the raw text never stops when the max tokens is not given. This commit helps reduce the frequency of such events. Nevertheless, if the model does not output a stop string/token id, the generation will still not be going to stop. * [Serving] Support tensor parallel shards override in command line (#2533) This PR supports the command line overrides for model JIT compilation. This is especially helpful for enabling tensor parallelism out of box, so people don't need to manually tweak `mlc-chat-config.json` to use tensor parallelism. * Add tie_word_embedding option for Qwen2 model (#2535) * [Bench] Defaults to aiohttp client, add ServerMetrics (#2527) * [Bench] Defaults to aiohttp client * Add ServerMetrics to summary * Remove duplicate servermetric def * [Android] Remove var capture in TVM_SOURCE_DIR (#2538) This PR fixes the TVM_SOURCE_DIR parsing issue on Windows. * [Fix] Fix inconsistent system prompt handling (#2539) This PR fixes the conversation template of ChatML, whose system prompt ends with `<|im_end|>`. An inconsistent handling of system prompt between the JSONFFI side and the Python side is also corrected. * [Attention] Fix attn kernel for general GQA group size (#2543) This PR fixes the TIR prefill attention kernels to support a broader list of GQA group sizes. * fix: typo error (#2544) * [Fix] Fix attn kernel build issue (#2545) This PR fixes TIR issues in the attn kernels. * [iOS] Add Qwen2 support (#2547) This PR add Qwen2 support to MLC Chat * [Android] Add Qwen2 support (#2548) * [Android] Escape backslashes and quotation marks (#2546) This commit escapes the backslashes and quotation marks in Android package build. * [EngineConfig] Add override options (#2550) This PR introduces override options to the Python side EngineConfig so that they'll be reflected in JIT model compilation. * [Site] Update link to webllm * [Site] Update heading * [Preset] Add model preset for model delivery (#2553) [Preset] Add model preset for wasm delivery * Update docs to remove mention of older models (#2557) * [Docs] Fix typo in mlc_llm chat command (#2560) * Fix compilation for gcc 13.2 (#2561) * [Tokenizer] Priorize HuggingFace/SentencePiece over ByteLevelBPE (#2559) This PR updates the tokenzier load logic, so that we prioritize the use of HuggingFace and SentencePiece tokenizers over the ByteLevelBPE tokenizer. This fixes the issue that token `<im_start>` in Qwen model is tokenized into multiple tokens when the ByteLevelBPE tokenizer is chosen when available. * [Serving][Grammar] Jump-forward decoding (#2551) [Serve][Grammar] Jump-forward decoding This PR supports the jump-forward decoding as described in <https://lmsys.org/blog/2024-02-05-compressed-fsm/>. The jump-forward decoding uses the grammar constraint to predict the next output string and tokenize the string into tokens, and therefore speeds up the decoding. This PR implements these optimizations to ensure the output quality: - Retokenization in jumpforward: Tokenize the last k token as string appended with the predicted string. If the tokenization result differs from the old tokens, roll back these tokens and accept the new ones. - Retokenization in decoding: Tokenize the last k token as string appended with the decoded token. This will happen in decoding stage when the jumpforward decoding happens in the last round. If the result differs, the old tokens will be rolled back. - Skip prefix tokens in jumpforward: We call tokens that is a prefix of another token as prefix tokens. If the last token from jumpforward is a prefix token, it's highly possible that it will be rolled back in the next decode stage, as it may be combined with the decoded token. It also effects the output distribution as such pattern is rare in training data. Therefore, we skip the last prefix token in jumpforward decoding. This PR also includes the following changes: - Add several metrics for request and engine, especially about the jumpforward decoding - Fix a bug in `_async_query_engine_metrics` to avoid throwing CancelledError from early return Performance and benchmark: Schema(Pydantic): ``` class Product(BaseModel): product_id: int is_available: bool price: float is_featured: Literal[True] category: Literal["Electronics", "Clothing", "Food"] tags: List[str] stock: Dict[str, int] ``` Platform: AMD Ryzen 9 5900X, NVIDIA 3080 10G Results: ``` Jump forward: False, Batch: 1 Engine metrics: { "engine_decode_time_sum": 0.4988938220000001, "engine_jump_forward_time_sum": 0, "completion_tokens_sum": 66, "decode_tokens_sum": 66, "jump_forward_tokens_sum": 0, "decode_tokens_per_s": 132.2926785010378, } Jump forward: True, Batch: 1 Engine metrics: { "engine_decode_time_sum": 0.37242740600000007, "engine_jump_forward_time_sum": 0.027989265000000006, "completion_tokens_sum": 68, "decode_tokens_sum": 68, "jump_forward_tokens_sum": 28, "decode_tokens_per_s": 182.58591850246378, } Jump forward: False, Batch: 4 Engine metrics: { "engine_decode_time_sum": 0.9106805410000002, "engine_jump_forward_time_sum": 0, "completion_tokens_sum": 261, "decode_tokens_sum": 261, "jump_forward_tokens_sum": 0, "decode_tokens_per_s": 286.5988546470984, } Jump forward: True, Batch: 4 Engine metrics: { "engine_decode_time_sum": 0.6843025599999999, "engine_jump_forward_time_sum": 0.028089531999999997, "completion_tokens_sum": 266, "decode_tokens_sum": 266, "jump_forward_tokens_sum": 112, "decode_tokens_per_s": 388.71694415405966, } Jump forward: False, Batch: 8 Engine metrics: { "engine_decode_time_sum": 1.62462493, "engine_jump_forward_time_sum": 0, "completion_tokens_sum": 538, "decode_tokens_sum": 538, "jump_forward_tokens_sum": 0, "decode_tokens_per_s": 331.1533573475325, } Jump forward: True, Batch: 8 Engine metrics: { "engine_decode_time_sum": 1.0509048310000002, "engine_jump_forward_time_sum": 0.027971332000000022, "completion_tokens_sum": 525, "decode_tokens_sum": 525, "jump_forward_tokens_sum": 224, "decode_tokens_per_s": 499.5694990767436, } Jump forward: False, Batch: 16 Engine metrics: { "engine_decode_time_sum": 2.317279175, "engine_jump_forward_time_sum": 0, "completion_tokens_sum": 1068, "decode_tokens_sum": 1068, "jump_forward_tokens_sum": 0, "decode_tokens_per_s": 460.8853398080531, } Jump forward: True, Batch: 16 Engine metrics: { "engine_decode_time_sum": 1.3962938819999997, "engine_jump_forward_time_sum": 0.030129287999999994, "completion_tokens_sum": 1059, "decode_tokens_sum": 1059, "jump_forward_tokens_sum": 448, "decode_tokens_per_s": 758.4363246533227, } ``` * [Delivery] Update model delivery script (#2565) Some improvements of the delivery script: - provide different overrides for different quantization. e.g. we can change prefill chunk size for q0/q3/q4 - rerun gen config only if only conv_template changes - do NOT recreate HF repo when the repo already exists. This will preserve commit history - dry-run validation * [Model] Enhance error reporting for invalid tensor-parallel settings (#2566) This PR enhances the error reporting for multi-GPU model compilation, so we can provide as many error reasons as possible before loading and running the models. * [Serving] Apply tree structure in draft token verification (#2563) This adds the interface to draft token state and sampler to allow tree structure being recorded and used for verification * [Bench] Json mode bench (#2552) * [Bench] Json mode bench This PR refactors mlc bench to enable json mode in dataset. * upd * fix lint * [Model] Support Multi-GPU for Qwen-MoE model (#2573) This PR introduces the multi-GPU support for the Qwen-MoE model. Validated on 4090x2. * [Metrics] Add missing fields in `Reset` (#2574) This PR adds the missing fields that were not cleared up in `EngineMetrics::Reset`. * [Doc] Update WebLLM doc (#2578) Update documentation for WebLLM. Currently we only provide a high-level view for WebLLM runtime here, and refer user to the WebLLM repo README for more. The documentation focuses on adding their own model variant / model library for WebLLM. Will follow up with more thorough runtime documentation. * [Op] Top-4 implementation for MoE model (#2586) This PR introduces a top-4 kernel for MoE model (particularly for the Qwen-MoE) at this moment. This is still a manual implementation and has some duplication with the existing top-2 kernel. In the future we'll consider leveraging meta-programming of TIR to unify the top-k kernel implementations. * [Model] Gemma 1.1 compatibility (#2594) This PR updates the Gemma config so that MLC can work properly with Gemma 1.1. * [Serving] Hybrid prefill (#2604) This PR adds the support for the hybrid prefill. So during the prefill engine action, it will do the decode for running requests as well. * Update quick_start.rst to fix broken links (#2607) Update quick_start.rst Fix broken links for convert weights and compile model pages * [Fix] Set the missed prefill finish time (#2613) This PR fixes a bug which fails to set the prefill finish time and results in metric error. * [Android] Reduce binary size (#2606) This PR updates the Android app the reduce the binary size. Right now it can be reduced to 108MB when only building with the Phi-3-mini-4k model. * [Fix] Gemma hidden_activation compatibility (#2614) This PR fixes the Gemma config compatibility issue. * Update debug_compare (#2612) This PR fixes a bug of the debug_compare.py script. * [SLM] Add support for InternLM2 architecture (#2608) This commit introduces the InternLM2 model support. * [Fix] Prefix cache only enables sliding window on leaf sequence (#2615) This PR updates the prefix cache to align the logic of enabling sliding window. Now only leaf sequence is enabled sliding window attention. * [Android] Update include path for tvm runtime src (#2616) This PR updates the include directories for the Android app so that we can avoid using macros for src file include. * [Fix] Mark the decode requests in hybrid prefill (#2621) This PR fixes an issue that may cause duplicate prefix updates for the decode requests in the hybrid prefill action. * [Fix] Fix the chunked prefill condition (#2628) This PR fixes a bug of the prefill chunking which may cause the running batch size exceeding the maximum allowed batch size. * [SLM] Internlm2 Multi-GPU support (#2626) This PR enable TP function of internlm2 model. * [Serving] Merge multiple token embedding lookup into one (#2629) This PR supports merging multiple token embedding lookup into a single one, since each token embedding lookup needs to go through the model, and multiple lookup will introduces extra overhead. * [Model] Support Internlm2.5 (#2630) InternLM2.5 series that have outstanding features were released just days ago, and this PR support Internlm2.5 by adding model preset of internlm_2_5_7b. * Fix for RWKV new config and new format vocab (#2632) * [Fix] Fix KV cache single-page copy kernel (#2644) The current single-page copy kernel misses a predicate, which may cause incorrect attention results in serving, when RemoveRequest is involved. * [Fix][Tokenizer] Fix failure in decoding tokens for ByteLevel BPE (#2649) This PR fixes the issue where the tokenizer would fail in decoding tokens for ByteLevel BPE when the token is not recognized by ByteLevel. E.g. in decoding, ``` "hello" -> "hello" (recognized by ByteLevel) "Ġthere" -> " there" (recognized by ByteLevel) "\n" -> not recognized by ByteLevel "\u203c" -> not recognized by ByteLevel ``` This PR adds the logic that in decoding, when the token is not recognized by ByteLevel, the original token will be returned. Then ``` "hello" -> "hello" (recognized by ByteLevel) "Ġthere" -> " there" (recognized by ByteLevel) "\n" -> "\n" (not recognized by ByteLevel) "\u203c" -> "\u203c" (not recognized by ByteLevel) ``` This behavior is align to huggingface tokenizers. * [Fix][Bitmask] Mask dummy padded tokens for grammar (#2651) * [Engine] Reduce action post-process overhead (#2653) This PR optimizes the post-process overhead and adds more detailed nvtx instruments. * [PrefixCache] Defer sequence extension (#2654) This PR deferrs the prefix cache sequence extention. Previously, the prefix cache update is committed after every action, which is unnecessary. We can defer this sequence extention and commit the extentions when the prefix cache is used again. This PR also changes the IntTuple used in PrefixCache to `std::vector<int32_t>` for less data structure construction overhead. * [Model] Support Starcoder2 (#2657) This PR supports Starcoder2 model. * [Engine] Lazy recompute in GetRunningRequestStateEntries (#2655) This PR updates GetRunningRequestStateEntries to make it lazy. We use a dirty flag to check whether the running request state entries are changed since the last recompute. We make this improvement due to the observation that this function may cause some CPU overhead. During consecutive rounds of batch decode, the running requests don't change, so we can effectively use this dirty flag to avoid recomputation. * [Fix] Fix prefix cache reuse with eagle mode (#2664) This PR fixes the prefix cache bug with eagle mode on. The prefilled offset is forgotten to be shifted in this case. * [Model] Support SmolLM (#2667) This PR supports HuggingFace's SmolLM. The only change needed is to support `tie_word_embeddings` in `llama_model.py`. Currently we extend an `nn.Embedding`, following our approach for QWen2. In future we can think about abstracting it out, perhaps implementing `forward_as_linear()` for `nn.Embedding`. * [SLM] Starcoder2 Multi-GPU support (#2662) This PR supports TP function of starcoder2 and fixes two typos. * [Engine] Defer the collection of decode inputs in prefill (#2668) This PR defers the collection of decode inputs in hybrid prefill, as the collection of decode inputs may cause much CPU overhead while it ends up no prefill can be performed. By deferring the collection of decode inputs, we can quickly decide whether prefill is doable, and this decision does not involve too much CPU overhead. * support mistral-nemo (#2676) * [Model] Fix annotation typos (#2672) * Update starcoder2_quantization.py * Update qwen2_loader.py * Update qwen2_model.py * Update qwen2_moe_loader.py * Update rwkv5_loader.py * Update rwkv6_loader.py * Update qwen_loader.py * Update phi3_quantization.py * Update phi_quantization.py * Update phi3_model.py * Update phi3_model.py * Update phi3_quantization.py * fix tp * [Model] Support Llama3.1 (#2682) This PR supports the [Llama3.1](https://huggingface.co/collections/meta-llama/llama-31-669fc079a0c406a149a5738f) family. Particularly we introduced the conversation template and RoPE scaling for Llama3.1. In the future we will bring the support of more RoPE scaling. Co-authored-by: Charlie Ruan <53290280+CharlieFRuan@users.noreply.github.com> * [SLM] Introduce microsoft/Phi-3 vision (#2658) Introduce microsoft/Phi-3 vision from https://huggingface.co/microsoft/Phi-3-vision-128k-instruct * [Preset] Add llama3.1 to preset, comment out llama3 (#2683) * [Pass] Rewrite FuseAddRMSNorm to avoid binding rewrite recursion (#2689) This PR revamps the FuseAddRMSNorm pass with manual pattern matching, in purpose of avoiding `rewrite_bindings` which is recursive and may cause unaffordable time when the model is large. * Initialize all `local_top_k` values in `gating_softmax_topk` (#2694) If `x` has `nan` or `-inf` values, the condition `x[vi,vk] > local_top_k[0]` may be false. Falling back to the condition `x[vi,vk] > local_top_k[1]` then reads the uninitialized value in `local_top_k[1]`. This can also result in out-of-bounds memory access. If all values in `x[vi,vk]` are `nan` or `-inf` along some row `vi`, then `local_top_k_index[1]` is never populated. For mixture-of-experts models, when `gating_softmax_topk` is used to select the expert, this uninitialized value is then used as an array index. This commit updates the `top2_softmax_norm_func` implementation in `gating_softmax_topk` to initialize both elements of the `local_top_k` and `local_top_k_index` arrays, matching the implementation of `top4_softmax_norm_func`. * [Serving] Fix spec decoding call packed with rvalue (#2699) * [ASYNC] Properly abort cleanup in async handling (#2698) This PR adds a context manager to properly cleanup during async for exception. Naively use the try except pattern will results in bug when we chain up async generators and exception get raised not inside the try except in between iterations. * [Serve] Expose prefill mode option (#2701) This PR exposes the option of prefill mode to chunked prefill or hybrid prefill with split fuse decode. * [Fix] Fix hybrid prefill disabled (#2705) This PR fixes the #2701 when the prefill mode is chunked but the prefill requests are not collected. * Turn on custom allreduce by default in O3 (#2706) * [Fix] Fix hybrid prefill index error (#2707) This PR fixes the index error when hybrid prefill is enabled. Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu> * [Bench] Revamp benchmark submodule (#2702) This PR revamp the benchmark submodule with a `__main__` entry that enables running the benchmark. * [Serving] Fix handling of num_tokens_for_next_decode in spec decoding (#2709) * Update worker.py for compatibility with upstream TVM (#2712) This commit updates `mlc_llm.cli.worker` to be compatible with upstream TVM https://github.com/apache/tvm/pull/17180, which adds a `num_groups` argument to the disco worker function. To de-couple this compatibility from a general TVM version bump, this commit has a check on the number of `worker.py` arguments provided, to determine whether the `num_groups` argument is present. After the TVM version used by MLC-LLM is updated to include the upstream changes, this check can be removed. * Add support for Gemma2 (#2674) * Add support for Gemma2 * Update Gemma2 impl This commit updates the Gemma2 implementation, including the following aspects: 1. We try to reuse as much code as possible from the Gemma model for the overall code structure clarity and management. 2. We properly set the scaling factor for attention. 3. We add the final logit soft-capping for Gemma2. --------- Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu> * [Preset] Add gemma2 preset (#2715) Add gemma2 2b 9b and 27b to preset, remove gemma1 preset. * [Android] Update model for Andorid APK (#2718) * Update android package config from gemma 2b to gemma 2 2b * Revert phi3 model definition for backward compatibility * [iOS] Add Gemma2 for iOS app (#2717) This commit switches the Gemma model in iOS app to Gemma2. * Default bundle gemma2 (#2721) * [Bench] LLMPerf dataset (#2713) This PR adds the LLMPerf into benchmark module. * [ConvTemplate] Update Gemma template with <bos> (#2722) This commit adds `<bos>` to the gemma's conversation template. * [C++] Handle system_prefix_token_ids in C++ Conv template (#2723) The `system_prefix_token_ids` of conv template already contains the bos token usually, which should be processed when converting message list to a single prompt. However, the C++ side didn't well respect this field before. * Delete .gitmodules --------- Co-authored-by: Wuwei Lin <wuwei@apache.org> Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu> Co-authored-by: Yong Wu <yongcale@gmail.com> Co-authored-by: krishnaraj36 <quic_kvegiraj@quicinc.com> Co-authored-by: Tianqi Chen <tqchen@users.noreply.github.com> Co-authored-by: Rick Zhou <rickzhoucmu@gmail.com> Co-authored-by: Git bot <bot@noreply.github.com> Co-authored-by: Vivian Zhai <98248913+YiyanZhai@users.noreply.github.com> Co-authored-by: Yixin Dong <ubospica@gmail.com> Co-authored-by: Animesh Bohara <ani.bohara@gmail.com> Co-authored-by: Animesh Bohara <abohara@cs.cmu.edu> Co-authored-by: Kartik Khandelwal <kartikkhandelwal1998@gmail.com> Co-authored-by: Nestor Qin <imba.qxy@gmail.com> Co-authored-by: Yaxing Cai <caiyaxing666@gmail.com> Co-authored-by: Faolain <Faolain@users.noreply.github.com> Co-authored-by: Bodhi <3882561+BodhiHu@users.noreply.github.com> Co-authored-by: Huaishun Hu <huaishun.hu@mthreads.com> Co-authored-by: Hyunsung Lee <ita9naiwa@gmail.com> Co-authored-by: Bohan Hou <bohanhou@andrew.cmu.edu> Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> Co-authored-by: tqchen <tqchenml@gmail.com> Co-authored-by: Mengshiun Yu <mengshyu@gmail.com> Co-authored-by: Charlie Ruan <53290280+CharlieFRuan@users.noreply.github.com> Co-authored-by: zifeitong <zifeitong@gmail.com> Co-authored-by: rmstc <ramees025@gmail.com> Co-authored-by: KEL <me@iamkel.net> Co-authored-by: Andrey Malyshev <ma_elvin@mail.ru> Co-authored-by: Gunjan Dhanuka <d.gunjan@iitg.ac.in> Co-authored-by: Shushi Hong <820958424@qq.com> Co-authored-by: Yao Yujian <yyjhao@gmail.com> Co-authored-by: Eric Lunderberg <Lunderberg@users.noreply.github.com>

MasterJH5574 marked this pull request as ready for review June 4, 2024 00:11

MasterJH5574 force-pushed the 06-02-tree-attn-kv-cache branch from 3834082 to 63b67cf Compare June 4, 2024 00:12

tqchen merged commit c0c33a5 into mlc-ai:main Jun 4, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serving] PagedKVCache tree-attention integration #2487

[Serving] PagedKVCache tree-attention integration #2487

MasterJH5574 commented Jun 2, 2024

[Serving] PagedKVCache tree-attention integration #2487

[Serving] PagedKVCache tree-attention integration #2487

Conversation

MasterJH5574 commented Jun 2, 2024