Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge with mlc-ai/main (d3d264d4b05d73e9757375013b842254f052c6ed, April 29th 2024) #265

Merged
merged 252 commits into from
Apr 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
252 commits
Select commit Hold shift + click to select a range
bcb9b6a
[Serving][Grammar] BNF grammar simplifier and matcher (#1801)
Ubospica Feb 24, 2024
ce42880
[Serving] LogProbs support (#1832)
MasterJH5574 Feb 24, 2024
1cbd67b
[Serving] Support Mixtral in MLC Serve (#1840)
MasterJH5574 Feb 26, 2024
607dc5a
[Fix] Fix `u_char` for Windows build (#1848)
MasterJH5574 Feb 27, 2024
c4d1b69
Auto updated submodule references
Feb 27, 2024
31e0571
[Fix] Add phi lm head name to is_final_fc, add q4f16_ft to CI (#1849)
CharlieFRuan Feb 28, 2024
89f3e41
[Build] Replace mod_transform_before_build with IRModule pass (#1852)
Lunderberg Feb 28, 2024
6ce1759
[SLM] Add support for InternLM architecture (#1835)
tlopex Feb 28, 2024
1497744
[Bugfix] Handle model names with multiple path components (#1851)
Lunderberg Feb 28, 2024
7456314
[KVCache] Add max num threads awareness to KVCache kernels (#1822)
CharlieFRuan Feb 28, 2024
52d002f
[KVCache] Migrate Baichuan model to PagedKVCache (#1854)
tlopex Feb 28, 2024
ac57c03
[Python] Lazy import of transformers for tiktoken conversion (#1860)
MasterJH5574 Feb 29, 2024
1f70d71
[SLM] RWKV5 World Support (#1787)
Hzfengsy Feb 29, 2024
eb465ec
[Serving] Register the ChatML conversation template (#1862)
tlopex Feb 29, 2024
5bbe204
[Utils][Transform] Added SetEntryFuncs transform (#1855)
Lunderberg Mar 1, 2024
eb66452
[Build] Update transform_params_for_each_rank to IRModule pass (#1856)
Lunderberg Mar 1, 2024
5f2a06e
[Serving][Grammar] Integrate JSON grammar into the generation pipelin…
Ubospica Mar 2, 2024
7806dee
[Serving] Support "n" for parallel generation (#1868)
MasterJH5574 Mar 2, 2024
63c338b
[CI] Add retry to scm checkout (#1869)
tqchen Mar 2, 2024
e8b5b0b
[Attn] Use float32 accumulation in attention kernel (#1870)
MasterJH5574 Mar 2, 2024
91008ae
[Utils] Allow ReorderTransformFunc to be used without param manager (…
Lunderberg Mar 3, 2024
731616e
[SLM] Migrate Phi-2 to paged KV Cache #1871 (#1872)
Kartik14 Mar 3, 2024
e4341b3
[Fix] Fix the use of "call_inplace_packed" and "call_pure_packed" (#1…
MasterJH5574 Mar 3, 2024
c0606ec
[Fix] Add the missing BundleModelParams pass (#1875)
MasterJH5574 Mar 3, 2024
07af0f9
[Docs] Update Android APK download link (#1876)
MasterJH5574 Mar 3, 2024
837869a
Fix MLC-LLM website link weight convert not accessible (#1877)
DiegoCao Mar 3, 2024
d2cfb1e
[Serving][Grammar] Support termination state in GrammarStateMatcher (…
Ubospica Mar 4, 2024
65ec85d
[Serving] Make RequestState as a standalone object class (#1878)
MasterJH5574 Mar 4, 2024
ffef890
[SLM] Update StableLM model and migrate it to paged KV Cache (#1882)
tlopex Mar 4, 2024
ef2db85
[KVCache] Qwen 1.0 Model PagedKV Support (#1887)
DiegoCao Mar 4, 2024
25877f9
[Serving] Estimate KV cache memory usage with metadata (#1888)
MasterJH5574 Mar 4, 2024
aeb55f1
[KVCache] Migrate bigcode arch to PagedKVCache (#1891)
davidpissarra Mar 5, 2024
e7b6cbc
[Serving] Add Phi-2 conv template to mlc serve (#1890)
Kartik14 Mar 5, 2024
8a8c529
[Attn] Fix attention kernel for head dim not divisble by 32 (#1889)
MasterJH5574 Mar 5, 2024
b345a9e
[Python] Enable "thrust" for CUDA by default (#1866)
MasterJH5574 Mar 5, 2024
2f26e05
[Serving] Fix loading presharded weights (#1894)
vinx13 Mar 6, 2024
a41f903
[Serving] Address embedding lookup OOM issue (#1899)
MasterJH5574 Mar 7, 2024
88ac813
[Model] Remove redundant `batch_forward` and move broadcast (#1900)
MasterJH5574 Mar 7, 2024
1eaef7c
[KVCache]Migrate Qwen2 model to PagedKVCache (#1903)
tlopex Mar 7, 2024
068d5ea
[CI] Skip not supported quantization in model compilation test (#1904)
MasterJH5574 Mar 7, 2024
655ae5c
[Serving] Add missing header for `std::iota` (#1905)
MasterJH5574 Mar 8, 2024
068091c
[Serving] Fix Model TokenEmbed function with TP (#1906)
MasterJH5574 Mar 8, 2024
73fa4a2
[SLM] Add support for Orion architecture. (#1883)
gesanqiu Mar 8, 2024
3f3e3fd
[Model] Eliminate the reshape in embedding func (#1908)
MasterJH5574 Mar 8, 2024
3f05a1f
[Pass] Low batch GEMM using GEMV-like schedule (#1769)
jinhongyii Mar 8, 2024
c2258ae
Auto updated submodule references
Mar 8, 2024
1b3cfd5
[Serving] Avoid unnecessary worker sync in Model (#1909)
MasterJH5574 Mar 9, 2024
448c5c4
[Serving][Grammar] Enhance GrammarStateMatcher to support general gra…
Ubospica Mar 9, 2024
b44cdc5
[Android] Improve perf of TIR PagedAttn kernel on Android (#1915)
spectrometerHBH Mar 10, 2024
20efccb
Deprecate old flow (#1928)
tqchen Mar 11, 2024
716b8e1
[Serving] Register the StableLM3B conversation template (#1920)
tlopex Mar 11, 2024
2e6f9cb
Remove deprecated build.py
tqchen Mar 11, 2024
9c80105
[Fix] KVCache creation with call_pure_packed (#1930)
MasterJH5574 Mar 11, 2024
d8fedd1
[KVCache] Update FlashInfer PackedFunc names (#1931)
MasterJH5574 Mar 11, 2024
4290a05
[REFACTOR] remove tests/legacy-python (#1933)
tqchen Mar 12, 2024
8beed7a
[REFACTOR] rename mlc_chat => mlc_llm (#1932)
tqchen Mar 12, 2024
c268f95
Auto updated submodule references
Mar 12, 2024
d6d972c
[Docs] Deprecating CUDA 11.7/11.8 support (#1939)
MasterJH5574 Mar 12, 2024
9df8f03
[Fix] Fix KV cache call in mistral (#1938)
MasterJH5574 Mar 12, 2024
4893415
[ChatModule] Remove eos_token_ids (#1940)
MasterJH5574 Mar 12, 2024
738e353
[SLM] Weight conversion with generator (#1916)
MasterJH5574 Mar 12, 2024
5b8c529
[Serve] Introducing GPU sampler for CUDA (#1934)
MasterJH5574 Mar 12, 2024
73b9965
[Serve] Constrain KV cache capacity on Metal (#1943)
MasterJH5574 Mar 13, 2024
8a29ee1
[CI] Add windows ci (#1942)
tqchen Mar 13, 2024
5c29f02
Auto updated submodule references
Mar 13, 2024
8d192ef
[Fix] Fix embedding shape check in ChatModule (#1953)
MasterJH5574 Mar 13, 2024
c0b2ccd
[Fix] Fetching the Git-LFS tokenizer files (#1954)
MasterJH5574 Mar 14, 2024
2872f70
[LogitProcessor] Add max thread awareness to logit processing kernels…
CharlieFRuan Mar 14, 2024
d546134
[Model] Use static hidden size in mixtral scatter_output (#1959)
vinx13 Mar 14, 2024
01527e9
Auto updated submodule references
Mar 15, 2024
09fe1bc
[CompilerFlag] Detect if FlashInfer is enabled from libinfo (#1941)
MasterJH5574 Mar 15, 2024
c7d52c4
[Serving][Grammar] Add grammar termination as a stop condition (#1964)
Ubospica Mar 15, 2024
994f928
Unify schema for conversation template and embed into mlc-chat-config…
rickzx Mar 15, 2024
73f2b27
[SLM] Small correction on Stablelm and Qwen2. (#1958)
tlopex Mar 16, 2024
d6b86d1
[Serving][Fix] Fix JSON output check in test_server.py (#1966)
Ubospica Mar 16, 2024
edffce4
[Model] Migrate Mistral to use PagedKVCache (#1967)
MasterJH5574 Mar 16, 2024
8f5e25d
Auto updated submodule references
Mar 18, 2024
386af8d
[REST] Update Rest API docs for the latest serve flow (#1972)
Kartik14 Mar 18, 2024
4db4373
[Conv] Add bos_token to llama and mistral in ConvTemplateRegistry (#1…
rickzx Mar 18, 2024
949ff2d
[Model][Serve] Add support for LLaVa model in serving engine (#1974)
anibohara2000 Mar 18, 2024
058c583
[Serve] Hot fix for the mixtral serving (#1975)
yongwww Mar 19, 2024
3cbc169
[REST] REST API Deprecated (#1973)
shreygupta2809 Mar 19, 2024
587e341
[Fix] Fix handling of non-numerical cuda arch (#1976)
vinx13 Mar 19, 2024
bed4f53
[Serving][Grammar] Support specifying the main rule in grammar (#1982)
Ubospica Mar 19, 2024
5485782
[Fix] Fix `MLC_MULTI_ARCH` with arch `sm_90a` (#1984)
cyx-6 Mar 19, 2024
06d6115
Fix Llama-2 and Mistral conversation template. Update ConvTemplateReg…
rickzx Mar 20, 2024
39d0865
[SpecDecode] Fix sampler selection. (#1971)
KnowingNothing Mar 20, 2024
a0484bd
[Serving][Grammar] Utility to convert json schema to EBNF grammar (#1…
Ubospica Mar 20, 2024
3b9b51a
Auto updated submodule references
Mar 20, 2024
d4ec25e
[Fix] Fix serve model to adapt the latest Allocator signature (#1989)
MasterJH5574 Mar 20, 2024
c74f176
[Model] Use optimized group gemm for Mixtral (#1988)
vinx13 Mar 20, 2024
244c2e7
[Attn] Fix the construction of attn result merge kernel (#1995)
MasterJH5574 Mar 21, 2024
ddfbcda
[iOS][Android] Add validation of library file for iOS and Android bui…
tqchen Mar 21, 2024
cc36324
Auto updated submodule references
Mar 21, 2024
96d9c8b
[Serve] add allocator in Storage as the upstream change (#1997)
yongwww Mar 21, 2024
0772940
[Compiler] Support IPC memory and customized all-reduce kernels (#1990)
MasterJH5574 Mar 22, 2024
ae97b8d
Auto updated submodule references
Mar 22, 2024
8405cb1
[Model] Fix the top-k TIR script for well-formedness (#2002)
MasterJH5574 Mar 22, 2024
64badb5
Fix invalid use of dataflow var in sampler output (#2003)
vinx13 Mar 22, 2024
837ee53
[Fix] Fix KV cache creation pass after nn.Module changes (#2011)
MasterJH5574 Mar 24, 2024
10f2d00
[iOS] Fix typo in prepare_model_lib.py (#2013)
HuitingLiu Mar 24, 2024
a6de1ff
Remove unstable assertion in KV cache creation dispatch (#2017)
MasterJH5574 Mar 24, 2024
1c8b72e
Auto updated submodule references
Mar 25, 2024
ab9fa81
[SLM] Qwen2 Multi-GPU support (#1985)
tlopex Mar 25, 2024
f04cd3e
more info for preshard (#2027)
na20215 Mar 25, 2024
1c975de
Register stablelm-2 conversation template (#2029)
rickzx Mar 25, 2024
8796fb4
[Serving][Fix] Fix problems in PopenServer (#2032)
Ubospica Mar 26, 2024
a6d31d7
[Quantization] Skip MoE gate layer (#2012)
MasterJH5574 Mar 26, 2024
f2518ab
[Serving][Grammar] Integration of JSON schema generation (#2030)
Ubospica Mar 27, 2024
0a23af5
[Compiler] Support AUTO mode for all-reduce strategy (#2034)
MasterJH5574 Mar 27, 2024
47c8350
[LLaVa] Follow-up for TODOs in LLaVa model (#2010)
anibohara2000 Mar 27, 2024
2d68e64
[Pipeline] Defer GPU IPC memory lowering (#2038)
MasterJH5574 Mar 27, 2024
be42bec
[Model] Add missing broadcast of logit_position for multigpu (#2040)
vinx13 Mar 28, 2024
5ebcda1
[Preshard] apply presharding after quantization (#2039)
vinx13 Mar 28, 2024
a0c0f21
[SLM] Baichuan Multi-GPU support (#2037)
tlopex Mar 28, 2024
34497ea
Auto updated submodule references
Mar 28, 2024
cf8d458
[Model] Skip TVMSynchronize when tracing is not enabled (#2041)
MasterJH5574 Mar 28, 2024
4255a45
[Serving] Support NVTX for benchmarking (#2043)
MasterJH5574 Mar 28, 2024
2b82091
Update huggingface_loader.py
tqchen Mar 28, 2024
522db05
[Serve] Separate callback invocation to another thread in AsyncEngine…
MasterJH5574 Mar 29, 2024
ad068c2
[LLaVa] Fix random token output after first sentence (#2048)
anibohara2000 Mar 29, 2024
b4b8e91
Auto updated submodule references
Mar 29, 2024
1acd5f5
[Pass] Fix LiftGlobalBufferAlloc for proper GlobalVar struct info (#2…
MasterJH5574 Mar 29, 2024
2f171b4
Auto updated submodule references
Mar 29, 2024
55d7dc3
[Serving] CLI Support for SERVE (#2014)
Kartik14 Mar 29, 2024
203afab
[Pipeline] Insert hints to enable cuda graph symbolic capture (#2050)
vinx13 Mar 29, 2024
6431bda
[Loader] Print message when multi-GPU loader is finished (#2051)
vinx13 Mar 30, 2024
12c9808
[KVCache] Support matching arbitrary element offset for aux data (#2057)
MasterJH5574 Mar 30, 2024
af7ef3e
[Serving] Support copy stream in LogitProcessor and GPUSampler (#2058)
MasterJH5574 Mar 30, 2024
2600a70
[SLM] Stablelm Multi-GPU support (#2052)
tlopex Mar 30, 2024
9ecc00e
[KVCache] Introducing single page copy func for KV cache fork (#2060)
MasterJH5574 Mar 30, 2024
e370ac7
[Python] Implement testing.DebugChat for end-to-end model debugging (…
rickzx Mar 30, 2024
069b73a
[Docs] Fix docs for python server and rest call (#2066)
yogeshg Mar 31, 2024
3e91e70
[CI] Enable submodule clone for WASM model compilation (#2068)
MasterJH5574 Mar 31, 2024
ed62796
[Serve] Fork sequence at specified positions (#2067)
MasterJH5574 Mar 31, 2024
5243b27
[SLM] Add support for RWKV6 model (#1977)
Celve Mar 31, 2024
8cac74c
[Quantization] Reorganize utils code in group_quantization (#2055)
vinx13 Apr 1, 2024
8a82f93
[Serving] Bugfix for empty stop string (#2070)
Kartik14 Apr 1, 2024
eb3d1e4
[SLM] Internlm Multi-GPU support (#2072)
tlopex Apr 1, 2024
10017db
[WebGPU] Add mlc wasm runtime, support grammar in web (#2061)
CharlieFRuan Apr 1, 2024
9121126
[Build] Use TVM_HOME environment variable (#2073)
Lunderberg Apr 1, 2024
b7416c0
[Serving] Support input chunking (#2069)
MasterJH5574 Apr 1, 2024
52de798
[Docs] API Code Completion Guide (#2054)
davidpissarra Apr 2, 2024
12ca8fd
Allow "mlc_llm --host" option to override host triple the model compi…
yuxuanchiadm Apr 2, 2024
63fc972
[Web] Move prep emcc deps script to web folder (#2077)
CharlieFRuan Apr 2, 2024
5bc3ffa
[SLM] Qwen Multi-GPU support (#2075)
tlopex Apr 2, 2024
96b8c33
Fix mismatch of metadata func and global symbol (#2078)
vinx13 Apr 3, 2024
1d34527
[Disco] Set worker CPU affinity with env variable (#2042)
MasterJH5574 Apr 3, 2024
7f1aacc
[Quantization] Introduce PerTensor and F8 quantization (#2079)
vinx13 Apr 4, 2024
700206b
[Serving][Refactor] Rename AsyncThreadedEngine to ThreadedEngine (#2081)
MasterJH5574 Apr 4, 2024
2e9cc1c
[Serving] Add cuda profiling in benchmark test (#2084)
yongwww Apr 5, 2024
41da87a
[Grammar] Fix broken grammar tests (#2083)
MasterJH5574 Apr 5, 2024
791623a
[Serving][Fix] Fix chunked prefill condition (#2082)
MasterJH5574 Apr 5, 2024
7e0f102
[Conversation] Fix RedPajama conversation template (#2087)
MasterJH5574 Apr 5, 2024
c2f2e59
[Serving][Refactor] Python interface refactor (#2085)
MasterJH5574 Apr 5, 2024
5cf700b
[Serving] Separating ThreadedEngine creation and initialization (#2090)
MasterJH5574 Apr 5, 2024
d6d3d7e
[Serving] Enhance robustness with small KV capacity (#2091)
MasterJH5574 Apr 5, 2024
a73eae2
[REST] Update REST API docs (#2092)
Kartik14 Apr 5, 2024
466fa8a
[DOCS] Clarify vulkan loader dependency (#2095)
tqchen Apr 5, 2024
a75eb0b
[SLM] Add support for Chatglm3 architecture (#2096)
tlopex Apr 6, 2024
3d564f3
[Quantization] Add OpenCL device (#2097)
mengshyu Apr 6, 2024
61f76c7
[Serving] Support stream=True for Python API (#2098)
MasterJH5574 Apr 6, 2024
50766fd
[Serving][Refactor] OpenAI API Python interface alignment (#2099)
MasterJH5574 Apr 7, 2024
fb24fcf
[DOC] fix small python env install error (#2102)
DiegoCao Apr 7, 2024
cc8b747
[JSONFFIEngine] Initial implementation of JSONFFIEngine (#2101)
anibohara2000 Apr 8, 2024
95d268b
[Model] Use tanh approximation of GeLU in Gemma MLP (#2106)
jeethu Apr 8, 2024
36d0e6a
Auto updated submodule references
Apr 8, 2024
3e71b70
[Quantization] Stricter checks for MoE gate (#2109)
MasterJH5574 Apr 9, 2024
623ed62
Auto updated submodule references
Apr 10, 2024
021c29c
[LLaVa] Fix allowed text model value in config (#2062)
anibohara2000 Apr 10, 2024
c4169d8
Auto updated submodule references
Apr 10, 2024
f832bde
Revert "Allow "mlc_llm --host" option to override host triple the mod…
tqchen Apr 10, 2024
716a5ed
Revert "Auto updated submodule references" (#2117)
MasterJH5574 Apr 10, 2024
6c48755
[Metadata] Include picojson rather than forward declaring (#2118)
MasterJH5574 Apr 10, 2024
39dfa3e
Auto updated submodule references
Apr 10, 2024
7f7c01f
Auto updated submodule references
Apr 11, 2024
a815148
[Serving][Grammar] Porting the json schema converter from python to C…
Ubospica Apr 11, 2024
9b71443
[Model] Use R.topk/cumsum for mixtral (#2107)
vinx13 Apr 11, 2024
880c68a
Enable flashinfer when group_size == 6 (#2124)
vinx13 Apr 12, 2024
4dfb9f0
[SpecDecode] Support Eagle in speculative decoding (#2080)
KnowingNothing Apr 12, 2024
65e4a56
[Pass] Attach non-negative TIR var attributes (#2125)
MasterJH5574 Apr 12, 2024
8e8a921
[Serving][Refactor] Engine constructor interface refactor (#2126)
MasterJH5574 Apr 12, 2024
8139a47
[Serving] Revamp engine mode selection logging info (#2128)
MasterJH5574 Apr 13, 2024
a361119
[SLM] Chatglm3 Multi-GPU support (#2123)
tlopex Apr 14, 2024
661abb2
[Serving] Fix support of large `n` under low max batch size (#2136)
MasterJH5574 Apr 14, 2024
3403a4e
[Docs] Revamp landing page with Engine Python API and server (#2137)
MasterJH5574 Apr 15, 2024
4cbda04
[Target] Update Target tags (#2141)
Hzfengsy Apr 16, 2024
8f33c30
[Util] Support debug debug_compare (#2142)
Hzfengsy Apr 16, 2024
3d25d9d
[Minor][SpecInfer] Fix Optional FC Bias for Mixtral Eagle Model (#2146)
zxybazh Apr 17, 2024
2de2875
[Serving] fix hardcoded host and port in popen_server (#2147)
yongwww Apr 17, 2024
8c673b4
[Docs] Introductory tutorial (#2145)
MasterJH5574 Apr 17, 2024
9f9436b
[Serving] Support `DebugCallFuncOnAllAllWorker` and CUDA profiler (#2…
MasterJH5574 Apr 17, 2024
2a24f13
[DOCS] Update introduction (#2151)
tqchen Apr 17, 2024
5a37e55
[Serving][Python] Rename Engine to LLMEngine (#2152)
MasterJH5574 Apr 17, 2024
751783b
Auto updated submodule references
Apr 17, 2024
e9a4a0b
[Quantization] Add e4m3 mode and enable fp8 storage type (#2154)
vinx13 Apr 17, 2024
7d3f34e
Revert "[Quantization] Add e4m3 mode and enable fp8 storage type" (#2…
tqchen Apr 18, 2024
8352235
[Serving] EngineConfig refactor (#2159)
MasterJH5574 Apr 18, 2024
ad770d8
[Llama3] Support Llama 3 (#2163)
CharlieFRuan Apr 18, 2024
bee1928
[Fix] Fix llama 3 conv template (#2164)
CharlieFRuan Apr 18, 2024
d6724b1
Auto updated submodule references
Apr 18, 2024
c6edba8
[Serving][HotFix] No `std::move()` for disco CallPacked (#2166)
MasterJH5574 Apr 19, 2024
de98524
[Docs] Update example for Llama3 (#2169)
MasterJH5574 Apr 19, 2024
3dbc1d5
[README] Fix broken link to Python API (#2168)
simonw Apr 19, 2024
856204e
[Docs] Update README (#2170)
MasterJH5574 Apr 19, 2024
855f9a2
[Docs] Documentation of LLMEngine in Python API (#2172)
MasterJH5574 Apr 19, 2024
f87745d
[Docs] Update project website (#2175)
MasterJH5574 Apr 19, 2024
b3b7f23
[Docs][Fix] Update index.md for jekyll failure (#2176)
MasterJH5574 Apr 19, 2024
9216467
[Quantization] Add e4m3 mode and enable fp8 storage type (reland #215…
vinx13 Apr 19, 2024
a50fae0
[Docs] Fix API reference not displayed (#2177)
MasterJH5574 Apr 19, 2024
675319f
[Docs] Update project website (#2180)
MasterJH5574 Apr 19, 2024
0ec6c7a
[Misc] Pass env along when calling `subprocess.run` (#2179)
MasterJH5574 Apr 19, 2024
132ad03
Change OpenAI protocol default value to None and supply using model c…
rickzx Apr 20, 2024
d43e10e
[Serving][Spec] Fix the output inconsistent bug of q0f32 spec decodin…
DearFishi Apr 20, 2024
54a6794
[Serving] Support ThreadedEngine Reload/Unload/Reset (#2185)
MasterJH5574 Apr 21, 2024
8186203
[WASM] Support grammar schema in wasm (#2187)
CharlieFRuan Apr 21, 2024
4994c5c
[Serving] Support loading system library (#2189)
MasterJH5574 Apr 21, 2024
830c908
[Op] Batch verify for speculative decoding (#2186)
spectrometerHBH Apr 22, 2024
a1830c1
[JIT] Better organize JIT and AOT handling (#2191)
tqchen Apr 22, 2024
f1f5cd1
Fix prefill and context flag names in doc (#2192)
ollmer Apr 22, 2024
17a2c6a
[Docs] Update quick start to mention Llama 3 8B (#2196)
EwoutH Apr 22, 2024
253cd0d
[SERVING] Add Conv Template and Function Calling support to JSON FFI …
Kartik14 Apr 22, 2024
12647d5
[Serving] Paged Radix Tree for Prefix Caching (#2183)
cyx-6 Apr 23, 2024
dc3988a
[Serving] Remove mandatory model check in server (#2195)
MasterJH5574 Apr 23, 2024
651c2a0
[Sampler] Enable GPU sampler for draft verification (#2198)
vinx13 Apr 23, 2024
0ed4bcb
[Eagle] Make eagle disco compatible (#2197)
vinx13 Apr 23, 2024
af8206b
[Serving][Spec] Fix normal mode verification for extra draft token (#…
MasterJH5574 Apr 23, 2024
d7c5a6e
[Sampler] Prob renormalization with top p for spec decoding (#2201)
MasterJH5574 Apr 23, 2024
9ec75ee
[Python] Rename LLMEngine to MLCEngine (#2210)
MasterJH5574 Apr 24, 2024
e115dde
[Fix] CUDA architecture detection bug fix (#2211)
mengshyu Apr 24, 2024
55b5c00
[Android ] Enable OpenCL host pointer usage (#2215)
srkreddy1238 Apr 25, 2024
85fffee
[PYTHON][KVCACHE] Enhance the thread limit for opencl (#2216)
krishnaraj36 Apr 25, 2024
71c7b3c
[Serving] Support RWKV for serving (#2111)
Celve Apr 25, 2024
fab0dd3
[Serving] Remove `cli.model_metadata` import from engine base (#2226)
MasterJH5574 Apr 26, 2024
1cdd0f9
[JSONFFIEngine] Support generation config in JSONFFIEngine. Default c…
rickzx Apr 26, 2024
6850529
[Sampler] Fix GPU sampler behavior when batch size is 0 (#2234)
MasterJH5574 Apr 26, 2024
ff72113
[Pass] Support two-stage softmax (#2220)
MasterJH5574 Apr 26, 2024
3139fd7
Auto updated submodule references
Apr 26, 2024
470a42a
[Docs] Update deploy/ios#bring-your-own-model-library (#2235)
nobuhiroYamakado Apr 26, 2024
93c560b
[Op] Top-p cutoff pivot (#2221)
spectrometerHBH Apr 27, 2024
8e7b38a
[Op] Batch Verify: accept proposal when p and q are close enough (#2236)
spectrometerHBH Apr 27, 2024
135bcf9
[Serving] Creating EngineConfig from JSON (#2237)
MasterJH5574 Apr 27, 2024
fd65973
[Bugfix] layer_norm_eps in GPT2Config should be float (#2240)
rickzx Apr 27, 2024
63a3804
[REFACTOR] Migrate JSONFFIEngine to formal namespace (#2241)
tqchen Apr 27, 2024
1a8bad0
[Serving] Share disco sessions among multiple model function tables (…
vinx13 Apr 28, 2024
5a26795
[DOC] Improve Install via environment variable (#2245)
JackWeiw Apr 29, 2024
3cb2ee8
[Sampler] FlashInfer sampling func integration (#2224)
MasterJH5574 Apr 29, 2024
d3d264d
Model Library Delivery (#2139)
Kartik14 Apr 29, 2024
0b7864e
merged
sunggg Apr 29, 2024
b7c93fb
fixed
sunggg Apr 29, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
180 changes: 100 additions & 80 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,98 +50,118 @@
</table>


**Scalable.** MLC LLM scales universally on NVIDIA and AMD GPUs, cloud and gaming GPUs. Below
showcases our single batch decoding performance with prefilling = 1 and decoding = 256.
## Quick Start

Performance of 4-bit CodeLlama-34B and Llama2-70B on two NVIDIA RTX 4090 and two AMD Radeon 7900 XTX:
<p float="left">
<img src="site/img/multi-gpu/figure-1.svg" width="40%"/>
<img src="site/img/multi-gpu/figure-3.svg" width="30%"/>
</p>
We introduce the quick start examples of chat CLI, Python API and REST server here to use MLC LLM.
We use 4-bit quantized 8B Llama-3 model for demonstration purpose.
The pre-quantized Llama-3 weights is available at https://huggingface.co/mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC.
You can also try out unquantized Llama-3 model by replacing `q4f16_1` to `q0f16` in the examples below.
Please visit our [documentation](https://llm.mlc.ai/docs/index.html) for detailed quick start and introduction.

Scaling of fp16 and 4-bit CodeLlama-34 and Llama2-70B on A100-80G-PCIe and A10G-24G-PCIe, up to 8 GPUs:
<p float="center">
<img src="site/img/multi-gpu/figure-2.svg" width="100%"/>
</p>
### Installation

## News
MLC LLM is available via [pip](https://llm.mlc.ai/docs/install/mlc_llm.html#install-mlc-packages).
It is always recommended to install it in an isolated conda virtual environment.

* [10/18/2023] [[Post]](https://blog.mlc.ai/2023/10/19/Scalable-Language-Model-Inference-on-Multiple-NVDIA-AMD-GPUs) Scalable multi-GPU support for CUDA and ROCm are official.
* [09/02/2023] Prebuilt ROCm 5.7 and CUDA 12.2 package is [available](https://llm.mlc.ai/docs/install/tvm.html#option-1-prebuilt-package).
* [08/25/2023] CodeLlama support is up.
* [08/14/2023] [[Post]](https://blog.mlc.ai/2023/08/09/GPU-Accelerated-LLM-on-Orange-Pi) Mali GPU support is up on Orange Pi.
* [08/09/2023] [[Post]](https://blog.mlc.ai/2023/08/09/Making-AMD-GPUs-competitive-for-LLM-inference) ROCm backend is mature to use.
* [08/02/2023] [Dockerfile](https://github.com/mlc-ai/llm-perf-bench/) is released for CUDA performance benchmarking.
* [07/19/2023] Support for Llama2-7B/13B/70B is up.
* [05/22/2023] [[Post]](https://blog.mlc.ai/2023/05/22/bringing-open-large-language-models-to-consumer-devices) RedPajama support is up.
* [05/08/2023] [[Post]](https://blog.mlc.ai/2023/05/08/bringing-hardware-accelerated-language-models-to-android-devices) MLC LLM is now available on Android.
* [05/01/2023] [[Post]](https://blog.mlc.ai/2023/05/01/bringing-accelerated-llm-to-consumer-hardware) MLC LLM is released with Metal, Vulkan and CUDA backends.
* [04/14/2023] [WebLLM](https://github.com/mlc-ai/web-llm) is released prior to MLC LLM with WebGPU and WebAssembly backend.
To verify the installation, activate your virtual environment, run

## Getting Started
```bash
python -c "import mlc_llm; print(mlc_llm.__path__)"
```

Please visit our [documentation](https://llm.mlc.ai/docs/index.html#getting-started) for detailed instructions.
You are expected to see the installation path of MLC LLM Python package.

## Model Support
### Chat CLI

MLC LLM supports a wide range of model architectures and variants. We have the following prebuilts which you can
use off-the-shelf. Visit [Prebuilt Models](https://llm.mlc.ai/docs/prebuilt_models.html) to see the full list, and [Compile Models via MLC](https://llm.mlc.ai/docs/compilation/compile_models.html) to see how to use models not on this list.
We can try out the chat CLI in MLC LLM with 4-bit quantized 8B Llama-3 model.

<table style="width:100%">
<thead>
<tr>
<th style="width:40%">Architecture</th>
<th style="width:60%">Prebuilt Model Variants</th>
</tr>
</thead>
<tbody>
<tr>
<td>Llama</td>
<td>Llama-2, Code Llama, Vicuna, WizardLM, WizardMath, OpenOrca Platypus2, FlagAlpha Llama-2 Chinese, georgesung Llama-2 Uncensored</td>
</tr>
<tr>
<td>GPT-NeoX</td>
<td>RedPajama</td>
</tr>
<tr>
<td>GPT-J</td>
<td></td>
</tr>
<tr>
<td>RWKV</td>
<td>RWKV-raven</td>
</tr>
<tr>
<td>MiniGPT</td>
<td></td>
</tr>
<tr>
<td>GPTBigCode</td>
<td>WizardCoder</td>
</tr>
<tr>
<td>ChatGLM</td>
<td></td>
</tr>
<tr>
<td>StableLM</td>
<td></td>
</tr>
<tr>
<td>Mistral</td>
<td></td>
</tr>
<tr>
<td>Phi</td>
<td></td>
</tr>
</tbody>
</table>
```bash
mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
```

It may take 1-2 minutes for the first time running this command.
After waiting, this command launch a chat interface where you can enter your prompt and chat with the model.

```
You can use the following special commands:
/help print the special commands
/exit quit the cli
/stats print out the latest stats (token/sec)
/reset restart a fresh chat
/set [overrides] override settings in the generation config. For example,
`/set temperature=0.5;max_gen_len=100;stop=end,stop`
Note: Separate stop words in the `stop` option with commas (,).
Multi-line input: Use escape+enter to start a new line.

user: What's the meaning of life
assistant:
What a profound and intriguing question! While there's no one definitive answer, I'd be happy to help you explore some perspectives on the meaning of life.

The concept of the meaning of life has been debated and...
```

### Python API

We can run the Llama-3 model with the chat completion Python API of MLC LLM.
You can save the code below into a Python file and run it.

```python
from mlc_llm import MLCEngine

# Create engine
model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC"
engine = MLCEngine(model)

# Run chat completion in OpenAI API.
for response in engine.chat.completions.create(
messages=[{"role": "user", "content": "What is the meaning of life?"}],
model=model,
stream=True,
):
for choice in response.choices:
print(choice.delta.content, end="", flush=True)
print("\n")

engine.terminate()
```

**The Python API of `mlc_llm.MLCEngine` fully aligns with OpenAI API**.
You can use MLCEngine in the same way of using
[OpenAI's Python package](https://github.com/openai/openai-python?tab=readme-ov-file#usage)
for both synchronous and asynchronous generation.

If you would like to do concurrent asynchronous generation, you can use `mlc_llm.AsyncMLCEngine` instead.

### REST Server

We can launch a REST server to serve the 4-bit quantized Llama-3 model for OpenAI chat completion requests.
The server has fully OpenAI API completeness.

```bash
mlc_llm serve HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
```

The server is hooked at `http://127.0.0.1:8000` by default, and you can use `--host` and `--port`
to set a different host and port.
When the server is ready (showing `INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)`),
we can open a new shell and send a cURL request via the following command:

```bash
curl -X POST \
-H "Content-Type: application/json" \
-d '{
"model": "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC",
"messages": [
{"role": "user", "content": "Hello! Our project is MLC LLM. What is the name of our project?"}
]
}' \
http://127.0.0.1:8000/v1/chat/completions
```

## Universal Deployment APIs

MLC LLM provides multiple sets of APIs across platforms and environments. These include
* [Python API](https://llm.mlc.ai/docs/deploy/python.html)
* [Python API](https://llm.mlc.ai/docs/deploy/python_engine.html)
* [OpenAI-compatible Rest-API](https://llm.mlc.ai/docs/deploy/rest.html)
* [C++ API](https://llm.mlc.ai/docs/deploy/cli.html)
* [JavaScript API](https://llm.mlc.ai/docs/deploy/javascript.html) and [Web LLM](https://github.com/mlc-ai/web-llm)
Expand All @@ -165,7 +185,7 @@ The underlying techniques of MLC LLM include:

<details>
<summary>References (Click to expand)</summary>

```bibtex
@inproceedings{tensorir,
author = {Feng, Siyuan and Hou, Bohan and Jin, Hongyi and Lin, Wuwei and Shao, Junru and Lai, Ruihang and Ye, Zihao and Zheng, Lianmin and Yu, Cody Hao and Yu, Yong and Chen, Tianqi},
Expand Down
1 change: 1 addition & 0 deletions android/library/prepare_libs.sh
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ cmake .. \
-DMLC_LLM_INSTALL_STATIC_LIB=ON \
-DCMAKE_SKIP_INSTALL_ALL_DEPENDENCY=ON \
-DUSE_OPENCL=ON \
-DUSE_OPENCL_ENABLE_HOST_PTR=ON \
-DUSE_CUSTOM_LOGGING=ON \

cmake --build . --target tvm4j_runtime_packed --config release
Expand Down
Loading