Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge with mlc-ai/main (d3d264d4b05d73e9757375013b842254f052c6ed, April 29th 2024) #265

Merged
merged 252 commits into from
Apr 29, 2024

Conversation

sunggg
Copy link
Member

@sunggg sunggg commented Apr 29, 2024

No description provided.

Ubospica and others added 30 commits February 24, 2024 08:35
This PR introduces the logprobs support with OpenAI API
compatibility. It enhances the sampler with a function to get
the top-probability tokens (supporting 5 tokens at most as of now).

To make it easy to pass logprob results back from serving engine
to frontend, we choose to pass logprob results in JSON string with
OpenAI API spec.

Unit tests are added to ensure the correctness of logprobs.
And the logprobs support also work with speculative decoding.
This PR supports Mixtral in MLC serve. The main thing is only
introducing the Mistral conversation template to Python registry
so that MLC Serve can use.

Besides that, this PR updates the KV cache capacity analysis to
make it more accurate in terms of usage calculation, while being
conservative since there is a known issue regarding batch-prefill
embedding taking which may lead to OOM. We will reset the follow up
on the issue with a fix in the future and then enable the estimation
to use more GPU vRAM.
Prior to this PR, `u_char` was used while it is not a standard
type in C++, which causes Windows build failure.

This PR fixes it by using `unsigned char`.
…#1852)

Instead of a python function that returns an updated `IRModule`, the
new `optimize_mod_pipeline` function returns a `tvm.ir.transform.Pass`
which can be applied to an `IRModule`.
* Create __init__.py

* Add files via upload

* Update model.py

* Update model_preset.py

* Update conv_templates.cc

* Update internlm_loader.py

* Update internlm_quantization.py

* fix name of notes

* Update model.py

* Migration

* fix pylint issue

* fix pylint issue

* fix pylint error

* Update internlm_loader.py

* Update __init__.py

* Update __init__.py

* Delete python/mlc_chat/model/internlm/__init__.py

* Add files via upload
Prior to this commit, a model name with multiple path
components (e.g. `dist/models/group_name/model_name`) would have
duplicated path components
(e.g. `dist/group_name/artifact_path/group_name/libname.so`).
This commit resolves the duplication.
* [KVCache] Add max num threads to KVCache kernels, fix WebGPU

* Read max_num_threads_per_block when available

* Change merge state in place kernel

* Make attention decode aware of max num threads, not just webgpu

Co-authored-by: Egor Churaev <egor.churaev@gmail.com>

* Change util function name

---------

Co-authored-by: Egor Churaev <egor.churaev@gmail.com>
…1860)

This PR moves the import of transformers into the function body
of tiktoken tokenizer conversion, so we do not have a force dependency
on transformers.
This PR adds RWKV5 support with RNNState, a similar interface as
PagedAttention.

Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Following mlc-ai#1854 , this pr registers the ChatML conversation template.
Sets the entry functions for a module.  This utility is intended for
cases where only module contains several externally-exposed functions,
and only one is desired for use.  (e.g. Separating out a
`transform_params` function from an `IRModule` that also contains
inference functions.)  This commit only updates the external
visibility, after which `relax.transform.DeadCodeElimination()` can be
applied.
…i#1856)

This allows it to be used as part of a optimization pipeline specified
as a `tvm.ir.transform.Sequential`.
mlc-ai#1867)

This PR is the 3rd part of the grammar-guided generation.
This intregrates the grammar framework into the generation
process, and supports JSON output for now.

The API this PR provides is compatible with the OpenAI api.

### APIs
#### Python API
```
@DataClass
class ResponseFormat:
    type: Literal["text", "json_object"] = "text"
    json_schema: Optional[str] = None

@DataClass
class GenerationConfig:
        response_format: ResponseFormat = ResponseFormat(type="text")
```

#### Rest API
```
response_format: { "type": "text" } # text generation, by default
response_format: { "type": "json_object" } # json generation
response_format: { "type": "json_object", json_schema="..."} # json generation with schema
```

JSON generation with schema is not supported yet,
but has been planned to be realized in the future.

### Performance
#### Without JSON
```
Single token prefill latency: 891.2234 ms/tok
Single token decode latency: 31.3399 ms/tok
Prefill token throughput: 4693.3077 tok/s
Decode token throughput: 226.4406 tok/s
Overall token throughput: 470.3180 tok/s
```
#### With JSON
```
Single token prefill latency: 219.2287 ms/tok
Single token decode latency: 29.1399 ms/tok
Prefill token throughput: 7392.1555 tok/s
Decode token throughput: 179.2296 tok/s
Overall token throughput: 1052.1996 tok/s
```

We observed a slight decrease in performance under JSON mode.
This will be further optimized in the future.
This PR brings field `n` to generation config and thereby
supports parallel generation. This parallel generation effectively
leverages the "fork" functionality of paged KV cache.

This PR supports specifying the number of parallel generation
`n` in stardard OpenAI ChatCompletion API. This is the last
feature towards the OpenAI API feature completeness.
Sometimes scm checkout can timeout, this PR add retry to that
Prior to this PR, the TIR attention kernels does not cast matmul
operands to fp32 before multiplying.
For models like Phi-2 which may have large Q/K/V data (at the level
of a few hundreds), the fp16 multiplication exceeds the range of
fp16, and lead to attention result being NAN sometimes.

This PR fixes this issue.
…lc-ai#1857)

Prior to this commit, the `ReorderTransformFunc` required several
components of the `ParamManager` to use.  The functionality it
provides, reordering dataflow blocks to minimize the liveset, is
useful outside of the context of the `ParamManager`.  This commit
makes the following changes, allowing it to be used independently of
the `ParamManager`.

- Generate the `pidx2binname` dictionary outside of `ReorderTransformFunc`

- Allow parameters to be separate `func.params`, rather than a single
  bundled tuple parameter.
This PR migrates Phi-2 for Paged KV cache Attention as a part of Model definition migration according to mlc-ai#1749 .

Co-authored-by: Shrey Gupta <shrey2809@gmail.com>
…c-ai#1874)

The use of `call_inplace_packed` and `call_pure_packed` in the old
flow is outdated due to signature changes. This PR fixes the issue.
PR mlc-ai#1852 missed to apply the BundleModelParams pass and thus made
the compiled models not runnable through ChatModule (mlc-ai#1864). This PR
fixes the issue.
As pointed out by mlc-ai#1830, this PR fixes the Android app download
link in docs.
This PR adopts suggestions from the support of OpenAI API parallel
generation `n` in mlc-ai#1868. The main update in this PR is to make
the RequestState as a standalone object class, which was a typedef
from `std::vector<RequestStateEntry>` before.

This PR also fixes a bug in prefill that will cause engine failure
when `n` is large.
Kartik14 and others added 29 commits April 22, 2024 17:22
…lc-ai#2190)

This PR adds conv template support to the JSON FFI Engine.
Also add function calling and pass stop str to generation config.

Co-authored-by: Shrey Gupta <shrey2809@gmail.com>
This PR introduces the Paged Radix Tree data structure, as foundation and prerequisite of prefix caching.
This PR removes the mandatory model check in server since as of now
we serve one engine at most which means there is always a unique
engine being served. As issue mlc-ai#2155 points out, the model check
in server can be a bad experience when the model string mismatches.
* [Eagle] Attach gpu verifier to model

* WIP

* WIP

* fix

* Enable GPU verifier

* lint

* lint
* [Eagle] Make BatchSelectLastHidden able to run on the controller
…lc-ai#2206)

This PR updates the draft verification of the normal mode speculative
decoding. Prior to this PR, we did not effectively leverage all the
draft tokens, and this PR fixes the issue.
)

This PR introduces a renormalization interface with regard to top-p
values for speculative decoding. This is helpful for simplifying the
logic of speculative decoding verification stage, as all probs have
been already updated with the top-p values and no top-p needs to
be taken into consideration.

So for speculative decoding, we always renorm the probability
distribution before sampling/verifying. For non speculative decoding
mode, we keep using the previous flow, which applies top-p together
when sampling.

Co-authored-by: Wuwei Lin <wuwei@apache.org>
This commit renames the LLMEngine to MLCEngine.
This commit returns a list of integers and adds an assert to check that the string of CUDA architecture must contain numbers only.

Co-authored-by: msyu <msyu@pllab.cs.nthu.edu.tw>
Take advantage of OpenCl host ptr that improves copy performance
It improves 2x time for tir based page attention for opencl adreno.
…#2226)

This PR removes the imports of functions in `cli.model_metadata` from
engine_base.py. The file `cli.model_metadata` is not designed for
import directly, and when importing functions from the file, it
repetitively reports warnings of

```
RuntimeWarning: 'mlc_llm.cli.model_metadata' found in sys.modules after
import of package 'mlc_llm.cli', but prior to execution of
'mlc_llm.cli.model_metadata'; this may result in unpredictable behaviour
```
…onfig values to NOT_GIVEN (mlc-ai#2225)

* Change OpenAI protocol default value to None in JSON FFI engine

* [JSONFFIEngine] Support generation config in JSONFFIEngine. Default config values to NOT_GIVEN
This PR adds the early exit for the GPU sampler, which ran into
GPU kernels even when the batch size is 0 prior to this commit.

The 0 batch size case can happen when parallel generation of a request
and engine preemption exists. In this case, the GPU sampler should
just synchronization and return, and not run into any GPU kernel.
This PR introduces the compiler pass that rewrites the normal softmax
to a two-stage softmax. This is based on our finding that when
vocabulary size is large, the normal softmax cannot have high-enough
parallelism on GPU. So we partition the workload into two stages
for better parallelism and better performance.
remove model metadata step (#1)

* remove model metadata step and make minor fixes
This commit introduces the GPU top-p cutoff operator for efficient
probability renormalization under top-p.
This PR supports creating EngineConfig from a JSON string, which
is useful for JSONFFIEngine and its API bindings.

This commit also removes the device from the EngineConfig for better
clarity.
This PR migrates JSONFFIEngine to a formal namespace.
Also list TODOs to further simplify the JSONFFIEngine.
improve Install via environment variable
This PR integrates the sampling function in FlashInfer.
We integrate the one without top-p for now.
* add model lib delivery

* fix lint
@sunggg sunggg merged commit 50e2686 into mlc-serve-v0.2.0 Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet