Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge upstream nov7 #52

Merged
merged 108 commits into from
Nov 7, 2023
Merged

Conversation

masahi
Copy link
Member

@masahi masahi commented Nov 6, 2023

No description provided.

davidpissarra and others added 30 commits October 7, 2023 22:36
Fix two bugs in kv-cache pop loop

Bug 1: old code would stop early because output_ids was shortened in-place during the loop

Bug 2: off-by-one in backoff size due to break
…1017)

This commit adds an optional `--pdb` flag to the `build.py` script. If
passed, any exception raised that would otherwise terminate the script
will first enter a pdb post-mortem, allowing the error to be
inspected.
…ai#1040)

Add doc for ChatConfig, ConvConfig, GenerationConfig, BuildArgs, build model
Support for the stablelm-3b-4e1t model
* Iterate model prebuilts docs

* small fix
This PR separates out the tokenizer creation function, the
random number generator out from `llm_chat.cc` as a preparation
step for batching inference support, since these functions/modules
are also used in the same way in batching inference.
* add verbose stats to mlc-chat REST API

* update docs
* [Transform] Apply split_rotary optimization on prefill

Prior to this commit, the `transform.fuse_split_rotary_embedding`
function was only applicable to the `decode` function of a Llama-type
model.  This was due to the sequence length being restricted to one,
both in the pattern-match rule and in the `split_rotary` function, and
the function being restricted to operate only on the `decode`
function.

This commit updates the `transform.fuse_split_rotary_embedding` pass
to be a `tvm.ir.transform.Pass`, operating on all applicable matched
in the `IRModule`.  The `split_rotary` function is now produced as a
fully-generic function, with static parameters substituted in
afterwards.  At this stage, the sequence length is retained as a
dynamic parameter, such that it can be used by the `prefill` function.

* Avoid multiple kernel launches for split_rotary
…i#1055)

Co-authored-by: Junru Shao <junrushao1994@gmail.com>
…ma-2 families (mlc-ai#1032)

* fix

* reflect feedback

---------

Co-authored-by: “Sunghyun <sunggg@umich.com>
`--force-reinstall` will reinstall all dependencies to a python package,
which is unnecessary. `-U` is a better choice in this case.
This PR introduces the initial batched input support for llama
models. To make the code managable, we keep both the single-sequence
handling flow and the batching handling flow in the Llama modeling.

Now, with `--enable-batching` as a build argument, we build Llama
for the batched version.

NOTE: The paged attention kernel/TIR func are not included in this PR,
so currently the built library with batching enabled is not runnable.
We will follow up with the attention kernel in the future.

This PR guarantees that the existing single-sequence inference (Python
API, CLI, etc.) is not broken.

P.S.. The batching flow is subject to bug fixes as we integrate with
the attention function and run the e2e flow in the future.
* [stablelm 3b] Rename dynamic vocab size from "v" to "vocab_size"

* Add get_num_key_value_heads method to StableLM3bConfig
This commit removes the `if`/`elif` chain in `core.py`, where the body
of each conditional assigns the same `mod, param_manager, params,
model_config`, and is identical except for the choice of model being
built.
This commit replaces the single-parameter
`relax_model.param_manager.create_quantize_func` function with a
method on the `ParamManager`, `create_parameter_transformation`.  This
avoids potential typos between `param_manager` as the imported Python
module `mlc_llm.relax_model.param_manager` and an instance of the
`ParamManager` class named `param_manager`, and makes the
functionality easier to find.

This function also takes an optional `optimize_parameter_order` flag,
defaulting to `True`, which applies the `ReorderTransformFunc` pass.
Since the `ReorderTransformFunc` is intended to be used with several
configuration objects owned by `ParamManager`, this simplifies the
common path of producing an optimally-ordered parameter transformation
module.
PR mlc-ai#1048 updated the signature of softmax in the built model library
and changed the temperature buffer shape in ChatModule. This causes
some existing demo unable to run since we did not do a round of model
library update.

This PR reverts the ChatModule change, and adds back the softmax
function in non-batching case. With this PR, the regression should
be fixed.
…ai#1074)

This PR lifts the device string parsing (just a few of lines)
to a standalone function, so that on the serving side the serving
can make use of this function as well.

Tested Python API and it does not seem to incur regression.
The pass `fuse-split-rotary` assumes the compute dtype is fp16, which
usually is, but in certain cases, e.g. `q0f32` and `q4f32_1`, the
compute is based on fp32 instead. This PR strengthens the check guard.
This PR establishes the compiler components in MLC-Chat Python API,
which currently includes two primary components: models and parameters.

The models are `nn.Module`-based definition of an LLM, which, as the
very first stab, contains only `LlamaForCasualLM`. It is decomposed into
three files:
- `llama_config.py`: common configurations for Llama, where we define
  relevant configurations of its architecture, as well as include
  standard config file for Llama2-7B/13B/70B for convenient testing;
- `llama.py`: the model architecture of Llama, based on the PyTorch-like
`nn.Module` API;
- `llama_parameter.py`: defines the mapping between MLC parameters and
  pytorch parameters.

The parameters contains the basic functionality of parameter mapping,
and the loaders that effectively convert parameters from PyTorch to MLC
according to the mapping specified. Currently, only `HFTorchLoader` is
implemented, but loaders like SafeTensor, GPTQ or AWQ should be quite
straightforward according to the existing design.

On top of this PR, on-the-fly quantization could be defined as a loading
time transformation on MLC parameters, while pre-quantized parameter
loading is effectively parameter loading after MLC's `nn.Module` is
quantized.

Two unittests examplify how the infrastructure works:
- `./tests/python/model/test_llama.py` shows how to create an `nn.Module`
using the new infra, and then convert it to TVM IRModule;
- `./tests/python/parameter/hf_torch_loader.py` shows how to load
parameters from HuggingFace PyTorch format.

Besides, `mlc_chat.support` is established for utility functions, which
now contains two utils:
- `config.py` which supports reading configurations into dataclasses
from JSON file or Python dict. On top of Python dataclass, it throws
irrelevant fields into `cls.kwargs`, which is helpful when loading
HuggingFace configuration file;
- `tqdm.py` which contains tqdm-related utilities, primarily redirecting
logging and printing to work nicely with tqdm.
anibohara2000 and others added 29 commits November 1, 2023 12:16
Use scoped storage instead of Downloads directory

Co-authored-by: Animesh Bohara <abohara@cs.cmu.edu>
This PR fixes the group quantization and add related unit tests.
This PR enables weight conversion in command line.
Sample command: `python3 -m mlc_chat.cli.convert_weight --config dist/models/llama-2-13b-chat-hf/ --quantization "q4f16_1" --output dist/test/`
…lc-ai#1178)

[Fix] Update q4f16 quantization with the new mutator name rule
…tral (mlc-ai#1087)

* mistral base

* Add sliding window mask making and its tests

* Small changes for sliding window mask

* Clean up mask making

* Remove kv_seq_len

* Add prefill chunking, handle max window size in SWA

* Add interleave kv

* Temporary fix for kv seq len

* Pass in more shapes to SWA prefill and decode in runtime

* mistral var fix

* Small changes regarding shape passing

* Small fix on chunk size

* Add build args, fix mlc chat config dump

* mistral system prompt
---------

Co-authored-by: David Pissarra <david.pissarra@tecnico.ulisboa.pt>
Co-authored-by: David Pissarra <61968959+davidpissarra@users.noreply.github.com>
This PR primarily does a major refactoring to introduce Python API that
is consistent with the CLI API. Besides, it includes the following
fixes and enhancements:

- More info provided to `isort` for better formatting in `pyproject.toml`;
- Print out the default value of all arguments in argparse command line;
- Ensure `--device` is always available locally when doing weight
  conversion;
- Add argument echoing in weight conversion to be consistent with its
  counterpart in compilation;
- Add a consistency checker to make sure the shapes/dtypes of all
  tensors from weight conversion is consistent with compilation;
- Echo the total size of parameters;
- Better logging of each parameter's shape and dtype, and either or not
  its quantized;
- More structure robustification, renaming `parameter/` to `loader/` to
  be more explicit about its intention;
- Inline and remove `ParamQuantizer` into the loader to improve logging
  and the logic flow;
- Always add instructions "Use `--xxx` to override" for any options that
  are auto detected to be more informative to end users;
- Fix wrong shape calculation when quantizing `nn.Embedding`;
- Fix wrong dtype calculation in group quantization when the input dtype
  is different from model dtype (e.g. "float32" in torch, but the model
  dtype in quantization is fp16 in `q4f16_1`);
- Fix inconsistent param names in layers such as `GroupQuantizeLinear`;
- Fix dtype inconsistency when a parameter is not quantized;
- Fix existing unittests.
Add docs for RestAPI

Co-authored-by: Animesh Bohara <abohara@cs.cmu.edu>
This PR enables ```llm-vscode``` extension API support for copilot-like code completion, following [HF's LSP](https://github.com/huggingface/llm-ls). Fully compatible with ```CodeLlama``` and ```starcoder``` on mlc-llm. 

- huggingface/llm-vscode#103 enhances extension user experience when used with mlc-llm rest api.

Thanks @ pacman100, who came up with this on his latest blogpost: https://huggingface.co/blog/personal-copilot
PR mlc-ai#1203 introduces some unnecessary and redundant logging messages.
This PR gets them removed.
The error message below

```
/usr/share/miniconda/envs/mlc-llm-build/conda-bld/mlc-chat-cli-nightly-package_1699286394016/work/3rdparty/tvm/3rdparty/picojson/picojson.h: In member function 'std::string picojson::value::to_str() const':
/usr/share/miniconda/envs/mlc-llm-build/conda-bld/mlc-chat-cli-nightly-package_1699286394016/work/3rdparty/tvm/3rdparty/picojson/picojson.h:494:37: error: expected ')' before 'PRId64'
  494 |       SNPRINTF(buf, sizeof(buf), "%" PRId64, u_.int64_);
      |               ~                     ^~~~~~~
      |                                     )
/usr/share/miniconda/envs/mlc-llm-build/conda-bld/mlc-chat-cli-nightly-package_1699286394016/work/3rdparty/tvm/3rdparty/picojson/picojson.h:81:1: note: 'PRId64' is defined in header '<cinttypes>'; did you forget to '#include <cinttypes>'?
   80 | #include <errno.h>
  +++ |+#include <cinttypes>
   81 | #include <inttypes.h>

```

indicates that the `__STDC_FORMAT_MACROS` flag is not turned on for some
reason.
The breakage was resulting from newer syntax being used for type
annotations, as part of mlc-ai#592.
So long as `mlc_chat.interface.openai_api` wasn't imported, the
breaking changes were not encountered.  In
mlc-ai#1107, the addition of `from
.interface.openai_api import ChatMessage` caused this module to be
imported, breaking compatibility of `mlc_chat.ChatModule` with
Python3.8.

This commit updates the type annotations to the supported syntax.
* [SLM] Enable loading from AWQ pre-quantized weight.

* remove awq_loader.py

* Update to the latest commit

* Delete llama_parameter.py

* update unittest

* fix lint

* upd

* add Llama-2-7B-AWQ
@masahi masahi merged commit f369d7f into octoml:batch-serving Nov 7, 2023
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.