# Add a New Model Architecture in MLC-LLM using SLM workflow

In this tutorial, we will demonstrate how to add a new model architecture in MLC-LLM using the new SLM workflow. SLM is the new model compilation workflow to bring modularized Python-first compilation to MLC-LLM, allowing users and developers to support new models and features more seamlessly.

As an example, under SLM, the amount of code required to define a Mistral model architecture is only about half of that under the old workflow.

But we still recommend reading through the [old tutorial](https://github.com/mlc-ai/notebooks/blob/main/tutorial/How_to_add_model_architeture_in_MLC_LLM.ipynb) to have some background understanding of the TVM Unity core and TensorIR.

Here, we are going to use [GPT-2](https://huggingface.co/gpt2) for demonstration purpose. GPT-2 is a transformers model pretrained on a very large corpus of English data in a self-supervised fashion, which can be used to guess the next word in sentences. It's model definition in Huggingface can be found [here](https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt2/modeling_gpt2.py).

Learn more about MLC LLM here: https://mlc.ai/mlc-llm/docs.

Click the button below to get started:

<a target="_blank" href="https://colab.research.google.com/github/mlc-ai/notebooks/blob/main/mlc-llm/tutorial_add_new_model_architecture_in_SLM.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Environment Setup

We will start from setting up the environment. First, let us create a new Conda environment, in which we will run the rest of the notebook.

```bash
conda create --name mlc-llm python=3.10
conda activate mlc-llm
```

**Google Colab**
- If you are running this in a Google Colab notebook, you would not need to create a conda environment.
- However, be sure to change your runtime to GPU by going to `Runtime` > `Change runtime type` and setting the Hardware accelerator to be "GPU".
- Besides, compiling GPT-2 **may** require more RAM than the default Colab allocates. You may need to either upgrade Colab to a paid plan (so that `runtime shape` can be set to `High RAM`), or use other environments.
  - But we also notice that, sometimes rerunning it several times (just the build portion) without exceeding the default RAM amount.

If you are using CUDA, you can run the following command to confirm that CUDA is set up correctly, and check the driver version number as well as what GPUs are currently available for use.

In [1]:
!nvidia-smi

Thu Dec 28 01:58:01 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

Next, let's download the MLC-AI and MLC-Chat nightly build packages. If you are running in a Colab environment, then you can just run the following command. Otherwise, go to https://mlc.ai/package/ and replace the command below with the one that is appropriate for your hardware and OS.

**Google Colab**: If you are using Colab, you may see the red warnings such as **"You must restart the runtime in order to use newly installed versions."** For our purpose, we can disregard them, the notebook will still run correctly.

In [2]:
!pip install --pre --force-reinstall mlc-ai-nightly-cu122 mlc-chat-nightly-cu122 -f https://mlc.ai/wheels

Looking in links: https://mlc.ai/wheels
Collecting mlc-ai-nightly-cu122
  Downloading https://github.com/mlc-ai/package/releases/download/v0.9.dev0/mlc_ai_nightly_cu122-0.12.dev1940-cp310-cp310-manylinux_2_28_x86_64.whl (471.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.9/471.9 MB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting mlc-chat-nightly-cu122
  Downloading https://github.com/mlc-ai/package/releases/download/v0.9.dev0/mlc_chat_nightly_cu122-0.1.dev712-cp310-cp310-manylinux_2_28_x86_64.whl (65.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.8/65.8 MB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting attrs (from mlc-ai-nightly-cu122)
  Using cached attrs-23.1.0-py3-none-any.whl (61 kB)
Collecting cloudpickle (from mlc-ai-nightly-cu122)
  Using cached cloudpickle-3.0.0-py3-none-any.whl (20 kB)
Collecting decorator (from mlc-ai-nightly-cu122)
  Using cached decorator-5.1.1-py3-none-any.whl (9.1 kB)
Collecting

**Google Colab**: Since we ignored the warnings/errors in the previous cell, run the following cell to verify the installation did in fact occur properly.

In [3]:
!python -c "import tvm; print('tvm installed properly!')"
!python -c "import mlc_chat; print('mlc_chat installed properly!')"

tvm installed properly!
mlc_chat installed properly!


Then, we clone the [mlc-llm repository](https://github.com/mlc-ai/mlc-llm).

**Google Colab**: Note, this will install into the mlc-llm folder. You can click the folder icon on the left menu bar to see the local file system and verify that the repository was cloned successfully.

In [4]:
!git clone --recursive https://github.com/mlc-ai/mlc-llm.git

fatal: destination path 'mlc-llm' already exists and is not an empty directory.


We then install `mlc-llm` as a package, so that we can use its functions outside of this directory.

In [5]:
!cd mlc-llm && pip install -e . && cd -

Obtaining file:///content/mlc-llm
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: mlc_llm
  Building editable for mlc_llm (pyproject.toml) ... [?25l[?25hdone
  Created wheel for mlc_llm: filename=mlc_llm-0.1.dev714+g09ec207-0.editable-py3-none-any.whl size=7384 sha256=17a4f12bfeb1c0f0c416b4e1fc138df38c9dcd7db3b008f7d90ce9573721384e
  Stored in directory: /tmp/pip-ephem-wheel-cache-_66don6s/wheels/60/f6/e4/f9ebad71d5663623c41caead0eb5663a07b045d94af8e40d00
Successfully built mlc_llm
Installing collected packages: mlc_llm
  Attempting uninstall: mlc_llm
    Found existing installation: mlc_llm 0.1.dev714+g09ec207
    Uninstalling mlc_llm-0.1.dev714+g09ec207:
      Successfully uninstalled mlc_llm-0.1.dev714+g09ec207
Successfully installed mlc

## Define the GPT-2 Model

Create a `gpt2` folder under `mlc-llm/python/mlc_chat/model/`. It's structure will look like the following:

```
mlc-llm/python/mlc_chat/model/gpt2/
├── gpt2_loader.py          # Load and convert the weights from Huggingface
├── gpt2_model.py           # Define the model architecture and configuration
├── gpt2_quantization.py    # Define quantization schemes
└── __init__.py
```

We first focus on `gpt2_model.py`. This file defines the GPT-2 model architecture in a modularized fashion using `tvm.relax.frontend.nn.Module`, similar to the PyTorch counterpart.

### Define a Config Class in gpt2_model.py

Let's first define a config class that is almost a direct translation from Huggingface's [GPT2Config](https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt2/configuration_gpt2.py). The attributes of this class should have the same name as the corresponding attributes in the huggingface config, otherwise, the huggingface config won't be loaded properly.

The `__post_init__` function is called after all the dataclass attributes are initialized.

In [6]:
import dataclasses
import math

from mlc_chat.support.config import ConfigBase

from tvm import te, tir
from tvm.relax.frontend import nn
from tvm.relax.frontend.nn import Tensor, op
from typing import Any, Dict, Optional

@dataclasses.dataclass
class GPT2Config(ConfigBase):  # pylint: disable=too-many-instance-attributes
    """Configuration of the GPT-2 model."""

    vocab_size: int
    n_embd: int
    n_layer: int
    n_head: int
    layer_norm_epsilon: int
    n_inner: int = -1
    scale_attn_by_inverse_layer_idx: bool = False
    # Internal configs used by MLC-LLM
    context_window_size: int = 0
    prefill_chunk_size: int = 0
    kwargs: Dict[str, Any] = dataclasses.field(default_factory=dict)

    def __post_init__(self):
        if self.n_inner is None or self.n_inner == -1:
            self.n_inner = 4 * self.n_embd

        self.context_window_size = self.kwargs["n_positions"]

        # Internal configs initialization

### Define model architecture in gpt2_model.py

With `tvm.relax.frontend.nn.Module`, we are able to define the model architecture in a modularized fashion. It looks pretty similar to the PyTorch style, except that the forward function does not actually perform the computation. It traces the operator graph using the placeholders that are passed as inputs.

Here we only present the GPT2Attention module. The entire model definition can be found [here](https://github.com/mlc-ai/mlc-llm/blob/main/python/mlc_chat/compiler/model/gpt2/gpt2_model.py).

In [7]:
class GPT2Attention(nn.Module):
    def __init__(self, config, layer_idx: int=None):
        self.embed_dim = config.n_embd
        self.num_heads = config.n_head
        self.head_dim = self.embed_dim // self.num_heads
        self.scale_attn_by_inverse_layer_idx = config.scale_attn_by_inverse_layer_idx
        self.layer_idx = layer_idx

        self.c_attn = nn.Linear(
            in_features=self.embed_dim,
            out_features=3 * self.num_heads * self.head_dim,
            bias=True,
        )
        self.c_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=True)

        self.k_cache = nn.KVCache(config.context_window_size, [self.num_heads, self.head_dim])
        self.v_cache = nn.KVCache(config.context_window_size, [self.num_heads, self.head_dim])

    def forward(
        self,
        hidden_states: Tensor,
        attention_mask: Tensor,
    ):
        d, h, t = self.head_dim, self.num_heads, 2
        b, s, _ = hidden_states.shape
        assert b == 1, "Only support batch size 1 at this moment."

        qkv = self.c_attn(hidden_states)
        qkv = op.reshape(qkv, (b, s, 3 * h, d))
        q, k, v = op.split(qkv, 3, axis=2)

        self.k_cache.append(op.squeeze(k, axis=0))
        self.v_cache.append(op.squeeze(v, axis=0))
        k = op.reshape(self.k_cache.view(t), (b, t, h, d))
        v = op.reshape(self.v_cache.view(t), (b, t, h, d))

        q = q.permute_dims([0, 2, 1, 3])  # [b, h, s, d]
        k = k.permute_dims([0, 2, 1, 3])  # [b, h, t, d]
        v = v.permute_dims([0, 2, 1, 3])  # [b, h, t, d]

        attn_weights = op.matmul(
            q, k.permute_dims([0, 1, 3, 2])  # [b, h, s, d] x [b, h, d, t] = [b, h, s, t]
        ) / math.sqrt(d)

        if self.scale_attn_by_inverse_layer_idx:
            attn_weights = attn_weights / float(self.layer_idx + 1)

        dtype = attn_weights.dtype
        if attention_mask is not None:
            attn_weights = attn_weights.maximum(tir.min_value(dtype)).minimum(attention_mask)

        if dtype == "float32":
            attn_weights = op.softmax(attn_weights, axis=-1)
        else:
            attn_weights = op.softmax(attn_weights.astype("float32"), axis=-1).astype(dtype)
        # [b, h, s, t] x [b, h, t, d] => [b, h, s, d] => [b, s, h, d]
        output = op.matmul(attn_weights, v)
        return self.c_proj(output.permute_dims([0, 2, 1, 3]).reshape((b, s, h * d)))

Note that we have already provided some built-in common modules that you will find handy. For example, the `nn.Linear` and `nn.KVCache` modules here are all built-in modules in MLC-LLM. A full list of built-in modules can be found [here](https://github.com/apache/tvm/blob/unity/python/tvm/relax/frontend/nn/modules.py).

Similarly, we have also provided a lot of common built-in operations that operates on the Tensors. For example, `op.reshape`, `op.matmul`, `op.softmax`, etc.A full list of built-in operations can be found [here](https://github.com/apache/tvm/blob/unity/python/tvm/relax/frontend/nn/op.py).

### Validating the Correctness of an `nn.Module`

Once you finished defining an `nn.Module`, you can compare it against its Huggingface PyTorch counterpart to make sure it behaves correctly. We can do so by:
- Initialize an MLC `nn.Module` from the class we just defined
- Load the corresponding module in Huggingface PyTorch model
- Copy the parameter weights from Huggingface module to MLC module
- Use `jit` to provide a TVM run-time that converts the MLC module to a PyTorch-compatible runnable module
- Feed the same PyTorch tensor as input to both modules and compare the output

In [40]:
# 1. Initialize an MLC `nn.Module` from the class we just defined

from tvm.relax.frontend.nn import spec

config_dict = {
    "architectures": ["GPT2LMHeadModel"],
    "bos_token_id": 50256,
    "eos_token_id": 50256,
    "hidden_act": "gelu_new",
    "n_ctx": 1024,
    "n_embd": 768,
    "n_head": 12,
    "n_layer": 12,
    "n_positions": 1024,
    "layer_norm_epsilon": 1e-05,
    "scale_attn_by_inverse_layer_idx": False,
    "vocab_size": 50257,
}

attn_spec = {"forward": {"hidden_states": spec.Tensor([1, 2, 768], dtype="float32"), "attention_mask": spec.Tensor([1, 1, 2, 2], dtype="float32")}}

config = GPT2Config.from_dict(config_dict)
mlc_attn = GPT2Attention(config ,layer_idx=5)

Note that we have also defined a JSON like dictionary as ModuleSpec, which describes how the placeholders in the module's forward function are defined. For example, here we define the hidden_states to be of shape [1, 2, 768], which corresponds to [batch_size, total_sequence_length, n_embd].

Now, we can export this module to TVM IRModule and parameters, and we can do a sanity check on the shape and data type of the parameters.

In [41]:
mod, named_params = mlc_attn.export_tvm(spec=attn_spec)

for name, param in named_params:
    print(name, param.shape, param.dtype)

c_attn.weight [2304, 768] float32
c_attn.bias [2304] float32
c_proj.weight [768, 768] float32
c_proj.bias [768] float32


In [42]:
# 2. Load the corresponding module in Huggingface PyTorch model
from transformers import AutoModelForCausalLM

hf_model = AutoModelForCausalLM.from_pretrained("gpt2")
hf_model

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [43]:
hf_attn = hf_model.transformer.h[5].attn

In [44]:
# 3. Copy the parameter weights from Huggingface module to MLC module

hf_state_dict = hf_attn.state_dict()
new_state_dict = {}

# Transpose the weight in attention layer since Huggingface implementation uses Conv1D instead of Linear
for k, v in hf_state_dict.items():
    if "weight" in k:
        new_state_dict[k] = v.T
    else:
        new_state_dict[k] = v

mlc_attn.load_state_dict(new_state_dict, strict=True)

([], [])

In [45]:
# 4. Use jit to provide a TVM run-time that converts the MLC module to a PyTorch-compatible runnable module

torch_attn = mlc_attn.jit(spec=attn_spec, device="cpu")

In [46]:
# 5. Feed the same PyTorch tensor as input to both modules and compare the output

import torch

x = torch.rand((1, 2, 768), dtype=torch.float32)

mask = torch.full((1, 1, 2, 2), torch.finfo(torch.float32).max, dtype=torch.float32)
mask[0, 0, 0, 1] = torch.finfo(torch.float32).min

hf_y = hf_attn.forward(x)    # In Huggingface attention implementation, causal mask is automatically applied
mlc_y = torch_attn["forward"](x, mask)
assert torch.allclose(hf_y[0], mlc_y, atol=1e-5)

### Define a Loader in gpt2_loader.py

In `gpt2_loader.py`, we define how we convert the parameters from Huggingface to the format used by MLC model.

The loader class will return an [`ExternMapping`](https://github.com/mlc-ai/mlc-llm/blob/main/python/mlc_chat/loader/mapping.py) that contains two kinds of mappings:
- Source -> MLC parameter mapping: for example, parameter renaming, parameter transformation, etc.
- Unused mapping: parameters in the source that are not used in the MLC model definition.

In GPT2, we need to transpose c_attn, c_proj and c_fc weights since GPT-2 uses Conv1D. To do so, we will supply a mapping function as follows

```
for conv1d_weight_name in ["attn.c_attn", "attn.c_proj", "mlp.c_proj", "mlp.c_fc"]:
    src_name = f"h.{i}.{conv1d_weight_name}.weight"
    mlc_name = f"transformer.{src_name}"
    mapping.add_mapping(
        mlc_name,
        [src_name],
        functools.partial(
            lambda x, dtype: x.transpose().astype(dtype),
            dtype=named_parameters[mlc_name].dtype,
        ),
    )
```

Some renamings are also needed for GPT-2 parameters conversion to work. Please refer to [gpt2_loader.py](https://github.com/mlc-ai/mlc-llm/blob/main/python/mlc_chat/model/gpt2/gpt2_loader.py).

## Add the Model to the Supported Pre-built Model Workflow

Once the entire model is defined in SLM, including the model architecture, model loader and model quantitizer, we can then add it to the supported pre-built model workflow.

In [`mlc-llm/python/mlc_chat/model/model.py`](https://github.com/mlc-ai/mlc-llm/blob/main/python/mlc_chat/model/model.py), add the GPT-2 model to the `MODELS` list:

```
"gpt2": Model(
    name="gpt2",
    model=gpt2_model.GPT2LMHeadModel,
    config=gpt2_model.GPT2Config,
    source={
        "huggingface-torch": gpt2_loader.huggingface,
        "huggingface-safetensor": gpt2_loader.huggingface,
    },
    quantize={
        "no-quant": gpt2_quantization.no_quant,
        "group-quant": gpt2_quantization.group_quant,
    },
)
```

## Compile GPT-2 model libraries and weights

The following steps will be the same as the general model compilation workflow [here](https://llm.mlc.ai/docs/compilation/compile_models.html).

In [48]:
# Create directory
!mkdir -p dist/models && cd dist/models

# Clone HF weights
!git lfs install
!git clone https://huggingface.co/gpt2
!cd ../..

Git LFS initialized.
Cloning into 'gpt2'...
remote: Enumerating objects: 84, done.[K
remote: Counting objects: 100% (84/84), done.[K
remote: Compressing objects: 100% (55/55), done.[K
remote: Total 84 (delta 31), reused 69 (delta 28), pack-reused 0[K
Unpacking objects: 100% (84/84), 1.66 MiB | 3.81 MiB/s, done.
Filtering content: 100% (11/11), 5.23 GiB | 30.99 MiB/s, done.


In [49]:
# Convert weight
!mlc_chat convert_weight ./dist/models/gpt2/ --quantization q4f16_1 -o dist/gpt2-q4f16_1-MLC

[2023-12-28 03:12:59] INFO utils.py:160: NumExpr defaulting to 2 threads.
[2023-12-28 03:13:02] INFO auto_config.py:116: [92mFound[0m model configuration: dist/models/gpt2/config.json
[2023-12-28 03:13:02] INFO auto_device.py:75: [92mFound[0m device: cuda:0
[2023-12-28 03:13:03] INFO auto_device.py:84: [91mNot found[0m device: rocm:0
[2023-12-28 03:13:03] INFO auto_device.py:84: [91mNot found[0m device: metal:0
[2023-12-28 03:13:04] INFO auto_device.py:84: [91mNot found[0m device: vulkan:0
[2023-12-28 03:13:04] INFO auto_device.py:84: [91mNot found[0m device: opencl:0
[2023-12-28 03:13:04] INFO auto_device.py:33: Using device: [1mcuda:0[0m
[2023-12-28 03:13:04] INFO auto_weight.py:70: Finding weights in: dist/models/gpt2
[2023-12-28 03:13:04] INFO auto_weight.py:129: [92mFound[0m source weight format: huggingface-torch. Source configuration: dist/models/gpt2/pytorch_model.bin
[2023-12-28 03:13:04] INFO auto_weight.py:149: [91mNot found[0m Huggingface Safetensor
[2023-

In [52]:
# 1. gen_config: generate mlc-chat-config.json and process tokenizers
!mlc_chat gen_config ./dist/models/gpt2 \
    --quantization q4f16_1 --conv-template gpt2 \
    -o dist/gpt2-q4f16_1-MLC/

# 2. compile: compile model library with specification in mlc-chat-config.json
!mlc_chat compile ./dist/gpt2-q4f16_1-MLC/mlc-chat-config.json \
    --device cuda -o dist/gpt2-q4f16_1-MLC/gpt2-q4f16_1-cuda.so

[2023-12-28 03:22:31] INFO utils.py:160: NumExpr defaulting to 2 threads.
[2023-12-28 03:22:32] INFO auto_config.py:116: [92mFound[0m model configuration: dist/models/gpt2/config.json
[2023-12-28 03:22:32] INFO auto_config.py:155: [92mFound[0m model type: [1mgpt2[0m. Use `--model-type` to override.
[2023-12-28 03:22:32] INFO gpt2_model.py:44: [1mcontext_window_size[0m not found in config.json. Falling back to [1mn_positions[0m (1024)
[2023-12-28 03:22:32] INFO gen_config.py:114: [generation_config.json] Setting [1mbos_token_id[0m: 50256
[2023-12-28 03:22:32] INFO gen_config.py:114: [generation_config.json] Setting [1meos_token_id[0m: 50256
[2023-12-28 03:22:32] INFO gen_config.py:128: [91mNot found[0m tokenizer config: dist/models/gpt2/tokenizer.model
[2023-12-28 03:22:32] INFO gen_config.py:126: [92mFound[0m tokenizer config: dist/models/gpt2/tokenizer.json. Copying to [1mdist/gpt2-q4f16_1-MLC/tokenizer.json[0m
[2023-12-28 03:22:32] INFO gen_config.py:126: [92mFou