# Extensions to More Model Variants

In the previous tutorial [Compiling Llama-2 with MLC-LLM in Python](https://github.com/mlc-ai/notebooks/blob/main/mlc-llm/tutorial_compile_llama2_with_mlc_llm.ipynb), we saw how to compile a model variant explicitly supported by MLC-LLM (i.e. listed in the [supported model variants](https://mlc.ai/mlc-llm/docs/prebuilt_models.html#supported-model-architectures)). In order to "explicitly support" a model variant, it primarily means defining its own [conversation template](https://github.com/mlc-ai/mlc-llm/blob/main/cpp/conv_templates.cc) (e.g. [Gorilla](https://github.com/mlc-ai/mlc-llm/pull/288), [Guanaco](https://github.com/mlc-ai/mlc-llm/pull/497), [WizardLM](https://github.com/mlc-ai/mlc-llm/pull/489)).

In this tutorial, we demonstrate that compiling a model variant not on the list is actually quite simple, as long as the architecture is [supported](https://mlc.ai/mlc-llm/docs/prebuilt_models.html#supported-model-architectures) (e.g. `llama`, `rwkv`, `gpt-neox`, etc.). We follow the steps of:
0. Environment setup
1. Download the weights and build the model
2. Update MLC chat configuration JSON
3. Chat with the compiled model
4. (Optional) Upload the compiled model weights
5. (Optional) Use the pre-built model weights you uploaded

If you would like to define a new model architecture, you could follow [this tutorial](https://mlc.ai/mlc-llm/docs/tutorials/customize/define_new_models.html), which would be much more involved.

Click the button below to get started!

<a target="_blank" href="https://colab.research.google.com/github/mlc-ai/notebooks/blob/main/mlc-llm/tutorial_extensions_to_more_model_variants.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Step 0. Environment setup

We will start from setting up the environment. First, let us create a new Conda environment, in which we will run the rest of the notebook.

```bash
conda create --name mlc-llm python=3.10
conda activate mlc-llm
```

**Google Colab**
- If you are not running this in a Google Colab notebook, you would not need to create a conda environment.
- However, be sure to change your runtime to GPU by going to `Runtime` > `Change runtime type` and setting the Hardware accelerator to be "GPU".
- Besides, compiling some models **may** require more RAM than the default Colab allocates. You may need to either upgrade Colab to a paid plan (so that `runtime shape` can be set to `High RAM`), or use other environments.
  - But we also notice that, sometimes rerunning it several times (just the build portion) would successfully pass without exceeding the default RAM amount.

If you are using CUDA, you can run the following command to confirm that CUDA is set up correctly, and check the driver version number as well as what GPUs are currently available for use.

In [None]:
!nvidia-smi

Next, let's download the MLC-AI and MLC-Chat nightly build packages. If you are running in a Colab environment, then you can just run the following command. Otherwise, go to https://mlc.ai/package/ and replace the command below with the one that is appropriate for your hardware and OS.

**Google Colab**: If you are using Colab, you may see the red warnings such as **"You must restart the runtime in order to use newly installed versions."** For our purpose, we can disregard them, the notebook will still run correctly.

In [None]:
!pip install --pre --force-reinstall mlc-ai-nightly-cu118 mlc-chat-nightly-cu118 -f https://mlc.ai/wheels

**Google Colab**: Since we ignored the warnings/errors in the previous cell, run the following cell to verify the installation did in fact occur properly.

In [None]:
!python -c "import tvm; print('tvm installed properly!')"
!python -c "import mlc_chat; print('mlc_chat installed properly!')"

Then, we clone the [mlc-llm repository](https://github.com/mlc-ai/mlc-llm).

**Google Colab**: Note, this will install into the mlc-llm folder. You can click the folder icon on the left menu bar to see the local file system and verify that the repository was cloned successfully.

In [None]:
!git clone --recursive https://github.com/mlc-ai/mlc-llm.git

We then install `mlc-llm` as a package, so that we can use its functions outside of this directory.

In [None]:
%cd mlc-llm
!pip install -e .

We then create a folder to store the downloaded parameters and compiled models. Typically, we store the compiled models under `dist`, and downloaded (i.e. uncompiled) parameters under `dist/models`. This is also the default directory setup for `mlc-llm`.

In [None]:
!mkdir -p dist/models

Now we have completed setting up the environments. If you are working in a notebook, you would need to run the `exit()` below to restart the runtime. Otherwise, notebooks cannot find the module right after installing them. Simply run this cell, then run the subsequent cells after the runtime finishes restarting.

In [None]:
exit()

## Step 1. Download the weights and build the model

This is the main section of the tutorial. In order to build the model using the Python function `build_model()`, we use a dataclass `BuildArgs` to organize the arguments for building the model. There are generally two ways of building the model:
1. Specify the `hf_path` in the `BuildArgs`, which allows `build_model()` to first download the parameters from hugging face before compiling it.
2. Download the parameters yourself, and specify `model` in the `BuildArgs`, so that `build_model()` can locate the downloaded parameters locally.

In this tutorial, we will use the first method.

**Note**: However, it is worth to note that many model variants post the **parameter delta** on hugging face rather than the actual parameters. For instance, look at the [instructions for compiling WizardLM](https://github.com/mlc-ai/mlc-llm/pull/489). In cases like WizardLM, we will have to proceed with the second method after reconstructing the parameters from the delta.

For more details on the arguments, please see [the docs for the CLI's arguments](https://mlc.ai/mlc-llm/docs/compilation/compile_models.html#compile-command-specification) for now, which is equivalent to `BuildArgs`. We will update documentation for `BuildArgs` soon. (Or you could look at the source code)

As mentioned above, we will use the `hf_path` to specify what model variant we would like to compile. Feel free to enter the huggingface path of the model you are interested in below. But please make sure that it contains the actual parameters not the delta.

We will use [GOAT-AI's GOAT-7B-Community](https://huggingface.co/GOAT-AI/GOAT-7B-Community) for this tutorial. Note however, that we give other options in the dropdown menu, and hugging face paths can also be used directly.

(Note: Rerun from this cell if **Google Colab** crashes)

In [None]:
%cd mlc-llm
!ls

In [None]:



# @title Model Parameters
hf_path = 'GOAT-AI/GOAT-7B-Community' # @param ["georgesung/llama2_7b_chat_uncensored", "GOAT-AI/GOAT-7B-Community"] {allow-input: true}

We import `mlc_llm` that we installed using `pip -p`. `mlc_chat` and `tvm` are included in the nightly pacakges we installed earlier.

In [None]:
import mlc_llm
import mlc_chat
import tvm

We then specify the arguments for building the model.

In [None]:
build_args = mlc_llm.BuildArgs(
    hf_path=hf_path,
    quantization="q4f16_1",
    target="cuda"
)

print(build_args)

`mlc_llm.build_model` is the main entrance here. It takes in a `BuildArgs` to start the entire model compilation workflow.

**Google Colab** If you are using Colab, the line below may require more RAM than the default Colab provides. You may need to either upgrade to a paid Colab plan, or run it in other environments. (Or sometimes, when you keep rerunning, (just the build portion), it eventually builds without exceeding the RAM Colab provides)

**The cell may take ~15 minutes to finish, mainly because downloading the parameters from hugging face takes a while.**

In [None]:
lib_path, model_path, chat_config_path = mlc_llm.build_model(build_args)

The output of `lib_path, model_path, chat_config_path = mlc_llm.build_model(build_args)` is given as a tuple of three paths.

`lib_path` is the path to the specific binary that has been built.

`model_path` is the path to the folder containing the compiled model parameters and other model specific configuration needed for other `mlc` modules.

`chat_config_path` is the path to the specific `.json` configuration needed to have this model work with `mlc_chat`, which we discuss in the next section.

## Step 2. Update MLC chat configuration in Python

We first take a look at the `mlc-chat-config.json` file we generated.

In [None]:
!cat dist/GOAT-7B-Community-q4f16_1/params/mlc-chat-config.json

We can see that this file contains several parameters that `mlc_chat` needs when running the chat application with this specific model. For example, the `conv_template` we are using for GOAT-7B is `llama_default`, which is defined in [cpp/conv_templates.cc](https://github.com/mlc-ai/mlc-llm/blob/main/cpp/conv_templates.cc).

The current logic is that, whenever we compile a model who does not have its own conversation template defined in `cpp/conv_templates.cc` (which is the case for GOAT-7B, unlike say, WizardLM), we concatenate its `model_category` with `_default`, hence `llama_default`.

Note that we have not developed a default template for other model categories. In that case, you might need to modify the `mlc-chat-config.json` manually. Perhaps:
- Either point the `"conv_template"` to one of the defined conversation templates in `cpp/conv_templates.cc`
- Then if needed, customize the options in `mlc-chat-config.json` by following the [tutorial here](https://mlc.ai/mlc-llm/docs/get_started/mlc_chat_config.html#configure-mlc-chat-json)


The process of determining the correct `conv_config` may involve some trial and error. Sometimes, the developer may provide useful information on their website.

In the case of GOAT-AI, after referring to [the tutorial](https://mlc.ai/mlc-llm/docs/get_started/mlc_chat_config.html#configure-mlc-chat-json), we will change the `stop_str` and `system` entries, so that the JSON config will end up being:

```json
{
    "model_lib": "GOAT-7B-Community-q4f16_1",
    "local_id": "GOAT-7B-Community-q4f16_1",
    "conv_template": "llama_default",
    "temperature": 0.7,
    "repetition_penalty": 1.0,
    "top_p": 0.95,
    "mean_gen_len": 128,
    "max_gen_len": 512,
    "shift_fill_factor": 0.3,
    "tokenizer_files": [],
    "model_category": "llama",
    "model_name": "GOAT-7B-Community",
    "conv_config": {
        "stop_str": "\n\n",
        "system": ""
    }
}
```

We can use the `mlc_chat.ConvConfig` and `mlc_chat.ChatConfig` objects to help us with this in Python (so we do not have to manually change the JSON file and can instead programmatically change what we need at runtime). The following is an example that upon calling `chat_module.reset_chat(chat_config)` will give us the desired chat configuraion.

In [None]:
conv_config = mlc_chat.ConvConfig(
    stop_str='\n\n',
    system=''
)
chat_config = mlc_chat.ChatConfig(
    conv_template="llama_default",
    conv_config=conv_config
)

## (Optional) Step 2b. Update MLC chat configuration JSON

It is also possible to modify the JSON directly rather than use the Python API to make these modifications. Simply copy and paste the above modified JSON into the appropriate JSON file to have the changes take effect.

**Google Colab**: To modify the file in google colab, simply click the folder icon on the left, locate the file, and clicking on the file will open up an editor on the right.

## Step 3. Chat with the compiled model

Now we can chat using `mlc_chat`'s `ChatModule`. Note that `mlc_llm.build_model` returns the path to the generated files, and we can directly pass them in to the workflow below.

For more details on `ChatModule`, please see the other tutorial [Getting Started with MLC-LLM](https://github.com/mlc-ai/notebooks/blob/main/mlc-llm/tutorial_chat_module_getting_started.ipynb), or its documentation [here](https://mlc.ai/mlc-llm/docs/deploy/python.html#api-reference).

In [None]:
# Directly use the returned paths to launch `ChatModule`
chat_mod = mlc_chat.ChatModule(model=model_path)
chat_mod.reset_chat(chat_config)

In [None]:
prompt = "Tell me a joke"
chat_mod.generate(prompt=prompt)

## (Optional) Step 4. Upload the compiled model weights

Next, we can upload the compiled model weights (the files in `dist/GOAT-7B-Community-q4f16_1/params`) to Hugging Face:

```bash
# First, please create a repository on Hugging Face.
# With the repository created, run
git lfs install
git clone https://huggingface.co/my-huggingface-account/my-goat7b-weight-huggingface-repo
cd my-goat7b-weight-huggingface-repo
cp /path/to/dist/GOAT-7B-Community-q4f16_1/params/* .
git add . && git commit -m "Add goat-7b compiled model weights"
git push origin main
```

We have an example of distributed `GOAT-7B-Community-q4f16_1` on [mlc-ai's huggingface](https://huggingface.co/mlc-ai/mlc-chat-GOAT-7B-Community-q4f16_1/tree/main).

The reason why we do not need to upload the `.so` file is because we can reuse the model library, as we will see in the next section.

## (Optional) Step 5. Use the pre-built model weights you uploaded

Finally, we will show you how to use the model weights you just uploaded. This is similar to what is shown in the tutorial of [Getting Started with Chat Module](https://github.com/mlc-ai/notebooks/blob/main/mlc-llm/tutorial_chat_module_getting_started.ipynb), and we have come full circle.

Before proceeding, we should restart the runtime again, as for some reason the notebook may crash. Simply run the next cell, and proceed with the following cells after the runtime restarts.

In [None]:
exit()

To demonstrate the usage of prebuilt weights, we first delete the weights we have downloaded and compiled.

In [None]:
!cd mlc-llm && rm -rf dist

Next, we download all the pre-built model libraries (the `.so` file we will use is in here).

In [None]:
!mkdir -p mlc-llm/dist/prebuilt
!git clone https://github.com/mlc-ai/binary-mlc-llm-libs.git mlc-llm/dist/prebuilt/lib

Then, download the pre-built weight you have uploaded to hugging face. Here, we use the example uploaded to mlc-ai's hugging face repo.

In [None]:
!cd mlc-llm/dist/prebuilt && git clone https://huggingface.co/mlc-ai/mlc-chat-GOAT-7B-Community-q4f16_1

Here is the model weights we just downloaded from hugging face:

In [None]:
!cd mlc-llm/dist/prebuilt && ls

Here is all the pre-built model libraries we cloned.

Note that there isn't one for GOAT-7B. However, that is fine because GOAT-7B shares the same architecture with Llama. As long as the model architecture is the same, and the quantization choice is the same, we can reuse the model library! This is why we did not need to upload the `.so` file in Step 4.

In [None]:
!cd mlc-llm/dist/prebuilt/lib && ls

Finally, we follow the same code in Step 3 and chat with the prebuilt model and weights!

Notice that the target is now `vulkan`, and the paths now point to the files we just downloaded.

In [None]:
import mlc_llm
import mlc_chat
import tvm

In [None]:
chat_mod = ChatModule(model="mlc-llm/dist/prebuilt/mlc-chat-GOAT-7B-Community-q4f16_1", lib_path="mlc-llm/dist/prebuilt/lib/Llama-2-7b-chat-hf-q4f16_1-cuda.so", device_name="cuda")

In [None]:
prompt = "Write a short poem about Pittsburgh"
chat_mod.generate(prompt=prompt)