# Compiling Llama-2 with MLC-LLM in Python

This notebook demonstrates how to compile a model via [MLC-LLM](https://github.com/mlc-ai/mlc-llm) with a Python API. The `mlc-llm` package allows you to compile model at any directory. In this tutorial, we will compile the newly released Llama-2, then chat with the model we build. You could also chat with [many other models](https://mlc.ai/mlc-llm/docs/compilation/compile_models.html#more-model-compile-commands) with the same method.

Note that you could also compile models in command line (as opposed to Python), as shown in [the docs](https://mlc.ai/mlc-llm/docs/compilation/compile_models.html).

If you are interested in learning about how the compilation works behind the scene, you may find [this course on machine learning compilation](https://mlc.ai/) helpful.

Click the button below to get started!

<a target="_blank" href="https://colab.research.google.com/github/mlc-ai/notebooks/blob/main/mlc-llm/tutorial_compile_llama2_with_mlc_llm.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Environment Setup

We will start from setting up the environment. First, let us create a new Conda environment, in which we will run the rest of the notebook.

```bash
conda create --name mlc-llm python=3.10
conda activate mlc-llm
```

**Google Colab**
- If you are running this in a Google Colab notebook, you would not need to create a conda environment.
- However, be sure to change your runtime to GPU by going to `Runtime` > `Change runtime type` and setting the Hardware accelerator to be "GPU".
- Besides, compiling Llama-2 **may** require more RAM than the default Colab allocates. You may need to either upgrade Colab to a paid plan (so that `runtime shape` can be set to `High RAM`), or use other environments.
  - But we also notice that, sometimes rerunning it several times (just the build portion) without exceeding the default RAM amount.

If you are using CUDA, you can run the following command to confirm that CUDA is set up correctly, and check the driver version number as well as what GPUs are currently available for use.

In [1]:
!nvidia-smi

Mon Jul 24 04:57:38 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   49C    P8     9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Next, let's download the MLC-AI and MLC-Chat nightly build packages. If you are running in a Colab environment, then you can just run the following command. Otherwise, go to https://mlc.ai/package/ and replace the command below with the one that is appropriate for your hardware and OS.

**Google Colab**: If you are using Colab, you may see the red warnings such as **"You must restart the runtime in order to use newly installed versions."** For our purpose, we can disregard them, the notebook will still run correctly.

In [2]:
!pip install --pre --force-reinstall mlc-ai-nightly-cu118 mlc-chat-nightly-cu118 -f https://mlc.ai/wheels

Looking in links: https://mlc.ai/wheels
Collecting mlc-ai-nightly-cu118
  Downloading https://github.com/mlc-ai/package/releases/download/v0.9.dev0/mlc_ai_nightly_cu118-0.12.dev1300-cp310-cp310-manylinux_2_28_x86_64.whl (81.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.2/81.2 MB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting mlc-chat-nightly-cu118
  Downloading https://github.com/mlc-ai/package/releases/download/v0.9.dev0/mlc_chat_nightly_cu118-0.1.dev288-cp310-cp310-manylinux_2_28_x86_64.whl (20.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.4/20.4 MB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting attrs (from mlc-ai-nightly-cu118)
  Downloading attrs-23.1.0-py3-none-any.whl (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.2/61.2 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cloudpickle (from mlc-ai-nightly-cu118)
  Downloading cloudpickle-2.2.1-py3-none-any.whl (2

**Google Colab**: Since we ignored the warnings/errors in the previous cell, run the following cell to verify the installation did in fact occur properly.

In [3]:
!python -c "import tvm; print('tvm installed properly!')"
!python -c "import mlc_chat; print('mlc_chat installed properly!')"

tvm installed properly!
mlc_chat installed properly!


Then, we clone the [mlc-llm repository](https://github.com/mlc-ai/mlc-llm).

**Google Colab**: Note, this will isntall into the mlc-llm folder. You can click the folder icon on the left menu bar to see the local file system and verify that the repository was cloned successfully.

In [4]:
!git clone --recursive https://github.com/mlc-ai/mlc-llm.git

Cloning into 'mlc-llm'...
remote: Enumerating objects: 5432, done.[K
remote: Counting objects: 100% (1435/1435), done.[K
remote: Compressing objects: 100% (441/441), done.[K
remote: Total 5432 (delta 1106), reused 1161 (delta 989), pack-reused 3997[K
Receiving objects: 100% (5432/5432), 20.04 MiB | 17.49 MiB/s, done.
Resolving deltas: 100% (3413/3413), done.
Submodule '3rdparty/argparse' (https://github.com/p-ranav/argparse) registered for path '3rdparty/argparse'
Submodule '3rdparty/googletest' (https://github.com/google/googletest.git) registered for path '3rdparty/googletest'
Submodule '3rdparty/tokenizers-cpp' (https://github.com/mlc-ai/tokenizers-cpp) registered for path '3rdparty/tokenizers-cpp'
Submodule '3rdparty/tvm' (https://github.com/mlc-ai/relax.git) registered for path '3rdparty/tvm'
Cloning into '/content/mlc-llm/3rdparty/argparse'...
remote: Enumerating objects: 2421, done.        
remote: Counting objects: 100% (27/27), done.        
remote: Compressing objects: 10

We then install `mlc-llm` as a package, so that we can use its functions outside of this directory.

In [5]:
!cd mlc-llm && pip install -e . && cd -

Obtaining file:///content/mlc-llm
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers (from mlc-llm==0.1.dev290+gd4c3a17)
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m37.6 MB/s[0m eta [36m0:00:00[0m
Collecting timm (from mlc-llm==0.1.dev290+gd4c3a17)
  Downloading timm-0.9.2-py3-none-any.whl (2.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m49.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub (from timm->mlc-llm==0.1.dev290+gd4c3a17)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m22.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors (from timm->mlc-llm==0.1.dev290+gd4c3a17)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━

## Download the Llama-2 model
After setting up the environment, we need to download the model we will compile. In this case, it would be [Llama-2-7B-Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). Note: you do not need to download from this link, we will download the model for you in this notebook.

To demonstrate that we can compile models using the `mlc-llm` model anywhere, we will create a separate directory to perform our work.

In [6]:
!mkdir -p ./my_workspace && ls

mlc-llm  my_workspace  sample_data


In [7]:
%cd my_workspace

/content/my_workspace


In order to download the large weights, we'll have to use `git lfs`.

In [8]:
!git lfs install

Git LFS initialized.


Now we will download the Llama-2 7B model from huggingface. Please first [request for access](https://huggingface.co/meta-llama) to Llama-2 weights (i.e. click [Llama-2 7B](https://huggingface.co/meta-llama/Llama-2-7b) and click the button to request access to the repo near the top of the model card information) from Meta using the email of your huggingface account. Then your huggingface account will have access to the model.

Since this particular model requires permission, we would need to log in to our huggingface account. In order to "log in" to your hugginface account on Colab or notebooks, you would need to create an [Access Token](https://huggingface.co/settings/tokens), and copy the token into when prompted below.

(Note: if the command appears to be taking a long time that most likely means the model is being downloaded, please check your filesystem to see if the directory `Llama-2-7b-chat-hf` has been created and is being populated)

In [9]:
import os, getpass, subprocess
command = ['git', 'clone', f'https://{input("Enter your huggingface username: ")}:{getpass.getpass(prompt="Huggingface CLI Access Token: ")}@huggingface.co/meta-llama/Llama-2-7b-chat-hf']
p = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
command = []
while p.poll() is None:
  l = p.stderr.readline()
  print(l.decode('utf-8'))

Enter your huggingface username: CharlieRuan0130
Huggingface CLI Access Token: ··········
Cloning into 'Llama-2-7b-chat-hf'...

Filtering content: 100% (5/5), 9.10 GiB | 8.07 MiB/s, done.

Encountered 2 file(s) that may not have been copied correctly on Windows:

	pytorch_model-00001-of-00002.bin

	model-00001-of-00002.safetensors



See: `git lfs help smudge` for more details.




**Google Colab:** If you have the free version of Colab, you will most likely run out of disk space, so please run the following command to free up some disk space.

In [10]:
!rm -rf Llama-2-7b-chat-hf/*.safetensors

# Compile Llama2 with `mlc_llm`

Finally, we can compile the model we just downloaded in Python.

In [11]:
# Need to restart runtime since notebooks cannot find the module right after installing
# Simply run this cell, then run the next cells after runtime finishes restarting
exit()

After restarting the runtime of the notebook, first go into the workspace we created. After this cell, all code below will be in Python!

In [5]:
%cd my_workspace

/content/my_workspace


We import `mlc_llm` that we installed using `pip -p`. `mlc_chat` and `tvm` are included in the nightly pacakges we installed earlier.

In [6]:
import mlc_llm, mlc_chat, tvm

We use a dataclass `BuildArgs` to organize the arguments for building the model. Besides specifying the model with `model` (when the weight is local), you could also use the argument `hf_path`, which will download the model from huggingface directly.

For more details on the arguments, please see [the docs for the CLI](https://mlc.ai/mlc-llm/docs/compilation/compile_models.html#compile-command-specification) for now. We will update documentation for `BuildArgs` soon. (Or you could look at the source code)

In [7]:
build_args = mlc_llm.BuildArgs(
    model="Llama-2-7b-chat-hf",
    quantization="q4f16_1",
    target="cuda")

`mlc_llm.build_model` is the main entrance here. It takes in a `BuildArgs` to start the entire model compilation workflow.

**Google Colab** If you are using Colab, the line below may require more RAM than the default Colab provides. You may need to either upgrade to a paid Colab plan, or run it in other environments. (Or sometimes, when you keep rerunning, (just the build portion), it eventually builds without exceeding the RAM Colab provides)

The output of `lib_path, model_path, chat_config_path = mlc_llm.build_model(build_args)` is given as a tuple of three paths.

`lib_path` is the path to the specific binary that has been built.

`model_path` is the path to the folder containing the compiled model parameters and other model specific configuration needed for other `mlc` modules.

`chat_config_path` is the path to the specific `.json` configuration needed to have this model work with `mlc_chat`.

In [8]:
lib_path, model_path, chat_config_path = mlc_llm.build_model(build_args)

Using path "Llama-2-7b-chat-hf" for model "Llama-2-7b-chat-hf"
Target configured: cuda -keys=cuda,gpu -arch=sm_75 -max_num_threads=1024 -thread_warp_size=32
Automatically using target for weight quantization: cuda -keys=cuda,gpu -arch=sm_75 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Start computing and quantizing weights... This may take a while.
Finish computing and quantizing weights.
Total param size: 3.5313796997070312 GB
Start storing to cache dist/Llama-2-7b-chat-hf-q4f16_1/params
[0327/0327] saving param_326
All finished, 115 total shards committed, record saved to dist/Llama-2-7b-chat-hf-q4f16_1/params/ndarray-cache.json
Save a cached module to dist/Llama-2-7b-chat-hf-q4f16_1/mod_cache_before_build_cuda.pkl.
Finish exporting to dist/Llama-2-7b-chat-hf-q4f16_1/Llama-2-7b-chat-hf-q4f16_1-cuda.so
Finish exporting chat config to dist/Llama-2-7b-chat-hf-q4f16_1/params/mlc-chat-config.json


# Now we can chat!

Now we can chat using `mlc_chat`'s `ChatModule`. Note that `mlc_llm.build_model` returns the path to the generated files, and we can directly pass them in to the workflow below.

For more details on `ChatModule`, please see the other tutorial [Getting Started with MLC-LLM](https://github.com/mlc-ai/notebooks/blob/main/mlc-llm/tutorial_chat_module_getting_started.ipynb), or its documentation [here](https://mlc.ai/mlc-llm/docs/deploy/python.html#api-reference).

In [9]:
# Directly use the returned paths to launch `ChatModule`
lib = tvm.runtime.load_module(lib_path)
chat_mod = mlc_chat.ChatModule(target="cuda")
chat_mod.reload(lib=lib, model_path=model_path)

In [10]:
from IPython.display import clear_output

prompt = "Prompt: Write me a poem about the city Pittsburgh"
chat_mod.prefill(input=prompt)

msg = None
while not chat_mod.stopped():
    chat_mod.decode()
    msg = chat_mod.get_message()
    clear_output()
    print(msg, flush=True)

Of course! Here is a poem about the city of Pittsburgh:
Pittsburgh, city of steel and might,
Where the rivers flow and the bridges take flight.
A city of contrasts, where the skyline meets the night,
A place of beauty, where the grit and grime take flight.
From the hills to the valleys, the views are simply divine,
A city that's rich in history, and full of life.
From the Steelers to the Pirates, and the Penguins too,
Pittsburgh's sports teams have won the hearts of many a crew.
The cultural scene is thriving, with art and music in the air,
From the Carnegie Museum to the Mattress Factory, and the theaters beyond compare.
The people of Pittsburgh, a melting pot of cultures and hues,
A city that's welcoming, and full of surprises, and news.
From the Strip District to the North Side, and everywhere in between,
Pittsburgh's neighborhoods are full of life, and stories untold and unseen.
So here's to Pittsburgh, a city of character and grace,
A place that's home to many, and a city that's i