# Compiling Llama-2 with MLC-LLM in Python

This notebook demonstrates how to compile a model via [MLC-LLM](https://github.com/mlc-ai/mlc-llm) with a Python API. The `mlc-llm` package allows you to compile model at any directory. (https://mlc.ai/mlc-llm/docs/compilation/compile_models.html#more-model-compile-commands)

In [None]:
!nvidia-smi

Tue Dec 12 08:24:36 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P8     9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
!pip install --pre --force-reinstall mlc-ai-nightly-cu118 mlc-chat-nightly-cu118 -f https://mlc.ai/wheels

Looking in links: https://mlc.ai/wheels
Collecting mlc-ai-nightly-cu118
  Downloading https://github.com/mlc-ai/package/releases/download/v0.9.dev0/mlc_ai_nightly_cu118-0.12.dev1900-cp310-cp310-manylinux_2_28_x86_64.whl (544.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m544.7/544.7 MB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting mlc-chat-nightly-cu118
  Downloading https://github.com/mlc-ai/package/releases/download/v0.9.dev0/mlc_chat_nightly_cu118-0.1.dev669-cp310-cp310-manylinux_2_28_x86_64.whl (60.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.5/60.5 MB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting attrs (from mlc-ai-nightly-cu118)
  Downloading attrs-23.1.0-py3-none-any.whl (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.2/61.2 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cloudpickle (from mlc-ai-nightly-cu118)
  Downloading cloudpickle-3.0.0-py3-none-any.whl

In [None]:
!git clone --recursive https://github.com/mlc-ai/mlc-llm.git

Cloning into 'mlc-llm'...
remote: Enumerating objects: 12280, done.[K
remote: Counting objects: 100% (1644/1644), done.[K
remote: Compressing objects: 100% (558/558), done.[K
remote: Total 12280 (delta 1284), reused 1221 (delta 1085), pack-reused 10636[K
Receiving objects: 100% (12280/12280), 23.65 MiB | 15.52 MiB/s, done.
Resolving deltas: 100% (7946/7946), done.
Submodule '3rdparty/argparse' (https://github.com/p-ranav/argparse) registered for path '3rdparty/argparse'
Submodule '3rdparty/googletest' (https://github.com/google/googletest.git) registered for path '3rdparty/googletest'
Submodule '3rdparty/tokenizers-cpp' (https://github.com/mlc-ai/tokenizers-cpp) registered for path '3rdparty/tokenizers-cpp'
Submodule '3rdparty/tvm' (https://github.com/mlc-ai/relax.git) registered for path '3rdparty/tvm'
Cloning into '/content/mlc-llm/3rdparty/argparse'...
remote: Enumerating objects: 2822, done.        
remote: Counting objects: 100% (2822/2822), done.        
remote: Compressing o

We then install `mlc-llm` as a package, so that we can use its functions outside of this directory.

In [None]:
!cd mlc-llm && pip install -e . && cd -

Obtaining file:///content/mlc-llm
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
Collecting timm (from mlc-llm==0.1.dev677+g53e159b)
  Downloading timm-0.9.12-py3-none-any.whl (2.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: mlc-llm
  Building editable for mlc-llm (pyproject.toml) ... [?25l[?25hdone
  Created wheel for mlc-llm: filename=mlc_llm-0.1.dev677+g53e159b-0.editable-py3-none-any.whl size=7384 sha256=b1604002aaabf8f62cd44d5b652ed95d0fab21e3a0442704315391ad504da26b
  Stored in directory: /tmp/pip-ephem-wheel-cache-w5zwte6z/wheels/60/f6/e4/f9ebad71d5663623c41caead0eb5663a07b045d94af8e40d00
Successfully built mlc-llm
Installing collected packages: timm, mlc

## Download the Llama-2 model
After setting up the environment, we need to download the model we will compile. In this case, it would be [Llama-2-7B-Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). Note: you do not need to download from this link, we will download the model for you in this notebook.

To demonstrate that we can compile models using the `mlc-llm` model anywhere, we will create a separate directory to perform our work.

In [None]:
!mkdir -p ./my_workspace && ls

mlc-llm  my_workspace  sample_data


In [None]:
%cd my_workspace

/content/my_workspace


In order to download the large weights, we'll have to use `git lfs`.

In [None]:
!git lfs install

Git LFS initialized.


Now we will download the Llama-2 7B model from huggingface. Please first [request for access](https://huggingface.co/meta-llama) to Llama-2 weights (i.e. click [Llama-2 7B](https://huggingface.co/meta-llama/Llama-2-7b) and click the button to request access to the repo near the top of the model card information) from Meta using the email of your huggingface account. Then your huggingface account will have access to the model.

Since this particular model requires permission, we would need to log in to our huggingface account. In order to "log in" to your hugginface account on Colab or notebooks, you would need to create an [Access Token](https://huggingface.co/settings/tokens), and copy the token into when prompted below.

(Note: if the command appears to be taking a long time that most likely means the model is being downloaded, please check your filesystem to see if the directory `Llama-2-7b-chat-hf` has been created and is being populated)

In [None]:
import os

# Specify the folder name
folder_name = 'Llama-2-7b-chat-hf'

# Specify the path where you want to create the folder
workspace_path = '/content/my_workspace'  # Update this with your actual workspace path

# Combine the workspace path and folder name to create the full path
folder_path = os.path.join(workspace_path, folder_name)
# Check if the folder already exists
if not os.path.exists(folder_path):
    # Create the folder if it doesn't exist
    os.makedirs(folder_path)
    print(f'Folder "{folder_name}" created in "{workspace_path}"')
else:
    print(f'Folder "{folder_name}" already exists in "{workspace_path}"')

Folder "Llama-2-7b-chat-hf" created in "/content/my_workspace"


In [None]:
import os, getpass, subprocess
command = ['git', 'clone', f'https://{input("Enter your huggingface username:rahulrock12 ")}:{getpass.getpass(prompt="Huggingface CLI Access Token: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")}@huggingface.co/meta-llama/Llama-2-7b-chat-hf']#TheBloke/Llama-2-70B-GPTQ
p = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
command = []
while p.poll() is None:
  l = p.stderr.readline()
  print(l.decode('utf-8'))

Enter your huggingface username:rahulrock12 rahulrock12
Huggingface CLI Access Token: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··········
Cloning into 'Llama-2-7b-chat-hf'...



In [None]:
!rm -rf Llama-2-7b-chat-hf/*.safetensors

# Compile Llama2 with `mlc_llm`

Finally, we can compile the model we just downloaded in Python.

In [None]:
# Need to restart runtime since notebooks cannot find the module right after installing
# Simply run this cell, then run the next cells after runtime finishes restarting
exit()

After restarting the runtime of the notebook, first go into the workspace we created. After this cell, all code below will be in Python!

In [None]:
%cd my_workspace

We import `mlc_llm` that we installed using `pip -p`. `mlc_chat` and `tvm` are included in the nightly pacakges we installed earlier.

In [None]:
import mlc_llm, mlc_chat, tvm

In [None]:
build_args = mlc_llm.BuildArgs(
    model="Llama-2-7b-chat-hf",
    quantization="q4f16_1",
    target="cuda")

The output of `lib_path, model_path, chat_config_path = mlc_llm.build_model(build_args)` is given as a tuple of three paths.

`lib_path` is the path to the specific binary that has been built.

`model_path` is the path to the folder containing the compiled model parameters and other model specific configuration needed for other `mlc` modules.

`chat_config_path` is the path to the specific `.json` configuration needed to have this model work with `mlc_chat`.

In [None]:
lib_path, model_path, chat_config_path = mlc_llm.build_model(build_args)

Using path "Llama-2-13b-chat-hf" for model "Llama-2-13b-chat-hf"
Target configured: cuda -keys=cuda,gpu -arch=sm_75 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Automatically using target for weight quantization: cuda -keys=cuda,gpu -arch=sm_75 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32


Get old param:   0%|          | 0/245 [00:00<?, ?tensors/s]
Get old param:   0%|          | 1/245 [00:02<10:33,  2.60s/tensors]

Start computing and quantizing weights... This may take a while.



Get old param:   1%|          | 2/245 [00:08<17:15,  4.26s/tensors]
Get old param:   2%|▏         | 4/245 [00:52<1:02:30, 15.56s/tensors]
Get old param:   2%|▏         | 5/245 [00:52<43:14, 10.81s/tensors]  
Get old param:   3%|▎         | 7/245 [00:53<23:08,  5.83s/tensors]
Set new param:   3%|▎         | 12/407 [00:53<15:07,  2.30s/tensors][A
Get old param:   4%|▍         | 11/245 [00:53<09:52,  2.53s/tensors]
Get old param:   5%|▌         | 13/245 [00:53<07:09,  1.85s/tensors]
Set new param:   5%|▌         | 22/407 [00:53<04:12,  1.52tensors/s][A
Get old param:   7%|▋         | 17/245 [00:54<03:57,  1.04s/tensors]
Get old param:   8%|▊         | 19/245 [00:54<03:07,  1.20tensors/s]
Set new param:   8%|▊         | 32/407 [00:54<01:34,  3.98tensors/s][A
Get old param:   9%|▉         | 23/245 [00:54<01:54,  1.93tensors/s]
Get old param:  10%|█         | 25/245 [00:55<01:37,  2.27tensors/s]
Set new param:  10%|█         | 42/407 [00:55<00:45,  8.05tensors/s][A
Get old param:  12%|█

Finish computing and quantizing weights.
Total param size: 6.820138931274414 GB
Start storing to cache dist/Llama-2-13b-chat-hf-q4f16_1/params
[0176/0407] saving param_175


Set new param: 100%|██████████| 407/407 [02:50<00:00, 20.07tensors/s][A

[0407/0407] saving param_406
All finished, 163 total shards committed, record saved to dist/Llama-2-13b-chat-hf-q4f16_1/params/ndarray-cache.json
Finish exporting chat config to dist/Llama-2-13b-chat-hf-q4f16_1/params/mlc-chat-config.json
Save a cached module to dist/Llama-2-13b-chat-hf-q4f16_1/mod_cache_before_build.pkl.
Finish exporting to dist/Llama-2-13b-chat-hf-q4f16_1/Llama-2-13b-chat-hf-q4f16_1-cuda.so


## Content Generation

In [None]:
# Directly use the returned paths to launch `ChatModule`
chat_mod = mlc_chat.ChatModule(model=model_path)

In [None]:
prompt = "Write me a poem about the city Pittsburgh"
chat_mod.generate(prompt=prompt)