# Getting Started with MLC-LLM in Python

Here's a quick overview of how to get started with the MLC-LLM `ChatModule` in Python. In this tutorial, we will chat with the [Vicuna-7B](https://huggingface.co/lmsys/vicuna-7b-delta-v1.1) model, which was trained by fine-tuning LLaMa and developed by LMSYS.

## Environment Setup

Let's set up your environment, so you can successfully run the `ChatModule`. First, lets set up the Conda environment which we'll be running this notebook in.

```bash
conda create --name mlc-llm python=3.10
conda activate mlc-llm
```

Next, let's download the MLC-AI and MLC-Chat nightly build packages. Go to https://mlc.ai/package/ and replace the command below with the one that is appropriate for your hardware and OS. Let's say we are using CUDA 11.6 on Linux.

In [1]:
!pip install --pre --force-reinstall mlc-ai-nightly-cu116 mlc-chat-nightly-cu116 -f https://mlc.ai/wheels

Looking in links: https://mlc.ai/wheels
Collecting mlc-ai-nightly-cu116
  Using cached https://github.com/mlc-ai/package/releases/download/v0.9.dev0/mlc_ai_nightly_cu116-0.12.dev1265-cp310-cp310-manylinux_2_28_x86_64.whl (98.6 MB)
Collecting mlc-chat-nightly-cu116
  Downloading https://github.com/mlc-ai/package/releases/download/v0.9.dev0/mlc_chat_nightly_cu116-0.1.dev257-cp310-cp310-manylinux_2_28_x86_64.whl (19.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.8/19.8 MB[0m [31m30.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting attrs (from mlc-ai-nightly-cu116)
  Using cached attrs-23.1.0-py3-none-any.whl (61 kB)
Collecting cloudpickle (from mlc-ai-nightly-cu116)
  Using cached cloudpickle-2.2.1-py3-none-any.whl (25 kB)
Collecting decorator (from mlc-ai-nightly-cu116)
  Using cached decorator-5.1.1-py3-none-any.whl (9.1 kB)
Collecting ml-dtypes (from mlc-ai-nightly-cu116)
  Using cached ml_dtypes-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manyl

Next, we can clone the [MLC-LLM project](https://github.com/mlc-ai/mlc-llm).

In [2]:
!git clone git@github.com:mlc-ai/mlc-llm.git
!cd mlc-llm
!git submodule update --init --recursive

Cloning into 'mlc-llm'...
remote: Enumerating objects: 5149, done.[K
remote: Counting objects: 100% (1151/1151), done.[K
remote: Compressing objects: 100% (361/361), done.[K
remote: Total 5149 (delta 907), reused 879 (delta 787), pack-reused 3998[K
Receiving objects: 100% (5149/5149), 19.90 MiB | 23.47 MiB/s, done.
Resolving deltas: 100% (3217/3217), done.


Next, let's download the model weights for the Vicuna-7B model and the prebuilt model libraries from Github. In order to download the large weights, we'll have to use `git lfs`.

In [3]:
!conda install git git-lfs
!git lfs install

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 22.9.0
  latest version: 23.5.2

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.

Retrieving notices: ...working... done
Updated git hooks.
Git LFS initialized.


In [4]:
!mkdir -p mlc-llm/dist/prebuilt
!git clone https://github.com/mlc-ai/binary-mlc-llm-libs.git mlc-llm/dist/prebuilt/lib

Cloning into 'mlc-llm/dist/prebuilt/lib'...
remote: Enumerating objects: 142, done.[K
remote: Counting objects: 100% (7/7), done.[K
remote: Compressing objects: 100% (6/6), done.[K
remote: Total 142 (delta 1), reused 4 (delta 1), pack-reused 135[K
Receiving objects: 100% (142/142), 40.07 MiB | 22.36 MiB/s, done.
Resolving deltas: 100% (94/94), done.


In [5]:
!git clone https://huggingface.co/mlc-ai/mlc-chat-vicuna-v1-7b-q3f16_0
!mv mlc-chat-vicuna-v1-7b-q3f16_0 mlc-llm/dist/prebuilt

Cloning into 'mlc-chat-vicuna-v1-7b-q3f16_0'...
remote: Enumerating objects: 196, done.[K
remote: Counting objects: 100% (196/196), done.[K
remote: Compressing objects: 100% (188/188), done.[K
remote: Total 196 (delta 7), reused 196 (delta 7), pack-reused 0[K
Receiving objects: 100% (196/196), 36.70 KiB | 6.12 MiB/s, done.
Resolving deltas: 100% (7/7), done.
Filtering content: 100% (131/131), 2.84 GiB | 25.55 MiB/s, done.


## Let's Chat

Before we can chat with the model, we must first import a few libraries and instantiate a `ChatModule` instance.

In [6]:
from mlc_chat import ChatModule
import tvm

from IPython.display import clear_output

We must invoke the `ChatModule` with the appropriate device type, such as `vulkan`, `cuda`, etc.

In [7]:
cm = ChatModule(target="vulkan")

In order to load the model weights and prebuilt model library into the `ChatModule`, we have to first call the `reload` function.

In [8]:
lib = tvm.runtime.load_module("mlc-llm/dist/prebuilt/lib/vicuna-v1-7b-q3f16_0-vulkan.so")
cm.reload(lib=lib, model_path="mlc-llm/dist/prebuilt/mlc-chat-vicuna-v1-7b-q3f16_0")

That's all that's needed to set up the `ChatModule`. You can now chat with the model by inputting any prompt you'd like. Try it out below!

In [9]:
prompt = input("Prompt: ")
cm.prefill(input=prompt)

msg = None
while not cm.stopped():
    cm.decode()
    msg = cm.get_message()
    clear_output(wait=True)
    print(msg, flush=True)

Hello! How can I help you today?


To evaluate the speed of the chat bot, you can print some statistics.

In [10]:
cm.runtime_stats_text()

'prefill: 115.6 tok/s, decode: 30.9 tok/s'

By default, the `ChatModule` will keep a history of your chat. You can reset the chat history by running the following.

In [11]:
cm.reset_chat()