# Getting Started with MLC-LLM using the Llama 2 Model

Here's a quick overview of how to get started with the MLC-LLM `ChatModule` in Python. In this tutorial, we will chat with the [Llama 2](https://ai.meta.com/llama/) model. For the easiest setup, we recommend trying this out in a Google Colab notebook. Click the button below to get started!

<a target="_blank" href="https://colab.research.google.com/github/mlc-ai/notebooks/blob/main/mlc-llm/tutorial_chat_module_getting_started.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Environment Setup

Let's set up your environment, so you can successfully run the `ChatModule`. First, lets set up the Conda environment which we'll be running this notebook in.

```bash
conda create --name mlc-llm python=3.10
conda activate mlc-llm
```

**Google Colab:** If you are running this in a Google Colab notebook, be sure to change your runtime to GPU by going to Runtime > Change runtime type and setting the Hardware accelerator to be "GPU". Select "Connect" on the top right to instantiate your GPU session.

If you are using CUDA, you can run the following command to confirm that CUDA is set up correctly, and check the version number.

In [2]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0


**Google Colab:** If you are running this in a Google Colab notebook, you will also need to download some Vulkan drivers. You may not need to download the drivers if you are running this locally and already have Vulkan support (or are not using Vulkan).

In [None]:
!sudo apt install -y vulkan-tools libnvidia-gl-525

**Google Colab:** You can run the following command to confirm that the Vulkan drivers have installed successfully.

In [None]:
!vulkaninfo

Next, let's download the MLC-AI and MLC-Chat nightly build packages. Go to https://mlc.ai/package/ and replace the command below with the one that is appropriate for your hardware and OS.

In [1]:
!pip install --pre --force-reinstall mlc-ai-nightly-cu116 mlc-chat-nightly-cu116 -f https://mlc.ai/wheels

Looking in links: https://mlc.ai/wheels
Collecting mlc-ai-nightly-cu116
  Downloading https://github.com/mlc-ai/package/releases/download/v0.9.dev0/mlc_ai_nightly_cu116-0.12.dev1297-cp310-cp310-manylinux_2_28_x86_64.whl (81.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.2/81.2 MB[0m [31m24.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting mlc-chat-nightly-cu116
  Downloading https://github.com/mlc-ai/package/releases/download/v0.9.dev0/mlc_chat_nightly_cu116-0.1.dev278-cp310-cp310-manylinux_2_28_x86_64.whl (20.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.4/20.4 MB[0m [31m33.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting attrs (from mlc-ai-nightly-cu116)
  Using cached attrs-23.1.0-py3-none-any.whl (61 kB)
Collecting cloudpickle (from mlc-ai-nightly-cu116)
  Using cached cloudpickle-2.2.1-py3-none-any.whl (25 kB)
Collecting decorator (from mlc-ai-nightly-cu116)
  Using cached decorator-5.1.1-p

Next, we can clone the [MLC-LLM project](https://github.com/mlc-ai/mlc-llm).

In [2]:
!git clone https://github.com/mlc-ai/mlc-llm.git
!cd mlc-llm && git submodule update --init --recursive

Cloning into 'mlc-llm'...
remote: Enumerating objects: 5287, done.[K
remote: Counting objects: 100% (1289/1289), done.[K
remote: Compressing objects: 100% (389/389), done.[K
remote: Total 5287 (delta 1013), reused 1031 (delta 897), pack-reused 3998[K
Receiving objects: 100% (5287/5287), 19.94 MiB | 16.42 MiB/s, done.
Resolving deltas: 100% (3323/3323), done.


Next, let's download the model weights for the Llama 2 model and the prebuilt model libraries from Github. In order to download the large weights, we'll have to use `git lfs`.

In [3]:
!conda install git git-lfs
!git lfs install

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 22.9.0
  latest version: 23.5.2

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.

Retrieving notices: ...working... done
Updated git hooks.
Git LFS initialized.


In [4]:
!mkdir -p mlc-llm/dist/prebuilt
!git clone https://github.com/mlc-ai/binary-mlc-llm-libs.git mlc-llm/dist/prebuilt/lib

Cloning into 'mlc-llm/dist/prebuilt/lib'...
remote: Enumerating objects: 154, done.[K
remote: Counting objects: 100% (19/19), done.[K
remote: Compressing objects: 100% (15/15), done.[K
remote: Total 154 (delta 7), reused 14 (delta 4), pack-reused 135[K
Receiving objects: 100% (154/154), 41.48 MiB | 18.77 MiB/s, done.
Resolving deltas: 100% (100/100), done.


In [5]:
!cd mlc-llm/dist/prebuilt && git clone https://huggingface.co/mlc-ai/mlc-chat-Llama-2-7b-chat-hf-q4f16_1

Cloning into 'mlc-chat-Llama-2-7b-chat-hf-q4f16_1'...
remote: Enumerating objects: 126, done.[K
remote: Counting objects: 100% (126/126), done.[K
remote: Compressing objects: 100% (124/124), done.[K
remote: Total 126 (delta 0), reused 123 (delta 0), pack-reused 0[K
Receiving objects: 100% (126/126), 497.08 KiB | 9.04 MiB/s, done.
Filtering content: 100% (116/116), 3.53 GiB | 33.14 MiB/s, done.


## Let's Chat

Before we can chat with the model, we must first import a few libraries and instantiate a `ChatModule` instance.

In [6]:
from mlc_chat import ChatModule
import tvm

from IPython.display import clear_output

We must invoke the `ChatModule` with the appropriate device type, such as `vulkan`, `cuda`, etc.

In [7]:
cm = ChatModule(target="vulkan")

In order to load the model weights and prebuilt model library into the `ChatModule`, we have to first call the `reload` function.

In [8]:
lib = tvm.runtime.load_module("mlc-llm/dist/prebuilt/lib/Llama-2-7b-chat-hf-q4f16_1-vulkan.so")
cm.reload(lib=lib, model_path="mlc-llm/dist/prebuilt/mlc-chat-Llama-2-7b-chat-hf-q4f16_1")

That's all that's needed to set up the `ChatModule`. You can now chat with the model by inputting any prompt you'd like. Try it out below!

In [10]:
prompt = input("Prompt: ")
cm.prefill(input=prompt)

msg = None
while not cm.stopped():
    cm.decode()
    msg = cm.get_message()
    clear_output()
    print(msg, flush=True)

Certainly! Here is a poem I came up with based on the theme of "nature":

In the grand expanse of nature's embrace,
Where trees stretch high and skies are wide and vast,
The world is full of beauty, full of grace.
A place where creatures roam and plants bloom with haste.

From the tiniest flower to the loftiest tree,
Each one unique, a work of art divine,
Nature's handiwork, a masterpiece to see.
A world where life thrives, where love and light shine.

In nature's arms, we find our peaceful place,
Where worries fade and joy takes their space,
A sanctuary where we can find our grace.
A place where love and wonder fill the space.

Please let me know if you would like me to modify the poem in any way!


To evaluate the speed of the chat bot, you can print some statistics.

In [11]:
cm.runtime_stats_text()

'prefill: 42.5 tok/s, decode: 81.3 tok/s'

By default, the `ChatModule` will keep a history of your chat. You can reset the chat history by running the following.

In [12]:
cm.reset_chat()