<a href="https://colab.research.google.com/github/inchiosa/mlc-llm/blob/main/tutorial_chat_module_getting_started_7b.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting Started with MLC-LLM using the Llama 2 Model

Here's a quick overview of how to get started with the MLC-LLM `ChatModule` in Python. In this tutorial, we will chat with the [Llama2](https://ai.meta.com/llama/) model. For the easiest setup, we recommend trying this out in a Google Colab notebook. Click the button below to get started!

<a target="_blank" href="https://colab.research.google.com/github/mlc-ai/notebooks/blob/main/mlc-llm/tutorial_chat_module_getting_started.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Environment Setup

Let's set up your environment, so you can successfully run the `ChatModule`. First, let's set up the Conda environment which we will be running this notebook in (not required if running in Google Colab).

```bash
conda create --name mlc-llm python=3.10
conda activate mlc-llm
```

**Google Colab:** If you are running this in a Google Colab notebook, be sure to change your runtime to GPU by going to Runtime > Change runtime type and setting the Hardware accelerator to be "GPU". Select "Connect" on the top right to instantiate your GPU session.

If you are using CUDA, you can run the following command to confirm that CUDA is set up correctly, and check the version number.

In [None]:
!nvidia-smi

Thu Aug 10 21:03:31 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   63C    P8    11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Next, let's download the MLC-AI and MLC-Chat nightly build packages. Go to https://mlc.ai/package/ and replace the command below with the one that is appropriate for your hardware and OS.

In [None]:
!pip install --pre --force-reinstall mlc-ai-nightly-cu118 mlc-chat-nightly-cu118 -f https://mlc.ai/wheels

Looking in links: https://mlc.ai/wheels
Collecting mlc-ai-nightly-cu118
  Downloading https://github.com/mlc-ai/package/releases/download/v0.9.dev0/mlc_ai_nightly_cu118-0.12.dev1395-cp310-cp310-manylinux_2_28_x86_64.whl (97.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m97.9/97.9 MB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting mlc-chat-nightly-cu118
  Downloading https://github.com/mlc-ai/package/releases/download/v0.9.dev0/mlc_chat_nightly_cu118-0.1.dev351-cp310-cp310-manylinux_2_28_x86_64.whl (20.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.6/20.6 MB[0m [31m49.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting attrs (from mlc-ai-nightly-cu118)
  Downloading attrs-23.1.0-py3-none-any.whl (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.2/61.2 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cloudpickle (from mlc-ai-nightly-cu118)
  Downloading cloudpickle-2.2.1-py3-none-any.whl (2

**Google Colab:** If in Google Colab, you may see a message warning you to restart the runtime. Simply run the following code in a new code cell to restart the runtime.

```python
import os
os.kill(os.getpid(), 9)
```

Next, let's download the model weights for the Llama2 model and the prebuilt model libraries from Github. In order to download the large weights, we'll have to use `git lfs`.

Note: If you are NOT running in **Google Colab** you may need to run this line `!conda install git git-lfs` to install `git` and `git-lfs` before running the following cell to fully install `git lfs`.

In [None]:
!git lfs install

Git LFS initialized.


These commands will download many prebuilt libraries as well as the chat configuration for Llama-2-7b that `mlc_chat` needs, which may take a long time. If in **Google Colab** you can verify that the files are being downloaded by clicking on the folder icon on the left and navigating to the `dist` and then `prebuilt` folders which should be updating as the files are being downloaded.

In [None]:
!rm -rf dist

In [None]:
!mkdir -p dist/prebuilt
!git clone https://github.com/mlc-ai/binary-mlc-llm-libs.git dist/prebuilt/lib

Cloning into 'dist/prebuilt/lib'...
remote: Enumerating objects: 215, done.[K
remote: Counting objects: 100% (80/80), done.[K
remote: Compressing objects: 100% (50/50), done.[K
remote: Total 215 (delta 55), reused 53 (delta 30), pack-reused 135[K
Receiving objects: 100% (215/215), 51.36 MiB | 16.82 MiB/s, done.
Resolving deltas: 100% (148/148), done.
Updating files: 100% (57/57), done.


In [None]:
!cd dist/prebuilt && git clone https://huggingface.co/mlc-ai/mlc-chat-Llama-2-7b-chat-hf-q4f16_1

Cloning into 'mlc-chat-Llama-2-7b-chat-hf-q4f16_1'...
remote: Enumerating objects: 126, done.[K
remote: Total 126 (delta 0), reused 0 (delta 0), pack-reused 126[K
Receiving objects: 100% (126/126), 497.08 KiB | 4.07 MiB/s, done.
Filtering content: 100% (116/116), 3.53 GiB | 70.03 MiB/s, done.


## Let's Chat!

Before we can chat with the model, we must first import a library and instantiate a `ChatModule` instance. The `ChatModule` must be initialized with the appropriate model name.

In [None]:
from mlc_chat import ChatModule
from mlc_chat.callback import StreamToStdout

cm = ChatModule(model="Llama-2-7b-chat-hf-q4f16_1")

System automatically detected device: cuda
Using model folder: /content/dist/prebuilt/mlc-chat-Llama-2-7b-chat-hf-q4f16_1
Using mlc chat config: /content/dist/prebuilt/mlc-chat-Llama-2-7b-chat-hf-q4f16_1/mlc-chat-config.json
Using library model: /content/dist/prebuilt/lib/Llama-2-7b-chat-hf-q4f16_1-cuda.so



Note that the above invocation abstracts away the logic for finding the relevant model directory and prebuilt library paths. To specify these manually, you could run the following instead (which would be equivalent to the above).

```python
cm = ChatModule(model="dist/prebuilt/mlc-chat-Llama-2-7b-chat-hf-q4f16_1", lib_path="dist/prebuilt/lib/Llama-2-7b-chat-hf-q4f16_1-cuda.so")
```

That is all what needed to set up the `ChatModule`. You can now chat with the model by entering any prompt you'd like. Try it out below!

In [None]:
output = cm.generate(
    prompt="When was Python released?",
    progress_callback=StreamToStdout(callback_interval=2),
)

Great, thank you for asking! Python was first released in 1991 by Guido van Rossum. He was a Dutch computer programmer and he created the Python programming language as an interpreted, object-oriented, and high-level language. The first version of Python was released on February 20, 1991, and it has been continuously updated and improved since then. ��  🐍��  📚


You can also repeat running the code block below for multiple rounds to interact with the model in a chat style.

In [None]:
prompt = input("Prompt: ")
output = cm.generate(prompt=prompt, progress_callback=StreamToStdout(callback_interval=2))

Prompt: When was Python released?
Great, thank you for asking! Python was first released in 1991 by Guido van Rossum. He was a Dutch computer programmer and he created the Python programming language as an interpreted, object-oriented, and high-level language. The first version of Python was released on February 20, 1991, and it has been continuously updated and improved since then. ��  🐍��  📚


In [None]:
output = cm.generate(
    prompt="Please summarize your response in three sentences.",
    progress_callback=StreamToStdout(callback_interval=2),
)

Of course! Here is a summary of my previous response:

Python was first released in 1991 by Guido van Rossum. He created the Python programming language as an interpreted, object-oriented, and high-level language. The first version of Python was released on February 20, 1991, and it has been continuously updated and improved since then.


To check the generation speed of the chat bot, you can print the statistics.

In [None]:
print(cm.stats())

prefill: 101.9 tok/s, decode: 39.5 tok/s


By default, the `ChatModule` will keep a history of your chat. You can reset the chat history by running the following.

In [None]:
cm.reset_chat()

### Benchmark Performance

To benchmark the performance, we can use the `benchmark_generate` method of ChatModule. It takes an input prompt and the number of tokens to generate, ignores the system prompt and model stop criterion, generates tokens in a language model way and stops until finishing generating the desired number of tokens. After calling `benchmark_generate`, we can use `stats` to check the performance.

In [None]:
print(cm.benchmark_generate(prompt="What is benchmark?", generate_length=512))
cm.stats()

Benchmarking is the process of selecting, using, and comparing measures of performance or quality against which a particular organization, product, or service can be evaluated. sierpina / iStock / Getty Images Benchmarking is the process of evaluating how well an organization, product, or service performs in relation to industry standards or best practices. It involves identifying key performance indicators (KPIs) and measuring progress or success against those indicators. Benchmarking can be used to identify areas for improvement, measure progress over time, or compare performance across different organizations. There are several types of benchmarking, including: 1. Internal benchmarking: This involves measuring the performance of an organization against its own past performance or against a target or goal. 2. External benchmarking: This involves measuring the performance of an organization against industry standards or best practices. 3. Competitive benchmarking: This involves measur

'prefill: 44.7 tok/s, decode: 38.5 tok/s'