# MLC-LLM Structured Generation with XGrammar

Here's a quick overview of how to generate structured text with XGrammar in MLC LLM in Python.
In this tutorial, we will be chatting with the Llama3.2 model.
For the easiest setup, we recommend trying this out in a Google Colab notebook. Click the button below to get started!

Structured generation of LLMs greatly improves the abilities of LLMs,
going beyond the basic chat or plain text generation.
With controllable structured generation, LLMs become able to serve as standard tools and can be better integrated into other applications in production.
MLCEngine offers state-of-the-art structured generation with XGrammar integration.
Importantly, the structured generation support is built into the engine, which means it can be used across all the API platforms that MLCEngine supports.

Learn more about
* MLC LLM: https://mlc.ai/mlc-llm/docs.
* XGrammar: https://xgrammar.mlc.ai/docs

Click the button below to get started!

<a target="_blank" href="https://colab.research.google.com/github/mlc-ai/notebooks/blob/main/mlc-llm/tutorial_mlc_xgrammar_structured_generation.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Install MLC LLM

We will start from setting up the environment. First, let us create a new Conda environment, in which we will run the rest of the notebook.

```
conda create --name mlc-llm python=3.11
conda activate mlc-llm
```

**Google Colab**

- If you are running this in a Google Colab notebook, you would not need to create a conda environment.
- However, be sure to change your runtime to GPU by going to `Runtime` > `Change runtime type` and setting the Hardware accelerator to be "GPU".

If you are using CUDA, you can run the following command to confirm that CUDA is set up correctly, and check the driver version number as well as what GPUs are currently available for use.

In [1]:
!nvidia-smi

Fri Nov 22 01:44:26 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   61C    P8              10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

Next, let's download the MLC-AI and mlc-llm nightly build packages. If you are running in a Colab environment, then you can just run the following command. Otherwise, go to https://llm.mlc.ai/docs/install/mlc_llm.html and replace the command below with the one that is appropriate for your hardware and OS.

**Google Colab**: If you are using Colab, you may see the red warnings such as "You must restart the runtime in order to use newly installed versions." For our purpose, we can disregard them, the notebook will still run correctly.

In [2]:
!pip install --pre mlc-ai-nightly-cu123 mlc-llm-nightly-cu123 -f https://mlc.ai/wheels

Looking in links: https://mlc.ai/wheels
Collecting mlc-ai-nightly-cu123
  Downloading https://github.com/mlc-ai/package/releases/download/v0.9.dev0/mlc_ai_nightly_cu123-0.18.dev249-cp310-cp310-manylinux_2_28_x86_64.whl (1026.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 GB[0m [31m987.8 kB/s[0m eta [36m0:00:00[0m
[?25hCollecting mlc-llm-nightly-cu123
  Downloading https://github.com/mlc-ai/package/releases/download/v0.9.dev0/mlc_llm_nightly_cu123-0.18.dev71-cp310-cp310-manylinux_2_28_x86_64.whl (177.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.1/177.1 MB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
Collecting fastapi (from mlc-llm-nightly-cu123)
  Downloading fastapi-0.115.5-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn (from mlc-llm-nightly-cu123)
  Downloading uvicorn-0.32.1-py3-none-any.whl.metadata (6.6 kB)
Collecting shortuuid (from mlc-llm-nightly-cu123)
  Downloading shortuuid-1.0.13-py3-none-any.whl.metadata

Let's confirm we have installed the packages successfully!

In [3]:
!python -c "import tvm; print('tvm installed properly!')"
!python -c "import mlc_llm; print('mlc_llm installed properly!')"

tvm installed properly!
mlc_llm installed properly!


## General JSON Text Generation

MLC LLM supports two levels of structured generation mode: general JSON response and schema customization. The general JSON mode constrains the response to conform to JSON grammar. To use the general JSON mode, pass argument `response_format={"type": "json_object"}` to chat completion. Below is a request example with JSON mode:


Note: If you are NOT running in **Google Colab** you may need to run this line `!conda install git git-lfs` to install `git` and `git-lfs` before running the following cell.

In [4]:
!git lfs install

Git LFS initialized.


In [5]:
from mlc_llm import MLCEngine

# Create the MLCEngine. The model will be automatically downloaded.
model = "HF://mlc-ai/Llama-3.2-3B-Instruct-q4f16_1-MLC"
engine = MLCEngine(model)

# Generate JSON text with MLCEngine, backed by XGrammar.
prompt = "List 3 must-see places of interest in United States in JSON."
for chunk in engine.chat.completions.create(
    messages= [{"role": "user", "content": prompt}],
    response_format={"type": "json_object"},
    stream=True,
):
    print(chunk.choices[0].delta.content, end="", flush=True)

0it [00:00, ?it/s]
100%|██████████| 58/58 [00:18<00:00,  3.11it/s]


{"places": [
  {
    "name": "Grand Canyon",
    "location": "Arizona",
    "description": "One of the most iconic natural wonders in the United States, the Grand Canyon is a breathtaking example of erosion and geological history."
  },
  {
    "name": "Statue of Liberty",
    "location": "New York/New Jersey",
    "description": "A symbol of freedom and democracy, the Statue of Liberty is a must-see attraction on Liberty Island in New York Harbor."
  },
  {
    "name": "Golden Gate Bridge",
    "location": "California",
    "description": "An engineering marvel and iconic symbol of San Francisco, the Golden Gate Bridge is a must-see for its stunning views and rich history."
  }
]}

## Structured Generation with Schema

Additionally, MLCEngine allows for the customization of the response JSON schema for each individual request. When a JSON schema is provided, MLCEngine will generate responses that adhere strictly to that schema. Below is a request example with customized JSON schema:

In [6]:
import json
import pydantic
from typing import List


class Country(pydantic.BaseModel):
    name: str
    capital: str


class Countries(pydantic.BaseModel):
    countries: List[Country]


# Get the JSON schema of "Countries"
schema = json.dumps(Countries.model_json_schema())
prompt = "Randomly list three countries and their capitals in JSON."

for chunk in engine.chat.completions.create(
    messages= [{"role": "user", "content": prompt}],
    response_format={"type": "json_object", "schema": schema},
    stream=True,
):
    print(chunk.choices[0].delta.content, end="", flush=True)


{"countries": [{"name": "Japan", "capital": "Tokyo"}, {"name": "Australia", "capital": "Canberra"}, {"name": "Brazil", "capital": "Brasilia"}]}