# LM Format Enforcer Integration with LlamaIndex

<a target="_blank" href="https://colab.research.google.com/github/noamgat/lm-format-enforcer/blob/main/samples/colab_llamacpppython_integration.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This notebook shows how you can integrate with the [LlamaIndex](https://github.com/run-llama/llama_index) library. Since LlamaIndex abstracts the underlying LLM but opens an interface to pass parameters to it, we will use our existing integrations with `transformers` and `llama-cpp-python` to integrate with LlamaIndex.

### Setting up the COLAB runtime (user action required)

This colab-friendly notebook is targeted at demoing the enforcer on LLAMA2. It can run on a free GPU on Google Colab.
Make sure that your runtime is set to GPU:

Menu Bar -> Runtime -> Change runtime type -> T4 GPU (at the time of writing this notebook). [Guide here](https://www.codesansar.com/deep-learning/using-free-gpu-tpu-google-colab.htm).

## Installing dependencies

We begin by installing the dependencies.



In [1]:
!pip install llama-index lm-format-enforcer torch transformers llama-cpp-python accelerate bitsandbytes cpm_kernels

# When running from source / developing the library, use this instead
# %load_ext autoreload
# %autoreload 2
# import sys
# import os
# sys.path.append(os.path.abspath('..'))
## os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

A few helper functions to make display nice and have our prompting ready. 

In [2]:
from IPython.display import display, Markdown

def display_header(text):
    display(Markdown(f'**{text}**'))

def display_content(text):
    display(Markdown(f'```\n{text}\n```'))



### Preparing our prompt and target output format

We set up the prompting style according to the [Llama2 demo](https://huggingface.co/spaces/huggingface-projects/llama-2-13b-chat/blob/main/app.py). We simplify the implementation a bit as we don't need chat history for this demo.
We use JSON Schema output for this example, but regex output is also available.

In [3]:

from pydantic import BaseModel
DEFAULT_SYSTEM_PROMPT = """\
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\
"""

def get_prompt(message: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT) -> str:
    return f'<s>[INST] <<SYS>>\n{system_prompt}\n<</SYS>>\n\n{message} [/INST]'

class AnswerFormat(BaseModel):
    first_name: str
    last_name: str
    year_of_birth: int
    num_seasons_in_nba: int

question = 'Please give me information about Michael Jordan. You MUST answer using the following json schema: '
question_with_schema = f'{question}{AnswerFormat.schema_json()}'
prompt = get_prompt(question_with_schema)

/tmp/ipykernel_618202/3203385482.py:16: PydanticDeprecatedSince20: The `schema_json` method is deprecated; use `model_json_schema` and json.dumps instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.4/migration/
  question_with_schema = f'{question}{AnswerFormat.schema_json()}'


# LlamaIndex + HuggingFace Transformers

This demo uses llama2, so you will have to create a free huggingface account, request access to the llama2 model, create an access token, and insert it when executing the next cell will request it.

Links:

- [Request access to llama model](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). See the "Access Llama 2 on Hugging Face" section.
- [Create huggingface access token](https://huggingface.co/settings/tokens)

In [None]:
!huggingface-cli login

### Loading the model

We load the model directly using transformers API in order to pass precise quantization parameters to it. Afterwards we initialize the LlamaIndex `HuggingFaceLLM` from it

In [4]:
import torch
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
from llama_index.llms import HuggingFaceLLM

model_id = 'meta-llama/Llama-2-7b-chat-hf'
device = 'cuda'

if torch.cuda.is_available():
    config = AutoConfig.from_pretrained(model_id)
    config.pretraining_tp = 1
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        config=config,
        torch_dtype=torch.float16,
        load_in_8bit=True,
        device_map='auto'
    )
else:
    raise Exception('GPU not available')
tokenizer = AutoTokenizer.from_pretrained(model_id)
if tokenizer.pad_token_id is None:
    # Required for batching example
    tokenizer.pad_token_id = tokenizer.eos_token_id 

llm_huggingface = HuggingFaceLLM(model=model, tokenizer=tokenizer)

  from .autonotebook import tqdm as notebook_tqdm
Downloading (…)lve/main/config.json: 100%|██████████| 614/614 [00:00<00:00, 2.44MB/s]
Downloading shards: 100%|██████████| 2/2 [00:00<00:00,  3.15it/s]
Loading checkpoint shards: 100%|██████████| 2/2 [06:53<00:00, 206.59s/it]


If the previous cell executed successfully, you have propertly set up your Colab runtime and huggingface account!

### Integrating LM Format Enforcer and generating JSON Schema

Now we demonstrate using ```JsonSchemaParser```. The output will always be in a format that can be parsed by the parser.

In [6]:
from typing import Optional
from lmformatenforcer import CharacterLevelParser, JsonSchemaParser
from lmformatenforcer.integrations.transformers import build_transformers_prefix_allowed_tokens_fn

def llamaindex_huggingface_lm_format_enforcer(llm: HuggingFaceLLM, prompt: str, character_level_parser: Optional[CharacterLevelParser]) -> str:
    prefix_allowed_tokens_fn = None
    if character_level_parser:
        prefix_allowed_tokens_fn = build_transformers_prefix_allowed_tokens_fn(llm._tokenizer, character_level_parser)
    
    # If changing the character level parser each call, inject it before calling complete. If its the same format
    # each time, you can set it once after creating the HuggingFaceLLM model
    llm.generate_kwargs['prefix_allowed_tokens_fn'] = prefix_allowed_tokens_fn
    output = llm.complete(prompt)
    text: str = output.text
    return text

display_header("Prompt:")
display_content(prompt)

display_header("Answer, Without json schema enforcing:")
result = llamaindex_huggingface_lm_format_enforcer(llm_huggingface, prompt, None)
display_content(result)

display_header("Answer, With json schema enforcing:")
result = llamaindex_huggingface_lm_format_enforcer(llm_huggingface, prompt, JsonSchemaParser(AnswerFormat.schema()))
display_content(result)


**Prompt:**

```
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

Please give me information about Michael Jordan. You MUST answer using the following json schema: {"properties": {"first_name": {"title": "First Name", "type": "string"}, "last_name": {"title": "Last Name", "type": "string"}, "year_of_birth": {"title": "Year Of Birth", "type": "integer"}, "num_seasons_in_nba": {"title": "Num Seasons In Nba", "type": "integer"}}, "required": ["first_name", "last_name", "year_of_birth", "num_seasons_in_nba"], "title": "AnswerFormat", "type": "object"} [/INST]
```

**Answer, Without json schema enforcing:**

```
  Of course! I'd be happy to help you with information about Michael Jordan. Here is the information in the format you requested:
{
"title": "AnswerFormat",
"type": "object",
"properties": {
"first_name": {"title": "First Name", "type": "string"},
"last_name": {"title": "Last Name", "type": "string"},
"year_of_birth": {"title":
```

**Answer, With json schema enforcing:**

/tmp/ipykernel_618202/2786092233.py:25: PydanticDeprecatedSince20: The `schema` method is deprecated; use `model_json_schema` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.4/migration/
  result = llamaindex_huggingface_lm_format_enforcer(llm_huggingface, prompt, JsonSchemaParser(AnswerFormat.schema()))


```
 { "first_name": "Michael", "last_name": "Jordan", "year_of_birth": 1963, "num_seasons_in_nba": 15 }




```

# LlamaIndex + LlamaCPP

This demo uses [Llama2 gguf weights by TheBloke](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF). We will use huggingface hub to download the model.

In [2]:
from llama_index.llms import LlamaCPP
from huggingface_hub import hf_hub_download
downloaded_model_path = hf_hub_download(repo_id="TheBloke/Llama-2-7b-Chat-GGUF", filename="llama-2-7b-chat.Q5_K_M.gguf")
llm_llamacpp = LlamaCPP(model_path=downloaded_model_path)

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /home/noamgat/huggingface/hub/models--TheBloke--Llama-2-7b-Chat-GGUF/snapshots/191239b3e26b2882fb562ffccdd1cf0f65402adb/llama-2-7b-chat.Q5_K_M.gguf (version GGUF V2 (latest))
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 


### Integrating LM Format Enforcer and generating JSON Schema

Now we demonstrate using ```JsonSchemaParser```. The output will always be in a format that can be parsed by the parser.

In [22]:
from typing import Optional
from llama_cpp import LogitsProcessorList
from lmformatenforcer import CharacterLevelParser, JsonSchemaParser
from lmformatenforcer.integrations.llamacpp import build_llamacpp_logits_processor

def llamaindex_llamacpp_lm_format_enforcer(llm: LlamaCPP, prompt: str, character_level_parser: Optional[CharacterLevelParser]) -> str:
    logits_processors: Optional[LogitsProcessorList] = None
    if character_level_parser:
        logits_processors = LogitsProcessorList([build_llamacpp_logits_processor(llm._model, character_level_parser)])
    
    # If changing the character level parser each call, inject it before calling complete. If its the same format
    # each time, you can set it once after creating the LlamaCPP model
    llm.generate_kwargs['logits_processor'] = logits_processors
    output = llm.complete(prompt)
    text: str = output.text
    return text

display_header("Prompt:")
display_content(prompt)

display_header("Answer, Without json schema enforcing:")
result = llamaindex_llamacpp_lm_format_enforcer(llm_llamacpp, prompt, None)
display_content(result)

display_header("Answer, With json schema enforcing:")
result = llamaindex_llamacpp_lm_format_enforcer(llm_llamacpp, prompt, JsonSchemaParser(AnswerFormat.schema()))
display_content(result)


/tmp/ipykernel_612108/4226464266.py:26: PydanticDeprecatedSince20: The `schema_json` method is deprecated; use `model_json_schema` and json.dumps instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.4/migration/
  question_with_schema = f'{question}{AnswerFormat.schema_json()}'


**Prompt:**

```
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

Please give me information about Michael Jordan. You MUST answer using the following json schema: {"properties": {"first_name": {"title": "First Name", "type": "string"}, "last_name": {"title": "Last Name", "type": "string"}, "year_of_birth": {"title": "Year Of Birth", "type": "integer"}, "num_seasons_in_nba": {"title": "Num Seasons In Nba", "type": "integer"}}, "required": ["first_name", "last_name", "year_of_birth", "num_seasons_in_nba"], "title": "AnswerFormat", "type": "object"} [/INST]
```

**Answer, Without json schema enforcing:**

Llama.generate: prefix-match hit

llama_print_timings:        load time = 18009.19 ms
llama_print_timings:      sample time =    33.98 ms /    99 runs   (    0.34 ms per token,  2913.05 tokens per second)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =  9239.67 ms /    99 runs   (   93.33 ms per token,    10.71 tokens per second)
llama_print_timings:       total time =  9415.06 ms


```
  Of course! I'd be happy to help you with information about Michael Jordan. Here is the information in the format you requested:
{
"title": "AnswerFormat",
"type": "object",
"properties": {
"first_name": {"title": "First Name", "type": "string"},
"last_name": {"title": "Last Name", "type": "string"},
"year_of_birth": {"title":
```

**Answer, With json schema enforcing:**

/tmp/ipykernel_612108/4226464266.py:37: PydanticDeprecatedSince20: The `schema` method is deprecated; use `model_json_schema` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.4/migration/
  result = llamaindex_llamacpp_lm_format_enforcer(llm, prompt, JsonSchemaParser(AnswerFormat.schema()))
Llama.generate: prefix-match hit

llama_print_timings:        load time = 18009.19 ms
llama_print_timings:      sample time =    17.44 ms /    53 runs   (    0.33 ms per token,  3038.82 tokens per second)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =  4980.95 ms /    53 runs   (   93.98 ms per token,    10.64 tokens per second)
llama_print_timings:       total time =  5452.10 ms


```
  { "first_name": "Michael", "last_name": "Jordan", "year_of_birth": 1963, "num_seasons_in_nba": 15 }



```

As you can see, the enforced output matches the required schema, while the unenforced does not. We have successfully integrated with llama.cpp!

Ending note - the last cell probably took quite a long time to run. This is due to this notebook using CPU inference with `llamacpp`. LM Format Enforcer's runtime footprint is negligible compared to the model's runtime.