# LM Format Enforcer Integration with llama.cpp (python bindings)

<a target="_blank" href="https://colab.research.google.com/github/noamgat/lm-format-enforcer/blob/main/samples/colab_llamacpppython_integration.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This notebook shows how you can integrate with the llama.cpp library via its [python bindings](https://github.com/abetlen/llama-cpp-python). We will do this using its ```LogitsProcessor``` interface, and show how we integrate with ~30 lines of code for the connection.

This sample notebook focuses on simplicity and ease of setup. Therefore we will use a CPU version of llamacpp, which will make inference slower. For production use, you should use the GPU version of llamacpp.

## Installing dependencies

We begin by installing the dependencies.



In [5]:
!pip install llama-cpp-python lm-format-enforcer huggingface-hub pandas numpy

# When running from source / developing the library, use this instead
# %load_ext autoreload
# %autoreload 2
# import sys
# import os
# sys.path.append(os.path.abspath('..'))
## os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

## Loading the model

This demo uses [Llama2 gguf weights by TheBloke](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF). We will use huggingface hub to download the model.

In [3]:
from llama_cpp import Llama
from huggingface_hub import hf_hub_download
downloaded_model_path = hf_hub_download(repo_id="TheBloke/Llama-2-7b-Chat-GGUF", filename="llama-2-7b-chat.Q5_K_M.gguf")
llm = Llama(model_path=downloaded_model_path)

  from .autonotebook import tqdm as notebook_tqdm
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /home/noamgat/huggingface/hub/models--TheBloke--Llama-2-7b-Chat-GGUF/snapshots/191239b3e26b2882fb562ffccdd1cf0f65402adb/llama-2-7b-chat.Q5_K_M.gguf (version GGUF V2 (latest))


If the previous cell executed successfully, you have propertly set up your Colab runtime and loaded the llama.cpp model!

A few helper functions to make display nicer.

In [2]:
from IPython.display import display, Markdown

def display_header(text):
    display(Markdown(f'**{text}**'))

def display_content(text):
    display(Markdown(f'```\n{text}\n```'))

## Setting up the prompt for the specific language model

We set up the prompting style according to the [Llama2 demo](https://huggingface.co/spaces/huggingface-projects/llama-2-13b-chat/blob/main/app.py). We simplify the implementation a bit as we don't need chat history for this demo.

In [3]:
DEFAULT_SYSTEM_PROMPT = """\
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\
"""

def get_prompt(message: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT) -> str:
    return f'<s>[INST] <<SYS>>\n{system_prompt}\n<</SYS>>\n\n{message} [/INST]'

## Generating text with the LM Format Enforcer Logits Processor

llama.cpp's python bindigs have a ```LogitsProcessor``` interface similar to one that exists in Huggingface Transformers. We will connect to this API and set the logits that are not allowed to negative infinity, ensuring they are not selected.

We use the high level llama.cpp python interface to create a ```TokenEnforcer```, and a ```LogitsProcessor``` that uses it.
The integration can be found in [lmformatenforcer/integrations/llamacpp.py](https://github.com/noamgat/lm-format-enforcer/blob/main/lmformatenforcer/integrations/llamacpp.py).

In order to integrate our logits processor with LlamaCpp, we create a ```LogitsProcessorList``` and pass it as a keyword variable when using the ```Llama``` class.


In [6]:
from typing import Optional
from llama_cpp import LogitsProcessorList
from lmformatenforcer import CharacterLevelParser
from lmformatenforcer.integrations.llamacpp import build_llamacpp_logits_processor, build_token_enforcer_tokenizer_data

tokenizer_data = build_token_enforcer_tokenizer_data(llm)

def llamacpp_with_character_level_parser(prompt: str, character_level_parser: Optional[CharacterLevelParser]) -> str:
    logits_processors: Optional[LogitsProcessorList] = None
    if character_level_parser:
        logits_processors = LogitsProcessorList([build_llamacpp_logits_processor(tokenizer_data, character_level_parser)])
    
    output = llm(prompt, logits_processor=logits_processors, max_tokens=100)
    text: str = output['choices'][0]['text']
    return text

## LlamaCpp + JSON Use case

Now we demonstrate using ```JsonSchemaParser```. We create a pydantic model, generate the schema from it, and use that to enforce the format.
The output will always be in a format that can be parsed by the parser.

In [7]:
from lmformatenforcer import JsonSchemaParser
from pydantic import BaseModel

from typing import List

class AnswerFormat(BaseModel):
    first_name: str
    last_name: str
    year_of_birth: int
    num_seasons_in_nba: int

question = 'Please give me information about Michael Jordan. You MUST answer using the following json schema: '
question_with_schema = f'{question}{AnswerFormat.schema_json()}'
prompt = get_prompt(question_with_schema)

display_header("Prompt:")
display_content(prompt)

display_header("Answer, Without json schema enforcing:")
result = llamacpp_with_character_level_parser(prompt, None)
display_content(result)

display_header("Answer, With json schema enforcing:")
result = llamacpp_with_character_level_parser(prompt, JsonSchemaParser(AnswerFormat.schema()))
display_content(result)

display_header("Answer, With json mode (json output, no specific schema) enforcing:")
result = llamacpp_with_character_level_parser(prompt, JsonSchemaParser(None))
display_content(result)

/tmp/ipykernel_395615/2888867492.py:13: PydanticDeprecatedSince20: The `schema_json` method is deprecated; use `model_json_schema` and json.dumps instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.4/migration/
  question_with_schema = f'{question}{AnswerFormat.schema_json()}'


**Prompt:**

```
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

Please give me information about Michael Jordan. You MUST answer using the following json schema: {"properties": {"first_name": {"title": "First Name", "type": "string"}, "last_name": {"title": "Last Name", "type": "string"}, "year_of_birth": {"title": "Year Of Birth", "type": "integer"}, "num_seasons_in_nba": {"title": "Num Seasons In Nba", "type": "integer"}}, "required": ["first_name", "last_name", "year_of_birth", "num_seasons_in_nba"], "title": "AnswerFormat", "type": "object"} [/INST]
```

**Answer, Without json schema enforcing:**

Llama.generate: prefix-match hit

llama_print_timings:        load time = 16719.92 ms
llama_print_timings:      sample time =    43.66 ms /   128 runs   (    0.34 ms per token,  2931.54 tokens per second)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time = 11262.72 ms /   128 runs   (   87.99 ms per token,    11.36 tokens per second)
llama_print_timings:       total time = 11447.94 ms


```
  Of course! I'd be happy to provide information about Michael Jordan in a safe and respectful manner. Here is the answer in the format requested:
{
"first_name": "Michael",
"last_name": "Jordan",
"year_of_birth": 1963,
"num_seasons_in_nba": 15
}
Here are some key facts about Michael Jordan:
* First Name: Michael
* Last Name: Jordan
* Year of Birth: 1963
* Number of Seasons in the
```

**Answer, With json schema enforcing:**

/tmp/ipykernel_395615/2888867492.py:24: PydanticDeprecatedSince20: The `schema` method is deprecated; use `model_json_schema` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.4/migration/
  result = llamacpp_with_character_level_parser(llm, prompt, JsonSchemaParser(AnswerFormat.schema()))
Llama.generate: prefix-match hit

llama_print_timings:        load time = 16760.31 ms
llama_print_timings:      sample time =    18.00 ms /    55 runs   (    0.33 ms per token,  3055.39 tokens per second)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =  5362.52 ms /    55 runs   (   97.50 ms per token,    10.26 tokens per second)
llama_print_timings:       total time =  5582.66 ms


```
  {
"first_name": "Michael",
"last_name": "Jordan",
"year_of_birth": 1963,
"num_seasons_in_nba": 15 }



```

**Answer, With json mode (json output, no specific schema) enforcing:**

Llama.generate: prefix-match hit

llama_print_timings:        load time = 16719.92 ms
llama_print_timings:      sample time =    42.56 ms /   128 runs   (    0.33 ms per token,  3007.80 tokens per second)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time = 11260.41 ms /   128 runs   (   87.97 ms per token,    11.37 tokens per second)
llama_print_timings:       total time = 11722.26 ms


```
  {
"properties": {
"first_name": {"title": "First Name", "type": "string"},
"last_name": {"title": "Last Name", "type": "string"},
"year_of_birth": {"title": "Year Of Birth", "type": "integer"},
"num_seasons_in_nba": {"title": "Num Seasons In Nba", "type": "integer"}
},
"required": ["first_name", "last_name", "year_of_birth", "num_seasons_
```

As you can see, the enforced output matches the required schema, while the unenforced does not. We have successfully integrated with llama.cpp!

Note - the last cell probably took quite a long time to run. This is due to this notebook using CPU inference. LM Format Enforcer's runtime footprint is negligible compared to the model's runtime.

## Analyzing the impact of the format enforcer

There is an API to analyze the impact of the format enforcer on the model's output. We will use it to see how many tokens were changed by the format enforcer.

The `analyze=True` parameter to `build_llamacpp_logits_processor()` causes the returning object to have an analyzer property, that we can get a report dictionary from after the run.

In [8]:
import pandas as pd

parser = JsonSchemaParser(AnswerFormat.schema())
# Note the analyze=True flag, which is required to collect the information for the analysis later.
logits_processors = LogitsProcessorList([build_llamacpp_logits_processor(llm, parser, analyze=True)])

display_header("Prompt:")
display_content(prompt)

display_header("Answer:")
output = llm(prompt, logits_processor=logits_processors)
text: str = output['choices'][0]['text']
display_content(text)

output_tokens = list(llm.eval_tokens)
# These two lines are possible because of the analyze=True flag above
analyzer = logits_processors[0].analyzer
report = analyzer.generate_report_dict(output_tokens)

enforced_scores = pd.DataFrame(report)
# Setting some display options to make the table more readable
pd.set_option('display.width', 1000)
pd.set_option('display.max_columns', 10)
pd.set_option('display.max_rows', 999)
pd.set_option('display.float_format', ' {:,.5f}'.format)
display(enforced_scores)

/tmp/ipykernel_395615/4231260298.py:3: PydanticDeprecatedSince20: The `schema` method is deprecated; use `model_json_schema` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.4/migration/
  parser = JsonSchemaParser(AnswerFormat.schema())


**Prompt:**

```
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

Please give me information about Michael Jordan. You MUST answer using the following json schema: {"properties": {"first_name": {"title": "First Name", "type": "string"}, "last_name": {"title": "Last Name", "type": "string"}, "year_of_birth": {"title": "Year Of Birth", "type": "integer"}, "num_seasons_in_nba": {"title": "Num Seasons In Nba", "type": "integer"}}, "required": ["first_name", "last_name", "year_of_birth", "num_seasons_in_nba"], "title": "AnswerFormat", "type": "object"} [/INST]
```

**Answer:**

Llama.generate: prefix-match hit

llama_print_timings:        load time = 16760.31 ms
llama_print_timings:      sample time =    17.89 ms /    55 runs   (    0.33 ms per token,  3074.52 tokens per second)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =  5362.62 ms /    55 runs   (   97.50 ms per token,    10.26 tokens per second)
llama_print_timings:       total time =  5685.62 ms


```
  {
"first_name": "Michael",
"last_name": "Jordan",
"year_of_birth": 1963,
"num_seasons_in_nba": 15
}
```

Unnamed: 0,generated_token,generated_token_idx,generated_score,leading_token,leading_token_idx,leading_score
0,,29871,0.99998,,29871,0.99998
1,{,426,6e-05,Of,4587,0.90464
2,\n,13,0.99852,\n,13,0.99852
3,"""",29908,0.98,"""",29908,0.98
4,first,4102,0.35435,first,4102,0.35435
5,_,29918,0.99989,_,29918,0.99989
6,name,978,1.0,name,978,1.0
7,""":",1115,0.99993,""":",1115,0.99993
8,"""",376,0.99887,"""",376,0.99887
9,Michael,24083,0.97372,Michael,24083,0.97372


The interesting timestep is timestep 1, where we see the format enforcer forced the `{` character. Other than that, the model was able to correctly generate in-format JSON on its own. The LM Format Enforcer did not have to intervene, and due to its support of letting the model control whitespaces, it generated the JSON format that the LLM knows best, without explicit knowledge of it.