# LM Format Enforcer Integration with Haystack v2

<a target="_blank" href="https://colab.research.google.com/github/noamgat/lm-format-enforcer/blob/main/samples/colab_haystackv2_integration.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This notebook shows how you can integrate [LM Format Enforcer](https://github.com/noamgat/lm-format-enforcer) with the [Haystack](https://github.com/deepset-ai/haystack) library. Since Haystack abstracts the underlying LLM but opens an interface to pass parameters to it, we will use our existing integrations with `transformers` and `vLLM` to integrate with Haystack.

### Setting up the COLAB runtime (user action required)

This colab-friendly notebook is targeted at demoing the enforcer on LLAMA2. It can run on a free GPU on Google Colab.
Make sure that your runtime is set to GPU:

Menu Bar -> Runtime -> Change runtime type -> T4 GPU (at the time of writing this notebook). [Guide here](https://www.codesansar.com/deep-learning/using-free-gpu-tpu-google-colab.htm).

## Installing dependencies

We begin by installing the dependencies.



In [1]:
!pip install haystack-ai "transformers>=4.43.1" lm-format-enforcer accelerate bitsandbytes cpm_kernels huggingface_hub canals openai

# When running from source / developing the library, use this instead
# %load_ext autoreload
# %autoreload 2
# import sys
# import os
# sys.path.append(os.path.abspath('..'))
# os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

A few helper functions to make display nice and have our prompting ready. 

In [2]:
from IPython.display import display, Markdown

def display_header(text):
    display(Markdown(f'**{text}**'))

def display_content(text):
    display(Markdown(f'```\n{text}\n```'))



### Preparing our prompt and target output format

We set up the prompting style according to the [Llama2 demo](https://huggingface.co/spaces/huggingface-projects/llama-2-13b-chat/blob/main/app.py). We simplify the implementation a bit as we don't need chat history for this demo.

In [3]:
DEFAULT_SYSTEM_PROMPT = """\
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\
"""

def get_prompt(message: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT) -> str:
    return f'<s>[INST] <<SYS>>\n{system_prompt}\n<</SYS>>\n\n{message} [/INST]'

# Haystack + LM Format Enforcer

This demo uses llama2, so you will have to create a free huggingface account, request access to the llama2 model, create an access token, and insert it when executing the next cell will request it.

Links:

- [Request access to llama model](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). See the "Access Llama 2 on Hugging Face" section.
- [Create huggingface access token](https://huggingface.co/settings/tokens)

In [None]:
import huggingface_hub
huggingface_hub.notebook_login()

### Loading the model

We load the model by creating a `HuggingFaceLocalGenerator` with precise quantization parameters.

In [17]:
import torch
from transformers import AutoConfig
from haystack.components.generators import HuggingFaceLocalGenerator

model_id = 'meta-llama/Llama-2-7b-chat-hf'
device = 'cuda'

if torch.cuda.is_available():
    config = AutoConfig.from_pretrained(model_id)
    config.pretraining_tp = 1
    model_kwargs = {
        'config': config,
        'torch_dtype': torch.float16,
        'model_kwargs': {'load_in_8bit': True},
        'device_map': 'auto'
    }
    model = HuggingFaceLocalGenerator(model_id, huggingface_pipeline_kwargs=model_kwargs)
    model.warm_up()
else:
    raise Exception('GPU not available')


Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00,  1.10s/it]


### Creating a Haystack V2 Pipeline

The integration to Haystack V2 is done via `LMFormatEnforcerLocalGenerator`. It receives the local LLM model component and the Character Level Parser as parameters. Only the wrapped model is added to the pipeline.

In [45]:
from typing import Optional
from haystack import Pipeline
from lmformatenforcer.characterlevelparser import CharacterLevelParser
from lmformatenforcer.integrations.haystackv2 import LMFormatEnforcerLocalGenerator

def build_pipeline(character_level_parser: Optional[CharacterLevelParser] = None):
    format_enforcer = LMFormatEnforcerLocalGenerator(model, character_level_parser)
    pipeline = Pipeline()
    pipeline.add_component(instance=format_enforcer, name='model')
    return pipeline

def run_pipeline(pipeline, prompt):
    result = pipeline.run({
        "model": {"prompt": prompt}
    })
    return result['model']['replies'][0]


If the previous cell executed successfully, you have propertly set up your Colab runtime and huggingface account!

### Integrating LM Format Enforcer and generating JSON Schema

Now we demonstrate using ```JsonSchemaParser```. The output will always be in a format that can be parsed by the parser.

In [48]:
from lmformatenforcer import JsonSchemaParser
from pydantic import BaseModel

class AnswerFormat(BaseModel):
    first_name: str
    last_name: str
    year_of_birth: int
    num_seasons_in_nba: int

question = 'Please give me information about Michael Jordan. You MUST answer using the following json schema: '
schema_json_str = AnswerFormat.schema_json()
question_with_schema = f'{question}{schema_json_str}'
prompt = get_prompt(question_with_schema)

vanilla_pipeline = build_pipeline(None)
enforced_pipeline = build_pipeline(JsonSchemaParser(AnswerFormat.schema()))

display_header("Prompt:")
display_content(prompt)

display_header("Answer, Without json schema enforcing:")
result = run_pipeline(vanilla_pipeline, prompt=prompt)
display_content(result)

display_header("Answer, With json schema enforcing:")
result = run_pipeline(enforced_pipeline, prompt=prompt)
display_content(result)


**Prompt:**

```
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

Please give me information about Michael Jordan. You MUST answer using the following json schema: {"title": "AnswerFormat", "type": "object", "properties": {"first_name": {"title": "First Name", "type": "string"}, "last_name": {"title": "Last Name", "type": "string"}, "year_of_birth": {"title": "Year Of Birth", "type": "integer"}, "num_seasons_in_nba": {"title": "Num Seasons In Nba", "type": "integer"}}, "required": ["first_name", "last_name", "year_of_birth", "num_seasons_in_nba"]} [/INST]
```

**Answer, Without json schema enforcing:**

```


Michael Jordan was born on February 17, 1963, in Fort Greene, Brooklyn, New York.
```

**Answer, With json schema enforcing:**



```












{
"first_name": "Michael",
"last_name": "Jordan",
"year_of_birth": 1963,
"num_seasons_in_nba": 15
}












```

# Regular expressions

We can also use regular expressions to enforce the output format. This allows getting precise outputs that needs to be parsed by the rest of the pipeline.

In [51]:
from lmformatenforcer.regexparser import RegexParser


date_regex = r'(0?[1-9]|1[0-2])\/(0?[1-9]|1\d|2\d|3[01])\/(19|20)\d{2}'
answer_regex = ' In mm/dd/yyyy format, Michael Jordan was born in ' + date_regex
question = 'When was Michael Jordan Born? Please answer in mm/dd/yyyy format.'
prompt = get_prompt(question)

vanilla_pipeline = build_pipeline(None)
enforced_pipeline = build_pipeline(RegexParser(answer_regex))


display_header("Prompt:")
display_content(prompt)

display_header("Answer, Without regex enforcing:")
result = run_pipeline(vanilla_pipeline, question)
display_content(result)

display_header("Answer, With regex enforcing:")
result = run_pipeline(enforced_pipeline, question)
display_content(result)


**Prompt:**

```
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

When was Michael Jordan Born? Please answer in mm/dd/yyyy format. [/INST]
```

**Answer, Without regex enforcing:**

```

Michael Jordan was born on February 17, 1963, in Fort Greene, Brooklyn, New York, USA.
```

**Answer, With regex enforcing:**

```
 In mm/dd/yyyy format, Michael Jordan was born in 02/17/1963
```

As you can see, using the Regex Formatter we get the date in the precise output format, ready for parsing.