# LM Format Enforcer Integration with ExLlamaV2

<a target="_blank" href="https://colab.research.google.com/github/noamgat/lm-format-enforcer/blob/main/samples/colab_exllamav2_integration.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This notebook shows how you can integrate with the [ExLlamaV2](https://github.com/turboderp/exllamav2/) library. We do it using it's Sampler Filter interface and the integration class in this repository.

ExLlamaV2 is one of the fastest inference engines, but does not support any of the popular constrained decoding libraries.

## Installing dependencies

We begin by installing the dependencies.



In [1]:
!pip install exllamav2 lm-format-enforcer huggingface-hub

# When running from source / developing the library, use this instead
# %load_ext autoreload
# %autoreload 2
# import sys
# import os
# sys.path.append(os.path.abspath('..'))
## os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

## Loading the model

This demo uses [Llama2 bpw weights by turboderp](https://huggingface.co/turboderp/Llama2-7B-exl2/tree/8.0bpw). We will use huggingface hub to download the model.

In [2]:
from huggingface_hub import snapshot_download
model_directory = snapshot_download(repo_id="turboderp/Llama2-7B-exl2", revision="6463dd96f3694a87b777852f8bd979dbaeb2b839")

  from .autonotebook import tqdm as notebook_tqdm
Fetching 16 files: 100%|██████████| 16/16 [00:00<00:00, 217885.92it/s]


## Preparing ExLlamaV2

We follow the [inference.py example](https://github.com/turboderp/exllamav2/blob/master/examples/inference.py) from the ExLlamaV2 repo. There is no one-liner setup at the moment, so the next cell will contain quite a bit of code. It is all from the example.

In [3]:
from exllamav2 import(
    ExLlamaV2,
    ExLlamaV2Config,
    ExLlamaV2Cache,
    ExLlamaV2Tokenizer,
)

from exllamav2.generator import (
    ExLlamaV2BaseGenerator,
    ExLlamaV2Sampler
)

# Initialize model and cache

config = ExLlamaV2Config()
config.model_dir = model_directory
config.prepare()

model = ExLlamaV2(config)
print("Loading model: " + model_directory)

cache = ExLlamaV2Cache(model, lazy = True)
model.load_autosplit(cache)

tokenizer = ExLlamaV2Tokenizer(config)

# Initialize generator

generator = ExLlamaV2BaseGenerator(model, cache, tokenizer)

# Prepare settings

settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.85
settings.top_k = 50
settings.top_p = 0.8
settings.disallow_tokens(tokenizer, [tokenizer.eos_token_id])

max_new_tokens = 150

generator.warmup()

Loading model: /mnt/wsl/PHYSICALDRIVE1p3/huggingface/hub/models--turboderp--Llama2-7B-exl2/snapshots/6463dd96f3694a87b777852f8bd979dbaeb2b839


If the previous cell executed successfully, you have propertly set up your Colab runtime and loaded the ExLlamaV2 model!

A few helper functions to make display nicer.

In [5]:
from IPython.display import display, Markdown

def display_header(text):
    display(Markdown(f'**{text}**'))

def display_content(text):
    display(Markdown(f'```\n{text}\n```'))


## Generating text with the LM Format Enforcer Logits Processor

ExLlamaV2's `Sampler.Settings` have a `filters` interface similar to one that exists in Huggingface Transformers. We will connect to this API and filter the forbidden logits.

The integration class `ExLlamaV2TokenEnforcerFilter` does just that. This is the ONLY integration point between lm-format-enforcer and ExLlamaV2.

Note that in this notebook we use `generate_simple()`, but the integration works with all ExLlamaV2 generation methods.


In [9]:
from lmformatenforcer.characterlevelparser import CharacterLevelParser
from lmformatenforcer.integrations.exllamav2 import ExLlamaV2TokenEnforcerFilter
from typing import Optional

def exllamav2_with_format_enforcer(prompt: str, parser: Optional[CharacterLevelParser] = None) -> str:
    if parser is None:
        settings.filters = []
    else:
        settings.filters = [ExLlamaV2TokenEnforcerFilter(parser, tokenizer)]
    result = generator.generate_simple(prompt, settings, max_new_tokens, seed = 1234)
    return result[len(prompt):]

## ExLlamaV2 + JSON Use case

Now we demonstrate using ```JsonSchemaParser```. We create a pydantic model, generate the schema from it, and use that to enforce the format.
The output will always be in a format that can be parsed by the parser.

In [14]:
from lmformatenforcer import JsonSchemaParser
from pydantic import BaseModel



class AnswerFormat(BaseModel):
    first_name: str
    last_name: str
    year_of_birth: int
    num_seasons_in_nba: int

question = 'Please give me information about Michael Jordan. You MUST answer using the following json schema: '
question_with_schema = f'{question}{AnswerFormat.schema_json()}'
prompt = question_with_schema

display_header("Prompt:")
display_content(prompt)

display_header("Answer, Without json schema enforcing:")
result = exllamav2_with_format_enforcer(prompt, parser=None)
display_content(result)

display_header("Answer, With json schema enforcing:")
parser = JsonSchemaParser(AnswerFormat.schema())
result = exllamav2_with_format_enforcer(prompt, parser=parser)
display_content(result)

display_header("Answer, With json mode (json output, no specific schema) enforcing:")
parser = JsonSchemaParser(None)
result = exllamav2_with_format_enforcer(prompt, parser=parser)
display_content(result)

**Prompt:**

```
Please give me information about Michael Jordan. You MUST answer using the following json schema: {"title": "AnswerFormat", "type": "object", "properties": {"first_name": {"title": "First Name", "type": "string"}, "last_name": {"title": "Last Name", "type": "string"}, "year_of_birth": {"title": "Year Of Birth", "type": "integer"}, "num_seasons_in_nba": {"title": "Num Seasons In Nba", "type": "integer"}}, "required": ["first_name", "last_name", "year_of_birth", "num_seasons_in_nba"]}
```

**Answer, Without json schema enforcing:**

```

The JSON schema that you need to provide is in this format, but with your own data and the correct name of the person. For example, if I wanted the answer for Michael Jordan, my response would look like this: { 'first-name': 'Michael', 'last-name': 'Jordan', 'year-of-birth': 1963, 'num-seasons-in-nba': 5 }. The first line contains the title (which can be anything) followed by a colon (:). Afterwards are three properties. Each property has its own key value pair as well as an associated schema that describes what type of object it should contain. Finally there's one required element per
```

**Answer, With json schema enforcing:**

```


    {
        "first_name": "Michael",
        "last_name": "Jordan",
        "year_of_birth": 1963,
        "num_seasons_in_nba": 15
    }

   

   


```

**Answer, With json mode (json output, no specific schema) enforcing:**

```


    [["Michael Jordan"],
     ["1963-02-17", 45, 15],
     ["Charlotte Hornets", 8, 15]]
```

As you can see, the enforced output matches the required schema, while the unenforced does not. We have successfully integrated with ExLlamaV2!

## ExLlamaV2 + Regular Expressions Use Case

In [19]:
from lmformatenforcer import RegexParser

date_regex = r'(0?[1-9]|1[0-2])\/(0?[1-9]|1\d|2\d|3[01])\/(19|20)\d{2}'
answer_regex = ' In mm/dd/yyyy format, Michael Jordan was born in ' + date_regex
question = 'Q: When was Michael Jordan Born? Please answer in mm/dd/yyyy format. A:'
prompt = question

display_header("Prompt:")
display_content(prompt)


display_header("Without format forcing:")
result = exllamav2_with_format_enforcer(prompt, parser=None)
display_content(result)


display_header(f"With regex force. Regex: ```{answer_regex}```")
parser = RegexParser(answer_regex)
result = exllamav2_with_format_enforcer(prompt, parser=parser)
display_content(result)



**Prompt:**

```
Q: When was Michael Jordan Born? Please answer in mm/dd/yyyy format. A:
```

**Without format forcing:**

```
 The birthday of basketball legend, Michael Jordan is 17th February 1963
Q: What was the first name of his father? A: James R. Jordan Sr.
Q: In which city did he complete both high school and college? A: Wilmington, North Carolina
Q: Who were the Chicago Bulls head coaches during his playing career with them? A: Phil Jackson (1984-1998) and Doug Collins (1998–2002).
Q: How many times did he win an NBA championship while a member of the Chicago Bulls? A: Six times. He won three consecutive championships from 1991 to
```

**With regex force. Regex: ``` In mm/dd/yyyy format, Michael Jordan was born in (0?[1-9]|1[0-2])\/(0?[1-9]|1\d|2\d|3[01])\/(19|20)\d{2}```**

```
  In mm/dd/yyyy format, Michael Jordan was born in 2/17/1963
```

As you can see, with regex forcing enabled, we got a valid output. Without it, we did not get it in the structure that we wanted.