# RAG with source highlighting using Constrained generation
_Authored by: [Aymeric Roucher](https://huggingface.co/m-ric)_

**Constrained generation** is a method that forces the LLM output to follow certain constraints, for instance to follow a specific pattern.

This has numerous use cases:
- ‚úÖ Output a dictionnary with specific keys
- üìè Make sure the output will be longer than N characters
- ‚öôÔ∏è More generally, force the output to follow a certain regex pattern for downtream processing.
- üí° Highlight sources supporting the answer in Retrieval-Augmented-Generation (RAG)


In this notebook, we demonstrate specifically the last use case:

**‚û°Ô∏è We build a RAG system that not only provides an answer, but also highlights the supporting snippets that this answer is based on.**

_If you need an introduction to RAG, you can check out [this other cookbook](advanced_rag)._

This notebook demonstrates structured generation on two inference methods:
- Using HuggingFace Inference Endpoints ([serverless](https://huggingface.co/docs/api-inference/quicktour) or [dedicated](https://huggingface.co/docs/inference-endpoints/en/guides/access))
- Or locally using [outlines](https://github.com/outlines-dev/outlines)

In [None]:
!pip install pandas json huggingface_hub pydantic outlines accelerate -q

In [2]:
import pandas as pd
import json
from huggingface_hub import InferenceClient

pd.set_option("display.max_colwidth", None)

In [3]:
repo_id = "meta-llama/Meta-Llama-3-8B-Instruct"

llm_client = InferenceClient(model=repo_id, timeout=120)

# Test your LLM client
llm_client.text_generation(prompt="How are you today?", max_new_tokens=20)

" I hope you're having a great day! I just wanted to check in and see how things are"

## First try: prompting the model

To get structured outputs from your model, you can simply prompt a powerful enough models with appropriate guidelines, and it should work directly... most of the time.

In this case, we want the RAG model to generate not only an answer, but also a confidence score and some source snippets.

In [20]:
RELEVANT_CONTEXT = """
Document:

The weather is really nice in Paris today.
To define a stop sequence in Transformers, you should pass the stop_sequence argument in your pipeline or model.

"""

In [21]:
RAG_PROMPT_TEMPLATE_JSON = """
Answer the user query based on the source documents.

Here are the source documents: {context}


You should provide your answer as a JSON blob, and also provide all relevant short source snippets from the documents on which you directly based your answer, and a confidence score as a float between 0 and 1.
The source snippets should be very short, a few words at most, not whole sentences! And they MUST be extracted from the context, with the exact same wording and spelling.

Your answer should be built as follows, it must contain the "Answer:" and "End of answer." sequences.

Answer:
{{
  "answer": your_answer,
  "confidence_score": your_confidence_score,
  "source_snippets": ["snippet_1", "snippet_2", ...]
}}
End of answer.

Now begin!
Here is the user question: {user_query}.
Answer:
"""

In [22]:
USER_QUERY = "How can I define a stop sequence in Transformers?"

In [23]:
prompt = RAG_PROMPT_TEMPLATE_JSON.format(
    context=RELEVANT_CONTEXT, user_query=USER_QUERY
)
print(prompt)


Answer the user query based on the source documents.

Here are the source documents: 
Document:

The weather is really nice in Paris today.
To define a stop sequence in Transformers, you should pass the stop_sequence argument in your pipeline or model.




You should provide your answer as a JSON blob, and also provide all relevant short source snippets from the documents on which you directly based your answer, and a confidence score as a float between 0 and 1.
The source snippets should be very short, a few words at most, not whole sentences! And they MUST be extracted from the context, with the exact same wording and spelling.

Your answer should be built as follows, it must contain the "Answer:" and "End of answer." sequences.

Answer:
{
  "answer": your_answer,
  "confidence_score": your_confidence_score,
  "source_snippets": ["snippet_1", "snippet_2", ...]
}
End of answer.

Now begin!
Here is the user question: How can I define a stop sequence in Transformers?.
Answer:



In [8]:
answer = llm_client.text_generation(
    prompt,
    max_new_tokens=1000,
)

answer = answer.split("End of answer.")[0]
print(answer)

{
  "answer": "You should pass the stop_sequence argument in your pipeline or model.",
  "confidence_score": 0.9,
  "source_snippets": ["stop_sequence", "pipeline or model"]
}



In [9]:
from ast import literal_eval

parsed_answer = literal_eval(answer)

In [10]:
def highlight(s):
    return "\x1b[1;32m" + s + "\x1b[0m"


def print_results(answer, source_text, highlight_snippets):
    print("Answer:", highlight(answer))
    print("\n\n", "=" * 10 + " Source documents " + "=" * 10)
    for snippet in highlight_snippets:
        source_text = source_text.replace(snippet.strip(), highlight(snippet.strip()))
    print(source_text)


print_results(
    parsed_answer["answer"], RELEVANT_CONTEXT, parsed_answer["source_snippets"]
)

Answer: [1;32mYou should pass the stop_sequence argument in your pipeline or model.[0m



Document:

The weather is really nice in Paris today.
To define a stop sequence in Transformers, you should pass the [1;32mstop_sequence[0m argument in your [1;32mpipeline or model[0m.




This works! ü•≥

But what about using a less powerful model?

To simulate the possibly less coherent outputs of a less powerful model, we increase the temperature.

In [11]:
answer = llm_client.text_generation(
    prompt,
    max_new_tokens=250,
    temperature=1.6,
    return_full_text=False,
)
print(answer)




Now, the output is not even in correct JSON.

## üëâ Solution: use a grammar

To force a JSON output, we'll have to use a **grammar**.

A **grammar** is a set of rules to constrain the generation: the LLM can only generate tokens that conform to the format defined in the rule - the "schema".

The grammar can be defined using Pydantic models, JSON schema, or regular expressions. The AI will then generate a response that conforms to the specified grammar.

Here for instance we follow [Pydantic types](https://docs.pydantic.dev/latest/api/types/).

In [32]:
from pydantic import BaseModel, confloat, Field, StringConstraints
from typing import List, Annotated


class AnswerWithSnippets(BaseModel):
    answer: Annotated[str, StringConstraints(min_length=10, max_length=100)]
    confidence: Annotated[float, confloat(ge=0.0, le=1.0)]
    source_snippets: List[Annotated[str, StringConstraints(max_length=30)]]

I advise inspecting the generated schema to check that it correctly represents your requirements:

In [33]:
AnswerWithSnippets.schema()

{'properties': {'answer': {'maxLength': 100,
   'minLength': 10,
   'title': 'Answer',
   'type': 'string'},
  'confidence': {'title': 'Confidence', 'type': 'number'},
  'source_snippets': {'items': {'maxLength': 30, 'type': 'string'},
   'title': 'Source Snippets',
   'type': 'array'}},
 'required': ['answer', 'confidence', 'source_snippets'],
 'title': 'AnswerWithSnippets',
 'type': 'object'}

You can use either the client's `text_generation` method or use its `post` method.

In [34]:
# Using text_generation
answer = llm_client.text_generation(
    prompt,
    grammar={"type": "json", "value": AnswerWithSnippets.schema()},
    max_new_tokens=250,
    temperature=1.6,
    return_full_text=False,
)
print(answer)

# Using post
data = {
    "inputs": prompt,
    "parameters": {
        "temperature": 1.6,
        "return_full_text": False,
        "grammar": {"type": "json", "value": AnswerWithSnippets.schema()},
        "max_new_tokens": 250,
    },
}
answer = json.loads(llm_client.post(json=data))[0]["generated_text"]
print(answer)

{
"answer": "you should pass the stop_quote_candidates argument in your tokenizer Ïù¥Îü∞ basemannee langeffgence_hr Î≥Ä",
"confidence": 0.98,
"source_snippets": ["stop_sequence –≤–æ–∑–¥—É—Ö–∞ize Ìä∏ ee\‰Ω†"]
}
{
  "answer": "lifetimeviders>To define a stop sequence in Transformers, you should pass the stop_sequence argument",
  "confidence": 1.0,
  "source_snippets": ["runassyÊå∫‚Äù,pass‚Äô.proto rgba bas" ]
}


‚úÖ Although the answer is still nonsensical due to the high temperature, the generated output is now correct JSON, with the exact keys and types we defined in our grammar!

It can then be parsed for further processing.

### Grammar on a local pipeline with Outlines

[Outlines](https://github.com/outlines-dev/outlines/) is the library that runs behind the hood on our Inference API to constrain output generation. You can also use it locally.

It works by [applying a bias on the logits](https://github.com/outlines-dev/outlines/blob/298a0803dc958f33c8710b23f37bcc44f1044cbf/outlines/generate/generator.py#L143) to force selection of only the ones that conform to your constraint.

In [None]:
import outlines

repo_id = "mustafaaljadery/gemma-2B-10M"
# Load model locally
model = outlines.models.transformers(repo_id)

schema_as_str = json.dumps(AnswerWithSnippets.schema())

generator = outlines.generate.json(model, schema_as_str)

# Use the `generator` to sample an output from the model
result = generator(prompt)
print(result)

You can also use [Text-Generation-Inference](https://huggingface.co/docs/text-generation-inference/en/index) with constrained generation: the [documentation](https://huggingface.co/docs/text-generation-inference/en/conceptual/guidance) explains how to do this in detail, with further examples.

Now we've demonstrated a specific RAG use-case, but constrained generation is helpful for much more than that.

For instance in your [LLM judge](llm_judge) workflows, you can also use constrained generation to output a JSON, as follows:
```
{
    "rationale": "The answer does not match the true answer at all."
    "score": 1,
    "confidence_level": 0.85
}
```

That's all for today, congrats for following along! üëè