# Validators


Pydantic offers an customizable and expressive validation framework for Python. Instructor leverages Pydantic's validation framework to provide a uniform developer experience for both code-based and LLM-based validation, as well as a reasking mechanism for correcting LLM outputs based on validation errors. To learn more check out the Pydantic [docs](https://docs.pydantic.dev/latest/) on validators.

Note: For the majority of this notebook we won't be calling openai, just using validators to see how we can control the validation of the objects.

Then we'll bring it all together into the context of RAG from the previous notebook.


Validators will enable us to control outputs by defining a function like so:

```python
def validation_function(value):
    if condition(value):
        raise ValueError("Value is not valid")
    return mutation(value)
```

Before we get started lets go over the general shape of a validator:


## Defining Validator Functions


In [None]:
from pydantic import BaseModel
from typing_extensions import Annotated
from pydantic import AfterValidator


def name_must_contain_space(v: str) -> str:
    if " " not in v:
        raise ValueError("Name must contain a space.")
    return v.lower()


class UserDetail(BaseModel):
    age: int
    name: Annotated[str, AfterValidator(name_must_contain_space)]


person = UserDetail(age=29, name="Jason")

## Using Field

We can also use the `Field` class to define validators. This is useful when we want to define a validator for a field that is primative, like a string or integer which supports a limited number of validators.


In [None]:
from pydantic import Field


class UserDetail(BaseModel):
    age: int = Field(..., gt=0)
    name: str


person = UserDetail(age=-10, name="Jason")

In [None]:
class AssistantMessage(BaseModel):
    message: str = Field(..., min_length=10)


message = AssistantMessage(message="Hey")

## Providing Context


In [None]:
from pydantic import ValidationInfo, field_validator


class Response(BaseModel):
    message: str

    @field_validator("message")
    def message_cannot_have_blacklisted_words(cls, v: str, info: ValidationInfo) -> str:
        blacklist = info.context.get("blacklist", [])
        for word in blacklist:
            assert word not in v.lower(), f"`{word}` was found in the message `{v}`"
        return v


Response.model_validate(
    {"message": "I will hurt them."},
    context={
        "blacklist": {
            "rob",
            "steal",
            "hurt",
            "kill",
            "attack",
        }
    },
)

In [None]:
Response.model_validate(
    {"message": "My name is rob."},
    context={
        "blacklist": {
            "rob",
            "steal",
            "hurt",
            "kill",
            "attack",
        }
    },
)

## Using OpenAI Moderation


To enhance our validation measures, we'll extend the scope to flag any answer that contains hateful content, harassment, or similar issues. OpenAI offers a moderation endpoint that addresses these concerns, and it's freely available when using OpenAI models.


With the `instructor` library, this is just one function edit away:


In [None]:
from typing import Annotated
from pydantic import AfterValidator
from instructor import openai_moderation

import instructor
from openai import OpenAI

client = instructor.patch(OpenAI())

# This uses Annotated which is a new feature in Python 3.9
# To define custom metadata for a type hint.
ModeratedStr = Annotated[str, AfterValidator(openai_moderation(client=client))]


class Response(BaseModel):
    message: ModeratedStr


Response(message="I want to make them suffer the consequences")

## General Validator


In [None]:
from instructor import llm_validator

HealthTopicStr = Annotated[
    str,
    AfterValidator(
        llm_validator(
            "don't talk about any other topic except health best practices and topics",
            openai_client=client,
        )
    ),
]


class AssistantMessage(BaseModel):
    message: HealthTopicStr


AssistantMessage(
    message="I would suggest you to visit Sicily as they say it is very nice in winter."
)

### Avoiding hallucination with citations


When incorporating external knowledge bases, it's crucial to ensure that the agent uses the provided context accurately and doesn't fabricate responses. Validators can be effectively used for this purpose. We can illustrate this with an example where we validate that a provided citation is actually included in the referenced text chunk:


In [None]:
from pydantic import ValidationInfo


class AnswerWithCitation(BaseModel):
    answer: str
    citation: str

    @field_validator("citation")
    @classmethod
    def citation_exists(cls, v: str, info: ValidationInfo):
        context = info.context
        if context:
            context = context.get("text_chunk")
            if v not in context:
                raise ValueError(f"Citation `{v}` not found in text")
        return v

Here we assume that there is a "text_chunk" field that contains the text that the model is supposed to use as context. We then use the `field_validator` decorator to define a validator that checks if the citation is included in the text chunk. If it's not, we raise a `ValueError` with a message that will be returned to the user.


In [None]:
AnswerWithCitation.model_validate(
    {
        "answer": "Blueberries are packed with protein",
        "citation": "Blueberries contain high levels of protein",
    },
    context={"text_chunk": "Blueberries are very rich in antioxidants"},
)

In practice there are many ways to implement this: we could use a regex to check if the citation is included in the text chunk, or we could use a more sophisticated approach like a semantic similarity check. The important thing is that we have a way to validate that the model is using the provided context accurately.


In [None]:
class AnswerWithCitation(BaseModel):
    answer: str
    citations: list[str]

    @field_validator("citations")
    @classmethod
    def citation_exists(cls, v: str, info: ValidationInfo):
        text_chunk = info.context.get("text_chunk")
        for citation in v:
            if citation not in text_chunk:
                raise ValueError(f"Citation `{citation}` not found in text")
        return v

In [None]:
class Citation(BaseModel):
    span_start: str = Field(
        ...,
        description="The start of the citation, use a 3-4 word phrase that is unique to the citation",
    )
    span_end: str = Field(
        ...,
        description="The end of the citation, use a 3-4 word phrase that is unique to the citation",
    )

    def check(self, text: str) -> bool:
        index_start = text.find(self.span_start)
        index_end = text.find(self.span_end)
        if index_start == -1 or index_end == -1:
            return False

        if index_start > index_end:
            return False

        return True


class AnswerWithCitation(BaseModel):
    answer: str
    citations: list[Citation]

    @field_validator("citations")
    @classmethod
    def citation_exists(cls, v: str, info: ValidationInfo):
        text_chunk = info.context.get("text_chunk")
        for citation in v:
            if not citation.check(text_chunk):
                raise ValueError(f"Citation `{citation}` not found in text")
        return v

## Reasking with validators

For most of these examples all we've done we've mostly only defined the validation logic. Which can be seperate from generation, however when we are given validation errors, we shouldn't end there! Instead instructor allows us to collect all the validation errors and reask the llm to rewrite their answer.

Lets try to use a extreme example to illustrate this point:


In [None]:
class QuestionAnswer(BaseModel):
    question: str
    answer: str


question = "What is the meaning of life?"
context = (
    "The according to the devil the meaning of life is a life of sin and debauchery."
)


resp = client.chat.completions.create(
    model="gpt-3.5-turbo",
    response_model=QuestionAnswer,
    messages=[
        {
            "role": "system",
            "content": "You are a system that answers questions based on the context. answer exactly what the question asks using the context.",
        },
        {
            "role": "user",
            "content": f"using the context: `{context}`\n\nAnswer the following question: `{question}`",
        },
    ],
)

resp.answer

In [None]:
class QuestionAnswer(BaseModel):
    question: str
    answer: Annotated[
        str,
        AfterValidator(
            llm_validator("don't say objectionable things", openai_client=client)
        ),
    ]


resp = client.chat.completions.create(
    model="gpt-3.5-turbo",
    response_model=QuestionAnswer,
    max_retries=2,
    messages=[
        {
            "role": "system",
            "content": "You are a system that answers questions based on the context. answer exactly what the question asks using the context.",
        },
        {
            "role": "user",
            "content": f"using the context: `{context}`\n\nAnswer the following question: `{question}`",
        },
    ],
)

resp.answer