## Get Started in Minutes

Install Instructor with a single command:

```bash
pip install -U instructor
pip install ibm-watsonx-ai
```

Now, let's see Instructor in action with a simple example:


In [None]:
import os
import instructor
from pydantic import BaseModel
from ibm_watsonx_ai import Credentials
from ibm_watsonx_ai.foundation_models import Model

client = Model(
    model_id="meta-llama/llama-3-1-70b-instruct",
    credentials=Credentials(
        api_key = os.environ.get("WATSONX_API_KEY"),
        url = os.environ.get("WATSONX_URL")),
    project_id= os.environ.get("WATSONX_PROJECT_ID")
)

# Patch the client with Watsonx
instructor_client=instructor.from_watsonx(
    client=client,
    mode=instructor.Mode.WATSONX_TOOLS
)

In [None]:
# Define your desired output structure
class UserInfo(BaseModel):
    name: str
    age: int

user_info = instructor_client.messages.create(
    response_model=UserInfo,
    max_retries=3,
    messages=[{"role": "user", "content": "John Doe is 30 years old."}],
)
print(user_info)

In [None]:
print(user_info.name)
print(user_info.age)

## Using Hooks
Instructor provides a powerful hooks system that allows you to intercept and log various stages of the LLM interaction process. Here's a simple example demonstrating how to use hooks:

In [None]:
# Define hook functions
def log_kwargs(**kwargs):
    print(f"Function called with kwargs: {kwargs}")

def log_exception(exception: Exception):
    print(f"An exception occurred: {str(exception)}")

instructor_client.on("completion:kwargs", log_kwargs)
instructor_client.on("completion:error", log_exception)

user_info = instructor_client.messages.create(
    
    response_model=UserInfo,
    messages=[{"role": "user", "content": "Extract the user name: 'John is 20 years old'"}],
)

In [None]:
print(f"Name: {user_info.name}, Age: {user_info.age}")

# Text Classification using Watsonx and Pydantic

This tutorial showcases how to implement text classification tasks—specifically, single-label and multi-label classifications—using the Watsonx API and Pydantic models. If you want to see full examples check out the hub examples for single classification and multi classification

## Single-Label Classification
### Defining the Structures

For single-label classification, we define a Pydantic model with a Literal field for the possible labels.

In [None]:
from pydantic import BaseModel, Field
from typing import Literal

class ClassificationResponse(BaseModel):
    """
    A few-shot example of text classification:

    Examples:
    - "Buy cheap watches now!": SPAM
    - "Meeting at 3 PM in the conference room": NOT_SPAM
    - "You've won a free iPhone! Click here": SPAM
    - "Can you pick up some milk on your way home?": NOT_SPAM
    - "Increase your followers by 10000 overnight!": SPAM
    """

    chain_of_thought: str = Field(
        ...,
        description="The chain of thought that led to the prediction.",
    )
    label: Literal["SPAM", "NOT_SPAM"] = Field(
        ...,
        description="The predicted class label.",
    )

### Classifying Text

The function classify will perform the single-label classification.

In [None]:
def classify(data: str) -> ClassificationResponse:
    """Perform single-label classification on the input text."""
    return instructor_client.messages.create(
        response_model=ClassificationResponse,
        messages=[
            {
                "role": "user",
                "content": f"Classify the following text: <text>{data}</text>",
            },
        ],
    )

### Testing and Evaluation

Let's run examples to see if it correctly identifies spam and non-spam messages.

In [None]:
if __name__ == "__main__":
    for text, label in [
        ("Hey Jason! You're awesome", "NOT_SPAM"),
        ("I am a nigerian prince and I need your help.", "SPAM"),
    ]:
        prediction = classify(text)
        assert prediction.label == label
        print(f"Text: {text}, Predicted Label: {prediction.label}")
        #> Text: Hey Jason! You're awesome, Predicted Label: NOT_SPAM
        #> Text: I am a nigerian prince and I need your help., Predicted Label: SPAM

## Multi-Label Classification
### Defining the Structures

For multi-label classification, we'll update our approach to use Literals instead of enums, and include few-shot examples in the model's docstring.

In [None]:
from pydantic import BaseModel, Field
from typing import Literal

class MultiClassPrediction(BaseModel):
    """
    Class for a multi-class label prediction.

    Examples:
    - "My account is locked": ["TECH_ISSUE"]
    - "I can't access my billing info": ["TECH_ISSUE", "BILLING"]
    - "When do you close for holidays?": ["GENERAL_QUERY"]
    - "My payment didn't go through and now I can't log in": ["BILLING", "TECH_ISSUE"]
    """

    chain_of_thought: str = Field(
        ...,
        description="The chain of thought that led to the prediction.",
    )

    class_labels: list[Literal["TECH_ISSUE", "BILLING", "GENERAL_QUERY"]] = Field(
        ...,
        description="The predicted class labels for the support ticket.",
    )

### Classifying Text

The function multi_classify is responsible for multi-label classification.

In [None]:
def multi_classify(data: str) -> MultiClassPrediction:
    """Perform multi-label classification on the input text."""
    return instructor_client.messages.create(
        response_model=MultiClassPrediction,
        messages=[
            {
                "role": "user",
                "content": f"Classify the following support ticket: <ticket>{data}</ticket>",
            },
        ],
    )

### Testing and Evaluation

Finally, we test the multi-label classification function using a sample support ticket.



In [None]:
# Test multi-label classification
ticket = "My account is locked and I can't access my billing info."
prediction = multi_classify(ticket)
assert "TECH_ISSUE" in prediction.class_labels
assert "BILLING" in prediction.class_labels
print(f"Ticket: {ticket}")
#> Ticket: My account is locked and I can't access my billing info.
print(f"Predicted Labels: {prediction.class_labels}")
#> Predicted Labels: ['TECH_ISSUE', 'BILLING']

By using Literals and including few-shot examples, we've improved both the single-label and multi-label classification implementations. These changes enhance type safety and provide better guidance for the AI model, potentially leading to more accurate classifications.

# Answering Questions with Validated Citations

For the full code example, check out examples/citation_fuzzy_match.py

## Overview

This example shows how to use Instructor with validators to not only add citations to answers generated but also prevent hallucinations by ensuring that every statement made by the LLM is backed up by a direct quote from the context provided, and that those quotes exist!
Two Python classes, Fact and QuestionAnswer, are defined to encapsulate the information of individual facts and the entire answer, respectively.

## Data Structures
### The Fact Class

The Fact class encapsulates a single statement or fact. It contains two fields:

    fact: A string representing the body of the fact or statement.
    substring_quote: A list of strings. Each string is a direct quote from the context that supports the fact.

#### Validation Method: validate_sources

This method validates the sources (substring_quote) in the context. It utilizes regex to find the span of each substring quote in the given context. If the span is not found, the quote is removed from the list.

In [None]:
from pydantic import Field, BaseModel, model_validator, ValidationInfo
import re

class Fact(BaseModel):
    fact: str = Field(...)
    substring_quote: list[str] = Field(...)

    @model_validator(mode="after")
    def validate_sources(self, info: ValidationInfo) -> "Fact":
        text_chunks = info.context.get("text_chunk", None)
        spans = list(self.get_spans(text_chunks))
        self.substring_quote = [text_chunks[span[0] : span[1]] for span in spans]
        return self

    def get_spans(self, context):
        for quote in self.substring_quote:
            yield from self._get_span(quote, context)

    def _get_span(self, quote, context):
        for match in re.finditer(re.escape(quote), context):
            yield match.span()

### The QuestionAnswer Class¶

This class encapsulates the question and its corresponding answer. It contains two fields:

    question: The question asked.
    answer: A list of Fact objects that make up the answer.

#### Validation Method: validate_sources

This method checks that each Fact object in the answer list has at least one valid source. If a Fact object has no valid sources, it is removed from the answer list.

In [None]:
from pydantic import BaseModel, Field, model_validator

class QuestionAnswer(BaseModel):
    question: str = Field(...)
    answer: list[Fact] = Field(...)

    @model_validator(mode="after")
    def validate_sources(self) -> "QuestionAnswer":
        self.answer = [fact for fact in self.answer if len(fact.substring_quote) > 0]
        return self

## Function to Ask AI a Question¶
## The ask_ai Function

This function takes a string question and a string context and returns a QuestionAnswer object. It uses the Watsonx API to fetch the answer and then validates the sources using the defined classes.

To understand the validation context work from pydantic check out pydantic's docs

In [None]:
#tool_choice_option="auto",
def ask_ai(question: str, context: str) -> QuestionAnswer:
    
    return instructor_client.messages.create(
        response_model=QuestionAnswer,
        messages=[
            {
                'role': 'system', 
                'content': 'You are a world class assistant to answer questions with correct and exact citations based on a given CONTEXT.'
            }, 
            {
                'role': 'user', 
                'content': 'You are a world class assistant to answer questions with correct and exact citations based on a given CONTEXT.'
            }, 
            {
                'role': 'assistant', 
                'content': 'I have the following context uppon which I have to base my answers.\n' + f"CONTEXT:{context}\n" 
            }, 
            {
                'role': 'user', 
                'content': 'Question: ' + f'{question}'
            }
            
            
        ],
        validation_context={"text_chunk": context},
    )

### Example

Here's an example of using these classes and functions to ask a question and validate the answer.

In [None]:
question = "What did the author do during college?"
context = """My name is Jason Liu, and I grew up in Toronto Canada but I was born in China. I went to an arts high school but in university I studied Computational Mathematics and physics. As part of coop I worked at many companies including Stitchfix, Facebook. I also started the Data Science club at the University of Waterloo and I was the president of the club for 2 years.'"""

In [None]:
# Set logging to DEBUG
#logging.basicConfig(level=logging.DEBUG)
print(ask_ai(question=question,context=context))

# Entity Resolution and Visualization for Legal Documents

In this guide, we demonstrate how to extract and resolve entities from a sample legal contract. Then, we visualize these entities and their dependencies as an entity graph. This approach can be invaluable for legal tech applications, aiding in the understanding of complex documents.

## Defining the Data Structures

The Entity and Property classes model extracted entities and their attributes. DocumentExtraction encapsulates a list of these entities.

In [None]:
from pydantic import BaseModel, Field

class Property(BaseModel):
    key: str
    value: str
    resolved_absolute_value: str


class Entity(BaseModel):
    id: int = Field(
        ...,
        description="Unique identifier for the entity, used for deduplication, design a scheme allows multiple entities",
    )
    subquote_string: list[str] = Field(
        ...,
        description="Correctly resolved value of the entity, if the entity is a reference to another entity, this should be the id of the referenced entity, include a few more words before and after the value to allow for some context to be used in the resolution",
    )
    entity_title: str
    properties: list[Property] = Field(
        ..., description="List of properties of the entity"
    )
    dependencies: list[int] = Field(
        ...,
        description="List of entity ids that this entity depends  or relies on to resolve it",
    )


class DocumentExtraction(BaseModel):
    entities: list[Entity] = Field(
        ...,
        description="Body of the answer, each fact should be a separate object with a body and a list of sources",
    )

## Entity Extraction and Resolution

The ask_ai function utilizes Watsonx API to extract and resolve entities from the input content.

In [None]:
import instructor
from openai import OpenAI

# Apply the patch to the OpenAI client
# enables response_model keyword
client = instructor.from_openai(OpenAI())


def ask_ai(content) -> DocumentExtraction:
    return instructor_client.messages.create(
        #tool_choice_option="auto",
        response_model=DocumentExtraction,
        messages=[
            {
                "role": "system",
                "content": "Extract and resolve a list of entities from the following document:",
            },
            {
                "role": "user",
                "content": content,
            },
        ],
    )

## Graph Visualization

generate_graph takes the extracted entities and visualizes them using Graphviz. It creates nodes for each entity and edges for their dependencies.

In [None]:
from graphviz import Digraph

def generate_html_label(entity: Entity) -> str:
    rows = [
        f"<tr><td>{prop.key}</td><td>{prop.resolved_absolute_value}</td></tr>"
        for prop in entity.properties
    ]
    table_rows = "".join(rows)
    return f"<<table border='0' cellborder='1' cellspacing='0'><tr><td colspan='2'><b>{entity.entity_title}</b></td></tr>{table_rows}</table>>"


def generate_graph(data: DocumentExtraction):
    dot = Digraph(comment="Entity Graph", node_attr={"shape": "plaintext"})

    for entity in data.entities:
        label = generate_html_label(entity)
        dot.node(str(entity.id), label)

    for entity in data.entities:
        for dep_id in entity.dependencies:
            dot.edge(str(entity.id), str(dep_id))

    dot.render("entity.gv", view=True)

## Execution

Finally, execute the code to visualize the entity graph for the sample legal contract.

In [None]:
content = """
Sample Legal Contract
Agreement Contract

This Agreement is made and entered into on 2020-01-01 by and between Company A ("the Client") and Company B ("the Service Provider").

Article 1: Scope of Work

The Service Provider will deliver the software product to the Client 30 days after the agreement date.

Article 2: Payment Terms

The total payment for the service is $50,000.
An initial payment of $10,000 will be made within 7 days of the the signed date.
The final payment will be due 45 days after [SignDate].

Article 3: Confidentiality

The parties agree not to disclose any confidential information received from the other party for 3 months after the final payment date.

Article 4: Termination

The contract can be terminated with a 30-day notice, unless there are outstanding obligations that must be fulfilled after the [DeliveryDate].
"""  # Your legal contract here
model = ask_ai(content)
model

In [None]:
generate_graph(model)

# PII Data Extraction and Scrubbing
## Overview

This example demonstrates the usage of OpenAI's ChatCompletion model for the extraction and scrubbing of Personally Identifiable Information (PII) from a document. The code defines Pydantic models to manage the PII data and offers methods for both extraction and sanitation.

## Defining the Structures

First, Pydantic models are defined to represent the PII data and the overall structure for PII data extraction.

In [None]:
from pydantic import BaseModel

# Define Schemas for PII data
class Data(BaseModel):
    index: int
    data_type: str
    pii_value: str


class PIIDataExtraction(BaseModel):
    """
    Extracted PII data from a document, all data_types should try to have consistent property names
    """

    private_data: list[Data]

    def scrub_data(self, content: str) -> str:
        """
        Iterates over the private data and replaces the value with a placeholder in the form of
        <{data_type}_{i}>
        """
        for i, data in enumerate(self.private_data):
            content = content.replace(data.pii_value, f"<{data.data_type}_{i}>")
        return content

## Extracting PII Data

The Watsonx API is utilized to extract PII information from a given document.

In [None]:
EXAMPLE_DOCUMENT = """
# Fake Document with PII for Testing PII Scrubbing Model
# (The content here)
"""

pii_data = instructor_client.messages.create(
    response_model=PIIDataExtraction,
    messages=[
        {
            "role": "system",
            "content": "You are a world class PII scrubbing model, Extract the PII data from the following document",
        },
        {
            "role": "user",
            "content": EXAMPLE_DOCUMENT,
        },
    ],
)

print("Extracted PII Data:")
#> Extracted PII Data:
print(pii_data.model_dump_json())
"""
{"private_data":[{"index":0,"data_type":"Name","pii_value":"John Doe"},{"index":1,"data_type":"Email","pii_value":"john.doe@example.com"},{"index":2,"data_type":"Phone Number","pii_value":"(555) 123-4567"},{"index":3,"data_type":"Address","pii_value":"123 Main St, Anytown, USA"},{"index":4,"data_type":"Social Security Number","pii_value":"123-45-6789"}]}
"""

### Scrubbing PII Data

After extracting the PII data, the scrub_data method is used to sanitize the document.

In [None]:
print("Scrubbed Document:")
#> Scrubbed Document:
print(pii_data.scrub_data(EXAMPLE_DOCUMENT))
"""
# Fake Document with PII for Testing PII Scrubbing Model
# He was born on <date_0>. His social security number is <ssn_1>. He has been using the email address <email_2> for years, and he can always be reached at <phone_3>.
"""

# Extracting Tables using Watsonx.ai

This post demonstrates how to use Python's type annotations and Watsonx new vision model to extract tables from images and convert them into markdown format. This method is particularly useful for data analysis and automation tasks.

The full code is available on GitHub

## Building the Custom Type for Markdown Tables

First, we define a custom type, MarkdownDataFrame, to handle pandas DataFrames formatted in markdown. This type uses Python's Annotated and InstanceOf types, along with decorators BeforeValidator and PlainSerializer, to process and serialize the data.

In [None]:
from io import StringIO
from typing import Annotated, Any
from pydantic import BeforeValidator, PlainSerializer, InstanceOf, WithJsonSchema
import pandas as pd


def md_to_df(data: Any) -> Any:
    # Convert markdown to DataFrame
    if isinstance(data, str):
        return (
            pd.read_csv(
                StringIO(data),  # Process data
                sep="|",
                index_col=1,
            )
            .dropna(axis=1, how="all")
            .iloc[1:]
            .applymap(lambda x: x.strip())
        )
    return data


MarkdownDataFrame = Annotated[
    InstanceOf[pd.DataFrame],
    BeforeValidator(md_to_df),
    PlainSerializer(lambda df: df.to_markdown()),
    WithJsonSchema(
        {
            "type": "string",
            "description": "The markdown representation of the table, each one should be tidy, do not try to join tables that should be seperate",
        }
    ),
]

## Defining the Table Class¶

The Table class is essential for organizing the extracted data. It includes a caption and a dataframe, processed as a markdown table. Since most of the complexity is handled by the MarkdownDataFrame type, the Table class is straightforward!

In [None]:
from pydantic import BaseModel

class Table(BaseModel):
    caption: str
    dataframe: MarkdownDataFrame

## Extracting Tables from Images

The extract_table function uses Watsonx vision model to process an image URL and extract tables in markdown format. We utilize the instructor library to patch the OpenAI client for this purpose.

In [None]:
import instructor
import base64
import requests
from collections.abc import Iterable
from ibm_watsonx_ai.foundation_models.schema import TextChatParameters

# Apply the patch to the client to support response_model
# Also use MD_JSON mode since the vision model does not support any special structured output mode

params = TextChatParameters(
    temperature=1,
    max_tokens=2000,
    time_limit=600000
)

client = Model(
    model_id='meta-llama/llama-3-2-90b-vision-instruct',
    params=params,
    credentials=Credentials(
        api_key = os.environ.get("WATSONX_API_KEY"),
        url = os.environ.get("WATSONX_URL")),
    project_id= os.environ.get("WATSONX_PROJECT_ID")
)

# Patch the client with Watsonx
instructor_client=instructor.from_watsonx(
    client=client,
    mode=instructor.Mode.WATSONX_MD_JSON
)


def extract_table(url: str) -> Iterable[Table]:
    url = url
    response = requests.get(url)
    encoded_string = base64.b64encode(response.content).decode('utf-8')

    return instructor_client.messages.create(
        response_model=Iterable[Table],
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Extract table from image."},
                    {
                        "type": "image_url",
                        "image_url": {
                        "url": "data:image/jpeg;base64," + encoded_string,
                        }
                    }
                ],
            }
        ],
    )

## Practical Example

In this example, we apply the method to extract data from an image showing the top grossing apps in Ireland for October 2023.

<img src="https://a.storyblok.com/f/47007/2400x2000/bf383abc3c/231031_uk-ireland-in-three-charts_table_v01_b.png" width="50%">

In [None]:
url="https://a.storyblok.com/f/47007/2400x2000/bf383abc3c/231031_uk-ireland-in-three-charts_table_v01_b.png"
# Set logging to DEBUG
#logging.basicConfig(level=logging.DEBUG)
tables = extract_table(url)


In [None]:
tables

In [None]:
for table in tables:

    print(table.dataframe)
    """
           Android                                      ... Category
     Rank                                               ...
    1                                       Google One  ...      Social networking
    2                                          Disney+  ...          Entertainment
    3                    TikTok - Videos, Music & LIVE  ...          Entertainment
    4                                 Candy Crush Saga  ...          Entertainment
    5                   Tinder: Dating, Chat & Friends  ...                  Games
    6                                      Coin Master  ...          Entertainment
    7                                           Roblox  ...                 Dating
    8                   Bumble - Dating & Make Friends  ...                  Games
    9                                      Royal Match  ...               Business
    10                     Spotify: Music and Podcasts  ...              Education

    [10 rows x 4 columns]
    """

In [None]:
table.dataframe