# Using Langchain for Feature Extraction and Tagging

Getting structured output from raw LLM generations is hard.

For example, suppose you need the model output formatted with a specific schema for:

- Extracting different parts of a user query (e.g., for semantic vs keyword search)


![Image description](https://github.com/langchain-ai/langchain/blob/master/docs/static/img/extraction.png?raw=1)

## Overview

There are two primary approaches for this:

- `Functions`: Some LLMs can call [functions](https://openai.com/blog/function-calling-and-other-api-updates) to extract arbitrary entities from LLM responses.

- `Pydantic`: Pydantic library is used to extract the features we want from the data for Python.

Only some LLMs support functions (e.g., OpenAI), and they are more general than parsers.

Parsers extract precisely what is enumerated in a provided schema (e.g., specific attributes of a person).

Functions can infer things beyond of a provided schema (e.g., attributes about a person that you did not ask for).

## Quickstart

OpenAI functions are one way to get started with extraction.

Define a schema that specifies the properties we want to extract from the LLM output.

Then, we can use `create_extraction_chain` to extract our desired schema using an OpenAI function call.

In [None]:
!pip install --upgrade --quiet langchain langchain-openai

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m817.7/817.7 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m47.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m291.3/291.3 kB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.5/115.5 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.6/311.6 kB[0m [31m22.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m47.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.0/53.0 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━

In [None]:
import os
from google.colab import userdata
os.environ['OPENAI_API_KEY']=userdata.get('OPENAI_API_KEY')

In [None]:
from langchain.chains import create_extraction_chain
from langchain_openai import ChatOpenAI

# Schema
schema = {
    "properties": {
        "name": {"type": "string"}, #"description":"person name", required=True
        "height": {"type": "integer"},
        "hair_color": {"type": "string"},
    },
    "required": ["name", "height"],
}

# Input
inp = """Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette and Alex is blonde."""

# Run chain
llm = ChatOpenAI(model="gpt-3.5-turbo-0125",
                 #openai_api_key=os.environ['OPENAI_API_KEY'],
                 temperature=0)
chain = create_extraction_chain(schema, llm)
output=chain.invoke(inp)
output

{'input': 'Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette and Alex is blonde.',
 'text': [{'name': 'Alex', 'height': 5, 'hair_color': 'blonde'},
  {'name': 'Claudia', 'height': 6, 'hair_color': 'brunette'}]}

In [None]:
# Just output
output["text"]

[{'name': 'Alex', 'height': 5, 'hair_color': 'blonde'},
 {'name': 'Claudia', 'height': 6, 'hair_color': 'brunette'}]

In [None]:
chain.prompt

ChatPromptTemplate(input_variables=['input'], messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['input'], template="Extract and save the relevant entities mentioned in the following passage together with their properties.\n\nOnly extract the properties mentioned in the 'information_extraction' function.\n\nIf a property is not present and is not required in the function parameters, do not include it in the output.\n\nPassage:\n{input}\n"))])

In [None]:
print(chain.prompt.messages[0].prompt.template)

Extract and save the relevant entities mentioned in the following passage together with their properties.

Only extract the properties mentioned in the 'information_extraction' function.

If a property is not present and is not required in the function parameters, do not include it in the output.

Passage:
{input}



In [None]:
chain.llm_kwargs

{'functions': [{'name': 'information_extraction',
   'description': 'Extracts the relevant information from the passage.',
   'parameters': {'type': 'object',
    'properties': {'info': {'type': 'array',
      'items': {'type': 'object',
       'properties': {'name': {'title': 'name', 'type': 'string'},
        'height': {'title': 'height', 'type': 'integer'},
        'hair_color': {'title': 'hair_color', 'type': 'string'}},
       'required': ['name', 'height']}}},
    'required': ['info']}}],
 'function_call': {'name': 'information_extraction'}}

In [None]:
chain.llm_kwargs["functions"][0]["description"]

'Extracts the relevant information from the passage.'

### Looking under the hood

Let's dig into what is happening when we call [create_extraction_chain](https://api.python.langchain.com/en/latest/chains/langchain.chains.openai_functions.extraction.create_extraction_chain.html#langchain.chains.openai_functions.extraction.create_extraction_chain).

### Pydantic with "create_extraction_chain_pydantic" toolkits

Pydantic library is used to extract the features we want from the data for Python.

It allows you to create data classes with attributes that are automatically validated when you instantiate an object.

Lets define a class with attributes annotated with types.

### Pydantic with "create_extraction_chain_pydantic" toolkits

In [None]:
from typing import List, Optional

from langchain_core.pydantic_v1 import BaseModel, Field
from langchain.chains import create_extraction_chain_pydantic

inp= 'Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette.'

class Person(BaseModel):
    """Information about a person."""

    # ^ Doc-string for the entity Person.
    # This doc-string is sent to the LLM as the description of the schema Person,
    # and it can help to improve extraction results.

    # Note that:
    # 1. Each field is an `optional` -- this allows the model to decline to extract it!
    # 2. Each field has a `description` -- this description is used by the LLM.
    # Having a good description can help improve extraction results.
    name: Optional[str] = Field(description="The person's name", default=None)
    height: Optional[float] = Field(description="Height measured in feets", default=None)
    hair_color: Optional[str] = Field(description="The color of the person's hair if known",default=None)

chain2 = create_extraction_chain_pydantic(Person, llm)
chain2.invoke(inp)
#1 feet=0.3048 m,

{'input': 'Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette.',
 'text': [Person(name='Alex', height=5.0, hair_color=None),
  Person(name='Claudia', height=6.0, hair_color='brunette')]}

### Pydantic with "with_structured_output" Method

In [None]:
from typing import List, Optional

inp= 'Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette.'

class Person(BaseModel):
    """Information about a person."""

    # ^ Doc-string for the entity Person.
    # This doc-string is sent to the LLM as the description of the schema Person,
    # and it can help to improve extraction results.

    # Note that:
    # 1. Each field is an `optional` -- this allows the model to decline to extract it!
    # 2. Each field has a `description` -- this description is used by the LLM.
    # Having a good description can help improve extraction results.
    name: Optional[str] = Field(description="The person's name", default=None)
    height: Optional[float] = Field(description="Height measured in feets", default=None)
    hair_color: Optional[str] = Field(description="The color of the person's hair if known",default=None)

class Data(BaseModel):
    """Extracted data about people."""

    # Creates a model so that we can extract multiple entities.
    people: List[Person]

In [None]:
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

# Define a custom prompt to provide instructions and any additional context.
# 1) You can add examples into the prompt template to improve extraction quality
# 2) Introduce additional parameters to take context into account (e.g., include metadata
#    about the document from which the text was extracted.)
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert extraction algorithm. "
        ),
        # Please see the how-to about improving performance with
        # reference examples.
        # MessagesPlaceholder('examples'),
        ("human",
         "Only extract relevant information from the '{text}'. "
         "If you do not know the value of an attribute asked "
         "to extract, return null for the attribute's value."),
    ]
)

llm = ChatOpenAI(model="gpt-3.5-turbo-0125",
                 temperature=0,
                 model_kwargs={"top_p":1})

In [None]:
inp

'Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette.'

In [None]:
#We need to use a model that supports function/tool calling.
chain = prompt | llm.with_structured_output(schema=Person) #if schema is Person, the output of model is for one person
chain.invoke({"text":inp}) #inp

  warn_beta(


Person(name='Alex', height=5.0, hair_color=None)

In [None]:
chain = prompt | llm.with_structured_output(schema=Data) #if schema is Data, the output of model is for all person
output= chain.invoke({"text":inp})
output

Data(people=[Person(name='Alex', height=5.0, hair_color=None), Person(name='Claudia', height=6.0, hair_color='brunette')])

In [None]:
output.people[0]

Person(name='Alex', height=5.0, hair_color=None)

### Another example via "with_structured_output" method

#### Without Examples

In [None]:
# We will be using tool calling mode, which
# requires a tool calling capable model.
llm = ChatOpenAI(
    # Consider benchmarking with a good model to get
    # a sense of the best possible quality.
    model="gpt-3.5-turbo-0125",
    # Remember to set the temperature to 0 for extractions!
    temperature=0,
    model_kwargs={"top_p":1}
)


runnable = prompt | llm.with_structured_output(schema=Data)

In [None]:
for _ in range(10):
    text = "The solar system is large, but earth has only 1 moon."
    print(runnable.invoke({"text": text, "examples": []}))

people=[Person(name=None, height=None, hair_color=None)]
people=[Person(name='The solar system is large, but earth has only 1 moon.', height=None, hair_color=None)]
people=[Person(name=None, height=None, hair_color=None)]
people=[Person(name='The solar system is large, but earth has only 1 moon.', height=None, hair_color=None)]
people=[Person(name=None, height=None, hair_color=None)]
people=[Person(name=None, height=None, hair_color=None)]
people=[Person(name=None, height=None, hair_color=None)]
people=[Person(name=None, height=None, hair_color=None)]
people=[Person(name=None, height=None, hair_color=None)]
people=[Person(name=None, height=None, hair_color=None)]


### With examples

In [None]:
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

# Define a custom prompt to provide instructions and any additional context.
# 1) You can add examples into the prompt template to improve extraction quality
# 2) Introduce additional parameters to take context into account (e.g., include metadata
#    about the document from which the text was extracted.)
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert extraction algorithm."
        ),
        # ↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓
        MessagesPlaceholder("examples"),  # <-- EXAMPLES!
        # ↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑
        ("human",
         "Only extract relevant information from the '{text}'. "
         "If you do not know the value of an attribute asked "
         "to extract, return null for the attribute's value."),
    ]
)

In [None]:
import uuid
from typing import Dict, List, TypedDict

from langchain_core.messages import (
    AIMessage,
    BaseMessage,
    HumanMessage,
    SystemMessage,
    ToolMessage,
)
from langchain_core.pydantic_v1 import BaseModel, Field


class Example(TypedDict):
    """A representation of an example consisting of text input and expected tool calls.

    For extraction, the tool calls are represented as instances of pydantic model.
    """

    input: str  # This is the example text
    tool_calls: List[BaseModel]  # Instances of pydantic model that should be extracted


def tool_example_to_messages(example: Example) -> List[BaseMessage]:
    """Convert an example into a list of messages that can be fed into an LLM.

    This code is an adapter that converts our example to a list of messages
    that can be fed into a chat model.

    The list of messages per example corresponds to:

    1) HumanMessage: contains the content from which content should be extracted.
    2) AIMessage: contains the extracted information from the model
    3) ToolMessage: contains confirmation to the model that the model requested a tool correctly.

    The ToolMessage is required because some of the chat models are hyper-optimized for agents
    rather than for an extraction use case.
    """
    messages: List[BaseMessage] = [HumanMessage(content=example["input"])]
    openai_tool_calls = []
    for tool_call in example["tool_calls"]:
        openai_tool_calls.append(
            {
                "id": str(uuid.uuid4()),
                "type": "function",
                "function": {
                    # The name of the function right now corresponds
                    # to the name of the pydantic model
                    # This is implicit in the API right now,
                    # and will be improved over time.
                    "name": tool_call.__class__.__name__,
                    "arguments": tool_call.json(),
                },
            }
        )
    messages.append(
        AIMessage(content="", additional_kwargs={"tool_calls": openai_tool_calls})
    )
    tool_outputs = example.get("tool_outputs") or [
        "You have correctly called this tool."
    ] * len(openai_tool_calls)
    for output, tool_call in zip(tool_outputs, openai_tool_calls):
        messages.append(ToolMessage(content=output, tool_call_id=tool_call["id"]))
    return messages

In [None]:
examples = [
    (
        "The ocean is vast and blue. It's more than 20,000 feet deep. There are many fish in it.",
        Person(name=None, height=None, hair_color=None),
    ),
    (
        "Fiona traveled far from France to Spain.",
        Person(name="Fiona", height=None, hair_color=None),
    ),
]


messages = []

for text, tool_call in examples:
    messages.extend(
        tool_example_to_messages({"input": text, "tool_calls": [tool_call]})
    )

In [None]:
messages

[HumanMessage(content="The ocean is vast and blue. It's more than 20,000 feet deep. There are many fish in it."),
 AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'ed89ebc6-b17c-4189-bc80-0281b8942f00', 'type': 'function', 'function': {'name': 'Person', 'arguments': '{"name": null, "height": null, "hair_color": null}'}}]}, tool_calls=[{'name': 'Person', 'args': {'name': None, 'height': None, 'hair_color': None}, 'id': 'ed89ebc6-b17c-4189-bc80-0281b8942f00'}]),
 ToolMessage(content='You have correctly called this tool.', tool_call_id='ed89ebc6-b17c-4189-bc80-0281b8942f00'),
 HumanMessage(content='Fiona traveled far from France to Spain.'),
 AIMessage(content='', additional_kwargs={'tool_calls': [{'id': '58fab6dd-afff-42b4-b19f-43babc352965', 'type': 'function', 'function': {'name': 'Person', 'arguments': '{"name": "Fiona", "height": null, "hair_color": null}'}}]}, tool_calls=[{'name': 'Person', 'args': {'name': 'Fiona', 'height': None, 'hair_color': None}, 'id': '58fab6dd

In [None]:
llm = ChatOpenAI(
    # Consider benchmarking with a good model to get
    # a sense of the best possible quality.
    model="gpt-3.5-turbo-0125",
    # Remember to set the temperature to 0 for extractions!
    temperature=0,
)


runnable = prompt | llm.with_structured_output(schema=Data)

In [None]:
for _ in range(10):
    text = "The solar system is large, but earth has only 1 moon."
    print(runnable.invoke({"text": text, "examples": messages}))

people=[Person(name=None, height=None, hair_color=None)]
people=[Person(name=None, height=None, hair_color=None)]
people=[Person(name=None, height=None, hair_color=None)]
people=[Person(name=None, height=None, hair_color=None)]
people=[Person(name=None, height=None, hair_color=None)]
people=[Person(name=None, height=None, hair_color=None)]
people=[Person(name=None, height=None, hair_color=None)]
people=[Person(name=None, height=None, hair_color=None)]
people=[Person(name=None, height=None, hair_color=None)]
people=[Person(name=None, height=None, hair_color=None)]


### Multiple entity types

We can extend this further.

Let's say we want to differentiate between dogs and people.

We can add `person_` and `dog_` prefixes for each property

### with "create_extraction_chain_pydantic"

In [None]:
from typing import List, Optional


class Person_and_Dog(BaseModel):
    """Information about a person and dog."""

    person_name: Optional[str] = Field(description="The person's name", default=None)
    person_height: Optional[float] = Field(description="Height measured in feets", default=None)
    person_hair_color: Optional[str] = Field(description="The color of the person's hair if known", default=None)
    dog_name: Optional[str] = Field(description="The name of the dog", default=None)
    dog_breed: Optional[str] = Field(description="The breed of the dog", default=None)

class Data(BaseModel):
    """Extracted data about people and dogs."""

    # Creates a model so that we can extract multiple entities.
    people_dogs: List[Person_and_Dog]

In [None]:
llm = ChatOpenAI(model="gpt-3.5-turbo-0125",
                 temperature=0,
                 model_kwargs={"top_p":1})
inp = "Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette and Alex is blonde."

chain = create_extraction_chain_pydantic(Data, llm)
chain.invoke(inp)

{'input': 'Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette and Alex is blonde.',
 'text': [Data(people_dogs=[Person_and_Dog(person_name='Alex', person_height=5.0, person_hair_color='blonde', dog_name=None, dog_breed=None), Person_and_Dog(person_name='Claudia', person_height=6.0, person_hair_color='brunette', dog_name=None, dog_breed=None)])]}

In [None]:
inp = """Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette and Alex is blonde.
Alex's dog Frosty is a labrador and likes to play hide and seek."""

chain.invoke(inp)["text"][0].people_dogs

[Person_and_Dog(person_name='Alex', person_height=5.0, person_hair_color='blonde', dog_name='Frosty', dog_breed='labrador'),
 Person_and_Dog(person_name='Claudia', person_height=6.0, person_hair_color='brunette', dog_name=None, dog_breed=None)]

## via "with_structured_output"

In [None]:
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert extraction algorithm. "
        ),
        # Please see the how-to about improving performance with
        # reference examples.
        # MessagesPlaceholder('examples'),
        ("human",
         "Only extract relevant information from the '{text}'. "
         "If you do not know the value of an attribute asked "
         "to extract, return null for the attribute's value."),
    ]
)

runnable = prompt | llm.with_structured_output(schema=Data)

In [None]:
output= runnable.invoke({"text":inp})
output.people_dogs

[Person_and_Dog(person_name='Alex', person_height=5.0, person_hair_color='blonde', dog_name='Frosty', dog_breed='labrador'),
 Person_and_Dog(person_name='Claudia', person_height=6.0, person_hair_color='brunette', dog_name=None, dog_breed=None)]

## via "create_extraction_chain_pydantic"

In [None]:
inp = """Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette and Alex is blonde.
Willow is a German Shepherd that likes to play with other dogs and can always be found playing with Milo, a border collie that lives close by."""

chain = create_extraction_chain_pydantic(Data, llm)
chain.invoke(inp)["text"][0].people_dogs

[Person_and_Dog(person_name='Alex', person_height=1.52, person_hair_color='blonde', dog_name=None, dog_breed=None),
 Person_and_Dog(person_name='Claudia', person_height=1.67, person_hair_color='brunette', dog_name=None, dog_breed=None),
 Person_and_Dog(person_name=None, person_height=None, person_hair_color=None, dog_name='Willow', dog_breed='German Shepherd'),
 Person_and_Dog(person_name=None, person_height=None, person_hair_color=None, dog_name='Milo', dog_breed='Border Collie')]

## via "with_structured_output"

In [None]:
inp = """Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette and Alex is blonde.
Willow is a German Shepherd that likes to play with other dogs and can always be found playing with Milo, a border collie that lives close by."""

output= runnable.invoke({"text":inp})
output.people_dogs

[Person_and_Dog(person_name='Alex', person_height=5.0, person_hair_color='blonde', dog_name=None, dog_breed=None),
 Person_and_Dog(person_name='Claudia', person_height=6.0, person_hair_color='brunette', dog_name=None, dog_breed=None),
 Person_and_Dog(person_name=None, person_height=None, person_hair_color=None, dog_name='Willow', dog_breed='German Shepherd'),
 Person_and_Dog(person_name=None, person_height=None, person_hair_color=None, dog_name='Milo', dog_breed='Border Collie')]

### Extra information

The power of functions (relative to using parsers alone) lies in the ability to perform semantic extraction.

In particular, `we can ask for things that are not explicitly enumerated in the schema`.

Suppose we want unspecified additional information about dogs.

We can use add a placeholder for unstructured extraction, `dog_extra_info`.

In [None]:
class Person_and_Dog(BaseModel):
    """Information about a person and dog."""

    person_name: Optional[str] = Field(description="The person's name", default=None)
    person_height: Optional[float] = Field(description="Height measured in feets", default=None)
    person_hair_color: Optional[str] = Field(description="The color of the person's hair if known", default=None)
    dog_name: Optional[str] = Field(description="The name of the dog", default=None)
    dog_breed: Optional[str] = Field(description="The breed of the dog", default=None)
    dog_extra_info: Optional[str] = Field(description="extra information about the dog", default=None)

class Data(BaseModel):
    """Extracted data about people and dogs."""

    # Creates a model so that we can extract multiple entities.
    people_dogs: List[Person_and_Dog]

In [None]:
llm = ChatOpenAI(model="gpt-3.5-turbo-0125",
                 temperature=0,
                 model_kwargs={"top_p":1})

chain = create_extraction_chain_pydantic(Data, llm)
chain.invoke(inp)["text"][0].people_dogs

[Person_and_Dog(person_name='Alex', person_height=5.0, person_hair_color='blonde', dog_name=None, dog_breed=None, dog_extra_info=None),
 Person_and_Dog(person_name='Claudia', person_height=6.0, person_hair_color='brunette', dog_name=None, dog_breed=None, dog_extra_info=None),
 Person_and_Dog(person_name=None, person_height=None, person_hair_color=None, dog_name='Willow', dog_breed='German Shepherd', dog_extra_info='likes to play with other dogs'),
 Person_and_Dog(person_name=None, person_height=None, person_hair_color=None, dog_name='Milo', dog_breed='border collie', dog_extra_info='lives close by')]

In [None]:
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert extraction algorithm. "
        ),
        # Please see the how-to about improving performance with reference examples.
        # MessagesPlaceholder('examples'),
        ("human",
         "Only extract relevant information from the '{text}'. "
         "If you do not know the value of an attribute asked "
         "to extract, return null for the attribute's value."),
    ]
)


runnable = prompt | llm.with_structured_output(schema=Data)

In [None]:
inp = """Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette and Alex is blonde.
Willow is a German Shepherd that likes to play with other dogs and can always be found playing with Milo, a border collie that lives close by."""

output= runnable.invoke({"text":inp})
output.people_dogs

[Person_and_Dog(person_name='Alex', person_height=5.0, person_hair_color='blonde', dog_name=None, dog_breed=None, dog_extra_info=None),
 Person_and_Dog(person_name='Claudia', person_height=6.0, person_hair_color='brunette', dog_name=None, dog_breed=None, dog_extra_info=None),
 Person_and_Dog(person_name=None, person_height=None, person_hair_color=None, dog_name='Willow', dog_breed='German Shepherd', dog_extra_info='likes to play with other dogs'),
 Person_and_Dog(person_name=None, person_height=None, person_hair_color=None, dog_name='Milo', dog_breed='Border Collie', dog_extra_info=None)]

This gives us additional information about the dogs.

# Tagging


## Use case

Tagging means labeling a document with classes such as:

- sentiment
- language
- style (formal, informal etc.)
- covered topics
- political tendency

![Image description](https://github.com/langchain-ai/langchain/blob/master/docs/static/img/tagging.png?raw=1)

## Overview

Tagging has a few components:

* `function`: tagging uses [functions](https://openai.com/blog/function-calling-and-other-api-updates) to specify how the model should tag a document
* `schema and pydantic`: defines how we want to tag the document

## Quickstart

Let's see a very straightforward example of how we can use OpenAI functions for tagging in LangChain.

In [None]:
from langchain.chains import create_tagging_chain
from langchain_openai import ChatOpenAI

We specify a few properties with their expected type in our schema.

In [None]:
# Schema
schema = {
    "properties": {
        "sentiment": {"type": "string"},
        "aggressiveness": {"type": "integer"},
        "language": {"type": "string"},
    }
}

# LLM
llm = ChatOpenAI(temperature=0,
                 model="gpt-3.5-turbo-0125",
                 #openai_api_key=os.environ['OPENAI_API_KEY']
                 )
chain = create_tagging_chain(schema, llm)

In [None]:
inp = "Estoy increiblemente contento de haberte conocido! Creo que seremos muy buenos amigos!"
chain.invoke(inp)["text"] # Usually don't return aggressiveness because required is not specified

{'sentiment': 'positive', 'language': 'Spanish'}

In [None]:
inp = "Estoy muy enojado con vos! Te voy a dar tu merecido!"
chain.invoke(inp)["text"]

{'sentiment': 'enojado', 'aggressiveness': 3, 'language': 'Spanish'}

As we can see in the examples, it correctly interprets what we want.

The results vary so that we get, for example, sentiments in different languages ('positive', 'enojado' etc.).

We will see how to control these results in the next section.

## Finer control

Careful schema definition gives us more control over the model's output.

Specifically, we can define:

- possible values for each property
- description to make sure that the model understands the property
- required properties to be returned

Here is an example of how we can use `_enum_`, `_description_`, and `_required_` to control for each of the previously mentioned aspects:

In [None]:
schema = {
    "properties": {
        "sentiment": {"type": "string",
                      "enum": ["positive", "neutral", "negative"],
                      "description": "The sentiment for text"},
        "aggressiveness": {
            "type": "integer",
            "enum": [1, 2, 3, 4, 5],
            "description": "describes how aggressive the statement is, the higher the number the more aggressive",
        },
        "language": {
            "type": "string",
            "enum": ["spanish", "english", "french", "german", "italian"],
        },
    },
    "required": ["language", "sentiment", "aggressiveness"],
}

In [None]:
llm = ChatOpenAI(temperature=0,
                 model="gpt-3.5-turbo-0125")
chain = create_tagging_chain(schema, llm)

Now the answers are much better!

In [None]:
inp = "Estoy increiblemente contento de haberte conocido! Creo que seremos muy buenos amigos!"
chain.invoke(inp)["text"]

{'sentiment': 'positive', 'aggressiveness': 1, 'language': 'spanish'}

In [None]:
inp = "Weather is ok here, I can go outside without much more than a coat"
chain.invoke(inp)["text"]

{'sentiment': 'neutral', 'aggressiveness': 1, 'language': 'english'}

##Pydantic

In [None]:
from typing import Optional
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain.chains import create_tagging_chain_pydantic



class Classification(BaseModel):
    sentiment: str = Field(description="The sentiment of the text",
                             enum=["happy", "neutral", "sad"])
    aggressiveness: int = Field(description="describes how aggressive the statement is, the higher the number the more aggressive",
                            enum=[1, 2, 3, 4, 5])
    language: str = Field(description="The language the text is written in",
                            enum=["spanish", "english", "french", "german", "italian"])

inp = "Weather is ok here, I can go outside without much more than a coat"
chain2 = create_tagging_chain_pydantic(Classification, llm)
chain2.invoke(inp)

{'input': 'Weather is ok here, I can go outside without much more than a coat',
 'text': Classification(sentiment='neutral', aggressiveness=1, language='english')}

In [None]:
inp = "damn!"
chain2.invoke(inp)["text"]

Classification(sentiment='sad', aggressiveness=5, language='english')

In [None]:
from langchain_core.prompts import ChatPromptTemplate

tagging_prompt = ChatPromptTemplate.from_template(
    """
Extract the desired information from the following passage.

Only extract the properties mentioned in the 'Classification' function.

Passage:
{input}
"""
)

inp = "damn!"

llm = ChatOpenAI(temperature=0,
                 model="gpt-3.5-turbo-0125")

tagging_chain = tagging_prompt | llm.with_structured_output(Classification)
tagging_chain.invoke({"input": inp})

Classification(sentiment='sad', aggressiveness=5, language='english')

## Multi-Label Classification

Let's make our example more specific and challenging. Let's analyze customers' comments separately for both the product they purchased and for their interactions with customer service.

In [None]:
# Schema
schema = {
    "properties": {
        "sentiment_for_product":{"type": "string",
                                        "enum":["positive", "neutral", "negative"],
                                        "description": "The sentiment for product"},
         "sentiment_for_customer_service_issue":{"type": "string",
                                        "enum":["positive", "neutral", "negative"],
                                        "description": "The sentiment for customer service issues"},
        "tecnical_problems": {"type": "string",
                              "description": "Details of the technical problem encountered with the products."},
        "negative_customer_surves_experiences": {"type": "string",
                                                 "description": "Details of the negative customer service experiences"}
    },
    "required": ["sentiment_for_product", "sentiment_for_customer_service_issue", "tecnical_problems",  "negative_customer_surves_experiences"]
}

# Input
inp = ["Although the phone's battery life is satisfactory, I had a frustrating experience with customer service when I needed help with a software issue.",
"The camera's low-light performance is excellent, but I encountered difficulties with the phone's software updates. Fortunately, the customer service team was \
helpful in resolving the issue.",
"The design of the phone is impressive, but I had to contact customer service multiple times to address an issue with the speaker.",
"I'm satisfied with the phone's performance overall, but the lack of timely software updates is disappointing. Customer service was responsive when \
I reached out for assistance. ",
"The phone's sleek design caught my eye, but I faced challenges with connectivity issues. Despite this, customer service was prompt in helping me \
troubleshoot the problem."]

# Run chain
llm = ChatOpenAI(model="gpt-3.5-turbo-0125",
                 temperature=0)

chain = create_tagging_chain(schema, llm)

In [None]:
for i in inp:
  print(chain.invoke(i)["text"])

{'sentiment_for_product': 'neutral', 'sentiment_for_customer_service_issue': 'negative', 'tecnical_problems': 'software issue', 'negative_customer_surves_experiences': 'frustrating experience'}
{'sentiment_for_product': 'positive', 'sentiment_for_customer_service_issue': 'positive', 'tecnical_problems': "difficulties with the phone's software updates", 'negative_customer_surves_experiences': ''}
{'sentiment_for_product': 'positive', 'sentiment_for_customer_service_issue': 'negative', 'tecnical_problems': 'issue with the speaker', 'negative_customer_surves_experiences': 'had to contact customer service multiple times'}
{'sentiment_for_product': 'positive', 'sentiment_for_customer_service_issue': 'positive', 'tecnical_problems': 'lack of timely software updates', 'negative_customer_surves_experiences': ''}
{'sentiment_for_product': 'neutral', 'sentiment_for_customer_service_issue': 'positive', 'tecnical_problems': 'connectivity issues', 'negative_customer_surves_experiences': ''}


In [None]:
from typing import Optional
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain.chains import create_tagging_chain_pydantic



class Classification(BaseModel):
    sentiment_for_product: str = Field(description="The sentiment for product",
                                       enum=["positive", "neutral", "negative"])
    sentiment_for_customer_service_issue: str = Field(description="The sentiment for customer service issues",
                                                      enum=["positive", "neutral", "negative"])
    tecnical_problems: str = Field(description="Details of the technical problem encountered with the products.")
    negative_customer_surves_experiences: str =Field(description="Details of the negative customer service experiences")

chain2 = create_tagging_chain_pydantic(Classification, llm)
for i in inp:
  print(chain2.invoke(i)["text"])

sentiment_for_product='neutral' sentiment_for_customer_service_issue='negative' tecnical_problems='software issue' negative_customer_surves_experiences='frustrating experience with customer service'
sentiment_for_product='positive' sentiment_for_customer_service_issue='positive' tecnical_problems="difficulties with the phone's software updates" negative_customer_surves_experiences=''
sentiment_for_product='positive' sentiment_for_customer_service_issue='negative' tecnical_problems='issue with the speaker' negative_customer_surves_experiences='had to contact customer service multiple times'
sentiment_for_product='positive' sentiment_for_customer_service_issue='positive' tecnical_problems='lack of timely software updates' negative_customer_surves_experiences=''
sentiment_for_product='neutral' sentiment_for_customer_service_issue='positive' tecnical_problems='connectivity issues' negative_customer_surves_experiences=''


In [None]:
tagging_prompt = ChatPromptTemplate.from_template(
    """
Extract the desired information from the following passage.

Only extract the properties mentioned in the 'Classification' function.

Passage:
{input}
"""
)

llm = ChatOpenAI(temperature=0,
                 model="gpt-3.5-turbo-0125")

tagging_chain = tagging_prompt | llm.with_structured_output(Classification)
for i in inp:
  print(tagging_chain.invoke({"input": i}))

sentiment_for_product='positive' sentiment_for_customer_service_issue='negative' tecnical_problems='software issue' negative_customer_surves_experiences='frustrating experience with customer service'
sentiment_for_product='positive' sentiment_for_customer_service_issue='positive' tecnical_problems="difficulties with the phone's software updates" negative_customer_surves_experiences=''
sentiment_for_product='positive' sentiment_for_customer_service_issue='negative' tecnical_problems='issue with the speaker' negative_customer_surves_experiences='had to contact customer service multiple times'
sentiment_for_product='positive' sentiment_for_customer_service_issue='positive' tecnical_problems='lack of timely software updates' negative_customer_surves_experiences=''
sentiment_for_product='neutral' sentiment_for_customer_service_issue='positive' tecnical_problems='connectivity issues' negative_customer_surves_experiences='None'


Now, using the discharge summary information of the patients, let's determine whether they have diabetes and hypertension, and score their likelihood of being a patient on a scale of 0 to 5.

In [None]:
patient_1="""Age: 50, Gender: Male,

Symptoms:

Experiencing persistent headaches lately.
Feeling numbness in hands and feet, especially in the evenings.
Frequently experiencing the need to urinate.

Findings:

Physical examination reveals retinal changes due to hypertension and signs of diabetic neuropathy.
Blood pressure measurement indicates 160/100 mmHg.
Blood tests reveal elevated blood sugar levels (fasting blood glucose of 200 mg/dL) and high cholesterol levels."""

patient_2="""Age: 60, Gender: Female

Symptoms:

Experiencing blurred vision lately.
Constantly feeling tired.
Experiencing weakness and fatigue.
Findings:

Physical examination reveals retinopathy due to hypertension and signs of diabetic nephropathy.
Blood pressure measurement indicates 170/110 mmHg.
Blood tests reveal high blood sugar levels (HbA1c levels above 9%) and elevated triglyceride levels."""

patient_3="""Age: 55, Gender: Male

Symptoms:

Experiencing frequent urination, especially during the night.
Feeling constantly thirsty and drinking large amounts of water.
Noticing unexplained weight loss despite normal or increased appetite.
Findings:

Physical examination reveals signs of peripheral neuropathy, such as tingling or numbness in the feet.
Blood tests show elevated fasting blood glucose levels (above 126 mg/dL) and HbA1c levels indicating poorly controlled diabetes.
Urinalysis indicates the presence of glucose in the urine, suggestive of uncontrolled diabetes."""

patient_4="""Age: 35
Gender: Female

Symptoms:

Experiencing frequent headaches, particularly in the morning.
Feeling dizzy or lightheaded, especially when standing up quickly.
Noticing chest pain or discomfort, especially during physical exertion.
Findings:

Blood pressure measurements consistently indicate elevated levels (systolic blood pressure above 140 mmHg and/or diastolic blood pressure above 90 mmHg).
Fundoscopic examination may reveal signs of hypertensive retinopathy, such as retinal hemorrhages or cotton-wool spots.
Laboratory tests may show elevated cholesterol or triglyceride levels, contributing to hypertension."""

patient_5="""Age: 45
Gender: Male

Symptoms:

Experiencing sudden onset of high fever, typically above 100.4°F (38°C).
Complaining of body aches and muscle soreness all over the body.
Having a persistent dry cough, sometimes accompanied by chest discomfort.
Feeling fatigued and weak, often experiencing extreme tiredness.
Findings:

Physical examination may reveal redness and inflammation of the throat.
Auscultation of the lungs may reveal crackles or wheezing due to inflammation.
Rapid antigen tests or PCR tests may confirm the presence of influenza virus in respiratory secretions."""

patients= [patient_1, patient_2, patient_3, patient_4, patient_5]

In [None]:
# Schema
schema = {
    "properties": {
        "diabetes":{"type": "string",
                    "enum":["yes", "no"],
                    "description": "Classify the patient as having diabetes."},
         "hypertension":{"type": "string",
                        "enum":["yes", "no"],
                        "description": "Classify the patient as having hypertension."},
        "diabetes_likelihood": {"type": "number",
                                 "description": "score of the patient having diabetes on a scale from 0 to 5.",
                                 "minimum":0,
                                 "maximum":5},
        "hypertension_likelihood": {"type": "number",
                                    "description": "score of the patient having hypertension on a scale from 0 to 5.",
                                    "minimum":0,
                                    "maximum":5}
    },
    "required": ["diabetes", "hypertension", "diabetes_likelihood",  "hypertension_likelihood"]
}

In [None]:
# Run chain
llm = ChatOpenAI(temperature=0,
                 model="gpt-3.5-turbo-0125")
chain = create_tagging_chain(schema, llm)

In [None]:
for i in patients:
  print(chain.invoke(i)["text"])

{'diabetes': 'yes', 'hypertension': 'yes', 'diabetes_likelihood': 4.5, 'hypertension_likelihood': 4.0}
{'diabetes': 'yes', 'hypertension': 'yes', 'diabetes_likelihood': 4.5, 'hypertension_likelihood': 5}
{'diabetes': 'yes', 'hypertension': 'no', 'diabetes_likelihood': 4.5, 'hypertension_likelihood': 0.0}
{'diabetes': 'no', 'hypertension': 'yes', 'diabetes_likelihood': 0, 'hypertension_likelihood': 5}
{'diabetes': 'no', 'hypertension': 'no', 'diabetes_likelihood': 0, 'hypertension_likelihood': 0}


In [None]:
class Classification(BaseModel):
    diabetes: str = Field(description="Classify the patient as having diabetes",
                          enum=["yes", "no"])
    hypertension: str = Field(description="Classify the patient as having hypertension",
                              enum=["yes", "no"])
    diabetes_likelihood: float = Field(description="score of the patient having diabetes on a scale from 0 to 5.",
                                       ge=0, le=5,) #ge=min, le=max
    hypertension_likelihood: float =Field(description="score of the patient having hypertension on a scale from 0 to 5",
                                          ge=0, le=5)

chain2 = create_tagging_chain_pydantic(Classification, llm)
for i in patients:
  print(chain2.invoke(i)["text"])

diabetes='yes' hypertension='yes' diabetes_likelihood=4.5 hypertension_likelihood=4.0
diabetes='yes' hypertension='yes' diabetes_likelihood=4.5 hypertension_likelihood=4.0
diabetes='yes' hypertension='no' diabetes_likelihood=4.5 hypertension_likelihood=0.0
diabetes='no' hypertension='yes' diabetes_likelihood=0.0 hypertension_likelihood=5.0
diabetes='no' hypertension='no' diabetes_likelihood=0.0 hypertension_likelihood=0.0


In [None]:
tagging_prompt = ChatPromptTemplate.from_template(
    """
Extract the desired information from the following passage.

Only extract the properties mentioned in the 'Classification' function.

Passage:
{input}
"""
)

llm = ChatOpenAI(temperature=0,
                 model="gpt-3.5-turbo-0125")

tagging_chain = tagging_prompt | llm.with_structured_output(Classification)
for i in patients:
  print(tagging_chain.invoke({"input": i}))

diabetes='yes' hypertension='yes' diabetes_likelihood=4.5 hypertension_likelihood=4.0
diabetes='yes' hypertension='yes' diabetes_likelihood=4.5 hypertension_likelihood=4.0
diabetes='yes' hypertension='no' diabetes_likelihood=4.5 hypertension_likelihood=0.0
diabetes='no' hypertension='yes' diabetes_likelihood=0.0 hypertension_likelihood=4.5
diabetes='no' hypertension='no' diabetes_likelihood=0.0 hypertension_likelihood=0.0


## Custom Prompt

In [None]:
# Schema
schema = {
    "properties": {
        "sentiment_for_product":{"type": "string",
                                        "enum":["positive", "neutral", "negative"],
                                        "description": "The sentiment for product"},

        "tecnical_problems": {"type": "string",
                              "description": "Details of the technical problem encountered with the products."},
    },
    "required": ["sentiment_for_product", "tecnical_problems"]
}

# Input
inp = ["Although the phone's battery life is satisfactory, I had a frustrating experience with customer service when I needed help with a software issue.",
"The camera's low-light performance is excellent, but I encountered difficulties with the phone's software updates. Fortunately, the customer service team was \
helpful in resolving the issue.",
"The design of the phone is impressive, but I had to contact customer service multiple times to address an issue with the speaker.",
"I'm satisfied with the phone's performance overall, but the lack of timely software updates is disappointing. Customer service was responsive when \
I reached out for assistance. ",
"The phone's sleek design caught my eye, but I faced challenges with connectivity issues. Despite this, customer service was prompt in helping me \
troubleshoot the problem."]


In [None]:
llm = ChatOpenAI(temperature=0,
                 model="gpt-4-turbo-2024-04-09",
                 openai_api_key=os.environ['OPENAI_API_KEY'])
chain = create_tagging_chain(schema, llm)

In [None]:
#default chain prompt
print(chain.prompt.messages[0].prompt.template)

Extract the desired information from the following passage.

Only extract the properties mentioned in the 'information_extraction' function.

Passage:
{input}



In [None]:
template_prompt = """Extract the desired information from the following passage.

Only extract the properties mentioned in the 'information_extraction' function.

capitalize the first letter of the tecnical_problems properties.

Passage:
{input}
"""

chain.prompt.messages[0].prompt.template=template_prompt

In [None]:
for i in inp:
  print(chain.invoke(i)["text"])

{'sentiment_for_product': 'neutral', 'tecnical_problems': 'Software issue'}
{'sentiment_for_product': 'positive', 'tecnical_problems': "Difficulties with the phone's software updates"}
{'sentiment_for_product': 'positive', 'tecnical_problems': 'Had to contact customer service multiple times to address an issue with the speaker.'}
{'sentiment_for_product': 'positive', 'tecnical_problems': 'Lack of timely software updates'}
{'sentiment_for_product': 'neutral', 'tecnical_problems': 'Connectivity issues'}


In [None]:
class Sentiment(BaseModel):
    sentiment_for_product: str = Field(description="The sentiment for product",
                                              enum=["positive", "neutral", "negative"])

    tecnical_problems: str = Field(description="Details of the technical problem encountered with the product.")

In [None]:
llm = ChatOpenAI(model="gpt-3.5-turbo-0125",
                 temperature=0)
#chain = tagging_prompt | llm.with_structured_output(schema=Person)
chain = create_tagging_chain_pydantic(Sentiment, llm)
for i in inp:
  print(chain.invoke(i)["text"])

sentiment_for_product='neutral' tecnical_problems='frustrating experience with customer service'
sentiment_for_product='positive' tecnical_problems="difficulties with the phone's software updates"
sentiment_for_product='positive' tecnical_problems='issue with the speaker'
sentiment_for_product='positive' tecnical_problems='lack of timely software updates'
sentiment_for_product='neutral' tecnical_problems='connectivity issues'


In [None]:
inp

["Although the phone's battery life is satisfactory, I had a frustrating experience with customer service when I needed help with a software issue.",
 "The camera's low-light performance is excellent, but I encountered difficulties with the phone's software updates. Fortunately, the customer service team was helpful in resolving the issue.",
 'The design of the phone is impressive, but I had to contact customer service multiple times to address an issue with the speaker.',
 "I'm satisfied with the phone's performance overall, but the lack of timely software updates is disappointing. Customer service was responsive when I reached out for assistance. ",
 "The phone's sleek design caught my eye, but I faced challenges with connectivity issues. Despite this, customer service was prompt in helping me troubleshoot the problem."]

END OF THE PROJECT