# Extraction

[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/use_cases/extraction/extraction.ipynb)

## Use case

---

Getting structured output from LLM generation is hard.

For example, suppose you need the model output formatted as JSON or in some other specified schema for:

- Extracting a structured row to insert into a database from a sentence
- Extracting multiple rows to insert into a database from a long document
- Extracting the correct API parameters from a user query


![Image description](../../../docs_skeleton/static/img/extraction.png)

## Overview 

--- 

There are two primary approaches for this:

- `Functions`: OpenAI [functions](https://openai.com/blog/function-calling-and-other-api-updates) can be used to extract entities from unstructured text. They give us control over which are the entities we wish to extract and what data type each should be. 


- `Parsing`: [Output parsers](https://python.langchain.com/docs/modules/model_io/output_parsers/) are classes that help structure language model responses. They are useful when you are using an LLM that does not support functions or you want to have fine close control over the output (e.g. output date is a specific datetime format).

![Image description](../../../docs_skeleton/static/img/extraction_options.png)

## Quickstart

---

OpenAI funtions are a nice way to get started with extraction.

To extract entities, we create a schema where we specify all the properties we want to find.

We can also specify which of these properties are required and which are optional.

In [None]:
! pip install openai 
! pip install langchain

import os
os.environ["OPENAI_API_KEY"] = "your-api-key"

In [4]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import create_extraction_chain

# JSON schema
schema = {
    "properties": {
        "name": {"type": "string"},
        "height": {"type": "integer"},
        "hair_color": {"type": "string"},
    },
    "required": ["name", "height"],
}

# Input 
inp = """Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette and Alex is blonde."""

# Run chain
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")
chain = create_extraction_chain(schema, llm)
chain.run(inp)

[{'name': 'Alex', 'height': 5, 'hair_color': 'blonde'},
 {'name': 'Claudia', 'height': 6, 'hair_color': 'brunette'}]

## Option 1: OpenAI funtions

---

### Multiple entity types

Let's say we want to differentiate between dogs and people.

We can add `person_` and `dog_` prefixes for each property

In [8]:
schema = {
    "properties": {
        "person_name": {"type": "string"},
        "person_height": {"type": "integer"},
        "person_hair_color": {"type": "string"},
        "dog_name": {"type": "string"},
        "dog_breed": {"type": "string"},
    },
    "required": ["person_name", "person_height"],
}

chain = create_extraction_chain(schema, llm)

inp = """Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette and Alex is blonde.
Alex's dog Frosty is a labrador and likes to play hide and seek."""

chain.run(inp)

[{'person_name': 'Alex',
  'person_height': 5,
  'person_hair_color': 'blonde',
  'dog_name': 'Frosty',
  'dog_breed': 'labrador'},
 {'person_name': 'Claudia',
  'person_height': 6,
  'person_hair_color': 'brunette'}]

### Unrelated entities

If we use `required: []`, we allow the model to return **only** person attributes or **only** dog attributes for a single entity (person or dog).

In [9]:
schema = {
    "properties": {
        "person_name": {"type": "string"},
        "person_height": {"type": "integer"},
        "person_hair_color": {"type": "string"},
        "dog_name": {"type": "string"},
        "dog_breed": {"type": "string"},
    },
    "required": [],
}

chain = create_extraction_chain(schema, llm)

inp = """Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette and Alex is blonde.
Willow is a German Shepherd that likes to play with other dogs and can always be found playing with Milo, a border collie that lives close by."""

chain.run(inp)

[{'person_name': 'Alex', 'person_height': 5, 'person_hair_color': 'blonde'},
 {'person_name': 'Claudia',
  'person_height': 6,
  'person_hair_color': 'brunette'},
 {'dog_name': 'Willow', 'dog_breed': 'German Shepherd'},
 {'dog_name': 'Milo', 'dog_breed': 'border collie'}]

### Extra information

To get more inforamtion about an entity, we can use add a placeholder for unstructured extraction.

Here, we add `dog_extra_info`.

In [10]:
schema = {
    "properties": {
        "person_name": {"type": "string"},
        "person_height": {"type": "integer"},
        "person_hair_color": {"type": "string"},
        "dog_name": {"type": "string"},
        "dog_breed": {"type": "string"},
        "dog_extra_info": {"type": "string"},
    },
}

chain = create_extraction_chain(schema, llm)

chain.run(inp)

[{'person_name': 'Alex', 'person_height': 5, 'person_hair_color': 'blonde'},
 {'person_name': 'Claudia',
  'person_height': 6,
  'person_hair_color': 'brunette'},
 {'dog_name': 'Willow',
  'dog_breed': 'German Shepherd',
  'dog_extra_info': 'likes to play with other dogs'},
 {'dog_name': 'Milo',
  'dog_breed': 'border collie',
  'dog_extra_info': 'lives close by'}]

This gives us additional information about the dogs.

### Pydantic 

Functions are powerful and flexible, allowing output 

For this, we can use Pydantic, a Python class that defines the structure of the data and the types of its attributes.

First, we can define a class with attributes annotated with types.

In [17]:
from typing import Optional, List
from pydantic import BaseModel, Field
from langchain.chains import create_extraction_chain_pydantic

# Pydantic class
class Properties(BaseModel):
    person_name: str
    person_height: int
    person_hair_color: str
    dog_breed: Optional[str]
    dog_name: Optional[str]
        
# Extraction
chain = create_extraction_chain_pydantic(pydantic_schema=Properties, llm=llm)

# Run 
inp = """Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette and Alex is blonde."""
chain.run(inp)

[Properties(person_name='Alex', person_height=5, person_hair_color='blonde', dog_breed=None, dog_name=None),
 Properties(person_name='Claudia', person_height=6, person_hair_color='brunette', dog_breed=None, dog_name=None)]

## Option 2: Parsing

--- 

[Output parsers](https://python.langchain.com/docs/modules/model_io/output_parsers/) are classes that help structure language model responses. 

There are two main methods an output parser must implement:

* `Get format instructions`: A method which returns a string containing instructions for how the output of a language model should be formatted.
* `Parse`: A method which takes in a string (assumed to be the response from a language model) and parses it into some structure.

### Pydantic

In [16]:
from langchain.prompts import (
    PromptTemplate,
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
)
from langchain.llms import OpenAI
from pydantic import BaseModel, Field, validator
from langchain.output_parsers import PydanticOutputParser


# Define your desired data structure.
class Joke(BaseModel):
    setup: str = Field(description="question to set up a joke")
    punchline: str = Field(description="answer to resolve the joke")

    # You can add custom validation logic easily with Pydantic.
    @validator("setup")
    def question_ends_with_question_mark(cls, field):
        if field[-1] != "?":
            raise ValueError("Badly formed question!")
        return field

# And a query intented to prompt a language model to populate the data structure.
joke_query = "Tell me a joke."

# Set up a parser + inject instructions into the prompt template.
parser = PydanticOutputParser(pydantic_object=Joke)

# Prompt
prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

# Run
_input = prompt.format_prompt(query=joke_query)
model = OpenAI(temperature=0)
output = model(_input.to_string())
parser.parse(output)

Joke(setup='Why did the chicken cross the road?', punchline='To get to the other side!')

As we can see, we get an output of the `Joke` class, which respects our originally desired schema: 'setup' and 'punchline'.

### Going deeper

* The [output parser](https://python.langchain.com/docs/modules/model_io/output_parsers/) documentation includes various parser examples for specific tyoes (e.g., lists, datetimne, enum, etc).  
* Libariries like [JSONFormer](https://python.langchain.com/docs/integrations/llms/jsonformer_experimental) offer another way for structured decoding of a subset of the JSON Schema.