# Extract examples

https://ai.plainenglish.io/harnessing-the-openai-api-with-langchain-and-pydantic-for-structured-data-extraction-30e3e6966699

By leveraging Langchain with Pydantic, developers and businesses alike can enhance their AI-powered applications, making them more robust and efficient. The Pydantic output parser is indeed a powerful tool that unlocks new potentials for structured data extraction from LLMs.


In [1]:
inp = """Large language models (LLMs) have demonstrated remarkable \
generalizability, such as understanding arbitrary entities and relations. \
Instruction tuning has proven effective for distilling LLMs \
into more cost-efficient models such as Alpaca and Vicuna. \
Yet such student models still trail the original LLMs by \
large margins in downstream applications. In this paper, \
we explore targeted distillation with mission-focused instruction \
tuning to train student models that can excel in a broad application \
class such as open information extraction. Using named entity \
recognition (NER) for case study, we show how ChatGPT can be distilled \
into much smaller UniversalNER models for open NER. For evaluation,\
we assemble the largest NER benchmark to date, comprising 43 datasets \
across 9 diverse domains such as biomedicine, programming, social media, \
law, finance. Without using any direct supervision, UniversalNER \
attains remarkable NER accuracy across tens of thousands of entity \
types, outperforming general instruction-tuned models such as Alpaca \
and Vicuna by over 30 absolute F1 points in average. With a tiny \
fraction of parameters, UniversalNER not only acquires ChatGPT's \
capability in recognizing arbitrary entity types, but also \
outperforms its NER accuracy by 7-9 absolute F1 points in average. \
Remarkably, UniversalNER even outperforms by a large margin \
state-of-the-art multi-task instruction-tuned systems such as \
InstructUIE, which uses supervised NER examples. \
We also conduct thorough ablation studies to assess the impact of \
various components in our distillation approach. We will release \
the distillation recipe, data, and UniversalNER models to facilitate \
future research on targeted distillation."""

In [2]:
import os
import openai

# Load your API key from an environment variable or secret management service
openai.api_key = os.getenv("OPENAI_API_KEY")
print(openai.api_key)

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


In [3]:
from langchain.chains import create_extraction_chain
from langchain.chat_models import ChatOpenAI

In [4]:
# Extraction using OpenAI functions
# Schema which will be filled using extracted information

schema = {
    "properties": {
        "research_topic": {"type": "string"},
        "problem_statement": {"type": "string"},
        "experiment_design": {"type": "string"},
        "finding": {"type": "string"},
    },
    "required": ["research_topic", "finding"],
}

# create model
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo", openai_api_key=openai.api_key)

# create extraction chain 
chain = create_extraction_chain(schema, llm)

#run extraction chain
vanilla_output = chain.run(inp)
print(vanilla_output)
print(vanilla_output[0])

[{'research_topic': 'targeted distillation with mission-focused instruction tuning', 'problem_statement': 'train student models that can excel in a broad application class such as open information extraction', 'experiment_design': 'distilling ChatGPT into smaller UniversalNER models for open NER', 'finding': 'UniversalNER attains remarkable NER accuracy across tens of thousands of entity types, outperforming general instruction-tuned models such as Alpaca and Vicuna by over 30 absolute F1 points in average'}]
{'research_topic': 'targeted distillation with mission-focused instruction tuning', 'problem_statement': 'train student models that can excel in a broad application class such as open information extraction', 'experiment_design': 'distilling ChatGPT into smaller UniversalNER models for open NER', 'finding': 'UniversalNER attains remarkable NER accuracy across tens of thousands of entity types, outperforming general instruction-tuned models such as Alpaca and Vicuna by over 30 abso

In [5]:
from typing import List
from pydantic import BaseModel, Field, field_validator

class CodeRisk(BaseModel):
    description: str = Field(description="A brief description of the security risk")
    severity: str = Field(description="The level of severity of the risk")
    recommendations: List[str] = Field(description="Recommended actions to mitigate the risk")

    @field_validator("severity")
    def severity_must_be_valid(cls, v):
        if v not in ["Low", "Medium", "High", "Critical"]:
            raise ValueError("Invalid severity level!")
        return v

    

In [6]:
from langchain.llms import OpenAI
from langchain.output_parsers import PydanticOutputParser
from langchain.prompts import PromptTemplate

model_name = "text-davinci-003"
temperature = 0.0
model = OpenAI(
  model_name=model_name,
  temperature=temperature,
  openai_api_key=openai.api_key,
)

code_snippet_query = """
Analyze the following block of code for potential security risks:
public void authenticate(String username, String password) {
    if (username.equals("admin") && password.equals("password123")) {
    // Authentication logic
    }
}
"""

parser = PydanticOutputParser(pydantic_object=CodeRisk)
prompt = PromptTemplate(
    template="Identify any potential security risks in the code snippet provided.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)
_input = prompt.format_prompt(query=code_snippet_query)
output = model(_input.to_string())

# The parsed_output variable now contains the structured data as a Pydantic model instance.
parsed_output = parser.parse(output)

In [7]:
print(parsed_output)

description='The code snippet is vulnerable to a hard-coded credentials attack, as it checks for a specific username and password combination.' severity='High' recommendations=['Implement a secure authentication mechanism that does not rely on hard-coded credentials.', 'Ensure that passwords are stored in a secure manner, such as using a one-way hashing algorithm.']
