# Key Data Extraction App

## Intro
* We will create app to **extract structured information from unstructured text**. Imagine, for example, that you want to extract name, lastname and country of users that submit comments in website of your company.

In [1]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai_api_key = os.environ["OPENAI_API_KEY"]

In [2]:
MODEL_GPT = 'gpt-4o-mini'

## Connect with LLM

In [3]:
from langchain_openai import ChatOpenAI

# llm = ChatOpenAI(model="gpt-3.5-turbo-0125")
llm = ChatOpenAI(model=MODEL_GPT)

## Define what information we want to extract
* **We'll use Pydantic to define schema to extract personal information**.
* **Document attributes and schema itself**: This information is sent to LLM and is used to improve quality of information extraction.
* Do not force LLM to make up information! **We import Optional for attributes allowing LLM to output None if it doesn't know answer**.

In [4]:
from typing import Optional

# from langchain_core.pydantic_v1 import BaseModel, Field
from pydantic import BaseModel, Field

class Person(BaseModel):
    """Information about a person."""

    # ^ Doc-string for the entity Person.
    # This doc-string is sent to the LLM as the description of the schema Person,
    # and it can help to improve extraction results.

    # Note that:
    # 1. Each field is an `optional` -- this allows the model to decline to extract it!
    # 2. Each field has a `description` -- this description is used by the LLM.
    # Having a good description can help improve extraction results.
    name: Optional[str] = Field(default=None, description="The name of the person")
    lastname: Optional[str] = Field(
        default=None, description="The lastname of the person if known"
    )
    country: Optional[str] = Field(
        default=None, description="The country of the person if known"
    )

## Define Extractor

In [5]:
from typing import Optional

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
# from langchain_core.pydantic_v1 import BaseModel, Field
from pydantic import BaseModel, Field

# Define a custom prompt to provide instructions and any additional context.
# 1) You can add examples into the prompt template to improve extraction quality
# 2) Introduce additional parameters to take context into account (e.g., include metadata
#    about the document from which the text was extracted.)
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert extraction algorithm. "
            "Only extract relevant information from the text. "
            "If you do not know the value of an attribute asked to extract, "
            "return null for the attribute's value.",
        ),
        ("human", "{text}"),
    ]
)

* We need to use model that supports function/tool calling.
* Please review [documentation](https://python.langchain.com/v0.2/docs/concepts/#function-tool-calling) for list of some models that can be used with this API.

In [6]:
chain = prompt | llm.with_structured_output(schema=Person)

## Try app

In [7]:
comment = "I absolutely love this product! It's been a game-changer for my daily routine. The quality is top-notch and the customer service is outstanding. I've recommended it to all my friends and family. - Sarah Johnson, USA"

In [8]:
chain.invoke({"text": comment})

Person(name='Sarah', lastname='Johnson', country='USA')

* **Note that this extraction capability is generative**, which means that our model can perform variety of tasks beyond expected. For instance, model could infer gender of user from their name, even when this information is not explicitly provided.

## Extraction of list of entities rather than single entity
* In real projects you will be extracting list of entities rather than single entity. **This can be easily achieved using pydantic by nesting models inside one another**.

In [9]:
from typing import List, Optional

# from langchain_core.pydantic_v1 import BaseModel, Field
from pydantic import BaseModel, Field


class Person(BaseModel):
    """Information about a person."""

    # ^ Doc-string for the entity Person.
    # This doc-string is sent to the LLM as the description of the schema Person,
    # and it can help to improve extraction results.

    # Note that:
    # 1. Each field is an `optional` -- this allows the model to decline to extract it!
    # 2. Each field has a `description` -- this description is used by the LLM.
    # Having a good description can help improve extraction results.
    name: Optional[str] = Field(default=None, description="The name of the person")
    lastname: Optional[str] = Field(
        default=None, description="The lastname of the person if known"
    )
    country: Optional[str] = Field(
        default=None, description="The country of the person if known"
    )

class Data(BaseModel):
    """Extracted data about people."""

    # Creates a model so that we can extract multiple entities.
    people: List[Person]

In [10]:
chain = prompt | llm.with_structured_output(schema=Data)

In [11]:
comment = "I'm so impressed with this product! It has truly transformed how I approach my daily tasks. The quality exceeds my expectations, and the customer support is truly exceptional. I've already suggested it to all my colleagues and relatives. - Emily Clarke, Canada"

In [12]:
chain.invoke({"text": comment})

Data(people=[Person(name='Emily', lastname='Clarke', country='Canada')])