<a href="https://colab.research.google.com/github/olonok69/LLM_Notebooks/blob/main/langchain/use_cases/Langchain_OpenAI_Use_cases_Extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LangChain

LangChain is a framework for developing applications powered by language models.

https://python.langchain.com/docs/use_cases

Extraction

https://python.langchain.com/docs/get_started/introduction

https://python.langchain.com/docs/use_cases/extraction/

Classical solutions to information extraction rely on a combination of people, (many) hand-crafted rules (e.g., regular expressions), and custom fine-tuned ML models.

Such systems tend to get complex over time and become progressively more expensive to maintain and more difficult to enhance.

LLMs can be adapted quickly for specific extraction tasks just by providing appropriate instructions to them and appropriate reference


https://python.langchain.com/docs/modules/model_io/prompts/

https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter

https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf


## Interesting Tool
https://github.com/langchain-ai/langchain-extract/tree/main


# Pydantic
Pydantic is the most widely used data validation library for Python.


In [3]:
!pip install langchain langchain-community tiktoken -q
!pip install -U accelerate -q
! pip install -U unstructured numpy -q
! pip install openai==0.27.7 \
   chromadb -q
! pip install pypdf -q


In [4]:

from google.colab import output
output.enable_custom_widget_manager()

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
from google.colab import userdata
openai_api_key = userdata.get('KEY_OPENAI')

In [13]:
import os
from langchain.prompts import ChatPromptTemplate, HumanMessagePromptTemplate
from langchain.chat_models import ChatOpenAI

from langchain.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field
from typing import List

In [8]:
# To help construct our Chat Messages
from langchain.schema import HumanMessage
from langchain.prompts import PromptTemplate, ChatPromptTemplate, HumanMessagePromptTemplate

# We will be using a chat model, defaults to gpt-3.5-turbo
from langchain_community.chat_models import ChatOpenAI

# To parse outputs and get structured data back
from langchain.output_parsers import StructuredOutputParser, ResponseSchema

chat_model = ChatOpenAI(temperature=0, model_name='gpt-3.5-turbo', openai_api_key=openai_api_key)

  warn_deprecated(


In [55]:
instructions = """
You will be given a sentence with person names and a sentence, extract those names and assign an emoji to them based on the emotion extracted from the sentence assigned to each person. Provide a simple explanation of the emoji as well
Return the person name and emojis in a python dictionary
"""

text = """
John says that today he feels a bit under the water, Maria it is surprised and just try to cheer him up and Mathias it is not attending as he is very busy.
"""

In [56]:
# Make your prompt which combines the instructions w/ the fruit names
prompt = (instructions + text)

# Call the LLM
output = chat_model.invoke([HumanMessage(content=prompt)])

print (output.content)
print (type(output.content))

{
    "John": "üòû",
    "Maria": "üò≤",
    "Mathias": "üï∞Ô∏è"
}

Explanation of emojis:
üòû - John is feeling a bit under the water, so this emoji represents sadness or feeling down.
üò≤ - Maria is surprised, so this emoji represents shock or surprise.
üï∞Ô∏è - Mathias is very busy and not attending, so this emoji represents being preoccupied or occupied with something else.
<class 'str'>


In [14]:
# Define a new Pydantic model with field descriptions and tailored for Twitter.
class TwitterUser(BaseModel):
    name: str = Field(description="Full name of the user.")
    handle: str = Field(description="Twitter handle of the user, without the '@'.")
    age: int = Field(description="Age of the user.")
    hobbies: List[str] = Field(description="List of hobbies of the user.")
    email: str = Field(description="Email address of the user.")
    bio: str = Field(description="Bio or short description about the user.")
    location: str = Field(description="Location or region where the user resides.")
    is_blue_badge: bool = Field(
        description="Boolean indicating if the user has a verified blue badge."
    )
    joined: str = Field(description="Date the user joined Twitter.")
    gender: str = Field(description="Gender of the user.")
    appearance: str = Field(description="Physical description of the user.")
    avatar_prompt: str = Field(
        description="Prompt for generating a photorealistic avatar image. The image should capture the essence of the user's appearance description, ideally in a setting that aligns with their interests or bio. Use professional equipment to ensure high quality and fine details."
    )
    banner_prompt: str = Field(
        description="Prompt for generating a banner image. This image should represent the user's hobbies, interests, or the essence of their bio. It should be high-resolution and captivating, suitable for a Twitter profile banner."
    )

In [58]:
# Instantiate the parser with the new model.
parser = PydanticOutputParser(pydantic_object=TwitterUser)

# Update the prompt to match the new query and desired format.
prompt = ChatPromptTemplate(
    messages=[
        HumanMessagePromptTemplate.from_template(
            "answer the users question as best as possible.\n{format_instructions}\n{question}"
        )
    ],
    input_variables=["question"],
    partial_variables={
        "format_instructions": parser.get_format_instructions(),
    },
)

In [57]:
file_path = "/content/drive/MyDrive/data/elon.pdf"


In [59]:
from langchain.document_loaders import PyPDFLoader
import pprint
loader = PyPDFLoader(file_path)
document = loader.load()


In [60]:
document[0].metadata

{'source': '/content/drive/MyDrive/data/elon.pdf', 'page': 0}

In [61]:
document[0].page_content

"Elon MuskShort portrait.Elon Musk, a 51-year-old male entrepreneur, inventor, and CEO, is best known for his ambitiousgoals in revolutionizing transportation and energy. Born in Pretoria, South Africa, Musk latermoved to the United States to pursue higher education. He attended Queen's University inKingston, Ontario, Canada for two years before transferring to the University of Pennsylvania.As a visionary with a normal build, short-cropped hair, and a trimmed beard, Musk often sportstailored suits or smart casual attire, giving him a confident yet approachable demeanor.Throughout his career, Musk has founded and led several successful companies, includingSpaceX, Tesla, Neuralink, and The Boring Company. His interests span across various fields suchas space exploration, electric vehicles, artificial intelligence, sustainable energy, tunnelconstruction, neural interfaces, Mars colonization, and hyperloop transportation. With hisdedication to advancing technology and sustainable solution

In [62]:
document_query = "Create a profile based on this description: " + document[0].page_content

_input = prompt.format_prompt(question=document_query)
output = chat_model.invoke(_input.to_messages())
parsed = parser.parse(output.content)

pprint.pprint(parsed)


TwitterUser(name='Elon Musk', handle='elonmusk', age=51, hobbies=['space exploration', 'electric vehicles', 'artificial intelligence', 'sustainable energy', 'tunnel construction', 'neural interfaces', 'Mars colonization', 'hyperloop transportation'], email='elonmusk@example.com', bio='Elon Musk, a 51-year-old male entrepreneur, inventor, and CEO, is best known for his ambitious goals in revolutionizing transportation and energy.', location='United States', is_blue_badge=True, joined='2009-06-02', gender='male', appearance='normal build, short-cropped hair, trimmed beard, often in tailored suits or smart casual attire', avatar_prompt="Create a photorealistic avatar image capturing Elon Musk's visionary appearance, ideally in a setting that aligns with his interests in technology and innovation.", banner_prompt="Generate a high-resolution banner image representing Elon Musk's diverse interests in space exploration, electric vehicles, artificial intelligence, sustainable energy, and more.

In [63]:
pprint.pprint(parsed.dict())


{'age': 51,
 'appearance': 'normal build, short-cropped hair, trimmed beard, often in '
               'tailored suits or smart casual attire',
 'avatar_prompt': "Create a photorealistic avatar image capturing Elon Musk's "
                  'visionary appearance, ideally in a setting that aligns with '
                  'his interests in technology and innovation.',
 'banner_prompt': 'Generate a high-resolution banner image representing Elon '
                  "Musk's diverse interests in space exploration, electric "
                  'vehicles, artificial intelligence, sustainable energy, and '
                  'more.',
 'bio': 'Elon Musk, a 51-year-old male entrepreneur, inventor, and CEO, is '
        'best known for his ambitious goals in revolutionizing transportation '
        'and energy.',
 'email': 'elonmusk@example.com',
 'gender': 'male',
 'handle': 'elonmusk',
 'hobbies': ['space exploration',
             'electric vehicles',
             'artificial intelligence',
   

# Second Example


In [64]:
file_path = "/content/drive/MyDrive/data/Uber-Q4-23-Prepared-Remarks.pdf"

In [65]:
loader = PyPDFLoader(file_path)
document = loader.load()

In [66]:
class FinancialData(BaseModel):
    name: str = Field(..., description="Name of the financial figure, such as revenue.")
    value: int = Field(..., description="Nominal earnings in local currency.")
    scale: str = Field(..., description="Scale of figure, such as MM, B, or percent.")
    period_start: str = Field(..., description="The start of the time period in ISO format.")
    period_duration: int = Field(..., description="Duration of period, in months")
    evidence: str = Field(..., description="Verbatim sentence of text where figure was found.")

data = {
    "description": "Financial revenues and other figures.",
    "schema": FinancialData.schema(),
    "instruction": (
        "Extract standard financial figures, specifically earnings and "
        "revenue figures."
    )
}

In [67]:
# Instantiate the parser with the new model.
parser = PydanticOutputParser(pydantic_object=FinancialData)

# Update the prompt to match the new query and desired format.
prompt = ChatPromptTemplate(
    messages=[
        HumanMessagePromptTemplate.from_template(
            "answer the users question as best as possible.\n{format_instructions}\n{question}"
        )
    ],
    input_variables=["question"],
    partial_variables={
        "format_instructions": parser.get_format_instructions(),
    },
)

In [68]:
document_query = "Extract financial key data values from this report: " + document[0].page_content

_input = prompt.format_prompt(question=document_query)
output = chat_model.invoke(_input.to_messages())
parsed = parser.parse(output.content)

pprint.pprint(parsed)

FinancialData(name='Adjusted EBITDA', value=1300000000, scale='MM', period_start='2023-01-01', period_duration=12, evidence='These strong top-line trends, combined with continued rigor on costs, translated to $1.3 billion in Adjusted EBITDA and $652 million in GAAP operating income.')


In [69]:
pprint.pprint(parsed.dict())

{'evidence': 'These strong top-line trends, combined with continued rigor on '
             'costs, translated to $1.3 billion in Adjusted EBITDA and $652 '
             'million in GAAP operating income.',
 'name': 'Adjusted EBITDA',
 'period_duration': 12,
 'period_start': '2023-01-01',
 'scale': 'MM',
 'value': 1300000000}
