## **Entity Extraction from IBM's Quaterly Earning Transcript call using Granite-8B**

##### This notebook works with two approaches to extract the entities from the transcript. The first approach is defining the entities in the prompt directly along with its description. In the second approach, we are defining the entities in a class and then converting it into pydantic function. This is then passed along with the prompt to the LLM.

##### The model used in this notebook is IBM's Granite-8b.
Authors: Anupam Chakraborty, Amogh Ranavade, Madhu Kanukula

### Install dependencies

In [None]:
! pip install git+https://github.com/ibm-granite-community/utils.git \
    langchain-community \
    langchain-docling \
    langchain-core \
    langchain-huggingface \
    langchain_community \
    langchain \
    pydantic \
    replicate

### Instantiate the model client

In [None]:
from ibm_granite_community.notebook_utils import get_env_var
from langchain_community.llms import Replicate
from transformers import AutoTokenizer

model_path = "ibm-granite/granite-3.3-8b-instruct"
model = Replicate(
    model=model_path,
    replicate_api_token=get_env_var('REPLICATE_API_TOKEN'),
)
tokenizer = AutoTokenizer.from_pretrained(model_path)

---

### 1 - Entity Extraction by defining entities in the prompt

The first approach is straightforward and involves explicitly defining the entities within the prompt itself. In this method, we specify the entities to be extracted along with their descriptions directly in the prompt. This includes:  

<u>**Entity Definitions:**</u> Each entity, such as company name, name of the CEO are clearly outlined with a concise description of what it represents.  

<u>**Prompt Structure:**</u> The prompt is structured to guide the LLM in understanding exactly what information is needed. By providing detailed instructions, we aim to ensure that the model focuses on extracting only the relevant data.  

<u>**Output Format:**</u> The output is required to be in JSON format, which enforces a consistent structure for the extracted data. If any entity is not found, the model is instructed to return "Data not available," preventing ambiguity.  

### Reading from the transcript PDF.

Transcript is taken from -> https://www.ibm.com/investor/att/pdf/IBM-3Q23-Earnings-Prepared-Remarks.pdf

We are loading the PDF from this link

In [None]:
import typing
from docling.chunking import BaseChunk, HybridChunker
from docling.datamodel.document import DoclingDocument
from langchain_docling.loader import BaseMetaExtractor, DoclingLoader, ExportType

EMBED_MODEL_ID = "ibm-granite/granite-embedding-125m-english"
EXPORT_TYPE = ExportType.DOC_CHUNKS

class MetaExtractor(BaseMetaExtractor):
    """Add a unique doc_id to each document's metadata"""
    def __init__(self) -> None:
        super().__init__()
        self.doc_id = 0
    def extract_chunk_meta(self, file_path: str, chunk: BaseChunk) -> dict[str, typing.Any]:
        self.doc_id = self.doc_id + 1
        return {
            "source": file_path,
            "doc_id": self.doc_id,
        }
    def extract_dl_doc_meta(
        self, file_path: str, dl_doc: DoclingDocument
    ) -> dict[str, typing.Any]:
        self.doc_id = self.doc_id + 1
        return {
            "source": file_path,
            "doc_id": self.doc_id,
        }

def extract_text_from_pages(pdf_path):
    loader = DoclingLoader(
        file_path=pdf_path,
        export_type=EXPORT_TYPE,
        chunker=HybridChunker(tokenizer=EMBED_MODEL_ID),
        meta_extractor=MetaExtractor()
    )
    docs = loader.load()
    return docs

In [None]:
url = "https://www.ibm.com/investor/att/pdf/IBM-3Q23-Earnings-Prepared-Remarks.pdf"

transcript_content = extract_text_from_pages(url)

All the entities that needs to be fetched are defined in the prompt itself along with the entity's description.

In [None]:
transcript_prompt = tokenizer.apply_chat_template(
    conversation=[{
        "role": "user",
        "content": """\
- You are AI Entity Extractor. You help extracting entities from the given transcript documents.
- Analyze the transcript documents and extract the following entities:

1) `company_name` : This is the name of the company for which the transcript is given.
2) `pre_tax_profit_percentage`: This is the operating pre-tax profit in percentage.
3) `pre_tax_profit_number`: This is the operating pre-tax profit in numbers.
4) `total_revenue_transaction_processing`: This is the total revenue growth for Transaction Processing sector in percentage.
5) `total_revenue_data_ai`: This is the total revenue growth for Data and AI sector in percentage.
6) `total_revenue_security`: This is the total revenue growth/decline for security sector in percentage.
7) `total_revenue_automation`: This is total revenue growth/decline for automation sector in percentage.

- Your output should strictly be in a json format.
- If any entity is not found, your output should be `data not available`. Do not make up your own entities if it is not present
- Only strictly do what is asked to you. Do not give any explanations to your output.
""",
    }],
    documents=[{
        "doc_id": doc.metadata["doc_id"],
        "text": doc.page_content,
    } for doc in transcript_content],
    add_generation_prompt=True,
    tokenize=False,
)


Invoking the model to get the results

In [None]:
response = model.invoke(transcript_prompt)
print(response)

In [None]:
import json

entities_transcript = json.loads(response)
entities_transcript

---

### 2 - Pydantic Class-Based Entity Definition

The second approach takes advantage of object-oriented programming principles by defining entities within a class structure. This method involves several key steps:  

<u>**Class Definition:**</u> We create a class that encapsulates all the relevant entities as members. Each member corresponds to an entity such as CEO name, company name, etc., and can include type annotations for better validation and clarity.  

<u>**Pydantic Integration:**</u> Utilizing Pydantic, a data validation library, we convert this class into a Pydantic model. This model not only defines the structure of our data but also provides built-in validation features, ensuring that any extracted data adheres to specified formats and types.  

<u>**Dynamic Prompting:**</u> The Pydantic model can then be integrated with the prompt sent to the LLM. This allows for a more dynamic interaction where the model can adapt based on the defined structure of entities. If new entities are added or existing ones modified, changes can be made at the class level without needing to rewrite the entire prompt.  

<u>**Enhanced Validation:**</u> By leveraging Pydantic's capabilities, we can ensure that any data extracted by the LLM meets our predefined criteria, enhancing data integrity and reliability.  

This class-based approach offers greater flexibility and scalability compared to the first method. It allows for easier modifications and expansions as new requirements arise, making it particularly suitable for larger projects or those requiring frequent updates.

Defining all the entities in a class along with the descripiton.

In [None]:
from pydantic import BaseModel, Field
from langchain_core.utils.function_calling import convert_to_openai_function

Defining all the entities in a class along with the descripiton.

In [None]:
class PreTaxProfit(BaseModel):
    "This contains information of the company's pre-tax profit in percentage as well as in numbers."
    pre_tax_profit_percentage: str = Field(description="Operating pre-tax profit in percentage.")
    pre_tax_profit_numbers: str = Field(description="Operating pre-tax profit in numbers.")


class RevenueGrowth(BaseModel):
    "This contains information of the company's revenue growth."
    total_revenue_change_transaction_processing: str = Field(description="Total revenue change for Transaction Processing sector in percentage.")
    total_revenue_change_data_ai: str = Field(description="Total revenue change for Data and AI sector in percentage.")
    total_revenue_change_security: str = Field(description="Total revenue change for security sector in percentage.")
    total_revenue_change_automation: str = Field(description="Total revenue change for automation sector in percentage.")

Wrapping all the classes into one parent class which is given to pydantic.

In [None]:
class EarningCallReport(BaseModel):
    "This contains information about the company."
    company_name: str = Field(description="The public company name.")
    pre_tax_profit: PreTaxProfit = Field(description="Operating pre-tax profit.")
    revenue: RevenueGrowth = Field(description="All revenue growth details for all sectors.")

In [None]:
transcript_function = convert_to_openai_function(EarningCallReport)

Similar prompt as before, but here, the pydantic function is passed here instead of defining each entity in the prompt text.

In [None]:
entity_prompt_with_pydantic = tokenizer.apply_chat_template(
    conversation=[{
        "role": "user",
        "content": f"""\
- You are AI Entity Extractor. You help extracting entities from the given transcript documents.
- Analyze the transcript documents and extract the entities as per the following Pydantic definition:

{transcript_function}

- Your output should strictly be in a json format.
- If any entity is not found, your output should be `data not available`. Do not make up your own entities if it is not present
- Only strictly do what is asked to you. Do not give any explanations to your output.
""",
    }],
    documents=[{
        "doc_id": doc.metadata["doc_id"],
        "text": doc.page_content,
    } for doc in transcript_content],
    add_generation_prompt=True,
    tokenize=False,
)


Invoking the model to get the results

In [None]:
response = model.invoke(entity_prompt_with_pydantic)
print(response)

In [None]:
entities_transcript_pydantic = json.loads(response)
entities_transcript_pydantic

---