## **Entity Extraction from IBM's Quaterly Earning Transcript call using Granite-8B**

##### This notebook works with two approaches to extract the entities from the transcript. The first approach is defining the entities in the prompt directly along with its description. In the second approach, we are defining the entities in a class and then converting it into pydantic function. This is then passed along with the prompt to the LLM.

##### The model used in this notebook is IBM's Granite-8b-preview-4k.
Authors: Anupam Chakraborty, Amogh Ranavade

----

### Install dependencies

In [None]:
!pip install langchain-community
!pip install git+https://github.com/ibm-granite-community/granite-kitchen
!pip install pydantic

### Instantiate the model client

In [2]:
import json
import requests
import fitz
from langchain_community.llms import Replicate
from ibm_granite_community.notebook_utils import get_env_var

model = Replicate(
    model="ibm-granite/granite-3.0-8b-instruct:8d8fb55950fb8eb2817fc078b7b05a0bd3ecc612d6332d8009fb0c007839192e",
    replicate_api_token=get_env_var('REPLICATE_API_TOKEN'),
)

---

### 1 - Entity Extraction by defining entities in the prompt

The first approach is straightforward and involves explicitly defining the entities within the prompt itself. In this method, we specify the entities to be extracted along with their descriptions directly in the prompt. This includes:  

<u>**Entity Definitions:**</u> Each entity, such as company name, name of the CEO are clearly outlined with a concise description of what it represents.  

<u>**Prompt Structure:**</u> The prompt is structured to guide the LLM in understanding exactly what information is needed. By providing detailed instructions, we aim to ensure that the model focuses on extracting only the relevant data.  

<u>**Output Format:**</u> The output is required to be in JSON format, which enforces a consistent structure for the extracted data. If any entity is not found, the model is instructed to return "Data not available," preventing ambiguity.  

### Reading from the transcript PDF.

Transcript is taken from -> https://www.ibm.com/investor/att/pdf/IBM-3Q23-Earnings-Prepared-Remarks.pdf

We are loading the PDF from this link

In [4]:
url = "https://www.ibm.com/investor/att/pdf/IBM-3Q23-Earnings-Prepared-Remarks.pdf"
response = requests.get(url)
pdf_data = response.content

with open("IBM-3Q23-Earnings-Prepared-Remarks.pdf", "wb") as file:
    file.write(pdf_data)

In [None]:
def extract_text_from_pages(pdf_path, start_page, end_page):
    text = ""
    with fitz.open(pdf_path) as doc:
        for page_num in range(start_page - 1, end_page):
            page = doc.load_page(page_num)
            text += page.get_text("text")
    return text

We are limiting and extracting the text from the PDF from 8th to 14th page due to the input token limit of granite.

In [30]:
transcript_content = extract_text_from_pages("IBM-3Q23-Earnings-Prepared-Remarks.pdf", 8, 14)

All the entities that needs to be fetched are defined in the prompt itself along with the entity's description.

In [85]:
transcript_prompt = f"""
<|start_of_role|>user<|end_of_role|>
-You are AI Entity Extractor. You help extracting entities from the given transcript: {transcript_content}
-Analyze this transcript and extract the following entities:

1) `company_name` : This is the name of the company for which the transcript is given.
2) `pre_tax_profit_percentage`: This is the operating pre-tax profit in percentage.
3) `pre_tax_profit_number`: This is the operating pre-tax profit in numbers.
4) `total_revenue_transaction_processing`: This is the total revenue growth for Transaction Processing sector in percentage.
5) `total_revenue_data_ai`: This is the total revenue growth for Data and AI sector in percentage.
6) `total_revenue_security`: This is the total revenue growth/decline for security sector in percentage.
7) `total_revenue_automation`: This is total revenue growth/decline for automation sector in percentage.

-Your output should strictly be in a json format.
-If any entity is not found, your output shound be `data not available`. Do not make up your own entites if it is not present
-Only strictly do what is asked to you. Do not give any explanations to your output.
<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>   
"""

Invoking the model to get the results

In [87]:
response = model.invoke(transcript_prompt)
print(response)

{
  "company_name": "IBM",
  "pre_tax_profit_percentage": "17%",
  "pre_tax_profit_number": "$2.3 billion",
  "total_revenue_transaction_processing": "5%",
  "total_revenue_data_ai": "6%",
  "total_revenue_security": "-3%",
  "total_revenue_automation": "13%"
}


In [88]:
entities_transcript = json.loads(response)
entities_transcript

{'company_name': 'IBM',
 'pre_tax_profit_percentage': '17%',
 'pre_tax_profit_number': '$2.3 billion',
 'total_revenue_transaction_processing': '5%',
 'total_revenue_data_ai': '6%',
 'total_revenue_security': '-3%',
 'total_revenue_automation': '13%'}

---

### 2 - Pydantic Class-Based Entity Definition

The second approach takes advantage of object-oriented programming principles by defining entities within a class structure. This method involves several key steps:  

<u>**Class Definition:**</u> We create a class that encapsulates all the relevant entities as members. Each member corresponds to an entity such as CEO name, company name, etc., and can include type annotations for better validation and clarity.  

<u>**Pydantic Integration:**</u> Utilizing Pydantic, a data validation library, we convert this class into a Pydantic model. This model not only defines the structure of our data but also provides built-in validation features, ensuring that any extracted data adheres to specified formats and types.  

<u>**Dynamic Prompting:**</u> The Pydantic model can then be integrated with the prompt sent to the LLM. This allows for a more dynamic interaction where the model can adapt based on the defined structure of entities. If new entities are added or existing ones modified, changes can be made at the class level without needing to rewrite the entire prompt.  

<u>**Enhanced Validation:**</u> By leveraging Pydantic's capabilities, we can ensure that any data extracted by the LLM meets our predefined criteria, enhancing data integrity and reliability.  

This class-based approach offers greater flexibility and scalability compared to the first method. It allows for easier modifications and expansions as new requirements arise, making it particularly suitable for larger projects or those requiring frequent updates.

Defining all the entities in a class along with the descripiton.

In [39]:
from pydantic import BaseModel, Field
from typing import List
from langchain.utils.openai_functions import convert_pydantic_to_openai_function

Defining all the entities in a class along with the descripiton.

In [49]:
class PreTaxProfit(BaseModel):
    "This contains information of the company's pre-tax profit in percentage as well as in numbers."
    pre_tax_profit_percentage: str = Field(description="Operating pre-tax profit in percentage.")
    pre_tax_profit_numbers: str = Field(description="Operating pre-tax profit in numbers.")


class RevenueGrowth(BaseModel):
    "This contains information of the company's revenue growth."
    total_revenue_change_transaction_processing: float = Field(description="Total revenue change for Transaction Processing sector in percentage.")
    total_revenue_change_data_ai: float = Field(description="Total revenue change for Data and AI sector in percentage.")
    total_revenue_change_security: float = Field(description="Total revenue change for security sector in percentage.")
    total_revenue_change_automation: float = Field(description="Total revenue change for automation sector in percentage.")

Wrapping all the classes into one parent class which is given to pydantic.

In [62]:
class EarningCallReport(BaseModel):
    "This contains information about the company."
    company_name: str = Field(description="The public company name.")
    pre_tax_profit: PreTaxProfit = Field(description="Operating pre-tax profit.")
    revenue: RevenueGrowth = Field(description="All revenue growth details for all sectors.")

In [63]:
transcript_function = convert_pydantic_to_openai_function(EarningCallReport)


Same prompt as before, but here, the pydantic function is passed here instead of defining each entity in the prompt.

In [70]:
entity_prompt_with_pydantic = f"""
<|start_of_role|>user<|end_of_role|>
-You are AI Entity Extractor. You help extracting entities from the given transcript: {transcript_content}
-Analyze this transcript and extract the following entities as per the following function defination: {transcript_function}
-Your output should strictly be in a json format.
-Do not generate random entities on your own. If it is not present or you are unable to find any specified entity, you strictly have to output it as `Data not available`.
-Only do what is asked to you. Do not start with or give any explanations to your output and do not hallucinate.
<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>   
"""


Invoking the model to get the results

In [73]:
response = model.invoke(entity_prompt_with_pydantic)
print(response)

{
  "company_name": "IBM",
  "pre_tax_profit": {
    "pre_tax_profit_percentage": "17%",
    "pre_tax_profit_numbers": "$2.3 billion"
  },
  "revenue": {
    "total_revenue_change_transaction_processing": "5%",
    "total_revenue_change_data_ai": "6%",
    "total_revenue_change_security": "-3%",
    "total_revenue_change_automation": "13%"
  }
}


In [74]:
entities_transcript_pydantic = json.loads(response)
entities_transcript_pydantic

{'company_name': 'IBM',
 'pre_tax_profit': {'pre_tax_profit_percentage': '17%',
  'pre_tax_profit_numbers': '$2.3 billion'},
 'revenue': {'total_revenue_change_transaction_processing': '5%',
  'total_revenue_change_data_ai': '6%',
  'total_revenue_change_security': '-3%',
  'total_revenue_change_automation': '13%'}}

---