# Example: extracting structured data from earnings call transcripts

Most public companies host earnings calls, providing their management opportunities to discuss past financial results and future plans. Natural language transcripts of these calls may contain useful information, but often this information must first be extracted from the document and arranged into a structured form so that it can be analyzed or compared across time periods and other companies.

Here we demonstrate the use of a LLM-powered extraction service on extracting information from Tesla's recent Q4 2023 earnings call. We show the importance of incorporating few-shot learning to accurate extraction in a real-world context.

This transcript is available [online](https://www.fool.com/earnings/call-transcripts/2024/01/24/tesla-tsla-q4-2023-earnings-call-transcript/), and we have copied it into a plain text file for convenience.

We first load and inspect the transcript:

In [1]:
with open("tsla_q4_2023_transcript.txt", "r") as fp:
    doc_content = fp.read()

In [2]:
# head
print(doc_content[:1500])

source: https://www.fool.com/earnings/call-transcripts/2024/01/24/tesla-tsla-q4-2023-earnings-call-transcript/

Martin Viecha

Good afternoon, everyone, and welcome to Tesla's fourth-quarter 2023 Q&A webcast. My name is Martin Viecha, VP of investor relations, and I'm joined today by Elon Musk, Vaibhav Taneja, and a number of other executives. Our Q4 results were announced at about 3 p.m. Central Time in the update that we published at the same link as this webcast.

During this call, we will discuss our business outlook and make forward-looking statements. These comments are based on our predictions and expectations as of today. Actual events or results could differ materially due to a number of risks and uncertainties, including those mentioned in our most recent filings with the SEC. [Operator instructions] But before we jump into Q&A, Elon has some opening remarks.

Elon.

Elon Musk -- Chief Executive Officer and Product Architect

Thank you. So, the Tesla team did an incredible jo

In [3]:
# tail
print(doc_content[-500:])

ons because we had enough, you know, NOLs, etc., wherein we didn't have to accrue book taxes.

Now that the valuation allowance has been released and we have recognized deferred tax assets on the books, that means your tax rate immediately goes up.

Martin Viecha

OK. I think that's all the time we have for today. Thank you so much for all of your questions, and we'll speak to you again in three months. Thank you.

Bye-bye.

Elon Musk -- Chief Executive Officer and Product Architect

Thank you.



To simplify and accelerate our analysis, we split the transcript into smaller chunks:

In [4]:
from langchain.text_splitter import CharacterTextSplitter


text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1000,
    chunk_overlap=200,
)
docs = text_splitter.create_documents([doc_content])

Let's examine one of the earlier sections of the transcript, which include production statistics across vehicle categories for Tesla:

In [5]:
doc = docs[1]
doc

Document(page_content="Elon.\n\nElon Musk -- Chief Executive Officer and Product Architect\n\nThank you. So, the Tesla team did an incredible job in 2023. We achieved a record production and deliveries of over 1.8 million vehicles, in line with our official guidance. And in Q4, we're producing vehicles at an annualized run rate of almost 2 million cars a year.\n\nThis was really a phenomenal achievement. Looking at just the Fremont Factory alone, we made 560,000 cars. This is a record. In fact, it's the highest-output automotive plant in North America.\n\nAnd people are often surprised that the -- the highest-output factory -- car factory in North America is in the San Francisco Bay area. It's a little counterintuitive, perhaps. And the -- it's really had an incredibly positive impact on that entire area. What would have been a rundown strip mall is the highest-productivity car plant in the -- in the Americas.")

We're ready to extract. Because the service is deployed with [Langserve](https://python.langchain.com/docs/langserve), we can access it via a RemoteRunnable:

In [6]:
from langserve import RemoteRunnable

runnable = RemoteRunnable("http://localhost:8000/extract_text/")

This paragraph of the transcript includes some all-up statistics on Tesla's vehicle production. Records of these statistics might need to associate a time period with the production counts. We can define a `VehicleProduction` record as a Pydantic BaseModel and attach some simple descriptions to each field:

In [7]:
from typing import List, Optional
from pydantic import BaseModel, Field

class VehicleProduction(BaseModel):
    count: int = Field(..., description="The count of vehicles produced")
    period_start: str = Field(..., description="The start of the time period in ISO format.")
    period_duration: int = Field(..., description="Duration of period, in months")
    annualized: bool = Field(..., description="Whether or not figure is an annualized estimate.")

class Root(BaseModel):
    records: List[VehicleProduction] = Field(None, description="List of VehicleProduction records")

Creating the `Root` class allows us to capture multiple records, when present. Let's try it:

In [8]:
response = runnable.invoke({"text": doc.page_content, "schema": Root.schema()})

In [9]:
response

{'extracted': {'records': [{'count': 1800000,
    'period_start': '2023-01-01',
    'period_duration': 12,
    'annualized': True},
   {'count': 560000,
    'period_start': '2023-01-01',
    'period_duration': 12,
    'annualized': False}]}}

Note that we pass both the natural-language text as well as the desired output schema to the runnable. In this case we successfully captured the 1.8 million figure, and for many cases this might be sufficient. But the transcript also notes an annualized run rate of almost 2 million vehicles per year in Q4. Suppose we want to extract this figure as well-- how can we convey our intent to the extraction service?

One option is few-shot examples. Here we define an example of such a similar record and pass it to the runnable in another inference:

In [10]:
examples = [
    {
        "text": (
            "2022 is off to a good start. In Q1, we produced "
            "vehicles at an annualized rate of 1 million."
        ),
        "output": Root(
            records=[
                VehicleProduction(
                    count=1_000_000,
                    period_start="2022-01-01",
                    period_duration=3,
                    annualized=True,
                )
            ]
        ).dict()
    }
]

response = runnable.invoke(
    {
        "text": doc.page_content,
        "schema": Root.schema(),
        "examples": examples,
    },
)

In [11]:
response

{'extracted': {'records': [{'count': 1800000,
    'period_start': '2023-01-01',
    'period_duration': 12,
    'annualized': True},
   {'count': 2000000,
    'period_start': '2023-10-01',
    'period_duration': 3,
    'annualized': True}]}}

We see that with this added context, the service was able to understand our intent and extract the additional record.

Let's consider another example. Earnings calls often include aggregate financial results for a reporting period. Below, Tesla's CFO discusses its revenue, free cash flow, and other figures:

In [12]:
doc = docs[9]
doc

Document(page_content='Martin Viecha\n\nThank you. And our CFO, Vaibhav, has some opening remarks as well.\n\nVaibhav Taneja -- Chief Financial Officer\n\nThanks, Martin. Good afternoon, everyone. As Elon mentioned, we had a record year in terms of both production and deliveries for auto business, as well as record deployments in our energy business. This was achieved despite 2023 being a challenging year in terms of higher interest rates and higher inflation.\n\nBig thanks to our customer for being with us through this challenging period. I would also like to thank the whole Tesla team for the resolve and dedication throughout. In terms of 2023 financials, we ended the year with over 96 billion of revenue and generated 4.4 billion of free cash flow to end the year with over 29 billion of cash and investments on hand. Our 2023 GAAP net income was impacted by the recognition of one-time, noncash benefit of 5.9 billion from the release of valuation allowance on certain deferred tax asset

Suppose we are looking specifically for earnings. What happens if we run such an extractor on this text?

In [13]:
class Earnings(BaseModel):
    value: int = Field(..., description="Nominal earnings in local currency.")
    period_start: str = Field(..., description="The start of the time period in ISO format.")
    period_duration: int = Field(..., description="Duration of period, in months")

class Root(BaseModel):
    records: List[Earnings] = Field(None, description="List of Earnings records")

In [14]:
response = runnable.invoke({"text": doc.page_content, "schema": Root.schema()})

In [15]:
response

{'extracted': {'records': [{'value': 5900000000,
    'period_start': '2023-01-01',
    'period_duration': 12}]}}

We see that the service extracted three records, even though none of them are earnings! Without the specific context of our use-case, the model is overly enthusiastic about associating figures such as revenue and cash flow with earnings. We can improve its behavior by offering it examples and additional instructions:

In [16]:
examples = [
    {
        "text": "Our revenue is $100 in 2022.",
        "output": Root(records=[]).dict()
    },
    {
        "text": "We generated one billion in free cash flow in 2022.",
        "output": Root(records=[]).dict()
    },
    {
        "text": "Our 2022 income was $32.",
        "output": Root(records=[]).dict()
    },
    {
        "text": "Our earnings in 2022 were $100 million.",
        "output": Root(
            records=[
                Earnings(value=100_000_000, period_start="2022-01-01", period_duration=12)
            ]
        ).dict()
    },
    {
        "text": "We had $6 of profit in 2022.",
        "output": Root(
            records=[
                Earnings(value=6, period_start="2022-01-01", period_duration=12)
            ]
        ).dict()
    },
]

response = runnable.invoke(
    {
        "text": doc.page_content,
        "schema": Root.schema(),
        "instructions": (
            "Note that earnings has a very specific meaning "
            "in a financial context as profit. Revenue, income, "
            "cash flow, are not earnings. Adjustments to earnings "
            "are also not earnings."
        ),
        "examples": examples,
    },
)

In [17]:
response

{'extracted': {'records': []}}

In this case we correctly extracted no records.

If we are interested in financial data more broadly, we can let the extractor identify the corresponding figure (such as revenue, or FCF):

In [18]:
class FinancialData(BaseModel):
    name: str = Field(..., description="Name of the financial figure, such as revenue.")
    value: int = Field(..., description="Nominal earnings in local currency.")
    period_start: str = Field(..., description="The start of the time period in ISO format.")
    period_duration: int = Field(..., description="Duration of period, in months")

class Root(BaseModel):
    records: List[FinancialData] = Field(None, description="List of Earnings records")

In [19]:
response = runnable.invoke({"text": doc.page_content, "schema": Root.schema()})

In [20]:
response

{'extracted': {'records': [{'name': 'revenue',
    'value': 96000000000,
    'period_start': '2023-01-01',
    'period_duration': 12},
   {'name': 'free cash flow',
    'value': 4400000000,
    'period_start': '2023-01-01',
    'period_duration': 12},
   {'name': 'cash and investments',
    'value': 29000000000,
    'period_start': '2023-01-01',
    'period_duration': 12},
   {'name': 'GAAP net income',
    'value': 5900000000,
    'period_start': '2023-01-01',
    'period_duration': 12}]}}

Different entities may refer to the same financial figure using different vocabulary, so in this case it may be helpful to advise the service to standardize names, where possible:

In [21]:
examples = [
    {
        "text": "Our revenue is $100 in 2022.",
        "output": Root(
            records=[
                FinancialData(
                    name="rev",
                    value=100,
                    period_start="2022-01-01",
                    period_duration=12,
                )
            ]
        ).dict()
    },
    {
        "text": "We generated one billion in free cash flow in 2022.",
        "output": Root(
            records=[
                FinancialData(
                    name="fcf",
                    value=1_000_000_000,
                    period_start="2022-01-01",
                    period_duration=12,
                )
            ]
        ).dict()
    },
    {
        "text": "Our 2022 Q1 revenue was $32 and we paid $3 in taxes.",
        "output": Root(
            records=[
                FinancialData(
                    name="rev",
                    value=32,
                    period_start="2022-01-01",
                    period_duration=3,
                ),
                FinancialData(
                    name="taxes",
                    value=3,
                    period_start="2022-01-01",
                    period_duration=3,
                ),
            ]
        ).dict()
    },
]


instructions = """
Standardize names according to the following mapping. If you detect
a financial figure that is not in the mapping, report it verbatim.

"revenue", "rev", etc. --> "rev"
"free cash flow", "fcf", etc. --> "fcf"
"financial assets", "investments", etc. --> "abc123"
"""

In [22]:
response = runnable.invoke(
    {
        "text": doc.page_content,
        "schema": Root.schema(),
        "instructions": instructions,
        "examples": examples,
    },
)

In [23]:
response

{'extracted': {'records': [{'name': 'rev',
    'value': 96000000000,
    'period_start': '2023-01-01',
    'period_duration': 12},
   {'name': 'fcf',
    'value': 4400000000,
    'period_start': '2023-01-01',
    'period_duration': 12},
   {'name': 'abc123',
    'value': 29000000000,
    'period_start': '2023-01-01',
    'period_duration': 12}]}}

Note that we can assign arbitrary identifiers in this case.

Is this cheating, because we are post-hoc constructing examples to correct undesired behavior? Yes. However, beyond a proof-of-concept and combined with feedback systems to gather incorrect extraction results and assemble them intelligently into batches of few-shot examples, we anticipate that this capability will allow LLM-based extraction systems to perform well in real-world tasks.