# Example: extracting structured data from earnings call transcripts

Most public companies host earnings calls, providing their management opportunities to discuss past financial results and future plans. Natural language transcripts of these calls may contain useful information, but often this information must first be extracted from the document and arranged into a structured form so that it can be analyzed or compared across time periods and other companies.

Here we demonstrate the use of a LLM-powered extraction service on extracting information from Uber's Q4 2023 earnings call. We show the importance of incorporating few-shot learning to accurate extraction in a real-world context.

Uber investor relations makes the prepared remarks for the call available [online](https://s23.q4cdn.com/407969754/files/doc_earnings/2023/q4/transcript/Uber-Q4-23-Prepared-Remarks.pdf).

First we start our local extraction service, as described in the [README](../../../README.md), and download the PDF document:

In [1]:
import requests

url = "http://localhost:8000"

In [2]:
# Uber transcripts from earnings calls and other events at https://investor.uber.com/news-events/default.aspx

pdf_url = "https://s23.q4cdn.com/407969754/files/doc_earnings/2023/q4/transcript/Uber-Q4-23-Prepared-Remarks.pdf"

In [3]:
# Get PDF bytes

pdf_response = requests.get(pdf_url)
assert(pdf_response.status_code == 200)
pdf_bytes = pdf_response.content

We next specify the schema of what we intend to extract. Here we specify a record of financial data. We allow the LLM to infer various attributes, such as the time period for the record.

Note that we include an `evidence` attribute, which provides context for the predictions and supports downstream verification of the results.

Once we've defined our schema, we create an extractor by posting it to our database.

In [4]:
from uuid import uuid4

from pydantic import BaseModel, Field

class FinancialData(BaseModel):
    name: str = Field(..., description="Name of the financial figure, such as revenue.")
    value: int = Field(..., description="Nominal earnings in local currency.")
    scale: str = Field(..., description="Scale of figure, such as MM, B, or percent.")
    period_start: str = Field(..., description="The start of the time period in ISO format.")
    period_duration: int = Field(..., description="Duration of period, in months")
    evidence: str = Field(..., description="Verbatim sentence of text where figure was found.")

user_id = str(uuid4())
headers = {"x-key": user_id}

data = {
    "user_id": user_id,
    "description": "Financial revenues and other figures.",
    "schema": FinancialData.schema(),
    "instruction": (
        "Extract standard financial figures, specifically earnings and "
        "revenue figures."
    )
}

response = requests.post(f"{url}/extractors", json=data, headers=headers)
response

<Response [200]>

In [5]:
extractor = response.json()
print(extractor)

{'uuid': '151db8c9-ec49-4c6c-a13d-b5335ede8cbb'}


We can now try the extractor on our PDF:

In [6]:
result = requests.post(
    f"{url}/extract",
    data={"extractor_id": extractor["uuid"]},
    files={"file": pdf_bytes},
    headers=headers,
)

result

<Response [200]>

In [7]:
result.json()

{'data': [{'name': 'Adjusted EBITDA',
   'scale': 'million',
   'value': 1300,
   'evidence': 'Q4 was a standout quarter to cap off a standout year... translated to $1.3 billion in Adjusted EBITDA',
   'period_start': '2023-10-01',
   'period_duration': 3},
  {'name': 'GAAP operating income',
   'scale': 'million',
   'value': 652,
   'evidence': 'translated to $1.3 billion in Adjusted EBITDA and $652 million in GAAP operating income',
   'period_start': '2023-10-01',
   'period_duration': 3},
  {'name': 'Gross Bookings',
   'scale': 'billion',
   'value': 37.6,
   'evidence': 'Gross Bookings of $37.6 billion',
   'period_start': '2023-10-01',
   'period_duration': 3},
  {'name': 'Revenue',
   'scale': 'billion',
   'value': 9.9,
   'evidence': 'we grew our revenue by 13% YoY on a constant-currency basis to $9.9 billion',
   'period_start': '2023-10-01',
   'period_duration': 3},
  {'name': 'Adjusted EBITDA',
   'scale': '$',
   'value': 1260000000,
   'evidence': 'We expect Adjusted E

We've extracted several records capturing various earnings and revenue figures, and have conformed the records to the desired schema.

We can convey additional instructions to the LLM efficiently via few-shot examples. For example, we can specify how the names of financial metrics should be normalized, or how scales (millions, billions, percentages, etc.) should be represented in different cases.

The `examples` endpoint lets us associate few-shot examples with an extractor. We can specify examples by pairing text inputs with lists of `FinancialData` outputs:

In [8]:
examples = [
    {
        "text": "In 2022, Revenue was $1 million and EBIT was $2M.",
        "output": [
            FinancialData(
                name="revenue",
                value=1,
                scale="MM",
                period_start="2022-01-01",
                period_duration=12,
                evidence="In 2022, Revenue was $1 million and EBIT was $2M.",
            ).dict(),
            FinancialData(
                name="ebit",
                value=2,
                scale="MM",
                period_start="2022-01-01",
                period_duration=12,
                evidence="In 2022, Revenue was $1 million and EBIT was $2M.",
            ).dict()
        ],
    },
]

responses = []
for example in examples:
    create_request = {
        "extractor_id": extractor["uuid"],
        "content": example["text"],
        "output": example['output'],
    }
    response = requests.post(f"{url}/examples", json=create_request, headers=headers)
    responses.append(response)

Having posted the examples, we can re-run the extraction:

In [9]:
result = requests.post(
    f"{url}/extract",
    data={"extractor_id": extractor["uuid"]},
    files={"file": pdf_bytes},
    headers=headers,
)

result

<Response [200]>

In [10]:
result.json()

{'data': [{'name': 'adjusted ebitda',
   'scale': 'MM',
   'value': 1300,
   'evidence': 'These strong top-line trends, combined with continued rigor on costs, translated to $1.3 billion in Adjusted EBITDA and $652 million in GAAP operating income.',
   'period_start': '2023-10-01',
   'period_duration': 3},
  {'name': 'revenue',
   'scale': 'MM',
   'value': 9900,
   'evidence': 'We grew our revenue by 13% YoY on a constant-currency basis to $9.9 billion.',
   'period_start': '2023-10-01',
   'period_duration': 3},
  {'name': 'gaap operating income',
   'scale': 'MM',
   'value': 652,
   'evidence': 'These strong top-line trends, combined with continued rigor on costs, translated to $1.3 billion in Adjusted EBITDA and $652 million in GAAP operating income.',
   'period_start': '2023-10-01',
   'period_duration': 3},
  {'name': 'adjusted ebitda',
   'scale': 'B',
   'value': 1260,
   'evidence': 'We expect Adjusted EBITDA of $1.26 billion to $1.34 billion.',
   'period_start': '2023-01