# Earnings Call Information Exact POC

POC to extract structured information from earning call scrips.


## Requirements
#### Package Requirements
This notebook was created with the following packages
- python                    3.11
- llama-index               0.12.25
- pandas                    2.2.2
- langchain                 0.3.21

#### Other Requirements
- Environment variable `OPENAI_API_KEY`.  This is needed for LLaMA Index to use its default GPT-3.5 to provide an answer to the query.
- Environment variable `DEEPINFRA_API_KEY`.  This is needed for REST API access LLM models in DeepInfra.

In [1]:
import pandas as pd

## Set up Environment

Setting up environment specific parameters.  Modify these to suit your local environment.

In [2]:

# Locations of the data sources
#

data_root = "../data"         # Directory to the data
ec_dir = "earning_calls"
working_dir = "working"
index_dir = "indices"


In [3]:
import os

# Keys for LLM access
openai_key = os.environ.get("OPENAI_API_KEY")
# hf_key = os.environ.get("HUGGINGFACEHUB_API_TOKEN")
di_key = os.environ.get("DEEPINFRA_API_KEY")

if not openai_key:
    raise EnvironmentError(f"OPENAI_API_KEY must be provided for this notebook to work.  Needed by LLaMA index.")

# if not hf_key:
#     raise EnvironmentError(f"Need HuggingFace token for this notebook to work.  Needed for query extension with DeepSeek-R1"  )

if not di_key:
    raise  EnvironmentError(f"DEEPINFRA_API_KEY is needed to run models in DeepInfra")

In [4]:
#
# Tweak these values
#

# Chunking size
chunk_size = 500
chunk_overlap = 100

# Type of article
article_type = "transcript of the earnings call"
article_name = "MSFT_EC_2Q25"
article_file = "msft/MSFT_FY2Q25__1__m4a_Good_Tape_2025-03-19.txt"

# Query scopes
scope = "Microsoft financial and operational reports"

# LLM models
# llm_model_name = "gpt-4"
# llm_model_name = "gpt-4.5"
llm_model_name = "llama-3"
# llm_model_name = "gemini-2"

# Generation temperature
temperature = 0.3


In [5]:
#
# Select an embedding model for vector database.  Here I use LLaMA Index.
#

from llama_index.embeddings.openai import OpenAIEmbedding

# Initialize the OpenAI embedding model
# embed_model = OpenAIEmbedding(model="text-embedding-3-small")
embed_model = OpenAIEmbedding(model="text-embedding-ada-002")

# Testing
# text = "OpenAI's new embedding models at works"
# print(embed_model.get_text_embedding(text))

In [6]:
# These are steps in this notebook that we want to force refreshing.
# Many of the steps are time-consuming, so I save their results in the data directory.
# If the saved results exists, I will reload them instead of recalculating them.
# Setting any of the steps to True forces the code to recalculate the result for that step.
steps = {
    "chunking": False,                       # Input the article and do chunking
    "extract_values": False,                 # Extract values from the chunks
    "table_embedding": False,                 # Embed table rows
}

def refresh(what:str):
    return what in steps and steps[what]

In [7]:
step_dependencies = {
    "extract_values": ["chunking"],
}

more_to_resolve = True
while more_to_resolve:
    more_to_resolve = False
    for step in step_dependencies:
        if not steps[step] and any([steps[s] for s in step_dependencies[step]]):
            steps[step] = True
            more_to_resolve = True

print("Refresh the following steps:")
for s in steps:
    if steps[s]:
        print(f"- {s}")

Refresh the following steps:


## Reading and Chunking

Read the transcript and chunk it.

In [8]:
from llama_index.core.node_parser import SentenceSplitter

article_path = os.path.join(data_root, ec_dir, article_file)
chunk_path = os.path.join(data_root, working_dir, f"{article_name}_info_chunks.parquet")

if refresh("chunking") or not os.path.exists(chunk_path):

    # Input
    with open(article_path, "r", encoding="utf-8") as tfd:
        transcript_content = tfd.read()

    # Initialize the SentenceSplitter
    sentence_splitter = SentenceSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

    # Split the text into chunks
    chunks = sentence_splitter.split_text(transcript_content)

    # Put into Pandas
    chunk_ids = [f"{article_name}_{i:04d}" for i in range(len(chunks))]
    chunk_df = pd.DataFrame(zip(chunk_ids, chunks), columns=["chunk_id", "content"])
    chunk_df = chunk_df.set_index("chunk_id")

    # Save the results
    chunk_df.to_parquet(chunk_path)
else:
    chunk_df = pd.read_parquet(chunk_path)

In [9]:
chunk_df

Unnamed: 0_level_0,content
chunk_id,Unnamed: 1_level_1
MSFT_EC_2Q25_0000,"MSFT_FY2Q25 (1).m4a\n\noperator assistance, pl..."
MSFT_EC_2Q25_0001,"If you ask a question, it will be included in ..."
MSFT_EC_2Q25_0002,"From now on, it's a more\ncontinuous cycle gov..."
MSFT_EC_2Q25_0003,Now on to AI platform and tools. As we shared ...
MSFT_EC_2Q25_0004,When you look at customers who purchased Copil...
MSFT_EC_2Q25_0005,And we are leaning into this. With Dynamics 36...
MSFT_EC_2Q25_0006,"Now on to our consumer businesses, starting wi..."
MSFT_EC_2Q25_0007,All Up Game Pass set a new quarterly record fo...
MSFT_EC_2Q25_0008,Microsoft Cloud gross margin percentage was 70...
MSFT_EC_2Q25_0009,"M365 consumer cloud revenue increased 8%, slig..."


## Extract Values

Ask LLM to get values into structured format

In [10]:
import bots

llm = bots.of(llm_model_name)

In [11]:
import re
import json
import time
from typing import List


def extract_values(df: pd.DataFrame, chunk_id: str) -> List[dict]:
    """ Extract information elements that contains a value """

    instruction = """
    The text below is extracted from a {article_type}.
    You are to extract a list of items and their values.  List only within the scope of {scope}.
    Use JSON format with the following fields:
    - entity: (e.g., IBM total, AWS division)
    - item: (e.g., revenue, net income, gross profit margin)
    - value: (e.g., $1,000,000, 75%) (use negative value for down trend)
    - unit: (e.g., USD, ea, %)
    - comments: (optional. for any supporting information, such as the contributing factor affecting the value)
    Output only the list.
    ===
    {text}
    """

    max_tries = 5
    text = df.loc[chunk_id]["content"]
    attempts = 0

    while attempts < max_tries:
        try:
            values = llm.react(
                instruction,
                arguments={
                    "article_type": article_type,
                    "scope": scope,
                    "text": text,
                },
                temperature=temperature
            )["content"]

            values = re.sub(r"^(['\"`]+)(.*?)(\1)$", r"\2", values, flags=re.DOTALL)
            values = json.loads(values)
            values = [{**d, "chunk_id": chunk_id} for d in values]

            return values

        except (ValueError, TimeoutError) as e:
            pause = 5 * (attempts + 1)
            print(f"{e}. Retry in {pause} seconds...")
            time.sleep(pause)
            attempts += 1

        time.sleep(2)   # To avoid HuggingFace throttling frequent access

# Testing
# extracted = extract_values(chunk_df, "MSFT_EC_2Q25_0001")
# print(extracted)

In [12]:
values_path = os.path.join(data_root, working_dir, f"{article_name}_values.tsv")

if not os.path.exists(values_path) or refresh("query_generation"):
    values_df = pd.DataFrame(columns=["entity", "item", "value", "unit", "comments", "chunk_id"])
else:
    values_df = pd.read_table(values_path).reset_index()

values_df

Unnamed: 0,index,entity,item,value,unit,comments,chunk_id,company,context
0,0,Microsoft Cloud,revenue,40,billion USD,up 21% year over year,MSFT_EC_2Q25_0001,Microsoft,2Q25 Earnings Call
1,1,Microsoft AI,annual revenue run rate,13,billion USD,up 175% year over year,MSFT_EC_2Q25_0001,Microsoft,2Q25 Earnings Call
2,2,AI inference,price performance gain,2,times,for every hardware generation,MSFT_EC_2Q25_0001,Microsoft,2Q25 Earnings Call
3,3,AI model,price performance gain,10,times,for every model generation due to software opt...,MSFT_EC_2Q25_0001,Microsoft,2Q25 Earnings Call
4,4,Microsoft Azure,data center capacity,2,times,doubled in the last three years,MSFT_EC_2Q25_0002,Microsoft,2Q25 Earnings Call
...,...,...,...,...,...,...,...,...,...
177,177,Microsoft commercial,RPO increase,39,billion USD,sequential increase,MSFT_EC_2Q25_0026,Microsoft,2Q25 Earnings Call
178,178,Microsoft commercial,bookings growth,75,%,"constant currency, sequential",MSFT_EC_2Q25_0026,Microsoft,2Q25 Earnings Call
179,179,Microsoft Azure,commitments,,,"OpenAI commitments, ongoing relationship",MSFT_EC_2Q25_0026,Microsoft,2Q25 Earnings Call
180,180,Microsoft commercial core,motions performance,,,"good performance, renewals and add-ons",MSFT_EC_2Q25_0026,Microsoft,2Q25 Earnings Call


In [13]:
values_df = pd.DataFrame(columns=["entity", "item", "value", "unit", "comments", "chunk_id"])

In [14]:

from util.Scaler import Scaler
from tqdm.notebook import tqdm

if refresh("extract_values") or not os.path.exists(values_path):
    done_chunks = set(values_df["chunk_id"])
    existing_chunks = set(chunk_df.index)
    need_generating = list(existing_chunks - done_chunks)
    need_generating.sort()

    print(f"Generating for {len(need_generating)} chunks: {need_generating}")

    with tqdm(need_generating, desc="Extracting Values") as pbar:
        for cid in need_generating:
            pbar.set_postfix_str(cid)
            values = extract_values(chunk_df, cid)

            for v in values:
                # Normalize the amount
                v["value"], v["unit"] = Scaler.normalize(v["value"], v["unit"])

                # Ensure all column exists
                for col in values_df.columns:
                    if col not in v:
                        v[col] = None

                # Create a mask for checking if the value combination exists
                mask = (values_df["entity"] == v["entity"]) & (values_df["item"] == v["item"])
                matching_rows = values_df[mask]

                if len(matching_rows):
                    # Check if any matching row has same value and unit
                    value_unit_match = (matching_rows['value'] == v['value']) & \
                                       (matching_rows['unit'] == v['unit'])

                    if value_unit_match.any():
                        # Get index of the matching row
                        match_idx = matching_rows[value_unit_match].index[0]

                        # Append the new comment to existing comment
                        existing_comment = values_df.at[match_idx, 'comments']
                        new_comment = v['comments']

                        if pd.isna(existing_comment):
                            values_df.at[match_idx, 'comments'] = new_comment
                        else:
                            values_df.at[match_idx, 'comments'] = f"{existing_comment}; {new_comment}"

                        existing_cid = values_df.at[match_idx, 'chunk_id']
                        new_cid = v['chunk_id']
                        values_df.at[match_idx, 'chunk_id'] = f"{existing_cid}, {new_cid}"

                    else:
                        # If no matching value and unit, add as new row
                        values_df = pd.concat([values_df, pd.DataFrame([v])], ignore_index=True)

                else:
                    # The entity-item does not exist, add the new row
                    values_df = pd.concat([values_df, pd.DataFrame([v])], ignore_index=True)

            values_df["company"] = "Microsoft"
            values_df["context"] = "2Q25 Earnings Call"

            # Save progress so far
            values_df.to_csv(values_path, sep="\t", index=False)
            pbar.update()

else:
    values_df = pd.read_table(values_path)


In [15]:
values_df["company"] = "Microsoft"
values_df["context"] = "2Q25 Earnings Call"

# Save progress so far
values_df.to_csv(values_path, sep="\t", index=False)


In [16]:
values_df

Unnamed: 0,entity,item,value,unit,comments,chunk_id,company,context
0,Microsoft Cloud,revenue,40,billion USD,up 21% year over year,MSFT_EC_2Q25_0001,Microsoft,2Q25 Earnings Call
1,Microsoft AI,annual revenue run rate,13,billion USD,up 175% year over year,MSFT_EC_2Q25_0001,Microsoft,2Q25 Earnings Call
2,AI inference,price performance gain,2,times,for every hardware generation,MSFT_EC_2Q25_0001,Microsoft,2Q25 Earnings Call
3,AI model,price performance gain,10,times,for every model generation due to software opt...,MSFT_EC_2Q25_0001,Microsoft,2Q25 Earnings Call
4,Microsoft Azure,data center capacity,2,times,doubled in the last three years,MSFT_EC_2Q25_0002,Microsoft,2Q25 Earnings Call
...,...,...,...,...,...,...,...,...
177,Microsoft commercial,RPO increase,39,billion USD,sequential increase,MSFT_EC_2Q25_0026,Microsoft,2Q25 Earnings Call
178,Microsoft commercial,bookings growth,75,%,"constant currency, sequential",MSFT_EC_2Q25_0026,Microsoft,2Q25 Earnings Call
179,Microsoft Azure,commitments,,,"OpenAI commitments, ongoing relationship",MSFT_EC_2Q25_0026,Microsoft,2Q25 Earnings Call
180,Microsoft commercial core,motions performance,,,"good performance, renewals and add-ons",MSFT_EC_2Q25_0026,Microsoft,2Q25 Earnings Call


## Index Table Rows


In [17]:
from util.TableIndexer import TableIndexer

value_index_path = os.path.join(data_root, index_dir, f"{article_name}_values")

if True or refresh("table_embedding") or not os.path.exists(value_index_path):
    indexer = TableIndexer()
    indexer.insert(values_df, metadata_fields=["chunk_id"])

    indexer.save(value_index_path)

else:
    indexer = TableIndexer.load(value_index_path)

In [34]:
# These values in the MSFT-FY2Q25-法說會memo.txt but not in MSFT_FY2Q25__1__m4a_Good_Tape_2025-03-19.txt
# print(indexer.query("What is Microsoft total FY2Q25營收季增? (cite chunk_id in [])")) # (6.2% not found in document)
# print(indexer.query("What is Microsoft total FY2Q25營收年增? (cite chunk_id in [])")) # (12.3 not found in document)
# print(indexer.query("What is Microsoft total FY2Q25營收優於財測中值? (cite chunk_id in [])")) # No value (not found in EC)
# print(indexer.query("What is Microsoft total FY2Q25營收市場預期? (cite chunk_id in [])")) # Wrong value (not found in EC)
# print(indexer.query("What is Microsoft total FY2Q25毛利率? (cite chunk_id in [])")) # Wrong value (not found in EC)


In [42]:
def ask(question: str):
    print(f"Q: {question}")
    question += " (cite chunk_id in [])"
    print(f"A: {indexer.query(question, top_k=10)}")
    print()

In [43]:
ask("What is Microsoft revenue of 2Q25?")
ask("What is Microsoft total FY2Q25營收?")
ask("Microsoft の 2025 年第 2 四半期の総収益はいくらですか?")
ask("What is Microsoft FY2Q25營收?")
ask("What is Microsoft total FY2Q25 gross margin?")
ask("What is Azure revenue growth?")
ask("智慧雲端營收年增%")
ask("智慧雲端營收")
ask("智慧雲端營收 USD forecast")


Q: What is Microsoft revenue of 2Q25?
A: Microsoft revenue of 2Q25 is 69.6 billion USD [MSFT_EC_2Q25_0007].

Q: What is Microsoft total FY2Q25營收?
A: Microsoft total FY2Q25 revenue is 69.6 billion USD [MSFT_EC_2Q25_0007].

Q: Microsoft の 2025 年第 2 四半期の総収益はいくらですか?
A: Microsoftの2025年第2四半期の総収益は69.6十億米ドルです [MSFT_EC_2Q25_0007]。

Q: What is Microsoft FY2Q25營收?
A: Microsoft FY2Q25 revenue is 69.6 billion USD [MSFT_EC_2Q25_0007].

Q: What is Microsoft total FY2Q25 gross margin?
A: Microsoft total FY2Q25 gross margin is 13% [MSFT_EC_2Q25_0007].

Q: What is Azure revenue growth?
A: Azure revenue growth is 31% [MSFT_EC_2Q25_0012].

Q: 智慧雲端營收年增%
A: 智慧雲端營收年增為19% [MSFT_EC_2Q25_0009]。

Q: 智慧雲端營收
A: 智慧雲端營收為 25.5 億美元，增長了 19% [MSFT_EC_2Q25_0009]。

Q: 智慧雲端營收 USD forecast
A: 智慧雲端營收 USD 預測為 25.9 到 26.2 百萬美元 [MSFT_EC_2Q25_0012]。



## Create an Agent