# Earnings Call Summary POC

POC for summarizing the transcripts of an earnings call

## Requirements
#### Package Requirements
This notebook was created with the following packages
- python                    3.11
- llama-index               0.12.25
- ragas                     0.2.14

#### Other Requirements
- Environment variable `OPENAI_API_KEY`.  This is needed for LLaMA Index to use its default GPT-3.5 to provide an answer to the query.
- Environment variable `DEEPINFRA_API_KEY`.  This is needed for REST API access LLM models in DeepInfra.

In [66]:
import pandas as pd
%pip list

Package                                  Version
---------------------------------------- -----------
absl-py                                  2.1.0
aiohttp                                  3.9.5
aiosignal                                1.3.1
annotated-types                          0.6.0
anyio                                    4.2.0
appdirs                                  1.4.4
appnope                                  0.1.3
argon2-cffi                              23.1.0
argon2-cffi-bindings                     21.2.0
arrow                                    1.3.0
asgiref                                  3.8.1
asttokens                                2.4.1
astunparse                               1.6.3
async-lru                                2.0.4
async-timeout                            4.0.3
attrs                                    23.2.0
Babel                                    2.14.0
backoff                                  2.2.1
bcrypt                      

## Set up Environment

Setting up environment specific parameters.  Modify these to suit your local environment.

In [93]:
#
# Locations of the data sources
#

data_root = "../data"         # Directory to the data
ec_dir = "earning_calls"
working_dir = "working"
reports_dir = "reports"


In [68]:
import os

# Keys for LLM access
openai_key = os.environ.get("OPENAI_API_KEY")
# hf_key = os.environ.get("HUGGINGFACEHUB_API_TOKEN")
di_key = os.environ.get("DEEPINFRA_API_KEY")

if not openai_key:
    raise EnvironmentError(f"OPENAI_API_KEY must be provided for this notebook to work.  Needed by LLaMA index.")

# if not hf_key:
#     raise EnvironmentError(f"Need HuggingFace token for this notebook to work.  Needed for query extension with DeepSeek-R1"  )

if not di_key:
    raise  EnvironmentError(f"DEEPINFRA_API_KEY is needed to run models in DeepInfra")

In [88]:
#
# Tweak these values
#

# Chunking size
chunk_size = 1000
chunk_overlap = 200

# Type of article
article_type = "transcript of the earnings call"
article_name = "MSFT_EC_2Q25"
article_file = "msft/MSFT_FY2Q25__1__m4a_Good_Tape_2025-03-19.txt"

# Summarization scopes
scope = "Microsoft financial and operational reports"

# Sections that shall be in the report
report_sections = [
    "Executive Summary", "Future Outlook",
]

In [86]:
# These are steps in this notebook that we want to force refreshing.
# Many of the steps are time-consuming, so I save their results in the data directory.
# If the saved results exists, I will reload them instead of recalculating them.
# Setting any of the steps to True forces the code to recalculate the result for that step.
steps = {
    "chunking": False,                       # Input the article and do chunking
    "chunk_summaries": False,                # Per chunk summarization
    "final_summary": True                   # Summarize the chunk summaries
}

## Reading and Chunking

Read the transcript and chunk it.

In [71]:
from llama_index.core.node_parser import SentenceSplitter

article_path = os.path.join(data_root, ec_dir, article_file)
chunk_path = os.path.join(data_root, working_dir, f"{article_name}_chunks.parquet")

if steps["chunking"] or not os.path.exists(chunk_path):

    # Input
    with open(article_path, "r", encoding="utf-8") as tfd:
        transcript_content = tfd.read()

    # Initialize the SentenceSplitter
    sentence_splitter = SentenceSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

    # Split the text into chunks
    chunks = sentence_splitter.split_text(transcript_content)

    # Put into Pandas
    chunk_ids = [f"{article_name}_{i:04d}" for i in range(len(chunks))]
    chunk_df = pd.DataFrame(zip(chunk_ids, chunks), columns=["chunk_id", "content"])

    # Save the results
    chunk_df.to_parquet(chunk_path)
else:
    chunk_df = pd.read_parquet(chunk_path)

In [72]:
chunk_df

Unnamed: 0,chunk_id,content
0,MSFT_EC_2Q25_0000,"MSFT_FY2Q25 (1).m4a\n\noperator assistance, pl..."
1,MSFT_EC_2Q25_0001,"From now on, it's a more\ncontinuous cycle gov..."
2,MSFT_EC_2Q25_0002,When you look at customers who purchased Copil...
3,MSFT_EC_2Q25_0003,More professionals than ever are engaging in h...
4,MSFT_EC_2Q25_0004,Microsoft Cloud gross margin percentage was 70...
5,MSFT_EC_2Q25_0005,Segment gross margin dollars increased 12% and...
6,MSFT_EC_2Q25_0006,We expect consistent execution\nacross our cor...
7,MSFT_EC_2Q25_0007,We expect Xbox content services revenue growth...
8,MSFT_EC_2Q25_0008,Please proceed. Thank you guys for taking the ...
9,MSFT_EC_2Q25_0009,But what we're seeing is waiting to see just h...


## Summarizing the Chunks

In [73]:
import bots

llm = bots.of("gpt-4")


In [79]:
from tqdm.notebook import tqdm

if steps["chunk_summaries"] or "summary" not in chunk_df.columns:

    # Ask LLM to summarize per chunk
    chunk_summary_prompt = """
    Concisely summarize the following section of a {article_type}.
    Only summarize within the scope of {scope}.
    ===
    {text}
    """

    chunk_summaries = []
    for chunk_content in tqdm(chunk_df["content"], desc="Summarize chunks"):
        result = llm.react(
            chunk_summary_prompt,
            arguments={
                "article_type": article_type,
                "text": chunk_content,
                "scope": scope,
            })
        chunk_summaries.append(result["content"])

    chunk_df["summary"] = chunk_summaries

    # Save the results
    chunk_df.to_parquet(chunk_path)
else:
    chunk_df = pd.read_parquet(chunk_path)

Summarize chunks:   0%|          | 0/14 [00:00<?, ?it/s]

In [80]:
chunk_df

Unnamed: 0,chunk_id,content,summary
0,MSFT_EC_2Q25_0000,"MSFT_FY2Q25 (1).m4a\n\noperator assistance, pl...",Microsoft CEO Satya Nadella reported a 21% YoY...
1,MSFT_EC_2Q25_0001,"From now on, it's a more\ncontinuous cycle gov...",Microsoft continues to grow across every layer...
2,MSFT_EC_2Q25_0002,When you look at customers who purchased Copil...,"In the first quarter of availability, customer..."
3,MSFT_EC_2Q25_0003,More professionals than ever are engaging in h...,Microsoft reported robust financial results wi...
4,MSFT_EC_2Q25_0004,Microsoft Cloud gross margin percentage was 70...,Microsoft Cloud's gross margin percentage stoo...
5,MSFT_EC_2Q25_0005,Segment gross margin dollars increased 12% and...,Microsoft reported a 12% increase in segment g...
6,MSFT_EC_2Q25_0006,We expect consistent execution\nacross our cor...,Microsoft expects undulating quarterly savings...
7,MSFT_EC_2Q25_0007,We expect Xbox content services revenue growth...,Microsoft expects Xbox content services revenu...
8,MSFT_EC_2Q25_0008,Please proceed. Thank you guys for taking the ...,Microsoft reported solid commercial bookings i...
9,MSFT_EC_2Q25_0009,But what we're seeing is waiting to see just h...,The discussion in the Microsoft earnings call ...


In [83]:
# Print the chunk summaries to read them better
for _, row in chunk_df.iterrows():
    print(f"==={row['chunk_id']}===\n{row['summary']}")

===MSFT_EC_2Q25_0000===
Microsoft CEO Satya Nadella reported a 21% YoY increase in Microsoft Cloud revenue, surpassing $40 billion for the first time. Furthermore, the company's artificial intelligence (AI) business has also grown significantly, reaching an annual revenue run rate of $13 billion, up 175% YoY. Microsoft is focused on scaling its fleet globally, addressing demand and balancing resources across training and inference. The company's cloud platform, Azure, continues to expand its data center capacity to meet demand, with total data center capacity more than doubling in the last three years. The report also noted significant efficiency gains in both training and inference due to software optimization and AI scaling laws.
===MSFT_EC_2Q25_0001===
Microsoft continues to grow across every layer of the technology stack, seeing progress in areas such as Azure, its AI initiatives, and the future of work. In terms of Azure, Microsoft has doubled data center capacity over three years

## Summarize the Summaries

In [95]:
report_en_path = os.path.join(data_root, reports_dir, f"{article_name}_report_en.txt")

if steps["final_summary"] or not os.path.exists(report_en_path):
    final_summary_prompt = """
    You are to generate a report from {article_type}.
    Organize related points in sections.
    The sections shall include, but not limited to, {sections}.
    ===
    {text}
    """

    chunk_summaries = "\n\n".join(list(chunk_df["summary"]))

    result = llm.react(final_summary_prompt, arguments={
        "article_type": article_type,
        "text": chunk_summaries,
        "scope": scope,
        "sections": ", ".join(report_sections),
    })
    report_en = result["content"]

    with open(report_en_path, "w", encoding="utf-8") as fd:
        fd.write(report_en)

else:
    with open(report_en_path, "r", encoding="utf-8") as fd:
        report_en = fd.read()

print(report_en)

Executive Summary:
- Microsoft reported substantial growth over the past year. Microsoft Cloud revenue rose by 21% YoY, surpassing $40 billion. 
- The primary growth areas include Azure, AI initiatives, and the future of work. Large corporations are migrating workloads to Microsoft's Azure, and more than 19,000 customers use Microsoft's Fabric.
- Microsoft's artificial intelligence (AI) business grew significantly, reaching an annual revenue run rate of $13 billion, up 175% YoY.
- The company's partnership with OpenAI and the launch of Azure AI Foundry has brought in more than 200,000 monthly active users.
- The future of work sees tools like Microsoft 365 Copilot being adopted. The number of daily users more than doubled quarter over quarter.
- Microsoft reported a solid financial result with a revenue of $69.6 billion, representing a 12% increase. Growth areas include LinkedIn, Edge, gaming, and AI sectors. 
- Upcoming priorities include continued investment in Microsoft Cloud and sc

In [96]:
report_zh_path = os.path.join(data_root, reports_dir, f"{article_name}_report_zh.txt")

if steps["final_summary"] or not os.path.exists(report_en_path):
    translation_prompt = """
    Translate the following text in traditional Chinese.
    Do not translate technical terminologies.
    ===
    {text}
    """

    result = llm.react(translation_prompt, arguments={
        "text": report_en,
    })
    report_zh = result["content"]

    with open(report_zh_path, "w", encoding="utf-8") as fd:
        fd.write(report_zh)

else:
    with open(report_zh_path, "r", encoding="utf-8") as fd:
        report_zh = fd.read()

print(report_zh)

執行總結:
- 微軟在過去一年內報告了大量的增長。微軟雲端業務的收入年增21%，超過400億美元。
- 主要的增長領域包括Azure，AI倡議及加速遠端工作的未來。大型公司將工作負載遷移到微軟的Azure，並且有超過19,000位客戶使用微軟的Fabric。
- 微軟的人工智慧(AI)業務顯著成長，達到年度收入運營額130億美元，年增率達175%。
- 該公司與OpenAI的合作及Azure AI工場的啟動為公司帶來超過200,000個每月活躍用戶。
- 當看到未來的工作使用像微軟365的Copilot這種工具時，每天的用戶數量比前季度增加一倍以上。
- 微軟報告了穩健的財務結果，收入為696億美元，增長率為12%。增長領域包括LinkedIn，Edge，遊戲以及AI領域。
- 即將要優先處理的任務包括繼續投資於微軟雲端及擴大AI設施。

財務表現:
- 該公司報告了穩健的財務表現，收入696億美元，年增12%。
- 微軟雲端的毛利率70%，由於AI設施的擴大導致稍微下降。
- LinkedIn的收入首次達到20億美元，增長率為9%。
- 效率與商業流程的收入增長14%，達到294億美元。
- 智慧雲端部門報告19%的收入增長，達到255億美元。
- 更個人化的計算收入報告為147億美元，與去年大致保持不變。
- 微軟通過股息及回購股份向股東回饋97億美元。

未來展望:
- 微軟預測，由於AI設施的擴大，雲端的毛利率將會降至69%，年度降幅。
- 預計效率與業務流程的收入增長率將介於11%至12%之間，或者294億至297億美元。
- 預期微軟365的商業雲端收入將較第二季度增長14%至15%。
- LinkedIn，Dynamics 365與智慧雲的期望收入增長分別處於個位數至青少年中期，及19%-20%之間。
- 預計Azure的收入增長將介於31%-32%之間。
- 對於個人計算，預期的收入為124億至128億美元。
- 預期Xbox服務的收入增長將在個位數至青少年中期範圍內，硬件收入較去年有所下降。

產品更新:
- GitHub Copilot日益受歡迎，第一週就有超過一百萬的註冊數。
- 最近推出的Azure AI工場有超過200,000個每月活躍用戶。
- 微軟的Fabric現在有超過19,000個付費用戶，而Power BI則有超過3,000萬個每