## CPSC 477 Final Project Part 1: Augmenting ECTSum With Gemini


We used Gemini 1.0 Pro and Google's Generative AI API to augment the summaries in ECTSum. We noticed that the provided bullet-point style summaries in ECTSum were too concise and failed to capture the plethora of useful information in the earnings call transcripts.

For example, this was the provided summary in ECTSum for `SNX_q2_2021.txt`.

```
synnex corporation sees q3 non-gaap earnings per share $1.90 to $2.10.
sees q3 non-gaap earnings per share $1.90 to $2.10.
sees q3 revenue $4.95 billion to $5.45 billion.
qtrly non-gaap diluted earnings per share from continuing operations $2.09.
q2 revenue rose 31 percent to $5.9 billion.
```

To solve this problem, we prompted Gemini to summarize the earnings call transcripts. Furthermore, we emphasized that it should incorporate all the information in the provided summary (which we referred to as the *Key Points* in the prompt). 

Here is the prompt that we used:

```
You are a financial advisor tasked with creating a short summary of an earnings call transcript. You only want to summarize or re-iterate points that would be relevant, critical, or informational to someone who wants to skim over the important details of a long transcript.

Below is an earnings call transcript. Please summarize this transcript in exactly one paragraph using complete sentences. Keep the summary below 300 words. It is very important that you do not use any titles in the summary. Include relevant information and statistics from the Earnings Call Transcript in your summary. Furthermore, it is very important that you incorporate all the information and statistics from the Key Points and spread it out throughout your summary.

Earnings Call Transcript:
[ect]

Key Points:
[summary]
```

Here is the Gemini-augmented summary for the `SNX_q2_2021.txt` example shown above:

```
Synnex Corporation reported strong Q2 results, exceeding expectations with a 31% year-over-year revenue increase to $5.9 billion. Despite a lower gross margin, non-GAAP diluted earnings per share increased 68% to $2.09. The company's improved cash flow and available liquidity of $3.1 billion demonstrate financial strength. Regarding the proposed merger with Tech Data, regulatory approvals are progressing, and the transaction is expected to close in the second half of 2021. For Q3, Synnex forecasts revenue between $4.95 billion and $5.45 billion, with non-GAAP diluted earnings per share projected in the range of $1.90 to $2.10. While aware of supply chain challenges, the company remains optimistic about IT spending and expects tailwinds from market demand and enterprise project resumption.
```

After generating summaries with Gemini, we scored the summaries in terms of quality and only kept those above our thresholds.

For example, for the `test` fold, we generated 179 summaries. Of those, 171 were kept because they correctly contained only one paragraph. Of those, 145 were kept because the numbers in the generated summary were also present in the full transcript (i.e. no hallucinated numbers). Of those, 50 were kept because they included at least 2 numbers from the *Key Points*. Finally, we added these 50 high-quality summaries to the `test` fold.

|                | train    | val     | test   |
|----------------| -------- | ------- | ------ |
| all            | 641      | 171     | 179    |
| 1 paragraph    | 610      | 168     | 171    |
| 100% precision | 530      | 145     | 145    |
| recall ≥ 2     | 200      | 50      | 50     |

In [2]:
# !pip install google-generativeai
import google.generativeai as genai
import random
from dotenv import load_dotenv
import os
import time
env = load_dotenv(override = True)

In [3]:
genai.configure(api_key = os.environ["API_KEY"])
model = genai.GenerativeModel("gemini-pro")

In [173]:
fold = "val"
ect_filenames = os.listdir(f"data/final/{fold}/ects")
gemini_filenames = os.listdir(f"data/final/{fold}/gemini_summaries")
unused_filenames = list(set(ect_filenames).difference(set(gemini_filenames)))
chosen_filenames = random.choices(unused_filenames, k = 1)

ects = []
summaries = []

for filename in chosen_filenames:
    with open(f"data/final/{fold}/ects/{filename}") as f:
        ects.append(f.read().strip())
    with open(f"data/final/{fold}/gt_summaries/{filename}") as f:
        summaries.append(f.read().strip())

prompt = "You are a financial advisor tasked with creating a short summary of an earnings call transcript. You only want to summarize or re-iterate points that would be relevant, critical, or informational to someone who wants to skim over the important details of a long transcript.\n\nBelow is an earnings call transcript. Please summarize this transcript in exactly one paragraph using complete sentences. Keep the summary below 300 words. It is very important that you do not use any titles in the summary. Include relevant information and statistics from the Earnings Call Transcript in your summary. Furthermore, it is very important that you incorporate all the information and statistics from the Key Points and spread it out throughout your summary."
for i, filename in enumerate(chosen_filenames):
    if os.path.isfile(f"data/final/{fold}/gemini_summaries/{filename}"):
        continue
    response = model.generate_content(f"{prompt}\n\nEarnings Call Transcript:\n{ects[i]}\n\nKey Points:\n{summaries[i]}\n")
    gemini_summary = response.text
    with open(f"data/final/{fold}/gemini_summaries/{filename}", "w") as f:
        f.write(gemini_summary)
    

In [183]:
fold = "test"
gemini_filenames = os.listdir(f"data/final/{fold}/gemini_summaries")

def is_float(s):
    try: 
        float(s)
        return True
    except: 
        return False

def extract_statistic(s):
    # The string should not be empty
    if len(s) == 0:
        return None
    # It could be a valid number that is not a recent year (i.e 2015-2025)
    if is_float(s) and (float(s) < 2015 or float(s) > 2025):
        return (float(s), None)
    # In the form of $[valid number]
    if s[0] == "$" and is_float(s[1:]):
        return (float(s[1:]), "$")
    # Or in the form of [valid number]%
    if s[-1] == "%" and is_float(s[:-1]):
        return (float(s[:-1]), "%")
    return None

def extract_numbers(summary):
    words = [s for line in summary.split("\n") for s in line.split(" ")]
    words = list(filter(len, words))
    words = list(map(lambda s: s[:-1] if s[-1] in [".", ","] else s, words))
    numbers = set(map(extract_statistic, words))
    numbers.remove(None)
    return list(numbers)

ranking = []
for filename in gemini_filenames:
    if not ".txt" in filename:
        continue
    with open(f"./data/final/{fold}/gemini_summaries/{filename}") as f:
        gemini_summary = f.read().strip()
        paragraph_count = len(list(filter(len, gemini_summary.split("\n"))))
        gemini_numbers = extract_numbers(gemini_summary)
        # print(gemini_numbers)
        with open(f"./data/final/{fold}/gt_summaries/{filename}") as g:
            summary = g.read().strip()
            # print(summary)
            summary_numbers = extract_numbers(summary)
            # print(summary_numbers)   
            summary_recall = 0
            for n in summary_numbers:
                if n in gemini_numbers:
                    summary_recall += 1
            with open(f"./data/final/{fold}/ects/{filename}") as h:
                ect = h.read().strip()
                ect_numbers = extract_numbers(ect)
                ect_precision = 0
                for n in gemini_numbers:
                    if n in ect_numbers:
                        ect_precision += 1
                # if filename == "MSI_q3_2021.txt":
                #     print(gemini_numbers, ect_numbers)
                ranking.append((filename, paragraph_count, (summary_recall, len(summary_numbers)), (ect_precision, len(gemini_numbers))))

print(len(ranking))
ranking = list(filter(lambda r: r[1] == 1, ranking))
print("Only keep single paragraph:", len(ranking))
ranking = list(filter(lambda r: r[3][1] > 0 and r[3][0] == r[3][1], ranking))
print("Only keep 100% precision:", len(ranking))
ranking_2 = sorted(ranking, key = lambda r: -r[2][0])
ranking = list(filter(lambda r: r[2][0] >= 2, ranking))
ranking_filenames = list(map(lambda r: r[0], ranking))
# print(ranking_filenames)
print("Only keep recall >= 2:", len(ranking))
for r in ranking_2:
    print(r)

179
Only keep single paragraph: 171
Only keep 100% precision: 145
Only keep recall >= 2: 50
('KFY_q2_2022.txt', 1, (8, 9), (10, 10))
('JBL_q1_2022.txt', 1, (7, 11), (8, 8))
('EW_q2_2021.txt', 1, (5, 20), (12, 12))
('HZO_q1_2021.txt', 1, (5, 9), (8, 8))
('BAH_q3_2022.txt', 1, (5, 8), (11, 11))
('MSI_q3_2021.txt', 1, (5, 10), (10, 10))
('GHM_q2_2021.txt', 1, (5, 8), (9, 9))
('CLX_q1_2021.txt', 1, (5, 10), (8, 8))
('VSH_q2_2021.txt', 1, (4, 6), (11, 11))
('CLX_q4_2021.txt', 1, (4, 7), (11, 11))
('LAZ_q2_2021.txt', 1, (4, 7), (9, 9))
('AAN_q4_2020.txt', 1, (4, 5), (16, 16))
('MMS_q4_2021.txt', 1, (4, 6), (5, 5))
('NX_q1_2021.txt', 1, (4, 9), (9, 9))
('KWR_q1_2020.txt', 1, (4, 9), (10, 10))
('SYK_q3_2021.txt', 1, (4, 5), (7, 7))
('WSO_q2_2021.txt', 1, (3, 4), (8, 8))
('ITT_q2_2021.txt', 1, (3, 4), (9, 9))
('SEM_q4_2021.txt', 1, (3, 5), (14, 14))
('GES_q2_2022.txt', 1, (3, 3), (6, 6))
('FSS_q2_2021.txt', 1, (3, 5), (9, 9))
('HII_q1_2021.txt', 1, (3, 3), (16, 16))
('NWN_q1_2021.txt', 1, (3, 3