## CPSC 477 Final Project Part 1: Augmenting ECTSum With Gemini


Ideas: get first 4000 words + last 4000 words, rerun Gemini, order by increasing length

We used Gemini 1.0 Pro and Google's Generative AI API to augment the summaries in ECTSum. We noticed that the provided bullet-point style summaries in ECTSum were too concise and failed to capture the plethora of useful information in the earnings call transcripts.

For example, this was the provided summary in ECTSum for `SNX_q2_2021.txt`.

```
synnex corporation sees q3 non-gaap earnings per share $1.90 to $2.10.
sees q3 non-gaap earnings per share $1.90 to $2.10.
sees q3 revenue $4.95 billion to $5.45 billion.
qtrly non-gaap diluted earnings per share from continuing operations $2.09.
q2 revenue rose 31 percent to $5.9 billion.
```

To solve this problem, we prompted Gemini to summarize the earnings call transcripts. Furthermore, we emphasized that it should incorporate all the information in the provided summary (which we referred to as the *Key Points* in the prompt). 

Here is the prompt that we used:

```
You are a financial advisor tasked with creating a short summary of an earnings call transcript. You only want to summarize or re-iterate points that would be relevant, critical, or informational to someone who wants to skim over the important details of a long transcript.

Below is an earnings call transcript. Please summarize this transcript in exactly one paragraph using complete sentences. Keep the summary below 300 words. It is very important that you do not use any titles in the summary. Include relevant information and statistics from the Earnings Call Transcript in your summary. Furthermore, it is very important that you incorporate all the information and statistics from the Key Points and spread it out throughout your summary.

Earnings Call Transcript:
[ect]

Key Points:
[summary]
```

Here is the Gemini-augmented summary for the `SNX_q2_2021.txt` example shown above:

```
Synnex Corporation reported strong Q2 results, exceeding expectations with a 31% year-over-year revenue increase to $5.9 billion. Despite a lower gross margin, non-GAAP diluted earnings per share increased 68% to $2.09. The company's improved cash flow and available liquidity of $3.1 billion demonstrate financial strength. Regarding the proposed merger with Tech Data, regulatory approvals are progressing, and the transaction is expected to close in the second half of 2021. For Q3, Synnex forecasts revenue between $4.95 billion and $5.45 billion, with non-GAAP diluted earnings per share projected in the range of $1.90 to $2.10. While aware of supply chain challenges, the company remains optimistic about IT spending and expects tailwinds from market demand and enterprise project resumption.
```

After generating summaries with Gemini, we scored the summaries in terms of quality and only kept those above our thresholds.

For example, for the `test` fold, we generated 179 summaries. Of those, 171 were kept because they correctly contained only one paragraph. Of those, 145 were kept because the numbers in the generated summary were also present in the full transcript (i.e. no hallucinated numbers). Of those, 50 were kept because they included at least 2 numbers from the *Key Points*. Finally, we added these 50 high-quality summaries to the `test` fold.

|                | train    | val     | test   |
|----------------| -------- | ------- | ------ |
| all            | 641      | 171     | 179    |
| 1 paragraph    | 610      | 168     | 171    |
| 100% precision | 530      | 145     | 145    |
| recall ≥ 2     | 200      | 50      | 50     |

In [1]:
# !pip install google-generativeai
import google.generativeai as genai
import random
from dotenv import load_dotenv
import os
import csv
import pandas as pd
env = load_dotenv(override = True)

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
genai.configure(api_key = os.environ["API_KEY"])
model = genai.GenerativeModel("gemini-pro")
max_ect_length = 8000

In [38]:
# Utility functions

def is_float(s):
    try: 
        float(s)
        return True
    except: 
        return False

def extract_statistic(s):
    # The string should not be empty
    s = s.replace(",", "") # Remove commas
    if len(s) == 0:
        return None
    # It could be a valid number that is not a recent year (i.e 2015-2025)
    if is_float(s) and (float(s) < 2015 or float(s) > 2025):
        return (float(s), None)
    # In the form of $[valid number]
    if s[0] == "$" and is_float(s[1:]):
        return (float(s[1:]), "$")
    # Or in the form of [valid number]%
    if s[-1] == "%" and is_float(s[:-1]):
        return (float(s[:-1]), "%")
    return None

def extract_numbers(text):
    words = [s for line in text.split("\n") for s in line.split(" ")]
    words = list(filter(len, words))
    words = list(map(lambda s: s[:-1] if s[-1] in ["."] else s, words))
    numbers = set(map(extract_statistic, words))
    numbers.remove(None)
    return list(numbers)

def extract_key_points(ect, summary):
    ect_numbers = []
    if len(ect) > max_ect_length:
        front = ect[:max_ect_length // 2]
        back = ect[-max_ect_length // 2:]
        ect_numbers = extract_numbers(f"{front}\n{back}")
    else:
        ect_numbers = extract_numbers(ect)
    key_points = ""
    key_points_numbers = []
    for line in summary.split("\n"):
        line_numbers = extract_numbers(line)
        ect_precision = 0
        for n in line_numbers:
            if n in ect_numbers:
                ect_precision += 1
        # print(len(ect), ect_precision, len(line_numbers))
        if ect_precision > 0 and ect_precision == len(line_numbers):
            key_points += line + "\n"
            key_points_numbers.extend(line_numbers)
    return (key_points, list(set(key_points_numbers)))
            
            


In [49]:
fold = "train"
ect_filenames = os.listdir(f"ectsum/final/{fold}/ects")
gemini_filenames = os.listdir(f"ectsum/final/{fold}/gemini_summaries")
unused_filenames = list(set(ect_filenames).difference(set(gemini_filenames)))
if ".ipynb_checkpoints" in unused_filenames:
    unused_filenames.remove(".ipynb_checkpoints")

entries = []
N = 20 # Number of summaries to generate with Gemini

for filename in unused_filenames:
    ect = ""
    with open(f"ectsum/final/{fold}/ects/{filename}") as f:
        ect = f.read().strip()
    summary = ""
    with open(f"ectsum/final/{fold}/gt_summaries/{filename}") as f:
        summary = f.read().strip()
    entries.append((filename, ect, summary))

# Sort entries in increasing order by transcript length
entries = sorted(entries, key = lambda e: len(e[1]))    

prompt = "You are a financial advisor tasked with creating a short summary of an earnings call transcript. You only want to summarize or re-iterate points that would be relevant, critical, or informational to someone who wants to skim over the important details of a long transcript.\n\nBelow is an earnings call transcript. Please summarize this transcript in exactly one paragraph using complete sentences. Keep the summary below 300 words. It is very important that you do not use any titles in the summary. Include relevant information and statistics from the Earnings Call Transcript in your summary. Furthermore, it is very important that you incorporate all the information and statistics from the Key Points and spread it out throughout your summary."
for filename, ect, summary in entries:
    if N == 0:
        break
    if os.path.isfile(f"ectsum/final/{fold}/gemini_summaries/{filename}"):
        continue
    key_points = extract_key_points(ect, summary)
    if len(key_points[1]) < 2:
        continue
    ect_length = len(ect)
    if ect_length > max_ect_length:
        ect = f"{ect[:max_ect_length // 2]}...\n...{ect[-max_ect_length // 2:]}"
    response = model.generate_content(f"{prompt}\n\nEarnings Call Transcript:\n{ect}\n\nKey Points:\n{key_points}\n")
    gemini_summary = response.text
    print(ect_length, len(key_points[1]), filename)
    N -= 1
    with open(f"ectsum/final/{fold}/gemini_summaries/{filename}", "w") as f:
        f.write(gemini_summary)
    

15253 3 CBT_q1_2021.txt
15278 3 ICE_q3_2021.txt
15286 4 CHD_q3_2020.txt
15310 6 B_q3_2020.txt
15310 2 CBT_q4_2021.txt
15387 4 PPL_q3_2020.txt
15397 4 IEX_q3_2021.txt
15408 2 NHI_q4_2019.txt
15440 8 WST_q4_2020.txt
15523 5 FLT_q1_2021.txt
15555 4 IEX_q1_2021.txt
15559 3 LPX_q4_2020.txt
15607 2 DHR_q2_2021.txt
15618 2 WRB_q1_2021.txt
15669 2 EQR_q2_2021.txt
15697 3 BG_q2_2021.txt
15701 2 WMS_q4_2021.txt
15708 9 A_q2_2021.txt
15756 2 MSM_q1_2021.txt
15851 2 PNW_q1_2020.txt


In [52]:
fold = "train"
gemini_filenames = os.listdir(f"ectsum/final/{fold}/gemini_summaries")

def is_float(s):
    try: 
        float(s)
        return True
    except: 
        return False

def extract_statistic(s):
    # The string should not be empty
    if len(s) == 0:
        return None
    # It could be a valid number that is not a recent year (i.e 2015-2025)
    if is_float(s) and (float(s) < 2015 or float(s) > 2025):
        return (float(s), None)
    # In the form of $[valid number]
    if s[0] == "$" and is_float(s[1:]):
        return (float(s[1:]), "$")
    # Or in the form of [valid number]%
    if s[-1] == "%" and is_float(s[:-1]):
        return (float(s[:-1]), "%")
    return None

def extract_numbers(summary):
    words = [s for line in summary.split("\n") for s in line.split(" ")]
    words = list(filter(len, words))
    words = list(map(lambda s: s[:-1] if s[-1] in [".", ","] else s, words))
    numbers = set(map(extract_statistic, words))
    numbers.remove(None)
    return list(numbers)

ranking = []
for filename in gemini_filenames:
    if not ".txt" in filename:
        continue
    with open(f"./ectsum/final/{fold}/gemini_summaries/{filename}") as f:
        gemini_summary = f.read().strip()
        paragraph_count = len(list(filter(len, gemini_summary.split("\n"))))
        gemini_numbers = extract_numbers(gemini_summary)
        # print(gemini_numbers)
        with open(f"./ectsum/final/{fold}/gt_summaries/{filename}") as g:
            summary = g.read().strip()
            # print(summary)
            summary_numbers = extract_numbers(summary)
            # print(summary_numbers)
            summary_recall = 0
            for n in summary_numbers:
                if n in gemini_numbers:
                    summary_recall += 1
            with open(f"./ectsum/final/{fold}/ects/{filename}") as h:
                ect = h.read().strip()
                ect_numbers = extract_numbers(ect)
                ect_precision = 0
                for n in gemini_numbers:
                    if n in ect_numbers:
                        ect_precision += 1
                # if filename == "MSI_q3_2021.txt":
                #     print(gemini_numbers, ect_numbers)
                ranking.append((filename, paragraph_count, (summary_recall, len(summary_numbers)), (ect_precision, len(gemini_numbers))))

print(len(ranking))
ranking = list(filter(lambda r: r[1] == 1, ranking))
print("Only keep single paragraph:", len(ranking))
ranking = list(filter(lambda r: r[3][1] > 0 and r[3][0] == r[3][1], ranking))
print("Only keep 100% precision:", len(ranking))
ranking_2 = sorted(ranking, key = lambda r: -r[2][0])
ranking = list(filter(lambda r: r[2][0] >= 2, ranking))
ranking_filenames = list(map(lambda r: r[0], ranking))
# print(ranking_filenames)
print("Only keep recall >= 2:", len(ranking))
for r in ranking_2:
    print(r)

336
Only keep single paragraph: 335
Only keep 100% precision: 302
Only keep recall >= 2: 200
('AME_q3_2021.txt', 1, (9, 10), (18, 18))
('DRI_q2_2021.txt', 1, (8, 9), (12, 12))
('JBL_q3_2021.txt', 1, (7, 12), (14, 14))
('UNF_q1_2022.txt', 1, (7, 8), (12, 12))
('DRI_q1_2022.txt', 1, (6, 7), (10, 10))
('DGX_q3_2021.txt', 1, (6, 10), (6, 6))
('FN_q1_2022.txt', 1, (6, 8), (7, 7))
('UNF_q2_2021.txt', 1, (6, 7), (9, 9))
('COF_q2_2021.txt', 1, (6, 9), (10, 10))
('SNX_q2_2021.txt', 1, (6, 7), (12, 12))
('WST_q4_2020.txt', 1, (6, 8), (8, 8))
('CCK_q3_2021.txt', 1, (6, 6), (10, 10))
('AXL_q3_2021.txt', 1, (5, 9), (10, 10))
('NX_q2_2021.txt', 1, (5, 7), (14, 14))
('ITW_q4_2019.txt', 1, (5, 7), (11, 11))
('MDU_q2_2021.txt', 1, (5, 5), (5, 5))
('VPG_q1_2020.txt', 1, (5, 11), (7, 7))
('KFY_q3_2022.txt', 1, (5, 7), (9, 9))
('MDC_q4_2020.txt', 1, (5, 11), (10, 10))
('PAY_q3_2021.txt', 1, (5, 7), (13, 13))
('VMI_q3_2021.txt', 1, (5, 10), (6, 6))
('HCA_q1_2021.txt', 1, (5, 7), (12, 12))
('NSP_q2_2021.txt

In [7]:
filenames = {}
filenames["train"] = ['KBH_q2_2021.txt', 'HVT_q2_2021.txt', 'ITT_q3_2020.txt', 'MFA_q3_2020.txt', 'CL_q1_2021.txt', 'MPC_q4_2021.txt', 'SPG_q4_2021.txt', 'CNP_q3_2021.txt', 'SKX_q1_2021.txt', 'WU_q1_2021.txt', 'EXR_q1_2021.txt', 'TUP_q3_2021.txt', 'TTC_q3_2021.txt', 'AVD_q2_2020.txt', 'APH_q2_2021.txt', 'PDM_q4_2021.txt', 'M_q3_2021.txt', 'DXC_q1_2022.txt', 'CNS_q4_2020.txt', 'ENR_q2_2021.txt', 'VSH_q3_2021.txt', 'DKS_q1_2021.txt', 'ABM_q4_2020.txt', 'ELY_q3_2020.txt', 'BR_q3_2021.txt', 'SAH_q1_2021.txt', 'KMI_q2_2021.txt', 'PHM_q3_2021.txt', 'TEN_q3_2021.txt', 'XYL_q1_2021.txt', 'LOW_q4_2021.txt', 'CPF_q2_2021.txt', 'DGX_q3_2021.txt', 'BDN_q3_2021.txt', 'DLB_q4_2020.txt', 'AVA_q2_2021.txt', 'OMI_q2_2021.txt', 'B_q4_2020.txt', 'HP_q1_2021.txt', 'FCN_q3_2020.txt', 'PEG_q2_2021.txt', 'TRI_q4_2021.txt', 'SMP_q3_2020.txt', 'WCN_q2_2021.txt', 'BDX_q3_2021.txt', 'UAA_q4_2021.txt', 'PWR_q3_2021.txt', 'PZN_q3_2020.txt', 'AEO_q1_2021.txt', 'IEX_q1_2021.txt', 'BRO_q3_2020.txt', 'AZO_q1_2022.txt', 'A_q3_2021.txt', 'MTN_q3_2021.txt', 'MTN_q2_2021.txt', 'PPG_q4_2021.txt', 'MAN_q4_2020.txt', 'CHE_q3_2021.txt', 'ALL_q3_2021.txt', 'SIG_q2_2022.txt', 'SIG_q3_2022.txt', 'AVA_q4_2020.txt', 'FCN_q4_2021.txt', 'OMC_q2_2021.txt', 'B_q2_2021.txt', 'KFY_q4_2021.txt', 'RTX_q2_2021.txt', 'WGO_q2_2021.txt', 'WM_q2_2021.txt', 'WFC_q3_2021.txt', 'RSG_q2_2021.txt', 'THG_q3_2020.txt', 'AAP_q2_2021.txt', 'AAP_q3_2021.txt', 'AAN_q1_2021.txt', 'CHE_q2_2020.txt', 'EOG_q2_2021.txt', 'TRI_q3_2021.txt', 'SFE_q4_2019.txt', 'SPB_q2_2021.txt', 'SNX_q2_2021.txt', 'WAT_q1_2021.txt', 'UNF_q4_2021.txt', 'PEG_q4_2020.txt', 'BRO_q4_2020.txt', 'KBH_q4_2021.txt', 'LII_q1_2021.txt', 'CMG_q3_2021.txt', 'GCO_q3_2022.txt', 'WNC_q1_2021.txt', 'VGR_q3_2020.txt', 'EAT_q1_2022.txt', 'TMO_q4_2021.txt', 'JBL_q2_2022.txt', 'CNS_q2_2021.txt', 'G_q3_2021.txt', 'EW_q4_2021.txt', 'AME_q3_2020.txt', 'AME_q2_2021.txt', 'AYI_q4_2021.txt', 'BC_q1_2021.txt', 'FLR_q1_2021.txt', 'APH_q4_2020.txt', 'OXM_q1_2021.txt', 'CLH_q4_2020.txt', 'NFG_q3_2020.txt', 'PII_q2_2021.txt', 'CTS_q2_2021.txt', 'PEG_q1_2021.txt', 'WEC_q2_2021.txt', 'WEC_q3_2021.txt', 'ZTS_q1_2021.txt', 'PKG_q2_2021.txt', 'HBI_q2_2020.txt', 'WWW_q1_2021.txt', 'WAT_q4_2020.txt', 'GTY_q4_2020.txt', 'PPL_q2_2021.txt', 'PKG_q2_2020.txt', 'TNC_q4_2020.txt', 'AEO_q3_2021.txt', 'RPM_q4_2021.txt', 'HUM_q3_2021.txt', 'SWI_q3_2021.txt', 'MTZ_q4_2020.txt', 'SYK_q4_2020.txt', 'FCN_q1_2021.txt', 'TOL_q3_2021.txt', 'CMA_q1_2021.txt', 'VMI_q3_2021.txt', 'SKT_q3_2021.txt', 'MAS_q3_2021.txt', 'MAS_q2_2021.txt', 'POR_q1_2021.txt', 'YUM_q2_2021.txt', 'UNF_q1_2022.txt', 'BHE_q2_2020.txt', 'RS_q1_2021.txt', 'LW_q2_2021.txt', 'TDY_q3_2021.txt', 'WSM_q4_2021.txt', 'FLR_q4_2020.txt', 'CIR_q4_2020.txt', 'ABC_q3_2021.txt', 'RPT_q3_2021.txt', 'DRI_q2_2021.txt', 'DRI_q3_2021.txt', 'RWT_q3_2021.txt', 'WSO_q1_2021.txt', 'FR_q4_2019.txt', 'CBT_q4_2021.txt', 'HZO_q4_2021.txt', 'MFA_q1_2020.txt', 'HCA_q1_2021.txt', 'IDA_q4_2019.txt', 'CUZ_q3_2021.txt', 'CF_q4_2020.txt', 'CL_q2_2021.txt', 'FMC_q1_2021.txt', 'NX_q1_2022.txt', 'AAT_q3_2021.txt', 'OGE_q1_2021.txt', 'RHI_q1_2021.txt', 'BC_q2_2021.txt', 'ANF_q2_2021.txt', 'MSI_q2_2021.txt', 'CIM_q2_2020.txt', 'WWE_q4_2020.txt', 'VZ_q3_2021.txt', 'WSR_q3_2021.txt', 'SXT_q3_2020.txt', 'PBI_q2_2021.txt', 'EMR_q2_2021.txt', 'KRG_q3_2021.txt', 'JLL_q2_2021.txt', 'HII_q2_2021.txt', 'DG_q1_2021.txt', 'JCI_q1_2022.txt', 'BMI_q1_2021.txt', 'CHH_q3_2021.txt', 'OI_q2_2021.txt', 'IEX_q4_2020.txt', 'MAN_q1_2021.txt', 'IDA_q3_2021.txt', 'GTY_q3_2021.txt', 'DBD_q4_2019.txt', 'DOV_q1_2021.txt', 'DIS_q1_2022.txt', 'KO_q1_2021.txt', 'PPL_q4_2021.txt', 'TRV_q1_2021.txt', 'PRA_q2_2021.txt', 'WAB_q3_2021.txt', 'MED_q3_2021.txt', 'CHD_q3_2020.txt', 'SUN_q4_2020.txt', 'SKT_q4_2020.txt', 'FUL_q1_2021.txt', 'AN_q4_2021.txt', 'FE_q1_2021.txt']
filenames["val"] = ['CNS_q4_2021.txt', 'AIR_q2_2021.txt', 'AIR_q3_2021.txt', 'RS_q3_2020.txt', 'FN_q1_2020.txt', 'MDC_q3_2020.txt', 'ROG_q1_2021.txt', 'COO_q2_2021.txt', 'USB_q3_2021.txt', 'DLB_q2_2021.txt', 'MAN_q3_2021.txt', 'FTV_q3_2020.txt', 'UPS_q3_2021.txt', 'RGR_q3_2021.txt', 'MPC_q2_2021.txt', 'SJW_q1_2021.txt', 'MSI_q1_2021.txt', 'RS_q4_2020.txt', 'TXT_q4_2020.txt', 'WST_q2_2020.txt', 'COO_q1_2022.txt', 'CTS_q3_2020.txt', 'NSC_q3_2021.txt', 'MYE_q3_2021.txt', 'MMC_q3_2021.txt', 'NWN_q4_2020.txt', 'SSD_q4_2020.txt', 'NSP_q3_2020.txt', 'WOR_q1_2022.txt', 'SNV_q3_2020.txt', 'CBT_q4_2020.txt', 'CPT_q3_2021.txt', 'EXR_q3_2021.txt', 'LCI_q4_2021.txt', 'FLR_q2_2021.txt', 'OXM_q3_2021.txt', 'WHG_q1_2020.txt', 'EMR_q3_2021.txt', 'FSS_q1_2021.txt', 'CL_q4_2020.txt', 'SIX_q1_2021.txt', 'IPG_q2_2021.txt', 'AFG_q3_2021.txt', 'TFX_q2_2020.txt', 'ROL_q1_2021.txt', 'GHM_q3_2022.txt', 'NJR_q1_2021.txt', 'BG_q1_2021.txt', 'TFX_q3_2021.txt', 'ABM_q1_2021.txt']
filenames["test"] = ['WSO_q2_2021.txt', 'ITT_q2_2021.txt', 'SEM_q4_2021.txt', 'ITW_q4_2021.txt', 'EW_q2_2021.txt', 'VSH_q2_2021.txt', 'FE_q4_2020.txt', 'CLX_q4_2021.txt', 'HRL_q2_2021.txt', 'KFY_q2_2022.txt', 'GES_q2_2022.txt', 'FE_q3_2020.txt', 'NNN_q1_2021.txt', 'TRV_q3_2021.txt', 'LAZ_q2_2021.txt', 'HZO_q1_2021.txt', 'FSS_q2_2021.txt', 'HII_q1_2021.txt', 'NWN_q1_2021.txt', 'AVD_q4_2020.txt', 'SCL_q4_2020.txt', 'AAN_q4_2020.txt', 'NNN_q4_2020.txt', 'BAH_q3_2022.txt', 'FDP_q3_2020.txt', 'DNB_q2_2021.txt', 'BHE_q3_2020.txt', 'SCI_q4_2020.txt', 'MMS_q4_2021.txt', 'NX_q1_2021.txt', 'LPX_q1_2021.txt', 'AIR_q1_2022.txt', 'JLL_q4_2021.txt', 'S_q3_2022.txt', 'DRI_q2_2022.txt', 'WU_q3_2021.txt', 'JBL_q1_2022.txt', 'MAC_q3_2020.txt', 'MSI_q3_2021.txt', 'OXM_q2_2021.txt', 'DNB_q4_2020.txt', 'GHM_q2_2021.txt', 'TNC_q3_2021.txt', 'GHL_q1_2021.txt', 'UNM_q2_2021.txt', 'KWR_q1_2020.txt', 'CLX_q1_2021.txt', 'NI_q3_2021.txt', 'SYK_q3_2021.txt', 'HXL_q1_2021.txt']
for fold in filenames:
    for filename in filenames[fold]:
        if not os.path.isfile(f"dataset/{fold}/ects/{filename}"):
            ect = ""
            with open(f"ectsum/final/{fold}/ects/{filename}") as f:
                ect = f.read()
            with open(f"dataset/{fold}/ects/{filename}", "w") as f:
                f.write(ect)
        if not os.path.isfile(f"dataset/{fold}/gemini_summaries/{filename}"):
            gemini_summary = ""
            with open(f"ectsum/final/{fold}/gemini_summaries/{filename}") as f:
                gemini_summary = f.read()
            with open(f"dataset/{fold}/gemini_summaries/{filename}", "w") as f:
                f.write(gemini_summary)


In [13]:
filenames = {}
filenames["train"] = os.listdir("dataset/train/gemini_summaries")
filenames["val"] = os.listdir("dataset/val/gemini_summaries")
filenames["test"] = os.listdir("dataset/test/gemini_summaries")
for fold in filenames:
    if ".ipynb_checkpoints" in filenames[fold]:        
        filenames[fold].remove(".ipynb_checkpoints")
        
# Earnings call transcripts:
#   - Train: dataset/train/ects/[filename]
#   - Val: dataset/val/ects/[filename]
#   - Test: dataset/test/ects/[filename]
# Gemini summaries:
#   - Train: dataset/train/gemini_summaries/[filename]
#   - Val: dataset/val/gemini_summaries/[filename]
#   - Test: dataset/test/gemini_summaries/[filename]
print(len(train_filenames), len(val_filenames), len(test_filenames))

for fold in filenames:
    if not os.path.isfile(f"dataset/{fold}.csv"):
        with open(f"dataset/{fold}.csv", "w") as csv_file:
            writer = csv.DictWriter(csv_file, fieldnames = ["filename", "ect", "gemini_summary"])
            entries = []
            for filename in filenames[fold]:
                entry = {}
                entry["filename"] = filename
                with open(f"dataset/{fold}/ects/{filename}") as ect_file:
                    entry["ect"] = ect_file.read()
                gemini_summary = ""
                with open(f"dataset/{fold}/gemini_summaries/{filename}") as gemini_file:
                    entry["gemini_summary"] = gemini_file.read()
                entries.append(entry)
            writer.writeheader()
            writer.writerows(entries)
                

200 50 50


In [24]:
df = pd.read_csv("dataset/train.csv")
print(sorted(list(df["ect"].str.len()))) # Transcript lengths

[3653, 6077, 6461, 6664, 7000, 7125, 7432, 8341, 8893, 8989, 9060, 9290, 9817, 9926, 10082, 10440, 10461, 10468, 10503, 10629, 10630, 10649, 10671, 10783, 10807, 10877, 11053, 11131, 11292, 11393, 11393, 11526, 11551, 11620, 11713, 11799, 12291, 12570, 13103, 13158, 13170, 13189, 13310, 13357, 13393, 13460, 13701, 13710, 13731, 13868, 13921, 13928, 13933, 13958, 14131, 14195, 14205, 14210, 14477, 14481, 14568, 14581, 14651, 14759, 14907, 15046, 15116, 15207, 15235, 15300, 15314, 15419, 15449, 15545, 15689, 15996, 16062, 16093, 16099, 16127, 16260, 16313, 16363, 16397, 16639, 16734, 16789, 16905, 17015, 17022, 17075, 17090, 17267, 17415, 17420, 17444, 17615, 17631, 17661, 17732, 17772, 17832, 17875, 17965, 17973, 17980, 17985, 18034, 18058, 18085, 18136, 18194, 18484, 18647, 18690, 18847, 18868, 18895, 18971, 18989, 19046, 19205, 19207, 19251, 19305, 19335, 19450, 19546, 19627, 19679, 20142, 20486, 20492, 20546, 20689, 20769, 20873, 20921, 20986, 21021, 21045, 21054, 21094, 21241, 21241

In [17]:
import json

fold = "train"
df = pd.read_csv(f"dataset/{fold}.csv")
my_json = []
prompt_1 = "You are a financial advisor tasked with creating a short summary of an earnings call transcript. You only want to summarize or re-iterate points that would be relevant, critical, or informational to someone who wants to skim over the important details of a long transcript. Below is an earnings call transcript."
prompt_2 = "Please summarize this transcript in exactly one paragraph using complete sentences. Keep the summary below 300 words. It is very important that you do not use any titles in the summary. Include relevant information and statistics from the Earnings Call Transcript in your summary."
for index, row in df.iterrows():
    entry = {}
    entry["filename"] = row["filename"]
    entry["text"] = f"<s>[INST] {prompt_1}\n\nEarnings Call Transcript:\n {row['ect'][:8000]} {prompt_2} [/INST] # {row['gemini_summary']}</s>" 
    my_json.append(entry)

with open(f"dataset/{fold}.jsonl", "w") as f:
    for entry in my_json:
        json.dump(entry, f)
        f.write("\n")