# Earning Call Transcripts Sentiment

For: Tan Cheen Hao!

The transcripts are already given to us by quarter by company so aggregation is not needed.

In the very basic form we basically want the output to be a csv file in the format below. (ideally order by quarter_year then by ticker but doesn't matter). `transcript_sentiment` should be values between 0 to 1 where the value vaguely represents the probability of a positive sentiment. Or -1 to 1 where -1 is neg and 1 is pos. This depends on you but *make it clear with a markdown at the end.*


| ticker | quarter_year  | transcript_sentiment |
|--------|---------------|----------------------|
| BAC    | Q1 2001       | 0.2                  |
| JPM    | Q1 2001       | 0.67                 |
| WFC    | Q1 2001       | 0.97                 |

Now, you could also explore the use of LLMs and prompt engineering to extract specific information from the text first. For example, you could look into using LLMs to extract company specific info vs market info or ask the LLM to find how "confident" the announcer is before extracting the sentiment.

For earning calls, instead of finding whether its positive or negative, you could also find the degree of complexity, or even degree of confidence. Also, look into **aspect based sentiment analysis**, it could be useful. Ideally, you should have 2 output files; 1 for revenue and 1 for CAR.

Be creative!

In [1]:
import pandas as pd
import numpy as np
import os
import json
from tqdm.auto import tqdm

In [7]:
# Directory containing the JSON files
json_folder_path ='data/text/earning_call_transcripts' 

# List to store transcripts
transcripts = []

# Loop through all files in the folder
for filename in tqdm(os.listdir(json_folder_path)):
    if filename.endswith(".json"):
        file_path = os.path.join(json_folder_path, filename)
        
        with open(file_path, 'r') as f:
            data = json.load(f)
        
        # Combine all component texts into one document
        components = data.get("components", [])
        full_text = "[BREAK]".join(component["text"] for component in components if "text" in component)
        most_important_date = data.get("mostimportantdate", np.nan)
        company_id = data.get("companyid", np.nan)

        transcripts.append({
            "company_id": company_id,
            "date": most_important_date,
            "transcript": full_text,
        })

# Create DataFrame
transcripts_data = pd.DataFrame(transcripts)


  0%|          | 0/3869 [00:00<?, ?it/s]

In [8]:
print("first transcript date:", transcripts_data["date"].min())
print("last transcript date:", transcripts_data["date"].max())
print("number of transcripts:", len(transcripts_data))

first transcript date: 2006-10-19
last transcript date: 2025-02-05
number of transcripts: 3868


In [9]:
def text_preprocessing_transcript(text):
    """Write the text preprocessing function here. This should work through the `df.apply()` function"""
    return text

In [10]:
def sentiment_analysis_transcript(transcripts_data: pd.DataFrame):
    """This function should take in the news data and output the final csv file dataframe"""
    output_data = transcripts_data.copy()
    return output_data

In [11]:
## save the final output

# output_data = sentiment_analysis_transcript(news_data)
# output_data.to_csv("output_transcript_sentiment.csv", index=False)