# Earning Call Transcripts Sentiment

For: Tan Cheen Hao!

The transcripts are already given to us by quarter by company so aggregation is not needed.

In the very basic form we basically want the output to be a csv file in the format below. (ideally order by quarter_year then by ticker but doesn't matter). `transcript_sentiment` should be values between 0 to 1 where the value vaguely represents the probability of a positive sentiment. Or -1 to 1 where -1 is neg and 1 is pos. This depends on you but _make it clear with a markdown at the end._

| ticker | quarter_year | transcript_sentiment |
| ------ | ------------ | -------------------- |
| BAC    | Q1 2001      | 0.2                  |
| JPM    | Q1 2001      | 0.67                 |
| WFC    | Q1 2001      | 0.97                 |

Now, you could also explore the use of LLMs and prompt engineering to extract specific information from the text first. For example, you could look into using LLMs to extract company specific info vs market info or ask the LLM to find how "confident" the announcer is before extracting the sentiment.

For earning calls, instead of finding whether its positive or negative, you could also find the degree of complexity, or even degree of confidence. Also, look into **aspect based sentiment analysis**, it could be useful. Ideally, you should have 2 output files; 1 for revenue and 1 for CAR.

Be creative!


In [1]:
import os
import json
import numpy as np
from collections import Counter
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score
import ast
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sentence_transformers import SentenceTransformer
import gensim
from gensim import corpora
from gensim.models.ldamodel import LdaModel

  from .autonotebook import tqdm as notebook_tqdm
  warn(f"Failed to load image Python extension: {e}")


In [2]:
# Directory containing the JSON files
json_folder_path ="E:/Users/Walze/Downloads/data/data/text/earning_call_transcripts"

# List to store transcripts
transcripts = []

# Loop through all files in the folder
for filename in os.listdir(json_folder_path):
    if filename.endswith(".json"):
        file_path = os.path.join(json_folder_path, filename)
        
        with open(file_path, 'r') as f:
            data = json.load(f)
        
        # Combine all component texts into one document
        components = data.get("components", [])
        full_text = " ".join(component["text"] for component in components if "text" in component)
        
        transcripts.append({
            "filename": filename,
            "transcript": full_text
        })

# Create DataFrame
df = pd.DataFrame(transcripts)

# Ensure you have the necessary NLTK data files
nltk.download('stopwords')
nltk.download('wordnet')

# Text Preprocessing
def preprocess_text(text):
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words('english'))
    text = text.lower()
    words = text.split()
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    return ' '.join(words)

df['processed_text'] = df['transcript'].apply(preprocess_text)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Walze\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Walze\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
df

Unnamed: 0,filename,transcript,processed_text
0,companyid_1038351 headline_First Security Gro...,"Michael, this is Chip. We took in the rate cap...","michael, chip. took rate cap fund deposit litt..."
1,companyid_1038351 headline_First Security Gro...,And what are your allocated reserves versus yo...,allocated reserve versus unallocated reserves?...
2,companyid_1038351 headline_First Security Gro...,Your next question comes from Sam Caldwell wit...,next question come sam caldwell kbw. concludes...
3,companyid_1038351 headline_First Security Gro...,"I know it's hard to forecast, but, loan balanc...","know hard forecast, but, loan balance shrunk l..."
4,companyid_1038352 headline_Banc of California...,I don't think it's unreasonable. I mean I thin...,think unreasonable. mean think definitely -- a...
...,...,...,...
3863,companyid_98045865 headline_Capital Bank Fina...,"Okay. Also, can you provide us any commentary ...","okay. also, provide u commentary around conver..."
3864,companyid_98045865 headline_Capital Bank Fina...,"Okay. Yes, that was my follow-up question was,...","okay. yes, follow-up question was, given obvio..."
3865,companyid_98045865 headline_Capital Bank Fina...,"Thanks, Chris. Chris mentioned earlier, we gen...","thanks, chris. chris mentioned earlier, genera..."
3866,companyid_98045865 headline_Capital Bank Fina...,I wanted to dig down a little further into the...,wanted dig little expense quarter see anything...


Feed transcipt thu FinBert NLP model


In [4]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [8]:
import textstat
from sklearn.preprocessing import StandardScaler
import torch
from scipy.special import softmax
import warnings
import re

scale = StandardScaler()
synthetic_texts = df['processed_text'].tolist()

# Compute raw complexity scores
raw_complexities = []
for synthetic_text in synthetic_texts:
    complexity = textstat.flesch_reading_ease(synthetic_text)
    raw_complexities.append([complexity])  # 2D for scaler

# Scale complexity scores
scaled_complexities = scale.fit_transform(raw_complexities)

# Sentiment analysis and result assembly
sentiment_results = []
for idx, synthetic_text in enumerate(synthetic_texts):
    # Tokenize the synthetic text
    inputs = tokenizer(synthetic_text, return_tensors="pt", truncation=True, max_length=512, padding=True)

    # Perform inference
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        probs = softmax(logits, axis=1)

        sentiment_score = (-1 * probs[0][0].item()) + (1 * probs[0][2].item())
        confidence = torch.max(torch.tensor(probs)).item()

    complexity = scaled_complexities[idx][0]

    # Parse filename to get company, quarter, and year
    filename = df.loc[idx, 'filename'].replace(".json", "")
    quarter_match = re.search(r'(Q[1-4]) (\d{4})', filename)

    if quarter_match:
        quarter = quarter_match.group(1)
        year = quarter_match.group(2)
    else:
        quarter = "Unknown"
        year = "Unknown"

    company_match = re.search(r'companyid_(\d+)', filename)
    company_id = company_match.group(1) if company_match else "Unknown"

    sentiment_results.append({
        "company": company_id,
        "quarter": quarter,
        "year": year,
        "sentiment_score": sentiment_score,
        "confidence": confidence,
        "complexity": complexity
    })

    print(f"{company_id} | {quarter} | {year} | Sentiment: {sentiment_score:.3f} | Confidence: {confidence:.3f} | Complexity: {complexity:.2f}")

sentiment_df = pd.DataFrame(sentiment_results)
sentiment_df.to_csv("sentiment_results.csv", index=False)


1038351 | Q1 | 2010 | Sentiment: -0.518 | Confidence: 0.749 | Complexity: -2.03
1038351 | Q2 | 2010 | Sentiment: -0.237 | Confidence: 0.603 | Complexity: -2.04
1038351 | Q3 | 2008 | Sentiment: 0.202 | Confidence: 0.571 | Complexity: -0.69
1038351 | Q4 | 2009 | Sentiment: 0.346 | Confidence: 0.665 | Complexity: -2.09
1038352 | Q1 | 2024 | Sentiment: -0.793 | Confidence: 0.885 | Complexity: 0.53
1038352 | Q2 | 2024 | Sentiment: 0.794 | Confidence: 0.886 | Complexity: 0.44
1038352 | Q3 | 2024 | Sentiment: -0.008 | Confidence: 0.496 | Complexity: 0.47
1038352 | Q4 | 2023 | Sentiment: 0.806 | Confidence: 0.879 | Complexity: 0.49
1038352 | Q4 | 2024 | Sentiment: 0.843 | Confidence: 0.909 | Complexity: 1.89
1038352 | Q1 | 2023 | Sentiment: 0.791 | Confidence: 0.867 | Complexity: 0.71
1038352 | Q2 | 2022 | Sentiment: 0.452 | Confidence: 0.708 | Complexity: 1.92
1038352 | Q3 | 2022 | Sentiment: 0.074 | Confidence: 0.463 | Complexity: 3.36
1038352 | Q4 | 2022 | Sentiment: -0.833 | Confidence: 0.