# News Sentiment

For: Tan Cheen Hao!

The news are given as different news across all dates with a list of companies in a field. We will need to accumulate this up to a quarter level by company.

In the very basic form we basically want the output to be a csv file in the format below. (ideally order by quarter_year then by ticker but doesn't matter). `news_sentiment` should be values between 0 to 1 where the value vaguely represents the probability of a positive sentiment. Or -1 to 1 where -1 is neg and 1 is pos. This depends on you but *make it clear with a markdown at the end.*


| ticker | quarter_year  | news_sentiment |
|--------|---------------|----------------|
| BAC    | Q1 2001       | 0.2            |
| JPM    | Q1 2001       | 0.67           |
| WFC    | Q1 2001       | 0.97           |


Of course some averaging will be needed so to prevent data loss, you could have multiple columns representing upper quartile sentiment, mean sentiment lower quartile sentiment for example. Perhaps news closer to announcement date might affect CAR more than older news so you might take the average of the most recent 3 news before annoucement. These are just some ideas to get you started but its completely up to you. You will need to gauge how to aggregate this up. Ideally, you should have 2 output files; 1 for revenue and 1 for CAR.

Now, you could also explore the use of LLMs and prompt engineering to extract specific information from the text first. For example, you could look into using LLMs to extract company specific news vs market news. Both news will affect revenue prediction but the latter should not affect CAR prediction.

Be creative!

In [None]:
import os
import json
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Ensure NLTK data is downloaded
nltk.download('stopwords')
nltk.download('wordnet')

# JSON file path
json_file_path = r"data/text/news/news.json"

# Load JSON file (assuming it's a single JSON file with a list of articles)
with open(json_file_path, 'r', encoding='utf-8') as f:
    data = json.load(f)

# List to hold parsed article data
articles = []

# Loop through each article in the JSON data
for item in data:
    article = {
        "date": item.get("date"),
        #"title": item.get("title"),
        #"author": ", ".join(item.get("author", [])),
        "content": item.get("content"),
        #"company_names": ", ".join(item.get("company_names", [])),
        "tickers": ", ".join([ticker[:-3] for ticker in item.get("tickers", [])]),
        #"company_ids": ", ".join(map(str, item.get("company_ids", [])))
    }
    articles.append(article)

# Convert to DataFrame
df = pd.DataFrame(articles)

# Show a sample of the processed DataFrame
print(df.head())

In [36]:
df2 = df.copy()

In [40]:
from companies import *
all_banks = set(large_banks + medium_banks + small_banks)
# Explode the company_ids list into multiple rows

# Ensure all tickers are lists (if not already)
df2['tickers'] = df2['tickers'].apply(lambda x: [ticker.strip() for ticker in x.split(',')] if isinstance(x, str) else [])

# Explode the 'tickers' column into multiple rows
df2 = df2.explode('tickers').reset_index(drop=True)

# Filter rows where 'tickers' are in 'all_banks'
df2 = df2[df2['tickers'].isin(all_banks)].reset_index(drop=True)

In [41]:
df2

Unnamed: 0,date,content,tickers,processed_content
0,2024-11-27,Ara Mahdessian and Vahe Kuzoyan had a simple p...,WFC,ara mahdessian vahe kuzoyan simple plan gradua...
1,2024-11-27,Ara Mahdessian and Vahe Kuzoyan had a simple p...,C,ara mahdessian vahe kuzoyan simple plan gradua...
2,2024-12-10,MongoDB ( MDB 1.96% ) Q3 2025 Earnings Call De...,WFC,mongodb ( mdb 1.96% ) q3 2025 earnings call de...
3,2024-12-10,MongoDB ( MDB 1.96% ) Q3 2025 Earnings Call De...,BAC,mongodb ( mdb 1.96% ) q3 2025 earnings call de...
4,2024-11-27,Autodesk ( ADSK -0.45% ) Q3 2025 Earnings Call...,WFC,autodesk ( adsk -0.45% ) q3 2025 earnings call...
...,...,...,...,...
1179,2024-09-23,FRENCH shipping giant CMA CGM plans to launch ...,C,french shipping giant cma cgm plan launch full...
1180,2024-12-22,Public Employees Retirement System of Ohio inc...,C,public employee retirement system ohio increas...
1181,2024-12-20,XTX Topco Ltd purchased a new stake in Cadence...,C,xtx topco ltd purchased new stake cadence bank...
1182,2024-12-10,Shares of Cadence Bank ( NYSE:CADE – Get Free ...,C,share cadence bank ( nyse:cade – get free repo...


In [None]:
from transformers import BertTokenizer, BertForSequenceClassification
import torch
import textstat
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Load FinBERT model
finbert_tokenizer = BertTokenizer.from_pretrained('yiyanghkust/finbert-tone')
finbert_model = BertForSequenceClassification.from_pretrained('yiyanghkust/finbert-tone')

# Initialize scaler
scaler = MinMaxScaler(feature_range=(-1, 1))

def get_sentiment_score(text):
    inputs = finbert_tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    outputs = finbert_model(**inputs)
    probs = torch.nn.functional.softmax(outputs.logits, dim=1)
    sentiment_score = (-1 * probs[0][0].item()) + (1 * probs[0][2].item())
    confidence = torch.max(probs).item()

    raw_complexity = textstat.flesch_reading_ease(text) if isinstance(text, str) else 50

    return sentiment_score, confidence, raw_complexity

# Step 1: Compute sentiment, confidence, and raw complexity
sentiment_list = []
confidence_list = []
raw_complexity_list = []

for idx, row in df2.iterrows():
    text = row['content'][:512] if isinstance(row['content'], str) else ""
    sentiment, confidence, raw_complexity = get_sentiment_score(text)
    
    sentiment_list.append(sentiment)
    confidence_list.append(confidence)
    raw_complexity_list.append(raw_complexity)

# Step 2: Assign collected results
df2['sentiment_score'] = sentiment_list
df2['confidence'] = confidence_list
df2['raw_complexity'] = raw_complexity_list

# Step 3: Fit scaler now
scaler.fit(df2[['raw_complexity']])

# Step 4: Transform scaled complexity
df2['complexity_score'] = scaler.transform(df2[['raw_complexity']])

In [45]:
df2

Unnamed: 0,date,content,tickers,processed_content,sentiment_score,confidence,raw_complexity,complexity_score
0,2024-11-27,Ara Mahdessian and Vahe Kuzoyan had a simple p...,WFC,ara mahdessian vahe kuzoyan simple plan gradua...,-9.999221e-01,0.999953,52.56,0.571613
1,2024-11-27,Ara Mahdessian and Vahe Kuzoyan had a simple p...,C,ara mahdessian vahe kuzoyan simple plan gradua...,-9.999221e-01,0.999953,52.56,0.571613
2,2024-12-10,MongoDB ( MDB 1.96% ) Q3 2025 Earnings Call De...,WFC,mongodb ( mdb 1.96% ) q3 2025 earnings call de...,-9.999647e-01,0.999980,47.75,0.502583
3,2024-12-10,MongoDB ( MDB 1.96% ) Q3 2025 Earnings Call De...,BAC,mongodb ( mdb 1.96% ) q3 2025 earnings call de...,-9.999647e-01,0.999980,47.75,0.502583
4,2024-11-27,Autodesk ( ADSK -0.45% ) Q3 2025 Earnings Call...,WFC,autodesk ( adsk -0.45% ) q3 2025 earnings call...,-9.999264e-01,0.999931,29.21,0.236510
...,...,...,...,...,...,...,...,...
1179,2024-09-23,FRENCH shipping giant CMA CGM plans to launch ...,C,french shipping giant cma cgm plan launch full...,-9.681520e-01,0.968170,31.89,0.274971
1180,2024-12-22,Public Employees Retirement System of Ohio inc...,C,public employee retirement system ohio increas...,-9.975859e-01,0.997589,45.42,0.469145
1181,2024-12-20,XTX Topco Ltd purchased a new stake in Cadence...,C,xtx topco ltd purchased new stake cadence bank...,-9.999671e-01,0.999972,32.29,0.280712
1182,2024-12-10,Shares of Cadence Bank ( NYSE:CADE – Get Free ...,C,share cadence bank ( nyse:cade – get free repo...,-4.828224e-07,0.999999,42.58,0.428387


In [46]:
import pandas as pd

# Make sure 'date' is datetime
df2['date'] = pd.to_datetime(df2['date'])

# Create 'quarter' column
df2['quarter'] = df2['date'].dt.to_period('Q').astype(str)

# Sort data by company, quarter, and date (for EWMA to make sense)
df2 = df2.sort_values(['tickers', 'quarter', 'date'])

# Function to apply EWMA per group
def compute_ewma(x):
    return pd.Series({
        'sentiment_score': x['sentiment_score'].ewm(span=len(x), adjust=False).mean().iloc[-1],
        'confidence': x['confidence'].ewm(span=len(x), adjust=False).mean().iloc[-1],
        'complexity_score': x['complexity_score'].ewm(span=len(x), adjust=False).mean().iloc[-1]
    })

# Group by company and quarter, and apply EWMA
quarterly_ewma = df2.groupby(['tickers', 'quarter']).apply(compute_ewma).reset_index()

# Split 'quarter' into 'year' and 'quarter' columns
quarterly_ewma[['year', 'quarter']] = quarterly_ewma['quarter'].str.extract(r'(\d{4})Q(\d)')
quarterly_ewma['year'] = quarterly_ewma['year'].astype(int)
quarterly_ewma['quarter'] = quarterly_ewma['quarter'].astype(int)

# Reorder columns
quarterly_ewma = quarterly_ewma[['tickers', 'year', 'quarter', 'sentiment_score', 'confidence', 'complexity_score']]

In [49]:
quarterly_ewma

Unnamed: 0,tickers,year,quarter,sentiment_score,confidence,complexity_score
0,AMAL,2024,4,-0.999966,0.999969,0.640069
1,AMTB,2024,4,-0.176004,0.809328,0.340987
2,BAC,2024,2,-0.632320,0.911313,0.238646
3,BAC,2024,3,-0.639418,0.961712,0.175384
4,BAC,2024,4,-0.528727,0.967111,0.173048
...,...,...,...,...,...,...
142,WFC,2024,3,-0.692210,0.975001,0.155569
143,WFC,2024,4,-0.499377,0.941427,0.252412
144,WFC,2025,1,-0.096596,0.914015,0.080549
145,WSBC,2024,3,-0.666597,0.999922,-0.155329


In [50]:
# Save the quarterly_ewma DataFrame to a CSV file
quarterly_ewma.to_csv("quarterly_ewma_results.csv", index=False)