# News Sentiment

For: Tan Cheen Hao!

The news are given as different news across all dates with a list of companies in a field. We will need to accumulate this up to a quarter level by company.

In the very basic form we basically want the output to be a csv file in the format below. (ideally order by quarter_year then by ticker but doesn't matter). `news_sentiment` should be values between 0 to 1 where the value vaguely represents the probability of a positive sentiment. Or -1 to 1 where -1 is neg and 1 is pos. This depends on you but *make it clear with a markdown at the end.*


| ticker | quarter_year  | news_sentiment |
|--------|---------------|----------------|
| BAC    | Q1 2001       | 0.2            |
| JPM    | Q1 2001       | 0.67           |
| WFC    | Q1 2001       | 0.97           |


Of course some averaging will be needed so to prevent data loss, you could have multiple columns representing upper quartile sentiment, mean sentiment lower quartile sentiment for example. Perhaps news closer to announcement date might affect CAR more than older news so you might take the average of the most recent 3 news before annoucement. These are just some ideas to get you started but its completely up to you. You will need to gauge how to aggregate this up. Ideally, you should have 2 output files; 1 for revenue and 1 for CAR.

Now, you could also explore the use of LLMs and prompt engineering to extract specific information from the text first. For example, you could look into using LLMs to extract company specific news vs market news. Both news will affect revenue prediction but the latter should not affect CAR prediction.

Be creative!

In [19]:
import os
import json
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Ensure NLTK data is downloaded
nltk.download('stopwords')
nltk.download('wordnet')

# JSON file path
json_file_path = r"E:/Users/Walze/Downloads/data/data/text/news/news.json"

# Load JSON file (assuming it's a single JSON file with a list of articles)
with open(json_file_path, 'r', encoding='utf-8') as f:
    data = json.load(f)

# List to hold parsed article data
articles = []

# Loop through each article in the JSON data
for item in data:
    article = {
        "date": item.get("date"),
        #"title": item.get("title"),
        #"author": ", ".join(item.get("author", [])),
        "content": item.get("content"),
        #"company_names": ", ".join(item.get("company_names", [])),
        "tickers": ", ".join([ticker[:-3] for ticker in item.get("tickers", [])]),
        #"company_ids": ", ".join(map(str, item.get("company_ids", [])))
    }
    articles.append(article)

# Convert to DataFrame
df = pd.DataFrame(articles)

# Preprocessing function
def preprocess_text(text):
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words('english'))
    text = text.lower()
    words = text.split()
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    return ' '.join(words)

# Apply preprocessing to content
df['processed_content'] = df['content'].apply(preprocess_text)

# Optional: Save the DataFrame to a CSV
# df.to_csv('processed_articles.csv', index=False)

# Show a sample of the processed DataFrame
print(df.head())



[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Walze\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Walze\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


         date                                            content  \
0  2024-11-27  Ara Mahdessian and Vahe Kuzoyan had a simple p...   
1  2024-12-10  MongoDB ( MDB 1.96% ) Q3 2025 Earnings Call De...   
2  2024-11-27  Autodesk ( ADSK -0.45% ) Q3 2025 Earnings Call...   
3  2024-12-20  Rohit Chopra may not have much time left as di...   
4  2024-12-20  online trading platform Market Size, Share, Tr...   

                                             tickers  \
0                   NDAQ, TTAN, 9844, GS, MS, WFC, C   
1  MDB, 4388, ATCO, MET, ALLAM, CARG, VSCO, ALV, ...   
2  ADSK, VP/, 4388, 2597Q, RPO, PW, EN, ACC, PTC,...   
3                                                WFC   
4             FDLI, AMTD, HOOD, BAC, WFC, CMCX, GCAP   

                                   processed_content  
0  ara mahdessian vahe kuzoyan simple plan gradua...  
1  mongodb ( mdb 1.96% ) q3 2025 earnings call de...  
2  autodesk ( adsk -0.45% ) q3 2025 earnings call...  
3  rohit chopra may much time left

In [20]:
from transformers import BertTokenizer, BertForSequenceClassification
import torch
import textstat
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Load FinBERT model
finbert_tokenizer = BertTokenizer.from_pretrained('yiyanghkust/finbert-tone')
finbert_model = BertForSequenceClassification.from_pretrained('yiyanghkust/finbert-tone')

# Initialize scaler
scaler = MinMaxScaler(feature_range=(-1, 1))

def get_sentiment_score(text):
    inputs = finbert_tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    outputs = finbert_model(**inputs)
    probs = torch.nn.functional.softmax(outputs.logits, dim=1)
    sentiment_score = (-1 * probs[0][0].item()) + (1 * probs[0][2].item())
    confidence = torch.max(probs).item()

    raw_complexity = textstat.flesch_reading_ease(text) if isinstance(text, str) else 50

    return sentiment_score, confidence, raw_complexity

# Step 1: Compute sentiment, confidence, and raw complexity
sentiment_list = []
confidence_list = []
raw_complexity_list = []

for idx, row in df.iterrows():
    text = row['content'][:512] if isinstance(row['content'], str) else ""
    sentiment, confidence, raw_complexity = get_sentiment_score(text)
    
    sentiment_list.append(sentiment)
    confidence_list.append(confidence)
    raw_complexity_list.append(raw_complexity)

# Step 2: Assign collected results
df['sentiment_score'] = sentiment_list
df['confidence'] = confidence_list
df['raw_complexity'] = raw_complexity_list

# Step 3: Fit scaler now
scaler.fit(df[['raw_complexity']])

# Step 4: Transform scaled complexity
df['complexity_score'] = scaler.transform(df[['raw_complexity']])


In [21]:
df

Unnamed: 0,date,content,tickers,processed_content,sentiment_score,confidence,raw_complexity,complexity_score
0,2024-11-27,Ara Mahdessian and Vahe Kuzoyan had a simple p...,"NDAQ, TTAN, 9844, GS, MS, WFC, C",ara mahdessian vahe kuzoyan simple plan gradua...,-0.999974,0.999977,58.62,0.498124
1,2024-12-10,MongoDB ( MDB 1.96% ) Q3 2025 Earnings Call De...,"MDB, 4388, ATCO, MET, ALLAM, CARG, VSCO, ALV, ...",mongodb ( mdb 1.96% ) q3 2025 earnings call de...,-0.999944,0.999964,53.37,0.432466
2,2024-11-27,Autodesk ( ADSK -0.45% ) Q3 2025 Earnings Call...,"ADSK, VP/, 4388, 2597Q, RPO, PW, EN, ACC, PTC,...",autodesk ( adsk -0.45% ) q3 2025 earnings call...,-0.999812,0.999901,49.82,0.388069
3,2024-12-20,Rohit Chopra may not have much time left as di...,WFC,rohit chopra may much time left director consu...,0.996311,0.998027,49.35,0.382191
4,2024-12-20,"online trading platform Market Size, Share, Tr...","FDLI, AMTD, HOOD, BAC, WFC, CMCX, GCAP","online trading platform market size, share, tr...",-0.757345,0.757511,-11.27,-0.375938
...,...,...,...,...,...,...,...,...
27144,2024-11-27,Natixis Advisors LLC increased its holdings in...,A,natixis advisor llc increased holding share in...,-0.990368,0.990422,39.33,0.256878
27145,2024-11-28,Natixis Advisors LLC lifted its position in sh...,", , A",natixis advisor llc lifted position share alle...,-0.983903,0.985238,39.13,0.254377
27146,2024-11-27,Natixis Advisors LLC grew its stake in REV Gro...,A,"natixis advisor llc grew stake rev group, inc....",-0.993760,0.993813,51.44,0.408329
27147,2010-04-12,Glimcher Realty Trust topped the list of Bigge...,A,glimcher realty trust topped list biggest perc...,-0.991367,0.991922,69.38,0.632691


In [25]:
df2 = df 

In [35]:

from companies import *
all_banks = set(large_banks + medium_banks + small_banks)
# Explode the company_ids list into multiple rows

# Ensure all tickers are lists (if not already)
df2['tickers'] = df2['tickers'].apply(lambda x: [ticker.strip() for ticker in x] if isinstance(x, list) else [])

# Explode the 'tickers' column into multiple rows
df2 = df2.explode('tickers').reset_index(drop=True)

# Filter rows where 'tickers' are in 'all_banks'
df2 = df2[df2['tickers'].isin(all_banks)].reset_index(drop=True)

In [None]:
df2 #dataframe with only bank tickers

Unnamed: 0,date,content,tickers,processed_content,sentiment_score,confidence,raw_complexity,complexity_score
0,2024-11-27,Ara Mahdessian and Vahe Kuzoyan had a simple p...,WFC,ara mahdessian vahe kuzoyan simple plan gradua...,-0.999974,0.999977,58.62,0.498124
1,2024-11-27,Ara Mahdessian and Vahe Kuzoyan had a simple p...,C,ara mahdessian vahe kuzoyan simple plan gradua...,-0.999974,0.999977,58.62,0.498124
2,2024-12-10,MongoDB ( MDB 1.96% ) Q3 2025 Earnings Call De...,WFC,mongodb ( mdb 1.96% ) q3 2025 earnings call de...,-0.999944,0.999964,53.37,0.432466
3,2024-12-10,MongoDB ( MDB 1.96% ) Q3 2025 Earnings Call De...,BAC,mongodb ( mdb 1.96% ) q3 2025 earnings call de...,-0.999944,0.999964,53.37,0.432466
4,2024-11-27,Autodesk ( ADSK -0.45% ) Q3 2025 Earnings Call...,WFC,autodesk ( adsk -0.45% ) q3 2025 earnings call...,-0.999812,0.999901,49.82,0.388069
...,...,...,...,...,...,...,...,...
1179,2024-09-23,FRENCH shipping giant CMA CGM plans to launch ...,C,french shipping giant cma cgm plan launch full...,-0.703139,0.703450,20.39,0.020010
1180,2024-12-22,Public Employees Retirement System of Ohio inc...,C,public employee retirement system ohio increas...,-0.995843,0.995894,54.83,0.450725
1181,2024-12-20,XTX Topco Ltd purchased a new stake in Cadence...,C,xtx topco ltd purchased new stake cadence bank...,-0.999959,0.999968,34.97,0.202351
1182,2024-12-10,Shares of Cadence Bank ( NYSE:CADE – Get Free ...,C,share cadence bank ( nyse:cade – get free repo...,-0.000055,0.999944,50.16,0.392321


In [42]:
df2.describe()

Unnamed: 0,sentiment_score,confidence,raw_complexity,complexity_score
count,1184.0,1184.0,1184.0,1184.0
mean,-0.4392202,0.957959,40.333699,0.269431
std,0.7110968,0.102359,24.223003,0.302939
min,-0.9999997,0.461687,-61.17,-1.0
25%,-0.9998781,0.988028,28.8725,0.126094
50%,-0.9690527,0.99981,43.73,0.311906
75%,-1.765744e-08,0.999972,57.3,0.481616
max,0.9999999,1.0,86.1,0.841796


In [40]:
import pandas as pd

# Make sure 'date' is datetime
df2['date'] = pd.to_datetime(df2['date'])

# Create 'quarter' column
df2['quarter'] = df2['date'].dt.to_period('Q').astype(str)

# Sort data by company, quarter, and date (for EWMA to make sense)
df2 = df2.sort_values(['tickers', 'quarter', 'date'])

# Function to apply EWMA per group
def compute_ewma(x):
    return pd.Series({
        'sentiment_score': x['sentiment_score'].ewm(span=len(x), adjust=False).mean().iloc[-1],
        'confidence': x['confidence'].ewm(span=len(x), adjust=False).mean().iloc[-1],
        'complexity_score': x['complexity_score'].ewm(span=len(x), adjust=False).mean().iloc[-1]
    })

# Group by company and quarter, and apply EWMA
quarterly_ewma = df2.groupby(['tickers', 'quarter']).apply(compute_ewma).reset_index()

# Split 'quarter' into 'year' and 'quarter' columns
quarterly_ewma[['year', 'quarter']] = quarterly_ewma['quarter'].str.extract(r'(\d{4})Q(\d)')
quarterly_ewma['year'] = quarterly_ewma['year'].astype(int)
quarterly_ewma['quarter'] = quarterly_ewma['quarter'].astype(int)

# Reorder columns
quarterly_ewma = quarterly_ewma[['tickers', 'year', 'quarter', 'sentiment_score', 'confidence', 'complexity_score']]

In [41]:
# Save the quarterly_ewma DataFrame to a CSV file
quarterly_ewma.to_csv("quarterly_ewma_results.csv", index=False)