# News Sentiment

For: Tan Cheen Hao!

The news are given as different news across all dates with a list of companies in a field. We will need to accumulate this up to a quarter level by company.

In the very basic form we basically want the output to be a csv file in the format below. (ideally order by quarter_year then by ticker but doesn't matter). `news_sentiment` should be values between 0 to 1 where the value vaguely represents the probability of a positive sentiment. Or -1 to 1 where -1 is neg and 1 is pos. This depends on you but *make it clear with a markdown at the end.*


| ticker | quarter_year  | news_sentiment |
|--------|---------------|----------------|
| BAC    | Q1 2001       | 0.2            |
| JPM    | Q1 2001       | 0.67           |
| WFC    | Q1 2001       | 0.97           |


Of course some averaging will be needed so to prevent data loss, you could have multiple columns representing upper quartile sentiment, mean sentiment lower quartile sentiment for example. Perhaps news closer to announcement date might affect CAR more than older news so you might take the average of the most recent 3 news before annoucement. These are just some ideas to get you started but its completely up to you. You will need to gauge how to aggregate this up. Ideally, you should have 2 output files; 1 for revenue and 1 for CAR.

Now, you could also explore the use of LLMs and prompt engineering to extract specific information from the text first. For example, you could look into using LLMs to extract company specific news vs market news. Both news will affect revenue prediction but the latter should not affect CAR prediction.

Be creative!

In [28]:
import os
import json
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Ensure NLTK data is downloaded
nltk.download('stopwords')
nltk.download('wordnet')

# JSON file path
json_file_path = r"E:/Users/Walze/Downloads/data/data/text/news/news.json"

# Load JSON file (assuming it's a single JSON file with a list of articles)
with open(json_file_path, 'r', encoding='utf-8') as f:
    data = json.load(f)

# List to hold parsed article data
articles = []

# Loop through each article in the JSON data
for item in data:
    article = {
        "date": item.get("date"),
        #"title": item.get("title"),
        #"author": ", ".join(item.get("author", [])),
        "content": item.get("content"),
        #"company_names": ", ".join(item.get("company_names", [])),
        #"tickers": ", ".join(item.get("tickers", [])),
        "company_ids": ", ".join(map(str, item.get("company_ids", [])))
    }
    articles.append(article)

# Convert to DataFrame
df = pd.DataFrame(articles)

# Preprocessing function
def preprocess_text(text):
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words('english'))
    text = text.lower()
    words = text.split()
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    return ' '.join(words)

# Apply preprocessing to content
df['processed_content'] = df['content'].apply(preprocess_text)

# Optional: Save the DataFrame to a CSV
# df.to_csv('processed_articles.csv', index=False)

# Show a sample of the processed DataFrame
print(df.head())



[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Walze\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Walze\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


         date                                            content  \
0  2024-11-27  Ara Mahdessian and Vahe Kuzoyan had a simple p...   
1  2024-12-10  MongoDB ( MDB 1.96% ) Q3 2025 Earnings Call De...   
2  2024-11-27  Autodesk ( ADSK -0.45% ) Q3 2025 Earnings Call...   
3  2024-12-20  Rohit Chopra may not have much time left as di...   
4  2024-12-20  online trading platform Market Size, Share, Tr...   

                                         company_ids  \
0     47656, 9249, 26975, 44699, 33170, 27885, 27990   
1  62160, 199661, 202825, 84138, 199931, 197747, ...   
2  28391, 60624, 199661, 126743, 128625, 31372, 6...   
3                                              27885   
4   6456, 42170, 207140, 27833, 27885, 193415, 38021   

                                   processed_content  
0  ara mahdessian vahe kuzoyan simple plan gradua...  
1  mongodb ( mdb 1.96% ) q3 2025 earnings call de...  
2  autodesk ( adsk -0.45% ) q3 2025 earnings call...  
3  rohit chopra may much time left

In [29]:
from transformers import BertTokenizer, BertForSequenceClassification
import torch
import textstat
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Load FinBERT model
finbert_tokenizer = BertTokenizer.from_pretrained('yiyanghkust/finbert-tone')
finbert_model = BertForSequenceClassification.from_pretrained('yiyanghkust/finbert-tone')

# Initialize scaler
scaler = MinMaxScaler(feature_range=(-1, 1))

def get_sentiment_score(text):
    inputs = finbert_tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    outputs = finbert_model(**inputs)
    probs = torch.nn.functional.softmax(outputs.logits, dim=1)
    sentiment_score = (-1 * probs[0][0].item()) + (1 * probs[0][2].item())
    confidence = torch.max(probs).item()

    raw_complexity = textstat.flesch_reading_ease(text) if isinstance(text, str) else 50

    return sentiment_score, confidence, raw_complexity

# Step 1: Compute sentiment, confidence, and raw complexity
sentiment_list = []
confidence_list = []
raw_complexity_list = []

for idx, row in df.iterrows():
    text = row['content'][:512] if isinstance(row['content'], str) else ""
    sentiment, confidence, raw_complexity = get_sentiment_score(text)
    
    sentiment_list.append(sentiment)
    confidence_list.append(confidence)
    raw_complexity_list.append(raw_complexity)

# Step 2: Assign collected results
df['sentiment_score'] = sentiment_list
df['confidence'] = confidence_list
df['raw_complexity'] = raw_complexity_list

# Step 3: Fit scaler now
scaler.fit(df[['raw_complexity']])

# Step 4: Transform scaled complexity
df['complexity_score'] = scaler.transform(df[['raw_complexity']])


In [30]:
# Ensure company_ids are split into a list of ints (if they're stored as strings)
df['company_ids'] = df['company_ids'].apply(lambda x: [int(i.strip()) for i in str(x).replace('[', '').replace(']', '').split(',')])

# Explode the company_ids list into multiple rows
df = df.explode('company_ids').reset_index(drop=True)

In [31]:
df

Unnamed: 0,date,content,company_ids,processed_content,sentiment_score,confidence,raw_complexity,complexity_score
0,2024-11-27,Ara Mahdessian and Vahe Kuzoyan had a simple p...,47656,ara mahdessian vahe kuzoyan simple plan gradua...,-0.999974,0.999977,58.62,0.498124
1,2024-11-27,Ara Mahdessian and Vahe Kuzoyan had a simple p...,9249,ara mahdessian vahe kuzoyan simple plan gradua...,-0.999974,0.999977,58.62,0.498124
2,2024-11-27,Ara Mahdessian and Vahe Kuzoyan had a simple p...,26975,ara mahdessian vahe kuzoyan simple plan gradua...,-0.999974,0.999977,58.62,0.498124
3,2024-11-27,Ara Mahdessian and Vahe Kuzoyan had a simple p...,44699,ara mahdessian vahe kuzoyan simple plan gradua...,-0.999974,0.999977,58.62,0.498124
4,2024-11-27,Ara Mahdessian and Vahe Kuzoyan had a simple p...,33170,ara mahdessian vahe kuzoyan simple plan gradua...,-0.999974,0.999977,58.62,0.498124
...,...,...,...,...,...,...,...,...
40819,2024-11-28,Natixis Advisors LLC lifted its position in sh...,164598,natixis advisor llc lifted position share alle...,-0.983903,0.985238,39.13,0.254377
40820,2024-11-28,Natixis Advisors LLC lifted its position in sh...,199991,natixis advisor llc lifted position share alle...,-0.983903,0.985238,39.13,0.254377
40821,2024-11-27,Natixis Advisors LLC grew its stake in REV Gro...,199991,"natixis advisor llc grew stake rev group, inc....",-0.993760,0.993813,51.44,0.408329
40822,2010-04-12,Glimcher Realty Trust topped the list of Bigge...,32092,glimcher realty trust topped list biggest perc...,-0.991367,0.991922,69.38,0.632691


In [32]:
import pandas as pd

# Make sure 'date' is datetime
df['date'] = pd.to_datetime(df['date'])

# Create 'quarter' column
df['quarter'] = df['date'].dt.to_period('Q').astype(str)

# Sort data by company, quarter, and date (for EWMA to make sense)
df = df.sort_values(['company_ids', 'quarter', 'date'])

# Function to apply EWMA per group
def compute_ewma(x):
    return pd.Series({
        'sentiment_score': x['sentiment_score'].ewm(span=len(x), adjust=False).mean().iloc[-1],
        'confidence': x['confidence'].ewm(span=len(x), adjust=False).mean().iloc[-1],
        'complexity_score': x['complexity_score'].ewm(span=len(x), adjust=False).mean().iloc[-1]
    })

# Group by company and quarter, and apply EWMA
quarterly_ewma = df.groupby(['company_ids', 'quarter']).apply(compute_ewma).reset_index()

# Split 'quarter' into 'year' and 'quarter' columns
quarterly_ewma[['year', 'quarter']] = quarterly_ewma['quarter'].str.extract(r'(\d{4})Q(\d)')
quarterly_ewma['year'] = quarterly_ewma['year'].astype(int)
quarterly_ewma['quarter'] = quarterly_ewma['quarter'].astype(int)

# Reorder columns
quarterly_ewma = quarterly_ewma[['company_ids', 'year', 'quarter', 'sentiment_score', 'confidence', 'complexity_score']]

In [36]:
# Save the quarterly_ewma DataFrame to a CSV file
quarterly_ewma.to_csv("quarterly_ewma_results.csv", index=False)

PAOPAO code below

In [34]:
import pandas as pd

In [35]:
news_data = pd.read_json("data/text/news/news.json")
news_data["date"] = pd.to_datetime(news_data["date"])

FileNotFoundError: File data/text/news/news.json does not exist

In [None]:
print("first news date:", news_data["date"].min())
print("last news date:", news_data["date"].max())
print("number of news:", len(news_data))

first news date: 1998-06-03 00:00:00
last news date: 2025-03-02 00:00:00
number of news: 27149


In [None]:
def text_preprocessing_news(text):
    """Write the text preprocessing function here. This should work through the `df.apply()` function"""
    return text

In [None]:
def sentiment_analysis_news(news_data: pd.DataFrame):
    """This function should take in the news data and output the final csv file dataframe"""
    output_data = news_data.copy()
    return output_data

In [None]:
## save the final output

# output_data = sentiment_analysis_news(news_data)
# output_data.to_csv("output_news_sentiment.csv", index=False)