# Sentiment Feature Engineering

This notebook focuses on extracting and engineering sentiment features from news headlines. We use a pre-trained FinBERT model to generate daily sentiment scores, then create additional features to capture sentiment trends, reversals, and recent sentiment shocks.

## 1. Libraries and Model Setup

We use HuggingFace's transformers and FinBERT for sentiment inference, along with pandas and numpy for data wrangling.

In [2]:
import pandas as pd
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load FinBERT model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('yiyanghkust/finbert-tone')
model = AutoModelForSequenceClassification.from_pretrained('yiyanghkust/finbert-tone')

## 2. Headline Sentiment Scoring

We use FinBERT to classify each headline as positive, negative, or neutral.


In [3]:
# Load S&P 500 news dataset
df = pd.read_csv('../data/sp500_news.csv', parse_dates=['Date'])

def get_finbert_sentiment(texts):
    tokens = tokenizer(texts, return_tensors='pt', padding=True, truncation=True, max_length=64)
    with torch.no_grad():
        outputs = model(**tokens)
        probs = torch.nn.functional.softmax(outputs.logits, dim=1).cpu().numpy()
    # FinBERT label order: [positive, negative, neutral]
    labels = np.array(['positive', 'negative', 'neutral'])
    preds = labels[np.argmax(probs, axis=1)]
    return preds, probs

# Score all headlines
batch_size = 32
sentiments, scores = [], []
for i in range(0, len(df), batch_size):
    batch = df['Title'].iloc[i:i+batch_size].tolist()
    pred, prob = get_finbert_sentiment(batch)
    sentiments.extend(pred)
    scores.extend(prob)

df['sentiment'] = sentiments
df[['positive_score','negative_score','neutral_score']] = np.array(scores)


  return forward_call(*args, **kwargs)


## 3. Aggregate Sentiment to Daily Level

We summarize sentiment by day to build features for modeling.

In [4]:
# Encode sentiment as numeric for aggregation
sentiment_map = {'positive': 1, 'neutral': 0, 'negative': -1}
df['sentiment_num'] = df['sentiment'].map(sentiment_map)

daily_sentiment = df.groupby('Date').agg(
    n_headlines=('Title', 'count'),
    sentiment_mean=('sentiment_num', 'mean'),
    sentiment_std=('sentiment_num', 'std'),
    positive_share=('sentiment', lambda x: (x == 'positive').mean()),
    negative_share=('sentiment', lambda x: (x == 'negative').mean()),
    neutral_share=('sentiment', lambda x: (x == 'neutral').mean()),
    pos_score_mean=('positive_score', 'mean'),
    neg_score_mean=('negative_score', 'mean'),
).fillna(0).reset_index()

## 4. Sentiment Trend & Reversal Features

Capture short-term changes and sentiment reversals.

In [5]:
window = 3  # You can try 3, 5, 7 for more features

daily_sentiment['sentiment_trend'] = daily_sentiment['sentiment_mean'].rolling(window).mean()
daily_sentiment['sentiment_reversal'] = (
    daily_sentiment['sentiment_mean'] * daily_sentiment['sentiment_mean'].shift(1) < 0
).astype(int)
daily_sentiment['sentiment_change'] = daily_sentiment['sentiment_mean'].diff()


## 5. Sentiment Features Sanity Check

Let’s quickly run a sanity-check on the resulting daily_sentiment.csv to see if the output makes sense given the pipeline we just built.

Here’s what you should expect in the file:
- **Date**: Each row represents a single day in your chosen horizon.

- **n_headlines**: Number of headlines for that day.

- **sentiment_mean**: Average sentiment (should be between -1 and 1; negative = bearish, positive = bullish).

- **sentiment_std**: Variability of daily sentiment. Tells you if the day’s news was “all of a kind” or mixed.

- **positive_share, negative_share, neutral_share**: % of headlines classified as each type.

- **pos_score_mean, neg_score_mean**: Average FinBERT confidence scores.

- **sentiment_trend**: Rolling average (should smooth the sentiment_mean).

- **sentiment_reversal**: 1 if sentiment flipped sign since the last day; 0 otherwise.

- **sentiment_change**: The raw difference in sentiment_mean from previous day.

In [8]:
print(daily_sentiment.head(10))
print(daily_sentiment.describe())

# Spot-check: any negative headline counts?
assert (daily_sentiment['n_headlines'] >= 0).all(), "Some days have negative headline counts!"

# Are sentiment values reasonable?
print(daily_sentiment[['sentiment_mean', 'sentiment_trend', 'sentiment_change']].describe())

        Date  n_headlines  sentiment_mean  sentiment_std  positive_share  \
0 2020-01-02            5        0.000000       0.707107        0.200000   
1 2020-01-03            9        0.111111       1.054093        0.555556   
2 2020-01-06            5        0.200000       1.095445        0.600000   
3 2020-01-07            4        0.250000       0.957427        0.500000   
4 2020-01-08            6        1.000000       0.000000        1.000000   
5 2020-01-09            7        0.142857       0.690066        0.285714   
6 2020-01-10            3        0.333333       1.154701        0.666667   
7 2020-01-13            4        1.000000       0.000000        1.000000   
8 2020-01-14            7        0.142857       1.069045        0.571429   
9 2020-01-15            7        0.428571       0.975900        0.714286   

   negative_share  neutral_share  pos_score_mean  neg_score_mean  \
0        0.200000       0.600000        0.200092        0.202745   
1        0.444444       0.0

## 6. Notebook Summary

The sentiment feature engineering step successfully generated daily sentiment statistics, including rolling trends and reversals. The results are well-distributed, with no missing or constant values. These features are now ready to be merged with price data and used for predictive modeling.

In [None]:
# Export for use in subsequent modeling notebooks.
daily_sentiment.to_csv('../data/daily_sentiment.csv', index=False)