# News Headline Tagging and Sentiment Analysis with DistilBERT

This notebook processes news headlines to:
- Perform sentiment analysis (positive/negative) using DistilBERT.
- Tag headlines with categories (Positive Sentiment, Negative Sentiment, New Products, Layoffs, Analyst Comments, Stocks, Dividends, Corporate Earnings, Mergers & Acquisitions, Store Openings, Product Recalls, Adverse Events, Personnel Changes, Stock Rumors) using DistilBERT embeddings.
- Map headlines to S&P 500 stocks.
- Output results as a CSV for Excel integration.

**Prerequisites**:
- Install dependencies: `!pip install transformers pandas torch scikit-learn`
- Run in a Jupyter Notebook environment.
- Headlines are provided as a list; replace with your own if needed.

**Note**: Category tagging simulates a fine-tuned multi-label DistilBERT model using embeddings. For better accuracy, fine-tune with labeled data (see instructions at the end).


In [3]:
pip install transformers pandas torch scikit-learn

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [None]:
# Import libraries
import pandas as pd
import numpy as np
from transformers import pipeline, DistilBertTokenizer, DistilBertModel
import torch
from sklearn.metrics.pairwise import cosine_similarity
import re
from datetime import datetime

# Initialize DistilBERT sentiment analysis pipeline
sentiment_analyzer = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")

# Initialize DistilBERT tokenizer and model for embeddings
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained('distilbert-base-uncased')

# S&P 500 tickers and company names (partial list, extend as needed)
sp500_tickers = {
    'NVDA': 'Nvidia', 'TSLA': 'Tesla', 'ULTA': 'Ulta Beauty', 'GPS': 'The Gap',
    'FDS': 'FactSet', 'COST': 'Costco', 'OKTA': 'Okta', 'M': 'Macy’s',
    'AWK': 'American Water Works'
}

# User-provided headlines
headlines = [
    {'date': '2025-06-15', 'text': 'FactSet’s ASV growth slows, raising analyst concerns.', 'source': 'Investing.com'},
    {'date': '2025-05-30', 'text': 'Nvidia shares pop after lower-cost chip for China announced.', 'source': 'Yahoo Finance'},
    {'date': '2025-05-31', 'text': 'Ulta Beauty jumps 8.3% after beating Q1 earnings.', 'source': 'Yahoo Finance'},
    {'date': '2025-05-30', 'text': 'The Gap stock falls 14.8% due to tariff fears.', 'source': 'Yahoo Finance'},
    {'date': '2025-06-10', 'text': 'Tesla stock rises on positive sentiment for EV vision.', 'source': 'AInvest'},
    {'date': '2025-06-15', 'text': 'FactSet acquires new firm, but EPS guidance may drop.', 'source': 'Investing.com'},
    {'date': '2025-06-12', 'text': 'Costco to open new stores in Q3.', 'source': 'Yahoo Finance'},
    {'date': '2025-06-13', 'text': 'American Water Works faces regulatory scrutiny after Thames Water news.', 'source': 'Simulated'},
    {'date': '2025-06-16', 'text': 'Okta announces layoffs to streamline operations.', 'source': 'Yahoo Finance'},
    {'date': '2025-06-20', 'text': 'Rumors swirl about Macy’s potential merger.', 'source': 'Simulated'}
]

df1 = pd.read_csv('data/Kraggle Datasets/Financial News Dataset/guardian_headlines.csv')
# Convert to DataFrame
df = pd.DataFrame({
    'date': df1['Time'],
    'text': df1['Headlines'],
})

# Define categories
categories = [
    'News - Positive Sentiment', 'News - Negative Sentiment', 'News - New Products',
    'News - Layoffs', 'News - Analyst Comments', 'News - Stocks', 'News - Dividends',
    'News - Corporate Earnings', 'News - Mergers & Acquisitions', 'News - Store Openings',
    'News - Product Recalls', 'News - Adverse Events', 'News - Personnel Changes',
    'News - Stock Rumors'
]

# Category descriptions for embedding-based tagging (simulating fine-tuned model)
category_descriptions = {
    'News - Positive Sentiment': 'Stock price increases, optimistic outlook, strong performance',
    'News - Negative Sentiment': 'Stock price declines, negative outlook, poor performance',
    'News - New Products': 'Launch or announcement of new products or services',
    'News - Layoffs': 'Company announces job cuts or downsizing',
    'News - Analyst Comments': 'Analyst reports, forecasts, or concerns about the company',
    'News - Stocks': 'General news about stock price or equity movements',
    'News - Dividends': 'Announcements about dividend payouts or changes',
    'News - Corporate Earnings': 'Reports on company earnings, revenue, or EPS',
    'News - Mergers & Acquisitions': 'Mergers, acquisitions, or buyouts involving the company',
    'News - Store Openings': 'New store openings or business expansion',
    'News - Product Recalls': 'Product defects, recalls, or safety issues',
    'News - Adverse Events': 'Lawsuits, regulatory scrutiny, or negative events',
    'News - Personnel Changes': 'Changes in executives, CEO, or key personnel',
    'News - Stock Rumors': 'Speculation or rumors about stock or company actions'
}

# Function to get DistilBERT embeddings
def get_embeddings(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state[:, 0, :].numpy()  # CLS token embedding

# Function to map headlines to tickers
def map_to_ticker(text):
    text_lower = text.lower()
    for ticker, name in sp500_tickers.items():
        if ticker.lower() in text_lower or name.lower() in text_lower:
            return ticker
    return 'Unknown'

# Function for multi-label tagging using embeddings
def tag_categories(text):
    text_embedding = get_embeddings(text)
    tags = []
    for category, desc in category_descriptions.items():
        desc_embedding = get_embeddings(desc)
        similarity = cosine_similarity(text_embedding, desc_embedding)[0][0]
        if similarity > 0.85:  # Threshold (adjust based on testing)
            tags.append(category)
    return tags if tags else ['None']

# Apply sentiment analysis
def get_sentiment(text):
    result = sentiment_analyzer(text)[0]
    label = result['label']
    score = result['score']
    return label, score

# Process headlines
df['Ticker'] = df['text'].apply(map_to_ticker)
df['Categories'] = df['text'].apply(tag_categories)
df[['Sentiment', 'Sentiment_Score']] = df['text'].apply(get_sentiment).apply(pd.Series)

# Expand categories into binary columns
for category in categories:
    df[category] = df['Categories'].apply(lambda x: 1 if category in x else 0)

# Save to CSV
df.to_csv('news_tags_distilbert.csv', index=False)
print('Output saved to news_tags_distilbert.csv')
df


Device set to use cpu


## Results

The output DataFrame (`news_tags_distilbert.csv`) contains:
- **date**: Date of the headline.
- **text**: Headline text.
- **source**: Source of the headline.
- **Ticker**: S&P 500 ticker (e.g., NVDA, AWK) or 'Unknown'.
- **Categories**: List of assigned categories (from DistilBERT embeddings).
- **Sentiment**: POSITIVE or NEGATIVE (from DistilBERT).
- **Sentiment_Score**: Confidence score from DistilBERT.
- Binary columns for each category (1 if present, 0 otherwise).

Import `news_tags_distilbert.csv` into Excel to join with your S&P 500 dataset (e.g., match on Date and Ticker).


## Fine-Tuning DistilBERT for Multi-Label Tagging

The above script uses embeddings and cosine similarity as a placeholder for a fine-tuned model. For better accuracy, fine-tune DistilBERT on labeled data. Steps:

1. **Prepare Labeled Data**:
   - Create a CSV with columns: `text` (headline), and binary columns for each category (1 if present, 0 otherwise).
   - Example:
     ```csv
     text,News - Positive Sentiment,News - Negative Sentiment,...,News - Stock Rumors
     Nvidia shares pop after lower-cost chip,1,0,...,0
     FactSet’s ASV growth slows,0,1,...,0
     ```
   - Label at least 500-1000 headlines for robust training.

2. **Fine-Tune DistilBERT**:
   - Use Hugging Face’s `transformers` library.
   - Example code (add to a new cell):
     ```python
     from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments
     from datasets import load_dataset

     # Load labeled data
     dataset = load_dataset('csv', data_files='labeled_headlines.csv')

     # Tokenize
     def tokenize_function(examples):
         return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=128)

     tokenized_dataset = dataset.map(tokenize_function, batched=True)

     # Define model for multi-label classification
     model = DistilBertForSequenceClassification.from_pretrained(
         'distilbert-base-uncased',
         num_labels=len(categories),
         problem_type='multi_label_classification'
     )

     # Training arguments
     training_args = TrainingArguments(
         output_dir='./results',
         num_train_epochs=3,
         per_device_train_batch_size=8,
         per_device_eval_batch_size=8,
         warmup_steps=500,
         weight_decay=0.01,
         logging_dir='./logs',
         logging_steps=10,
         evaluation_strategy='epoch'
     )

     # Trainer
     trainer = Trainer(
         model=model,
         args=training_args,
         train_dataset=tokenized_dataset['train'],
         eval_dataset=tokenized_dataset['validation']
     )

     # Train
     trainer.train()

     # Save model
     model.save_pretrained('fine_tuned_distilbert_multi_label')
     tokenizer.save_pretrained('fine_tuned_distilbert_multi_label')
     ```

3. **Use Fine-Tuned Model**:
   - Replace the `tag_categories` function with:
     ```python
     fine_tuned_model = DistilBertForSequenceClassification.from_pretrained('fine_tuned_distilbert_multi_label')
     def tag_categories(text):
         inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=128)
         with torch.no_grad():
             outputs = fine_tuned_model(**inputs)
         probs = torch.sigmoid(outputs.logits).numpy()[0]
         tags = [categories[i] for i in range(len(probs)) if probs[i] > 0.5]
         return tags if tags else ['None']
     ```

4. **Resources**:
   - Need labeled data? I can help create a small dataset or suggest labeling tools (e.g., Prodigy).
   - Fine-tuning requires a GPU for efficiency (e.g., Google Colab).
