# Fine-tuning BERT for NER on Video Comments

In this notebook, we will:

1. Load video comment data from a CSV file.
2. Preprocess the data and align token-level labels.
3. Fine-tune a pre-trained BERT model (using Hugging Face Transformers) for Named Entity Recognition.
4. Evaluate the model.

### Baseline Model: BERT-NER

This notebook implements a baseline Named Entity Recognition (NER) model using a fine-tuned BERT architecture. It serves as the foundation for evaluating improvements introduced by contextual embeddings and clustering methods in our final pipeline.

We evaluate the model using standard NER metrics (precision, recall, F1-score) on a held-out test set.

In [None]:
!pip3 install transformers seqeval torch "accelerate>=0.26.0"
!python -m spacy download en_core_web_sm

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import torch
if not hasattr(torch, "get_default_device"):
    torch.get_default_device = lambda: torch.device("cuda" if torch.cuda.is_available() else "cpu")
from torch.utils.data import Dataset, DataLoader
from transformers import (AutoTokenizer, AutoModelForTokenClassification,
                          TrainingArguments, Trainer, DataCollatorForTokenClassification)
import json
import random
from sklearn.model_selection import train_test_split
from seqeval.metrics import precision_score, recall_score, f1_score, classification_report
import spacy
from sklearn.model_selection import train_test_split
from datetime import datetime

In [None]:
# Import the dataset
nlp = spacy.load("en_core_web_sm")
df = pd.read_csv('../data/4698969/Dataset_updated.csv')
df = df.dropna()
df = df.drop_duplicates()
df = df.reset_index(drop=True)
df.head()

In [None]:
print("Dataset shape:", df.shape)
print("Columns:", df.columns)

---

## Data Format Assumptions

For this notebook we assume:

- **Comment:** contains the raw comment text.
- **combined_labels_str:** contains a string representation of a list of token-level BIO labels (aligned to a whitespace tokenization of the comment).

## Building the Label Set

We scan through the dataset to extract all unique labels from the combined_labels_str column.

In [None]:
# Extract named entities from comments
comment_entities = []

for text in df['Comment'].dropna():
    doc = nlp(text)
    for ent in doc.ents:
        comment_entities.append((ent.text.strip(), ent.label_))

# Create DataFrame of entities
entity_df = pd.DataFrame(comment_entities, columns=["Entity", "Label"])
people_keywords = set(entity_df[entity_df["Label"] == "PERSON"]["Entity"].str.lower())
org_keywords = set(entity_df[entity_df["Label"] == "ORG"]["Entity"].str.lower())
brand_keywords = set(entity_df[entity_df["Label"].isin(["PRODUCT", "WORK_OF_ART"])]["Entity"].str.lower())

# Show top 10 most frequent entities per type
top_entities_by_type = entity_df.groupby("Label")["Entity"].value_counts().groupby(level=0).head(10)
print(top_entities_by_type.reset_index(name="Count"))

In [None]:
# --- Function to assign BIO-style labels ---
def generate_synthetic_labels(tokens):
    return [
        "B-PER" if token.lower() in people_keywords else
        "B-ORG" if token.lower() in org_keywords else
        "B-PROD" if token.lower() in brand_keywords else
        "O"
        for token in tokens
    ]

# Apply labeling to the token column
df["synthetic_labels"] = df["tokens"].apply(generate_synthetic_labels)

# Save to variables for training
texts = df["tokens"].tolist()
labels = df["synthetic_labels"].tolist()

# Preview one example
for token, label in zip(texts[0], labels[0]):
    print(f"{token:>10}  →  {label}")

### Custom PyTorch Dataset for NER

This dataset:
- Uses the **Comment** column as the raw text.
- Uses the **combined_labels_str** column (parsed into a list) as the token-level labels.
- Tokenizes the text using BERT's tokenizer with `is_split_into_words=True` and aligns the provided labels with the sub-tokens.

Note: The text is first split by whitespace so that the provided labels (which were created with a whitespace tokenization) align with the tokens.

In [None]:
class CustomNERDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, label_to_id, max_length=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.label_to_id = label_to_id
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        tokens = self.texts[idx]
        tags = self.labels[idx]

        encoding = self.tokenizer(
            tokens,
            is_split_into_words=True,
            truncation=True,
            padding="max_length",
            max_length=self.max_length,
            return_tensors="pt"
        )

        word_ids = encoding.word_ids(batch_index=0)
        label_ids = []

        previous_word_idx = None
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(self.label_to_id[tags[word_idx]])
            else:
                label_ids.append(-100)  # mask subword tokens
            previous_word_idx = word_idx

        encoding["labels"] = torch.tensor(label_ids)
        return {k: v.squeeze() for k, v in encoding.items()}

### Splitting the Dataset into Training and Validation

We'll use scikit-learn’s train_test_split to separate the data.

In [None]:
# === 70/15/15 Train/Validation/Test Split ===

# First split: 70% train, 30% temp (val + test)
train_texts, temp_texts, train_labels, temp_labels = train_test_split(
    texts, labels, test_size=0.30, random_state=42
)

# Second split: 15% val, 15% test (from 30% temp)
val_texts, test_texts, val_labels, test_labels = train_test_split(
    temp_texts, temp_labels, test_size=0.5, random_state=42
)

print(f"Train size: {len(train_texts)}, Validation size: {len(val_texts)}, Test size: {len(test_texts)}")

---

## Train the Model

In [None]:
# Flatten all label lists and get the unique labels
unique_labels = sorted({label for label_seq in labels for label in label_seq})
print("Label Set:", unique_labels)

In [None]:
# Create mappings for the model
label_to_id = {label: i for i, label in enumerate(unique_labels)}
id_to_label = {i: label for label, i in label_to_id.items()}

In [None]:
# Load tokenizer and model with correct label mappings
model_checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    num_labels=len(unique_labels),
    id2label=id_to_label,
    label2id=label_to_id
)

In [None]:
# Create our custom datasets for training and validation.
# Create dataset objects
train_dataset = CustomNERDataset(train_texts, train_labels, tokenizer, label_to_id)
val_dataset = CustomNERDataset(val_texts, val_labels, tokenizer, label_to_id)
print("Number of training examples:", len(train_dataset))
print("Number of validation examples:", len(val_dataset))

In [None]:
test_dataset = CustomNERDataset(test_texts, test_labels, tokenizer, label_to_id)
print("Number of test examples:", len(test_dataset))

In [None]:
# Inspect one tokenized sample from the training dataset.
print(train_dataset)
sample = train_dataset[0]
print("Tokenized input keys:", sample.keys())
print("Tokens:", tokenizer.convert_ids_to_tokens(sample["input_ids"]))
print("Aligned Labels:", [id_to_label[l] if l != -100 else "-100" for l in sample["labels"]])

### Data Collator and Evaluation Metrics

We use the Hugging Face DataCollator for token classification and define a compute_metrics function using seqeval.

In [None]:
data_collator = DataCollatorForTokenClassification(tokenizer)

In [None]:
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_labels = []
    true_predictions = []
    for pred_seq, label_seq in zip(predictions, labels):
        curr_labels = []
        curr_preds = []
        for pred, label in zip(pred_seq, label_seq):
            if label != -100:
                curr_labels.append(id_to_label[label])
                curr_preds.append(id_to_label[pred])
        true_labels.append(curr_labels)
        true_predictions.append(curr_preds)
    
    precision = precision_score(true_labels, true_predictions)
    recall = recall_score(true_labels, true_predictions)
    f1 = f1_score(true_labels, true_predictions)
    # Uncomment the following line for a detailed report:
    # print(classification_report(true_labels, true_predictions))
    return {
        "precision": precision,
        "recall": recall,
        "f1": f1,
    }

### Training Arguments and Trainer Setup

Adjust the training parameters as needed.

In [None]:
training_args = TrainingArguments(
    output_dir="../report/bert-ner-video-comments",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='../report/logs',
    logging_steps=10,
    save_strategy="epoch",
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

### Training the Model

In [None]:
trainer.train()

---

## Evaluation

In [None]:
metrics = trainer.evaluate(test_dataset)
print("Test Metrics:", metrics)
# Save the model
trainer.save_model("../report/bert-ner-video-comments/model-{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}")
tokenizer.save_pretrained("../report/bert-ner-video-comments/tokenizer-{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}")

In [None]:
# === Evaluation on Test Set ===
test_output = trainer.predict(test_dataset)
predictions = test_output.predictions
true_labels = test_output.label_ids

In [None]:
# Convert logits to predicted class indices
predictions = predictions.argmax(axis=-1)

In [None]:
# Map predictions and true labels to tag names
predicted_tags = []
true_tags = []

In [None]:
for pred_seq, label_seq in zip(predictions, true_labels):
    pred_labels = []
    true_labels_cleaned = []
    for pred, label in zip(pred_seq, label_seq):
        if label != -100:
            pred_labels.append(id_to_label[pred])
            true_labels_cleaned.append(id_to_label[label])
    predicted_tags.append(pred_labels)
    true_tags.append(true_labels_cleaned)

In [None]:
# Print individual metrics
precision = precision_score(true_tags, predicted_tags)
recall = recall_score(true_tags, predicted_tags)
f1 = f1_score(true_tags, predicted_tags)

print(f"Precision: {precision:.2%}")
print(f"Recall:    {recall:.2%}")
print(f"F1 Score:  {f1:.2%}")

In [None]:
# Print full report
print("NER Evaluation on Test Set:")
print(classification_report(true_tags, predicted_tags))

### Inference Example

Test the model on a new comment.

In [None]:
test_text = "This new update totally changed the way I see the future of tech!"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

In [None]:
# Tokenize input and get word ID mapping
encoding = tokenizer(test_text, return_tensors="pt", truncation=True, return_offsets_mapping=True, return_tensors="pt", is_split_into_words=False)
encoding = {k: v.to(device) for k, v in encoding.items()}
offset_mapping = encoding.pop("offset_mapping")

In [None]:
# Run through model
with torch.no_grad():
    outputs = model(**encoding).logits
predictions = torch.argmax(outputs, dim=2).squeeze().tolist()

In [None]:
# Convert to labels, ignoring special tokens
tokens = tokenizer.convert_ids_to_tokens(encoding["input_ids"].squeeze())
predicted_labels = []
for token, pred_id in zip(tokens, predictions):
    label = id_to_label[pred_id]
    predicted_labels.append(label)

In [None]:
# Print results
print("Tokens:", tokens)
print("Predicted Labels:", predicted_labels)

---

### Discussion of Baseline Results

The BERT-NER model performs reasonably well on standard entities like people and organizations, but struggles with informal/slang terms and context-dependent mentions often seen in video comments.

This highlights the need for incorporating contextual embeddings and clustering approaches to handle variant spellings and implicit references, which we address in the extended model pipeline.