# Fine-tuning BERT for NER on Video Comments

In this notebook, we will:

1. Load video comment data from a CSV file.
2. Preprocess the data and align token-level labels.
3. Fine-tune a pre-trained BERT model (using Hugging Face Transformers) for Named Entity Recognition.
4. Evaluate the model.

### Baseline Model: BERT-NER

This notebook implements a baseline Named Entity Recognition (NER) model using a fine-tuned BERT architecture. It serves as the foundation for evaluating improvements introduced by contextual embeddings and clustering methods in our final pipeline.

We evaluate the model using standard NER metrics (precision, recall, F1-score) on a held-out test set.

In [None]:
!pip3 install transformers seqeval torch "accelerate>=0.26.0"
!python -m spacy download en_core_web_sm



In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import torch
if not hasattr(torch, "get_default_device"):
    torch.get_default_device = lambda: torch.device("cuda" if torch.cuda.is_available() else "cpu")
from torch.utils.data import Dataset, DataLoader
from transformers import (AutoTokenizer, AutoModelForTokenClassification,
                          TrainingArguments, Trainer, DataCollatorForTokenClassification)
import json
import random
from sklearn.model_selection import train_test_split
from seqeval.metrics import precision_score, recall_score, f1_score, classification_report
import spacy
from sklearn.model_selection import train_test_split

In [None]:
# Import the dataset
nlp = spacy.load("en_core_web_sm")
df = pd.read_csv('../data/4698969/Dataset_updated.csv')
df = df.dropna()
df = df.drop_duplicates()
df = df.reset_index(drop=True)
df.head()

Unnamed: 0,ID,Date,Author,Likes,Replies,Comment,Relevance,Polarity,Feature request,Problem report,Efficiency,Safety,tokens,labels,num_tokens,has_entity,entity_tokens,combined_labels_str
0,UghhPYDEB6B173gCoAEC,2017-04-28T18:12:45Z,Aaron Brown,1679,30,i want what he's smoking,spam,neutral,False,False,False,False,"['i', 'want', 'what', 'he', ""'s"", 'smoking']","['spam', 'neutral', False, False, False, 'false']",6,True,"['i', 'want', 'what', 'he', ""'s"", 'smoking']","False, False, False, false, neutral, spam"
1,Ugh6WAPQinruAHgCoAEC,2017-04-28T18:15:14Z,Felician Cadar,684,22,I love how Musk always makes seemingly wild cl...,spam,positive,False,False,False,False,"['I', 'love', 'how', 'Musk', 'always', 'makes'...","['spam', 'positive', False, False, False, 'fal...",23,True,"['I', 'love', 'how', 'Musk', 'always', 'makes']","False, False, False, false, positive, spam"
2,Ugj9xobHmVeDEHgCoAEC,2017-04-28T18:24:53Z,Kelvin Yang,0,0,No.3,spam,neutral,False,False,False,False,['No.3'],"['spam', 'neutral', False, False, False, 'false']",1,True,['No.3'],"False, False, False, false, neutral, spam"
3,Ugj39PRg5dVn8XgCoAEC,2017-04-28T18:25:31Z,Kelvin Yang,140,4,Could be the start of a historical company,spam,neutral,False,False,False,False,"['Could', 'be', 'the', 'start', 'of', 'a', 'hi...","['spam', 'neutral', False, False, False, 'false']",8,True,"['Could', 'be', 'the', 'start', 'of', 'a']","False, False, False, false, neutral, spam"
4,Ugiu9jMmiWts1HgCoAEC,2017-04-28T18:31:52Z,serendipity42,675,9,Gotta start somewhere before making tunnels on...,spam,neutral,False,False,False,False,"['Got', 'ta', 'start', 'somewhere', 'before', ...","['spam', 'neutral', False, False, False, 'false']",9,True,"['Got', 'ta', 'start', 'somewhere', 'before', ...","False, False, False, false, neutral, spam"


In [4]:
print("Dataset shape:", df.shape)
print("Columns:", df.columns)

Dataset shape: (4275, 18)
Columns: Index(['ID', 'Date', 'Author', 'Likes', 'Replies', 'Comment', 'Relevance',
       'Polarity', 'Feature request', 'Problem report', 'Efficiency', 'Safety',
       'tokens', 'labels', 'num_tokens', 'has_entity', 'entity_tokens',
       'combined_labels_str'],
      dtype='object')


---

## Data Format Assumptions

For this notebook we assume:

- **Comment:** contains the raw comment text.
- **combined_labels_str:** contains a string representation of a list of token-level BIO labels (aligned to a whitespace tokenization of the comment).

## Building the Label Set

We scan through the dataset to extract all unique labels from the combined_labels_str column.

In [None]:
# Extract named entities from comments
comment_entities = []

for text in df['Comment'].dropna():
    doc = nlp(text)
    for ent in doc.ents:
        comment_entities.append((ent.text.strip(), ent.label_))

# Create DataFrame of entities
entity_df = pd.DataFrame(comment_entities, columns=["Entity", "Label"])
people_keywords = set(entity_df[entity_df["Label"] == "PERSON"]["Entity"].str.lower())
org_keywords = set(entity_df[entity_df["Label"] == "ORG"]["Entity"].str.lower())
brand_keywords = set(entity_df[entity_df["Label"].isin(["PRODUCT", "WORK_OF_ART"])]["Entity"].str.lower())

# Show top 10 most frequent entities per type
top_entities_by_type = entity_df.groupby("Label")["Entity"].value_counts().groupby(level=0).head(10)
print(top_entities_by_type.reset_index(name="Count"))

In [None]:
# --- Function to assign BIO-style labels ---
def generate_synthetic_labels(tokens):
    return [
        "B-PER" if token.lower() in people_keywords else
        "B-ORG" if token.lower() in org_keywords else
        "B-PROD" if token.lower() in brand_keywords else
        "O"
        for token in tokens
    ]

# Apply labeling to the token column
df["synthetic_labels"] = df["tokens"].apply(generate_synthetic_labels)

# Save to variables for training
texts = df["tokens"].tolist()
labels = df["synthetic_labels"].tolist()

# Preview one example
for token, label in zip(texts[0], labels[0]):
    print(f"{token:>10}  →  {label}")

### Custom PyTorch Dataset for NER

This dataset:
- Uses the **Comment** column as the raw text.
- Uses the **combined_labels_str** column (parsed into a list) as the token-level labels.
- Tokenizes the text using BERT's tokenizer with `is_split_into_words=True` and aligns the provided labels with the sub-tokens.

Note: The text is first split by whitespace so that the provided labels (which were created with a whitespace tokenization) align with the tokens.

### Splitting the Dataset into Training and Validation

We'll use scikit-learn’s train_test_split to separate the data.

In [None]:
# === 70/15/15 Train/Validation/Test Split ===

# First split: 70% train, 30% temp (val + test)
train_texts, temp_texts, train_labels, temp_labels = train_test_split(
    texts, labels, test_size=0.30, random_state=42
)

# Second split: 15% val, 15% test (from 30% temp)
val_texts, test_texts, val_labels, test_labels = train_test_split(
    temp_texts, temp_labels, test_size=0.5, random_state=42
)

print(f"Train size: {len(train_texts)}, Validation size: {len(val_texts)}, Test size: {len(test_texts)}")

---

## Train the Model

In [9]:
# Load the tokenizer and model.
model_checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    num_labels=len(labels_list),
    id2label=id_to_label,
    label2id=label_to_id
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [10]:
# Create our custom datasets for training and validation.
train_dataset = CustomNERDataset(train_df, tokenizer, label_to_id)
val_dataset = CustomNERDataset(val_df, tokenizer, label_to_id)
print("Number of training examples:", len(train_dataset))
print("Number of validation examples:", len(val_dataset))

Number of training examples: 3420
Number of validation examples: 855


In [11]:
# Inspect one tokenized sample from the training dataset.
print(train_dataset)
sample = train_dataset[0]
print("Tokenized input keys:", sample.keys())
print("Tokens:", tokenizer.convert_ids_to_tokens(sample["input_ids"]))
print("Aligned Labels:", [id_to_label[l] if l != -100 else "-100" for l in sample["labels"]])

<__main__.CustomNERDataset object at 0x18ab00890>
Tokenized input keys: dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'labels'])
Tokens: ['[CLS]', 'Te', '##sla', 'claims', 'the', 'future', 'is', 'self', 'driving', 'car', ',', 'but', 'now', 'shows', 'their', 'future', 'is', 'a', 'car', 'on', 'a', 'flat', 'bed', 'rail', 'car', 'that', 'follows', 'a', 'rail', '.', 'What', 'a', 'mi', '##s', '##fire', '.', 'For', 'such', 'a', 'simple', 'and', 'regulated', 'environment', 'as', 'a', 'car', 'only', 'tunnel', ',', 'it', 'would', 'actually', 'be', 'much', 'easier', 'to', 'make', 'car', 'self', 'drive', 'than', 'on', 'a', 'open', 'road', '.', '[SEP]']
Aligned Labels: ['-100', 'False', 'False', 'False', 'False', 'false', 'neutral', 'spam', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', 

### Data Collator and Evaluation Metrics

We use the Hugging Face DataCollator for token classification and define a compute_metrics function using seqeval.

In [12]:
data_collator = DataCollatorForTokenClassification(tokenizer)

In [13]:
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_labels = []
    true_predictions = []
    for pred_seq, label_seq in zip(predictions, labels):
        curr_labels = []
        curr_preds = []
        for pred, label in zip(pred_seq, label_seq):
            if label != -100:
                curr_labels.append(id_to_label[label])
                curr_preds.append(id_to_label[pred])
        true_labels.append(curr_labels)
        true_predictions.append(curr_preds)
    
    precision = precision_score(true_labels, true_predictions)
    recall = recall_score(true_labels, true_predictions)
    f1 = f1_score(true_labels, true_predictions)
    # Uncomment the following line for a detailed report:
    # print(classification_report(true_labels, true_predictions))
    return {
        "precision": precision,
        "recall": recall,
        "f1": f1,
    }

### Training Arguments and Trainer Setup

Adjust the training parameters as needed.

In [14]:
training_args = TrainingArguments(
    output_dir="../report/bert-ner-video-comments",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='../report/logs',
    logging_steps=10,
    save_strategy="epoch",
)



In [15]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

  trainer = Trainer(


### Training the Model

In [16]:
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,0.5262,0.52381,0.625496,0.528057,0.572661
2,0.4187,0.427282,0.641906,0.626047,0.633877
3,0.2362,0.425186,0.687881,0.660804,0.674071




TrainOutput(global_step=1284, training_loss=0.5447256332988679, metrics={'train_runtime': 365.0513, 'train_samples_per_second': 28.106, 'train_steps_per_second': 3.517, 'total_flos': 319729063427856.0, 'train_loss': 0.5447256332988679, 'epoch': 3.0})

---

## Evaluation

In [17]:
results = trainer.evaluate()
print("Evaluation results:", results)

Evaluation results: {'eval_loss': 0.4251856803894043, 'eval_precision': 0.6878814298169137, 'eval_recall': 0.6608040201005025, 'eval_f1': 0.6740709098675779, 'eval_runtime': 5.6937, 'eval_samples_per_second': 150.165, 'eval_steps_per_second': 18.793, 'epoch': 3.0}




In [None]:
# === Evaluation on Test Set ===
y_pred = predict_entities(test_texts)  # Should return List[List[str]] same as test_labels

print("NER Evaluation on Test Set:")
print(classification_report(test_labels, y_pred))

# Print metrics explicitly
precision = precision_score(test_labels, y_pred)
recall = recall_score(test_labels, y_pred)
f1 = f1_score(test_labels, y_pred)

print(f"Precision: {precision:.2%}")
print(f"Recall:    {recall:.2%}")
print(f"F1 Score:  {f1:.2%}")

### Inference Example

Test the model on a new comment.

In [20]:
test_text = "This new update totally changed the way I see the future of tech!"
device = torch.device("cpu")
model.to(device)
inputs = tokenizer(test_text, return_tensors="pt")
outputs = model(**inputs).logits
predictions = torch.argmax(outputs, dim=2)
predicted_labels = [id_to_label[p] for p in predictions[0].tolist()]
tokens = tokenizer.tokenize(test_text)
print("Tokens:", tokens)
print("Predicted Labels:", predicted_labels)

Tokens: ['This', 'new', 'update', 'totally', 'changed', 'the', 'way', 'I', 'see', 'the', 'future', 'of', 'tech', '!']
Predicted Labels: ['False', 'False', 'False', 'False', 'false', 'neutral', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam']


---

### Discussion of Baseline Results

The BERT-NER model performs reasonably well on standard entities like people and organizations, but struggles with informal/slang terms and context-dependent mentions often seen in video comments.

This highlights the need for incorporating contextual embeddings and clustering approaches to handle variant spellings and implicit references, which we address in the extended model pipeline.