# Named Entity Recognition (NER) Project

## Problem Statement
This project implements Named Entity Recognition using the IOB2 tagging scheme. The dataset contains sentences with words mapped to their respective NER tags, where:
- **I**: Inside - word is inside a chunk
- **O**: Outside - word belongs to no chunk  
- **B**: Beginning - word is the beginning of a chunk

## Dataset Overview
- **Columns**:  
• Sentences # : sentence number     
• Word : word to be classified     
• POS : POS tags for respective word      
• Tag : NER tags for respective word      
- **Format**: Each sentence is broken down word by word with corresponding NER tags
- **Task**: Predict NER tags for each word in a sentence

## Approach
1. Data preprocessing and sentence preparation
2. Train/Validation/Test split (70/10/20%)
3. Baseline models will be:
 - Pretrained SpaCy NER (evaluate as-is)
 - Pretrained BERT NER (evaluate as-is)
4. Advanced model implementation we will fine-tune SpaCy and bert on training and validation set and compare on test.
5. Performance comparison and analysis

## Evaluation
Evaluating Named Entity Recognition (NER) models isn’t just about accuracy — because accuracy can be misleading when most tokens are “O” (non-entity).     
**Most Common:**      
Precision → Of the entities the model predicted, how many are correct?     
Recall → Of the true entities, how many did the model find?     
F1-score → Balance between precision and recall.     
Usually reported per entity type (e.g., PER, ORG, LOC) and as micro/macro averages.     
👉 These are the most standard usually done on entity rather than token level.     



In [2]:
#For uploading data on colab
from google.colab import files
uploaded = files.upload()

Saving ner_dataset.csv to ner_dataset.csv


In [1]:
import torch ##For colab
print(torch.cuda.is_available())  # should be True
print(torch.cuda.get_device_name(0))  # usually "Tesla T4"
# import torch ##for local
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(device)

True
Tesla T4
cuda


##Install Dependencies

In [None]:
pip install seqeval,transformers,torch,spacy
# on terminal run: python -m spacy download en_core_web_lg

# Data Preparation

In [3]:
# Import data handling libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

In [4]:
# Load and examine the dataset
print("Loading dataset...")
df = pd.read_csv('ner_dataset.csv',encoding='ISO-8859-1')

print(f"Dataset shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print("\nFirst 10 rows:")
print(df.head(10))

print("\nDataset info:")
print(df.info())

print("\nUnique NER tags:")
print(df['Tag'].value_counts())


Loading dataset...
Dataset shape: (1048575, 4)
Columns: ['Sentence #', 'Word', 'POS', 'Tag']

First 10 rows:
    Sentence #           Word  POS    Tag
0  Sentence: 1      Thousands  NNS      O
1          NaN             of   IN      O
2          NaN  demonstrators  NNS      O
3          NaN           have  VBP      O
4          NaN        marched  VBN      O
5          NaN        through   IN      O
6          NaN         London  NNP  B-geo
7          NaN             to   TO      O
8          NaN        protest   VB      O
9          NaN            the   DT      O

Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 4 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   Sentence #  47959 non-null    object
 1   Word        1048565 non-null  object
 2   POS         1048575 non-null  object
 3   Tag         1048575 non-null  object
dtypes: object(4)
memory usage: 32.0+ MB
None

Uni

In [5]:
# Forward-fill sentence identifiers
_df = df.copy()
_df['Sentence #'] = _df['Sentence #'].ffill()
_df.head(3)

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,Sentence: 1,of,IN,O
2,Sentence: 1,demonstrators,NNS,O


In [6]:
_df.isnull().sum()

Unnamed: 0,0
Sentence #,0
Word,10
POS,0
Tag,0


In [7]:
# Drop rows without Word/Tag
_df = _df.dropna(subset=['Word', 'Tag'])
_df.isnull().sum()

Unnamed: 0,0
Sentence #,0
Word,0
POS,0
Tag,0


In [8]:
# Group by sentence and aggregate lists
sentence_df = _df.groupby('Sentence #').agg(lambda x: list(x)).reset_index()
print(f"Sentence-level dataframe created with shape: {sentence_df.shape}")
print('\n\n')
print(sentence_df.head(3))

Sentence-level dataframe created with shape: (47959, 4)



      Sentence #                                               Word  \
0    Sentence: 1  [Thousands, of, demonstrators, have, marched, ...   
1   Sentence: 10  [Iranian, officials, say, they, expect, to, ge...   
2  Sentence: 100  [Helicopter, gunships, Saturday, pounded, mili...   

                                                 POS  \
0  [NNS, IN, NNS, VBP, VBN, IN, NNP, TO, VB, DT, ...   
1  [JJ, NNS, VBP, PRP, VBP, TO, VB, NN, TO, JJ, J...   
2  [NN, NNS, NNP, VBD, JJ, NNS, IN, DT, NNP, JJ, ...   

                                                 Tag  
0  [O, O, O, O, O, O, B-geo, O, O, O, O, O, B-geo...  
1  [B-gpe, O, O, O, O, O, O, O, O, O, O, O, O, O,...  
2  [O, O, B-tim, O, O, O, O, O, B-geo, O, O, O, O...  


In [12]:
# Basic validation to ensure aligned lengths(lists in Word,POS and Tag columns should have same length)
_oks = all(len(w) == len(p) == len(t) for w, p, t in zip(sentence_df['Word'], sentence_df['POS'], sentence_df['Tag']))
_oks

True

In [9]:
# Summary stats from sentence_df
_lengths = sentence_df['Word'].apply(len)
print(f"Total number of sentences: {len(sentence_df)}")
print(f"Average sentence length: {float(np.mean(_lengths)):.2f} words")
print(f"Max sentence length: {int(np.max(_lengths))} words")
print(f"Min sentence length: {int(np.min(_lengths))} words")

# Show example rows
print("\nExample sentence-level rows:")
for i in range(3):
    print(f"Sentence id: {sentence_df.loc[i, 'Sentence #']}")
    print(f"Words: {sentence_df.loc[i, 'Word'][:]}")
    print(f"POS:   {sentence_df.loc[i, 'POS'][:]}")
    print(f"Tags:  {sentence_df.loc[i, 'Tag'][:]}")
    print()

Total number of sentences: 47959
Average sentence length: 21.86 words
Max sentence length: 104 words
Min sentence length: 1 words

Example sentence-level rows:
Sentence id: Sentence: 1
Words: ['Thousands', 'of', 'demonstrators', 'have', 'marched', 'through', 'London', 'to', 'protest', 'the', 'war', 'in', 'Iraq', 'and', 'demand', 'the', 'withdrawal', 'of', 'British', 'troops', 'from', 'that', 'country', '.']
POS:   ['NNS', 'IN', 'NNS', 'VBP', 'VBN', 'IN', 'NNP', 'TO', 'VB', 'DT', 'NN', 'IN', 'NNP', 'CC', 'VB', 'DT', 'NN', 'IN', 'JJ', 'NNS', 'IN', 'DT', 'NN', '.']
Tags:  ['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O']

Sentence id: Sentence: 10
Words: ['Iranian', 'officials', 'say', 'they', 'expect', 'to', 'get', 'access', 'to', 'sealed', 'sensitive', 'parts', 'of', 'the', 'plant', 'Wednesday', ',', 'after', 'an', 'IAEA', 'surveillance', 'system', 'begins', 'functioning', '.']
POS:   ['JJ', 'NNS', 'VBP',

In [10]:
# Split data into train/validation/test (70/10/20%)
print("Splitting data into train/validation/test sets...")

train_val_df, test_df = train_test_split(
    sentence_df, test_size=0.2, random_state=42, shuffle=True
)

train_df, val_df = train_test_split(
    train_val_df, test_size=0.125, random_state=42, shuffle=True  # 0.125 of 80% = 10%
)

print(f"Training sentences: {len(train_df)} ({len(train_df)/len(sentence_df)*100:.1f}%)")
print(f"Validation sentences: {len(val_df)} ({len(val_df)/len(sentence_df)*100:.1f}%)")
print(f"Test sentences: {len(test_df)} ({len(test_df)/len(sentence_df)*100:.1f}%)")

Splitting data into train/validation/test sets...
Training sentences: 33571 (70.0%)
Validation sentences: 4796 (10.0%)
Test sentences: 9592 (20.0%)


# Common Helper and Evaluation Setup

https://huggingface.co/spaces/evaluate-metric/seqeval

In [13]:
# Helper and evaluation setup
import os
import math
from typing import List, Tuple, Dict, Any
from seqeval.metrics import classification_report as seqeval_classification_report
from seqeval.metrics import f1_score as seqeval_f1
from seqeval.metrics import precision_score as seqeval_precision
from seqeval.metrics import recall_score as seqeval_recall

# Build plain-text sentence and word character offsets(start and end position of every word)
def words_to_text_and_offsets(words: List[str]) -> Tuple[str, List[Tuple[int, int]]]:
    text_parts: List[str] = []
    offsets: List[Tuple[int, int]] = []
    cursor: int = 0
    for i, w in enumerate(words):
        if i > 0:
            text_parts.append(" ")
            cursor += 1
        start = cursor
        text_parts.append(w)
        cursor += len(w)
        end = cursor
        offsets.append((start, end))
    return "".join(text_parts), offsets

# Map entity spans (start,end,label) back to IOB2 per word offsets
def spans_to_iob2(words: List[str], spans: List[Tuple[int, int, str]], offsets: List[Tuple[int, int]]) -> List[str]:
    # _, offsets = words_to_text_and_offsets(words)
    out: List[str] = ["O"] * len(words)
    for (s, e, lab) in spans:
        began = False
        for i, (ws, we) in enumerate(offsets):
            if we <= s:
                continue
            if ws >= e:
                break
            if not began:
                out[i] = f"B-{lab}"
                began = True
            else:
                out[i] = f"I-{lab}"
    return out

# evaluator given a predict function over words -> tags
def evaluate_on_dataframe(predict_fn, df_subset: pd.DataFrame, max_samples: int | None = None, desc: str = "Eval") -> Dict[str, Any]:
    y_true: List[List[str]] = []
    y_pred: List[List[str]] = []
    iterable = df_subset.itertuples(index=False)
    count = 0
    for row in iterable:
        words: List[str] = list(row.Word)
        gold: List[str] = list(row.Tag)
        pred: List[str] = predict_fn(words)
        if len(pred) != len(gold):
            # length guard: pad/truncate to align
            if len(pred) < len(gold):
                pred = pred + ["O"] * (len(gold) - len(pred))
            else:
                pred = pred[: len(gold)]
        y_true.append(gold)
        y_pred.append(pred)
        count += 1
        if max_samples is not None and count >= max_samples:
            break
    report = seqeval_classification_report(y_true, y_pred, zero_division=0, digits=4)
    metrics = {
        "precision": float(seqeval_precision(y_true, y_pred)),
        "recall": float(seqeval_recall(y_true, y_pred)),
        "f1": float(seqeval_f1(y_true, y_pred)),
        "report": report,
    }
    print(f"{desc} — precision: {metrics['precision']:.4f}, recall: {metrics['recall']:.4f}, f1: {metrics['f1']:.4f}")
    print(report)
    return metrics


## Baseline 1: Pretrained spaCy NER (en_core_web_sm, md, or lg)


In [15]:
# on terminal run: python -m spacy download en_core_web_lg
import spacy
print("Loading spaCy pretrained NER model...")
nlp = spacy.load("en_core_web_lg")

In [31]:
text, _ = words_to_text_and_offsets(sentence_df["Word"].iloc[0])
doc = nlp(text)
# Note the labels are different from our data
for ent in doc.ents:
    print(ent.text, ent.label_)

Thousands CARDINAL
London GPE
Iraq GPE
British NORP


In [22]:
LABEL_MAP = {
    "PERSON": "per",
    "ORG": "org",
    "GPE": "gpe",          # countries/cities/states
    "LOC": "geo",          # non-GPE locations, mountains, rivers
    "DATE": "tim",
    "TIME": "tim",
    "EVENT": "eve",
    "WORK_OF_ART": "art",
    "NORP": "nat",         # nationalities/religions; optional fit for your 'nat'
    # Optional extras if they appear; map or ignore:
    "LANGUAGE": "nat",     # or drop if not desired
}

def spacy_predict(words: List[str]) -> List[str]:
    text, offsets = words_to_text_and_offsets(words)
    doc = nlp(text)
    spans=[]
    for ent in doc.ents:
        mapped = LABEL_MAP.get(ent.label_)
        if mapped is None:
            continue
        spans.append((ent.start_char, ent.end_char, mapped))
    return spans_to_iob2(words, spans, offsets)

In [23]:
print("Evaluating spaCy on validation set as it is smaller than test")
spacy_val_metrics = evaluate_on_dataframe(spacy_predict, val_df, desc="spaCy Validation")

Evaluating spaCy on validation set as it is smaller than test
spaCy Validation — precision: 0.2555, recall: 0.2741, f1: 0.2645
              precision    recall  f1-score   support

         art     0.1290    0.0755    0.0952        53
         eve     0.1263    0.3429    0.1846        35
         geo     0.5862    0.0403    0.0755      3792
         gpe     0.0303    0.0751    0.0432      1517
         nat     0.0000    0.0000    0.0000         9
         org     0.4414    0.3634    0.3986      1929
         per     0.4248    0.4248    0.4248      1683
         tim     0.5164    0.6506    0.5758      2052

   micro avg     0.2555    0.2741    0.2645     11070
   macro avg     0.2818    0.2466    0.2247     11070
weighted avg     0.4432    0.2741    0.2736     11070



## Baseline 2: Pretrained BERT NER via transformers pipeline

In [24]:
from transformers import pipeline

In [32]:
text, _ = words_to_text_and_offsets(sentence_df["Word"].iloc[0])
ents = bert_ner(text)
# Note the labels are different from our data
for e in ents:
    print(str(e["entity_group"]))

LOC
LOC
MISC


In [25]:
print("Loading pretrained BERT NER pipeline (dslim/bert-base-NER)...")
bert_ner = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple", device=0 if device.type=="cuda" else -1)
BERT_LABEL_MAP = {
    "PER": "per",
    "ORG": "org",
    "LOC": "geo",
    "MISC": "art",  # or choose better mapping for your data
    "GPE": "gpe"    # some pipelines may yield GPE
}
def bert_pipeline_predict(words: List[str]) -> List[str]:
    text, offsets = words_to_text_and_offsets(words)
    ents = bert_ner(text)
    spans = []
    for e in ents:
        src = str(e["entity_group"])
        mapped = BERT_LABEL_MAP.get(src)
        if mapped is None:
            continue
        spans.append((int(e["start"]), int(e["end"]), mapped))
    return spans_to_iob2(words, spans, offsets)

Loading pretrained BERT NER pipeline (dslim/bert-base-NER)...


config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cpu


In [26]:
print("Evaluating BERT pipeline on validation set as it is smaller than test")
bert_val_metrics = evaluate_on_dataframe(bert_pipeline_predict, val_df, desc="BERT-pipeline Validation")

Evaluating BERT pipeline on validation set as it is smaller than test
BERT-pipeline Validation — precision: 0.5157, recall: 0.4500, f1: 0.4807
              precision    recall  f1-score   support

         art     0.0055    0.2453    0.0107        53
         eve     0.0000    0.0000    0.0000        35
         geo     0.7929    0.8782    0.8333      3792
         gpe     0.0000    0.0000    0.0000      1517
         nat     0.0000    0.0000    0.0000         9
         org     0.6511    0.4624    0.5408      1929
         per     0.4381    0.4439    0.4410      1683
         tim     0.0000    0.0000    0.0000      2052

   micro avg     0.5157    0.4500    0.4807     11070
   macro avg     0.2359    0.2537    0.2282     11070
weighted avg     0.4517    0.4500    0.4468     11070



## Summary for Baseline Results

### **Metrics Summary**
**spaCy (pretrained)**: Precision ~0.256, Recall ~0.274, F1 ~0.265 on validation set (macro averages show similar trends).   
**BERT (pretrained pipeline)**: Precision ~0.516, Recall ~0.450, F1 ~0.481 on validation set, almost doubling spaCy’s F1.   

Scores are computed with seqeval on word-level IOB2 tags after mapping model outputs to standard tags.   

### **Why spaCy and BERT Were Selected as Baseline**
- spaCy was chosen for its ease of use, speed, and widespread adoption in production NLP pipelines, offering a lightweight yet effective baseline.    
- BERT was selected as a strong transformer-based baseline representing state-of-the-art contextual embeddings and high potential accuracy.    

### **Baseline Shortcomings (from the validation results)**
**Domain mismatch:** Both models were pretrained on general corpora (CoNLL, OntoNotes) and struggle with dataset-specific entities like art, eve, nat, leading to very low (often zero) recall and precision.    
**Tokenization misalignment:** BERT’s subword tokenizer may cause label projection errors, reducing recall on complex tokens.    
**Class imbalance:** Rare entity types (nat, eve, gpe, tim) suffer from near-zero detection by both baselines due to limited training examples.    
**Recall limits in spaCy:** spaCy shows moderate precision but very low recall on high-support classes (e.g., geo recall only 4%), indicating over-conservatism.    
**Limited recall on some BERT classes:** Although BERT performs better overall, it completely misses some classes like gpe, nat, and tim.    

### **How Fine-tuning Helps**
**Vocabulary and domain adaptation:** Fine-tuning aligns embeddings to dataset-specific entities and jargon, boosting recall and precision especially on common classes.     
**Label distribution learning:** Models learn dataset-specific priors and handle rare classes better.     
**Better calibration and optimization:** Task heads adapt to the exact label set and dataset patterns, improving F1 substantially.

### **Benefit of spaCy if it Had Performed Well**
spaCy offers a fast, memory-efficient, and production-ready pipeline that excels in real-time applications and easy integration—even without requiring heavy GPU infrastructure.

## Advance: Finetuning Hugging Face's bert-base-NER on our data

https://huggingface.co/dslim/bert-base-NER

In [16]:
from transformers import AutoTokenizer, AutoModelForTokenClassification, DataCollatorForTokenClassification, Trainer, TrainingArguments, pipeline
from transformers.utils import logging as hf_logging
from datasets import Dataset

hf_logging.set_verbosity_error()

# Derive full tag set from training data
ALL_TAGS: List[str] = sorted({t for tags in train_df["Tag"] for t in set(tags)})
TAG2ID: Dict[str, int] = {t: i for i, t in enumerate(ALL_TAGS)}
ID2TAG: Dict[int, str] = {i: t for t, i in TAG2ID.items()}
MODEL_NAME = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# Align labels with word-piece tokenization
def tokenize_and_align_labels(examples: Dict[str, List[Any]]):
    tokenized = tokenizer(examples["words"], is_split_into_words=True, truncation=True, padding=False)
    labels = []
    for i, word_labels in enumerate(examples["labels"]):
        word_ids = tokenized.word_ids(batch_index=i)
        prev_word_id = None
        label_ids = []
        for word_id in word_ids:
            if word_id is None:
                label_ids.append(-100)
            elif word_id != prev_word_id:
                label_ids.append(TAG2ID[word_labels[word_id]])
            else:
                # For subsequent wordpieces, set to -100 to ignore
                label_ids.append(-100)
            prev_word_id = word_id
        labels.append(label_ids)
    tokenized["labels"] = labels
    return tokenized

# Build Hugging Face Datasets from existing dataframes
def to_hf_dataset(df_in: pd.DataFrame) -> Dataset:
    records = [{"words": words, "labels": tags} for words, tags in zip(df_in["Word"], df_in["Tag"])]
    return Dataset.from_list(records)



tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [17]:
# Prepare datasets for BERT fine-tuning
hf_train = to_hf_dataset(train_df)
hf_val = to_hf_dataset(val_df)
hf_test = to_hf_dataset(test_df)

hf_train_tok = hf_train.map(tokenize_and_align_labels, batched=True)
hf_val_tok = hf_val.map(tokenize_and_align_labels, batched=True)
hf_test_tok = hf_test.map(tokenize_and_align_labels, batched=True)

data_collator = DataCollatorForTokenClassification(tokenizer)

num_labels = len(ALL_TAGS)
model = AutoModelForTokenClassification.from_pretrained(MODEL_NAME, num_labels=num_labels, id2label=ID2TAG, label2id=TAG2ID)
model.to(device)

Map:   0%|          | 0/33571 [00:00<?, ? examples/s]

Map:   0%|          | 0/4796 [00:00<?, ? examples/s]

Map:   0%|          | 0/9592 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12

In [21]:
# Fine-tune BERT token classifier
EPOCHS = 2
BATCH_SIZE = 16
LR = 5e-5
WARMUP_RATIO = 0.1

logging_steps = max(10, len(hf_train_tok) // (BATCH_SIZE * 5))

training_args = TrainingArguments(
    output_dir="./ner-bert-output",
    learning_rate=LR,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=EPOCHS,
    weight_decay=0.01,
    eval_strategy="epoch",  # Corrected argument name
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    logging_steps=logging_steps,
    fp16=(device.type == "cuda"),
    report_to=["tensorboard"], # Changed to report to TensorBoard
)

# compute metrics using seqeval at token level -> convert back to entities via simple IOB compare
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = logits.argmax(-1)
    true_labels = []
    pred_labels = []
    # Corrected line to convert hf_val["words"] to a list before concatenation
    for p_row, l_row, words in zip(preds, labels, list(hf_val["words"]) + hf_train["words"][:0]):
        # labels contain -100 for subwords; filter
        true_seq = []
        pred_seq = []
        for p, l in zip(p_row.tolist(), l_row.tolist()):
            if l == -100:
                continue
            true_seq.append(ID2TAG[l])
            pred_seq.append(ID2TAG[p])
        true_labels.append(true_seq)
        pred_labels.append(pred_seq)
    return {
        "precision": seqeval_precision(true_labels, pred_labels),
        "recall": seqeval_recall(true_labels, pred_labels),
        "f1": seqeval_f1(true_labels, pred_labels),
    }

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=hf_train_tok,
    eval_dataset=hf_val_tok,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [22]:
trainer.train()

{'loss': 0.0655, 'grad_norm': 1.052993655204773, 'learning_rate': 4.50214387803716e-05, 'epoch': 0.19961886612672702}
{'loss': 0.0516, 'grad_norm': 0.717755913734436, 'learning_rate': 4.003096712720344e-05, 'epoch': 0.39923773225345405}
{'loss': 0.0503, 'grad_norm': 2.1456356048583984, 'learning_rate': 3.504049547403526e-05, 'epoch': 0.598856598380181}
{'loss': 0.0482, 'grad_norm': 1.074541687965393, 'learning_rate': 3.0050023820867078e-05, 'epoch': 0.7984754645069081}
{'loss': 0.0487, 'grad_norm': 1.1420683860778809, 'learning_rate': 2.505955216769891e-05, 'epoch': 0.9980943306336351}
{'eval_loss': 0.10318918526172638, 'eval_precision': 0.8104649138081143, 'eval_recall': 0.8409214092140921, 'eval_f1': 0.8254123071466571, 'eval_runtime': 5.3707, 'eval_samples_per_second': 892.991, 'eval_steps_per_second': 55.859, 'epoch': 1.0}
{'loss': 0.0519, 'grad_norm': 1.3102813959121704, 'learning_rate': 2.006908051453073e-05, 'epoch': 1.197713196760362}
{'loss': 0.0469, 'grad_norm': 0.59472066164

TrainOutput(global_step=4198, training_loss=0.049501477848523684, metrics={'train_runtime': 392.2279, 'train_samples_per_second': 171.181, 'train_steps_per_second': 10.703, 'train_loss': 0.049501477848523684, 'epoch': 2.0})

In [32]:
# Evaluate fine-tuned BERT on validation and test
def bert_finetuned_predict(words: List[str]) -> List[str]:
    # tokenize as a single sentence
    enc = tokenizer(words, is_split_into_words=True, return_tensors="pt", truncation=True)

    # Get word_ids from the tokenized input BEFORE moving to device
    word_ids = enc.word_ids(batch_index=0)

    # Move tensors to device
    enc = {k: v.to(device) for k, v in enc.items()}

    with torch.no_grad():
        logits = model(**enc).logits[0] ##model call

    preds: List[str] = []
    previous_word_id = None
    for token_id, wid in enumerate(word_ids):
        if wid is None:
            # Special tokens
            continue
        elif wid != previous_word_id:
            # First token of a new word
            label_id = int(torch.argmax(logits[token_id]).item())
            preds.append(ID2TAG[label_id])
        else:
            # Subsequent tokens of the same word, append "O" or handle based on IOB2 rules if needed
            # For this task, we take the first token's prediction for the word
            pass # We already appended the prediction for the first token
        previous_word_id = wid

    # length guard: This is less likely to be needed with the improved logic, but keep for safety
    if len(preds) != len(words):
        warnings.warn(f"Prediction length mismatch: predicted {len(preds)}, expected {len(words)}. Padding/truncating.")
        if len(preds) < len(words):
            preds = preds + ["O"] * (len(words) - len(preds))
        else:
            preds = preds[: len(words)]

    return preds

In [33]:
# Example usage of the fine-tuned BERT model

sample_sentence = "Barack Obama visited London last week."
sample_words = sample_sentence.split() # Simple split for demonstration

predicted_tags = bert_finetuned_predict(sample_words)

print(f"Sentence: {sample_sentence}")
print(f"Words:    {sample_words}")
print(f"Predicted Tags: {predicted_tags}")

Sentence: Barack Obama visited London last week.
Words:    ['Barack', 'Obama', 'visited', 'London', 'last', 'week.']
Predicted Tags: ['B-per', 'I-per', 'O', 'B-geo', 'O', 'O']


In [34]:
print("Evaluating fine-tuned BERT on validation set...")
finetuned_val_metrics = evaluate_on_dataframe(bert_finetuned_predict, val_df, desc="BERT-finetuned Validation")

Evaluating fine-tuned BERT on validation set...
BERT-finetuned Validation — precision: 0.8302, recall: 0.8482, f1: 0.8391
              precision    recall  f1-score   support

         art     0.4118    0.2642    0.3218        53
         eve     0.3111    0.4000    0.3500        35
         geo     0.8520    0.9001    0.8754      3792
         gpe     0.9561    0.9473    0.9517      1517
         nat     0.4286    0.6667    0.5217         9
         org     0.7100    0.7133    0.7117      1929
         per     0.7822    0.8087    0.7952      1683
         tim     0.8710    0.8621    0.8665      2052

   micro avg     0.8302    0.8482    0.8391     11070
   macro avg     0.6653    0.6953    0.6742     11070
weighted avg     0.8303    0.8482    0.8389     11070




## Performance Explanation     
### Overall Micro Averages:
Precision: 0.8302    
Recall: 0.8482    
F1-score: 0.8391    
These indicate balanced and high-quality predictions on the validation set, with recall slightly higher than precision (good for NER as capturing more entities is often desired).

### Class-wise Detail:   
**geo** (Geographical names): Excellent F1=0.8754 with strong precision and recall (~85-90%). This is the most frequent class, and the model handles it very well.    
**gpe** (Geopolitical entities): Very strong results (F1=0.9517).   
**per** (Person) and **org** (Organization) also have solid performances (F1 ~0.79 and 0.71 respectively).   
**tim** (Time) is very well recognized (F1=0.87).    
Rare classes nat (nature), art (art), and eve (event) have lower F1-scores (0.52, 0.32, 0.35). Though these classes improved from previous baselines, they remain challenging due to low support (number of occurrences).

### Macro and Weighted Averages:
**Macro average F1** (average over classes without weighting): 0.6742 indicates room for improvement, especially in rare classes.   
**Weighted average F1** (~0.839) closely matches micro average, reflecting dominance of frequent classes.



In [40]:
#This will overwrite epochs parameter, trying once more with increased epochs
EPOCHS = 3

In [41]:
# Re-load the pretrained model
num_labels = len(ALL_TAGS)
model = AutoModelForTokenClassification.from_pretrained(MODEL_NAME, num_labels=num_labels, id2label=ID2TAG, label2id=TAG2ID)
model.to(device)

# Re-initialize the Trainer
# Ensure EPOCHS and other training_args are set as desired before this cell
training_args = TrainingArguments(
    output_dir="./ner-bert-output",
    learning_rate=LR, # Ensure LR is set
    per_device_train_batch_size=BATCH_SIZE, # Ensure BATCH_SIZE is set
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=EPOCHS, # Ensure EPOCHS is set to the desired number for the fresh run
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    logging_steps=logging_steps, # Ensure logging_steps is set
    fp16=(device.type == "cuda"),
    report_to=["tensorboard"],
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=hf_train_tok,
    eval_dataset=hf_val_tok,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

{'loss': 0.2046, 'grad_norm': 1.0525790452957153, 'learning_rate': 4.6680959186914405e-05, 'epoch': 0.19961886612672702}
{'loss': 0.1127, 'grad_norm': 0.9182557463645935, 'learning_rate': 4.335397808480229e-05, 'epoch': 0.39923773225345405}
{'loss': 0.1066, 'grad_norm': 1.607831597328186, 'learning_rate': 4.002699698269018e-05, 'epoch': 0.598856598380181}
{'loss': 0.0984, 'grad_norm': 1.0379698276519775, 'learning_rate': 3.670001588057805e-05, 'epoch': 0.7984754645069081}
{'loss': 0.0925, 'grad_norm': 0.7895373702049255, 'learning_rate': 3.337303477846594e-05, 'epoch': 0.9980943306336351}
{'eval_loss': 0.09240315109491348, 'eval_precision': 0.8153303964757709, 'eval_recall': 0.8359530261969287, 'eval_f1': 0.8255129348795718, 'eval_runtime': 5.7659, 'eval_samples_per_second': 831.785, 'eval_steps_per_second': 52.03, 'epoch': 1.0}
{'loss': 0.0745, 'grad_norm': 1.303756833076477, 'learning_rate': 3.0046053676353818e-05, 'epoch': 1.197713196760362}
{'loss': 0.0702, 'grad_norm': 0.516150414

In [43]:
print("Evaluating fine-tuned BERT on validation set with epochs=3")
finetuned_val_metrics = evaluate_on_dataframe(bert_finetuned_predict, val_df, desc="BERT-finetuned Validation")

Evaluating fine-tuned BERT on validation set with epochs=3
BERT-finetuned Validation — precision: 0.8349, recall: 0.8491, f1: 0.8419
              precision    recall  f1-score   support

         art     0.3902    0.3019    0.3404        53
         eve     0.3158    0.3429    0.3288        35
         geo     0.8607    0.8995    0.8797      3792
         gpe     0.9568    0.9479    0.9523      1517
         nat     0.3846    0.5556    0.4545         9
         org     0.7165    0.7206    0.7185      1929
         per     0.7839    0.7974    0.7906      1683
         tim     0.8720    0.8699    0.8709      2052

   micro avg     0.8349    0.8491    0.8419     11070
   macro avg     0.6601    0.6794    0.6670     11070
weighted avg     0.8348    0.8491    0.8417     11070



**Per-Class Observations**
Slight overall improvements with 3 epochs in precision, recall, and F1 (micro and weighted averages).     
Macro average F1 decreased slightly from 0.6742 to 0.6670, indicating slight degradation in rare classes (art, nat).

**Rare classes show mixed signals:**   
*art* improved recall but precision slightly dropped, netting a small F1 gain (0.3218 → 0.3404).    
*nat* dropped notably on both precision and recall, lowering F1 from 0.5217 to 0.4545.

**Interpretation**   
-The extra epoch marginally improves overall model balance and frequent classes.    
-There may be some trade-offs with rare classes due to potential overfitting or model shifting focus.    
-In practice, 3 epochs seems a slight improvement in global metrics, but monitoring rare class performance closely is vital.

**Recommendation**    
-Given primary objective (maximize overall F1), 3 epochs seems the best choice.   
-If rare class detection is critical, consider additional targeted approaches (data augmentation, class-weighted loss) along with monitoring training for class-specific performance.    
-Use early stopping or validation checkpoints to balance overall and rare class learning.

In [44]:
print("Evaluating fine-tuned BERT on test set...")
finetuned_test_metrics = evaluate_on_dataframe(bert_finetuned_predict, test_df, desc="BERT-finetuned Test")

Evaluating fine-tuned BERT on test set...
BERT-finetuned Test — precision: 0.8438, recall: 0.8553, f1: 0.8495
              precision    recall  f1-score   support

         art     0.2254    0.1702    0.1939        94
         eve     0.3793    0.3143    0.3438        70
         geo     0.8626    0.9086    0.8850      7558
         gpe     0.9628    0.9484    0.9556      3142
         nat     0.5517    0.4000    0.4638        40
         org     0.7465    0.7234    0.7348      4151
         per     0.7858    0.8091    0.7973      3400
         tim     0.8822    0.8869    0.8845      4077

   micro avg     0.8438    0.8553    0.8495     22532
   macro avg     0.6745    0.6451    0.6573     22532
weighted avg     0.8424    0.8553    0.8485     22532



## Test Set Performance Overview
**Micro Average:**  

-Precision: 0.8438   
-Recall: 0.8553   
-F1-score: 0.8495   
These show the model accurately predicts entities with balanced precision and recall on the larger test set, even slightly improved compared to validation.

**Class-wise Breakdown:**   

-geo (Geographic): Highest support (7558), strong F1 of 0.8850 reflecting excellent recognition.   
-gpe (Geopolitical): Also high support (3142) with F1 of 0.9556, very accurate.   
-per (Person) and org (Organization) show good performance (F1 ~0.80 and 0.73).   
-tim (Time) has very high F1 (0.8845).   
-Rare classes like nat, art, and eve still have lower F1 (0.46, 0.19, 0.34) due to fewer examples and inherent difficulty.   

**Macro Average F1** (0.6573) is lower than micro, highlighting challenges in rare/low-support classes.

**Weighted Average F1**(0.8485) closely tracks micro-average, driven by frequent classes.

## Interpretation and Next Steps
-The model performs strongly overall and generalizes well from validation to test without overfitting.   
-Rare classes remain the main challenge; pushing improvements here requires more balanced data, augmentation, or specialized loss functions.   
-High-frequency classes (geo, gpe, per, tim) are robustly detected, making the model reliable for most practical uses.   
-Consider strategies targeted at rare classes if critical for your application.

In [46]:
from transformers import AutoModelForTokenClassification, AutoTokenizer

# Define the directory where the model was saved
save_directory = "./ner-bert-output/final_model"

# Load the model and tokenizer
loaded_model = AutoModelForTokenClassification.from_pretrained(save_directory)
loaded_tokenizer = AutoTokenizer.from_pretrained(save_directory)

# Move the loaded model to the appropriate device (GPU if available)
loaded_model.to(device)

print(f"Model and tokenizer loaded from {save_directory}")

Model and tokenizer loaded from ./ner-bert-output/final_model


In [47]:
# Use the loaded model to predict NER tags for a new sentence

def predict_with_loaded_model(sentence: str, model, tokenizer, id2tag: Dict[int, str], device) -> List[str]:
    words = sentence.split() # Simple split for demonstration
    enc = tokenizer(words, is_split_into_words=True, return_tensors="pt", truncation=True)
    enc = {k: v.to(device) for k, v in enc.items()}

    with torch.no_grad():
        logits = model(**enc).logits[0]

    wi = tokenizer(words, is_split_into_words=True).word_ids()
    preds: List[str] = []
    token_i = 0
    seen_word = set()
    for idx, wid in enumerate(wi):
        if wid is None:
            continue
        if wid in seen_word:
            continue
        seen_word.add(wid)
        while token_i < logits.size(0) and logits[token_i].shape[-1] != len(id2tag):
            token_i += 1
        if token_i >= logits.size(0):
            preds.append("O")
            continue
        label_id = int(torch.argmax(logits[token_i]).item())
        preds.append(id2tag[label_id])
        token_i += 1

    # length guard
    if len(preds) != len(words):
        if len(preds) < len(words):
            preds = preds + ["O"] * (len(words) - len(preds))
        else:
            preds = preds[: len(words)]

    return preds

# Sample sentence for prediction
test_sentence = "Apple Inc. is looking at buying U.K. startup for $1 billion."

# Get predictions using the loaded model
predicted_tags_loaded = predict_with_loaded_model(test_sentence, loaded_model, loaded_tokenizer, ID2TAG, device)

print(f"Sentence: {test_sentence}")
print(f"Predicted Tags (Loaded Model): {predicted_tags_loaded}")

Sentence: Apple Inc. is looking at buying U.K. startup for $1 billion.
Predicted Tags (Loaded Model): ['O', 'B-org', 'I-org', 'I-org', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'I-geo']
