# Always-On NER Baseline

This notebook implements the "always-on" baseline for near real-time Named Entity Recognition (NER). In this approach, we run NER on every token as it arrives, simulating a scenario where we have no selective inference strategy.

This is the most computationally expensive approach, but it gives us the earliest possible detection of entities.

## 1. Setup and Imports

First, let's import the necessary libraries and set up our environment.

In [1]:
%pip install datasets transformers

Note: you may need to restart the kernel to use updated packages.


In [2]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
import sys

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
sys.path.append("./src/")
from utils import convert_predictions

## 2. Load OntoNotes Dataset

We'll use the OntoNotes 5.0 dataset, which contains texts from various genres with named entity annotations.

In [8]:
# Load the English portion of OntoNotes 5.0
ontonotes = load_dataset(
    "conll2012_ontonotesv5",
    "english_v12",
    cache_dir="./dataset/ontonotes",
)
print(f"Dataset loaded with splits: {ontonotes.keys()}")

# Get basic statistics for each split
for split_name in ontonotes.keys():
    print(f"{split_name.capitalize()} set: {len(ontonotes[split_name])} examples")

Downloading data: 100%|██████████| 194M/194M [00:04<00:00, 42.4MB/s] 
Generating train split: 100%|██████████| 10539/10539 [00:10<00:00, 958.80 examples/s] 
Generating validation split: 100%|██████████| 1370/1370 [00:01<00:00, 862.76 examples/s]
Generating test split: 100%|██████████| 1200/1200 [00:01<00:00, 1133.76 examples/s]

Dataset loaded with splits: dict_keys(['train', 'validation', 'test'])
Train set: 10539 examples
Validation set: 1370 examples
Test set: 1200 examples





## 3. Set Up NER Model

We'll use a pre-trained transformer model for NER. For this baseline, we'll use a BERT-based model

In [5]:
# Load pre-trained NER model
model_name = "dslim/bert-base-NER"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Create NER pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

# Wrapper function for NER that takes tokens and returns predictions
def run_ner_on_tokens(tokens):
    """Run NER on a list of tokens."""
    text = " ".join(tokens)
    pipeline_output = ner_pipeline(text)
    return pipeline_output

# Test NER model on a sample sentence
sample_text = "John Smith works at Microsoft in Seattle."
print(f"Sample text: {sample_text}")
test_output = ner_pipeline(sample_text)
print(f"NER pipeline output: {test_output}")

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


Sample text: John Smith works at Microsoft in Seattle.
NER pipeline output: [{'entity_group': 'PER', 'score': np.float32(0.9996886), 'word': 'John Smith', 'start': 0, 'end': 10}, {'entity_group': 'ORG', 'score': np.float32(0.9989378), 'word': 'Microsoft', 'start': 20, 'end': 29}, {'entity_group': 'LOC', 'score': np.float32(0.9988439), 'word': 'Seattle', 'start': 33, 'end': 40}]


In [7]:
import re
# Removing punctuation to prevent mismatches and splitting the sample text into tokens
tokens = re.sub(r"[.,?!]+", "", sample_text).split(" ")

bio_tags = convert_predictions(tokens, test_output)
print(f"Tokens: {tokens}")
print(f"Converted BIO tags: {bio_tags}")

Tokens: ['John', 'Smith', 'works', 'at', 'Microsoft', 'in', 'Seattle']
Converted BIO tags: ['B-PER', 'I-PER', 'O', 'O', 'B-ORG', 'O', 'B-LOC']
