# Window size model (approach 2) Baseline

This notebook implements the sliding window baseline for near real-time Named Entity Recognition (NER).

## 1. Setup and preparation

First, let's import the necessary libraries and set up our environment.

In [2]:
%pip install datasets

Note: you may need to restart the kernel to use updated packages.


In [15]:
from datasets import load_dataset
import random

## 2. Loading Dataset

In [5]:
# Load the English portion of OntoNotes 5.0
ontonotes = load_dataset(
    "conll2012_ontonotesv5",
    "english_v12",
    cache_dir="./dataset/ontonotes",
)
print(f"Dataset loaded with splits: {ontonotes.keys()}")

Dataset loaded with splits: dict_keys(['train', 'validation', 'test'])


## 3. Creating all window sizes of size 6 accross test split

In [16]:
windows = []
SPAN_LENGTH = 6

# Iterate through the test split
for doc in ontonotes["test"]:
    curr_window = []
    # Fix: Sometimes doc['sentences'] is a list of lists, so we need to flatten it
    if isinstance(doc['sentences'], list) and isinstance(doc['sentences'][0], list):
        doc['sentences'] = [sentence for sublist in doc['sentences'] for sentence in sublist]
    for sentence in doc["sentences"]:
        for word in sentence['words']:
            curr_window.append(word)
            # If the current window reaches the defined span length, add it to the list
            if len(curr_window) == SPAN_LENGTH:
                windows.append(curr_window.copy())
                curr_window = curr_window[1:]  # Slide the window by one word

print(f"Total windows created: {len(windows)}")
print(f"Example window: {windows[random.randint(0, len(windows) - 1)]}")
    

Total windows created: 224128
Example window: ['We', 'have', 'the', 'right', 'to', 'bring']
