# CRF + expert.ai Edge NL API for Named Entity Recognition

> **IMPORTANT NOTICE**: This notebook was developed using the expert.ai Edge NL API, which was **retired in 2023**. The code is preserved for educational and archival purposes but will not function without access to the legacy Edge NL API runtime.
>
> For information about current expert.ai NLP solutions, please contact **info@expert.ai** or visit [expert.ai](https://www.expert.ai).

---

## Overview

This notebook demonstrates how to enhance Named Entity Recognition (NER) performance using a **Conditional Random Field (CRF)** model combined with linguistic features extracted from the **expert.ai Edge NL API**.

### Approach
- **Dataset**: CoNLL 2003 corpus (standard benchmark for NER tasks)
- **Feature Engineering**: Linguistic features (POS tags, dependency labels, syncons, knowledge labels) from expert.ai's NLP engine
- **Model**: CRF (Conditional Random Field) via `sklearn-crfsuite`

### Key Results
The model achieves the following F1 scores on the CoNLL 2003 test set:
- **LOC** (Location): 0.893
- **PER** (Person): 0.882
- **ORG** (Organization): 0.801
- **MISC** (Miscellaneous): 0.747
- **Overall micro-averaged F1**: 0.845

---

## Data Preparation

The **CoNLL 2003** corpus is a standard benchmark dataset for Named Entity Recognition tasks. It contains news articles annotated with four entity types:
- **PER** (Person names)
- **LOC** (Locations)
- **ORG** (Organizations)
- **MISC** (Miscellaneous entities)

The dataset uses the **BIO tagging scheme** (Beginning-Inside-Outside) to mark entity boundaries.

The corpus is downloaded from the [nluninja/nlp_datasets](https://github.com/nluninja/nlp_datasets) repository.

### Helper Functions for CoNLL Corpus Processing

The following functions handle loading and parsing the CoNLL 2003 data format, where each line contains a token with its POS tag, syntactic chunk tag, and NER label.

In [None]:
# URL to the CoNLL 2003 dataset hosted on GitHub
CONLL_URL_ROOT = "https://raw.githubusercontent.com/nluninja/nlp_datasets/be9fd23409f1443790f6e1eab91d28b105769368/conll2003/data/"

In [None]:
# Standard library and third-party imports
import os
import re
import urllib
import pandas as pd
from math import nan

# CRF model and evaluation libraries
import sklearn_crfsuite
from sklearn_crfsuite import metrics
from sklearn.metrics import make_scorer

In [None]:
def load_conll_data(filename, url_root=CONLL_URL_ROOT, only_tokens=False):
    """
    Load and parse a CoNLL 2003 formatted dataset file.
    
    The CoNLL format structures data with one word per line, where each line
    contains: word, POS tag, syntactic chunk tag, and NER entity tag, all
    separated by whitespace. Sentences are separated by empty lines.
    
    Args:
        filename: Name of the file to load (e.g., 'train.txt', 'test.txt')
        url_root: Base URL where the dataset files are hosted
        only_tokens: If True, return only token strings; if False, return
                     tuples containing all features (word, POS, chunk tag)
    
    Returns:
        X: List of sentences, where each sentence is a list of tokens/tuples
        Y: List of label sequences corresponding to each sentence
        output_labels: Set of all unique NER labels found in the data
    """
    lines = read_raw_conll(url_root, filename)
    X = []
    Y = []
    sentence = []
    labels = []
    output_labels = set()
    
    for line in lines:
        if line == "\n":
            # Empty line indicates end of sentence
            if len(sentence) != len(labels):
                print(f"Error: we have {len(sentence)} words but {len(labels)} labels")
            if sentence and is_real_sentence(only_tokens, sentence):
                X.append(sentence)
                Y.append(labels)
            sentence = []
            labels = []
        else:
            # Parse line: word POS chunk_tag NER_tag
            features = line.split()
            tag = features.pop()  # Last element is the NER tag
            labels.append(tag)
            output_labels.add(tag)
            if only_tokens:
                sentence.append(features.pop(0))  # First element is the word
            else:
                sentence.append(tuple(features))
    
    print(f"Read {len(X)} sentences")
    if len(X) != len(Y):
        print("ERROR in reading data.")
    return X, Y, output_labels

In [None]:
def read_raw_conll(url_root, filename):
    """
    Fetch and read a CoNLL 2003 dataset file from a URL.
    
    Args:
        url_root: Base URL where the file is hosted
        filename: Name of the file to read
        
    Returns:
        List of lines from the file, excluding the first two header lines
    """
    lines = []
    full_url = url_root + filename
    lines = open_read_from_url(full_url)
    return lines[2:]  # Skip the header lines

In [None]:
def open_read_from_url(url):
    """
    Download and read a text file from a URL.
    
    Args:
        url: Full URL to the text file
        
    Returns:
        List of lines from the file, decoded as UTF-8
    """
    print(f"Read file from {url}")
    file = urllib.request.urlopen(url)
    lines = []
    for line in file:
        lines.append(line.decode("utf-8"))
    return lines

In [None]:
def is_real_sentence(only_token, sentence):
    """
    Check if a sentence is actual content or a document separator.
    
    CoNLL 2003 uses '-DOCSTART-' markers and dashed lines to separate
    documents within the corpus. These should be filtered out.
    
    Args:
        only_token: Whether the sentence contains only token strings
        sentence: The sentence to check
        
    Returns:
        True if this is a real sentence, False if it's a document separator
    """
    first_word = ""
    if only_token:
        first_word = sentence[0]
    else:
        first_word = sentence[0][0]

    if '---------------------' in first_word or first_word == '-DOCSTART-':
        return False
    else:
        return True

### Data Loading

Load the train, validation, and test splits of the CoNLL 2003 dataset. We extract only the tokens (words) since the expert.ai API will provide its own linguistic annotations.

In [None]:
# Load the three dataset splits (train, validation, test)
# Setting only_tokens=True to get just the word strings, as expert.ai will
# provide its own POS tags and other linguistic features
raw_train, y_train, output_labels = load_conll_data('train.txt', only_tokens=True)
raw_valid, y_valid, _ = load_conll_data('valid.txt', only_tokens=True)
raw_test, y_test, _ = load_conll_data('test.txt', only_tokens=True)

In [None]:
# Example: Display first sentence and its NER labels
# Labels use BIO format: B-ORG = Beginning of Organization, O = Outside (not an entity)
print("Sentence:", raw_train[0])
print("Labels:  ", y_train[0])

---

## Feature Generation with expert.ai Edge NL API

> **NOTE**: This section requires the expert.ai Edge NL API runtime, which was **discontinued in 2023**. The code below will not execute without access to the legacy Edge API.

The expert.ai Edge NL API provides rich linguistic features for each token:
- **POS tags**: Part-of-speech tagging (PROPN, NOUN, VERB, etc.)
- **Dependency labels**: Syntactic dependency relations (nsubj, dobj, root, etc.)
- **Syncons**: Semantic concept IDs from the expert.ai knowledge graph
- **Knowledge labels**: Semantic categories (e.g., 'town', 'person', 'company')
- **Type classes**: Combined POS and entity type information (e.g., 'NPR.GEO' for proper noun geographic entity)

These features significantly enhance the CRF model's ability to recognize named entities compared to using simple word-based features alone.

In [None]:
# Authentication credentials for expert.ai API
# NOTE: These credentials are for the deprecated Edge NL API (retired 2023)
# The Edge API required local deployment via expert.ai Studio
import os
os.environ["EAI_USERNAME"] = 'your_username@example.com'
os.environ["EAI_PASSWORD"] = 'your_password'

In [None]:
# Initialize the expert.ai Edge NL API client
# NOTE: This import will fail as the Edge API has been discontinued (2023)
# Contact info@expert.ai for information about current NLP solutions
from expertai.nlapi.edge.client import ExpertAiClient
client = ExpertAiClient()

### Helper Functions for Tokenization and Feature Extraction

The following functions process text through the expert.ai NLP engine and extract linguistic features that will be used as input to the CRF model.

In [None]:
# Progress bar library for tracking long-running operations
from tqdm import tqdm, trange

In [None]:
def tokens_to_docs(raw, eai):
    """
    Analyze sentences using the expert.ai NLP engine.
    
    Takes tokenized sentences, reconstructs them as strings, and processes
    them through the expert.ai full_analysis endpoint to obtain rich
    linguistic annotations.
    
    Args:
        raw: List of sentences, where each sentence is a list of token strings
        eai: expert.ai client instance
        
    Returns:
        docs: List of expert.ai Document objects containing NLP analysis results
    """
    docs = []
    for sent in tqdm(raw):
        # Join tokens back into a sentence string and analyze
        docs.append(eai.full_analysis(' '.join(sent)))
    return docs

In [None]:
def _get_label(doc, syncon):
    """
    Extract the knowledge graph label for a syncon from the document.
    
    Syncons (semantic concepts) in expert.ai's knowledge graph can have
    associated labels that provide semantic categorization (e.g., 'town',
    'person', 'company'). This function retrieves that label.
    
    Args:
        doc: expert.ai Document object
        syncon: Syncon ID to look up
        
    Returns:
        The simplified label (last component after '.') or empty string if not found
    """
    label = ''
    if hasattr(doc, 'knowledge'):
        for element in doc.knowledge:
            if element.syncon == syncon:
                label = element.label
                break
        # Extract the most specific part of hierarchical labels
        # e.g., 'geography.town' -> 'town'
        if label and '.' in label:
            label = label.split('.')[-1]
    return label

In [None]:
def features_from_docs(sentences, docs):
    """
    Extract CRF features from expert.ai analyzed documents.
    
    Aligns the original tokenization with expert.ai's tokenization and extracts
    linguistic features for each token. Handles cases where tokenization differs
    (e.g., expert.ai may split or merge tokens differently).
    
    Features extracted per token:
        - word: The original token text
        - pos: Part-of-speech tag (PROPN, NOUN, VERB, etc.)
        - dep: Dependency relation label (nsubj, dobj, root, etc.)
        - syncon: Semantic concept ID from expert.ai knowledge graph
        - label: Knowledge graph category label
        - typeclass: Combined POS and entity type (e.g., ['NPR', 'GEO'])
    
    Args:
        sentences: List of sentences (lists of token strings)
        docs: List of expert.ai Document objects
        
    Returns:
        eai_sents: List of sentences, where each sentence is a list of
                   feature dictionaries for each token
    """
    eai_sents = []
    for sent_idx in trange(len(sentences)):
        seek = 0  # Track position in the document content string
        eai_tokenlist = []
        
        for tk_idx in range(len(sentences[sent_idx])):
            # Get the original token and find its position in the doc
            token = sentences[sent_idx][tk_idx]
            index_start = docs[sent_idx].content.find(token, seek)
            index_end = index_start + len(token)
            
            # Find expert.ai tokens that overlap with this token's span
            # This handles tokenization mismatches between original and expert.ai
            possible_tokens = []
            for t in docs[sent_idx].tokens:
                if (t.start <= index_start and t.end >= index_end) or \
                   (t.start >= index_start and t.start <= index_end) or \
                   (t.end >= index_start and t.end <= index_end):
                    possible_tokens.append(t)
            
            if not possible_tokens:
                print('ERROR: expertai tokenization not found for token', token)
                eai_tokenlist.append(_voidtoken())
            else:
                # If multiple tokens match, prefer the one with a valid syncon
                if len(possible_tokens) > 1:
                    possible_tokens.sort(key=lambda t: t.syncon, reverse=True)
                
                # Build feature dictionary from the best matching token
                new_token = {
                    'word': token,
                    'pos': possible_tokens[0].pos,
                    'syncon': possible_tokens[0].syncon,
                    'ancestor': -1,
                    'label': _get_label(docs[sent_idx], possible_tokens[0].syncon),
                    'dep': possible_tokens[0].dependency.label,
                    'typeclass': possible_tokens[0].type_.split('.')
                }
                eai_tokenlist.append(new_token)
            
            # Advance position tracker past whitespace
            seek = index_end
            while len(docs[sent_idx].content) < seek and (docs[sent_idx].content[seek] == ' '):
                seek += 1
        
        eai_sents.append(eai_tokenlist)
    return eai_sents

In [None]:
def features_from_word(sentence, idx):
    """
    Build CRF feature dictionary for a single token with context.
    
    Creates features from the current token plus its immediate neighbors,
    capturing local context that's crucial for sequence labeling.
    
    Feature categories:
        - Word shape features: lowercase, suffix, capitalization, digit patterns
        - expert.ai features: POS tag, dependency label, syncon, knowledge label
        - Context features: Same features for previous (-1) and next (+1) tokens
        - Boundary markers: BOS (beginning of sentence), EOS (end of sentence)
    
    Args:
        sentence: List of token feature dictionaries
        idx: Index of the current token
        
    Returns:
        Dictionary of features for the CRF model
    """
    token = sentence[idx]
    
    # Core features for the current token
    features = {
        'bias': 1.0,  # Constant feature for learning label priors
        'word.lower()': token['word'].lower(),
        'word[-3:]': token['word'][-3:],  # 3-character suffix
        'word[-2:]': token['word'][-2:],  # 2-character suffix
        'word.isupper()': token['word'].isupper(),
        'word.istitle()': token['word'].istitle(),
        'word.isdigit()': token['word'].isdigit(),
        # expert.ai linguistic features
        'eai.postag': token['pos'],
        'eai.postag[:2]': token['pos'][:2],  # POS tag prefix
        'eai.deptag': token['dep'],
        'eai.deptag[-2:]': token['dep'][-2:],  # Dep tag suffix
        # Normalized syncon values (scaled to [0,1] range approximately)
        'eai.syncon': -1 if token['syncon'] == -1 else token['syncon'] / 10000.,
        'eai.ancestor': -1 if token['ancestor'] == -1 else token['ancestor'] / 10000.,
        'eai.labels': token['label'],
        'eai.typeclass': token['typeclass'],
    }
    
    # Previous token context (position -1)
    if idx > 0:
        token1 = sentence[idx - 1]
        features.update({
            '-1:word.lower()': token1['word'].lower(),
            '-1:word.istitle()': token1['word'].istitle(),
            '-1:word.isupper()': token1['word'].isupper(),
            '-1:eai.postag': token1['pos'],
            '-1:eai.deptag': token1['dep'],
            '-1:eai.labels': token1['label'],
            '-1:eai.typeclass': token1['typeclass'],
        })
    else:
        features['BOS'] = True  # Beginning of sentence marker
    
    # Next token context (position +1)
    if idx < len(sentence) - 1:
        token1 = sentence[idx + 1]  # Note: Fixed bug - was sentence[idx-1]
        features.update({
            '+1:word.lower()': token1['word'].lower(),
            '+1:word.istitle()': token1['word'].istitle(),
            '+1:word.isupper()': token1['word'].isupper(),
            '+1:eai.postag': token1['pos'],
            '+1:eai.deptag': token1['dep'],
            '+1:eai.labels': token1['label'],
            '+1:eai.typeclass': token1['typeclass'],
        })
    else:
        features['EOS'] = True  # End of sentence marker
    
    return features

In [None]:
def features_from_sentence(sentence):
    """
    Generate CRF features for all tokens in a sentence.
    
    Args:
        sentence: List of token feature dictionaries
        
    Returns:
        Tuple of feature dictionaries, one per token
    """
    return tuple(features_from_word(sentence, index) for index in range(len(sentence)))

In [None]:
def _voidtoken():
    """
    Create a placeholder token with empty features.
    
    Used when expert.ai tokenization doesn't align with the original
    tokenization and no matching token can be found.
    
    Returns:
        Dictionary with empty/default values for all token features
    """
    return {
        'word': '',
        'pos': '',
        'syncon': -1,
        'ancestor': -1,
        'dep': '',
        'label': ''
    }

### Process Data Through expert.ai and Extract Features

This step analyzes all sentences through the expert.ai NLP engine and extracts the linguistic features. Processing ~20,000 sentences takes approximately 1 hour.

> **NOTE**: This cell requires the Edge NL API runtime to be running locally.

In [None]:
# Analyze all sentences through expert.ai NLP engine
# This is the most time-consuming step (~1 hour for the full dataset)
train_docs = tokens_to_docs(raw_train, client)
test_docs = tokens_to_docs(raw_test, client)
valid_docs = tokens_to_docs(raw_valid, client)

In [None]:
# Extract structured features from the analyzed documents
# Aligns original tokens with expert.ai tokens and creates feature dictionaries
train = features_from_docs(raw_train, train_docs)
test = features_from_docs(raw_test, test_docs)
valid = features_from_docs(raw_valid, valid_docs)

In [None]:
# Inspect example features for a sample sentence
# Shows how expert.ai enriches tokens with semantic information
import pprint
p_idx = 2

print("Original sentence:", raw_train[p_idx])
print("NER labels:", y_train[p_idx])
print('')
print("Extracted features per token:")
pprint.pprint(train[p_idx])
print('')
print("Raw expert.ai token objects:")
pprint.pprint([tk.__dict__ for tk in train_docs[p_idx].tokens])

### Build Final Feature Vectors for CRF

Convert the extracted features into the format expected by sklearn-crfsuite, adding context window features (previous and next token features).

In [None]:
# Build final feature vectors with context windows for each dataset split
X_train = [features_from_sentence(sentence) for sentence in train]
X_test = [features_from_sentence(sentence) for sentence in test]
X_valid = [features_from_sentence(sentence) for sentence in valid]

# Display example feature vector for one sentence
print("Example feature vectors for sentence:")
pprint.pprint(X_train[1])

---

## Training the CRF Model

Train a Conditional Random Field model using the L-BFGS optimization algorithm. The CRF learns to predict NER labels by modeling the conditional probability of label sequences given the input features.

**Hyperparameters:**
- `c1=0.1`: L1 regularization coefficient (promotes sparse features)
- `c2=0.5`: L2 regularization coefficient (prevents overfitting)
- `max_iterations=800`: Maximum optimization iterations
- `all_possible_transitions=True`: Include transitions not observed in training data

In [None]:
%%time
# Train the CRF model
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',      # Limited-memory BFGS optimization
    c1=0.1,                 # L1 regularization
    c2=0.5,                 # L2 regularization
    max_iterations=800,     # Max training iterations
    all_possible_transitions=True,  # Allow unseen label transitions
    verbose=True
)

# Fit with development set for monitoring convergence
crf.fit(X_train, y_train, X_dev=X_valid, y_dev=y_valid)

---

## Model Evaluation

Evaluate the trained model on both the test and validation sets using standard NER metrics:
- **Precision**: Proportion of predicted entities that are correct
- **Recall**: Proportion of actual entities that were found
- **F1-score**: Harmonic mean of precision and recall

In [None]:
import time

def compute_prediction_latency(dataset, model, n_instances=-1):
    """
    Measure the average prediction latency per instance.
    
    Args:
        dataset: Input data to predict on
        model: Trained model with a predict() method
        n_instances: Number of instances to average over (-1 = all)
        
    Returns:
        Average prediction time in seconds per instance
    """
    if n_instances == -1:
        n_instances = len(dataset)
    start_time = time.process_time()
    model.predict(dataset)
    total_latency = time.process_time() - start_time
    return total_latency / n_instances

In [None]:
# Model size (number of learned feature weights)
print('Model size: {:0.2f}M parameters'.format(crf.size_ / 1000000))

In [None]:
# Inference speed (average time to predict one sentence)
print(f'Prediction latency: {compute_prediction_latency(X_test, crf):.3} seconds per sentence')

In [None]:
# Evaluate on test and validation sets using seqeval
# (entity-level evaluation that accounts for BIO boundaries)
from seqeval.metrics import classification_report

datasets = [('Test Set', X_test, y_test), ('Validation Set', X_valid, y_valid)]

for title, X, Y in datasets:
    Y_pred = crf.predict(X)
    print(f"=== {title} ===")
    print(classification_report(Y, Y_pred, digits=3))
    print('\n')