# Named Entity Recognition (NER) System - R&D Notebook
Author: Juan Roesel

#### **TABLE OF CONTENTS**
1. [Introduction](#1)
2. [Set Up](#2)
3. [Utils](#3)
4. [Data Processing](#4)
5. [Model Training](#5)
6. [Hyperparameter Optimization](#6)
7. [Evaluation on Test Data](#7)
8. [Inference](#8)
9. [Limitations](#9)
10. [References](#10)

**NOTE:** Be sure to run this notebook using the Kernel enabled by the virtual environment set up by Poetry. For reference, it should start with `ner-system-XXX` and run using Python 3.11.0.

## 1. Introduction
<a id='1'></a>

Named Entity Recognition (NER) is a key task in Natural Language Processing (NLP) aimed at identifying and categorizing important information in text, such as names of people, places, and organizations. The main challenges in NER stem from the complexity of language, including its variability and the context-dependent nature of how entities are represented. NER is essential for applications ranging from information retrieval and content classification to enhancing user interfaces.

Among the models best suited for this task stands [Conditional Random Fields (CRFs)](https://en.wikipedia.org/wiki/Conditional_random_field), which are statistical models designed for structured prediction, making them ideal for NER. Unlike models that treat predictions independently, CRFs leverage the context and sequence of words, significantly improving the accuracy of identifying and classifying named entities. They are also fast to train and cost-efficient to deploy.

The data used to train the model was the [CoNLL2003 dataset](https://paperswithcode.com/dataset/conll-2003), a standard NER research benchmark comprising English news articles annotated with named entities. Its rigorous annotation and diversity make it a robust dataset for evaluating NER models.

Our experiment thus involved training a CRF model on the CoNLL2003 dataset, achieving a **Macro-F1 score of 0.88 on the validation set and 0.82 on the test set**. These results confirm the model's high reliability in efficiently recognizing and classifying named entities in news articles. For reference, the [state-of-the-art results](https://paperswithcode.com/sota/named-entity-recognition-ner-on-conll-2003) achieved on this dataset range from 0.91 to 0.94 by primarily using deep neural networks that are more expensive train and complex to deploy.

This notebook contains the code used to read and process the data, transform it into relevant features, and use it to train the CRF model with hyperparameter optimization.


## 2. Set Up
<a id='2'></a>

In [1]:
import sys
import time
import pickle
import logging
from pathlib import Path
from functools import wraps
from typing import Optional, List, Dict

import transformers
import scipy.stats
from sklearn_crfsuite import CRF
from sklearn_crfsuite.utils import flatten
from sklearn_crfsuite.metrics import flat_f1_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import make_scorer

from tqdm import tqdm
from datasets import load_dataset

import spacy
import nltk
from nltk.corpus import names, gazetteers, stopwords
from nltk import pos_tag

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# NLTK resources needed for feature engineering

nltk.download('gazetteers')
nltk.download('names')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package gazetteers to
[nltk_data]     C:\Users\juanr\AppData\Roaming\nltk_data...
[nltk_data]   Package gazetteers is already up-to-date!
[nltk_data] Downloading package names to
[nltk_data]     C:\Users\juanr\AppData\Roaming\nltk_data...
[nltk_data]   Package names is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\juanr\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\juanr\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [3]:
# Global Config

class NERConfig:
    """
    Class to hold all the configuration parameters along the NER pipeline.
    """
    def __init__(
        self,
        data_path: str | Path,
        models_path: Optional[str | Path] = None,
        artifacts_path: Optional[str | Path] = None,
        label_types: Optional[List[str]] = None,
        id2label: Optional[Dict[int, str]] = None,
        dataset: Optional[str] = None,
        train_file: Optional[str] = None,
        dev_file: Optional[str] = None,
        test_file: Optional[str] = None,
        base_model_path: Optional[str] = None,
        tokenizer: Optional[transformers.BertTokenizer] = None,
        spacy_model: Optional[spacy.Language] = None,
        **kwargs
    ):
        if isinstance(data_path, str):
            self.DATA_PATH = Path(data_path)
        if isinstance(models_path, str):
            self.MODEL_PATH = Path(models_path)
        if isinstance(artifacts_path, str):
            self.ARTIFACTS_PATH = Path(artifacts_path)
        # DATA CONFIG
        if dataset:  # Must be a valid HuggingFace dataset
            self.dataset = load_dataset(dataset)
            self.LABEL_TYPES = self.dataset["train"].features["ner_tags"].feature.names
            self.ID2LABEL = {i: label for i, label in enumerate(self.LABEL_TYPES)}
        else:
            self.LABEL_TYPES = label_types
            self.ID2LABEL = id2label
        if train_file: 
            self.train_file = self.DATA_PATH / train_file
        if dev_file:
            self.dev_file = self.DATA_PATH / dev_file
        if test_file:
            self.test_file = self.DATA_PATH / test_file
        # BASE BERT CONFIG
        self.BASE_MODEL_PATH = base_model_path or "bert-base-cased"
        self.MAX_LEN = 128
        self.TRAIN_BATCH_SIZE = 64
        self.VALID_BATCH_SIZE = 32
        self.EPOCHS = 15
        self.OUT_DIM = 768  # bert-base-cased: 768, bert-large-cased: 1024
        self.TOKENIZER = tokenizer or transformers.BertTokenizer.from_pretrained(
            self.BASE_MODEL_PATH, do_lower_case=False
        )
        # BASE CRF CONFIG
        self.SPACY = spacy_model or spacy.load('en_core_web_sm')
        self.CRF_ALGORITHM = "lbfgs"
        self.CRF_C1 = 0.1
        self.CRF_C2 = 0.1
        self.CRF_MAX_ITER = 100
        self.logger = logging.getLogger(self.__class__.__name__)
        # KWARGS
        if kwargs:
            for attr, value in kwargs.items():
                setattr(self, attr, value)

In [4]:
# Instantiate NERConfig for downstream use

ner_config = NERConfig(
    data_path="data",
    models_path="models",
    artifacts_path="artifacts",
    dataset="conll2003"
)

## 3. Utils
<a id='3'></a>

In [5]:
def get_root_directory() -> Path:
    current_path = Path.cwd()
    root_directory = current_path.parents[0]
    return root_directory

def _flattens_y(func):
    @wraps(func)
    def wrapper(y_true, y_pred, *args, **kwargs):
        y_true_flat = flatten(y_true)
        y_pred_flat = flatten(y_pred)
        return func(y_true_flat, y_pred_flat, *args, **kwargs)
    return wrapper

@_flattens_y
def flat_classification_report(y_true, y_pred, labels=None, **kwargs):
    """
    Return classification report for sequence items.
    #NOTE: Adapated from source code to mitigate TypeError issue with original function.
    See: https://www.reddit.com/r/learnpython/comments/swwplz/sklearn_crfsuite_issue_using_metricsflat/
    """
    from sklearn import metrics
    return metrics.classification_report(y_true, y_pred, labels=labels, **kwargs)

## 4. Data Processing
<a id='4'></a>

In [6]:
ner_config.dataset["train"][0]

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

In [7]:
class CRFDataProcessor:
    """
    Class to process the raw data into a format that can be used
    to train a CRF model using the sklearn_crfsuite library.
    """

    def __init__(self, config: NERConfig, debug: bool = False, persist_data: bool = False):
        self.config = config
        self.debug = debug
        self.persist_data = persist_data
        self.logger = logging.getLogger(self.__class__.__name__)
        self.logger.addHandler(logging.StreamHandler(stream=sys.stdout))
    
    def _token2features(self, tokens: List[str], idx: int):
        """
        Generates a feature dictionary for a given token, using the following features:
        - Features which looks at neighbouring words.
        - Features which looks at word morphology.
        - Features which considers the "shape" of word.
        - Features which include POS tags.
        - Gazetteer features using the nltk Gazetteer corpus.
        - Name features using the nltk Names corpus.
        - Stop words features using the nltk StopWords corpus.

        """
        # load reference vocabularies and POS tags
        _gazetteer_words = set(gazetteers.words())
        _names_words = set(names.words())
        _stop_words = set(stopwords.words())
        _pos_tags = pos_tag(tokens)

        token_features = {}
        
        # token
        token_features["token"] = tokens[idx]

        # POS tags
        token_features["pos_tag_full"] = _pos_tags[idx][1]
        token_features["pos_tag_short"] = _pos_tags[idx][1][:2]

        # lower case
        token_features["token_lower"] = tokens[idx].lower()

        # first word upper case (boolean)
        token_features["token_upper"] = True if idx == 0 else False
        
        # upper case (boolean)
        token_features["token_upper"] = tokens[idx].isupper()

        # tile case (boolean)
        if not tokens[idx].isupper() and not tokens[idx].islower() and idx != 0:
            token_features["token_tile"] = True
        else:
            token_features["token_tile"] = False

        # digit case (boolean)
        token_features["token_digit"] = tokens[idx].isdigit()

        # all alpha case (boolean)
        token_features["token_alpha"] = tokens[idx].isalpha()

        # token is title (boolean)
        token_features["token_title"] = tokens[idx].istitle()

        # space case (boolean)
        token_features["token_space"] = tokens[idx].isspace()

        # word morphology - 3 chars
        if len(tokens[idx]) > 3:
            token_features["token_morph_-3"] = tokens[idx][:3]
            token_features["token_morph_+3"] = tokens[idx][-3:]

        # word morphology - 2 chars
        if len(tokens[idx]) > 2:
            token_features["token_morph_-2"] = tokens[idx][:2]
            token_features["token_morph_+2"] = tokens[idx][-2:]
        
        # neighbours with corresponding POS tags
        if idx > 0:
            token_features["prev_token"] = tokens[idx - 1]
            token_features["prev_token_pos_full"] = _pos_tags[idx - 1][1]
            token_features["prev_token_pos_short"] = _pos_tags[idx - 1][1][:2]
        else:
            token_features["BOS"] = True

        if idx > 1:
            token_features["prev_prev_token"] = tokens[idx - 2]
            token_features["prev_prev_token_pos_full"] = _pos_tags[idx - 2][1]
            token_features["prev_prev_token_pos_short"] = _pos_tags[idx - 2][1][:2]

        if idx < len(tokens) - 2:
            token_features["next_next_token"] = tokens[idx + 2]
            token_features["next_next_token_pos_full"] = _pos_tags[idx + 2][1]
            token_features["next_next_token_pos_short"] = _pos_tags[idx + 2][1][:2]

        if idx < len(tokens) - 1:
            token_features["next_token"] = tokens[idx + 1]
            token_features["next_token_pos_full"] = _pos_tags[idx + 1][1]
            token_features["next_token_pos_short"] = _pos_tags[idx + 1][1][:2]
        else:
            token_features["EOS"] = True
        
        # gazetteer features
        if tokens[idx] in _gazetteer_words:
            token_features["gazetteer"] = tokens[idx]

        # name features
        if tokens[idx] in _names_words:
            token_features["name"] = tokens[idx]

        # stop words features
        if tokens[idx] in _stop_words:
            token_features["stop_word"] = tokens[idx]
        
        return token_features
    
    def _sentence2features(self, tokens: List[str]):
        """
        Generates a list of feature dictionaries for a given list of tokens.
        """
        return [self._token2features(tokens, idx) for idx in range(len(tokens))]
    

    def _labels_to_iob(self, label_ids: List[int]):
        """
        Convert a list of label ids to IOB format
        """
        return [self.config.ID2LABEL[label_id] for label_id in label_ids]
    
    def run(self, split: str):
        """
        Runs a simple data processing/transformation pipeline to convert
        the raw data into a format that can be used to train a CRF model.
        """
        if split not in ["train", "validation", "test"]:
            raise ValueError("split must be one of 'train', 'validation', 'test'")
        
        # self.logger.info(f"Initializing data processing for {split} split...")
        print(f"Initializing data processing for {split} split...")
        _start_time = time.perf_counter()
        
        if self.debug:
            _data = self.config.dataset[split][:10]
            split_feature_dicts = [self._sentence2features(tokens) for tokens in tqdm(_data["tokens"])]
            split_tags = [self._labels_to_iob(tags) for tags in tqdm(_data["ner_tags"])]

        else:
            _data = self.config.dataset[split]
            split_feature_dicts = [self._sentence2features(item["tokens"]) for item in tqdm(_data)]
            split_tags = [self._labels_to_iob(item["ner_tags"]) for item in tqdm(_data)]

        _end_time = time.perf_counter()
        # self.logger.info(f"Data processing for {split} split completed in {_end_time - _start_time:0.4f} seconds")
        print(f"Data processing for {split} split completed in {_end_time - _start_time:0.4f} seconds")

        if self.persist_data:
            _output_path = self.config.ARTIFACTS_PATH / f"{split}_data.pkl"
            _output_path_tags = self.config.ARTIFACTS_PATH / f"{split}_tags.pkl"
            
            self.logger.info(f"Persisting {split} features to {_output_path}...")
            self.logger.info(f"Persisting {split} tags to {_output_path_tags}...")
            self.logger.info(f"Persisting {split} features to {_output_path}...")
            self.logger.info(f"Persisting {split} tags to {_output_path_tags}...")

            with open(get_root_directory() / _output_path, "wb") as f:
                pickle.dump(split_feature_dicts, f)
            with open(get_root_directory() / _output_path_tags, "wb") as f:
                pickle.dump(split_tags, f)
        
        return split_feature_dicts, split_tags
    

In [8]:
crf_processor = CRFDataProcessor(ner_config, debug=True)

# Test _convert_to_iob
check_sentence = ner_config.dataset["train"][0]["tokens"] # "EU rejects German call to boycott British lamb ."
check_label_ids = ner_config.dataset["train"][0]["ner_tags"]
check_tags = ['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']

out_tags = crf_processor._labels_to_iob(check_label_ids)
assert out_tags == check_tags

# Test _tokens2features
check_tokens = ner_config.dataset["train"][0]["tokens"]
expected_features = {
    "token": "EU",
    "pos_tag_full": "NNP",
    "pos_tag_short": "NN",
    "token_lower": "eu",
    "token_upper": True,
    "token_tile": False,
    "token_digit": False,
    "token_alpha": True,
    "token_title": False,
    "token_space": False,
    "BOS": True,
    "next_token": "rejects",
    "next_token_pos_full": "VBZ",
    "next_token_pos_short": "VB",
    "next_next_token": "German",
    "next_next_token_pos_full": "JJ",
    "next_next_token_pos_short": "JJ"
}

out_features = crf_processor._sentence2features(check_tokens)
assert out_features[0] == expected_features

## 5. Model Training
<a id='5'></a>

In [9]:
crf_processor = CRFDataProcessor(
    ner_config, debug=False, persist_data=True
)

train_data, train_tags = crf_processor.run("train")
valid_data, valid_tags = crf_processor.run("validation")
test_data, test_tags = crf_processor.run("test")

Initializing data processing for train split...


100%|██████████| 14041/14041 [32:10<00:00,  7.27it/s] 
100%|██████████| 14041/14041 [00:00<00:00, 14868.02it/s]


Data processing for train split completed in 1931.6256 seconds
Initializing data processing for validation split...


100%|██████████| 3250/3250 [08:04<00:00,  6.71it/s]
100%|██████████| 3250/3250 [00:00<00:00, 15185.06it/s]


Data processing for validation split completed in 484.2565 seconds
Initializing data processing for test split...


100%|██████████| 3453/3453 [07:28<00:00,  7.71it/s]
100%|██████████| 3453/3453 [00:00<00:00, 17339.10it/s]


Data processing for test split completed in 448.2328 seconds


### Vanilla CRF

In [10]:
crf = CRF(
    max_iterations=100,
    algorithm=ner_config.CRF_ALGORITHM,
    verbose=True,
)
crf.keep_tempfiles = False
crf.model_filename = ner_config.MODEL_PATH / "crf_model_vanilla.pkl"

crf.fit(train_data, train_tags)


loading training data to CRFsuite:   0%|          | 0/14041 [00:00<?, ?it/s]

loading training data to CRFsuite: 100%|██████████| 14041/14041 [00:03<00:00, 4515.57it/s]



Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 0
0....1....2....3....4....5....6....7....8....9....10
Number of features: 185691
Seconds required: 0.611

L-BFGS optimization
c1: 0.000000
c2: 1.000000
num_memories: 6
max_iterations: 100
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

Iter 1   time=0.32  loss=238244.34 active=185678 feature_norm=1.00
Iter 2   time=0.16  loss=198254.00 active=185682 feature_norm=3.17
Iter 3   time=0.16  loss=153776.21 active=185682 feature_norm=2.85
Iter 4   time=0.31  loss=112579.27 active=185682 feature_norm=2.75
Iter 5   time=0.17  loss=102476.05 active=185682 feature_norm=3.16
Iter 6   time=0.18  loss=78499.07 active=185682 feature_norm=4.79
Iter 7   time=0.17  loss=64762.83 active=185682 feature_norm=6.39
Iter 8   time=0.19  loss=57228.42 active=185682 feature_norm=9.69
Iter 9   time=0.18  loss=50112.79 active=185682 feature_norm=10.7

In [11]:
y_pred = crf.predict(valid_data)

print(f"Micro F1:{flat_f1_score(valid_tags, y_pred, average='micro')}")
print(f"Macro F1:{flat_f1_score(valid_tags, y_pred, average='macro')}")
print(f"Classification Report:\n{flat_classification_report(valid_tags, y_pred, labels=ner_config.LABEL_TYPES)}")

Micro F1:0.9757602897083447
Macro F1:0.8808505497261443
Classification Report:
              precision    recall  f1-score   support

           O       0.99      1.00      0.99     42759
       B-PER       0.90      0.89      0.89      1842
       I-PER       0.94      0.96      0.95      1307
       B-ORG       0.84      0.81      0.83      1341
       I-ORG       0.81      0.80      0.80       751
       B-LOC       0.92      0.92      0.92      1837
       I-LOC       0.90      0.82      0.86       257
      B-MISC       0.93      0.82      0.88       922
      I-MISC       0.89      0.73      0.80       346

    accuracy                           0.98     51362
   macro avg       0.90      0.86      0.88     51362
weighted avg       0.98      0.98      0.98     51362



In [12]:
# Printing top 10 and bottom 10 transitions to evaluate model empirically

transition_feats = crf.transition_features_
# top 10
top_10 = sorted(transition_feats.items(), key=lambda x: x[1], reverse=True)[:10]
# bottom 10
bottom_10 = sorted(transition_feats.items(), key=lambda x: x[1], reverse=True)[-10:]
print(f"Top 10 transitions: {top_10}")
print()
print(f"Bottom 10 transitions: {bottom_10}")

Top 10 transitions: [(('B-PER', 'I-PER'), 6.814966), (('B-ORG', 'I-ORG'), 6.774296), (('I-ORG', 'I-ORG'), 6.296536), (('B-MISC', 'I-MISC'), 6.090638), (('O', 'O'), 5.814416), (('B-LOC', 'I-LOC'), 5.765354), (('I-MISC', 'I-MISC'), 5.675503), (('O', 'B-PER'), 5.524776), (('I-LOC', 'I-LOC'), 5.031712), (('I-PER', 'I-PER'), 4.403127)]

Bottom 10 transitions: [(('B-ORG', 'B-MISC'), -0.821103), (('I-MISC', 'B-PER'), -0.829498), (('B-MISC', 'B-MISC'), -0.87536), (('I-LOC', 'B-ORG'), -0.995225), (('I-MISC', 'B-LOC'), -1.000655), (('B-ORG', 'B-PER'), -1.074974), (('B-LOC', 'B-PER'), -1.24679), (('B-ORG', 'B-LOC'), -1.344094), (('I-ORG', 'B-LOC'), -1.521077), (('I-ORG', 'B-ORG'), -1.557601)]


### **General Observations**

* The Vanilla CRF implementation reached a very decent $0.86$ `macro avg F1` score, which highlights the effectiveness of the CRF model for sequential learning tasks like NER.

* The best performing label was `O`, which is not suprising given the imbalanced nature of the dataset.

* The worst performing labels were `I-ORG` and `I-MISC`. Concerning the former, a potential underlying cause might be the presence of semantic ambiguity in some of the entities (e.g., `Apple` as company and `Apple` as a fruit), which can be mitigated through well-known approaches such as Entity Linking. 

* In the case of `I-MISC`, a potential reason for underperforming might be the amount of semantic variance among the entities associated with the label (e.g., `World Cup`. `Grand Prix` and `Peace Prize`, `Timorese-born` and `Barcelona-Madrid` are all categorized as `I-MISC` entities). This could be mitigated by feature engineering or by breaking down the `MISC` category into more cohesive entities (at the expense of increasing model complexity and requiring a wider range of labelled examples).

* The top performing transitions (e.g., `B-PER -> I-PER` or `B-ORG -> I-ORG`) seem to indicate that the model has learned how to estimate the likelihood of the `i`th tag transitioning to the `j`th tag reasonably well. Under the same light, the bottom 10 transitions shown seem to point to transitions that don't make much sense (e.g., from a `B-MISC` tag to another `B-MISC` tag, both denoting the beginning of a state).

## 6. Hyperaparemeter Optimization
<a id='6'></a>

In [17]:
_crf = CRF(algorithm=ner_config.CRF_ALGORITHM)

# use the same metric for evaluation
f1_scorer = make_scorer(flat_f1_score, average='weighted')

# define fixed parameters and parameters to search
params = {
    'c1': scipy.stats.expon(scale=0.5),
    'c2': scipy.stats.expon(scale=0.05),
    'epsilon': [0.00001, 0.00001, 0.000005, 0.00005],
    'min_freq': [0, 10, 25, 20, 100, 200],
    'max_iterations': [50, 100, 250, 500, 750],
    'all_possible_states': [True, False],
    'all_possible_transitions': [True, False]
}

# implement randomized search
crf_rs = RandomizedSearchCV(
    _crf, 
    params,
    cv=5,
    verbose=1,
    n_iter=5,
    scoring=f1_scorer,
    n_jobs=-1
)

crf_rs.keep_tempfiles = False
crf_rs.model_filename = ner_config.MODEL_PATH / "crf_model_optim.pkl"
crf_rs.fit(train_data, train_tags)

Fitting 5 folds for each of 5 candidates, totalling 25 fits


In [18]:
best_crf = crf_rs.best_estimator_
y_pred = best_crf.predict(valid_data)

print(f"Micro F1:{flat_f1_score(valid_tags, y_pred, average='micro')}")
print(f"Macro F1:{flat_f1_score(valid_tags, y_pred, average='macro')}")
print(f"Classification Report:\n{flat_classification_report(valid_tags, y_pred, labels=ner_config.LABEL_TYPES)}")

Micro F1:0.9772010435730696
Macro F1:0.8844588409277788
Classification Report:
              precision    recall  f1-score   support

           O       0.99      1.00      0.99     42759
       B-PER       0.90      0.90      0.90      1842
       I-PER       0.94      0.95      0.95      1307
       B-ORG       0.86      0.81      0.84      1341
       I-ORG       0.82      0.82      0.82       751
       B-LOC       0.93      0.93      0.93      1837
       I-LOC       0.90      0.82      0.86       257
      B-MISC       0.92      0.84      0.88       922
      I-MISC       0.88      0.72      0.79       346

    accuracy                           0.98     51362
   macro avg       0.91      0.87      0.88     51362
weighted avg       0.98      0.98      0.98     51362



In [19]:
# Printing top 10 and bottom 10 transitions to evaluate model empirically
transition_feats = crf_rs.best_estimator_.transition_features_
# top 10
top_10 = sorted(transition_feats.items(), key=lambda x: x[1], reverse=True)[:10]
# bottom 10
bottom_10 = sorted(transition_feats.items(), key=lambda x: x[1], reverse=True)[-10:]
print(f"Top 10 transitions: {top_10}")
print()
print(f"Bottom 10 transitions: {bottom_10}")

Top 10 transitions: [(('B-ORG', 'I-ORG'), 6.952282), (('B-LOC', 'I-LOC'), 6.471124), (('I-ORG', 'I-ORG'), 6.397533), (('I-MISC', 'I-MISC'), 6.109443), (('B-PER', 'I-PER'), 5.872573), (('B-MISC', 'I-MISC'), 5.779496), (('I-LOC', 'I-LOC'), 5.356029), (('O', 'B-PER'), 4.723923), (('O', 'O'), 4.456403), (('I-PER', 'I-PER'), 3.96032)]

Bottom 10 transitions: [(('I-MISC', 'B-LOC'), -2.086034), (('B-MISC', 'B-LOC'), -2.100947), (('B-LOC', 'B-PER'), -2.24063), (('B-MISC', 'B-MISC'), -2.291549), (('I-LOC', 'B-LOC'), -2.586804), (('B-PER', 'B-LOC'), -2.67583), (('B-PER', 'B-MISC'), -3.00232), (('I-ORG', 'B-ORG'), -3.689932), (('B-ORG', 'B-LOC'), -4.214494), (('I-ORG', 'B-LOC'), -4.982422)]


### **General Observations**

* Despite having run a process of searching through a parameter space for an optimal combination of paramters that increase model performance, the `macro avg F1` score barely improved. This might be due to the inherent limitations of statistical models in the context of language modeling. See the Limitations section below for more on this point.

* The `I-ORG` tag increased two percentage points in performance, from 0.80 to 0.82. On the other hand, the tag `I-MISC` slightly decreased to 0.79.

## 7, Evaluation on Test Data
<a id='7'></a>

In [20]:
y_pred = best_crf.predict(test_data)

print(f"Micro F1: {flat_f1_score(test_tags, y_pred, average='micro')}")
print(f"Macro F1: {flat_f1_score(test_tags, y_pred, average='macro')}")

Micro F1: 0.9622698395606762
Macro F1: 0.8243243839841066


### **Observations**

Both `Micro F1` and `Macro F1` decreased when running the model on the test set. This might be due to the model potentially overfitting against the dataset, and might be more indicative of its true performance when fit against unseen data coming from similar distributions (i.e., news articles in English).

Adding more news datasets into the training mix would help the model become more robust to language variance.

In [21]:
# Persist the Best CRF model
with open(get_root_directory() / crf_rs.model_filename, 'wb') as f:
    pickle.dump(best_crf, f)

## 8. Inference
<a id='8'></a>

In [22]:
import pickle

import spacy

from ner_system.models.pipeline import CRFPipeline
from ner_system.config import NERConfig
from ner_system.utils import get_root_directory

In [25]:
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe("sentencizer")

ner_config = NERConfig(
    label_types=["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "B-MISC", "I-MISC"],
    data_path="data"
)

with open(get_root_directory() / "models/crf_model_optim.pkl", "rb") as f:
    best_crf = pickle.load(f)

resources = {"space_model": nlp}

crf_processor = CRFPipeline(ner_config, resources, debug=False, persist_data=False)

test = "Hollywood A-listers hit the green ...By Matthew Knight, CNNUpdated 1203 GMT (2003 HKT) October 5, 2011  Photos: Celebs and sports starsMichael Douglas and Andy Garcia take to world golf's oldest and arguably greatest stage as they practise on the Old Course at St. Andrews, Scotland in preparation for the Alfred Dunhill Links Championships."
sents = [sent for sent in nlp(test).sents]
for sent in sents:
    tokens = [token.text for token in sent]
    features = crf_processor._sentence2features(tokens)
    tags = best_crf.predict([features])
    print(f"Sentence: {sent}")
    print(f"Tokens: {tokens}")
    print(f"Predicted Tags: {tags}")

Sentence: Hollywood A-listers hit the green ...
Tokens: ['Hollywood', 'A', '-', 'listers', 'hit', 'the', 'green', '...']
Predicted Tags: [['B-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O']]
Sentence: By Matthew Knight, CNNUpdated 1203 GMT (2003 HKT) October 5, 2011  Photos: Celebs and sports starsMichael Douglas and Andy Garcia take to world golf's oldest and arguably greatest stage as they practise on the Old Course at St. Andrews, Scotland in preparation for the Alfred Dunhill Links Championships.
Tokens: ['By', 'Matthew', 'Knight', ',', 'CNNUpdated', '1203', 'GMT', '(', '2003', 'HKT', ')', 'October', '5', ',', '2011', ' ', 'Photos', ':', 'Celebs', 'and', 'sports', 'starsMichael', 'Douglas', 'and', 'Andy', 'Garcia', 'take', 'to', 'world', 'golf', "'s", 'oldest', 'and', 'arguably', 'greatest', 'stage', 'as', 'they', 'practise', 'on', 'the', 'Old', 'Course', 'at', 'St.', 'Andrews', ',', 'Scotland', 'in', 'preparation', 'for', 'the', 'Alfred', 'Dunhill', 'Links', 'Championships', '.']
Predict

## 9. Model Limitations
<a id='9'></a>

CRFs are statistical models that directly model the conditional probability of the label sequence given a sequence of input tokens. CRFs make predictions based on the entire input sequence, taking into account the context and transition probabilities between labels in the sequence, which is particularly useful for sequence labeling tasks like NER.

However, there are some inherent limitations in both the model's architecture and the training approach overall that might explain the negligible boost in performance when running hyperparameter optimization. Here are some of them:

- **Lack of contextual embeddings:** CRFs model the sequence and its labels considering the local context and transitions between neighboring labels, but they do not inherently understand the context the way deep learning models like BERT do. Their understanding of context is limited to the features provided to it.

- **Autoregressive property:** While making predictions, CRFs consider the sequence of labels in a manner where the prediction of a label at a certain position in a sequence can depend on the labels at previous (and possibly future) positions. However, this might lead to error prograpation along the sequence if the predicted labels are not correct. 

- **Limited training data:** The learned parameters of the current CRF model are constrained to the language distribution present in CoNNL2003, potentially leading to overfitting and causing the model to be overly sensitive to small fluctuations in the data.

## 10. References
<a id='10'></a>