# Named Entity Recognition - Kaggle Competition

## Objective

Train a Conditional Random Field (CRF) model to perform Named Entity Recognition (NER) on the Ontonotes V5.0 dataset. The Kaggle competition page can be [found here](https://www.kaggle.com/c/colx-563-lab-assignment-1/overview).

## Results

* Private Leaderboard: **1st place** (micro F1 score: 0.96819)
* Public Leaderboard: **2nd place** (micro F1 score: 0.96838)


## Project Setup

In [32]:
import os
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import f1_score, classification_report
from bs4 import BeautifulSoup
from nltk.corpus import names, gazetteers, stopwords
from nltk import pos_tag
from sklearn_crfsuite import CRF
from sklearn_crfsuite.metrics import flat_f1_score, flat_classification_report
import re

In [2]:
ontonotes_path = "data/ner_kaggle_competition/"

In [3]:
train_data = ['train/' + filename for filename in os.listdir(ontonotes_path + 'train')]
dev_data = ['dev/' + filename for filename in os.listdir(ontonotes_path + 'dev')]

In [4]:
train_data[:10]

['train/wsj_2164.name',
 'train/wsj_0509.name',
 'train/abc_0025.name',
 'train/wsj_2021.name',
 'train/mnb_0007.name',
 'train/wsj_1877.name',
 'train/cnn_0331.name',
 'train/wsj_0765.name',
 'train/wsj_0335.name',
 'train/wsj_1174.name']

## Initial Data Processing

The following code block converts data from the `.name` files format to standard IOB/BIO tags appropriate for conducting Named Entity Recognition (NER).

The `.name` files contain sentences with XML tags which indicate specific named entities. For instance, in the following example:

The tag for 'Hong' is *B-GPE* and 'Kong' is *I-GPE* (GPE stands for Geopolitical Entity).

In [5]:
sentence1 = 'While <ENAMEX TYPE="PERSON">Galloway</ENAMEX> \'s <ENAMEX TYPE="ORG" S_OFF="4">pro-Wal-Mart</ENAMEX> film introduces us to grateful employees /-'
sentence2 = '<ENAMEX TYPE="GPE">Moscow</ENAMEX> , overcast changing to moderate snow , <ENAMEX TYPE="QUANTITY">2 degrees below zero</ENAMEX> to <ENAMEX TYPE="QUANTITY">1 degree</ENAMEX> .'
soup = BeautifulSoup(sentence1, "html.parser")

In [6]:
import bs4

def sentence2iob(sentence):
    '''Input sentence is a string from the Ontonotes corpus, with xml tags indicating named entities
    output is a list of tokens and a list of NER IOB-tags corresponding to those tokens'''
    soup = BeautifulSoup(sentence, "html.parser")
    curr_tokens = []
    curr_tags = []
    for child in soup:
        if child.name == "enamex" and not isinstance(child, bs4.element.NavigableString):
            ner_toks = child.get_text().split()
            for i, tok in enumerate(ner_toks):
                curr_tokens.append(tok)
                if i == 0:
                    curr_tags.append("B-" + child.get("type"))
                else:
                    curr_tags.append("I-" + child.get("type"))
        else:
            for tok in child.split():
                curr_tags.append("O")
                curr_tokens.append(tok)
    
    return curr_tokens, curr_tags
    

In [7]:
check_sentence = 'While <ENAMEX TYPE="PERSON">Galloway</ENAMEX> \'s <ENAMEX TYPE="ORG" S_OFF="4">pro-Wal-Mart</ENAMEX> film introduces us to grateful employees /-'
curr_tokens, curr_tags = sentence2iob(check_sentence)
assert curr_tags == ['O', 'B-PERSON', 'O', 'B-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

check_sentence = '<ENAMEX TYPE="GPE">Moscow</ENAMEX> , overcast changing to moderate snow , <ENAMEX TYPE="QUANTITY">2 degrees below zero</ENAMEX> to <ENAMEX TYPE="QUANTITY">1 degree</ENAMEX> .'
curr_tokens, curr_tags = sentence2iob(check_sentence)
assert curr_tags == ['B-GPE', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-QUANTITY', 'I-QUANTITY', 'I-QUANTITY', 'I-QUANTITY', 'O', 'B-QUANTITY', 'I-QUANTITY', 'O']

# If tests passed, then create tokens and tags
token_count = 0
for filename in train_data:
    with open(ontonotes_path + filename, encoding="utf-8") as f:
        f.readline()
        for sentence in f:
            curr_tokens, curr_tags = sentence2iob(sentence)
            token_count += len(curr_tokens)
            assert "" not in curr_tokens # checking for empty strings

assert token_count == 1096878

print("Success!")

" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
" looks like a URL. Beautiful Soup is not an HTTP clien

Success!


## Feature Engineering

The following function `word2features` creates a set of features which are informative and relevant for the training procedure. Some of the engineered features include:

- Features which looks at neighbouring words.
- Features which looks at word morphology.
- Features which considers the "shape" of word.
- Features which include POS tags.
- Gazetteer features using the `nltk` Gazetteer corpus.
- Name features using the `nltk` Names corpus.
- Stop words features using the `nltk` StopWords corpus.

In [87]:
gazetteer_words = set(gazetteers.words())
names_words = set(names.words())
stop_words = set(stopwords.words())

def word2features(sentence, idx):
    """Engineers informative features for the NER model training."""
    pos_tags = pos_tag(sentence)
    word_features = {}
    # token
    word_features["word"] = sentence[idx]
    # POS tags
    word_features["pos_tag_full"] = pos_tags[idx][1]
    word_features["pos_tag_short"] = pos_tags[idx][1][:2]
    # lower case
    word_features["word_lowercase"] = sentence[idx].lower()
    # first word upper case (boolean)
    if idx == 0:
        word_features["word_uppercase"] = True
    # upper case (boolean)
    if sentence[idx].isupper():
        word_features["word_alluppercase"] = True
    # tile case (boolean)
    if not sentence[idx].isupper() and not sentence[idx].islower() and idx != 0:
        word_features["word_tilecase"] = True
    else:
        word_features["word_tilecase"] = False
    # digit case (boolean)
    word_features["word_isdigit"] = sentence[idx].isdigit()
    # all alphabet case (boolean)
    word_features["word_isalpha"] = sentence[idx].isalpha()
    # title case (boolean)
    word_features["word_istitle"] = sentence[idx].istitle()
    # space case (boolean)
    word_features["word_isspace"] = sentence[idx].isspace()
    # word morphology - 3 chars
    if len(sentence[idx]) > 3:
        word_features["word_init_3chars"] = sentence[idx][:3]
        word_features["word_end_3chars"] = sentence[idx][-3:]
    # word morphology - 2 chars
    if len(sentence[idx]) > 2:
        word_features["word_init_2chars"] = sentence[idx][:2]
        word_features["word_end_2chars"] = sentence[idx][-2:]
    # neighbours
    if idx > 0:
        word_features["word_left_neighbour"] = sentence[idx-1]
        word_features["pos_tag_full_left"] = pos_tags[idx-1][1]
        word_features["pos_tag_short_left"] = pos_tags[idx-1][1][:2]
    if idx < len(sentence) - 1:
        word_features["word_right_neighbour"] = sentence[idx+1]
        word_features["pos_tag_full_right"] = pos_tags[idx+1][1]
        word_features["pos_tag_short_right"] = pos_tags[idx+1][1][:2]
    if idx > 1:
        word_features["word_left_left_neighbour"] = sentence[idx-2]
        word_features["pos_tag_full_left_left"] = pos_tags[idx-2][1]
        word_features["pos_tag_short_left_left"] = pos_tags[idx-2][1][:2]
    if idx < len(sentence) - 2:
        word_features["word_right_right_neighbour"] = sentence[idx+2]
        word_features["pos_tag_full_right_right"] = pos_tags[idx+2][1]
        word_features["pos_tag_short_right_right"] = pos_tags[idx+2][1][:2]
    # gazetteer
    if sentence[idx] in gazetteer_words:
        word_features["gazetteers"] = sentence[idx]
    # names
    if sentence[idx] in names_words:
        word_features["names"] = sentence[idx] 
    # stopwords
    if sentence[idx] in stop_words:
        word_features["stopwords"] = sentence[idx]
    
    return word_features
    
def sentence2features(sentence):
    """Takes a sentence, iterates over its words, and creates a list of engineered features."""
    return [word2features(sentence, idx) for idx in range(len(sentence))]

The following code block prepares the feature dictionaries required to train the CRF model.

In [93]:
def prepare_ner_feature_dicts(ner_files):
    """ner_files is a list of Ontonotes files with NER annotations. Returns feature dictionaries and 
    IOB tags for each token in the entire dataset, adpated for a CRF model."""
    train_dicts = []
    train_tags = []
    for filename in ner_files:
        with open(ontonotes_path + filename, encoding="utf-8") as f:
            f.readline()
            for sentence in f:
                curr_tokens, curr_tags = sentence2iob(sentence)
                if curr_tokens:
                    train_dicts.extend(sentence2features(curr_tokens))
                    train_tags.extend(curr_tags)
                    
    assert(len(train_dicts)) == 1096878  # checks correct length
    
    return train_dicts, train_tags

In [94]:
train_dicts, train_tags = prepare_ner_feature_dicts(train_data)
dev_dicts, dev_tags = prepare_ner_feature_dicts(dev_data)

" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
" looks like a URL. Beautiful Soup is not an HTTP clien

## CRF Model Training

In [102]:
# Random Search
from sklearn.model_selection import RandomizedSearchCV
import scipy.stats
from sklearn.metrics import make_scorer

params = {
    'c1': scipy.stats.expon(scale=0.5),
    'c2': scipy.stats.expon(scale=0.05),
    'min_freq': [0, 10, 25, 20, 100],
    'algorithm': ['lbfgs', 'l2sgd'],
    'all_possible_states': [True, False],
    'all_possible_transitions': [True, False]
    
}

crf = CRF(max_iterations=100)

# use the same metric for evaluation
f1_scorer = make_scorer(flat_f1_score, average='weighted')

# search
rs = RandomizedSearchCV(crf, 
                        params,
                        cv=3,
                        verbose=1,
                        n_iter=5,
                        scoring=f1_scorer)

rs.fit(train_dicts, train_tags)

Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
Traceback (most recent call last):
  File "/Users/JJR/opt/miniconda3/envs/573/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/JJR/opt/miniconda3/envs/573/lib/python3.8/site-packages/sklearn_crfsuite/estimator.py", line 307, in fit
    trainer = self._get_trainer()
  File "/Users/JJR/opt/miniconda3/envs/573/lib/python3.8/site-packages/sklearn_crfsuite/estimator.py", line 530, in _get_trainer
    return trainer_cls(
  File "pycrfsuite/_pycrfsuite.pyx", line 260, in pycrfsuite._pycrfsuite.BaseTrainer.__init__
  File "pycrfsuite/_pycrfsuite.pyx", line 390, in pycrfsuite._pycrfsuite.BaseTrainer.set_params
  File "pycrfsuite/_pycrfsuite.pyx", line 421, in pycrfsuite._pycrfsuite.BaseTrainer.set
ValueError: Parameter not found: c1 = 0.4573919215958079

[Parallel(n_jobs=1)]: Done  15 out of  15 | elapse

RandomizedSearchCV(cv=3, estimator=CRF(keep_tempfiles=None, max_iterations=100),
                   n_iter=5,
                   param_distributions={'algorithm': ['lbfgs', 'l2sgd'],
                                        'all_possible_states': [True, False],
                                        'all_possible_transitions': [True,
                                                                     False],
                                        'c1': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fec3a76eeb0>,
                                        'c2': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fec3a76e3a0>,
                                        'min_freq': [0, 10, 25, 20, 100]},
                   scoring=make_scorer(flat_f1_score, average=weighted),
                   verbose=1)

In [111]:
print(rs.best_params_)

{'algorithm': 'lbfgs', 'all_possible_states': True, 'all_possible_transitions': True, 'c1': 0.3584489507233674, 'c2': 0.11652449959538214, 'min_freq': 0}


In [105]:
# Training CRF model with best parameters from random search
crf3 = CRF(algorithm='lbfgs', 
           max_iterations=900, 
           verbose=True, 
           c1=0.3584489507233674, 
           c2=0.11652449959538214, 
           all_possible_states=True,
           all_possible_transitions=True)
crf3.fit(train_dicts, train_tags)

loading training data to CRFsuite: 100%|██████████| 55468/55468 [01:14<00:00, 739.98it/s] 



Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 1
feature.possible_transitions: 1
0....1....2....3....4....5....6....7....8....9....10
Number of features: 9810624
Seconds required: 360.361

L-BFGS optimization
c1: 0.358449
c2: 0.116524
num_memories: 6
max_iterations: 900
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

Iter 1   time=17.10 loss=2551961.91 active=1826650 feature_norm=1.00
Iter 2   time=17.61 loss=1835232.23 active=1774130 feature_norm=5.98
Iter 3   time=6.21  loss=1483518.10 active=1251247 feature_norm=5.13
Iter 4   time=35.15 loss=840770.63 active=927167 feature_norm=3.82
Iter 5   time=17.76 loss=773395.51 active=877555 feature_norm=3.90
Iter 6   time=6.29  loss=694636.59 active=912348 feature_norm=4.82
Iter 7   time=6.30  loss=620411.38 active=880450 feature_norm=6.25
Iter 8   time=6.26  loss=522743.32 active=758688 feature_norm=9.10
Iter 9   time=6.23  loss=445575.92 active=736237 feat

CRF(algorithm='lbfgs', all_possible_states=True, all_possible_transitions=True,
    c1=0.3584489507233674, c2=0.11652449959538214, keep_tempfiles=None,
    max_iterations=900, verbose=True)

In [106]:
y_pred = crf3.predict(dev_dicts)
print(f"Micro F1:{flat_f1_score(dev_tags, y_pred, average='micro')}")
print(f"Macro F1:{flat_f1_score(dev_tags, y_pred, average='macro')}")
print(f"{flat_classification_report(dev_tags, y_pred, digits=3)}")

Micro F1:0.9692182046792559
Macro F1:0.7218686604128397
               precision    recall  f1-score   support

   B-CARDINAL      0.837     0.882     0.859      1216
       B-DATE      0.886     0.890     0.888      2230
      B-EVENT      0.744     0.492     0.593       130
        B-FAC      0.516     0.322     0.397       149
        B-GPE      0.924     0.944     0.934      2738
   B-LANGUAGE      0.944     0.588     0.724       114
        B-LAW      0.515     0.362     0.425        47
        B-LOC      0.776     0.688     0.729       231
      B-MONEY      0.936     0.923     0.929       712
       B-NORP      0.897     0.918     0.907       928
    B-ORDINAL      0.794     0.883     0.836       222
        B-ORG      0.868     0.827     0.847      3024
    B-PERCENT      0.930     0.920     0.925       574
     B-PERSON      0.882     0.900     0.891      2082
    B-PRODUCT      0.463     0.307     0.369       101
   B-QUANTITY      0.894     0.672     0.767       125
       B

In [107]:
# Printing top 10 and bottom 10 transitions to evaluate model empirically

transition_feats = crf3.transition_features_
# top 10
top_10 = sorted(transition_feats.items(), key=lambda x: x[1], reverse=True)[:10]
# bottom 10
bottom_10 = sorted(transition_feats.items(), key=lambda x: x[1], reverse=True)[-10:]
print(f"Top 10 transitions: {top_10}")
print()
print(f"Bottom 10 transitions: {bottom_10}")

Top 10 transitions: [(('I-MONEY', 'I-MONEY'), 11.153637), (('I-LAW', 'I-LAW'), 10.41666), (('I-EVENT', 'I-EVENT'), 10.382468), (('I-LOC', 'I-LOC'), 10.356205), (('B-TIME', 'I-TIME'), 10.105098), (('I-FAC', 'I-FAC'), 10.019691), (('I-TIME', 'I-TIME'), 9.916139), (('I-GPE', 'I-GPE'), 9.833531), (('I-QUANTITY', 'I-QUANTITY'), 9.638192), (('I-CARDINAL', 'I-CARDINAL'), 9.549802)]

Bottom 10 transitions: [(('O', 'I-EVENT'), -5.593083), (('O', 'I-FAC'), -5.797168), (('O', 'I-MONEY'), -5.865978), (('O', 'I-LOC'), -5.916598), (('O', 'I-GPE'), -6.105288), (('O', 'I-PERSON'), -6.139959), (('O', 'I-WORK_OF_ART'), -6.240348), (('O', 'I-CARDINAL'), -6.426434), (('O', 'I-DATE'), -7.908284), (('O', 'I-ORG'), -9.424757)]


> Note: The observed transitions appear to make sense because they are representing logical transitions between entities. For example, `('B-TIME', 'I-TIME')` may represent time in HH:MM format, just as `('B-DATE', 'I-DATE')` may represent a date in MM/DD format.
>
> Conversely, transitions that don't seem to make sense are ranked lower such as those starting with `O` and following with an `I`tag.

## Evaluating on Test Set and Submitting to Kaggle

In [108]:
test_data = [ontonotes_path + 'test_untagged/' + file for file in os.listdir(ontonotes_path + 'test_untagged/')]
test_data = sorted(test_data) # ensures the files are in a standard order for consistency
test_dicts = []
# creates feature dicts
for file in test_data:
    with open(file, encoding="utf-8") as f:
        for sentence in f:
            curr_tokens = sentence.split()
            test_dicts.append(sentence2features(curr_tokens))

In [109]:
import csv

# makes predictions and saves them in csv file
header = ['Id', 'Predicted']
y_pred = crf3.predict(test_dicts)
i = 0
with open("test_tags.csv", "w", newline='') as f:
    writer = csv.writer(f, delimiter=',')
    writer.writerow(header) 
    for pred in y_pred:
        for tag in pred:
            writer.writerow([i, tag])
            i += 1

After iterating several times over the training procedure, I reached a `micro f1` score of `0.96838` on the public leaderboard (calculated on approx. 50% of the test data) and `0.96819` on the private leaderboard (calculated on the remaining 50% of test data), reaching 2nd and 1st place respectively. See the [competition's Leaderboard](https://www.kaggle.com/c/colx-563-lab-assignment-1/leaderboard) for the final standings.