In [None]:
#|hide
#|default_exp ner_crf

# NER with Conditional Random Fields (CRF)
(follows: )

CRF is a powerful technique before Deep Learning became popular. POS tagging of sequences in order to label the most important information related to the problem at hand: generic named entities (locations, people, and organizations) or more specialized entities such as disease or symptomes.

For this we use `sklearn-crfsuite`.

I guess, due to the upsurge of Deep Learning the `sklearn-crfsuite` is not updated anymore. So instead of the snippet below, we have to install an updated version of the library to be able to produce reports:

In [None]:
#| hide
#!pip install sklearn-crfsuite # Run only once

In [None]:
#| export
#%pip install git+https://github.com/MeMartijn/updated-sklearn-crfsuite.git\#egg=sklearn_crfsuite

Data comes from NLTK:

In [None]:
#| export
import nltk
import sklearn
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import LabelBinarizer
import sklearn_crfsuite as crfsuite
from sklearn_crfsuite import metrics

In [None]:
#| export
#nltk.download('conll2002') # Just run this line once
train_sents = list(nltk.corpus.conll2002.iob_sents('ned.train'))
dev_sents = list(nltk.corpus.conll2002.iob_sents('ned.testa'))
test_sents = list(nltk.corpus.conll2002.iob_sents('ned.testb'))

Let's have a look at the data. They are a list of tokenized sentences: the string, POS tag and it's entity tag. Nowadays the POS tag is not used in deep learning, but with CRF it provides useful information: Nouns are more common denoting entities than verbs, so the POS tags carry useful information.

In [None]:
#| export
train_sents[0]

[('De', 'Art', 'O'),
 ('tekst', 'N', 'O'),
 ('van', 'Prep', 'O'),
 ('het', 'Art', 'O'),
 ('arrest', 'N', 'O'),
 ('is', 'V', 'O'),
 ('nog', 'Adv', 'O'),
 ('niet', 'Adv', 'O'),
 ('schriftelijk', 'Adj', 'O'),
 ('beschikbaar', 'Adj', 'O'),
 ('maar', 'Conj', 'O'),
 ('het', 'Art', 'O'),
 ('bericht', 'N', 'O'),
 ('werd', 'V', 'O'),
 ('alvast', 'Adv', 'O'),
 ('bekendgemaakt', 'V', 'O'),
 ('door', 'Prep', 'O'),
 ('een', 'Art', 'O'),
 ('communicatiebureau', 'N', 'O'),
 ('dat', 'Conj', 'O'),
 ('Floralux', 'N', 'B-ORG'),
 ('inhuurde', 'V', 'O'),
 ('.', 'Punc', 'O')]

How does CRF work? Deep learning neural nets just learn their relevant features from the input texts themselves. CRFs learn the realtionship between **the features we give them** and the **label of a token in a given context**. They **do not** learn these features themselves, the quality of the model highly depends on the features we present to them.

We therefore:

- use a method to collect the features for every token:
  - the word itself + POS tag
  - completely uppercase? starts with digit? starts with capital?
  - character bigram or trigram the word ends with
  - use a `bias` feature that always has the same value (through it the CRF model can learn the relative freq. of each label type in the training data)

- we use the word embeddings to give the model more information about the meaning of a word (500 wikipedia clusters of word embeddings). We read those from file and map each word to the ID of the cluster it is in. Define `read_clusters`.

- we want the CRF to look at the context of a token. We provide the CRF with the 2 words on either side of the token: their case, POS tags. If there is no left or right, we give it that information: BOS or EOS.

In [None]:
#| export
# Making use of the Wikipedia word embeddings
def read_clusters(cluster_file):
  word2cluster = {}
  with open(cluster_file) as i:
    for line in i:
      word, cluster = line.strip().split('\t')
      word2cluster[word] = cluster
  return word2cluster

# Using features of the words AND looking at the context of a token (neigbours +/- 2)
def word2features(sent, i, word2cluster):
  word = sent[i][0]
  postag = sent[i][1]
  features = [
    'bias',
    'word.lower=' + word.lower(),
    'word[-3]=' + word[-3:], # looking at the last 3 chars of the token
    'word[-2]=' + word[-2:], # looking at the last 2 chars of the token
    'word.isupper=%s' % word.isupper(),
    'word.istitle=%s' % word.istitle(),
    'word.isdigit=%s' % word.isdigit(),
    'word.cluster=%s' % word2cluster[word.lower()] if word.lower() in word2cluster else '0'
    'postag=' + postag
  ]
  # Look at the first neighbour token to the left
  if i > 0:
    word1 = sent[i-1][0]
    postag1 = sent[i-1][1]
    features.extend([
      '-1:word.lower=' + word1.lower(),
      '-1:word.istitle=%s' % word1.istitle(),
      '-1:word.isupper=%s' % word1.isupper(),
      '-1:postag=' + postag1
    ])
  else:
    features.append('BOS')
  # Look at the second neighbour to the left
  if i > 1: 
    word2 = sent[i-2][0]
    postag2 = sent[i-2][1]
    features.extend([
      '-2:word.lower=' + word2.lower(),
      '-2:word.istitle=%s' % word2.istitle(),
      '-2:word.isupper=%s' % word2.isupper(),
      '-2:postag=' + postag2
    ])
  # look at the first neigbour to the right
  if i < len(sent)-1:
    word1 = sent[i+1][0]
    postag1 = sent[+1][0]
    features.extend([
      '+1:word.lower=' + word1.lower(),
      '+1:word.istitle=%s' % word1.istitle(),
      "+1:word.isupper=%s" % word1.isupper(),
      '+1:postag=' + postag1
    ])
  else:
    features.append('EOS')
  # Look at the second neighbour to the right
  if i < len(sent)-2:
    word2 = sent[i+2][0]
    postag2 = sent[+2][0]
    features.extend([
      '+2:word.lower=' + word2.lower(),
      '+2:word.istitle=%s' % word2.istitle(),
      "+2:word.isupper=%s" % word2.isupper(),
      '+2:postag=' + postag2
    ])
  return features

# Now we define the functions to do all the work
def sent2features(sent, word2cluster):
  return [word2features(sent, i, word2cluster) for i in range(len(sent))]

def sent2labels(sent):
  return [label for token, postag, label in sent]

def sent2tokens(sent):
  return [token for token, postag, label in sent]

word2cluster = read_clusters('/home/peter/Documents/data/nlp/clusters_nl.tsv')

Let's try the `sent2features` function out using the first word from the first training_sent:

In [None]:
#| export
train_sents[0][0]

('De', 'Art', 'O')

In [None]:
#| export
sent2features(train_sents[0], word2cluster)[0]

['bias',
 'word.lower=de',
 'word[-3]=De',
 'word[-2]=De',
 'word.isupper=False',
 'word.istitle=True',
 'word.isdigit=False',
 'word.cluster=38',
 'BOS',
 '+1:word.lower=tekst',
 '+1:word.istitle=False',
 '+1:word.isupper=False',
 '+1:postag=tekst',
 '+2:word.lower=van',
 '+2:word.istitle=False',
 '+2:word.isupper=False',
 '+2:postag=van']

Now we assign our training and test sets the appropriate labels:

In [None]:
#| export
X_train = [sent2features(s, word2cluster) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_dev = [sent2features(s, word2cluster) for s in dev_sents]
y_dev = [sent2labels(s) for s in dev_sents]

X_test = [sent2features(s, word2cluster) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]


Next we create a CRF model and start the training using the standard `lbfgs` algorithm for parameter estimation and run it for 100 iterations.

When done, we save the model using `joblib`.

In [None]:
#| export
crf = crfsuite.CRF(
    verbose='true',
    algorithm='lbfgs',
    max_iterations=100
)

crf.fit(X_train, y_train, X_dev=X_dev, y_dev=y_dev)

loading training data to CRFsuite: 100%|██████████| 15806/15806 [00:01<00:00, 10135.78it/s]





loading dev data to CRFsuite: 100%|██████████| 2895/2895 [00:00<00:00, 9396.00it/s]



Holdout group: 2

Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 0
0....1....2....3....4....5....6....7....8....9....10
Number of features: 171320
Seconds required: 0.414

L-BFGS optimization
c1: 0.000000
c2: 1.000000
num_memories: 6
max_iterations: 100
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

Iter 1   time=0.29  loss=104683.26 active=171320 precision=0.100  recall=0.111  F1=0.105  Acc(item/seq)=0.901 0.496  feature_norm=1.00
Iter 2   time=0.16  loss=96793.85 active=171320 precision=0.100  recall=0.111  F1=0.105  Acc(item/seq)=0.901 0.496  feature_norm=1.15
Iter 3   time=0.16  loss=92785.91 active=171320 precision=0.100  recall=0.111  F1=0.105  Acc(item/seq)=0.901 0.496  feature_norm=1.26
Iter 4   time=0.16  loss=87079.17 active=171320 precision=0.100  recall=0.111  F1=0.105  Acc(item/seq)=0.901 0.496  feature_norm=1.46
Iter 5   time=0.16  loss=74874.43 active=17

There is an error, but this has to do with the sklearn version we use; to let the error disappear we should use sklearn < 24.0.

Let's see whether we can write our model to file:

In [None]:
#| export
import joblib
import os

OUTPUT_PATH = '/home/peter/Documents/data/nlp/models'
OUTPUT_FILE = 'crf_model'

if not os.path.exists(OUTPUT_PATH):
  os.mkdir(OUTPUT_PATH)

joblib.dump(crf, os.path.join(OUTPUT_PATH, OUTPUT_FILE))

['/home/peter/Documents/data/nlp/models/crf_model']

With our model saved, we now can evaluate the output of our CRF model. We will load our model from file and test it on the full test set.

We will have a look at the first sentence.

In [None]:
#| export
crf = joblib.load(os.path.join(OUTPUT_PATH, OUTPUT_FILE))
y_pred = crf.predict(X_test)

example_sent = test_sents[0]
print("Sentence:", ' '.join(sent2tokens(example_sent)))
print("Predicted:", ' '.join(crf.predict([sent2features(example_sent, word2cluster)])[0]))
print("Correct:", ' '.join(sent2labels(example_sent)))

Sentence: Dat is in Italië , Spanje of Engeland misschien geen probleem , maar volgens ' Der Kaiser ' in Duitsland wel .
Predicted: O O O B-LOC O B-LOC O B-LOC O O O O O O O B-MISC I-MISC O O B-LOC O O
Correct: O O O B-LOC O B-LOC O B-LOC O O O O O O O B-PER I-PER O O B-LOC O O


We are now ready to evalaute the whole test set. We print out a classification report for all labels except 'O'. They are the majority of labels anyway, so they will skew the results (they are most probably assigned correctly).

In [None]:
#| export
labels = list(crf.classes_)
labels.remove('O')
y_pred = crf.predict(X_test)
sorted_labels = sorted(
  labels,
  key=lambda name: (name[1:], name[0])
)
# The following code only runs with the updated metrics.py module in `sklearn_crfsuite` library.
# Here: pip install git+https://github.com/MeMartijn/updated-sklearn-crfsuite.git\#egg=sklearn_crfsuite
print(metrics.flat_classification_report(y_test, y_pred, labels=sorted_labels))

              precision    recall  f1-score   support

       B-LOC       0.83      0.82      0.83       774
       I-LOC       0.35      0.45      0.40        49
      B-MISC       0.81      0.61      0.70      1187
      I-MISC       0.54      0.41      0.46       410
       B-ORG       0.79      0.70      0.74       882
       I-ORG       0.75      0.61      0.67       551
       B-PER       0.82      0.88      0.85      1098
       I-PER       0.90      0.95      0.93       807

   micro avg       0.80      0.74      0.77      5758
   macro avg       0.72      0.68      0.70      5758
weighted avg       0.80      0.74      0.76      5758



We have good scores, especially `B-LOC` and `B-PER` score very good.

Next we will use the `eli5` library to have a look at the most likely transitions the CRF model has identified. `Eli5` helps us to explain the predictions of our CRF model

## Finding the optimal hyperparameters

So far we've trained a model with the default parameters. It's unlikely that these will give us the best performance possible. Therefore we're going to search automatically for the best hyperparameter settings by iteratively training different models and evaluating them. Eventually we'll pick the best one.

Here we'll focus on two parameters: c1 and c2. These are the parameters for L1 and L2 regularization, respectively. Regularization prevents overfitting on the training data by adding a penalty to the loss function. In L1 regularization, this penalty is the sum of the absolute values of the weights; in L2 regularization, it is the sum of the squared weights. L1 regularization performs a type of feature selection, as it assigns 0 weight to irrelevant features. L2 regularization, by contrast, makes the weight of irrelevant features small, but not necessarily zero. L1 regularization is often called the Lasso method, L2 is called the Ridge method, and the linear combination of both is called Elastic Net regularization.

We define the parameter space for c1 and c2 and use the flat F1-score to compare the individual models. We'll rely on three-fold cross validation to score each of the 50 candidates. We use a randomized search, which means we're not going to try out all specified parameter settings, but instead, we'll let the process sample randomly from the distributions we've specified in the parameter space. It will do this 50 (n_iter) times. This process takes a while, but it's worth the wait.

In [None]:
#| export
import scipy
from sklearn.metrics import make_scorer
from sklearn.model_selection import RandomizedSearchCV

crf = crfsuite.CRF(
  algorithm='lbfgs',
  max_iterations=100,
  all_possible_transitions=True,
  keep_tempfiles=True
)

params_space = {
    'c1': scipy.stats.expon(scale=0.5),
    'c2': scipy.stats.expon(scale=0.05),
}

f1_scorer = make_scorer(metrics.flat_f1_score,
                        average='weighted', labels=labels)

rs = RandomizedSearchCV(crf, params_space,
                        cv=3,
                        verbose=1,
                        n_jobs=-1,
                        n_iter=50,
                        scoring=f1_scorer)
rs.fit(X_train, y_train)

Fitting 3 folds for each of 50 candidates, totalling 150 fits


Runs now, since we installed a venv conda py38! But it uses scikit-learn 1.1.1 Duh?

In [None]:
#| export
print('best params:', rs.best_params_)
print('best CV score:', rs.best_score_)
print('model size: {:0.2f}M'.format(rs.best_estimator_.size_ / 1000000))

best params: {'c1': 0.011442612624257335, 'c2': 0.010103491512999968}
best CV score: 0.7538097172556367
model size: 2.13M


In [None]:
#| export
best_crf = rs.best_estimator_
y_pred = best_crf.predict(X_test)
print(metrics.flat_classification_report(
    y_test, y_pred, labels=sorted_labels, digits=3
))

              precision    recall  f1-score   support

       B-LOC      0.849     0.858     0.853       774
       I-LOC      0.389     0.571     0.463        49
      B-MISC      0.836     0.613     0.707      1187
      I-MISC      0.658     0.417     0.510       410
       B-ORG      0.827     0.724     0.772       882
       I-ORG      0.798     0.652     0.717       551
       B-PER      0.827     0.898     0.861      1098
       I-PER      0.880     0.965     0.921       807

   micro avg      0.824     0.756     0.789      5758
   macro avg      0.758     0.712     0.726      5758
weighted avg      0.821     0.756     0.781      5758

