# Named Entity Recognition with Conditional Random Fields

One of the classic challenges of Natural Language Processing is sequence labelling. In sequence labelling, the goal is to label each word in a text with a word class. In part-of-speech tagging, these word classes are parts of speech, such as noun or verb. In named entity recognition (NER), they're types of generic named entities, such as locations, people or organizations, or more specialized entities, such as diseases or symptoms in the healthcare domain. In this way, sequence labelling can help us extract the most important information from a text and improve the performance of analytics, search or matching applications. 

In this notebook we'll explore Conditional Random Fields, the most popular approach to sequence labelling before Deep Learning arrived. Deep Learning may get all the attention right now, but Conditional Random Fields are still a powerful tool to build a simple sequence labeller. 

The tool we're going to use is `sklearn-crfsuite`. This is a wrapper around `python-crfsuite`, which itself is a Python binding of [CRFSuite](http://www.chokkan.org/software/crfsuite/). The reason we're using `sklearn-crfsuite` is that it provides a number of handy utility functions, for example for evaluating the output of the model. You can install it with `pip install sklearn-crfsuite`.

## Data

First we get some data. A well-known data set for training and testing NER models is the CoNLL-2002 data, which has Spanish and Dutch texts labelled with four types of entities: locations (LOC), persons (PER), organizations (ORG) and miscellaneous entities (MISC). Both corpora are split up in three portions: a training portion and two smaller test portions, one of which we'll use as development data. It's easy to collect the data from NLTK. 

In [1]:
import nltk
import sklearn
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import LabelBinarizer
import sklearn_crfsuite as crfsuite
from sklearn_crfsuite import metrics

In [2]:
train_sents = list(nltk.corpus.conll2002.iob_sents('ned.train'))
dev_sents = list(nltk.corpus.conll2002.iob_sents('ned.testa'))
test_sents = list(nltk.corpus.conll2002.iob_sents('ned.testb'))

The data consists of a list of tokenized sentences. For each of the tokens we have the string itself, its part-of-speech tag and its entity tag, which follows the BIO convention. In the deep learning world we live in today, it's common to ignore the part-of-speech tags. However, since CRFs rely on good feature extraction, we'll gladly make use of this information. After all, the part of speech of a word tells us a lot about its possible status as a named entity: nouns will more often be entities than verbs, for example.

In [3]:
train_sents[0]

[('De', 'Art', 'O'),
 ('tekst', 'N', 'O'),
 ('van', 'Prep', 'O'),
 ('het', 'Art', 'O'),
 ('arrest', 'N', 'O'),
 ('is', 'V', 'O'),
 ('nog', 'Adv', 'O'),
 ('niet', 'Adv', 'O'),
 ('schriftelijk', 'Adj', 'O'),
 ('beschikbaar', 'Adj', 'O'),
 ('maar', 'Conj', 'O'),
 ('het', 'Art', 'O'),
 ('bericht', 'N', 'O'),
 ('werd', 'V', 'O'),
 ('alvast', 'Adv', 'O'),
 ('bekendgemaakt', 'V', 'O'),
 ('door', 'Prep', 'O'),
 ('een', 'Art', 'O'),
 ('communicatiebureau', 'N', 'O'),
 ('dat', 'Conj', 'O'),
 ('Floralux', 'N', 'B-ORG'),
 ('inhuurde', 'V', 'O'),
 ('.', 'Punc', 'O')]

## Feature Extraction

Whereas today neural networks are expected to learn the relevant features of the input texts themselves, this is very different with Conditional Random Fields. CRFs learn the relationship between the features we give them and the label of a token in a given context. They're not going to earn these features themselves. Instead, the quality of the model will depend highly on the relevance of the features we show it. 

The most important method in this tutorial is therefore the one that collects the features for every token. What information could be useful? The word itself, of course, together with its part of speech tag. It can also be interesting to know whether the word is completely uppercase, whether it starts with a capital or is a digit. In addition, we also take a look at the character bigram and trigram the word ends with. Finally, we also give every token a `bias` feature, which always has the same value. This bias feature helps the CRF learn the relative frequency of each label type in the training data.

Apart from the token itself, we also want the CRF to look at its context. More specifically, we're going to give it some extra information about the two words to the left and the right of the targt word. We'll tell the CRF what these words, whether they start with a capital or are completely uppercase, and their part-of-speech tag. If there is no left or right context, we'll tell the CRF that the token is at the beginning or end of the sentence (`BOS` or `EOS`). 

In [8]:
def clusters(cluster_file):
    word2cluster = {}
    with open(cluster_file) as i:
        for line in i:
            word, cluster = line.strip().split('\t')
            word2cluster[word] = cluster
    return word2cluster

def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]
    features = [
        'bias',
        'word.lower=' + word.lower(),
        'word[-3:]=' + word[-3:],
        'word[-2:]=' + word[-2:],
        'word.isupper=%s' % word.isupper(),
        'word.istitle=%s' % word.istitle(),
        'word.isdigit=%s' % word.isdigit(),
        'word.cluster=%s' % word2cluster[word.lower()] if word.lower() in word2cluster else "0",
        'postag=' + postag
    ]
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.extend([
            '-1:word.lower=' + word1.lower(),
            '-1:word.istitle=%s' % word1.istitle(),
            '-1:word.isupper=%s' % word1.isupper(),
            '-1:postag=' + postag1
        ])
    else:
        features.append('BOS')

    if i > 1: 
        word2 = sent[i-2][0]
        postag2 = sent[i-2][1]
        features.extend([
            '-2:word.lower=' + word2.lower(),
            '-2:word.istitle=%s' % word2.istitle(),
            '-2:word.isupper=%s' % word2.isupper(),
            '-2:postag=' + postag2
        ])        

        
    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.extend([
            '+1:word.lower=' + word1.lower(),
            '+1:word.istitle=%s' % word1.istitle(),
            '+1:word.isupper=%s' % word1.isupper(),
            '+1:postag=' + postag1
        ])
    else:
        features.append('EOS')

    if i < len(sent)-2:
        word2 = sent[i+2][0]
        postag2 = sent[i+2][1]
        features.extend([
            '+2:word.lower=' + word2.lower(),
            '+2:word.istitle=%s' % word2.istitle(),
            '+2:word.isupper=%s' % word2.isupper(),
            '+2:postag=' + postag2
        ])

        
    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]

word2cluster = clusters("data/clusters_nl.tsv")

In [9]:
sent2features(train_sents[0])[0]

['bias',
 'word.lower=de',
 'word[-3:]=De',
 'word[-2:]=De',
 'word.isupper=False',
 'word.istitle=True',
 'word.isdigit=False',
 'word.cluster=38',
 'postag=Art',
 'BOS',
 '+1:word.lower=tekst',
 '+1:word.istitle=False',
 '+1:word.isupper=False',
 '+1:postag=N',
 '+2:word.lower=van',
 '+2:word.istitle=False',
 '+2:word.isupper=False',
 '+2:postag=Prep']

In [10]:
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_dev = [sent2features(s) for s in dev_sents]
y_dev = [sent2labels(s) for s in dev_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

## Training

We now create a CRF model and train it. We'll use the standard [L-BFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS) algorithm for our parameter estimation and run it for 100 iterations. When we're done, we save the model with `joblib`.

In [11]:
crf = crfsuite.CRF(
    verbose='true',
    algorithm='lbfgs',
    max_iterations=100
)

crf.fit(X_train, y_train, X_dev=X_dev, y_dev=y_dev)


loading training data to CRFsuite:   0%|          | 0/15806 [00:00<?, ?it/s][A
loading training data to CRFsuite:   0%|          | 51/15806 [00:00<00:37, 425.75it/s][A
loading training data to CRFsuite:   1%|          | 100/15806 [00:00<00:38, 412.68it/s][A
loading training data to CRFsuite:   1%|▏         | 202/15806 [00:00<00:26, 590.10it/s][A
loading training data to CRFsuite:   2%|▏         | 277/15806 [00:00<00:24, 626.28it/s][A
loading training data to CRFsuite:   2%|▏         | 371/15806 [00:00<00:22, 682.70it/s][A
loading training data to CRFsuite:   3%|▎         | 546/15806 [00:00<00:18, 821.60it/s][A
loading training data to CRFsuite:   4%|▍         | 655/15806 [00:00<00:17, 870.75it/s][A
loading training data to CRFsuite:   5%|▍         | 755/15806 [00:00<00:17, 881.64it/s][A
loading training data to CRFsuite:   5%|▌         | 853/15806 [00:01<00:18, 804.54it/s][A
loading training data to CRFsuite:   6%|▌         | 936/15806 [00:01<00:20, 741.99it/s][A
loading tr

loading training data to CRFsuite:  57%|█████▋    | 9072/15806 [00:11<00:08, 812.92it/s][A
loading training data to CRFsuite:  58%|█████▊    | 9238/15806 [00:11<00:08, 817.53it/s][A
loading training data to CRFsuite:  59%|█████▉    | 9390/15806 [00:11<00:07, 819.87it/s][A
loading training data to CRFsuite:  60%|██████    | 9526/15806 [00:11<00:07, 823.76it/s][A
loading training data to CRFsuite:  61%|██████    | 9658/15806 [00:11<00:07, 814.05it/s][A
loading training data to CRFsuite:  62%|██████▏   | 9764/15806 [00:11<00:07, 815.66it/s][A
loading training data to CRFsuite:  63%|██████▎   | 9886/15806 [00:12<00:07, 818.95it/s][A
loading training data to CRFsuite:  63%|██████▎   | 9995/15806 [00:12<00:07, 819.76it/s][A
loading training data to CRFsuite:  64%|██████▍   | 10099/15806 [00:12<00:07, 809.71it/s][A
loading training data to CRFsuite:  64%|██████▍   | 10183/15806 [00:12<00:06, 806.37it/s][A
loading training data to CRFsuite:  65%|██████▌   | 10276/15806 [00:12<00:06, 





loading dev data to CRFsuite:   6%|▌         | 177/2895 [00:00<00:03, 796.53it/s][A
loading dev data to CRFsuite:   8%|▊         | 218/2895 [00:00<00:04, 660.06it/s][A
loading dev data to CRFsuite:   9%|▉         | 258/2895 [00:00<00:04, 584.09it/s][A
loading dev data to CRFsuite:  12%|█▏        | 354/2895 [00:00<00:03, 648.32it/s][A
loading dev data to CRFsuite:  15%|█▌        | 448/2895 [00:00<00:03, 693.19it/s][A
loading dev data to CRFsuite:  18%|█▊        | 515/2895 [00:00<00:03, 674.11it/s][A
loading dev data to CRFsuite:  20%|██        | 579/2895 [00:00<00:03, 592.78it/s][A
loading dev data to CRFsuite:  22%|██▏       | 633/2895 [00:01<00:03, 575.51it/s][A
loading dev data to CRFsuite:  25%|██▌       | 726/2895 [00:01<00:03, 604.72it/s][A
loading dev data to CRFsuite:  27%|██▋       | 790/2895 [00:01<00:03, 594.06it/s][A
loading dev data to CRFsuite:  29%|██▉       | 850/2895 [00:01<00:03, 593.18it/s][A
loading dev data to CRFsuite:  32%|███▏      | 913/2895 [00:01<0


Holdout group: 2

Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 0
0....1....2....3....4....5....6....7....8....9....10
Number of features: 152117
Seconds required: 2.626

L-BFGS optimization
c1: 0.000000
c2: 1.000000
num_memories: 6
max_iterations: 100
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

Iter 1   time=1.03  loss=104214.83 active=152117 precision=0.100  recall=0.111  F1=0.105  Acc(item/seq)=0.901 0.496  feature_norm=1.00
Iter 2   time=0.58  loss=96997.81 active=152117 precision=0.100  recall=0.111  F1=0.105  Acc(item/seq)=0.901 0.496  feature_norm=1.13
Iter 3   time=0.59  loss=92085.38 active=152117 precision=0.100  recall=0.111  F1=0.105  Acc(item/seq)=0.901 0.496  feature_norm=1.26
Iter 4   time=0.58  loss=84277.67 active=152117 precision=0.100  recall=0.111  F1=0.105  Acc(item/seq)=0.901 0.496  feature_norm=1.51
Iter 5   time=0.74  loss=67577.53 active=15

Iter 59  time=0.67  loss=10836.31 active=152117 precision=0.773  recall=0.713  F1=0.736  Acc(item/seq)=0.969 0.781  feature_norm=43.74
Iter 60  time=0.66  loss=10712.24 active=152117 precision=0.779  recall=0.719  F1=0.744  Acc(item/seq)=0.970 0.788  feature_norm=44.41
Iter 61  time=0.55  loss=10602.81 active=152117 precision=0.789  recall=0.709  F1=0.740  Acc(item/seq)=0.970 0.789  feature_norm=44.68
Iter 62  time=0.56  loss=10508.84 active=152117 precision=0.782  recall=0.711  F1=0.739  Acc(item/seq)=0.970 0.787  feature_norm=45.37
Iter 63  time=0.56  loss=10458.88 active=152117 precision=0.783  recall=0.717  F1=0.744  Acc(item/seq)=0.970 0.788  feature_norm=45.51
Iter 64  time=0.56  loss=10420.78 active=152117 precision=0.763  recall=0.711  F1=0.730  Acc(item/seq)=0.969 0.786  feature_norm=45.67
Iter 65  time=0.56  loss=10315.28 active=152117 precision=0.766  recall=0.721  F1=0.735  Acc(item/seq)=0.969 0.786  feature_norm=46.30
Iter 66  time=0.56  loss=10204.10 active=152117 precisi

CRF(algorithm='lbfgs', all_possible_states=None,
  all_possible_transitions=None, averaging=None, c=None, c1=None, c2=None,
  calibration_candidates=None, calibration_eta=None,
  calibration_max_trials=None, calibration_rate=None,
  calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
  gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=100,
  max_linesearch=None, min_freq=None, model_filename=None,
  num_memories=None, pa_type=None, period=None, trainer_cls=None,
  variance=None, verbose='true')

In [12]:
import joblib
import os

OUTPUT_PATH = "models/ner/"
OUTPUT_FILE = "crf_model"

if not os.path.exists(OUTPUT_PATH):
    os.mkdir(OUTPUT_PATH)

joblib.dump(crf, os.path.join(OUTPUT_PATH, OUTPUT_FILE))

['models/ner/crf_model']

## Evaluation

Let's evaluate the output of our CRF. We'll load the model from the output file above and have it predict labels for the full test set.

As a sanity check, let's take a look at its predictions for the first test sentence. This output looks pretty good: the CRF is able to predict all four locations and the person in the sentence correctly.

In [13]:
crf = joblib.load(os.path.join(OUTPUT_PATH, OUTPUT_FILE))
y_pred = crf.predict(X_test)

example_sent = test_sents[0]

print("Sentence:", ' '.join(sent2tokens(example_sent)))
print("Predicted:", ' '.join(crf.predict([sent2features(example_sent)])[0]))
print("Correct:  ", ' '.join(sent2labels(example_sent)))

Sentence: Dat is in Italië , Spanje of Engeland misschien geen probleem , maar volgens ' Der Kaiser ' in Duitsland wel .
Predicted: O O O B-LOC O B-LOC O B-LOC O O O O O O O B-MISC I-MISC O O B-LOC O O
Correct:   O O O B-LOC O B-LOC O B-LOC O O O O O O O B-PER I-PER O O B-LOC O O


Now we evaluate on the full test set. We'll print out a classification report for all labels except `O`. If we were to include `O`, which far outnumbers the entity labels in our data, the average scores would be inflated artificially, simply because there's an inherently high probability that the `O` labels from our CRF are correct.

In [14]:
labels = list(crf.classes_)
labels.remove("O")
y_pred = crf.predict(X_test)
sorted_labels = sorted(
    labels,
    key=lambda name: (name[1:], name[0])
)

print(metrics.flat_classification_report(y_test, y_pred, labels=sorted_labels))

              precision    recall  f1-score   support

       B-LOC       0.83      0.83      0.83       774
       I-LOC       0.29      0.41      0.34        49
      B-MISC       0.84      0.61      0.71      1187
      I-MISC       0.59      0.42      0.49       410
       B-ORG       0.80      0.69      0.74       882
       I-ORG       0.74      0.66      0.70       551
       B-PER       0.80      0.90      0.85      1098
       I-PER       0.87      0.95      0.91       807

   micro avg       0.80      0.74      0.77      5758
   macro avg       0.72      0.68      0.70      5758
weighted avg       0.80      0.74      0.76      5758



Now we can also look at the most likely transitions the CRF has identified, and at the top features for every label. We'll do this with the `eli5` library, which helps us explain the predictions of machine learning models.

The top transitions are quite intuitive: the most likely transitions are those within the same entity type (from a B-label to an O-label), and those where a B-label follows an O-label. 

The features, too, make sense. For example, if a word does not start with an uppercase letter, it is unlikely to be an entity. By contrast, a word is very likely to be a location if it ends in `ië`, which is indeed a very common suffix for locations in Dutch.

In [42]:
import eli5

eli5.show_weights(crf, top=30)

From \ To,O,B-LOC,I-LOC,B-MISC,I-MISC,B-ORG,I-ORG,B-PER,I-PER
O,4.135,4.248,0.0,4.058,0.0,4.146,0.0,3.366,0.0
B-LOC,-0.431,-0.356,6.741,0.0,0.0,0.0,0.0,-0.902,0.0
I-LOC,-1.242,-0.249,6.054,0.0,0.0,0.0,0.0,0.0,0.0
B-MISC,-0.723,0.688,0.0,-0.365,6.325,0.7,0.0,0.7,0.0
I-MISC,-1.456,0.0,0.0,-0.372,6.419,1.266,0.0,-0.65,0.0
B-ORG,-0.444,0.0,0.0,-0.724,0.0,0.0,6.68,0.179,0.0
I-ORG,-0.824,0.0,0.0,0.0,0.0,0.0,6.488,0.22,0.0
B-PER,0.18,-0.429,0.0,-0.594,0.0,0.0,0.0,-1.134,7.566
I-PER,-0.157,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.38

Weight?,Feature,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0
Weight?,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Weight?,Feature,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Weight?,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3
Weight?,Feature,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4
Weight?,Feature,Unnamed: 2_level_5,Unnamed: 3_level_5,Unnamed: 4_level_5,Unnamed: 5_level_5,Unnamed: 6_level_5,Unnamed: 7_level_5,Unnamed: 8_level_5
Weight?,Feature,Unnamed: 2_level_6,Unnamed: 3_level_6,Unnamed: 4_level_6,Unnamed: 5_level_6,Unnamed: 6_level_6,Unnamed: 7_level_6,Unnamed: 8_level_6
Weight?,Feature,Unnamed: 2_level_7,Unnamed: 3_level_7,Unnamed: 4_level_7,Unnamed: 5_level_7,Unnamed: 6_level_7,Unnamed: 7_level_7,Unnamed: 8_level_7
Weight?,Feature,Unnamed: 2_level_8,Unnamed: 3_level_8,Unnamed: 4_level_8,Unnamed: 5_level_8,Unnamed: 6_level_8,Unnamed: 7_level_8,Unnamed: 8_level_8
+3.733,word.istitle=False,,,,,,,
+3.331,word.isupper=False,,,,,,,
+2.205,word[-2:]=ag,,,,,,,
+2.142,"-1:word.lower=""",,,,,,,
+2.126,word[-3:]=dag,,,,,,,
+1.978,BOS,,,,,,,
+1.776,-1:word.lower=+,,,,,,,
+1.767,postag=Pron,,,,,,,
+1.733,word.lower=u,,,,,,,
+1.726,word[-2:]=U,,,,,,,

Weight?,Feature
+3.733,word.istitle=False
+3.331,word.isupper=False
+2.205,word[-2:]=ag
+2.142,"-1:word.lower="""
+2.126,word[-3:]=dag
+1.978,BOS
+1.776,-1:word.lower=+
+1.767,postag=Pron
+1.733,word.lower=u
+1.726,word[-2:]=U

Weight?,Feature
+3.153,word[-2:]=ië
+2.072,-1:word.lower=in
+2.053,word.lower=brussel
+2.013,word.lower=antwerpen
+1.873,word[-2:]=VS
+1.873,word[-3:]=VS
+1.873,word.lower=vs
+1.867,-1:word.lower=uit
+1.654,word.lower=gent
+1.628,word[-3:]=sel

Weight?,Feature
+1.572,+2:word.lower=m
+1.150,-1:word.lower=col
+1.062,word[-2:]=rk
+0.982,-1:word.lower=san
+0.960,word[-2:]=al
+0.884,word.lower=york
+0.884,word[-3:]=ork
+0.882,word[-3:]=eum
+0.846,-1:word.istitle=False
+0.807,word.lower=staten

Weight?,Feature
+2.316,word.lower=sport
+2.128,+2:word.lower=1
+2.009,word[-2:]=se
+2.004,word[-2:]='s
+1.922,word.lower=buitenland
+1.786,word[-3:]=nse
+1.678,postag=Adj
+1.638,word.lower=journaal
+1.621,word.lower=belgische
+1.577,word.lower=tobin-taks

Weight?,Feature
+1.892,-2:word.lower=ronde
+1.861,-1:word.isupper=True
+1.772,-1:word.lower=ronde
+1.709,word.lower=leven
+1.376,-1:word.istitle=True
+1.304,-1:word.lower=euro
+1.274,+1:word.lower=ned
+1.261,word[-2:]=00
+1.209,-1:word.lower=financiële
+1.142,word.lower=2000

Weight?,Feature
+2.568,word[-3:]=com
+2.033,word.lower=quizpeople
+1.985,word[-3:]=ple
+1.916,+1:word.lower=morgen
+1.635,BOS
+1.631,word[-2:]=ga
+1.623,-2:word.lower=minister
+1.576,word.isupper=True
+1.563,word[-3:]=bel
+1.520,word.lower=agalev

Weight?,Feature
+1.620,word.lower=morgen
+1.368,-1:word.lower=radio
+1.314,word[-3:]=gen
+1.171,-1:word.lower=vlaams
+1.164,word[-3:]=ion
+1.026,-1:word.lower=ned
+0.967,-1:word.lower=europese
+0.895,-2:word.lower=voor
+0.887,word[-2:]=3
+0.887,word.lower=3

Weight?,Feature
+1.757,postag=V
+1.678,word.lower=bode
+1.540,+2:word.lower=--
+1.509,+1:word.lower=(
+1.465,word[-3:]=par
+1.432,word.lower=laurent
+1.425,word.lower=strupar
+1.416,word.lower=vriens
+1.416,word[-2:]=ff
+1.398,word.istitle=True

Weight?,Feature
+1.519,-1:word.lower=van
+1.157,+1:word.lower=(
+1.132,word.lower=gucht
+1.068,word.lower=grauwe
+1.046,+2:word.lower=die
+1.037,word[-3:]=uwe
+1.031,word[-2:]=we
+1.002,word.lower=shenfu
+1.002,-1:word.lower=pu
+1.002,word[-3:]=nfu


## Finding the optimal hyperparameters

So far we've trained a model with the default parameters. It's unlikely that these will give us the best performance possible. Therefore we're going to search automatically for the best hyperparameter settings by iteratively training different models and evaluating them. Eventually we'll pick the best one.

Here we'll focus on two parameters: `c1` and `c2`. These are the parameters for L1 and L2 regularization, respectively. Regularization prevents overfitting on the training data by adding a penalty to the loss function. In L1 regularization, this penalty is the sum of the absolute values of the weights; in L2 regularization, it is the sum of the squared weights. L1 regularization performs a type of feature selection, as it assigns 0 weight to irrelevant features. L2 regularization, by contrast, makes the weight of irrelevant features small, but not necessarily zero. L1 regularization is often called the Lasso method, L2 is called the Ridge method, and the linear combination of both is called Elastic Net regularization.

We define the parameter space for c1 and c2 and use the flat F1-score to compare the individual models. We'll rely on three-fold cross validation to score each of the 50 candidates. We use a randomized search, which means we're not going to try out all specified parameter settings, but instead, we'll let the process sample randomly from the distributions we've specified in the parameter space. It will do this 50 (`n_iter`) times. This process takes a while, but it's worth the wait.

In [33]:
import scipy
from sklearn.metrics import make_scorer
from sklearn.model_selection import RandomizedSearchCV

crf = crfsuite.CRF(
    algorithm='lbfgs',
    max_iterations=100,
    all_possible_transitions=True
)

params_space = {
    'c1': scipy.stats.expon(scale=0.5),
    'c2': scipy.stats.expon(scale=0.05),
}

f1_scorer = make_scorer(metrics.flat_f1_score,
                        average='weighted', labels=labels)

rs = RandomizedSearchCV(crf, params_space,
                        cv=3,
                        verbose=1,
                        n_jobs=-1,
                        n_iter=50,
                        scoring=f1_scorer)
rs.fit(X_train, y_train)

Fitting 3 folds for each of 50 candidates, totalling 150 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed: 11.5min
[Parallel(n_jobs=-1)]: Done 150 out of 150 | elapsed: 46.6min finished


RandomizedSearchCV(cv=3, error_score='raise-deprecating',
          estimator=CRF(algorithm='lbfgs', all_possible_states=None,
  all_possible_transitions=True, averaging=None, c=None, c1=None, c2=None,
  calibration_candidates=None, calibration_eta=None,
  calibration_max_trials=None, calibration_rate=None,
  calibration_samples=None, delta=None, epsilon=None, error...e,
  num_memories=None, pa_type=None, period=None, trainer_cls=None,
  variance=None, verbose=False),
          fit_params=None, iid='warn', n_iter=50, n_jobs=-1,
          param_distributions={'c1': <scipy.stats._distn_infrastructure.rv_frozen object at 0x1a49263160>, 'c2': <scipy.stats._distn_infrastructure.rv_frozen object at 0x1a4df2ce48>},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score='warn',
          scoring=make_scorer(flat_f1_score, average=weighted, labels=['B-ORG', 'B-MISC', 'B-PER', 'I-PER', 'B-LOC', 'I-MISC', 'I-ORG', 'I-LOC']),
          verbose=1)

Let's take a look at the best hyperparameter settings. Our random search got the best performance with `c1` set to around 0.073 and `c2` set to around 0.017.

In [34]:
print('best params:', rs.best_params_)
print('best CV score:', rs.best_score_)
print('model size: {:0.2f}M'.format(rs.best_estimator_.size_ / 1000000))

best params: {'c1': 0.07252549611546448, 'c2': 0.016841709573397014}
best CV score: 0.7333391320135255
model size: 1.37M


To find out what precision, recall and F1-score this translates to, we take the best estimator from our random search and evaluate it on the test set. This indeed shows a nice improvement from our initial model. We've gone from an average F1-score of 7.4% to 77.1%. Both precision and recall have improved, and we see a positive result for all four entity types.

In [38]:
best_crf = rs.best_estimator_
y_pred = best_crf.predict(X_test)
print(metrics.flat_classification_report(
    y_test, y_pred, labels=sorted_labels, digits=3
))

              precision    recall  f1-score   support

       B-LOC      0.849     0.789     0.818       774
       I-LOC      0.404     0.429     0.416        49
      B-MISC      0.841     0.626     0.718      1187
      I-MISC      0.609     0.407     0.488       410
       B-ORG      0.821     0.706     0.759       882
       I-ORG      0.802     0.661     0.724       551
       B-PER      0.776     0.880     0.825      1098
       I-PER      0.835     0.954     0.891       807

   micro avg      0.803     0.741     0.771      5758
   macro avg      0.742     0.682     0.705      5758
weighted avg      0.802     0.741     0.764      5758

