[View in Colaboratory](https://colab.research.google.com/github/kjsr7/business_intelligence/blob/master/pycrf_conll2003.ipynb)

# Named Entity Recognition using sklearn-crfsuite

In [36]:
!pip install python-crfsuite



In [37]:
!python --version

Python 3.6.6


In [38]:
from itertools import chain
import nltk
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import LabelBinarizer
import sklearn
import pycrfsuite

print(sklearn.__version__)

0.19.2


In [39]:
nltk.download('conll2002')

[nltk_data] Downloading package conll2002 to /root/nltk_data...
[nltk_data]   Unzipping corpora/conll2002.zip.


True

## Let's use CoNLL 2003 data to build a NER system¶
CoNLL2002 corpus processing functions are available in NLTK. We use these functions to preprocess coNLL2003 dataset

In [40]:
nltk.corpus.conll2002.fileids()

['esp.testa', 'esp.testb', 'esp.train', 'ned.testa', 'ned.testb', 'ned.train']

In [0]:
!cp eng.* /root/nltk_data/corpora/conll2002/

## 1. Training data
CoNLL 2003 dataset contains a list of english sentences, with Named Entities annotated. It uses IOB2 encoding. CoNLL 2003 data also provide POS tags.

In [43]:
train_sents = list(nltk.corpus.conll2002.iob_sents('eng.train'))
test_sents = list(nltk.corpus.conll2002.iob_sents('eng.testb'))
len(train_sents)

14986

In [44]:
train_sents[0]

[('EU', 'NNP', 'I-ORG'),
 ('rejects', 'VBZ', 'O'),
 ('German', 'JJ', 'I-MISC'),
 ('call', 'NN', 'O'),
 ('to', 'TO', 'O'),
 ('boycott', 'VB', 'O'),
 ('British', 'JJ', 'I-MISC'),
 ('lamb', 'NN', 'O'),
 ('.', '.', 'O')]

## Feature Extraction
Next, define some features. In this example we use word identity, word suffix, word shape and word POS tag; also, some information from nearby words is used. This makes a simple baseline, but you certainly can add and remove some features to get (much?) better results - experiment with it.

In [0]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]
    features = [
        'bias',
        'word.lower=' + word.lower(),
        'word[-3:]=' + word[-3:],
        'word[-2:]=' + word[-2:],
        'word.isupper=%s' % word.isupper(),
        'word.istitle=%s' % word.istitle(),
        'word.isdigit=%s' % word.isdigit(),
        'postag=' + postag,
        'postag[:2]=' + postag[:2],
    ]
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.extend([
            '-1:word.lower=' + word1.lower(),
            '-1:word.istitle=%s' % word1.istitle(),
            '-1:word.isupper=%s' % word1.isupper(),
            '-1:postag=' + postag1,
            '-1:postag[:2]=' + postag1[:2],
        ])
    else:
        features.append('BOS')

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.extend([
            '+1:word.lower=' + word1.lower(),
            '+1:word.istitle=%s' % word1.istitle(),
            '+1:word.isupper=%s' % word1.isupper(),
            '+1:postag=' + postag1,
            '+1:postag[:2]=' + postag1[:2],
        ])
    else:
        features.append('EOS')

    return features

In [0]:
def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

In [0]:
def sent2labels(sent):
    return [label for token, postag, label in sent]

In [0]:
def sent2tokens(sent):
    return [token for token, postag, label in sent]

This is what word2features extracts:

In [21]:
!cat /proc/meminfo

MemTotal:       13335204 kB
MemFree:         9841408 kB
MemAvailable:   12325468 kB
Buffers:           46840 kB
Cached:          2589924 kB
SwapCached:            0 kB
Active:           953996 kB
Inactive:        2330652 kB
Active(anon):     606084 kB
Inactive(anon):      308 kB
Active(file):     347912 kB
Inactive(file):  2330344 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:               260 kB
Writeback:             0 kB
AnonPages:        647892 kB
Mapped:           166020 kB
Shmem:               796 kB
Slab:             121968 kB
SReclaimable:      95100 kB
SUnreclaim:        26868 kB
KernelStack:        3376 kB
PageTables:         5204 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     6667600 kB
Committed_AS:    2053480 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
AnonHugePages:         0 kB
ShmemHugePages:  

In [49]:
sent2features(train_sents[0])[1]

['bias',
 'word.lower=rejects',
 'word[-3:]=cts',
 'word[-2:]=ts',
 'word.isupper=False',
 'word.istitle=False',
 'word.isdigit=False',
 'postag=VBZ',
 'postag[:2]=VB',
 '-1:word.lower=eu',
 '-1:word.istitle=False',
 '-1:word.isupper=True',
 '-1:postag=NNP',
 '-1:postag[:2]=NN',
 '+1:word.lower=german',
 '+1:word.istitle=True',
 '+1:word.isupper=False',
 '+1:postag=JJ',
 '+1:postag[:2]=JJ']

Extract the features from the data:

In [50]:
%%time
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

CPU times: user 1.66 s, sys: 327 ms, total: 1.99 s
Wall time: 1.99 s


In [0]:
X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

## Train the model
To train the model, we create pycrfsuite.Trainer, load the training data and call 'train' method. First, create pycrfsuite.Trainer and load the training data to CRFsuite:

In [0]:
trainer = pycrfsuite.Trainer(verbose=False)

In [0]:
for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)

Set training parameters. We will use L-BFGS training algorithm (it is default) with Elastic Net (L1 + L2) regularization.

In [0]:
trainer.set_params({
    'c1': 1.0,   # coefficient for L1 penalty
    'c2': 1e-3,  # coefficient for L2 penalty
    'max_iterations': 50,  # stop earlier

    # include transitions that are possible, but not observed
    'feature.possible_transitions': True
})

Possible parameters for the default training algorithm:

In [55]:
trainer.params()

['feature.minfreq',
 'feature.possible_states',
 'feature.possible_transitions',
 'c1',
 'c2',
 'max_iterations',
 'num_memories',
 'epsilon',
 'period',
 'delta',
 'linesearch',
 'max_linesearch']

Train the model:

In [0]:
trainer.train('conll2002-esp.crfsuite')

We can also get information about the final state of the model by looking at the trainer's logparser. If we had tagged our input data using the optional group argument in add, and had used the optional holdout argument during train, there would be information about the trainer's performance on the holdout set as well.

In [57]:
trainer.logparser.last_iteration

{'active_features': 8902,
 'error_norm': 769.368767,
 'feature_norm': 108.543567,
 'linesearch_step': 1.0,
 'linesearch_trials': 1,
 'loss': 14546.233784,
 'num': 50,
 'scores': {},
 'time': 0.254}

We can also get this information for every step using trainer.logparser.iterations

In [58]:
print(len(trainer.logparser.iterations), trainer.logparser.iterations[-1])

50 {'num': 50, 'scores': {}, 'loss': 14546.233784, 'feature_norm': 108.543567, 'error_norm': 769.368767, 'active_features': 8902, 'linesearch_trials': 1, 'linesearch_step': 1.0, 'time': 0.254}


## Make predictions
To use the trained model, create pycrfsuite.Tagger, open the model and use "tag" method:

In [0]:
tagger = pycrfsuite.Tagger()
tagger.open('conll2002-esp.crfsuite')

example_sent = test_sents[0]

In [60]:
print(' '.join(sent2tokens(example_sent)), end='\n\n')

SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRISE DEFEAT .



In [61]:
print("Predicted:", ' '.join(tagger.tag(sent2features(example_sent))))
print("Correct:  ", ' '.join(sent2labels(example_sent)))

Predicted: O O I-LOC O O O O I-LOC O O O O
Correct:   O O I-LOC O O O O I-PER O O O O


## Evaluate the model

In [0]:
def bio_classification_report(y_true, y_pred):
    """
    Classification report for a list of BIO-encoded sequences.
    It computes token-level metrics and discards "O" labels.

    Note that it requires scikit-learn 0.15+ (or a version from github master)
    to calculate averages properly!
    """
    lb = LabelBinarizer()
    y_true_combined = lb.fit_transform(list(chain.from_iterable(y_true)))
    y_pred_combined = lb.transform(list(chain.from_iterable(y_pred)))

    tagset = set(lb.classes_) - {'O'}
    tagset = sorted(tagset, key=lambda tag: tag.split('-', 1)[::-1])
    class_indices = {cls: idx for idx, cls in enumerate(lb.classes_)}

    return classification_report(
        y_true_combined,
        y_pred_combined,
        labels = [class_indices[cls] for cls in tagset],
        target_names = tagset,
    )


Predict entity labels for all sentences in our testing set ('testb' data):

In [0]:
y_pred = [tagger.tag(xseq) for xseq in X_test]

check the result.

In [64]:
print(bio_classification_report(y_test, y_pred))

             precision    recall  f1-score   support

      B-LOC       0.00      0.00      0.00         6
      I-LOC       0.83      0.76      0.79      1919
     B-MISC       0.00      0.00      0.00         9
     I-MISC       0.74      0.73      0.74       909
      B-ORG       0.00      0.00      0.00         5
      I-ORG       0.73      0.74      0.74      2491
      I-PER       0.83      0.88      0.85      2773

avg / total       0.79      0.79      0.79      8112



  'precision', 'predicted', average, warn_for)


## Let's check what classifier learned

In [0]:
from collections import Counter
info = tagger.info()

In [0]:
def print_transitions(trans_features):
    for (label_from, label_to), weight in trans_features:
        print("%-6s -> %-7s %0.6f" % (label_from, label_to, weight))

In [0]:
print("Top likely transitions:")
print_transitions(Counter(info.transitions).most_common(15))

In [0]:
print("\nTop unlikely transitions:")
print_transitions(Counter(info.transitions).most_common()[-15:])

Check the state features:

In [0]:
def print_state_features(state_features):
    for (attr, label), weight in state_features:
        print("%0.6f %-6s %s" % (weight, label, attr))

In [0]:
print("Top positive:")
print_state_features(Counter(info.state_features).most_common(20))

In [0]:
print("\nTop negative:")
print_state_features(Counter(info.state_features).most_common()[-20:])