In [3]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')

In [6]:
from itertools import chain

import nltk
import sklearn
import scipy.stats
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
import pycrfsuite
import sklearn_crfsuite
from sklearn_crfsuite import metrics, scorers
#from sklearn.metrics import accuracy_score as scorers



## Let's use CoNLL 2002 data to build a NER system

CoNLL2002 corpus is available in NLTK. We use Spanish data.

In [7]:
import nltk
nltk.download('conll2002')
#nltk.corpus.conll2002.fileids()

[nltk_data] Downloading package conll2002 to
[nltk_data]     C:\Users\moha\AppData\Roaming\nltk_data...
[nltk_data]   Package conll2002 is already up-to-date!


True

In [29]:
%%time
train_sents = list(nltk.corpus.conll2002.iob_sents('esp.train'))
test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb'))

Wall time: 1.61 s


In [31]:
train_sents=train_sents[0:4000]
test_sents=test_sents[0:4000]

In [32]:
len(train_sents) #sent is abbrivation of sentences! 

4000

In [33]:
type(train_sents)

list

In [62]:
train_sents[0]

[('Melbourne', 'NP', 'B-LOC'),
 ('(', 'Fpa', 'O'),
 ('Australia', 'NP', 'B-LOC'),
 (')', 'Fpt', 'O'),
 (',', 'Fc', 'O'),
 ('25', 'Z', 'O'),
 ('may', 'NC', 'O'),
 ('(', 'Fpa', 'O'),
 ('EFE', 'NC', 'B-ORG'),
 (')', 'Fpt', 'O'),
 ('.', 'Fp', 'O')]

## Features

Next, define some features. In this example we use word identity, word suffix, word shape and word POS tag; also, some information from nearby words is used. 

This makes a simple baseline, but you certainly can add and remove some features to get (much?) better results - experiment with it.

sklearn-crfsuite (and python-crfsuite) supports several feature formats; here we use feature dicts.

In [34]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]
    
    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],        
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True
        
    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True
                
    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]

In [57]:
word2features(sent, i)

NameError: name 'features' is not defined

This is what word2features extracts:

In [35]:
sent2features(train_sents[0])[0]

{'bias': 1.0,
 'word.lower()': 'melbourne',
 'word[-3:]': 'rne',
 'word[-2:]': 'ne',
 'word.isupper()': False,
 'word.istitle()': True,
 'word.isdigit()': False,
 'postag': 'NP',
 'postag[:2]': 'NP',
 'BOS': True,
 '+1:word.lower()': '(',
 '+1:word.istitle()': False,
 '+1:word.isupper()': False,
 '+1:postag': 'Fpa',
 '+1:postag[:2]': 'Fp'}

Extract features from the data:

In [36]:
%%time
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

Wall time: 530 ms


## Training

To see all possible CRF parameters check its docstring. Here we are useing L-BFGS training algorithm (it is default) with Elastic Net (L1 + L2) regularization.

In [37]:
%%time

crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs', 
    c1=0.1, 
    c2=0.1, 
    max_iterations=100, 
    all_possible_transitions=True
)
crf.fit(X_train, y_train)

Wall time: 12.6 s


## Evaluation

There is much more O entities in data set, but we're more interested in other entities. To account for this we'll use averaged F1 score computed for all labels except for O. ``sklearn-crfsuite.metrics`` package provides some useful metrics for sequence classification task, including this one.

In [38]:
labels = list(crf.classes_)
labels.remove('O')
labels

['B-LOC', 'B-ORG', 'B-PER', 'I-PER', 'B-MISC', 'I-ORG', 'I-LOC', 'I-MISC']

In [39]:
y_pred = crf.predict(X_test)
metrics.flat_f1_score(y_test, y_pred, 
                      average='weighted', labels=labels)

0.7410293084158585

Inspect per-class results in more detail:

In [40]:
# group B and I results
sorted_labels = sorted(
    labels, 
    key=lambda name: (name[1:], name[0])
)
print(metrics.flat_classification_report(
    y_test, y_pred, labels=sorted_labels, digits=3
))

              precision    recall  f1-score   support

       B-LOC      0.753     0.750     0.752      1084
       I-LOC      0.548     0.471     0.507       325
      B-MISC      0.583     0.434     0.497       339
      I-MISC      0.568     0.478     0.519       557
       B-ORG      0.797     0.781     0.789      1400
       I-ORG      0.797     0.766     0.782      1104
       B-PER      0.800     0.861     0.830       735
       I-PER      0.858     0.924     0.890       634

   micro avg      0.758     0.734     0.746      6178
   macro avg      0.713     0.683     0.696      6178
weighted avg      0.751     0.734     0.741      6178



## Hyperparameter Optimization

To improve quality try to select regularization parameters using randomized search and 3-fold cross-validation.

I takes quite a lot of CPU time and RAM (we're fitting a model ``50 * 3 = 150`` times), so grab a tea and be patient, or reduce n_iter in RandomizedSearchCV, or fit model only on a subset of training data.

In [41]:
%%time
# define fixed parameters and parameters to search
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs', 
    max_iterations=100, 
    all_possible_transitions=True
)
params_space = {
    'c1': scipy.stats.expon(scale=0.5),
    'c2': scipy.stats.expon(scale=0.05),
}

# use the same metric for evaluation
f1_scorer = make_scorer(metrics.flat_f1_score, 
                        average='weighted', labels=labels)

# search
rs = RandomizedSearchCV(crf, params_space, 
                        cv=3, 
                        verbose=1, 
                        n_jobs=-1, 
                        n_iter=50, 
                        scoring=f1_scorer)
rs.fit(X_train, y_train)

Fitting 3 folds for each of 50 candidates, totalling 150 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  4.3min
[Parallel(n_jobs=-1)]: Done 150 out of 150 | elapsed: 18.7min finished


Wall time: 18min 54s


Best result:

In [42]:
# crf = rs.best_estimator_
print('best params:', rs.best_params_)
print('best CV score:', rs.best_score_)
print('model size: {:0.2f}M'.format(rs.best_estimator_.size_ / 1000000))

best params: {'c1': 0.0034985964210545823, 'c2': 0.044106226635262986}
best CV score: 0.7432078715482601
model size: 1.44M


### Check parameter space

A chart which shows which ``c1`` and ``c2`` values have RandomizedSearchCV checked. Red color means better results, blue means worse.

In [44]:
_x = [s.['c1'] for s in rs.cv_results_]
_y = [s.parameters['c2'] for s in rs.cv_results_]
_c = [s.mean_validation_score for s in rs.cv_results_]

fig = plt.figure()
fig.set_size_inches(12, 12)
ax = plt.gca()
ax.set_yscale('log')
ax.set_xscale('log')
ax.set_xlabel('C1')
ax.set_ylabel('C2')
ax.set_title("Randomized Hyperparameter Search CV Results (min={:0.3}, max={:0.3})".format(
    min(_c), max(_c)
))

ax.scatter(_x, _y, c=_c, s=60, alpha=0.9, edgecolors=[0,0,0])

print("Dark blue => {:0.4}, dark red => {:0.4}".format(min(_c), max(_c)))

AttributeError: 'str' object has no attribute 'parameters'

## Check best estimator on our test data

As you can see, quality is improved.

In [45]:
crf = rs.best_estimator_
y_pred = crf.predict(X_test)
print(metrics.flat_classification_report(
    y_test, y_pred, labels=sorted_labels, digits=3
))

              precision    recall  f1-score   support

       B-LOC      0.752     0.749     0.750      1084
       I-LOC      0.534     0.462     0.495       325
      B-MISC      0.579     0.445     0.503       339
      I-MISC      0.587     0.461     0.517       557
       B-ORG      0.793     0.785     0.789      1400
       I-ORG      0.798     0.755     0.776      1104
       B-PER      0.801     0.852     0.825       735
       I-PER      0.871     0.920     0.895       634

   micro avg      0.759     0.730     0.744      6178
   macro avg      0.714     0.679     0.694      6178
weighted avg      0.752     0.730     0.739      6178



## Let's check what classifier learned

In [46]:
from collections import Counter

def print_transitions(trans_features):
    for (label_from, label_to), weight in trans_features:
        print("%-6s -> %-7s %0.6f" % (label_from, label_to, weight))

print("Top likely transitions:")
print_transitions(Counter(crf.transition_features_).most_common(20))

print("\nTop unlikely transitions:")
print_transitions(Counter(crf.transition_features_).most_common()[-20:])

Top likely transitions:
B-MISC -> I-MISC  9.434681
I-MISC -> I-MISC  8.305446
B-LOC  -> I-LOC   7.032559
B-PER  -> I-PER   6.343906
I-LOC  -> I-LOC   5.945340
B-ORG  -> I-ORG   5.448386
I-ORG  -> I-ORG   4.801332
I-PER  -> I-PER   4.544744
O      -> O       4.137090
O      -> B-ORG   2.258654
O      -> B-PER   1.557009
O      -> B-LOC   1.293416
O      -> B-MISC  0.942316
I-PER  -> B-LOC   0.634488
B-LOC  -> B-LOC   0.395998
B-MISC -> O       0.313307
B-MISC -> B-LOC   0.139787
B-ORG  -> B-LOC   -0.040442
B-ORG  -> O       -0.069276
B-MISC -> B-ORG   -0.201349

Top unlikely transitions:
I-LOC  -> B-PER   -2.113020
B-MISC -> I-ORG   -2.169655
I-PER  -> B-ORG   -2.383408
I-PER  -> I-ORG   -2.415725
I-MISC -> B-ORG   -2.468051
I-MISC -> B-LOC   -2.526295
I-LOC  -> I-ORG   -2.550440
B-LOC  -> I-ORG   -2.606115
I-PER  -> B-PER   -2.650260
I-LOC  -> B-LOC   -2.794816
I-ORG  -> I-PER   -2.820029
I-ORG  -> B-MISC  -2.955805
B-PER  -> B-PER   -3.106794
I-MISC -> I-ORG   -3.304232
I-ORG  -> B-LO

We can see that, for example, it is very likely that the beginning of an organization name (B-ORG) will be followed by a token inside organization name (I-ORG), but transitions to I-ORG from tokens with other labels are penalized.

Check the state features:

In [47]:
def print_state_features(state_features):
    for (attr, label), weight in state_features:
        print("%0.6f %-8s %s" % (weight, label, attr))    

print("Top positive:")
print_state_features(Counter(crf.state_features_).most_common(30))

print("\nTop negative:")
print_state_features(Counter(crf.state_features_).most_common()[-30:])

Top positive:
8.599360 B-ORG    word.lower():psoe-progresistas
6.585475 B-ORG    word.lower():petrobras
6.217121 O        BOS
6.140465 B-ORG    word.lower():coag-extremadura
5.591812 B-ORG    -1:word.lower():distancia
5.573142 B-LOC    +1:word.lower():finalizaron
5.254197 B-ORG    +1:word.lower():plasencia
5.236096 B-MISC   word.lower():justicia
5.058713 B-MISC   word.lower():diversia
5.055395 B-ORG    word[-2:]:-e
4.882500 B-MISC   word.lower():exteriores
4.879345 B-MISC   word.lower():competencia
4.826606 O        word.lower():r.
4.826606 O        word[-3:]:R.
4.810205 O        word.lower():v
4.801424 B-ORG    word.lower():telefónica
4.771291 B-LOC    word.lower():estrecho
4.639619 B-ORG    word.lower():terra
4.637544 B-ORG    word.lower():esquerra
4.627735 B-LOC    word.lower():líbano
4.611645 O        word.lower():b
4.611645 O        word[-3:]:B
4.611645 O        word[-2:]:B
4.580382 B-MISC   word.lower():cc2305001730
4.580382 B-MISC   word[-3:]:730
4.553537 O        bias
4.549363 



Some observations:

   * **9.385823 B-ORG word.lower():psoe-progresistas** - the model remembered names of some entities - maybe it is overfit, or maybe our features are not adequate, or maybe remembering is indeed helpful;
   * **4.636151 I-LOC -1:word.lower():calle:** "calle" is a street in Spanish; model learns that if a previous word was "calle" then the token is likely a part of location;
   * **-5.632036 O word.isupper()**, **-8.215073 O word.istitle()** : UPPERCASED or TitleCased words are likely entities of some kind;
   * **-2.097561 O postag:NP** - proper nouns (NP is a proper noun in the Spanish tagset) are often entities.

What to do next

    * Load 'testa' Spanish data.
    * Use it to develop better features and to find best model parameters.
    * Apply the model to 'testb' data again.

The model in this notebook is just a starting point; you certainly can do better!

