In [36]:
from itertools import chain
import nltk
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import LabelBinarizer
import sklearn
import pycrfsuite

print(sklearn.__version__)

0.18.1


# Let's use CoNLL 2002 data to build a NER system

CoNLL2002 corpus is available in NLTK. We use Spanish data.

In [37]:
nltk.corpus.conll2002.fileids()

[u'esp.testa',
 u'esp.testb',
 u'esp.train',
 u'ned.testa',
 u'ned.testb',
 u'ned.train']

In [75]:
%%time
train_sents = list(nltk.corpus.conll2002.iob_sents('esp.train'))
test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb'))
test_sents_2 = list(nltk.corpus.conll2002.iob_sents('esp.testa'))

CPU times: user 1.93 s, sys: 28 ms, total: 1.96 s
Wall time: 1.96 s


Data format:

In [39]:
train_sents[0]

[(u'Melbourne', u'NP', u'B-LOC'),
 (u'(', u'Fpa', u'O'),
 (u'Australia', u'NP', u'B-LOC'),
 (u')', u'Fpt', u'O'),
 (u',', u'Fc', u'O'),
 (u'25', u'Z', u'O'),
 (u'may', u'NC', u'O'),
 (u'(', u'Fpa', u'O'),
 (u'EFE', u'NC', u'B-ORG'),
 (u')', u'Fpt', u'O'),
 (u'.', u'Fp', u'O')]

## Features

Next, define some features. In this example we use word identity, word suffix, word shape and word POS tag; also, some information from nearby words is used. 

This makes a simple baseline, but you certainly can add and remove some features to get (much?) better results - experiment with it.

In [40]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]
    features = [
        'bias',
        'word.lower=' + word.lower(),
        'word[-3:]=' + word[-3:],
        'word[-2:]=' + word[-2:],
        'word.isupper=%s' % word.isupper(),
        'word.istitle=%s' % word.istitle(),
        'word.isdigit=%s' % word.isdigit(),
        'postag=' + postag,
        'postag[:2]=' + postag[:2],
    ]
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.extend([
            '-1:word.lower=' + word1.lower(),
            '-1:word.istitle=%s' % word1.istitle(),
            '-1:word.isupper=%s' % word1.isupper(),
            '-1:postag=' + postag1,
            '-1:postag[:2]=' + postag1[:2],
        ])
    else:
        features.append('BOS')
        
    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.extend([
            '+1:word.lower=' + word1.lower(),
            '+1:word.istitle=%s' % word1.istitle(),
            '+1:word.isupper=%s' % word1.isupper(),
            '+1:postag=' + postag1,
            '+1:postag[:2]=' + postag1[:2],
        ])
    else:
        features.append('EOS')
                
    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]    

This is what word2features extracts:

In [41]:
sent2features(train_sents[0])[0]

['bias',
 u'word.lower=melbourne',
 u'word[-3:]=rne',
 u'word[-2:]=ne',
 'word.isupper=False',
 'word.istitle=True',
 'word.isdigit=False',
 u'postag=NP',
 u'postag[:2]=NP',
 'BOS',
 u'+1:word.lower=(',
 '+1:word.istitle=False',
 '+1:word.isupper=False',
 u'+1:postag=Fpa',
 u'+1:postag[:2]=Fp']

Extract the features from the data:

In [76]:
%%time
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

X_test_2 = [sent2features(s) for s in test_sents_2]
y_test_2 = [sent2labels(s) for s in test_sents_2]

CPU times: user 7.02 s, sys: 504 ms, total: 7.52 s
Wall time: 7.43 s


## Train the model

To train the model, we create pycrfsuite.Trainer, load the training data and call 'train' method. 
First, create pycrfsuite.Trainer and load the training data to CRFsuite:

In [43]:
%%time
trainer = pycrfsuite.Trainer(verbose=False)

for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)

CPU times: user 4.61 s, sys: 12 ms, total: 4.62 s
Wall time: 4.59 s


Set training parameters. We will use L-BFGS training algorithm (it is default) with Elastic Net (L1 + L2) regularization.

In [44]:
trainer.set_params({
    'c1': 1.0,   # coefficient for L1 penalty
    'c2': 1e-3,  # coefficient for L2 penalty
    'max_iterations': 50,  # stop earlier

    # include transitions that are possible, but not observed
    'feature.possible_transitions': True
})

Possible parameters for the default training algorithm:

In [45]:
trainer.params()

['feature.minfreq',
 'feature.possible_states',
 'feature.possible_transitions',
 'c1',
 'c2',
 'max_iterations',
 'num_memories',
 'epsilon',
 'period',
 'delta',
 'linesearch',
 'max_linesearch']

Train the model:

In [46]:
%%time
trainer.train('conll2002-esp.crfsuite')

CPU times: user 18 s, sys: 0 ns, total: 18 s
Wall time: 18 s


trainer.train saves model to a file:

In [47]:
!ls -lh ./conll2002-esp.crfsuite

-rw-rw-r-- 1 matthieu matthieu 601K mars  17 22:01 ./conll2002-esp.crfsuite


We can also get information about the final state of the model by looking at the trainer's logparser. If we had tagged our input data using the optional group argument in add, and had used the optional holdout argument during train, there would be information about the trainer's performance on the holdout set as well. 

In [48]:
trainer.logparser.last_iteration

{'active_features': 11346,
 'error_norm': 1262.912078,
 'feature_norm': 79.110017,
 'linesearch_step': 1.0,
 'linesearch_trials': 1,
 'loss': 14807.577946,
 'num': 50,
 'scores': {},
 'time': 0.368}

We can also get this information for every step using trainer.logparser.iterations

In [49]:
print len(trainer.logparser.iterations), trainer.logparser.iterations[-1]

50 {'loss': 14807.577946, 'error_norm': 1262.912078, 'linesearch_trials': 1, 'active_features': 11346, 'num': 50, 'time': 0.368, 'scores': {}, 'linesearch_step': 1.0, 'feature_norm': 79.110017}


## Make predictions

To use the trained model, create pycrfsuite.Tagger, open the model and use "tag" method:

In [50]:
tagger = pycrfsuite.Tagger()
tagger.open('conll2002-esp.crfsuite')

<contextlib.closing at 0x7f4de43f1750>

Let's tag a sentence to see how it works:

In [51]:
example_sent = test_sents[0]
print(' '.join(sent2tokens(example_sent)) + '\n\n')

print("Predicted:", ' '.join(tagger.tag(sent2features(example_sent))))
print("Correct:  ", ' '.join(sent2labels(example_sent)))

La Coruña , 23 may ( EFECOM ) .


('Predicted:', 'B-LOC I-LOC O O O O B-ORG O O')
('Correct:  ', u'B-LOC I-LOC O O O O B-ORG O O')


## Evaluate the model

In [52]:
def bio_classification_report(y_true, y_pred):
    """
    Classification report for a list of BIO-encoded sequences.
    It computes token-level metrics and discards "O" labels.
    
    Note that it requires scikit-learn 0.15+ (or a version from github master)
    to calculate averages properly!
    """
    lb = LabelBinarizer()
    y_true_combined = lb.fit_transform(list(chain.from_iterable(y_true)))
    y_pred_combined = lb.transform(list(chain.from_iterable(y_pred)))
        
    tagset = set(lb.classes_) - {'O'}
    tagset = sorted(tagset, key=lambda tag: tag.split('-', 1)[::-1])
    class_indices = {cls: idx for idx, cls in enumerate(lb.classes_)}
    
    return classification_report(
        y_true_combined,
        y_pred_combined,
        labels = [class_indices[cls] for cls in tagset],
        target_names = tagset,
    )

Predict entity labels for all sentences in our testing set ('testb' Spanish data):

In [53]:
%%time
y_pred = [tagger.tag(xseq) for xseq in X_test]

CPU times: user 528 ms, sys: 12 ms, total: 540 ms
Wall time: 538 ms


..and check the result. Note this report is not comparable to results in CONLL2002 papers because here we check per-token results (not per-entity). Per-entity numbers will be worse.  

In [54]:
print(bio_classification_report(y_test, y_pred))

             precision    recall  f1-score   support

      B-LOC       0.78      0.75      0.76      1084
      I-LOC       0.66      0.60      0.63       325
     B-MISC       0.69      0.47      0.56       339
     I-MISC       0.61      0.49      0.54       557
      B-ORG       0.79      0.81      0.80      1400
      I-ORG       0.80      0.79      0.80      1104
      B-PER       0.82      0.87      0.84       735
      I-PER       0.87      0.93      0.90       634

avg / total       0.77      0.76      0.76      6178



## Let's check what classifier learned

In [55]:
from collections import Counter
info = tagger.info()

def print_transitions(trans_features):
    for (label_from, label_to), weight in trans_features:
        print("%-6s -> %-7s %0.6f" % (label_from, label_to, weight))

print("Top likely transitions:")
print_transitions(Counter(info.transitions).most_common(15))

print("\nTop unlikely transitions:")
print_transitions(Counter(info.transitions).most_common()[-15:])

Top likely transitions:
B-ORG  -> I-ORG   8.631963
I-ORG  -> I-ORG   7.833706
B-PER  -> I-PER   6.998706
B-LOC  -> I-LOC   6.913675
I-MISC -> I-MISC  6.129735
B-MISC -> I-MISC  5.538291
I-LOC  -> I-LOC   4.983567
I-PER  -> I-PER   3.748358
B-ORG  -> B-LOC   1.727090
B-PER  -> B-LOC   1.388267
B-LOC  -> B-LOC   1.240278
O      -> O       1.197929
O      -> B-ORG   1.097062
I-PER  -> B-LOC   1.083332
O      -> B-MISC  1.046113

Top unlikely transitions:
I-PER  -> B-ORG   -2.056130
I-LOC  -> I-ORG   -2.143940
B-ORG  -> I-MISC  -2.167501
I-PER  -> I-ORG   -2.369380
B-ORG  -> I-PER   -2.378110
I-MISC -> I-PER   -2.458788
B-LOC  -> I-PER   -2.516414
I-ORG  -> I-MISC  -2.571973
I-LOC  -> B-PER   -2.697791
I-LOC  -> I-PER   -3.065950
I-ORG  -> I-PER   -3.364434
O      -> I-PER   -7.322841
O      -> I-MISC  -7.648246
O      -> I-ORG   -8.024126
O      -> I-LOC   -8.333815


We can see that, for example, it is very likely that the beginning of an organization name (B-ORG) will be followed by a token inside organization name (I-ORG), but transitions to I-ORG from tokens with other labels are penalized. Also note I-PER -> B-LOC transition: a positive weight means that model thinks that a person name is often followed by a location.

Check the state features:

In [56]:
def print_state_features(state_features):
    for (attr, label), weight in state_features:
        print("%0.6f %-6s %s" % (weight, label, attr))    

print("Top positive:")
print_state_features(Counter(info.state_features).most_common(20))

print("\nTop negative:")
print_state_features(Counter(info.state_features).most_common()[-20:])

Top positive:
8.886516 B-ORG  word.lower=efe-cantabria
8.743642 B-ORG  word.lower=psoe-progresistas
5.769032 B-LOC  -1:word.lower=cantabria
5.195429 I-LOC  -1:word.lower=calle
5.116821 O      word.lower=mayo
4.990871 O      -1:word.lower=día
4.910915 I-ORG  -1:word.lower=l
4.721572 B-MISC word.lower=diversia
4.676259 B-ORG  word.lower=telefónica
4.334354 B-ORG  word[-2:]=-e
4.149862 B-ORG  word.lower=amena
4.141370 B-ORG  word.lower=terra
3.942852 O      word.istitle=False
3.926397 B-ORG  word.lower=continente
3.924672 B-ORG  word.lower=acesa
3.888706 O      word.lower=euro
3.856445 B-PER  -1:word.lower=según
3.812373 B-MISC word.lower=exteriores
3.807582 I-MISC -1:word.lower=1.9
3.807098 B-MISC word.lower=sanidad

Top negative:
-1.965379 O      word.lower=fundación
-1.981541 O      -1:word.lower=británica
-2.118347 O      word.lower=061
-2.190653 B-PER  word[-3:]=nes
-2.226373 B-ORG  postag=SP
-2.226373 B-ORG  postag[:2]=SP
-2.260972 O      word[-3:]=uia
-2.384920 O      -1:word.lower

Some observations:

* **8.743642 B-ORG  word.lower=psoe-progresistas** - the model remembered names of some entities - maybe it is overfit, or maybe our features are not adequate, or maybe remembering is indeed helpful;
* **5.195429 I-LOC  -1:word.lower=calle**: "calle" is a street in Spanish; model learns that if a previous word was "calle" then the token is likely a part of location;
* **-3.529449 O      word.isupper=True**, ** -2.913103 O      word.istitle=True **: UPPERCASED or TitleCased words are likely entities of some kind;
* **-2.585756 O      postag=NP** - proper nouns (NP is a proper noun in the Spanish tagset) are often entities.

# TP - CRF

## Pickling  test dataset

In [57]:
import cPickle as pickle
pickle.dump({'X': X_test, 'y': y_test}, open('../conll2002/conll2002-esp_crfsuite-test-data.dump', 'wb'))   

## Hyper-parameter tuning

In the following we will test different values for the two regularization l1 and l2.

In [59]:
from itertools import product

trainer = pycrfsuite.Trainer(verbose=False)

for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)
    
c1_values = [0, 1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1]
c2_values = [0, 1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1]

#c1_values = [0, 1e-4, 1e0, 1e1]
#c2_values = [0, 1e-4, 1e0, 1e1]

# loop over the regularization values
for c1, c2 in product(c1_values, c2_values):
    trainer.set_params({
        'c1': c1,   # coefficient for L1 penalty
        'c2': c2,  # coefficient for L2 penalty
        'max_iterations': 150,  
        # include transitions that are possible, but not observed
        'feature.possible_transitions': True
    })

    #model_name = 'conll2002-esp.crfsuite_%.1e_%.1e' % (c1, c2)
    model_name = 'conll2002-esp.crfsuite_tmp'
    trainer.train(model_name)

    log = trainer.logparser.last_iteration
    nb_nonzero = log['active_features']
    feature_norm = log['feature_norm']

    tagger = pycrfsuite.Tagger()
    tagger.open(model_name)

    y_pred = [tagger.tag(xseq) for xseq in X_test]

    print 'c1: %.1e - c2: %.1e \n Nb non-zero features : %.0f & Coefficient vector norm %.1f \n' % (c1, c2, nb_nonzero, feature_norm)
    print(bio_classification_report(y_test, y_pred))


c1: 0.0e+00 - c2: 0.0e+00 
 Nb non-zero features : 96180 & Coefficient vector norm 146.9 

             precision    recall  f1-score   support

      B-LOC       0.79      0.74      0.76      1084
      I-LOC       0.65      0.60      0.62       325
     B-MISC       0.58      0.55      0.57       339
     I-MISC       0.65      0.60      0.62       557
      B-ORG       0.79      0.81      0.80      1400
      I-ORG       0.85      0.75      0.79      1104
      B-PER       0.82      0.87      0.84       735
      I-PER       0.88      0.93      0.90       634

avg / total       0.78      0.76      0.77      6178

c1: 0.0e+00 - c2: 1.0e-04 
 Nb non-zero features : 96180 & Coefficient vector norm 113.9 

             precision    recall  f1-score   support

      B-LOC       0.78      0.77      0.78      1084
      I-LOC       0.59      0.62      0.60       325
     B-MISC       0.64      0.53      0.58       339
     I-MISC       0.68      0.57      0.62       557
      B-ORG       0

### Analysis :  

We note that the behavior of the parameter vector as function of c1 and c2 is as expected. 
Indeed, the greater c1 the sparser the vector. And the greater c2 the smaller the norm of the vector.
These are the classical behaviors corresponding to these kind of regularizations i.e. l1 and l2.  
There are several maximum (0.80) for the mean f1-score. All of them with a value of c2 = 1e-01 (and c1=1e-4, 1e-3, 1e-2, 1e-1). Therefore the l2 regularization seems to be most important for this task.


## Viterbi decoding

Here, we implement our version of the viterbi algorithm. We use tools from the flexcrf suite.  
We obtain the same results than with crfsuite.

In [62]:
import cPickle as pickle
import numpy as np
from flexcrf_tp.models.linear_chain import (_feat_fun_values,
                                            _compute_all_potentials,
                                            _forward_score,
                                            _backward_score,
                                            _partition_fun_value,
                                            _posterior_score)

from flexcrf_tp.crfsuite2flexcrf import convert_data_to_flexcrf

# -- Define vitrebi_decoder here:

def viterbi_decoder(m_xy, n=None, log_version=True):
    """
    Performs MAP inference, determining $y = \argmax_y P(y|x)$, using the
    Viterbi algorithm.

    Parameters
    ----------
    m_xy : ndarray, shape (n_obs, n_labels, n_labels)
        Values of log-potentials ($\log M_i(y_{i-1}, y_i, x)$)
        computed based on feature functions f_xy and/or user-defined potentials
        `psi_xy`. At t=0, m_xy[0, 0, :] contains values of $\log M_1(y_0, y_1)$
        with $y_0$ the fixed initial state.

    n : integer, default=None
        Time position up to which to decode the optimal sequence; if not
        specified (default) the score is computed for the whole sequence.

    Returns
    -------
    y_pred : ndarray, shape (n_obs,)
        Predicted optimal sequence of labels.

    """

    n_seq, n_label, _ = m_xy.shape 
    record_argmax = np.zeros((n_seq - 1, n_label), dtype=int)
    y_pred = np.zeros(n_seq, dtype=int)

    # init
    score_max = m_xy[0, 0, :].copy()

    # loop - computing the new scores along the sequence and recording the paths
    for i_seq in range(n_seq - 1):
        tmp = score_max.reshape(-1, 1) + m_xy[i_seq + 1] # using broadcasting from numpy to make the code efficient
        record_argmax[i_seq] = np.argmax(tmp, axis=0)
        score_max = tmp[record_argmax[i_seq], np.arange(n_label)]

    # back-tracking - find the path leading to the best sequence
    y_pred[-1] = np.argmax(score_max)
    for i_seq in np.arange(n_seq - 1)[::-1]:
        y_pred[i_seq] = record_argmax[i_seq, y_pred[i_seq + 1]]

    return y_pred, score_max[y_pred[-1]]


# -- Load data and crfsuite model and convert them-------------------------

RECREATE = True  # set to True to recreate flexcrf data with new model

CRFSUITE_MODEL_FILE = '../conll2002/conll2002-esp.crfsuite'
CRFSUITE_TEST_DATA_FILE = '../conll2002/conll2002-esp_crfsuite-test-data.dump'
FLEXCRF_TEST_DATA_FILE = '../conll2002/conll2002-esp_flexcrf-test-data.dump'

# crfsuite model
tagger = pycrfsuite.Tagger()
tagger.open(CRFSUITE_MODEL_FILE)
model = tagger.info()

data = {'X': X_test, 'y': y_test} #pickle.load(open(CRFSUITE_TEST_DATA_FILE))
print "test data loaded."

if RECREATE:
    dataset, thetas = convert_data_to_flexcrf(data, model, n_seq=3)
    pickle.dump({'dataset': dataset, 'thetas': thetas}, open(FLEXCRF_TEST_DATA_FILE, 'wb'))
else:
    dd = pickle.load(open(FLEXCRF_TEST_DATA_FILE))
    dataset = dd['dataset']
    thetas = dd['thetas']

# -- Start classification ------------------------------------------------

for seq in range(len(dataset)):
    # -- with crfsuite
    s_ = tagger.tag(data['X'][seq])
    y_ = np.array([int(model.labels[s]) for s in s_])
    prob_ = tagger.probability(s_)

    print "\n-- With crfsuite:"
    print "labels:\n", s_, "\n", y_
    print "probability:\t %f" % prob_

    # -- with flexcrf
    f_xy, y = dataset[seq]
    theta = thetas[seq]
    
    # pre-compute potentials
    m_xy, f_m_xy = _compute_all_potentials(f_xy, theta)
    
    # find the mode 
    y_pred, s_max = viterbi_decoder(m_xy)
    
    # compute the forward variables and the partition function
    alpha = _forward_score(m_xy, n=None, log_version=True)
    Z = np.sum(np.exp(alpha[-1]))
    
    # posterior probability of the best sequence i.e p(y|x)
    prob = np.exp(s_max) / Z

    print "-- With flexcrf:"
    print "labels:\n", y_pred
    print "equal predictions: ", all(y_pred == y_)
    print "probability:\t %f" % prob
    print "delta:\t %f" % abs(prob - prob_)

tagger.close()


test data loaded.

converting to flexcrf format...
f_xy_desc created.
t_xyy_desc created
Processing sentence 1/3...
Processing sentence 2/3...
Processing sentence 3/3...

-- With crfsuite:
labels:
['B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'B-ORG', 'O', 'O'] 
[0 7 1 1 1 1 2 1 1]
probability:	 0.930321
-- With flexcrf:
labels:
[0 7 1 1 1 1 2 1 1]
equal predictions:  True
probability:	 0.930321
delta:	 0.000000

-- With crfsuite:
labels:
['O'] 
[1]
probability:	 0.999996
-- With flexcrf:
labels:
[1]
equal predictions:  True
probability:	 0.999996
delta:	 0.000000

-- With crfsuite:
labels:
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'B-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-PER', 'I-PER', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', 'O'] 
[1 1 1 1 1 1 1 1 1 1 1 1 5 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 3 4 4 1 1 1 1 1 0 7 1]
probability:	 0.437425
-- With flexcrf:

## Chuncking task

We implement a chuncker based on the Conll2000 dataset using the crfsuite.  
Our features  are mainly inspired from the tutorial (http://www.chokkan.org/software/crfsuite/tutorial.html).

In [69]:
# load Conll2000 dataset
train_sents = nltk.corpus.conll2000.iob_sents('train.txt')
test_sents = nltk.corpus.conll2000.iob_sents('test.txt')

In [70]:
# feature function

def word2features(sent, i):
    
    word = sent[i][0]
    postag = sent[i][1]
    features = [
        'bias',
        'word.lower=' + word.lower(),
        'word[-3:]=' + word[-3:],
        'word[-2:]=' + word[-2:],
        'word.isupper=%s' % word.isupper(),
        'word.istitle=%s' % word.istitle(),
        'word.isdigit=%s' % word.isdigit(),
        'postag=' + postag,
        'postag[:2]=' + postag[:2]
    ]
    
    if i > 1:
        word_b_1 = sent[i - 1][0]
        postag_b_1 = sent[i - 1][1]
        
        word_b_2 = sent[i - 2][0]
        postag_b_2 = sent[i - 2][1]
        
        features.extend([
            '-1:word.lower=' + word_b_1.lower(),
            '-1:word.istitle=%s' % word_b_1.istitle(),
            '-1:word.isupper=%s' % word_b_1.isupper(),
            '-1:postag=' + postag_b_1,
            '-1:postag[:2]=' + postag_b_1[:2],
                
            # two words before
            '-2:word.lower=' + word_b_2.lower(),
            '-2:word.istitle=%s' % word_b_2.istitle(),
            '-2:word.isupper=%s' % word_b_2.isupper(),
            '-2:postag=' + postag_b_2,
            '-2:postag[:2]=' + postag_b_2[:2],
                
            # word sequence
            '-1:word.word=' + word_b_1.lower() + word.lower(),
                
            '-2:word.word=' + word_b_2.lower() + word_b_1.lower(),
            '-2:word.word.word=' + word_b_2.lower() + word_b_1.lower() + word.lower(),
                
            # tag sequence    
            '-1:postag.postag=' + postag_b_1 + postag,
                
            '-2:postag.postag=' + postag_b_2 + postag_b_1,
            '-2:postag.postag.postag=' + postag_b_2 + postag_b_1 + postag
                
        ])
        
    if i < len(sent) - 2:
        word_a_1 = sent[i + 1][0]
        postag_a_1 = sent[i + 1][1]
        word_a_2 = sent[i + 2][0]
        postag_a_2 = sent[i + 2][1]
        features.extend([
            '+1:word.lower=' + word_a_1.lower(),
            '+1:word.istitle=%s' % word_a_1.istitle(),
            '+1:word.isupper=%s' % word_a_1.isupper(),
            '+1:postag=' + postag_a_1,
            '+1:postag[:2]=' + postag_a_1[:2],
            
            # two words before
            '+2:word.lower=' + word_a_2.lower(),
            '+2:word.istitle=%s' % word_a_2.istitle(),
            '+2:word.isupper=%s' % word_a_2.isupper(),
            '+2:postag=' + postag_a_2,
            '+2:postag[:2]=' + postag_a_2[:2],
                
            # word sequence
            '+1:word.word=' + word.lower() + word_a_1.lower(),
                
            '+2:word.word=' + word_a_1.lower() + word_a_2.lower(),
            '+2:word.word.word=' + word.lower() + word_a_1.lower() + word_a_2.lower() ,
                
            # tag sequence    
            '+1:postag.postag=' + postag + postag_a_1,
                
            '+2:postag.postag=' + postag_a_1 + postag_a_2,
            '+2:postag.postag.postag=' + postag + postag_a_1 + postag_a_2
        ])
    
    if i > 0 and i < len(sent) - 1:
        word_a_1 = sent[i + 1][0]
        postag_a_1 = sent[i + 1][1]
        word_b_1 = sent[i - 1][0]
        postag_b_1 = sent[i - 1][1]
        
        features.extend([
            # word sequence
            '+1:word.word='  + word.lower() + word_a_1.lower(),
            '-1:word.word='  + word_b_1.lower() + word.lower(),

            'word.word.word=' + word_b_1.lower() + word.lower() + word_a_1.lower(), 

            # tag sequence    
            '+1:postag.postag='  + postag + postag_a_1,
            '-1:postag.postag=' + postag_b_1 + postag, 

            'postag.postag.postag=' + postag_b_1 + postag + postag_a_1
        ])
        
    if i==0:
        features.append('BOS')
        
    
    if i==len(sent):
        features.append('EOS')
        
    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]    


### Compute features

In [71]:
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

### Train Model

We take the parameters values found in the above task. Normally we should also do some hyper-parameter tuning.

In [72]:
trainer = pycrfsuite.Trainer(verbose=False)

for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)
    
trainer.set_params({
    'c1': 1e-1,   # coefficient for L1 penalty
    'c2': 1e-1,  # coefficient for L2 penalty
    'max_iterations': 150,  # stop earlier

    # include transitions that are possible, but not observed
    'feature.possible_transitions': True
})

trainer.train('conll2000-eng.crfsuite')

### Prediction

In [73]:
tagger = pycrfsuite.Tagger()
tagger.open('conll2000-eng.crfsuite')

y_pred = [tagger.tag(xseq) for xseq in X_test]
print 'Test dataset evaluation'
print(bio_classification_report(y_test, y_pred))


Test dataset evaluation
             precision    recall  f1-score   support

     B-ADJP       0.81      0.74      0.78       438
     I-ADJP       0.79      0.65      0.71       167
     B-ADVP       0.85      0.82      0.83       866
     I-ADVP       0.64      0.58      0.61        89
    B-CONJP       0.50      0.56      0.53         9
    I-CONJP       0.67      0.77      0.71        13
     B-INTJ       1.00      1.00      1.00         2
      B-LST       0.00      0.00      0.00         5
      I-LST       0.00      0.00      0.00         2
       B-NP       0.97      0.97      0.97     12422
       I-NP       0.97      0.97      0.97     14376
       B-PP       0.97      0.98      0.97      4811
       I-PP       0.77      0.71      0.74        48
      B-PRT       0.82      0.82      0.82       106
     B-SBAR       0.89      0.85      0.87       535
     I-SBAR       0.17      0.75      0.27         4
       B-VP       0.96      0.96      0.96      4658
       I-VP       0.9

In [74]:
y_pred = [tagger.tag(xseq) for xseq in X_train]
print 'Train dataset evaluation'
print(bio_classification_report(y_train, y_pred))


Train dataset evaluation
             precision    recall  f1-score   support

     B-ADJP       1.00      1.00      1.00      2060
     I-ADJP       1.00      1.00      1.00       643
     B-ADVP       1.00      1.00      1.00      4227
     I-ADVP       1.00      1.00      1.00       443
    B-CONJP       1.00      1.00      1.00        56
    I-CONJP       1.00      1.00      1.00        73
     B-INTJ       1.00      0.97      0.98        31
     I-INTJ       1.00      1.00      1.00         9
      B-LST       1.00      1.00      1.00        10
       B-NP       1.00      1.00      1.00     55081
       I-NP       1.00      1.00      1.00     63307
       B-PP       1.00      1.00      1.00     21281
       I-PP       1.00      1.00      1.00       291
      B-PRT       1.00      1.00      1.00       556
      I-PRT       1.00      1.00      1.00         2
     B-SBAR       1.00      1.00      1.00      2207
     I-SBAR       1.00      1.00      1.00        70
      B-UCP       1.

### Analysis :

The mean f1-score (geometric average between the precision and the recall values) for the test set is good (0.96) but we can see that some kind of labels like (B-LST or I-SBAR) have very low score and even zero.
The zero score are not really relevant because they have very low support (below 10).  
We note that the labels having a support of at least 10000 in the train have very good f1-score.
These labels are very frequent and so have a big impact on the loss function.
A improvement would be to apply some weights to loss to compensate for the non uniformity of the label distribution. 
We can reasonably assume that with higher support the labels will have better score.
It is possible also that there is some overfitting as the training are very high.