# Block 3: Sequence Level. Building a NER chunker with CRF.
Jordi Armengol - Joan Llop

In this block we want to discover the name entities of sentences using conditional random fields. 
#### CRFs 
Usually a neural network is model that takes a single input and returns the most likely label, but with conditional random fields, the previous inputs and the next inputs matter in the task of assigning a label to an instance. Therefore, we can think of them as a way of modeling the join distribution of a whole sequence of inputs.

In [112]:
import nltk
from nltk.corpus.reader import ConllCorpusReader
import pycrfsuite
# gensim can be downloaded using pip install -U gensim
import gensim # m'ha semblat inutil, crec que ho hauriem de borrar de cara a l'entrega
from gensim.models import Word2Vec # " "
from sklearn import svm, metrics

#### Data
We will use the english files of the conll2003 corpus. A 'conll2003' folder with the files 'eng.train', 'eng.testa' and 'eng.testb' is required to be in the same folder as the notebook.

The data will be a list of sentences of tripplets: (word, pos, ne). we will use the testa for validation purposes and the testb as our test data.

In [113]:
train = ConllCorpusReader('conll2003', 'eng.train', ['words', 'pos', 'ne', 'chunk']).iob_sents()[1:]
testa = ConllCorpusReader('conll2003', 'eng.testa', ['words', 'pos', 'ne', 'chunk']).iob_sents()[1:]
testb = ConllCorpusReader('conll2003', 'eng.testb', ['words', 'pos', 'ne', 'chunk']).iob_sents()[1:]

## Preprocessing
#### The features
In order to train our CRF we need to create the features (from the data we have the words and the pos). We have decided to use the following features:
- The word in lowercase
- The POS
- The lenght of the word
- A bool that indicates if the word is the beginning of the sentence
- A bool that indicates if the word is the end of the sentence
- A bool that indicates if the word is all in uppercase
- A bool that indicates if the word is a digit
- A bool that indicates if the word is a title

We repeat all these features for the two previous words and for the two next words of the sentence (if they exist).

In [114]:
def get_words_from_sent(sent):
    return [words for words, postag, label in sent]


# used when all features are embeddings
def get_embedded_word_features(i, words, model):
    word_features = []
    for j in range(len(words)):
        word_features.append(str(model.wv.similarity(words[i], words[j])))
    return word_features


# used when all features are embeddings
def get_embedded_sentence_features(sent, model):
    words = get_words_from_sent(sent)
    features = [get_embedded_word_features(i, words, model) for i in range(len(words))]
    return features


# used when all features are embeddings
def get_embedded_features(corpus):
    words = [get_words_from_sent(sent) for sent in corpus]
    model = gensim.models.Word2Vec(words, min_count = 1, size = 100, window = 5)
    return [get_embedded_sentence_features(sent, model) for sent in corpus]


def get_word_features(i, sent):
    words = get_words_from_sent(sent)
    word = words[i]
    features = []
    features.append('lowercase word: ' + word.lower()) # word in lowercase
    features.append('postag: ' + str(sent[i][1])) # Postag                   
    features.append('lenght of word: ' + str(len(words))) # lenght of word
    features.append('BOS: ' + str(i==0)) # beggining of a sentence
    features.append('EOF: ' + str(i==len(words)-1)) # end of sentence
    features.append('word is upper: ' + str(word.isupper())) # is uppercase
    features.append('word is digit: ' + str(word.isdigit())) # is a digit
    features.append('word is title: ' + str(word.istitle())) # is a title
    if (i > 0):
        previous_word = words[i-1]
        features.append('lowercase previous word: ' + previous_word.lower())
        features.append('postag previous word: ' + str(sent[i-1][1]))
        features.append('lenght of previous word: ' + str(len(previous_word)))
        features.append('previous word is BOS: ' + str(i-1==0))
        features.append('previous word is EOF: ' + str(i-1==len(words)-1))
        features.append('previous word is upper: ' + str(previous_word.isupper()))
        features.append('previous word is digit: ' + str(previous_word.isdigit()))
        features.append('previous word is title: ' + str(previous_word.istitle()))
    if (i > 1):
        previous_word = words[i-2]
        features.append('lowercase second previous word: ' + previous_word.lower())
        features.append('postag second previous word: ' + str(sent[i-2][1]))
        features.append('lenght of second previous word: ' + str(len(previous_word)))
        features.append('second previous word is BOS: ' + str(i-2==0))
        features.append('second previous word is EOF: ' + str(i-2==len(words)-1))
        features.append('second previous word is upper: ' + str(previous_word.isupper()))
        features.append('second previous word is digit: ' + str(previous_word.isdigit()))
        features.append('second previous word is title: ' + str(previous_word.istitle()))
    if (i < len(words)-1):
        next_word = words[i+1]
        features.append('lowercase next word: ' + next_word.lower())
        features.append('postag next word: ' + str(sent[i+1][1]))
        features.append('lenght of next word: ' + str(len(next_word)))
        features.append('next word is BOS: ' + str(i+1==0))
        features.append('next word is EOF: ' + str(i+1==len(words)-1))
        features.append('next word is upper: ' + str(next_word.isupper()))
        features.append('next word is digit: ' + str(next_word.isdigit()))
        features.append('next word is title: ' + str(next_word.istitle()))
    if (i < len(words)-2):
        next_word = words[i+2]
        features.append('lowercase second next word: ' + next_word.lower())
        features.append('postag second next word: ' + str(sent[i+2][1]))
        features.append('lenght of second next word: ' + str(len(next_word)))
        features.append('second next word is BOS: ' + str(i+2==0))
        features.append('second next word is EOF: ' + str(i+2==len(words)-1))
        features.append('second next word is upper: ' + str(next_word.isupper()))
        features.append('second next word is digit: ' + str(next_word.isdigit()))
        features.append('second next word is title: ' + str(next_word.istitle()))
    return features


def get_sentence_features(sent):
    return [get_word_features(i, sent) for i in range(len(sent))]


def get_features(corpus):
    return [get_sentence_features(sent) for sent in corpus]
    
                        
def get_sentence_labels(sent):
    return [label for words, postag, label in sent]
                        
                        
def get_labels(corpus):
    return [get_sentence_labels(sent) for sent in corpus]


## Train phase
We separate the data in features and the labels:

In [115]:
%%time
# train_features = get_embedded_features(train)

train_features = get_features(train)
train_labels = get_labels(train)

# testa_features = get_embedded_features(testa)

testa_features = get_features(testa)
testa_labels = get_labels(testa)

# testb_features = get_embedded_features(testb)

testb_features = get_features(testb)
testb_labels = get_labels(testb)

CPU times: user 9.07 s, sys: 500 ms, total: 9.57 s
Wall time: 9.9 s


We add all sentences (where each word has been converted to features) to the CRF

In [116]:
%%time
CRF = pycrfsuite.Trainer(verbose=False)

for x, y in zip(train_features, train_labels):
    CRF.append(x, y)

CPU times: user 4.7 s, sys: 7.99 ms, total: 4.71 s
Wall time: 4.71 s


We train the CRF with the sentences that we have add before. The resulting model will be saved with the name 'conll2003-eng.model'

In [117]:
%%time
CRF.train('conll2003-eng.model')

CPU times: user 3min 12s, sys: 108 ms, total: 3min 12s
Wall time: 3min 12s


## Test phase

We have used the testa data for validation purposes and the testb for the final test.

In [118]:
tagger = pycrfsuite.Tagger()
tagger.open('conll2003-eng.model')
y_pred = []

# use testa_features for validation and testb_features for testing
for sentence_pred in [tagger.tag(x) for x in testb_features]:
    for pred in sentence_pred:
        y_pred.append(pred)
        
# use testa_labels for validation and testb_labels for testing
y_test = []
for sentence_labels in testb_labels:
    for label in sentence_labels:
        y_test.append(label)
        
# Print results 
print('accuracy =', metrics.accuracy_score(y_true=y_test, y_pred=y_pred))
print(metrics.classification_report(y_true=y_test, y_pred=y_pred))
print('confusion matrix: ')
print(metrics.confusion_matrix(y_true=y_test, y_pred=y_pred))


accuracy = 0.952341983418


  'precision', 'predicted', average, warn_for)


              precision    recall  f1-score   support

       B-LOC       0.00      0.00      0.00         6
      B-MISC       0.00      0.00      0.00         9
       B-ORG       0.00      0.00      0.00         5
       I-LOC       0.81      0.84      0.83      1919
      I-MISC       0.72      0.68      0.70       909
       I-ORG       0.75      0.74      0.75      2491
       I-PER       0.84      0.89      0.86      2773
           O       0.99      0.98      0.98     38323

    accuracy                           0.95     46435
   macro avg       0.51      0.52      0.51     46435
weighted avg       0.95      0.95      0.95     46435

confusion matrix: 
[[    0     0     0     0     0     4     2     0]
 [    0     0     0     0     5     2     2     0]
 [    0     0     0     0     0     4     1     0]
 [    0     0     0  1608    35   105    76    95]
 [    0     0     0    55   616    61    37   140]
 [    0     0     0   150    72  1851   220   198]
 [    0     0     0    6

## Conclusions

We can see that the more data the better. In the classes where there is almost no data (B-LOC, B-MISC and B-ORG) we are not able to predict any instance, while with the classes with more data we perform much better. In the class 'O' (there is no name entity) we get more than 98 percent of precision and recall (we have to remark that a classifier that assign all claases to 'O' will get 0.82 of precision). We can see that we predict better the class I-LOC than the class I-ORG, but this last class have more that than the class I-LOC, One possible explanation to this anomaly is the lenght of each class: the names of organizations are usually longer than the locations (name of cities, for instance).