# Block 3: Sequence Level. Building a NER with CRF.
Jordi Armengol - Joan Llop

In this block, we want to perform Named Entity Recognition and Classification (NERC) in the given corpus, Conll2003, for the following entities:
- Miscellaneous.
- Organizations.
- Locations.
- Persons.

NERC usually requires sequence-level models, since words alone are not enough for disambiguating some cases. For this reason, we will use a sequence-level model, CRFs.

#### CRFs 

Conditional Random Fields are a kind of discriminative model (ie. not generative) based on undirected probabilistic graphs. We can think of them as a way of modeling the joint distribution of a whole sequence of inputs.

We will use pycrfsuite, a Python library for CRFs.

In [2]:
import nltk
from nltk.corpus.reader import ConllCorpusReader
import pycrfsuite
from sklearn import metrics

#### Data
We will use the english files of the Conll2003 corpus. A 'conll2003' folder with the files 'eng.train', 'eng.testa' and 'eng.testb' is required to be in the same folder as the notebook.

The data will be a list of sentences of tripplets: (word, pos, ne). we will use the testa for validation purposes and the testb as our test data.

In [3]:
train = ConllCorpusReader('conll2003', 'eng.train', ['words', 'pos', 'ne', 'chunk']).iob_sents()[1:]
testa = ConllCorpusReader('conll2003', 'eng.testa', ['words', 'pos', 'ne', 'chunk']).iob_sents()[1:]
testb = ConllCorpusReader('conll2003', 'eng.testb', ['words', 'pos', 'ne', 'chunk']).iob_sents()[1:]

## Preprocessing
#### Data cleaning, tokenization and other preprocessing issues
ConllCorpusReader already provides the tokenized sentences, Part-of-Speech and annotations. No further preprocessing is required as far as these matters are concerned.

## Feature engineering
In order to train our CRF, we need to create our own feature set to represent samples. We have done some research for knowing which are the usual features used in these cases (ie. NER) and examined examples of using pycrfsuite. Some of these features are pretty obvious, like whether the word starts with uppercase, in which case, the probability of being a named entity will be way higher.

We have decided to experiment with the following features:
- The word in lowercase
- The POS
- The length of the word
- A bool that indicates if the word is the beginning of the sentence
- A bool that indicates if the word is the end of the sentence
- A bool that indicates if the word is all in uppercase
- A bool that indicates if the word is a digit
- A bool that indicates if the word starts with an uppercase character.

We repeat all these features for the two previous words and for the two next words of the sentence (if they exist). We have used testa (the validation set) for assuring that these features are actually useful.

In [9]:
def get_words_from_sent(sent):
    return [words for words, postag, label in sent]

def get_individual_word_features(word, words, sent, index, id_):
    features = {}
    features[id_ + 'lowercase_word'] = word.lower() # word in lowercase
    features[id_ + 'postag'] = str(sent[index][1]) # Part-of-Speech                   
    features[id_ + 'length'] = str(len(words)) # length of word
    features[id_ + 'BOS'] = str(index==0) # beggining of a sentence
    features[id_ + 'EOF'] = str(index==len(words)-1) # end of sentence
    features[id_ + 'is_upper'] = str(word.isupper()) # is uppercase
    features[id_ + 'is_digit'] = str(word.isdigit()) # is a digit
    features[id_ + 'starts_upper'] = str(word.istitle()) # starts with uppercase
    return features
    
def get_word_features(sent, i):
    words = get_words_from_sent(sent)
    word = words[i]
    features = get_individual_word_features(word, words, sent, i, '')
    if i > 0:
        previous_word1 = words[i-1]
        previous1_word_features = get_individual_word_features(previous_word1, words, sent, i-1, 'previous1_')
        features = dict(**features, **previous1_word_features)
    if i > 1:
        previous_word2 = words[i-2]
        previous2_word_features = get_individual_word_features(previous_word2, words, sent, i-2, 'previous2_')
        features = dict(**features, **previous2_word_features)
    if i < len(words)-1:
        next_word1 = words[i+1]
        next_word1_features = get_individual_word_features(next_word1, words, sent, i+1, 'next1_')
        features = dict(**features, **next_word1_features)
    if i < len(words)-2:
        next_word2 = words[i+2]
        next_word2_features = get_individual_word_features(next_word2, words, sent, i+2, 'next2_')
        features = dict(**features, **next_word2_features)
    return features


def get_sentence_features(sent):
    #return pycrfsuite.ItemSequence([get_word_features(sent, i) for i in range(len(sent))])
    return [get_word_features(sent, i) for i in range(len(sent))]


def get_features(corpus):
    return [get_sentence_features(sent) for sent in corpus]
    
                        
def get_sentence_labels(sent):
    return [label for words, postag, label in sent]
                        
                        
def get_labels(corpus):
    return [get_sentence_labels(sent) for sent in corpus]


## Train phase
We separate the data in features and the labels and use the previous functions for building the feature sets. We will use the validation set for assuring that the features that we have thought of are actually useful. However, for computational constraints, we will not check all the combinations. Instead, we will test feature by feature and we will assume that if they are useful individually, the aggregate will be useful as well. We will not test the length of the context (the 2 previous words and the 2 next words) for the same reason (computational time).

We add all sentences (where each word has been converted to features) to the respective CRFs. The resulting models will be saved in disk.

In [31]:
%%time

def select(instances, feature_names_to_select):
    res = []
    for instance in instances:
        res.append(pycrfsuite.ItemSequence([{k: s[k] for k in feature_names_to_select} for s in instance]))
    return res
    
train_features = get_features(train)
train_labels = get_labels(train)
feature_names = ['postag', 'length', 'BOS', 'EOF', 'is_upper', 'is_digit', 'starts_upper']
for feature_name in feature_names:
    print('Training with', feature_name + '...')
    train_features_ = select(train_features, ['lowercase_word', feature_name])
    CRF = pycrfsuite.Trainer(verbose=False)
    for x, y in zip(train_features, train_labels):
        CRF.append(x, y)
    CRF.train('conll2003-eng-' + feature_name + '.model')
print('Training with words only...')
train_features_ = select(train_features, ['lowercase_word'])
CRF = pycrfsuite.Trainer(verbose=False)
for x, y in zip(train_features_, train_labels):
    CRF.append(x, y)
CRF.train('conll2003-eng-words.model')
print('Training with words and all features...')
CRF = pycrfsuite.Trainer(verbose=False)
for x, y in zip(train_features, train_labels):
    CRF.append(x, y)
CRF.train('conll2003-eng-all.model')

Training with postag...
Training with length...
Training with BOS...
Training with EOF...
Training with is_upper...
Training with is_digit...
Training with starts_upper...
Training with words only...
Training with words and all features...
CPU times: user 34min 49s, sys: 1.04 s, total: 34min 50s
Wall time: 34min 51s


## Validation and model selection phase
We select the model with the highest accuracy in the validation (testa) set.

In [43]:
testa_features = get_features(testa)
testa_labels = get_labels(testa)
feature_names = ['postag', 'length', 'BOS', 'EOF', 'is_upper', 'is_digit', 'starts_upper']
y_true = []
for sentence_labels in testa_labels:
    for label in sentence_labels:
        y_true.append(label)
best_accuracy = 0
best_model = ''
for feature_name in feature_names:
    testa_features_ = select(testa_features, ['lowercase_word', feature_name])
    tagger = pycrfsuite.Tagger()
    tagger.open('conll2003-eng-' + feature_name + '.model')
    y_pred = []
    for sentence_pred in [tagger.tag(x) for x in testa_features_]:
        for pred in sentence_pred:
            y_pred.append(pred)
    accuracy = metrics.accuracy_score(y_true=y_true, y_pred=y_pred)
    print('accuracy using ' + feature_name + ' = ' + str(accuracy))
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_model = 'conll2003-eng-' + feature_name + '.model'

        
tagger = pycrfsuite.Tagger()
tagger.open('conll2003-eng-words.model')
testa_features_ = select(testa_features, ['lowercase_word'])
y_pred = []
for sentence_pred in [tagger.tag(x) for x in testa_features_]:
    for pred in sentence_pred:
        y_pred.append(pred)
accuracy = metrics.accuracy_score(y_true=y_true, y_pred=y_pred)
print('accuracy using words only = ' + str(accuracy))
if accuracy > best_accuracy:
    best_accuracy = accuracy
    best_model = 'conll2003-eng-words.model'

        
tagger = pycrfsuite.Tagger()
tagger.open('conll2003-eng-all.model')
y_pred = []
for sentence_pred in [tagger.tag(x) for x in testa_features]:
    for pred in sentence_pred:
        y_pred.append(pred)
accuracy = metrics.accuracy_score(y_true=y_true, y_pred=y_pred)
print('accuracy using words and all features = ' + str(accuracy))
if accuracy > best_accuracy:
    best_accuracy = accuracy
    best_model = 'conll2003-eng-all.model'

accuracy using postag = 0.07526965460846541
accuracy using length = 0.06545695261087964
accuracy using BOS = 0.0630427164051244
accuracy using EOF = 0.05737704918032787
accuracy using is_upper = 0.13657957244655583
accuracy using is_digit = 0.03775164518515634
accuracy using starts_upper = 0.43722985865036407
accuracy using words only = 0.9040535804680503
accuracy using words and all features = 0.969705229547136


## Test phase

We have used testb for the final test, applying the selected model. This set has not been used for training or model selection.

In [34]:
testb_features = get_features(testb)
testb_labels = get_labels(testb)

tagger = pycrfsuite.Tagger()
tagger.open(best_model)
y_pred = []
for sentence_pred in [tagger.tag(x) for x in testb_features]:
    for pred in sentence_pred:
        y_pred.append(pred)
        
y_test = []
for sentence_labels in testb_labels:
    for label in sentence_labels:
        y_test.append(label)
        
# Print results 
print('accuracy =', metrics.accuracy_score(y_true=y_test, y_pred=y_pred))
print(metrics.classification_report(y_true=y_test, y_pred=y_pred))
print('confusion matrix: ')
print(metrics.confusion_matrix(y_true=y_test, y_pred=y_pred))


accuracy = 0.9534618283622268


  'precision', 'predicted', average, warn_for)


              precision    recall  f1-score   support

       B-LOC       0.00      0.00      0.00         6
      B-MISC       0.00      0.00      0.00         9
       B-ORG       0.00      0.00      0.00         5
       I-LOC       0.82      0.83      0.82      1919
      I-MISC       0.73      0.69      0.71       909
       I-ORG       0.76      0.75      0.76      2491
       I-PER       0.84      0.89      0.87      2773
           O       0.99      0.98      0.99     38323

   micro avg       0.95      0.95      0.95     46435
   macro avg       0.52      0.52      0.52     46435
weighted avg       0.95      0.95      0.95     46435

confusion matrix: 
[[    0     0     0     0     0     4     2     0]
 [    0     0     0     0     5     1     2     1]
 [    0     0     0     0     0     4     1     0]
 [    0     0     0  1585    33   122    82    97]
 [    0     0     0    56   625    57    40   131]
 [    0     0     0   145    66  1876   218   186]
 [    0     0     0    5

## Conclusions

We can see that the more data the better. In the classes where there are almost no data (B-LOC, B-MISC and B-ORG, since apparently most entities are composed of a single word) we are not able to predict any instance, while with the classes with more data our model performs much better. In the class 'O' (there is no name entity) we get more than 98 percent of precision and recall (we have to remark that a classifier that assign all claases to 'O' will get 0.82 of precision). We can see that we predict better the class I-LOC than the class I-ORG, but the latter class has more instances than the former. One possible explanation to this anomaly is the length of each class: the names of organizations are usually longer than the locations (name of cities, for instance).

With regard to model selection, we have seen that all the features we experimented with seem to be useful, as we can see in our partial ablation study in the model selection and validation phase. However, some features seem to be way more important than others, specially the feature denoting whether the given word starts with capital letters, which seems to be logical, in the sense that named entities will usually start with a capital letter. The importance of some other features seems to be marginal.

Finally, we conclude that we have not overfitted to the validation set, since the results with the best model are consistent with the ones obtained in the test set.