# LAB4a - NERC with Conditional Random Fields (CRF)

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

### Credits

The content of this notebook is an adaptation of:
https://www.depends-on-the-definition.com/named-entity-recognition-conditional-random-fields-python/

which is itself based on:

https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html

In this notebook, we are going to use Conditional Random Fields (CRF) to train a Named Entity Recognition and Classifciation (NERC) system. CRF classifiers have been specifically succesful for this this task for several reasons:

<ol>
    <li>They can take a wide variety of features into account
    <li>They exploit both sequences of words with their annotations and sequences of features into account to make predictions
</ol>

We can see the task of NERC as a special sequence annotation task, in which some tokens in a sentence fall outside named-entity expressions, while other are part of a named entity expression. As such it has similarities with part-of-speech tagging, phrase structure chunking but also with semantic classification when classifying a named-entity-phrase for some type: person, organisaiton, location, time expression, etc. Due to the nature of the task there is a wide range of features that can contribute but there is also a strong sequence depedency in that the features of one token predict the tags of the next token and vice versa. Like part-of-speech tagging, sequence dependencies typical are reflected with the boundaries of a sentence. That is why CRF models for NERC, typically use the sentence as a unit for representing features.


### Preparation

You first need to install the special sklearn-crfsuite which does not come with sklearn. Open a command line within the Anaconda install environment and run the next command:

>pip install sklearn-crfsuite

For evaluation of sequence tagging, we are going to use a pakage *seqeval* which was tested on CoNLL tasks:

https://github.com/chakki-works/seqeval

> pip install seqeval[cpu]


To analyse the features used we also need another package:

>pip install eli5

See: https://eli5.readthedocs.io/en/latest/

### eli5 Fix

The eli5 library is no longer supported, and in order to get it to work, you might need to modify two files which contain an outdated import.
To do so, run eli5_patch.py from your terminal (while located in your working directory, run "python eli5_patch.py"). After that, the library should work.

### Background

We first present a formal model for the typical properties of the data that our classifier needs to annotate. If you are not familiar with the mathematical modeling of such problems, you can skip this subsection. The model helps explaining how a model can adapted to avoid overfitting to the trainign set by forcing it to down-rank certain features and generalise more.


We typically represent the data as a sequence of words and as a sequence of tags, which are the output states of each word token in the sequence, i.e. being part of a named-entity expression or not.

We denote the input sequence (the words in a sentence):

$$x = (x_1,\dots, x_m)$$

The sequence of output states, i.e. the named entity tags, is represented as:

$$s = (s_1,\dots, s_m)$$

In conditional random fields we model the conditional probability for a sequence *1..m*:

$$p(s_1,\dots,s_m|x_1,\dots,x_m)$$

We do this by defining a feature map that maps an entire input sequence *x* paired with an entire state sequence *s* to some d-dimensional feature vector:

$$\Phi(x_1,\dots,x_m,s_1,\dots,s_m)\in\mathbb{R}^d$$

Then we can model the probability as a log-linear model with the parameter vector `w`:

$$p(s|x; w) = \frac{\exp(w\cdot\Phi(x, s))}{\sum_{s^\prime} \exp(w\cdot\Phi(x, s^\prime))},$$

Here *s'* ranges over all possible output sequences. For the estimation of *w*, we assume that we have a set of *n* labeled examples. Now we define the regularized log-likelihood function L:



$$L(w) = \sum_{i=1}^n \log p(s^i|x^i; w) - \frac{\lambda_2}{2}\|w\|_2^2 - \lambda_1 \|w\|_1.$$

The lambda terms force the parameter vector to be small in the respective norm. This penalizes the model complexity and is known as **regularization**. The parameters lambda_2 and lambda_1 allow us to control the extent of regularization. The parameter vector $w^*$ is then estimated as

$$w^* = \text{arg max}_{w\in \mathbb{R}^d} L(w)$$

If we estimated the vector $w^*$, we can find the most likely tag a sentence $s^*$ for a sentence x by



$$s^* = \text{arg max}_{s} p(s|x; w^*).$$

### Implementation

#### Step 0: Install the needed modules
1.`sklearn_crfsuite`

Run `pip install sklearn_crfsuite` or 

`conda install -c derickl sklearn-crfsuite`


2.`eli5`

ELI5 is a Python package which helps to debug machine learning classifiers and explain their predictions. It provides support for the following machine learning frameworks and packages: scikit-learn.

https://eli5.readthedocs.io/en/latest/overview.html

Run `pip install eli5` or 

`conda install -c conda-forge eli5`

#### Step I: Loading the data

Now we want to apply this model. Let’s start by loading the data.

In [1]:
import pandas as pd
import numpy as np

We are going to load an entity data set in CSV format that is provided through Kaggle and which follows a specifically adapted IOB annotation format. You can download the data set and the documentation from the next URL:

https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus#ner_dataset.csv


We use the pandas framework to load the CSV data as a table with columns.

In [2]:
#Adapt the path to load your local copy of the data set
data = pd.read_csv("nerc_datasets/kaggle/ner_dataset.csv", encoding="latin1")


The annotation has 4 columns, where the final column has the named entity tags and the first column is special as it represents a sentence identifier that is given for the first token of a sentence:

```
Sentence: 3,They,PRP,O
,marched,VBD,O
,from,IN,O
,the,DT,O
,Houses,NNS,O
,of,IN,O
,Parliament,NN,O
,to,TO,O
,a,DT,O
,rally,NN,O
,in,IN,O
,Hyde,NNP,B-geo
,Park,NNP,I-geo
,.,.,O
```

The pandas framework is very powerful and provides many different options for data manipulation and conversion. Please consult the online documentation for more details.

We are going to use a specific method to fill data holes so that we get a uniform representation. More details are provided here: 

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html


In [3]:
#Fill NA/NaN values using the specified method.

data = data.ffill()

#### Step II: Initial analysis

Let's see how many rows we have in our data

In [4]:
print(len(data))

1048575


We see that we have over a million rows with tokens as data. This is quite a lot.

Through the *data.head(10)* and *data.tail(10)* functions, we can inspect the start and the end of of the data frame 

In [5]:
data.head(10)

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,Sentence: 1,of,IN,O
2,Sentence: 1,demonstrators,NNS,O
3,Sentence: 1,have,VBP,O
4,Sentence: 1,marched,VBN,O
5,Sentence: 1,through,IN,O
6,Sentence: 1,London,NNP,B-geo
7,Sentence: 1,to,TO,O
8,Sentence: 1,protest,VB,O
9,Sentence: 1,the,DT,O


Let's print the last 10 rows of the data:

In [6]:
data.tail(10)

Unnamed: 0,Sentence #,Word,POS,Tag
1048565,Sentence: 47958,impact,NN,O
1048566,Sentence: 47958,.,.,O
1048567,Sentence: 47959,Indian,JJ,B-gpe
1048568,Sentence: 47959,forces,NNS,O
1048569,Sentence: 47959,said,VBD,O
1048570,Sentence: 47959,they,PRP,O
1048571,Sentence: 47959,responded,VBD,O
1048572,Sentence: 47959,to,TO,O
1048573,Sentence: 47959,the,DT,O
1048574,Sentence: 47959,attack,NN,O


We have *47,959* sentences in our data set. For a CRF approach, sentences are the text units to model sequences of words.

As further analysis, we can make a set of all unique words:

In [7]:
words = list(set(data["Word"].values))

In [8]:
n_words = len(words); n_words

35177

So we have 47959 sentences containing 35177 unique words. We need the sentences as a unit for the CRF approach which assumes that sentences have some predictive sequence of words and likewise tags.

We will use a class called SentenceGetter to retrieve sentences with their labels. Don't worry about the details of this.

In the same way, we can get a list of all the values for the column with the part-of-speech values.

In [9]:
pos = list(set(data["POS"].values))

In [10]:
print(pos)

['RB', 'RRB', 'FW', 'WP', 'EX', 'NN', 'TO', 'VBZ', ',', 'JJR', 'NNPS', 'CC', 'VBN', 'VBP', '``', 'CD', 'PRP$', '$', 'RBR', 'UH', ':', 'NNS', 'NNP', 'WRB', 'MD', 'VBD', 'PDT', 'PRP', '.', 'WDT', 'VB', 'RBS', ';', 'WP$', 'JJS', 'RP', 'VBG', 'JJ', 'LRB', 'DT', 'POS', 'IN']


Finally, we extract the list of unique annotation tags, in this case the named-entity IOB tags

In [11]:
labels = list(set(data["Tag"].values))

In [12]:
print(labels)

['B-art', 'I-art', 'I-per', 'B-tim', 'B-geo', 'O', 'B-per', 'B-nat', 'B-org', 'B-eve', 'B-gpe', 'I-org', 'I-nat', 'I-gpe', 'I-tim', 'I-eve', 'I-geo']


It is important to learn about the prior distribution of the tags. For this, we can use the list of tags and apply the *Counter* function to generate the frequency count.

In [13]:
import collections
label_counts = collections.Counter(list(data["Tag"].values))
print(label_counts)

Counter({'O': 887908, 'B-geo': 37644, 'B-tim': 20333, 'B-org': 20143, 'I-per': 17251, 'B-per': 16990, 'I-org': 16784, 'B-gpe': 15870, 'I-geo': 7414, 'I-tim': 6528, 'B-art': 402, 'B-eve': 308, 'I-art': 297, 'I-eve': 253, 'B-nat': 201, 'I-gpe': 198, 'I-nat': 51})


In [14]:
print("test")

test


We see that *O* is by far the most dominant tag. The other tags are less frequent, where the standard entity types *geo*, *tim*, *org* and *per* are more dominant than the special types *art*, *eve*, *nat* and *gpe*. Such data distributions are important to understand data biases of systems.

The next function retrieves from the data frame, a list of tuples for each separate sentence, where we defined the tuples as a set consisting of the word, the part-of-speech-tag and the entity-tag.

In [15]:
# Function that processes the data into sentences
class SentenceGetter(object):
    
    def __init__(self, data):
        self.n_sent = 1
        self.data = data
        self.empty = False
        agg_func = lambda s: [(w, p, t) for w, p, t in zip(s["Word"].values.tolist(),
                                                           s["POS"].values.tolist(),
                                                           s["Tag"].values.tolist())]
        self.grouped = self.data.groupby("Sentence #").apply(agg_func)
        self.sentences = [s for s in self.grouped]
    
    def get_next(self):
        try:
            s = self.grouped["Sentence: {}".format(self.n_sent)]
            self.n_sent += 1
            return s
        except:
            return None

In [16]:
getter = SentenceGetter(data)

  self.grouped = self.data.groupby("Sentence #").apply(agg_func)


In [17]:
sent = getter.get_next()

This is an example sentence we get with our SentenceGetter:

In [18]:
print(sent)

[('Thousands', 'NNS', 'O'), ('of', 'IN', 'O'), ('demonstrators', 'NNS', 'O'), ('have', 'VBP', 'O'), ('marched', 'VBN', 'O'), ('through', 'IN', 'O'), ('London', 'NNP', 'B-geo'), ('to', 'TO', 'O'), ('protest', 'VB', 'O'), ('the', 'DT', 'O'), ('war', 'NN', 'O'), ('in', 'IN', 'O'), ('Iraq', 'NNP', 'B-geo'), ('and', 'CC', 'O'), ('demand', 'VB', 'O'), ('the', 'DT', 'O'), ('withdrawal', 'NN', 'O'), ('of', 'IN', 'O'), ('British', 'JJ', 'B-gpe'), ('troops', 'NNS', 'O'), ('from', 'IN', 'O'), ('that', 'DT', 'O'), ('country', 'NN', 'O'), ('.', '.', 'O')]


We can get all sentences as follows:

In [19]:
sentences = getter.sentences

In [20]:
print(len(sentences))

47959


In [21]:
sentence= sentences[3]
print(sentence)

[('They', 'PRP', 'O'), ('left', 'VBD', 'O'), ('after', 'IN', 'O'), ('a', 'DT', 'O'), ('tense', 'NN', 'O'), ('hour-long', 'JJ', 'O'), ('standoff', 'NN', 'O'), ('with', 'IN', 'O'), ('riot', 'NN', 'O'), ('police', 'NNS', 'O'), ('.', '.', 'O')]


#### Step III: Feature engineering

Now we craft a set of features and prepare the dataset. We define some typical features for NERC: the actual word (lowecase), the word beginning and ending, word shape features and the part-of-speech information. If there is a preceding word (i>0), we add some properties of the preceding word. If there is a following word in the sentence (i < len(sent)-1), we add similar properties for the following word. A special feature is added for the first and last word.

In [None]:
# input is a sentence as a structure show above 
#and and ith word from the sentence to return the features for that word

def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]
    
    # data structure consisting of a feature name and value for the token
    features = {
        'bias': 1.0,
        'word.lower()': word.lower(), # lower case variant of the token
        'word[-3:]': word[-3:], #suffix of 3 characters
        'word[-2:]': word[-2:], #suffix of 2 characters
        'word.isupper()': word.isupper(), # initial captial
        'word.istitle()': word.istitle(), # all words ini caps
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2], #first two characters of the PoS Tag
    }
    if i > 0:
        # adding features for the word based on the previous word
        word1 = sent[i-1][0] # previous word
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True # Beginning of sentence as a feature

    if i < len(sent)-1:
        # adding features for the word based on the next word
        word1 = sent[i+1][0] # next word
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True # end of sentence as a feature

    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]

The following code extracts features with our functions above. It also prepares all labels from the original dataset.
First, try processing the full data (first two lines). If that fails, restart the kernel and try the bottom two lines instead.

In [23]:
sentence = sentences[0]
print(sentence)

[('Thousands', 'NNS', 'O'), ('of', 'IN', 'O'), ('demonstrators', 'NNS', 'O'), ('have', 'VBP', 'O'), ('marched', 'VBN', 'O'), ('through', 'IN', 'O'), ('London', 'NNP', 'B-geo'), ('to', 'TO', 'O'), ('protest', 'VB', 'O'), ('the', 'DT', 'O'), ('war', 'NN', 'O'), ('in', 'IN', 'O'), ('Iraq', 'NNP', 'B-geo'), ('and', 'CC', 'O'), ('demand', 'VB', 'O'), ('the', 'DT', 'O'), ('withdrawal', 'NN', 'O'), ('of', 'IN', 'O'), ('British', 'JJ', 'B-gpe'), ('troops', 'NNS', 'O'), ('from', 'IN', 'O'), ('that', 'DT', 'O'), ('country', 'NN', 'O'), ('.', '.', 'O')]


In [29]:
# X = [sent2features(s) for s in sentences]
# y = [sent2labels(s) for s in sentences]

#If your enviornment breaks here, it might be because of very large lists being held in memory. Try loading first 10000 examples with:
X = [sent2features(s) for s in sentences[:10000]]
y = [sent2labels(s) for s in sentences[:10000]]

We can now inspect the first data representation in X.

In [30]:
print(X[0])

[{'bias': 1.0, 'word.lower()': 'thousands', 'word[-3:]': 'nds', 'word[-2:]': 'ds', 'word.isupper()': False, 'word.istitle()': True, 'word.isdigit()': False, 'postag': 'NNS', 'postag[:2]': 'NN', 'BOS': True, '+1:word.lower()': 'of', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': 'IN', '+1:postag[:2]': 'IN'}, {'bias': 1.0, 'word.lower()': 'of', 'word[-3:]': 'of', 'word[-2:]': 'of', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'IN', 'postag[:2]': 'IN', '-1:word.lower()': 'thousands', '-1:word.istitle()': True, '-1:word.isupper()': False, '-1:postag': 'NNS', '-1:postag[:2]': 'NN', '+1:word.lower()': 'demonstrators', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': 'NNS', '+1:postag[:2]': 'NN'}, {'bias': 1.0, 'word.lower()': 'demonstrators', 'word[-3:]': 'ors', 'word[-2:]': 'rs', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'NNS', 'postag[:2]': 'NN', '-1:word.lower()': '

#### Step IV: Initialize CRF

Now we can initialize the algorithm. We use the conditional random field (CRF) implementation provided by sklearn-crfsuite.

In [31]:
import sklearn_crfsuite

from sklearn_crfsuite import CRF

# different parameters are used for training
# check https://sklearn-crfsuite.readthedocs.io/en/latest/api.html?highlight=CRF
crf = CRF(algorithm='lbfgs',
          c1=0.1, #The coefficient for L1 regularization.
          c2=0.1, #The coefficient for L2 regularization.
          max_iterations=100,
          all_possible_transitions=False) #When True, CRFsuite generates transition features that associate all of possible label pairs, 
                                        #including ones that never occur. Suppose that the number of labels in the training data is L, this function will generate (L * L) transition features

We now have defined a instance *crf* to train and test on our data. We are going to use 5-fold cross-validation, which means that we keep 20% for testing and 80% for trining and repeat this 5 times so that each part of the data is tested once and used four times for training. We average of the tests.

In [32]:
from sklearn.model_selection import cross_val_predict
from sklearn_crfsuite.metrics import flat_classification_report

We will use the sklearn_crfsuite classification report to evaluate the tagger, because we are basically interested in precision, recall and the f1-score. These metrics are common in NLP tasks and if you are not familiar with these metrics, then check out the wikipedia articles.

#### Step V: Train and test the CRF algorithm

We use *cross_val_predict* to do the cross-validation, this takes a while as we have over a million data points, defined a rich feature set and need to repeat it 5 times. It takes a few minutes on a pretty decent laptop to run the cross-validation. If you are not sure your machine can handle it or if you cannot wait. You could go back and apply the sentence and label extraction on a subset of the sentences, e.g:

X = [sent2features(s) for s in sentences[:10000]]
y = [sent2labels(s) for s in sentences[:10000]]

In [33]:
# given the model "crf", 
# given the feature representations of the sentences x and their labels y,
# apply 5-folded cross classifcation, testing 5 times on 80% train and 20% test
# this may take half an hour depending on the machine you are running it
pred = cross_val_predict(estimator=crf, X=X, y=y, cv=5)

If you're getting "AttributeError: 'CRF' object has no attribute 'keep_tempfiles', downgrade your scikit-learn package with: 

pip uninstall scikit-learn <br>
conda install -c anaconda scikit-learn==0.23.2


Next, we can run *flat_classification_report* function from sklearn_crfsuite to the *pred* variable to obtain the report per IOB tag on the token level.

In [34]:
report = flat_classification_report(y_pred=pred, y_true=y)
print(report)

              precision    recall  f1-score   support

       B-art       0.20      0.03      0.05        76
       B-eve       0.61      0.34      0.43        74
       B-geo       0.82      0.88      0.85      7715
       B-gpe       0.95      0.93      0.94      3257
       B-nat       0.18      0.05      0.08        39
       B-org       0.75      0.68      0.71      4329
       B-per       0.80      0.78      0.79      3469
       B-tim       0.91      0.86      0.88      4244
       I-art       0.25      0.03      0.06        60
       I-eve       0.42      0.17      0.25        63
       I-geo       0.79      0.74      0.77      1502
       I-gpe       0.85      0.38      0.52        29
       I-nat       0.20      0.08      0.11        13
       I-org       0.77      0.74      0.75      3524
       I-per       0.82      0.88      0.85      3600
       I-tim       0.82      0.72      0.76      1380
           O       0.99      0.99      0.99    186878

    accuracy              

This report shows that the performance varies considerably across the different types of entities. Also note that the class "O" has F1 of 97 and is the dominant class. The support is the number of samples of the true response that lie in that class.

#### Step VI: Inspect features

The nice thing about CRFs is, that we can look into the algorithm and visualize the transition probabilites from one tag to another. We also can see which features are important for predicting a certain tag. We use the eli5 library to perform the investigation: https://eli5.readthedocs.io/en/latest/

In order to analyse the features, we need to build a model according to the whole data set. For this, we need to call the *fit* function on our data *X* and tags *y* again. This will take a few minutes as well (unless you limited the data!).

In [35]:
crf.fit(X, y)

In [36]:
import eli5

CRFsuite CRF models use two kinds of features: state features and transition features. Let’s check their weights using eli5.explain_weights:

In [37]:
eli5.show_weights(crf, top=30)

From \ To,O,B-art,I-art,B-eve,I-eve,B-geo,I-geo,B-gpe,I-gpe,B-nat,I-nat,B-org,I-org,B-per,I-per,B-tim,I-tim
O,4.204,2.099,0.0,2.112,0.0,1.829,0.0,1.129,0.0,1.283,0.0,2.28,0.0,3.437,0.0,2.722,0.0
B-art,-0.412,0.0,6.921,0.0,0.0,0.218,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.135,0.0
I-art,-1.155,0.0,6.748,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
B-eve,-1.074,0.0,0.0,0.0,6.844,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
I-eve,-0.773,0.0,0.0,0.0,6.401,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
B-geo,0.831,0.0,0.0,0.0,0.0,0.0,8.517,0.678,0.0,0.0,0.0,0.714,0.0,0.589,0.0,2.281,0.0
I-geo,0.09,0.0,0.0,0.0,0.0,0.0,7.226,-0.006,0.0,0.0,0.0,0.98,0.0,1.441,0.0,1.019,0.0
B-gpe,1.151,0.0,0.0,0.0,0.0,0.351,0.0,0.0,5.429,0.0,0.0,2.15,0.0,1.728,0.0,0.006,0.0
I-gpe,-0.101,0.0,0.0,0.0,0.0,0.021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.475,0.0,0.0,0.0
B-nat,-0.902,0.0,0.0,0.0,0.0,1.242,0.0,0.0,0.0,0.0,5.933,0.0,0.0,0.0,0.0,0.0,0.0

Weight?,Feature,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0,Unnamed: 9_level_0,Unnamed: 10_level_0,Unnamed: 11_level_0,Unnamed: 12_level_0,Unnamed: 13_level_0,Unnamed: 14_level_0,Unnamed: 15_level_0,Unnamed: 16_level_0
Weight?,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Weight?,Feature,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Weight?,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3,Unnamed: 15_level_3,Unnamed: 16_level_3
Weight?,Feature,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4
Weight?,Feature,Unnamed: 2_level_5,Unnamed: 3_level_5,Unnamed: 4_level_5,Unnamed: 5_level_5,Unnamed: 6_level_5,Unnamed: 7_level_5,Unnamed: 8_level_5,Unnamed: 9_level_5,Unnamed: 10_level_5,Unnamed: 11_level_5,Unnamed: 12_level_5,Unnamed: 13_level_5,Unnamed: 14_level_5,Unnamed: 15_level_5,Unnamed: 16_level_5
Weight?,Feature,Unnamed: 2_level_6,Unnamed: 3_level_6,Unnamed: 4_level_6,Unnamed: 5_level_6,Unnamed: 6_level_6,Unnamed: 7_level_6,Unnamed: 8_level_6,Unnamed: 9_level_6,Unnamed: 10_level_6,Unnamed: 11_level_6,Unnamed: 12_level_6,Unnamed: 13_level_6,Unnamed: 14_level_6,Unnamed: 15_level_6,Unnamed: 16_level_6
Weight?,Feature,Unnamed: 2_level_7,Unnamed: 3_level_7,Unnamed: 4_level_7,Unnamed: 5_level_7,Unnamed: 6_level_7,Unnamed: 7_level_7,Unnamed: 8_level_7,Unnamed: 9_level_7,Unnamed: 10_level_7,Unnamed: 11_level_7,Unnamed: 12_level_7,Unnamed: 13_level_7,Unnamed: 14_level_7,Unnamed: 15_level_7,Unnamed: 16_level_7
Weight?,Feature,Unnamed: 2_level_8,Unnamed: 3_level_8,Unnamed: 4_level_8,Unnamed: 5_level_8,Unnamed: 6_level_8,Unnamed: 7_level_8,Unnamed: 8_level_8,Unnamed: 9_level_8,Unnamed: 10_level_8,Unnamed: 11_level_8,Unnamed: 12_level_8,Unnamed: 13_level_8,Unnamed: 14_level_8,Unnamed: 15_level_8,Unnamed: 16_level_8
Weight?,Feature,Unnamed: 2_level_9,Unnamed: 3_level_9,Unnamed: 4_level_9,Unnamed: 5_level_9,Unnamed: 6_level_9,Unnamed: 7_level_9,Unnamed: 8_level_9,Unnamed: 9_level_9,Unnamed: 10_level_9,Unnamed: 11_level_9,Unnamed: 12_level_9,Unnamed: 13_level_9,Unnamed: 14_level_9,Unnamed: 15_level_9,Unnamed: 16_level_9
Weight?,Feature,Unnamed: 2_level_10,Unnamed: 3_level_10,Unnamed: 4_level_10,Unnamed: 5_level_10,Unnamed: 6_level_10,Unnamed: 7_level_10,Unnamed: 8_level_10,Unnamed: 9_level_10,Unnamed: 10_level_10,Unnamed: 11_level_10,Unnamed: 12_level_10,Unnamed: 13_level_10,Unnamed: 14_level_10,Unnamed: 15_level_10,Unnamed: 16_level_10
Weight?,Feature,Unnamed: 2_level_11,Unnamed: 3_level_11,Unnamed: 4_level_11,Unnamed: 5_level_11,Unnamed: 6_level_11,Unnamed: 7_level_11,Unnamed: 8_level_11,Unnamed: 9_level_11,Unnamed: 10_level_11,Unnamed: 11_level_11,Unnamed: 12_level_11,Unnamed: 13_level_11,Unnamed: 14_level_11,Unnamed: 15_level_11,Unnamed: 16_level_11
Weight?,Feature,Unnamed: 2_level_12,Unnamed: 3_level_12,Unnamed: 4_level_12,Unnamed: 5_level_12,Unnamed: 6_level_12,Unnamed: 7_level_12,Unnamed: 8_level_12,Unnamed: 9_level_12,Unnamed: 10_level_12,Unnamed: 11_level_12,Unnamed: 12_level_12,Unnamed: 13_level_12,Unnamed: 14_level_12,Unnamed: 15_level_12,Unnamed: 16_level_12
Weight?,Feature,Unnamed: 2_level_13,Unnamed: 3_level_13,Unnamed: 4_level_13,Unnamed: 5_level_13,Unnamed: 6_level_13,Unnamed: 7_level_13,Unnamed: 8_level_13,Unnamed: 9_level_13,Unnamed: 10_level_13,Unnamed: 11_level_13,Unnamed: 12_level_13,Unnamed: 13_level_13,Unnamed: 14_level_13,Unnamed: 15_level_13,Unnamed: 16_level_13
Weight?,Feature,Unnamed: 2_level_14,Unnamed: 3_level_14,Unnamed: 4_level_14,Unnamed: 5_level_14,Unnamed: 6_level_14,Unnamed: 7_level_14,Unnamed: 8_level_14,Unnamed: 9_level_14,Unnamed: 10_level_14,Unnamed: 11_level_14,Unnamed: 12_level_14,Unnamed: 13_level_14,Unnamed: 14_level_14,Unnamed: 15_level_14,Unnamed: 16_level_14
Weight?,Feature,Unnamed: 2_level_15,Unnamed: 3_level_15,Unnamed: 4_level_15,Unnamed: 5_level_15,Unnamed: 6_level_15,Unnamed: 7_level_15,Unnamed: 8_level_15,Unnamed: 9_level_15,Unnamed: 10_level_15,Unnamed: 11_level_15,Unnamed: 12_level_15,Unnamed: 13_level_15,Unnamed: 14_level_15,Unnamed: 15_level_15,Unnamed: 16_level_15
Weight?,Feature,Unnamed: 2_level_16,Unnamed: 3_level_16,Unnamed: 4_level_16,Unnamed: 5_level_16,Unnamed: 6_level_16,Unnamed: 7_level_16,Unnamed: 8_level_16,Unnamed: 9_level_16,Unnamed: 10_level_16,Unnamed: 11_level_16,Unnamed: 12_level_16,Unnamed: 13_level_16,Unnamed: 14_level_16,Unnamed: 15_level_16,Unnamed: 16_level_16
+6.148,word.lower():month,,,,,,,,,,,,,,,
+5.590,word.lower():year,,,,,,,,,,,,,,,
+5.190,word.lower():week,,,,,,,,,,,,,,,
+4.931,word.lower():last,,,,,,,,,,,,,,,
+4.649,bias,,,,,,,,,,,,,,,
+4.579,word.lower():jordanian,,,,,,,,,,,,,,,
+4.445,word.lower():christian,,,,,,,,,,,,,,,
+4.167,word.lower():jewish,,,,,,,,,,,,,,,
+4.134,word.lower():kurdish,,,,,,,,,,,,,,,
+4.133,word.lower():secretary,,,,,,,,,,,,,,,

Weight?,Feature
+6.148,word.lower():month
+5.590,word.lower():year
+5.190,word.lower():week
+4.931,word.lower():last
+4.649,bias
+4.579,word.lower():jordanian
+4.445,word.lower():christian
+4.167,word.lower():jewish
+4.134,word.lower():kurdish
+4.133,word.lower():secretary

Weight?,Feature
+4.343,word.lower():spaceshipone
+3.864,word[-3:]:One
+2.975,+1:word.lower():al-arabiya
+2.537,-1:word.lower():site
+2.514,-1:word.lower():shown
+2.442,word.lower():frankenstadion
+2.422,-1:word.lower():key
+2.327,word.lower():journal
+2.234,word.lower():sidnaya
+2.138,word.lower():turkish

Weight?,Feature
+2.338,+1:word.lower():reported
+1.675,word[-2:]:ll
+1.534,-1:word.lower():boeing
+1.394,+1:word.lower():is
+1.219,-1:word.lower():for
+1.200,word.lower():mustard
+1.181,word[-3:]:Zoo
+1.181,word.lower():zoo
+1.160,-1:word.lower():balad
+1.120,+1:word.lower():mustard

Weight?,Feature
+3.529,word.lower():ramadan
+2.833,+1:word.lower():men
+2.777,word.lower():olympic
+2.776,word[-3:]:pic
+2.664,-1:word.lower():war
+2.481,word.lower():hopman
+2.275,word.lower():nanmadol
+2.196,word[-3:]:II
+2.196,word.lower():ii
+2.195,word[-2:]:II

Weight?,Feature
+2.799,+1:word.lower():caused
+2.250,word.lower():games
+2.151,word[-3:]:mes
+1.891,-1:postag:NNPS
+1.786,word[-3:]:Day
+1.778,+1:word.lower():tournament
+1.755,word.lower():day
+1.738,word.lower():open
+1.737,word[-3:]:pen
+1.569,-1:word.lower():war

Weight?,Feature
+5.137,word.lower():caribbean
+3.994,word.lower():martian
+3.864,word.lower():colombia
+3.859,word.lower():nativity
+3.773,word[-3:]:the
+3.669,-1:word.lower():mr.
+3.375,word.lower():second-in-command
+3.331,-1:word.lower():bordeaux
+3.261,word.lower():iraq
+3.215,word.lower():greenpeace

Weight?,Feature
+3.207,word.lower():airport
+2.841,-1:word.lower():western
+2.759,word.lower():city
+2.578,word.lower():island
+2.549,-1:word.lower():yucatan
+2.520,-1:word.lower():east
+2.478,-1:word.lower():orthodox
+2.370,-1:word.lower():surma
+2.341,word.lower():station
+2.220,word[-3:]:ort

Weight?,Feature
+4.932,word.lower():nepal
+4.557,word[-3:]:pal
+4.174,word.lower():niger
+4.009,-1:word.lower():high-level
+3.992,word.lower():madagonia
+3.960,word.lower():korean
+3.842,word.lower():liechtenstein
+3.776,word.lower():jordan
+3.742,word.lower():afghan
+3.714,word.istitle()

Weight?,Feature
+3.239,+1:word.lower():began
+2.890,-1:word.lower():democratic
+2.797,word[-2:]:bs
+2.618,-1:word.lower():bosnian
+2.520,+1:word.lower():mayor
+2.334,-1:word.lower():south
+2.294,+1:word.lower():since
+2.029,-1:word.lower():panama
+1.951,-1:word.lower():israeli
+1.774,word.lower():city

Weight?,Feature
+3.700,word.lower():katrina
+2.922,word.lower():rita
+2.841,word[-3:]:ita
+2.349,word[-2:]:ta
+2.159,word.lower():h5n1
+2.159,word[-3:]:5N1
+2.157,word[-2:]:N1
+2.100,+1:word.lower():river
+1.934,-1:word.lower():often-deadly
+1.798,+1:word.lower():strain

Weight?,Feature
+2.549,+1:word.lower():slammed
+1.825,word.lower():rita
+1.821,word[-3:]:ita
+1.532,-1:word.lower():type
+1.464,word.lower():diabetes
+1.416,+1:word.lower():relief
+1.384,-1:word.lower():heart
+1.302,-1:word.lower():hurricanes
+1.208,word[-3:]:ase
+1.192,word.lower():disease

Weight?,Feature
+5.744,word.lower():hamas
+5.162,-1:word.lower():rice
+5.013,word.lower():philippine
+4.150,word.lower():al-qaida
+4.020,word.lower():taleban
+3.860,word.lower():westerners
+3.707,-1:word.lower():nepal
+3.569,-1:word.lower():olympic
+3.470,-1:word.lower():vladimir
+3.423,-1:word.lower():concern

Weight?,Feature
+3.109,word.lower():member-countries
+2.983,word.lower():raiders
+2.981,word.lower():member-states
+2.951,word.lower():times
+2.805,-1:word.lower():shi'ites
+2.733,+1:word.lower():mulgueta
+2.683,word.lower():nations
+2.666,word.lower():ministry
+2.624,word[-3:]:for
+2.595,+1:word.lower():maung

Weight?,Feature
+5.501,word.lower():president
+4.637,word.lower():obama
+4.375,word.lower():koran
+3.983,word.lower():madeleine
+3.894,word.lower():mccain
+3.844,word.lower():clinton
+3.684,word.lower():western
+3.644,word.lower():prime
+3.298,+1:word.lower():leads
+3.193,word.lower():secretary

Weight?,Feature
+3.224,+1:word.lower():surayud
+2.865,+1:word.lower():george
+2.773,word.lower():christians
+2.427,+1:word.lower():h.w.
+2.307,+1:word.lower():philippe
+2.251,-1:word.lower():condoleezza
+2.198,-1:word.lower():viktor
+1.891,-1:word.lower():interim
+1.830,+1:word.lower():udi
+1.824,-1:word.lower():uri

Weight?,Feature
+5.658,word[-3:]:day
+5.100,word.lower():today
+5.031,word.lower():multi-candidate
+4.575,word[-3:]:Day
+4.498,word.lower():weekend
+4.187,+1:word.lower():weeks
+4.141,word.lower():midnight
+4.098,word.lower():february
+4.085,word.lower():january
+3.935,word.lower():eucharist

Weight?,Feature
+3.838,word[-3:]:day
+3.764,-1:word.lower():this
+3.352,word[-2:]:ay
+3.347,+1:word.lower():4.2
+3.054,word.lower():evening
+3.031,-1:word.lower():past
+2.745,+1:word.lower():asia
+2.729,+1:word.lower():ii
+2.709,word.lower():january
+2.696,+1:word.lower():india


The first table shows the learned weights for the transition probabilities. We see for example that *B-art* is most likely followed by *I-art* (8.442), while *I-art* is never followed by *B-art* and also by none of the other *I* tags, which makes sense. Check the table for other regularities and to see if they make sense.

The second table shows for each category features that contributed most positively. Here we see that the CRF is just memorizing a lot of words (we have not used any gazetteers for creating features). For example for the tag ‘B-per’, the algorithm remembers ‘president’ ‘obama’. This is called overfitting. It works for this data but not for other data in which other presidents rule.

Instead of evaluating the IOB tags at the token level, we can also evaluate the complete sequence of an entity phrase.
For sequence evaluation, we are going to use the *seqeval* package which is specifically designed for sequence annotations. 
In our case, it will return scores for he complete phrases instead of the IOB tags for the tokens. It also ignores the "O" tag which is dominant.

We use the function *precision_score*, *recall_score*, and *f1_score* from the *seqeval* package to get the overall sequence annotation results for the total set. 

In [38]:
from seqeval.metrics import precision_score, recall_score, f1_score, classification_report

print("precision-score: {:.1%}".format(precision_score(y, pred)))
print("recall-score: {:.1%}".format(recall_score(y, pred)))
print("F1-score: {:.1%}".format(f1_score(y, pred)))

precision-score: 81.7%
recall-score: 80.3%
F1-score: 81.0%


The *seqeval* package also provides an option to derive a specific classification report for the entity types at phrase level instead of the token level:

In [39]:
print(classification_report(y, pred))

              precision    recall  f1-score   support

         art       0.20      0.03      0.05        76
         eve       0.61      0.34      0.43        74
         geo       0.82      0.88      0.85      7715
         gpe       0.95      0.93      0.94      3257
         nat       0.18      0.05      0.08        39
         org       0.72      0.65      0.68      4329
         per       0.74      0.72      0.73      3469
         tim       0.88      0.83      0.85      4244

   micro avg       0.82      0.80      0.81     23203
   macro avg       0.64      0.55      0.58     23203
weighted avg       0.81      0.80      0.81     23203



We can see that the results for the complete sequences is somewhat lower than for the token level annotation. Also note that the "O" tags are ignored. On the other hand, the overall macro averaged results are somewhat higher.

#### Step VII: Tuning the model

To overcome that CRF is memorizing words, we can tune the parameters, especially the regularization parameters of the CRF algorithm. The c1 and c2 parameter of the CRF algorithm are the regularization parameters \lambda_1 and \lambda_2. While c1 weights the l_1 regularization, the c2 parameter weights the l_2 regularization. We now limit the number of features used by enforcing sparsity on the parameter vector w. To do this we increase the l_1-regularization parameter c1. Reducing the number of features prevents the system from overfitting. If we regularize CRF more, we can expect that only features which are generic will remain, and memoized tokens will go. With L1 regularization (c1 parameter) coefficients of most features should be driven to zero. Let’s check what effect does regularization have on CRF weights:

In [40]:
crf = CRF(algorithm='lbfgs',
          c1=10, #L1 regularization is now set to 100
          c2=0.1,
          max_iterations=20,
          all_possible_transitions=False)

#### Note!
The next command will take another half an hour to carry out the training and testing 5 times

In [41]:
pred = cross_val_predict(estimator=crf, X=X, y=y, cv=5)

For the details at the IOB tag level, we use again the flat_classification function from sklearn.

In [42]:
report = flat_classification_report(y_pred=pred, y_true=y)
print(report)

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


              precision    recall  f1-score   support

       B-art       0.00      0.00      0.00        76
       B-eve       0.00      0.00      0.00        74
       B-geo       0.73      0.87      0.79      7715
       B-gpe       0.88      0.83      0.85      3257
       B-nat       0.00      0.00      0.00        39
       B-org       0.69      0.54      0.61      4329
       B-per       0.75      0.68      0.72      3469
       B-tim       0.90      0.69      0.78      4244
       I-art       0.00      0.00      0.00        60
       I-eve       0.00      0.00      0.00        63
       I-geo       0.65      0.59      0.61      1502
       I-gpe       0.00      0.00      0.00        29
       I-nat       0.00      0.00      0.00        13
       I-org       0.59      0.66      0.63      3524
       I-per       0.75      0.84      0.79      3600
       I-tim       0.74      0.48      0.58      1380
           O       0.98      0.99      0.99    186878

    accuracy              

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


We see that the evaluation results are not really better than before. For example B-per now scores 0.76P and 0.7R, while it scored  84P and 81R before. We also see that the macro average results score lower overall.

But let's look at the features before we jump to conclusions.

To inspect the features again, we need to call the *fit* function again. Take another break in case you did not limit the data.

In [43]:
crf.fit(X, y)

Now we look again at the features.

In [44]:
eli5.show_weights(crf, top=30)

From \ To,O,B-art,I-art,B-eve,I-eve,B-geo,I-geo,B-gpe,I-gpe,B-nat,I-nat,B-org,I-org,B-per,I-per,B-tim,I-tim
O,2.492,0.072,0.0,0.072,0.0,1.027,0.0,0.774,0.0,0.0,0.0,1.473,0.0,1.164,0.0,1.733,0.0
B-art,-0.044,0.0,0.288,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
I-art,-0.052,0.0,0.229,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
B-eve,-0.059,0.0,0.0,0.0,0.367,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
I-eve,0.0,0.0,0.0,0.0,0.096,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
B-geo,0.697,0.0,0.0,0.0,0.0,0.0,5.254,0.0,0.0,0.0,0.0,-0.202,0.0,-0.797,0.0,0.456,0.0
I-geo,0.151,0.0,0.0,0.0,0.0,0.0,1.371,0.0,0.0,0.0,0.0,0.0,0.0,-0.024,0.0,0.0,0.0
B-gpe,0.601,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.156,0.0,0.0,0.176,0.0,0.318,0.0,-0.049,0.0
I-gpe,-0.004,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
B-nat,-0.06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.074,0.0,0.0,0.0,0.0,0.0,0.0

Weight?,Feature,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0,Unnamed: 9_level_0,Unnamed: 10_level_0,Unnamed: 11_level_0,Unnamed: 12_level_0,Unnamed: 13_level_0,Unnamed: 14_level_0,Unnamed: 15_level_0,Unnamed: 16_level_0
Weight?,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Weight?,Feature,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Weight?,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3,Unnamed: 15_level_3,Unnamed: 16_level_3
Weight?,Feature,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4
Weight?,Feature,Unnamed: 2_level_5,Unnamed: 3_level_5,Unnamed: 4_level_5,Unnamed: 5_level_5,Unnamed: 6_level_5,Unnamed: 7_level_5,Unnamed: 8_level_5,Unnamed: 9_level_5,Unnamed: 10_level_5,Unnamed: 11_level_5,Unnamed: 12_level_5,Unnamed: 13_level_5,Unnamed: 14_level_5,Unnamed: 15_level_5,Unnamed: 16_level_5
Weight?,Feature,Unnamed: 2_level_6,Unnamed: 3_level_6,Unnamed: 4_level_6,Unnamed: 5_level_6,Unnamed: 6_level_6,Unnamed: 7_level_6,Unnamed: 8_level_6,Unnamed: 9_level_6,Unnamed: 10_level_6,Unnamed: 11_level_6,Unnamed: 12_level_6,Unnamed: 13_level_6,Unnamed: 14_level_6,Unnamed: 15_level_6,Unnamed: 16_level_6
Weight?,Feature,Unnamed: 2_level_7,Unnamed: 3_level_7,Unnamed: 4_level_7,Unnamed: 5_level_7,Unnamed: 6_level_7,Unnamed: 7_level_7,Unnamed: 8_level_7,Unnamed: 9_level_7,Unnamed: 10_level_7,Unnamed: 11_level_7,Unnamed: 12_level_7,Unnamed: 13_level_7,Unnamed: 14_level_7,Unnamed: 15_level_7,Unnamed: 16_level_7
Weight?,Feature,Unnamed: 2_level_8,Unnamed: 3_level_8,Unnamed: 4_level_8,Unnamed: 5_level_8,Unnamed: 6_level_8,Unnamed: 7_level_8,Unnamed: 8_level_8,Unnamed: 9_level_8,Unnamed: 10_level_8,Unnamed: 11_level_8,Unnamed: 12_level_8,Unnamed: 13_level_8,Unnamed: 14_level_8,Unnamed: 15_level_8,Unnamed: 16_level_8
Weight?,Feature,Unnamed: 2_level_9,Unnamed: 3_level_9,Unnamed: 4_level_9,Unnamed: 5_level_9,Unnamed: 6_level_9,Unnamed: 7_level_9,Unnamed: 8_level_9,Unnamed: 9_level_9,Unnamed: 10_level_9,Unnamed: 11_level_9,Unnamed: 12_level_9,Unnamed: 13_level_9,Unnamed: 14_level_9,Unnamed: 15_level_9,Unnamed: 16_level_9
Weight?,Feature,Unnamed: 2_level_10,Unnamed: 3_level_10,Unnamed: 4_level_10,Unnamed: 5_level_10,Unnamed: 6_level_10,Unnamed: 7_level_10,Unnamed: 8_level_10,Unnamed: 9_level_10,Unnamed: 10_level_10,Unnamed: 11_level_10,Unnamed: 12_level_10,Unnamed: 13_level_10,Unnamed: 14_level_10,Unnamed: 15_level_10,Unnamed: 16_level_10
Weight?,Feature,Unnamed: 2_level_11,Unnamed: 3_level_11,Unnamed: 4_level_11,Unnamed: 5_level_11,Unnamed: 6_level_11,Unnamed: 7_level_11,Unnamed: 8_level_11,Unnamed: 9_level_11,Unnamed: 10_level_11,Unnamed: 11_level_11,Unnamed: 12_level_11,Unnamed: 13_level_11,Unnamed: 14_level_11,Unnamed: 15_level_11,Unnamed: 16_level_11
Weight?,Feature,Unnamed: 2_level_12,Unnamed: 3_level_12,Unnamed: 4_level_12,Unnamed: 5_level_12,Unnamed: 6_level_12,Unnamed: 7_level_12,Unnamed: 8_level_12,Unnamed: 9_level_12,Unnamed: 10_level_12,Unnamed: 11_level_12,Unnamed: 12_level_12,Unnamed: 13_level_12,Unnamed: 14_level_12,Unnamed: 15_level_12,Unnamed: 16_level_12
Weight?,Feature,Unnamed: 2_level_13,Unnamed: 3_level_13,Unnamed: 4_level_13,Unnamed: 5_level_13,Unnamed: 6_level_13,Unnamed: 7_level_13,Unnamed: 8_level_13,Unnamed: 9_level_13,Unnamed: 10_level_13,Unnamed: 11_level_13,Unnamed: 12_level_13,Unnamed: 13_level_13,Unnamed: 14_level_13,Unnamed: 15_level_13,Unnamed: 16_level_13
Weight?,Feature,Unnamed: 2_level_14,Unnamed: 3_level_14,Unnamed: 4_level_14,Unnamed: 5_level_14,Unnamed: 6_level_14,Unnamed: 7_level_14,Unnamed: 8_level_14,Unnamed: 9_level_14,Unnamed: 10_level_14,Unnamed: 11_level_14,Unnamed: 12_level_14,Unnamed: 13_level_14,Unnamed: 14_level_14,Unnamed: 15_level_14,Unnamed: 16_level_14
Weight?,Feature,Unnamed: 2_level_15,Unnamed: 3_level_15,Unnamed: 4_level_15,Unnamed: 5_level_15,Unnamed: 6_level_15,Unnamed: 7_level_15,Unnamed: 8_level_15,Unnamed: 9_level_15,Unnamed: 10_level_15,Unnamed: 11_level_15,Unnamed: 12_level_15,Unnamed: 13_level_15,Unnamed: 14_level_15,Unnamed: 15_level_15,Unnamed: 16_level_15
Weight?,Feature,Unnamed: 2_level_16,Unnamed: 3_level_16,Unnamed: 4_level_16,Unnamed: 5_level_16,Unnamed: 6_level_16,Unnamed: 7_level_16,Unnamed: 8_level_16,Unnamed: 9_level_16,Unnamed: 10_level_16,Unnamed: 11_level_16,Unnamed: 12_level_16,Unnamed: 13_level_16,Unnamed: 14_level_16,Unnamed: 15_level_16,Unnamed: 16_level_16
+3.511,bias,,,,,,,,,,,,,,,
+2.605,postag[:2]:VB,,,,,,,,,,,,,,,
+2.417,BOS,,,,,,,,,,,,,,,
+1.575,EOS,,,,,,,,,,,,,,,
+1.501,word.lower():a,,,,,,,,,,,,,,,
+1.429,word[-2:]:al,,,,,,,,,,,,,,,
+1.332,postag[:2]:PR,,,,,,,,,,,,,,,
+1.095,word[-2:]:er,,,,,,,,,,,,,,,
+1.036,postag:PRP,,,,,,,,,,,,,,,
+0.956,+1:postag[:2]:PR,,,,,,,,,,,,,,,

Weight?,Feature
+3.511,bias
+2.605,postag[:2]:VB
+2.417,BOS
+1.575,EOS
+1.501,word.lower():a
+1.429,word[-2:]:al
+1.332,postag[:2]:PR
+1.095,word[-2:]:er
+1.036,postag:PRP
+0.956,+1:postag[:2]:PR

Weight?,Feature
0.127,postag:NNP
0.116,word.istitle()
0.114,-1:postag:DT
0.114,-1:postag[:2]:DT
0.103,postag[:2]:NN
0.075,-1:word.lower():the
0.068,+1:word.istitle()
0.06,+1:postag:NNP
0.035,+1:postag[:2]:NN
-0.011,BOS

Weight?,Feature
0.183,-1:word.istitle()
0.128,-1:postag[:2]:NN
0.107,-1:postag:NNP
0.073,postag:NNP
0.041,word.istitle()
0.038,postag[:2]:NN
0.025,+1:word.istitle()
0.022,+1:postag:NNP
-0.105,bias

Weight?,Feature
0.182,+1:word.istitle()
0.123,postag:NNP
0.104,postag[:2]:NN
0.098,+1:postag:NNP
0.097,word.isupper()
0.096,word.lower():ii
0.096,word[-3:]:II
0.096,word[-2:]:II
0.095,-1:word.lower():war
0.086,-1:postag[:2]:DT

Weight?,Feature
0.199,-1:postag[:2]:NN
0.175,-1:word.istitle()
0.078,-1:postag:NNP
0.06,postag:NNP
0.046,word.istitle()
0.044,+1:postag:NNP
0.036,postag[:2]:NN
0.018,-1:word.lower():world
-0.004,+1:postag[:2]:NN
-0.007,-1:postag[:2]:IN

Weight?,Feature
+1.803,-1:word.lower():in
+1.381,word[-2:]:ia
+1.328,word.lower():iran
+1.322,word.lower():u.s.
+1.321,word[-3:]:.S.
+1.320,word[-2:]:S.
+1.107,postag:NNP
+1.074,word.istitle()
+0.967,postag[:2]:NN
+0.885,word[-3:]:ran

Weight?,Feature
+0.938,word.lower():states
+0.580,-1:word.lower():south
+0.538,word[-3:]:tes
+0.471,word[-2:]:ea
+0.443,word[-3:]:ast
+0.439,word.istitle()
+0.419,-1:word.lower():southern
+0.419,word.lower():korea
+0.399,postag:NNP
+0.393,-1:postag:NNP

Weight?,Feature
+3.351,word.istitle()
+1.406,postag:NNS
+1.129,word[-3:]:ese
+1.110,word.lower():iraqi
+1.101,word[-2:]:li
+1.095,word[-3:]:ans
+1.047,word.lower():palestinian
+1.009,postag:JJ
+1.004,+1:word.lower():president
+0.995,postag[:2]:JJ

Weight?,Feature
0.102,-1:postag:JJ
0.057,postag[:2]:JJ
0.052,postag:JJ
0.051,-1:postag[:2]:JJ
0.021,-1:word.istitle()
0.012,-1:word.lower():bosnian
-0.001,+1:word.istitle()
-0.005,+1:postag[:2]:VB
-0.011,-1:postag[:2]:NN
-0.021,+1:postag[:2]:NN

Weight?,Feature
0.04,-1:postag:IN
0.04,-1:postag[:2]:IN
0.038,word.isupper()
0.02,postag:NNP
-0.0,-1:postag:NNP
-0.011,+1:postag[:2]:NN
-0.016,-1:word.istitle()
-0.061,-1:postag[:2]:NN
-0.2,bias

Weight?,Feature
-0.006,+1:word.istitle()
-0.008,+1:postag[:2]:VB
-0.01,-1:postag[:2]:NN
-0.032,word.istitle()
-0.063,+1:postag[:2]:NN
-0.064,postag[:2]:NN
-0.804,bias

Weight?,Feature
+2.471,word.isupper()
+1.454,word[-3:]:ban
+1.184,postag:NNP
+1.034,word.lower():taleban
+1.031,postag[:2]:NN
+0.883,word.lower():hamas
+0.802,word[-3:]:mas
+0.768,+1:word.lower():nations
+0.767,word[-2:]:da
+0.748,word[-2:]:as

Weight?,Feature
+0.885,word.lower():nations
+0.827,-1:word.istitle()
+0.691,-1:postag:NNP
+0.674,word[-3:]:ion
+0.656,word[-3:]:ons
+0.621,postag:NNPS
+0.590,-1:postag:POS
+0.590,-1:postag[:2]:PO
+0.584,-1:word.lower():european
+0.578,-1:word.lower():of

Weight?,Feature
+1.329,word[-2:]:r.
+1.206,postag:NNP
+1.098,word[-3:]:Mr.
+1.098,word.lower():mr.
+1.055,word.lower():president
+1.005,word.lower():prime
+0.995,word[-3:]:ime
+0.983,word[-3:]:ent
+0.927,+1:postag:NNP
+0.827,-1:postag:NN

Weight?,Feature
+1.662,-1:word.lower():president
+0.955,-1:postag:NNP
+0.845,-1:word.istitle()
+0.744,-1:postag[:2]:NN
+0.744,postag:NNP
+0.558,-1:word.lower():mr.
+0.513,word.istitle()
+0.505,postag[:2]:NN
+0.350,word[-2:]:ez
+0.254,+1:postag:VBD

Weight?,Feature
+3.047,word[-3:]:day
+2.465,word[-2:]:ay
+1.995,word[-2:]:er
+1.981,-1:word.lower():in
+1.389,word[-3:]:ber
+1.193,+1:word.lower():years
+1.189,postag:CD
+1.189,postag[:2]:CD
+0.952,word.isdigit()
+0.890,-1:postag:NNS

Weight?,Feature
+1.642,postag:NN
+0.931,word[-2:]:ay
+0.797,postag[:2]:CD
+0.797,postag:CD
+0.771,word.isdigit()
+0.705,word[-3:]:day
+0.645,+1:postag:CD
+0.645,+1:postag[:2]:CD
+0.616,-1:postag:NN
+0.600,-1:word.lower():since


As expected, we see, that the model stops to rely on words and uses the context more, as it generalizes better is more useful over multiple training instances. This is an effect of the l_1-regularization. Again looking at *B-per* and *I-per*, we see that the names dropped out and that parts-of-speech and words such as "mr" and "president" remain as the top scoring features.

On regularization: "Regularization is a technique to discourage the complexity of the model. It does this by penalizing the loss function. This helps to solve the overfitting problem."

In particular, L1-regularization acts as a feature selector, simply removing some of the features. You can read more on regularization [here](https://medium.com/datadriveninvestor/l1-l2-regularization-7f1b4fe948f2).

## Conclusion

We can thus conclude that although the model seems to perform less than before it is still a better model because it did not overfit on the names of the training set.

For entity phrase evaluation, we use the functions from the *seqeval* package which is specifically designed for sequence annotations. 

In [45]:
print("precision-score: {:.1%}".format(precision_score(y, pred)))
print("recall-score: {:.1%}".format(recall_score(y, pred)))
print("F1-score: {:.1%}".format(f1_score(y, pred)))

precision-score: 74.4%
recall-score: 70.7%
F1-score: 72.5%


In [46]:
print(classification_report(y, pred))

  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

         art       0.00      0.00      0.00        76
         eve       0.00      0.00      0.00        74
         geo       0.72      0.86      0.79      7715
         gpe       0.88      0.82      0.85      3257
         nat       0.00      0.00      0.00        39
         org       0.63      0.49      0.55      4329
         per       0.69      0.63      0.66      3469
         tim       0.85      0.65      0.74      4244

   micro avg       0.74      0.71      0.72     23203
   macro avg       0.47      0.43      0.45     23203
weighted avg       0.74      0.71      0.72     23203



Remarkably, the sequence evaluation 

The original notebook on which this notebook was based can be found here:

https://github.com/TeamHG-Memex/sklearn-crfsuite/blob/master/docs/CoNLL2002.ipynb

It describes a similar process to build CRF-NERC classifier from the CoNLL-2002 dataset, which has Spanish and Dutch texts. You can follow this notebook to create your own NERC system for these languages.

## End of this notebook