### Hands-on 3:

* Download and preprocess Named Entity Recognition (NER)  corpus (CONLL 2002)
* Prepare CRF model for NER
* Run CRF for training and evaluation


## Named Entity Recognition (NER) using CRF

The task of Named Entity Recognition (NER) involves the recognition of :<br>
* names of persons
* locations
* organizations
* dates
* ...


#### Example 

For example, the following sentence is tagged with sub-sequences indicating PER (for persons), LOC (for location) and ORG (for organization):

<br>

Wolff, currently a journalist in Argentina, played with Del Bosque in the final years of the seventies in Real Madrid.

<br>


_______________

<b>[PER Wolff ] </b> , currently a journalist in <b> [LOC Argentina ] </b> , played with <b> [PER Del Bosque ] </b> in the final years  of the seventies in <b> [ORG Real Madrid ] </b> .

_______________

<br>

### NER - Sub Task Involved :

NER involves 2 sub-tasks: <br>

* identifying the boundaries of such expressions (the open and close brackets) and  
* labeling the expressions (with tags such as PER, LOC or ORG). As for the task of chunking, this sequence labeling task is mapped to a classification tag, using a BIO encoding of the data:  <br>

### BIO Tagging:

The BIO / IOB format (short for inside, outside, beginning) is a common tagging format for tagging tokens in a chunking task in computational linguistics (ex. named-entity recognition).

* The B- prefix before a tag indicates that the tag is the beginning of a chunk
* An I- prefix before a tag indicates that the tag is inside a chunk. 
* An O tag indicates that a token belongs to no entity / chunk.

The following figure shows how a BIO tagged sentence looks like:



```
    Wolff B-PER
            , O
    currently O
            a O
   journalist O
           in O
    Argentina B-LOC
            , O
       played O
         with O
          Del B-PER
       Bosque I-PER
           in O
          the O
        final O
        years O
           of O
          the O
    seventies O
           in O
         Real B-ORG
       Madrid I-ORG
            . O
```

### DataSet

Let’s use CoNLL 2002 data to build a NER system

CoNLL2002 corpus is available in NLTK. 

In [1]:
# download corpus

import nltk
nltk.download('conll2002')

# get training/testing datasets
from nltk.corpus import conll2002

[nltk_data] Error loading conll2002: <urlopen error [Errno 61]
[nltk_data]     Connection refused>


In [2]:
conll2002

<ConllChunkCorpusReader in '/Users/Yam/nltk_data/corpora/conll2002'>

### Data Preparation

In [3]:
## Training and testing

train_sents = list(conll2002.iob_sents('esp.train')) ## spain
test_sents = list(conll2002.iob_sents('esp.testb'))

print(train_sents[0])
#each tuple contains token, syntactic tag, ner label


[('Melbourne', 'NP', 'B-LOC'), ('(', 'Fpa', 'O'), ('Australia', 'NP', 'B-LOC'), (')', 'Fpt', 'O'), (',', 'Fc', 'O'), ('25', 'Z', 'O'), ('may', 'NC', 'O'), ('(', 'Fpa', 'O'), ('EFE', 'NC', 'B-ORG'), (')', 'Fpt', 'O'), ('.', 'Fp', 'O')]


In [24]:
len(train_sents), len(test_sents)

(8323, 1517)

### Features

- word level features 
    - word shape
    - word suffix 
 
- Current/previous word context 
    - some information from nearby words is used.
    
- word POS tag
- label context 

This makes a simple baseline, but you certainly can add and remove some features to get (much?) better results - experiment with it.

sklearn-crfsuite (and python-crfsuite) supports several feature formats; 
here we use feature dicts.


In [4]:
# functions of sentence representations for sequence labelling
def word2features(sent, i):
    
    word = sent[i][0]
    postag = sent[i][1]


    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }
    
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        # Indicate that it is the 'beginning of a document'
        features['BOS'] = True
        
    
    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        # Features for words that are not at the end of a document
        features['EOS'] = True

    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]

This is what word2features extracts:

In [5]:
sample_sentence = " ".join([s for s,c,d in train_sents[2]])
sample_sentence

'El Abogado General del Estado , Daryl Williams , subrayó hoy la necesidad de tomar medidas para proteger al sistema judicial australiano frente a una página de internet que imposibilita el cumplimiento de los principios básicos de la Ley .'

In [6]:
train_sents[2][:10]

[('El', 'DA', 'O'),
 ('Abogado', 'NC', 'B-PER'),
 ('General', 'AQ', 'I-PER'),
 ('del', 'SP', 'I-PER'),
 ('Estado', 'NC', 'I-PER'),
 (',', 'Fc', 'O'),
 ('Daryl', 'VMI', 'B-PER'),
 ('Williams', 'NC', 'I-PER'),
 (',', 'Fc', 'O'),
 ('subrayó', 'VMI', 'O')]

In [9]:
word2features(train_sents[2], 1)

{'bias': 1.0,
 'word.lower()': 'abogado',
 'word[-3:]': 'ado',
 'word[-2:]': 'do',
 'word.isupper()': False,
 'word.istitle()': True,
 'word.isdigit()': False,
 'postag': 'NC',
 'postag[:2]': 'NC',
 '-1:word.lower()': 'el',
 '-1:word.istitle()': True,
 '-1:word.isupper()': False,
 '-1:postag': 'DA',
 '-1:postag[:2]': 'DA',
 '+1:word.lower()': 'general',
 '+1:word.istitle()': True,
 '+1:word.isupper()': False,
 '+1:postag': 'AQ',
 '+1:postag[:2]': 'AQ'}

In [10]:
sent2features(train_sents[2])

[{'bias': 1.0,
  'word.lower()': 'el',
  'word[-3:]': 'El',
  'word[-2:]': 'El',
  'word.isupper()': False,
  'word.istitle()': True,
  'word.isdigit()': False,
  'postag': 'DA',
  'postag[:2]': 'DA',
  'BOS': True,
  '+1:word.lower()': 'abogado',
  '+1:word.istitle()': True,
  '+1:word.isupper()': False,
  '+1:postag': 'NC',
  '+1:postag[:2]': 'NC'},
 {'bias': 1.0,
  'word.lower()': 'abogado',
  'word[-3:]': 'ado',
  'word[-2:]': 'do',
  'word.isupper()': False,
  'word.istitle()': True,
  'word.isdigit()': False,
  'postag': 'NC',
  'postag[:2]': 'NC',
  '-1:word.lower()': 'el',
  '-1:word.istitle()': True,
  '-1:word.isupper()': False,
  '-1:postag': 'DA',
  '-1:postag[:2]': 'DA',
  '+1:word.lower()': 'general',
  '+1:word.istitle()': True,
  '+1:word.isupper()': False,
  '+1:postag': 'AQ',
  '+1:postag[:2]': 'AQ'},
 {'bias': 1.0,
  'word.lower()': 'general',
  'word[-3:]': 'ral',
  'word[-2:]': 'al',
  'word.isupper()': False,
  'word.istitle()': True,
  'word.isdigit()': False,
  

### Feature Extraction:

Extract features from the training data and testing data

In [11]:
# sentence representations for sequence labelling
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

In [12]:
train_sents[0], y_train[0]

([('Melbourne', 'NP', 'B-LOC'),
  ('(', 'Fpa', 'O'),
  ('Australia', 'NP', 'B-LOC'),
  (')', 'Fpt', 'O'),
  (',', 'Fc', 'O'),
  ('25', 'Z', 'O'),
  ('may', 'NC', 'O'),
  ('(', 'Fpa', 'O'),
  ('EFE', 'NC', 'B-ORG'),
  (')', 'Fpt', 'O'),
  ('.', 'Fp', 'O')],
 ['B-LOC', 'O', 'B-LOC', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'O', 'O'])

In [13]:
X_train[0], y_train[0]

([{'bias': 1.0,
   'word.lower()': 'melbourne',
   'word[-3:]': 'rne',
   'word[-2:]': 'ne',
   'word.isupper()': False,
   'word.istitle()': True,
   'word.isdigit()': False,
   'postag': 'NP',
   'postag[:2]': 'NP',
   'BOS': True,
   '+1:word.lower()': '(',
   '+1:word.istitle()': False,
   '+1:word.isupper()': False,
   '+1:postag': 'Fpa',
   '+1:postag[:2]': 'Fp'},
  {'bias': 1.0,
   'word.lower()': '(',
   'word[-3:]': '(',
   'word[-2:]': '(',
   'word.isupper()': False,
   'word.istitle()': False,
   'word.isdigit()': False,
   'postag': 'Fpa',
   'postag[:2]': 'Fp',
   '-1:word.lower()': 'melbourne',
   '-1:word.istitle()': True,
   '-1:word.isupper()': False,
   '-1:postag': 'NP',
   '-1:postag[:2]': 'NP',
   '+1:word.lower()': 'australia',
   '+1:word.istitle()': True,
   '+1:word.isupper()': False,
   '+1:postag': 'NP',
   '+1:postag[:2]': 'NP'},
  {'bias': 1.0,
   'word.lower()': 'australia',
   'word[-3:]': 'lia',
   'word[-2:]': 'ia',
   'word.isupper()': False,
   'word

### Training
Here we are using L-BFGS training algorithm (it is default) with Elastic Net (L1 + L2) regularization.

In [14]:
# train CRF model
# !pip install sklearn_crfsuite
import sklearn_crfsuite
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)



In [15]:
crf



CRF(algorithm='lbfgs', all_possible_transitions=True, c1=0.1, c2=0.1,
    keep_tempfiles=None, max_iterations=100)

In [16]:
crf.fit(X_train, y_train)

# training model parameters

CRF(algorithm='lbfgs', all_possible_transitions=True, c1=0.1, c2=0.1,
    keep_tempfiles=None, max_iterations=100)

### Evaluation
There is much more O entities in data set, but we’re more interested in other entities. To account for this we’ll use averaged F1 score computed for all labels except for O. sklearn-crfsuite.metrics package provides some useful metrics for sequence classification task, including this one.

In [17]:
# get label set
labels = list(crf.classes_)
labels.remove('O')
print(labels)

['B-LOC', 'B-ORG', 'B-PER', 'I-PER', 'B-MISC', 'I-ORG', 'I-LOC', 'I-MISC']


In [18]:
# evaluate CRF model
from sklearn_crfsuite import metrics

y_pred = crf.predict(X_test)
metrics.flat_f1_score(y_test, y_pred, average='weighted', labels=labels)

0.7964686316443963

In [19]:
y_pred[0]

['B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'B-ORG', 'O', 'O']

### Inspect per-class results in more detail:

In [27]:
# group B and I results
sorted_labels = sorted(
    labels,
    key=lambda name: (name[1:], name[0])
)
print(metrics.flat_classification_report(
    y_test, y_pred, labels=sorted_labels, digits=3
))

              precision    recall  f1-score   support

       B-LOC      0.810     0.784     0.797      1084
       I-LOC      0.690     0.637     0.662       325
      B-MISC      0.731     0.569     0.640       339
      I-MISC      0.699     0.589     0.639       557
       B-ORG      0.807     0.832     0.820      1400
       I-ORG      0.852     0.786     0.818      1104
       B-PER      0.850     0.884     0.867       735
       I-PER      0.893     0.943     0.917       634

   micro avg      0.813     0.787     0.799      6178
   macro avg      0.791     0.753     0.770      6178
weighted avg      0.809     0.787     0.796      6178

