# Sequence Labeling on a NER (Named Entity Recognition) model using w4 in Python

**Data can be downloaded from** https://github.com/dice-group/n3-collection/blob/master/reuters.xml

**Sequence labeling** = assigning a label to each member in the sequence

In [0]:
input = ["Nepal", "is", "a", "neighbor", "of", "India"]
output = ["C", "I", "I", "I", "I", "C"]

##Named Entity Recognition

To train a NER model, we need some labeled data. We'll use the Reuters-128 dataset, which is an English corpus in the NLP Interchange Format (NIF).

---


It contains **128 economic news articles**, along with information for **880 named entities** with their position in the document, and a URI identifying the entity.

**Sample document from the XML file:**

In [0]:
<document id="8">
  <documenturi>http://www.research.att.com/~lewis/Reuters-21578/15009</documenturi>
  <documentsource>Reuters-21578</documentsource>
  <textwithnamedentities>
    <namedentityintext uri="http://aksw.org/notInWiki/Home_Intensive_Care_Inc">Home Intensive Care Inc</namedentityintext>
    <simpletextpart> said it has opened a Dialysis at Home office in </simpletextpart>
    <namedentityintext uri="http://dbpedia.org/resource/Philadelphia">Philadelphia</namedentityintext>
    <simpletextpart>, its 12th nationwide.</simpletextpart>
  </textwithnamedentities>
</document>

In [0]:
from bs4 import BeautifulSoup as bs
from bs4.element import Tag
import codecs

In [3]:
from google.colab import files

uploaded = files.upload()

Saving reuters.xml to reuters.xml


##Preparing the dataset for training

To prepare the dataset for training, we need to label every word (token) in the sentences to be either **irrelevant (I**) or part of a **named entity (N)**.

---

As the data is in XML format, we'll use the ***BeautifulSoup*** library to parse the file and extract the data as follows:

In [0]:
# Reading the data file and parsing the XML

with codecs.open("reuters.xml", "r", "utf-8") as infile:
  soup = bs(infile, "html5lib")
  
docs = []
for elem in soup.find_all("document"):
  texts = []
  
  # Looping through each child of the element under 
  # "textwithnamedentities"
  for c in elem.find("textwithnamedentities").children:
    if type(c) == Tag:
      if c.name == "namedentityintext":
        label = "N"     # part of a named entity
      else:
        label = "I"     # irrelevant word
      for w in c.text.split(" "):
        if len(w) > 0:
          texts.append((w,label))
          
  docs.append(texts)

### The result will be a list of documents, each of which contains a list of (word, label) tuples, e.g.

In [5]:
docs[0][:10]

[('Paxar', 'N'),
 ('Corp', 'N'),
 ('said', 'I'),
 ('it', 'I'),
 ('has', 'I'),
 ('acquired', 'I'),
 ('Thermo-Print', 'N'),
 ('GmbH', 'N'),
 ('of', 'I'),
 ('Lohn', 'N')]

##Generating POS Tags

To train a w4 model, we need to create features for each of the tokens in the sentences.

---

The POS tags of the words can indicate whether a word is a noun, verb, adjective etc. (Fun fact: a POS tagger is a trained w4 model)

---

We'll use **NLTK's POS tagger** to generate the POS tags for the tokens in our docs as follows:

In [6]:
import nltk
nltk.download('averaged_perceptron_tagger')

data = []

for i, doc in enumerate(docs):
  
  # Obtain the list of tokens in the document
  tokens = [t for t, label in doc]
  
  # Performing POS tagging
  tagged = nltk.pos_tag(tokens)
  
  # Take the word, POS tag, and its label
  data.append([(w, pos, label) for (w, label), (word, pos) in zip(doc, tagged)])

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


###The output will be a list of documents, each of which is a list of tuples *with the word, its POS tag, and its label*

In [7]:
data[0]

[('Paxar', 'NNP', 'N'),
 ('Corp', 'NNP', 'N'),
 ('said', 'VBD', 'I'),
 ('it', 'PRP', 'I'),
 ('has', 'VBZ', 'I'),
 ('acquired', 'VBN', 'I'),
 ('Thermo-Print', 'NNP', 'N'),
 ('GmbH', 'NNP', 'N'),
 ('of', 'IN', 'I'),
 ('Lohn', 'NNP', 'N'),
 (',', ',', 'I'),
 ('West', 'NNP', 'N'),
 ('Germany', 'NNP', 'N'),
 (',', ',', 'I'),
 ('a', 'DT', 'I'),
 ('distributor', 'NN', 'I'),
 ('of', 'IN', 'I'),
 ('Paxar', 'NNP', 'N'),
 ('products,', 'NN', 'I'),
 ('for', 'IN', 'I'),
 ('undisclosed', 'JJ', 'I'),
 ('terms.', 'NN', 'I')]

##Generating Features

GIven the POS tags, we can continue to generate more features for each of the tokens in the dataset. The features that will be useful in the training process depends on the task at hand.

Some of the commonly used features for a word ***w*** in NER, are as follows:


*   The word ***w*** itself (converted to lowercase for normalization)
*   The prefix / suffix of ***w*** (e.g. -ion)
*   The word surrounding ***w***
*   Whether ***w*** is in uppercase or lowercase
*   Whether ***w*** is a number, or contains digits
*   POS tag of ***w***, and those of the surrounding words
*   Whether ***w*** is or contains a special character (e.g. hyphen, dollar sign)



Below is a function for generating features for our documents. 

---

It takes a doc (list of tuples) and an index (the ***i***th document), and returns the documents with features extracted.

In [0]:
def word2features(doc, i):
  word = doc[i][0]
  postag = doc[i][1]
  
  # Common features for all words
  features = [
      'bias',
      'word.lower = ' + word.lower(),
      'word[-3:] = ' + word[-3:],
      'word[-2:] = ' + word[-2:],
      'word.isupper = %s' % word.isupper(),
      'word.istitle = %s' % word.istitle(),
      'word.isdigit = %s' % word.isdigit(),
      'postag = ' + postag
  ]
  
  
  # Features for words that are not at the
  # beginning of a document
  if i > 0:
    word1 = doc[i - 1][0]
    postag1 = doc[i - 1][1]
    features.extend([
        '-1:word.lower = ' + word1.lower(),
        '-1:word.istitle = %s' % word1.istitle(),
        '-1:word.isupper = %s' % word1.isupper(),
        '-1:word.isdigit = %s' % word1.isdigit(),
        '-1:postag = ' + postag1
    ])
  else:
    # Indicate that it's the beginning of the document
    features.append('BOS')
    
  
  # Features for words that are not at the
  # end of a document
  if i < len(doc) - 1:
    word1 = doc[i + 1][0]
    postag1 = doc[i + 1][1]
    features.extend([
        '+1:word.lower = ' + word1.lower(),
        '+1:word.istitle = %s' % word1.istitle(),
        '+1:word.isupper = %s' % word1.isupper(),
        '+1:word.isdigit = %s' % word1.isdigit(),
        '+1:postag = ' + postag1
    ])
  else:
    # Indicate that it is the end of a document
    features.append('EOS')
    
  return features

##Training the Model

To train the model, we need to first prepare the training data and the corresponding labels.

---

Also, to be able to investigate the accuracy of the model, we need to separate the data into training and test sets, for which, we'll use the **train_test_split** function in the **scikit-learn** library

In [0]:
from sklearn.model_selection import train_test_split

# A function for extracting features from documents
def extract_features(doc):
  return [word2features(doc, i) for i in range(len(doc))]

# A function for generating the list of labels for each document
def get_labels(doc):
  return [label for (token, postag, label) in doc]

X = [extract_features(doc) for doc in data]
y = [get_labels(doc) for doc in data]

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.2)

In **pycrfsuite**, a CRF model can be trained by first creating a trainer, and then by submitting the training data and the corresponding labels to it.

---

After that, set the parameters and call **train()** to start the training process.

As the dataset used here is very small, the training with **max_iterations = 200** can be finished in a few seconds

In [10]:
!pip install python-crfsuite
import pycrfsuite as P

trainer = P.Trainer(verbose = True)

# Submit training data to the trainer
for xseq, yseq in zip(X_train, y_train):
  trainer.append(xseq, yseq)
  
# Set the parameters of the model
trainer.set_params({
    
    # coefficient for L1 penalty
    'c1': 0.1,
    
    # coefficient for L2 penalty
    'c2': 0.01,
    
    # maximum number of iterations
    'max_iterations': 200,
    
    # whether to include transitions that
    # are possible, but not observed
    'feature.possible_transitions': True
})

# Giving a file name as a parameter to the train function, so that
# the model will be saved to the file when training is finished
trainer.train('w4.model')

Collecting python-crfsuite
[?25l  Downloading https://files.pythonhosted.org/packages/2f/86/cfcd71edca9d25d3d331209a20f6314b6f3f134c29478f90559cee9ce091/python_crfsuite-0.9.6-cp36-cp36m-manylinux1_x86_64.whl (754kB)
[K    1% |▍                               | 10kB 18.3MB/s eta 0:00:01[K    2% |▉                               | 20kB 4.5MB/s eta 0:00:01[K    4% |█▎                              | 30kB 6.4MB/s eta 0:00:01[K    5% |█▊                              | 40kB 4.1MB/s eta 0:00:01[K    6% |██▏                             | 51kB 4.9MB/s eta 0:00:01[K    8% |██▋                             | 61kB 5.8MB/s eta 0:00:01[K    9% |███                             | 71kB 6.6MB/s eta 0:00:01[K    10% |███▌                            | 81kB 7.4MB/s eta 0:00:01[K    12% |████                            | 92kB 8.2MB/s eta 0:00:01[K    13% |████▍                           | 102kB 6.6MB/s eta 0:00:01[K    14% |████▉                           | 112kB 6.8MB/s eta 0:00:01[K  

##Checking the Results

Once we have the model trained, we can apply it on our test data and see whether it gives reasonable results.

We load the model named **w4.model** and apply it to our test data

In [11]:
tagger = P.Tagger()

tagger.open('w4.model')

y_pred = [tagger.tag(xseq) for xseq in X_test]

# Let's take a look at a random sample in the testing set
i = 14
for x, y in zip(y_pred[i], [x[1].split("=")[1] for x in X_test[i]]):
  print("%s (%s)" % (y, x))

 nv (N)
 philips (N)
 gloielampenfabrieken (N)
 pglo.as (N)
 expects (I)
 volume (I)
 turnover (I)
 to (I)
 show (I)
 a (I)
 satisfactory (I)
 increase (I)
 in (I)
 the (I)
 first (I)
 1987 (I)
 quarter, (I)
 chairman (I)
 cor (I)
 van (I)
 der (I)
 klugt (N)
 told (I)
 the (I)
 annual (I)
 shareholders (I)
 meeting. (I)
 but (I)
 an (I)
 average (I)
 dollar (I)
 rate (I)
 of (I)
 only (I)
 2.07 (I)
 guilders (I)
 against (I)
 the (I)
 2.69 (I)
 guilders (I)
 in (I)
 the (I)
 first (I)
 quarter (I)
 of (I)
 1986 (I)
 would (I)
 take (I)
 turnover (I)
 for (I)
 january (I)
 to (I)
 march (I)
 this (I)
 year (I)
 in (I)
 guilder (I)
 terms (I)
 to (I)
 less (I)
 than (I)
 the (I)
 13.06 (I)
 billion (I)
 guilders (I)
 posted (I)
 in (I)
 the (I)
 comparable (I)
 1986 (I)
 period. (I)
 he (I)
 said (I)
 all (I)
 the (I)
 first (I)
 quarter (I)
 figures (I)
 were (I)
 not (I)
 yet (I)
 available (I)
 and (I)
 would (I)
 be (I)
 released (I)
 on (I)
 april (I)
 29. (I)


### To study the performance of the w4 tagger trained above in more depth, we can check the precision and recall on the test data.

We'll do this using the **classification_report **function in **scikit-learn**

###But, given that the predictions are sequences of tags, we need to transform the data into a list of labels before feeding them into the function.

In [12]:
import numpy as np
from sklearn.metrics import classification_report

# Creating a mapping of labels to indices
labels = {"N": 1, "I": 0}

# Convert the sequences of tags into a 1-D array
predictions = np.array([labels[tag] for row in y_pred for tag in row])
truths = np.array([labels[tag] for row in y_test for tag in row])

# Printing out the classification report
print(classification_report(
    truths, predictions,
    target_names = ["I", "N"]
))

              precision    recall  f1-score   support

           I       0.98      0.99      0.98      3069
           N       0.89      0.83      0.86       388

   micro avg       0.97      0.97      0.97      3457
   macro avg       0.94      0.91      0.92      3457
weighted avg       0.97      0.97      0.97      3457

