In this notebook, we will use a recurrent neural network to solve Named Entity Recognition (NER) problem. NER is a common task in natural language processing systems. It serves for extraction such entities from the text as persons, organizations, locations, etc. We will build a NER to recognize named entities from Twitter.

For example, we want to extract persons' and organizations' names from the text. Than for the input text:

    Ian Goodfellow works for Google Brain

a NER model needs to provide the following sequence of tags:

    B-PER I-PER    O     O   B-ORG  I-ORG

Where *B-* and *I-* prefixes stand for the beginning and inside of the entity, while *O* stands for out of tag or no tag. Markup with the prefix scheme is called *BIO markup*. This markup is introduced for distinguishing of consequent entities with similar types.

A solution of the task will be based on neural networks, particularly, on Bi-Directional Long Short-Term Memory Networks (Bi-LSTMs).

In [1]:
import os 
import time 
import random
import warnings

import numpy as np 
from collections import defaultdict

import torch
import torch.nn as nn
import torch.nn.functional as F

from torch.utils.tensorboard import SummaryWriter


warnings.simplefilter('ignore')

data_path = "./data/twitter"
path_to_logdir = './logdir'
path_to_model = "./models"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

if not os.path.exists(path_to_logdir):
    os.makedirs(path_to_logdir)
    
if not os.path.exists(path_to_model):
    os.makedirs(path_to_model)


# I/- Load the Twitter Named Entity Recognition corpus

We will work with a corpus, which contains tweets with NE tags. Every line of a file contains a pair of a token (word/punctuation symbol) and a tag, separated by a whitespace. Different tweets are separated by an empty line.

## 1) Read data
The function *read_data* reads a corpus from the *file_path* and returns two lists: one with tokens and one with the corresponding tags. You need to complete this function by adding a code, which will replace a user's nickname to `<USR>` token and any URL to `<URL>` token. 

In [2]:
def read_data(file_path):
    tokens = []
    tags = []
    
    tweet_tokens = []
    tweet_tags = []
    for line in open(file_path, encoding='utf-8'):
        line = line.strip()
        if not line:
            if tweet_tokens:
                tokens.append(tweet_tokens)
                tags.append(tweet_tags)
            tweet_tokens = []
            tweet_tags = []
        else:
            token, tag = line.split()

            if token.startswith('http://') or token.startswith("https://"):
                token = "<URL>"
            if token.startswith("@"):
                token = "<USR>"
            
            tweet_tokens.append(token)
            tweet_tags.append(tag)
            
    return tokens, tags

And now we can load three separate parts of the dataset:
 - *train* data for training the model;
 - *validation* data for evaluation and hyperparameters tuning;
 - *test* data for final evaluation of the model.

In [3]:
train_tokens, train_tags = read_data(os.path.join(data_path, 'train.txt'))
validation_tokens, validation_tags = read_data(os.path.join(data_path, 'validation.txt'))
test_tokens, test_tags = read_data(os.path.join(data_path, 'test.txt'))

## 2) Prepare dictionnaries

To train a neural network, we will use two mappings: 
- {token}$\to${token id}: address the row in embeddings matrix for the current token;
- {tag}$\to${tag id}: one-hot ground truth probability distribution vectors for computing the loss at the output of the network.
 

In [4]:
def build_dict(tokens_or_tags, special_tokens):
    """
        tokens_or_tags: a list of lists of tokens or tags
        special_tokens: some special tokens
    """
    tok2idx = defaultdict(lambda: 0)
    idx2tok = {}
    
    index = 0
    for special_token in special_tokens:
        tok2idx[special_token] = index
        idx2tok[index] = special_token
        index += 1
    
   
    for seq in tokens_or_tags:
        for tok in seq:
            if not tok in tok2idx:
                tok2idx[tok] = index
                idx2tok[index] = tok
                index += 1
    
    return tok2idx, idx2tok


# create the mapping between tokens and ids for a sentence
def words2idxs(tokens_list):
    return [token2idx[word] for word in tokens_list]

def tags2idxs(tags_list):
    return [tag2idx[tag] for tag in tags_list]

def idxs2words(idxs):
    return [idx2token[idx] for idx in idxs]

def idxs2tags(idxs):
    return [idx2tag[idx] for idx in idxs]

After implementing the function *build_dict* we can make dictionaries for tokens and tags. Special tokens in our case will be:
 - `<UNK>` token for out of vocabulary tokens;
 - `<PAD>` token for padding sentence to the same length when we create batches of sentences.

In [5]:
special_tokens = ['<UNK>', '<PAD>']
special_tags = ['O']

# Create dictionaries 
token2idx, idx2token = build_dict(train_tokens + validation_tokens, special_tokens)
tag2idx, idx2tag = build_dict(train_tags, special_tags)

## 3) Create dataset and datalaoder

We will creatr know dataset object and dataloader, that well enable us to load batches of tweets. The tricky part is that all sequences within a batch need to have the same length. So we will pad them with a special `<PAD>` token.  

In [6]:
def batches_generator(batch_size, tokens, tags,
                      shuffle=True, allow_smaller_last_batch=True):
    """Generates padded batches of tokens and tags."""
    
    n_samples = len(tokens)
    if shuffle:
        order = np.random.permutation(n_samples)
    else:
        order = np.arange(n_samples)

    n_batches = n_samples // batch_size
    if allow_smaller_last_batch and n_samples % batch_size:
        n_batches += 1

    for k in range(n_batches):
        batch_start = k * batch_size
        batch_end = min((k + 1) * batch_size, n_samples)
        current_batch_size = batch_end - batch_start
        x_list = []
        y_list = []
        max_len_token = 0
        for idx in order[batch_start: batch_end]:
            x_list.append(words2idxs(tokens[idx]))
            y_list.append(tags2idxs(tags[idx]))
            max_len_token = max(max_len_token, len(tags[idx]))
            
        # Fill in the data into numpy nd-arrays filled with padding indices.
        x = np.ones([current_batch_size, max_len_token], dtype=np.int32) * token2idx['<PAD>']
        y = np.ones([current_batch_size, max_len_token], dtype=np.int32) * tag2idx['O']
        lengths = np.zeros(current_batch_size, dtype=np.int32)
        for n in range(current_batch_size):
            utt_len = len(x_list[n])
            x[n, :utt_len] = x_list[n]
            lengths[n] = utt_len
            y[n, :utt_len] = y_list[n]
        yield x, y, lengths

# II/- Create and train the model

In this section, we create the model, i.e. class **NERTagger**, our model is a simple one it has only tree ayers, an embedding layer that transforms the indexe into vectors, the second layer is an LSTM layer and the last layer is a Linear layer that maps the output of the LSTM layer to the required dimension, i.e. the number of all possible tags. All those layers are implemented in pytorch and can be used easily. We define also the **train_one_epoch** function, that takes into charge the training procedure, it consists in loading batches, computing the output of the batch, coumputing by comparing the predicted output with the true target, the gradients of the loss function are then computed using *.backward()*, and finally a gradient descent step is done using *.step()*, we also provide **test_model** function that take care of evaluating the model on the validation/test set.

In [7]:
class NERTagger(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, vocab_size, n_tags):
        super(NERTagger, self).__init__()
        
        self.vocab_size = vocab_size
        self.n_tags = n_tags
        self.embedding_dim = embedding_dim
        self.hidden_dim = hidden_dim

        self.word_embeddings = nn.Embedding(self.vocab_size, self.embedding_dim)

        self.lstm = nn.LSTM(self.embedding_dim, self.hidden_dim, batch_first=True)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(self.hidden_dim,  self.n_tags)

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, _ = self.lstm(embeds)
        tag_space = self.hidden2tag(lstm_out)
        tag_scores = F.log_softmax(tag_space, dim=2)
        return tag_scores
    
    
    
def train_one_epoch(model, train_tokens, train_tags, batch_size, criterion, optimizer, writer, epoch):
    n_samples = len(train_tokens)
    model.train()
    correct = 0
    running_loss = 0
    for batch_id, (x_batch, y_batch, lengths) in enumerate(batches_generator(batch_size, train_tokens, train_tags)):
        x_batch = torch.tensor(x_batch).long().to(device)
        y_batch = torch.tensor(y_batch).long().to(device)
    
        optimizer.zero_grad()
    
        outputs = ner_tagger(x_batch).permute(0, 2, 1)
    
        loss = criterion(outputs, y_batch)
    
        loss.backward()

        optimizer.step()
    
        y_pred = outputs.data.max(1)[1]
    
        correct += y_pred.eq(y_batch.data).cpu().sum() / y_batch.shape[1]
        running_loss += loss.item()
    
        if batch_id % 100 == 99:
            writer.add_scalar('training loss',
                              running_loss / 100,
                              epoch * (n_samples // batch_size) + batch_id)

            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_id * x_batch.shape[0], n_samples,
                       100. * batch_id / (n_samples // batch_size), running_loss / 100))

            running_loss = 0.0

            writer.flush()

    writer.add_scalar('Train/Loss', loss.item(), epoch)
    writer.flush()
    
    
    
def test_model(model, tokens, tags, criterion, writer=None, device=None, epoch=None):
    model.eval()
    i, loss, correct, n = [0, 0, 0, 0]
    n_samples = 0

    print("Testing..")
    with torch.no_grad():
        for batch_id, (x_batch, y_batch, lengths) in enumerate(batches_generator(batch_size, tokens, tags)):
            x_batch = torch.tensor(x_batch).long().to(device)
            y_batch = torch.tensor(y_batch).long().to(device)

            outputs = model(x_batch).permute(0, 2, 1)

            loss += criterion(outputs, y_batch)

            y_pred = outputs.data.max(1)[1]
            correct += y_pred.eq(y_batch.data).cpu().sum() / y_batch.shape[1]
            n += 1
            n_samples += int(x_batch.shape[0])

    loss /= n  # loss function already averages over batch size
    accuracy = 100. * correct / (n_samples)
    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        loss, correct, n_samples,
        accuracy))

    if writer:
        # Record loss and accuracy into the writer
        writer.add_scalar('Test/Loss', loss, epoch)
        writer.add_scalar('Test/Accuracy', accuracy, epoch)
        writer.flush()
    return accuracy


def train_model(model, train_tokens, train_tags, validation_tokens, validation_tags, batch_size, criterion, optimizer, writer, n_epochs=10):
    best_acc = 0.
    for epoch in range(0, n_epochs):
        print("Epoch %d" % epoch)
        train_one_epoch(model, train_tokens, train_tags, batch_size, criterion, optimizer, writer, epoch)
        acc = test_model(model, validation_tokens, validation_tags, criterion, writer, device, epoch)
        if acc > best_acc:
            best_acc = acc
            torch.save(model, os.path.join(path_to_model, "ner_best.pth"))

        writer.close()
        

def predict_sentence(sentence, tags=None, verbosity=0):
    X = torch.tensor(words2idxs(sentence)).to(device).unsqueeze(0)
    
    y_pred = ner_tagger(X)
    y_pred = y_pred.squeeze().max(1)[1].cpu()
    
    res = list(y_pred.numpy())
    res = idxs2tags(res)

    if tags:
        y_true = torch.tensor(tags2idxs(true_tags))
        correct = y_pred.eq(y_true.data).cpu().sum()
        if verbosity:
            print("Correct tags {}/{}".format(int(correct), len(res)))
    
    return res

We train now our model, note that tensorboard can be used to view the evolution of the training processus. 

In [8]:
PATH_to_log_dir = './logdir'
if not os.path.exists(PATH_to_log_dir):
    os.makedirs(PATH_to_log_dir)
    
timestr = time.strftime("%Y%m%d_%H%M%S")
writer = SummaryWriter(os.path.join(path_to_logdir, timestr))

Using tensorboard is easy, and can be done by running the following command on terminal

    tensorboard --logdir {PATH_to_log_dir}

In [9]:
batch_size = 32
n_epochs = 10
learning_rate = 0.005
learning_rate_decay = 2
dropout_keep_probability = 0.5
hidden_dim = 512
embedding_dim = 128


ner_tagger = NERTagger(embedding_dim=embedding_dim, hidden_dim=hidden_dim,
                       vocab_size=len(token2idx), n_tags=len(tag2idx))

ner_tagger = ner_tagger.to(device)

criterion = nn.NLLLoss()
optimizer = torch.optim.Adam(ner_tagger.parameters())



train_model(ner_tagger, train_tokens, train_tags, validation_tokens,
            validation_tags, batch_size, criterion, optimizer,
            writer, n_epochs=n_epochs)


Epoch 0
Testing..

Test set: Average loss: 0.2572, Accuracy: 680/724 (93%)

Epoch 1
Testing..

Test set: Average loss: 0.2190, Accuracy: 682/724 (94%)

Epoch 2
Testing..

Test set: Average loss: 0.1920, Accuracy: 683/724 (94%)

Epoch 3
Testing..

Test set: Average loss: 0.1827, Accuracy: 687/724 (94%)

Epoch 4
Testing..

Test set: Average loss: 0.1792, Accuracy: 687/724 (94%)

Epoch 5
Testing..

Test set: Average loss: 0.1794, Accuracy: 686/724 (94%)

Epoch 6
Testing..

Test set: Average loss: 0.1894, Accuracy: 685/724 (94%)

Epoch 7
Testing..

Test set: Average loss: 0.2019, Accuracy: 687/724 (94%)

Epoch 8
Testing..

Test set: Average loss: 0.2098, Accuracy: 684/724 (94%)

Epoch 9
Testing..

Test set: Average loss: 0.2339, Accuracy: 682/724 (94%)



# III/- Evaluation

In this section we give the performence of the Neural NER tagger on the test set. Furthermore, we give some examples of the application of the trained on sentences from the test set.

In [10]:
acc = test_model(ner_tagger, test_tokens, test_tags, criterion, device=device)

Testing..

Test set: Average loss: 0.2280, Accuracy: 676/724 (93%)



We notice that our model gives good results, with approximately 95% of coorect tags. This result can even be improved by using more complex architectures and by spending more time on fine-tuining the hyperparameters. However accuracy is not enough to understand the quality of the model, because of the presence of a lot of 'O' tags in comparaison with other other tags. We also use F1 score and Precision and Recall metrics

In [11]:
y_pred = []
y_test = []
for ii, sent in enumerate(test_tokens):
    if len(sent) > 1:
        y_pred.append(predict_sentence(sent))
        y_test.append(test_tags[ii])

In [12]:
labels = list(tag2idx)
labels.remove('O')

In [13]:
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics

metrics.flat_f1_score(y_test, y_pred,
                      average='weighted', labels=labels)

0.3486622946970767

In [14]:
sorted_labels = sorted(
    labels,
    key=lambda name: (name[1:], name[0])
)
print(metrics.flat_classification_report(
    y_test, y_pred, labels=sorted_labels, digits=3
))

               precision    recall  f1-score   support

    B-company      0.707     0.345     0.464        84
    I-company      0.611     0.275     0.379        40
   B-facility      0.537     0.468     0.500        47
   I-facility      0.765     0.426     0.547        61
    B-geo-loc      0.631     0.497     0.556       165
    I-geo-loc      0.760     0.365     0.494        52
      B-movie      0.000     0.000     0.000         8
      I-movie      0.333     0.100     0.154        10
B-musicartist      0.167     0.111     0.133        27
I-musicartist      0.000     0.000     0.000        24
      B-other      0.418     0.320     0.363       103
      I-other      0.277     0.247     0.261        93
     B-person      0.103     0.471     0.169       104
     I-person      0.340     0.258     0.293        66
    B-product      0.500     0.107     0.176        28
    I-product      0.349     0.250     0.291        60
 B-sportsteam      0.250     0.032     0.057        31
 I-sports

## Some Examples

We give now some examples of usage of the traned NER tagger on different examples

In [15]:
ner_tagger.eval()


indexes = random.sample(range(1, len(test_tokens)), 10)

for idx in indexes:
    sentence = test_tokens[idx]
    true_tags = test_tags[idx]
    print("We consider")
    print("Input sentence: ", ' '.join(sentence))
    print("Predicted NER tag", ' '.join(predict_sentence(sentence, tags=true_tags, verbosity=1)))
    print("*"*40)

We consider
Input sentence:  Prayers going out to the victims and families of the Wilmington Courthouse shooting .
Correct tags 12/14
Predicted NER tag O O O O O O O O O O B-other I-other O O
****************************************
We consider
Input sentence:  Prayers going out to the victims of the San Bernardino shooting , can't understand what drives a group of people to do something so vile .
Correct tags 26/26
Predicted NER tag O O O O O O O O B-geo-loc I-geo-loc O O O O O O O O O O O O O O O O
****************************************
We consider
Input sentence:  x1x_ne_x1x <URL> August 11 , 2015 at 02:02 AM 4
Correct tags 10/10
Predicted NER tag O O O O O O O O O O
****************************************
We consider
Input sentence:  havoc sia flight tomorrow at 1145am and i havent even packed
Correct tags 7/11
Predicted NER tag B-person B-person O O O B-facility I-facility O O O O
****************************************
We consider
Input sentence:  RT <USR> : There will be no 