<a href="https://colab.research.google.com/github/humblethinker/ltsm-based-ner/blob/master/Named_entity_recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Data

The data will be mounted from drive.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Load the Twitter Named Entity Recognition corpus


In [None]:
def read_data(file_path):
    tokens = []
    tags = []
    
    tweet_tokens = []
    tweet_tags = []
    for line in open(file_path, encoding='utf-8'):
        line = line.strip()
        if not line:
            if tweet_tokens:
                tokens.append(tweet_tokens)
                tags.append(tweet_tags)
            tweet_tokens = []
            tweet_tags = []
        else:
            token, tag = line.split()

            if token.startswith('@'):
                token = '<USR>'
            elif token.lower().startswith('http://') or token.lower().startswith('https://'):
                token = '<URL>'
            tweet_tokens.append(token)
            tweet_tags.append(tag)
            
    return tokens, tags

And now we can load three separate parts of the dataset:
 - *train* data for training the model;
 - *validation* data for evaluation and hyperparameters tuning;
 - *test* data for final evaluation of the model.

In [None]:
train_tokens, train_tags = read_data('/content/drive/My Drive/data/train.txt')
validation_tokens, validation_tags = read_data('/content/drive/My Drive/data/train.txt')
test_tokens, test_tags = read_data('/content/drive/My Drive/data/test.txt')

Printing data:

In [None]:
for i in range(3):
    for token, tag in zip(train_tokens[i], train_tags[i]):
        print('%s\t%s' % (token, tag))
    print()

RT	O
<USR>	O
:	O
Online	O
ticket	O
sales	O
for	O
Ghostland	B-musicartist
Observatory	I-musicartist
extended	O
until	O
6	O
PM	O
EST	O
due	O
to	O
high	O
demand	O
.	O
Get	O
them	O
before	O
they	O
sell	O
out	O
...	O

Apple	B-product
MacBook	I-product
Pro	I-product
A1278	I-product
13.3	I-product
"	I-product
Laptop	I-product
-	I-product
MD101LL/A	I-product
(	O
June	O
,	O
2012	O
)	O
-	O
Full	O
read	O
by	O
eBay	B-company
<URL>	O
<URL>	O

Happy	O
Birthday	O
<USR>	O
!	O
May	O
Allah	B-person
s.w.t	O
bless	O
you	O
with	O
goodness	O
and	O
happiness	O
.	O



### Prepare dictionaries

To train a neural network, we will use two mappings: 
- {token}$\to${token id}: address the row in embeddings matrix for the current token;
- {tag}$\to${tag id}: one-hot ground truth probability distribution vectors for computing the loss at the output of the network.

In [None]:
from collections import defaultdict

In [None]:
def build_dict(tokens_or_tags, special_tokens):
    """
        tokens_or_tags: a list of lists of tokens or tags
        special_tokens: some special tokens
    """
    # Create a dictionary with default value 0
    tok2idx = defaultdict(lambda: 0)
    idx2tok = []
    
    vocab = set([l for ls in tokens_or_tags for l in ls])
    vocab_size = len(vocab)+len(special_tokens)
    idx2tok = ['']*vocab_size

    for i,token in enumerate(special_tokens):
        tok2idx[token] = i
        idx2tok[i] = token
    
    for i, token in enumerate(vocab, len(special_tokens)):
        tok2idx[token] = i
        idx2tok[i] = token
        
    return tok2idx, idx2tok

In [None]:
special_tokens = ['<UNK>', '<PAD>']
special_tags = ['O']

# Create dictionaries 
token2idx, idx2token = build_dict(train_tokens + validation_tokens, special_tokens)
tag2idx, idx2tag = build_dict(train_tags, special_tags)

The next additional functions will help us to create the mapping between tokens and ids for a sentence. 

In [None]:
def words2idxs(tokens_list):
    return [token2idx[word] for word in tokens_list]

def tags2idxs(tags_list):
    return [tag2idx[tag] for tag in tags_list]

def idxs2words(idxs):
    return [idx2token[idx] for idx in idxs]

def idxs2tags(idxs):
    return [idx2tag[idx] for idx in idxs]

### Generate batches

In [None]:
def batches_generator(batch_size, tokens, tags,
                      shuffle=True, allow_smaller_last_batch=True):
    """Generates padded batches of tokens and tags."""
    
    n_samples = len(tokens)
    if shuffle:
        order = np.random.permutation(n_samples)
    else:
        order = np.arange(n_samples)

    n_batches = n_samples // batch_size
    if allow_smaller_last_batch and n_samples % batch_size:
        n_batches += 1

    for k in range(n_batches):
        batch_start = k * batch_size
        batch_end = min((k + 1) * batch_size, n_samples)
        current_batch_size = batch_end - batch_start
        x_list = []
        y_list = []
        max_len_token = 0
        for idx in order[batch_start: batch_end]:
            x_list.append(words2idxs(tokens[idx]))
            y_list.append(tags2idxs(tags[idx]))
            max_len_token = max(max_len_token, len(tags[idx]))
            
        x = np.ones([current_batch_size, max_len_token], dtype=np.int32) * token2idx['<PAD>']
        y = np.ones([current_batch_size, max_len_token], dtype=np.int32) * tag2idx['O']
        lengths = np.zeros(current_batch_size, dtype=np.int32)
        for n in range(current_batch_size):
            utt_len = len(x_list[n])
            x[n, :utt_len] = x_list[n]
            lengths[n] = utt_len
            y[n, :utt_len] = y_list[n]
        yield x, y, lengths

## Build a recurrent neural network

This is the most important part of the project. Here we will specify the network architecture based on TensorFlow building blocks. We will create an LSTM network which will produce probability distribution over tags for each token in a sentence. To take into account both right and left contexts of the token, we will use Bi-Directional LSTM (Bi-LSTM). Dense layer will be used on top to perform tag classification.

In [None]:
import tensorflow as tf
import numpy as np

In [None]:
class BiLSTMModel():
    pass

First, we need to create [placeholders](https://www.tensorflow.org/api_docs/python/tf/compat/v1/placeholder) to specify what data we are going to feed into the network during the execution time.

In [None]:
def declare_placeholders(self):
    """Specifies placeholders for the model."""

    self.input_batch = tf.compat.v1.placeholder(dtype=tf.int32, shape=[None, None], name='input_batch') 
    self.ground_truth_tags = tf.compat.v1.placeholder(dtype=tf.int32, shape=[None, None], name='ground_truth_tags')
  
    self.lengths = tf.compat.v1.placeholder(dtype=tf.int32, shape=[None], name='lengths') 
    
    self.dropout_ph = tf.compat.v1.placeholder_with_default(tf.cast(1.0, tf.float32), shape=[])
    
    self.learning_rate_ph = tf.compat.v1.placeholder(dtype=tf.float32, shape=[], name='learning_rate')

In [None]:
BiLSTMModel.__declare_placeholders = classmethod(declare_placeholders)

Now, let us specify the layers of the neural network. First, we need to perform some preparatory steps:
 

In [None]:
def build_layers(self, vocabulary_size, embedding_dim, n_hidden_rnn, n_tags):
    """Specifies bi-LSTM architecture and computes logits for inputs."""
    
    initial_embedding_matrix = np.random.randn(vocabulary_size, embedding_dim) / np.sqrt(embedding_dim)
    embedding_matrix_variable = tf.Variable(initial_value=initial_embedding_matrix, name='embeddings_matrix', dtype=tf.float32)
    
    forward_cell =  tf.compat.v1.nn.rnn_cell.DropoutWrapper(tf.compat.v1.nn.rnn_cell.BasicLSTMCell(num_units = n_hidden_rnn), input_keep_prob	= self.dropout_ph, output_keep_prob	= self.dropout_ph, state_keep_prob = self.dropout_ph)######### YOUR CODE HERE #############
    backward_cell = tf.compat.v1.nn.rnn_cell.DropoutWrapper(tf.compat.v1.nn.rnn_cell.BasicLSTMCell(num_units = n_hidden_rnn), input_keep_prob	= self.dropout_ph, output_keep_prob	= self.dropout_ph, state_keep_prob = self.dropout_ph) ######### YOUR CODE HERE #############

    embeddings = tf.compat.v1.nn.embedding_lookup(embedding_matrix_variable, self.input_batch)
    
    (rnn_output_fw, rnn_output_bw), _ = tf.compat.v1.nn.bidirectional_dynamic_rnn(forward_cell, backward_cell, embeddings, self.lengths, dtype=tf.float32)
    rnn_output = tf.compat.v1.concat([rnn_output_fw, rnn_output_bw], axis=2)

    self.logits = tf.compat.v1.layers.dense(rnn_output, n_tags, activation=None)

In [None]:
BiLSTMModel.__build_layers = classmethod(build_layers)

To compute the actual predictions of the neural network, we apply [softmax](https://www.tensorflow.org/api_docs/python/tf/nn/softmax) to the last layer and find the most probable tags with [argmax](https://www.tensorflow.org/api_docs/python/tf/argmax).

In [None]:
def compute_predictions(self):
    """Transforms logits to probabilities and finds the most probable tags."""
    
    softmax_output = tf.nn.softmax(self.logits)
    
    self.predictions = tf.math.argmax(softmax_output, axis=-1)

In [None]:
BiLSTMModel.__compute_predictions = classmethod(compute_predictions)

We will use [cross-entropy loss](http://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html#cross-entropy), efficiently implemented in TF as 
[cross entropy with logits](https://www.tensorflow.org/api_docs/python/tf/nn/softmax_cross_entropy_with_logits_v2).

In [None]:
def compute_loss(self, n_tags, PAD_index):
    """Computes masked cross-entopy loss with logits."""
    
    
    ground_truth_tags_one_hot = tf.one_hot(self.ground_truth_tags, n_tags)
    loss_tensor = tf.nn.softmax_cross_entropy_with_logits(labels=ground_truth_tags_one_hot, logits=self.logits)
    
    mask = tf.cast(tf.not_equal(self.input_batch, PAD_index), tf.float32)
  
    self.loss = tf.reduce_mean(tf.multiply(loss_tensor, mask))

In [None]:
BiLSTMModel.__compute_loss = classmethod(compute_loss)

The last thing to specify is how we want to optimize the loss. 
We use [Adam](https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer) optimizer with a learning rate from the corresponding placeholder. 
We will also need to apply clipping to eliminate exploding gradients.

In [None]:
def perform_optimization(self):
    """Specifies the optimizer and train_op for the model."""
    
    self.optimizer = tf.compat.v1.train.AdamOptimizer(self.learning_rate_ph)
    self.grads_and_vars = self.optimizer.compute_gradients(self.loss)
    
    clip_norm = tf.cast(1.0, tf.float32)
    self.grads_and_vars =[ (tf.clip_by_norm(gv[0], clip_norm),gv[1]) for gv in self.grads_and_vars]
    
    self.train_op = self.optimizer.apply_gradients(self.grads_and_vars)

In [None]:
BiLSTMModel.__perform_optimization = classmethod(perform_optimization)

In [None]:
def init_model(self, vocabulary_size, n_tags, embedding_dim, n_hidden_rnn, PAD_index):
    self.__declare_placeholders()
    self.__build_layers(vocabulary_size, embedding_dim, n_hidden_rnn, n_tags)
    self.__compute_predictions()
    self.__compute_loss(n_tags, PAD_index)
    self.__perform_optimization()

In [None]:
BiLSTMModel.__init__ = classmethod(init_model)

## Train the network and predict tags

In [None]:
def train_on_batch(self, session, x_batch, y_batch, lengths, learning_rate, dropout_keep_probability):
    feed_dict = {self.input_batch: x_batch,
                 self.ground_truth_tags: y_batch,
                 self.learning_rate_ph: learning_rate,
                 self.dropout_ph: dropout_keep_probability,
                 self.lengths: lengths}
    
    session.run(self.train_op, feed_dict=feed_dict)

In [None]:
BiLSTMModel.train_on_batch = classmethod(train_on_batch)

In [None]:
def predict_for_batch(self, session, x_batch, lengths):

    feed_dict = {self.input_batch: x_batch,
                 self.dropout_ph: 1.0,
                 self.lengths: lengths}

    predictions = session.run(self.predictions, feed_dict=feed_dict)
    
    return predictions

In [None]:
BiLSTMModel.predict_for_batch = classmethod(predict_for_batch)



### Evaluation 

In [None]:
from evaluation import precision_recall_f1

In [None]:
def predict_tags(model, session, token_idxs_batch, lengths):
    """Performs predictions and transforms indices to tokens and tags."""
    
    tag_idxs_batch = model.predict_for_batch(session, token_idxs_batch, lengths)
    
    tags_batch, tokens_batch = [], []
    for tag_idxs, token_idxs in zip(tag_idxs_batch, token_idxs_batch):
        tags, tokens = [], []
        for tag_idx, token_idx in zip(tag_idxs, token_idxs):
            tags.append(idx2tag[tag_idx])
            tokens.append(idx2token[token_idx])
        tags_batch.append(tags)
        tokens_batch.append(tokens)
    return tags_batch, tokens_batch
    
    
def eval_conll(model, session, tokens, tags, short_report=True):
    """Computes NER quality measures using CONLL shared task script."""
    
    y_true, y_pred = [], []
    for x_batch, y_batch, lengths in batches_generator(1, tokens, tags):
        tags_batch, tokens_batch = predict_tags(model, session, x_batch, lengths)
        if len(x_batch[0]) != len(tags_batch[0]):
            raise Exception("Incorrect length of prediction for the input, "
                            "expected length: %i, got: %i" % (len(x_batch[0]), len(tags_batch[0])))
        predicted_tags = []
        ground_truth_tags = []
        for gt_tag_idx, pred_tag, token in zip(y_batch[0], tags_batch[0], tokens_batch[0]): 
            if token != '<PAD>':
                ground_truth_tags.append(idx2tag[gt_tag_idx])
                predicted_tags.append(pred_tag)

        y_true.extend(ground_truth_tags + ['O'])
        y_pred.extend(predicted_tags + ['O'])
        
    results = precision_recall_f1(y_true, y_pred, print_results=True, short_report=short_report)
    return results

## Run your experiment

In [None]:
tf.compat.v1.reset_default_graph()
tf.compat.v1.disable_eager_execution()

model = BiLSTMModel(vocabulary_size=len(token2idx), n_tags=len(tag2idx), embedding_dim=200, n_hidden_rnn=200, PAD_index=token2idx['<PAD>'])  ######### YOUR CODE HERE #############

batch_size = 32
n_epochs = 7
learning_rate = 0.005
learning_rate_decay = np.sqrt(2)
dropout_keep_probability = 0.5

Finally, we are ready to run the training!

In [None]:
sess = tf.compat.v1.Session()
sess.run(tf.compat.v1.global_variables_initializer())

print('Start training... \n')
for epoch in range(n_epochs):

    print('-' * 20 + ' Epoch {} '.format(epoch+1) + 'of {} '.format(n_epochs) + '-' * 20)
    print('Train data evaluation:')
    eval_conll(model, sess, train_tokens, train_tags, short_report=True)
    print('Validation data evaluation:')
    eval_conll(model, sess, validation_tokens, validation_tags, short_report=True)
    
    # Train the model
    for x_batch, y_batch, lengths in batches_generator(batch_size, train_tokens, train_tags):
        model.train_on_batch(sess, x_batch, y_batch, lengths, learning_rate, dropout_keep_probability)
        
    learning_rate = learning_rate / learning_rate_decay
    
print('...training finished.')

Start training... 

-------------------- Epoch 1 of 7 --------------------
Train data evaluation:
processed 105778 tokens with 4489 phrases; found: 71100 phrases; correct: 157.

precision:  0.22%; recall:  3.50%; F1:  0.42

Validation data evaluation:
processed 12836 tokens with 537 phrases; found: 8589 phrases; correct: 25.

precision:  0.29%; recall:  4.66%; F1:  0.55

-------------------- Epoch 2 of 7 --------------------
Train data evaluation:
processed 105778 tokens with 4489 phrases; found: 2488 phrases; correct: 470.

precision:  18.89%; recall:  10.47%; F1:  13.47

Validation data evaluation:
processed 12836 tokens with 537 phrases; found: 221 phrases; correct: 40.

precision:  18.10%; recall:  7.45%; F1:  10.55

-------------------- Epoch 3 of 7 --------------------
Train data evaluation:
processed 105778 tokens with 4489 phrases; found: 4523 phrases; correct: 1698.

precision:  37.54%; recall:  37.83%; F1:  37.68

Validation data evaluation:
processed 12836 tokens with 537 ph

Now let us see full quality reports for the final model on train, validation, and test sets.

In [None]:
print('-' * 20 + ' Train set quality: ' + '-' * 20)
train_results = eval_conll(model, sess, train_tokens, train_tags, short_report=False)

print('-' * 20 + ' Validation set quality: ' + '-' * 20)
validation_results = eval_conll(model, sess, validation_tokens, validation_tags, short_report=False) ######### YOUR CODE HERE #############

print('-' * 20 + ' Test set quality: ' + '-' * 20)
test_results = eval_conll(model, sess, test_tokens, test_tags, short_report=False) ######### YOUR CODE HERE #############

-------------------- Train set quality: --------------------
processed 105778 tokens with 4489 phrases; found: 4694 phrases; correct: 3780.

precision:  80.53%; recall:  84.21%; F1:  82.33

	     company: precision:   90.50%; recall:   93.31%; F1:   91.88; predicted:   663

	    facility: precision:    7.32%; recall:    6.69%; F1:    6.99; predicted:   287

	     geo-loc: precision:   88.97%; recall:   96.39%; F1:   92.53; predicted:  1079

	       movie: precision:   40.68%; recall:   35.29%; F1:   37.80; predicted:    59

	 musicartist: precision:   75.21%; recall:   78.45%; F1:   76.79; predicted:   242

	       other: precision:   81.77%; recall:   90.09%; F1:   85.73; predicted:   834

	      person: precision:   90.95%; recall:   96.39%; F1:   93.59; predicted:   939

	     product: precision:   84.40%; recall:   86.79%; F1:   85.58; predicted:   327

	  sportsteam: precision:   71.06%; recall:   76.96%; F1:   73.89; predicted:   235

	      tvshow: precision:   48.28%; recall:  