# Named Entity Recognition  for Dialogue Systems

Named Entity Recognition (NER), also known as entity extraction, classifies named entities that are present in a text into pre-defined categories like “individuals”, “companies”, “places”, “organization”, “cities”, “dates”, “product terminologies” etc. It adds a wealth of semantic knowledge to your content and helps you to promptly understand the subject of any given text. 

<b> Use-Cases of Named Entity Recognition </b>
* Classifying content for news providers: Named Entity Recognition can automatically scan entire articles and reveal which are the major people, organizations, and places discussed in them. Knowing the relevant tags for each article help in automatically categorizing the articles in defined hierarchies and enable smooth content discovery.
* Efficient Search Algorithms: Let’s suppose you are designing an internal search algorithm for an online publisher that has millions of articles. If for every search query the algorithm ends up searching all the words in millions of articles, the process will take a lot of time. Instead, if Named Entity Recognition can be run once on all the articles and the relevant entities (tags) associated with each of those articles are stored separately, this could speed up the search process considerably. With this approach, a search term will be matched with only the small list of entities discussed in each article leading to faster search execution.
* Powering Content Recommendations: One of the major uses cases of Named Entity Recognition involves automating the recommendation process. Recommendation systems dominate how we discover new content and ideas in today’s worlds. The example of Netflix shows that developing an effective recommendation system can work wonders for the fortunes of a media company by making their platforms more engaging and event addictive. For news publishers, using Named Entity Recognition to recommend similar articles is a proven approach. This can be done by extracting entities from a particular article and recommending the other articles which have the most similar entities mentioned in them. 
* Understanding user intent: when  building  dialogue  systems,  reliable  named  entity recognition is vital component to understanding user intent. Here is an example of a conversation between a bot and a user where entities to recognize are highlighted:

<br>
<img src="img/entity_extraction.png" style= "width:400px;height:400px" />
<br>

In this tutorial, we’re going to implement and train a Named Entity Recognition model with Keras that can be used as part of a dialogue ststem. 


# build the dataset from kaggle

The dataset that we are going to use to train and evaluate our model is the Annotated Corpus for Named Entity Recognition. The dataset is a Kaggle dataset and can be downloaded <a href="https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus"> here </a>. 

Number of tagged entities:

'O': 1146068', geo-nam': 58388, 'org-nam': 48034, 'per-nam': 23790, 'gpe-nam': 20680, 'tim-dat': 12786, 'tim-dow': 11404, 'per-tit': 9800, 'per-fam': 8152, 'tim-yoc': 5290, 'tim-moy': 4262, 'per-giv': 2413, 'tim-clo': 891, 'art-nam': 866, 'eve-nam': 602, 'nat-nam': 300, 'tim-nam': 146, 'eve-ord': 107, 'per-ini': 60, 'org-leg': 60, 'per-ord': 38, 'tim-dom': 10, 'per-mid': 1, 'art-add': 1

Essential info about entities:

    geo = Geographical Entity
    org = Organization
    per = Person
    gpe = Geopolitical Entity
    tim = Time indicator
    art = Artifact
    eve = Event
    nat = Natural Phenomenon

Total Words Count = 1354149 Target Data Column: "tag"

In [None]:
"""Read, split and save the kaggle dataset for our model"""

import csv
import os
import sys
import numpy as np

def load_dataset(path_csv):
    """Loads dataset into memory from csv file"""
    # Open the csv file, need to specify the encoding for python3
    use_python3 = sys.version_info[0] >= 3
    with (open(path_csv, encoding="windows-1252") if use_python3 else open(path_csv)) as f:
        csv_file = csv.reader(f, delimiter=',')
        dataset = []
        words, tags = [], []

        # Each line of the csv corresponds to one word
        for idx, row in enumerate(csv_file):
            if idx == 0: continue
            sentence, word, pos, tag = row
            # If the first column is non empty it means we reached a new sentence
            if len(sentence) != 0:
                if len(words) > 0:
                    assert len(words) == len(tags)
                    dataset.append((words, tags))
                    words, tags = [], []
            try:
                word, tag = str(word), str(tag)
                words.append(word)
                tags.append(tag)
            except UnicodeDecodeError as e:
                print("An exception was raised, skipping a word: {}".format(e))
                pass

    return dataset


def save_dataset(dataset, save_dir):
    """Writes sentences.txt and labels.txt files in save_dir from dataset
    Args:
        dataset: ([(["a", "cat"], ["O", "O"]), ...])
        save_dir: (string)
    """
    # Create directory if it doesn't exist
    print("Saving in {}...".format(save_dir))
    if not os.path.exists(save_dir):
        os.makedirs(save_dir)

    # Export the dataset
    with open(os.path.join(save_dir, 'sentences.txt'), 'w') as file_sentences:
        with open(os.path.join(save_dir, 'labels.txt'), 'w') as file_labels:
            for words, tags in dataset:
                file_sentences.write("{}\n".format(" ".join(words)))
                file_labels.write("{}\n".format(" ".join(tags)))
    print("- done.")


# Check that the dataset exists (you need to make sure you haven't downloaded the `ner.csv`)
path_dataset = 'data/ner/ner_dataset.csv'
msg = "{} file not found. Make sure you have downloaded the right dataset".format(path_dataset)
assert os.path.isfile(path_dataset), msg

# Load the dataset into memory
print("Loading Kaggle dataset into memory...")
dataset = load_dataset(path_dataset)
print("- done.")

# Split the dataset into train and hold(dummy split with no shuffle)
train_dataset = dataset[:int(0.7*len(dataset))]
hold_dataset = dataset[int(0.7*len(dataset)):]

# Save the datasets to files
save_dataset(train_dataset, 'data/ner/train')
save_dataset(hold_dataset, 'data/ner/hold')

### Load the training data to memory

In [None]:
data_path = 'data/ner/train/'

with open(data_path+'sentences.txt', 'r') as f:
    sentences = f.read().split('\n')
with open(data_path+'labels.txt', 'r') as f:
    sentence_tags = f.read().split('\n')

print(len(sentences), ' examples for training')
print('\nExample of training sentence:',sentences[0])
print('\nLabel',sentence_tags[0])


In [None]:
# reduce the size of the dataset
sentences = sentences[:10000]
sentence_tags = sentence_tags[:10000]

### Pre-processing

First, let's tokenize all the sentences:

In [None]:
sentences =[sent.split(' ') for sent in sentences]
sentence_tags =[sent.split(' ') for sent in sentence_tags]

In [None]:
print(sentences[0])
print(sentence_tags[0])

Before training a model, we need to split the data in training and testing data. Let’s use the train_test_split function from Scikit-Learn:

In [None]:
from sklearn.model_selection import train_test_split
 
(train_sentences, 
test_sentences, 
train_tags, 
test_tags) = train_test_split(sentences, sentence_tags, test_size=0.1)
 

Keras also needs to work with numbers, not with words (or tags). Let’s assign to each word (and tag) a unique integer. We’re computing a set of unique words (and tags) then transforming it in a list and indexing them in a dictionary. These dictionaries are the word vocabulary and the tag vocabulary. We’ll also add a special value for padding the sequences (more on that later), and another one for unknown words (OOV – Out Of Vocabulary).

In [None]:

words, tags = set([]), set([])
 
for s in train_sentences:
    for w in s:
        words.add(w.lower())

for ts in train_tags:
    for t in ts:
        tags.add(t)

word2index = {w: i + 2 for i, w in enumerate(list(words))}
word2index['-PAD-'] = 0  # The special value used for padding
word2index['-OOV-'] = 1  # The special value used for OOVs
 
tag2index = {t: i + 1 for i, t in enumerate(list(tags))}
tag2index['-PAD-'] = 0  # The special value used to padding

Let’s now convert the word dataset to integer dataset, both the words and the tags.

In [None]:
train_sentences_X, test_sentences_X, train_tags_y, test_tags_y = [], [], [], []
 
for s in train_sentences:
    s_int = []
    for w in s:
        try:
            s_int.append(word2index[w.lower()])
        except KeyError:
            s_int.append(word2index['-OOV-'])
 
    train_sentences_X.append(s_int)

for s in test_sentences:
    s_int = []
    for w in s:
        try:
            s_int.append(word2index[w.lower()])
        except KeyError:
            s_int.append(word2index['-OOV-'])
 
    test_sentences_X.append(s_int)

for s in train_tags:
    train_tags_y.append([tag2index[t] for t in s])

for s in test_tags:
    test_tags_y.append([tag2index[t] for t in s])

print(train_sentences_X[0])
print(test_sentences_X[0])
print(train_tags_y[0])
print(test_tags_y[0])

Keras can only deal with fixed size sequences. We will pad to the right all the sequences with a special value (0 as the index and “-PAD-“` as the corresponding word/tag) to the length of the longest sequence in the dataset. Let’s compute the maximum length of all the sequences.

In [None]:
MAX_LENGTH = len(max(train_sentences_X, key=len))
print(MAX_LENGTH) 

Now we can use Keras’s convenient pad_sequences utility function:

In [None]:
from keras.preprocessing.sequence import pad_sequences
 
train_sentences_X = pad_sequences(train_sentences_X, maxlen=MAX_LENGTH, padding='post')
test_sentences_X = pad_sequences(test_sentences_X, maxlen=MAX_LENGTH, padding='post')
train_tags_y = pad_sequences(train_tags_y, maxlen=MAX_LENGTH, padding='post')
test_tags_y = pad_sequences(test_tags_y, maxlen=MAX_LENGTH, padding='post')
 
print(train_sentences_X[0])
print(test_sentences_X[0])
print(train_tags_y[0])
print(test_tags_y[0])

## Network architecture

Let’s now define the model. Here’s what we need to have in mind:

* We’ll need an <b>embedding</b> layer that computes a word vector model for our words. 
* We’ll need an <b>LSTM layer</b> with a <b>Bidirectional modifier</b>. bidirectional modifier inputs to the LSTM the next values in the sequence, not just the previous.
* We need to set the return_sequences=True parameter so that the LSTM outputs a sequence, not only the final value.
* After the LSTM Layer we need a <b>Dense Layer</b> (or fully-connected layer) that picks the appropriate POS tag. Since this dense layer needs to run on each element of the sequence, we need to add the TimeDistributed modifier.



In [None]:
from keras.models import Sequential
from keras.layers import Dense, LSTM, InputLayer, Bidirectional, TimeDistributed, Embedding, Activation
from keras.optimizers import Adam
 

model = Sequential()
model.add(InputLayer(input_shape=(MAX_LENGTH, )))
model.add(Embedding(len(word2index), 128))
model.add(Bidirectional(LSTM(256, return_sequences=True)))
model.add(TimeDistributed(Dense(len(tag2index))))
model.add(Activation('softmax'))
 
model.compile(loss='categorical_crossentropy',
              optimizer=Adam(0.001),
              metrics=['accuracy'])
 
model.summary()

There’s one more thing to do before training. We need to transform the sequences of tags to sequences of <b>One-Hot Encoded tags</b>. This is what the Dense Layer outputs. Here’s a function that does that:

In [None]:
def to_categorical(sequences, categories):
    cat_sequences = []
    for s in sequences:
        cats = []
        for item in s:
            cats.append(np.zeros(categories))
            cats[-1][item] = 1.0
        cat_sequences.append(cats)
    return np.array(cat_sequences)

Here’s how the one hot encoded tags look like:

In [None]:
cat_train_tags_y = to_categorical(train_tags_y, len(tag2index))
print(cat_train_tags_y[0])

The moment we’ve all been waiting for, training the model:

In [None]:
model.fit(train_sentences_X,
          to_categorical(train_tags_y, len(tag2index)), 
          batch_size=128, 
          epochs=1, 
          validation_split=0.2)

Let’s evaluate our model on the data we’ve kept aside for testing:

In [None]:
scores = model.evaluate(test_sentences_X, to_categorical(test_tags_y, len(tag2index)))
print(f"{model.metrics_names[1]}: {scores[1] * 100}")   
 

If you got a very hight accuracy, don’t get overexcited. There’s a catch: a lot of our success is because there’s a lot of padding and padding is really easy to get right. Let’s set aside this issue for now.

Let’s take two test sentences:

In [None]:
test_samples = [
    "At the Group of Eight summit in Scotland , Japanese Prime Minister Junichiro Koizumi said he is outraged by the London attacks .He noted terrorist acts must not be forgivable . ".split(),
    "Sarin gas attacks on the Tokyo subway system in 1995 killed 12 people and injured thousands .".split(),
    
]
print(test_samples)

Let’s transform them into padded sequences of word ids:

In [None]:
test_samples_X = []
for s in test_samples:
    s_int = []
    for w in s:
        try:
            s_int.append(word2index[w.lower()])
        except KeyError:
            s_int.append(word2index['-OOV-'])
    test_samples_X.append(s_int)

test_samples_X = pad_sequences(test_samples_X, maxlen=MAX_LENGTH, padding='post')
print(test_samples_X)

Let’s make our first predictions:

In [None]:
predictions = model.predict(test_samples_X)
print(predictions, predictions.shape)

Pretty hard to read, right? We need to do the “reverse” operation for to_categorical:

In [None]:
def logits_to_tokens(sequences, index):
    token_sequences = []
    for categorical_sequence in sequences:
        token_sequence = []
        for categorical in categorical_sequence:
            token_sequence.append(index[np.argmax(categorical)])
 
        token_sequences.append(token_sequence)
 
    return token_sequences

Here’s how the predictions look:

In [None]:
print(logits_to_tokens(predictions, {i: t for t, i in tag2index.items()}))

For most of the sentences, the largest part is “padding tokens”. These are really easy to guess, hence the super high performance. Let’s write a custom accuracy, that ignores the paddings:

In [None]:
from keras import backend as K
 
def ignore_class_accuracy(to_ignore=0):
    def ignore_accuracy(y_true, y_pred):
        y_true_class = K.argmax(y_true, axis=-1)
        y_pred_class = K.argmax(y_pred, axis=-1)
 
        ignore_mask = K.cast(K.not_equal(y_pred_class, to_ignore), 'int32')
        matches = K.cast(K.equal(y_true_class, y_pred_class), 'int32') * ignore_mask
        accuracy = K.sum(matches) / K.maximum(K.sum(ignore_mask), 1)
        return accuracy
    return ignore_accuracy

Let’s now retrain, adding the ignore_class_accuracy metric at the compile stage:

In [None]:
from keras.models import Sequential
from keras.layers import Dense, LSTM, InputLayer, Bidirectional, TimeDistributed, Embedding, Activation
from keras.optimizers import Adam
 

model = Sequential()
model.add(InputLayer(input_shape=(MAX_LENGTH, )))
model.add(Embedding(len(word2index), 128))
model.add(Bidirectional(LSTM(256, return_sequences=True)))
model.add(TimeDistributed(Dense(len(tag2index))))
model.add(Activation('softmax'))
 
model.compile(loss='categorical_crossentropy',
              optimizer=Adam(0.001),
              metrics=['accuracy', ignore_class_accuracy(0)])
 
model.summary()

Let’s now retrain:

In [None]:
model.fit(train_sentences_X, to_categorical(train_tags_y, len(tag2index)), batch_size=128, epochs=1, validation_split=0.2)

In [None]:
predictions = model.predict(test_samples_X)
print(logits_to_tokens(predictions, {i: t for t, i in tag2index.items()}))