## Name Entity Recognition using Recurrent Neural Network


Name entity recognition is the task to label nouns in sentences with their categories. For example, words like Paris, London or Tokyo share the label "Location;" European Comission, BBC, Foreign Ministry are all "Organizations." These tags can be helpful for other system functionalities.

## Import libraries

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras import layers, optimizers
from tensorflow.keras.utils import plot_model, to_categorical
from tensorflow.keras.preprocessing.sequence import pad_sequences

## Preprocessing
To build models for name enitity recognition task, we need to first convert words and corresponding name entity tags in sentences to numbers. Then, we need to make all sentences the same length by adding zeros to the end of sentences shorter than the longest sentence in the dataset (aka padding).

### Step 1: Build Corpus & Tags dictionary
To convert words to numbers, we need to <br>
(1) read in the sentences and name entities in the training dataset.<br>
(2) break sentences into lists of words (tokenization) and tags.<br>
(3) create a dictionary "Corpus" to map every word to a number, plus an "UNK" token for unknown words and another dictionary called "Tags" to store mapping of all name entity tags and the corresponding number.

In [2]:
Train_text = open('./conll2003/train.txt')
#Set to store unique word and name entitiy tags
Corpus_Train = set()
NE_tag_Train = set()

for line in Train_text:    
    splits = line.split(' ')
    splits[-1] = splits[-1].rstrip("\n")
    if len(splits)  > 1:
        word = splits[0]
        NE_tag = splits[-1]
        Corpus_Train.add(word)
        NE_tag_Train.add(NE_tag)
Train_text.close()

#Create corpus and dictionary
AllWords_l = sorted(Corpus_Train)
AllNE_tag_l = sorted(NE_tag_Train)

Corpus = dict()
Tags = dict()
for i in range(len(AllWords_l)):
    Corpus[AllWords_l[i]]=i
Corpus['UNK'] = len(AllWords_l)

for j in range(len(AllNE_tag_l)):
    Tags[AllNE_tag_l[j]]=j

Tags['UNK'] = len(AllNE_tag_l)
num_labels = len(Tags)

In [3]:
print("The resulted dictionary of name entity tags looks like:")
Tags

The resulted dictionary of name entity tags looks like:


{'B-LOC': 0,
 'B-MISC': 1,
 'B-ORG': 2,
 'B-PER': 3,
 'I-LOC': 4,
 'I-MISC': 5,
 'I-ORG': 6,
 'I-PER': 7,
 'O': 8,
 'UNK': 9}

### Step 2: Turn sentenses to list of tokens and name entity tags
Below is the function for tokenizing sentences in the dataset, which can be used on training set, validation set and test set.

In [4]:
def SentenceToList(filePath):
    Train_text = open(filePath)
    
    #list of lists of sentences
    SenteceCollect = list()
    NE_tagCollect = list()

    #list to store individual sentences
    Sentence_l = list()
    NE_tag_l = list()

    #max length
    max_sent_len = 0

    for line in Train_text:
    
        splits = line.split(' ')
        splits[-1] = splits[-1].rstrip("\n")
    
        if len(splits)  > 1:
            word = splits[0]
            NE_tag = splits[-1]
            Sentence_l.append(word)
            NE_tag_l.append(NE_tag)
        else:
            SenteceCollect.append(Sentence_l)
            if len(Sentence_l) > max_sent_len:
                max_sent_len = len(Sentence_l) 
            NE_tagCollect.append(NE_tag_l)
            Sentence_l =list()
            NE_tag_l = list()
    Train_text.close()
    return SenteceCollect, NE_tagCollect, max_sent_len

In [5]:
Train_Sent_l, Train_Tag_l, Train_MaxLen = SentenceToList('./conll2003/train.txt')
Val_Sent_l, Val_Tag_l, Val_MaxLen = SentenceToList('./conll2003/valid.txt')
Test_Sent_l, Test_Tag_l, Test_MaxLen = SentenceToList('./conll2003/test.txt')

print("The first two sentenses in the train set tokenized:")
print(Train_Sent_l[1:3])
print("Name entity abels of the first two sentenses in the train set:")
print(Train_Tag_l[1:3])

print("Max sentence length of train set:")
print(Train_MaxLen)

print("Max sentence length of Validation set:")
print(Val_MaxLen)
print("Max sentence length of Test set:")
print(Test_MaxLen)

Max_Len = Test_MaxLen

The first two sentenses in the train set tokenized:
[['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'], ['Peter', 'Blackburn']]
Name entity abels of the first two sentenses in the train set:
[['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O'], ['B-PER', 'I-PER']]
Max sentence length of train set:
113
Max sentence length of Validation set:
109
Max sentence length of Test set:
124


### Step 3: Map words and tags to number
The function below maps tokens to indices in the Corpus and Tags Dictionary.

In [6]:
def MapTokenInd(sentences, TokenDic):
    SentencesTokenInd = []
    for sentence in sentences:
        TokenIndices = []
        for token in sentence:
            #token = token.lower()
            if token in TokenDic:
                TokenIdx = TokenDic[token]
            else:
                TokenIdx = TokenDic['UNK']
            TokenIndices.append(TokenIdx)
        SentencesTokenInd.append(TokenIndices)
    return SentencesTokenInd

In [7]:
#Training set
Train_Sent_Ind_l = MapTokenInd(Train_Sent_l, Corpus)
Train_Tag_Ind_l = MapTokenInd(Train_Tag_l, Tags)
#Validation set
Val_Sent_Ind_l = MapTokenInd(Val_Sent_l, Corpus)
Val_Tag_Ind_l = MapTokenInd(Val_Tag_l, Tags)
#Test Set
Test_Sent_Ind_l = MapTokenInd(Test_Sent_l, Corpus)
Test_Tag_Ind_l = MapTokenInd(Test_Tag_l, Tags)


print("The first two sentenses in the training set mapped to index has become:")
print(Train_Sent_Ind_l[1:3])
print("Name entity labels mapped to label index of the first two sentenses in the train set:")
print(Train_Tag_Ind_l[1:3])

The first two sentenses in the training set mapped to index has become:
[[6419, 20820, 7228, 14821, 22699, 14672, 5083, 18389, 124], [10720, 4910]]
Name entity labels mapped to label index of the first two sentenses in the train set:
[[2, 8, 1, 8, 8, 8, 1, 8, 8], [3, 7]]


## Step 4: padding
To make all sentences in the dataset the same length to put build models, we add zeros to the end of each sentece (called padding) to make them all the same length of the longest sentence, in this case, 124 tokens long.

In [8]:
#Padding
Train_Sent_Padded = pad_sequences(Train_Sent_Ind_l, maxlen=Max_Len, padding='post')
Train_Tag_Padded = pad_sequences(Train_Tag_Ind_l, maxlen=Max_Len, padding='post')
Train_Labels_OneHot = [to_categorical(i, num_classes=num_labels) for i in Train_Tag_Padded]


Val_Sent_Padded = pad_sequences(Val_Sent_Ind_l, maxlen=Max_Len, padding='post')
Val_Tag_Padded = pad_sequences(Val_Tag_Ind_l, maxlen=Max_Len, padding='post')
Val_Labels_OneHot = [to_categorical(j, num_classes=num_labels) for j in Val_Tag_Padded]

Test_Sent_Padded = pad_sequences(Test_Sent_Ind_l, maxlen=Max_Len, padding='post')
Test_Tag_Padded = pad_sequences(Test_Tag_Ind_l, maxlen=Max_Len, padding='post')
Test_Labels_OneHot = [to_categorical(k, num_classes=num_labels) for k in Test_Tag_Padded]


print("After padding, the first two sentences and tags of the training set become:")
print(Train_Sent_Padded[1:3])
print(Train_Sent_Padded[1:3])

After padding, the first two sentences and tags of the training set become:
[[ 6419 20820  7228 14821 22699 14672  5083 18389   124     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0]
 [10720  4910     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0  

## Step 5: Model Buildup
Here, we use tensorflow to build a sequence model containing an embedding layer, a bi-directional Long-Short Term Memory (LSTM) layer and a dense layer. The structures (layers in the model) and other hyperparameters (e.g. learning rate, activation function, dropout rate, regularization) are all hyper parameters which can be tunned to optimize evaluation matrics using the training and validation data set.

In [9]:
def get_bilstm_lstm_model():
    model = tf.keras.Sequential()

    # Add Embedding layer
    model.add(layers.Embedding(input_dim=len(Corpus), output_dim=64, 
                        input_length=Max_Len))

    # Add bidirectional LSTM
    model.add(layers.Bidirectional(layers.LSTM(units=64, return_sequences=True, dropout=0.2, recurrent_dropout=0.2)))

    # Add LSTM
    #model.add(LSTM(units=output_dim, return_sequences=True, dropout=0.5, recurrent_dropout=0.5))

    # Add timeDistributed Layer
    model.add(layers.TimeDistributed(layers.Dense(num_labels, activation="softmax")))

    #Optimiser 
    adam = optimizers.Adam(lr=0.0001, beta_1=0.9, beta_2=0.999)

    # Compile model
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.summary()
    
    return model

In [10]:
print("Model structure summary:")
model = get_bilstm_lstm_model()
#plot_model(model)

Model structure summary:
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 124, 64)           1512000   
                                                                 
 bidirectional (Bidirectiona  (None, 124, 128)         66048     
 l)                                                              
                                                                 
 time_distributed (TimeDistr  (None, 124, 10)          1290      
 ibuted)                                                         
                                                                 
Total params: 1,579,338
Trainable params: 1,579,338
Non-trainable params: 0
_________________________________________________________________


  super(Adam, self).__init__(name, **kwargs)


## Step 6: Model training
We use the above mentioned model structure to train on the training dataset. Epochs (times the model runs through all the training data) and batch size (during an epoch, training data are processed on batch at a time untill the whole training set is processed) are to be adjusted to maximize performance.

In [11]:
def train_model(X, y, model):
    loss = list()
    for i in range(5):
        # fit model for one epoch on this sequence
        hist = model.fit(X, y, batch_size=500, verbose=1, epochs=1)
        loss.append(hist.history['loss'][0])
    return loss

In [12]:
#Train the Model
results = pd.DataFrame()
results['with_add_lstm'] = train_model(Train_Sent_Padded, np.array(Train_Labels_OneHot), model)



## Step 7: Validation
Hyperparameters can be tuned to maximize the model performance on the validation set. The performance on the training and validation set both serve as reference for hyperparameter tuning (eg. bias-variance issue).

In [13]:
Val_pred = model.predict(Val_Sent_Padded)

Val_pred = np.argmax(Val_pred, axis=2)

Val_Labels = np.argmax(Val_Labels_OneHot, axis=2)

Val_Accuracy = (Val_pred == Val_Labels).mean()

print("Validation Accuracy: {:8f}/".format(Val_Accuracy))

Validation Accuracy: 0.976360/


## Step 8: Test the Model with Unseen Test Data
Finally, we use a held-out data set to test model performance. This is to avoid overfitting due to hyperparameter tuning.

In [14]:
Test_pred = model.predict(Test_Sent_Padded)

Test_pred = np.argmax(Test_pred, axis=2)

Test_Labels = np.argmax(Test_Labels_OneHot, axis=2)

Test_Accuracy = (Test_pred == Test_Labels).mean()

print("Test Accuracy: {:.8f}/".format(Test_Accuracy))

Test Accuracy: 0.97909881/


## Compare performances on sentences of different length

In [21]:
#Find cutpoint for long and short sentences in the test set

Test_sent_l_dic = dict()
for sentence in Test_Sent_l:
    if len(sentence) not in Test_sent_l_dic:
        Test_sent_l_dic[len(sentence)] = 1
    else:
        Test_sent_l_dic[len(sentence)] += 1

num_short_sentences = 0
num_long_sentences = 0

for i in range(1, Test_MaxLen):
    if i <11:
        num_short_sentences += Test_sent_l_dic[i]
    else:
        try:
            num_long_sentences += Test_sent_l_dic[i]
        except:
            pass

print("Number of sentences with less than or equal to 10 tokens in the test set:")
print(num_short_sentences)
print("Number of sentences with more than 10 tokens in the test set:")
print(num_long_sentences)
    

Number of sentences with less than or equal to 10 tokens in the test set:
2295
Number of sentences with more than 10 tokens in the test set:
1388


In [23]:
def SentToListCat(filePath):
    Train_text = open(filePath)
    
    #list of lists of sentences
    SentCollectShort = list()
    SentCollectLong = list()
    NE_tagCollectShort = list()
    NE_tagCollectLong = list()


    #list to store individual sentences
    Sentence_l = list()
    NE_tag_l = list()

    #max length
    max_sent_len = 0

    for line in Train_text:
    
        splits = line.split(' ')
        splits[-1] = splits[-1].rstrip("\n")
    
        if len(splits)  > 1:
            word = splits[0]
            NE_tag = splits[-1]
            Sentence_l.append(word)
            NE_tag_l.append(NE_tag)
        else:
            
            #separating long and short sentences
            if len(Sentence_l)>10:
                SentCollectLong.append(Sentence_l)
                NE_tagCollectLong.append(NE_tag_l)                
            else:
                SentCollectShort.append(Sentence_l)
                NE_tagCollectShort.append(NE_tag_l)

                
            Sentence_l =list()
            NE_tag_l = list()
    Train_text.close()

    return SentCollectLong, NE_tagCollectLong, SentCollectShort, NE_tagCollectShort

In [25]:
#cat sentencese
TestSentLong_l, TestTagLong_l, TestSentShort_l, TestTagShort_l = SentToListCat('./conll2003/test.txt')

In [29]:
#Map to token
TestSentLong_Ind_l = MapTokenInd(TestSentLong_l, Corpus)
TestTagLong_Ind_l = MapTokenInd(TestTagLong_l, Tags)
TestSentShort_Ind_l = MapTokenInd(TestSentShort_l, Corpus)
TestTagShrot_Ind_l = MapTokenInd(TestTagShort_l, Tags)

In [30]:
#padding
TestSentLong_Padded = pad_sequences(TestSentLong_Ind_l, maxlen=Max_Len, padding='post')
TestTagLong_Padded = pad_sequences(TestTagLong_Ind_l, maxlen=Max_Len, padding='post')
TestLabelsLong_OneHot = [to_categorical(k, num_classes=num_labels) for k in TestTagLong_Padded]
#padding
TestSentShort_Padded = pad_sequences(TestSentShort_Ind_l, maxlen=Max_Len, padding='post')
TestTagShort_Padded = pad_sequences(TestTagShrot_Ind_l, maxlen=Max_Len, padding='post')
TestLabelsShort_OneHot = [to_categorical(k, num_classes=num_labels) for k in TestTagShort_Padded]

In [31]:
TestPredLong = model.predict(TestSentLong_Padded)

TestPredLong = np.argmax(TestPredLong, axis=2)

TestLabelLong = np.argmax(TestLabelsLong_OneHot, axis=2)

TestAccLong = (TestPredLong == TestLabelLong).mean()

print("Test Accuracy Long Sentences: {:.8f}/".format(TestAccLong))

Test Accuracy Long Sentences: 0.96715553/


In [33]:
TestPredShort = model.predict(TestSentShort_Padded)

TestPredShort = np.argmax(TestPredShort, axis=2)

TestLabelShort = np.argmax(TestLabelsShort_OneHot, axis=2)

TestAccShort = (TestPredShort == TestLabelShort).mean()

print("Test Accuracy Short Sentences: {:.8f}/".format(TestAccShort))

Test Accuracy Short Sentences: 0.98632722/


## Reference

1. Erik F Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142–147. Association for Computational Linguistics. 
2. Jason PC Chiu and Eric Nichols. 2015. Named entity recognition with bidirectional lstm-cnns. arXiv preprint arXiv:1511.08308.