# NER Training a Custom NER Algorithm

In this exercise, we will train our own RNN-based Named Entity Recognition algorithm, using the CoNLL-2003 tagged dataset.

## Part 1: Loading CoNLL-2003 data

The [CoNLL-2003](https://www.clips.uantwerpen.be/conll2003/ner/) shared task was a joint effort by academics to provide approaches to named entity recognition, using a tagged dataset of named entities in English and German. We will be using the tagged English data from CoNLL-2003, found in the accompanying file *conll2003.zip*.

After uploading this file to the current directory, access the data as follows:

In [1]:
! unzip conll2003.zip

Archive:  conll2003.zip
   creating: conll2003/
  inflating: conll2003/train.txt     
   creating: __MACOSX/
   creating: __MACOSX/conll2003/
  inflating: __MACOSX/conll2003/._train.txt  
  inflating: conll2003/valid.txt     
  inflating: __MACOSX/conll2003/._valid.txt  
  inflating: conll2003/test.txt      
  inflating: __MACOSX/conll2003/._test.txt  


In [2]:
import pandas as pd
def read_conll(filename):
  df = pd.read_csv(filename,
                     sep=' ', header=None, keep_default_na=False,
                     names = ['TOKEN', 'POS', 'CHUNK', 'NE'],
                     quoting=3, skip_blank_lines=False)
  df['SENTENCE'] = (df.TOKEN == '').cumsum()
  return df[df.TOKEN != '']
train_df = read_conll('conll2003/train.txt')
valid_df = read_conll('conll2003/valid.txt')
test_df = read_conll('conll2003/test.txt')

Note that the CoNLL-2003 data contains part-of-speech (POS) and chunk tags, but we will only be using the token text and named entity (NE) tags that are provided.

**Questions:**
  1. What percentages of the CoNLL-2003 data are training, validation, and testing data? (calculate directly)

In [3]:
df_size = train_df.shape[0] + valid_df.shape[0] + test_df.shape[0]
print(f"Train percentage: {round((train_df.shape[0]/df_size)*100,1)}%")
print(f"Validation percentage: {round((valid_df.shape[0]/df_size)*100,1)}%")
print(f"Test percentage: {round((test_df.shape[0]/df_size)*100,1)}%")


Train percentage: 67.6%
Validation percentage: 17.0%
Test percentage: 15.4%


  2. What do the tags in column 'NE' mean? Explain in words.

In [4]:
train_df.head()

Unnamed: 0,TOKEN,POS,CHUNK,NE,SENTENCE
0,-DOCSTART-,-X-,-X-,O,0
2,EU,NNP,B-NP,B-ORG,1
3,rejects,VBZ,B-VP,O,1
4,German,JJ,B-NP,B-MISC,1
5,call,NN,I-NP,O,1


In [5]:
set(train_df.NE)

{'B-LOC', 'B-MISC', 'B-ORG', 'B-PER', 'I-LOC', 'I-MISC', 'I-ORG', 'I-PER', 'O'}

* B-LOC - begin of location entity
* I-LOC - inside of location entity
* B-MISC - begin of miscellaneous entity
* I-MISC - inside of miscellaneous entity
* B-ORG - begin of organization entity
* I-ORG - inside of organization entity
* B-PER - begin of a person entity
* I_PER - inside of a person entity
* O - Outside of an entity

## Part 2: Feature calculation

In order to learn named entity recognition using RNNs, we must transform our input and output into numeric vectors by calculating relevant features. For our basic NER algorithm, we will simply use word indices as input and one-hot embeddings of NER tags as output.

**Questions:**

3. Save a list of the 5000 most common word tokens (values from column `TOKEN`) in our training data as a list `vocab`, and save a list of all unique entity tags (values from column `NE`) as a list `ne_tags`. 

In [6]:
from collections import Counter

count_words = Counter(train_df.TOKEN)
vocab = [item[0] for item in sorted(count_words.items(), key=lambda item: item[1], reverse=True)]
vocab = vocab[:5000]
ne_tags = list(set(train_df.NE))
ne_tags

['B-ORG', 'I-MISC', 'I-LOC', 'O', 'I-ORG', 'B-LOC', 'B-PER', 'B-MISC', 'I-PER']

4. Create a function `token2index(token)` that takes in the value of a word token and returns a unique integer. It should return 1 for any token which is not found in `vocab` (i.e. which is out-of-vocabulary) and a number >= 2 for every token found in `vocab`.

In [7]:
def token2index(token):
    if token in vocab:
        return vocab.index(token)+2
    else:
        return 1

5. Create a function `ne_tag2index(ne_tag)` which returns a unique integer >= 1 for every entity tag.

In [8]:
def ne_tag2index(ne_tag):
    if ne_tag in ne_tags:
        return ne_tags.index(ne_tag)+1

6. Add new columns `token_index` and `ne_index` to the CoNLL data DataFrames containing the values of `token2index()` and `ne_tag2index()` for each token and entity tag.

In [9]:
train_df['token_index'] = train_df.TOKEN.apply(lambda row: token2index(row))
train_df['ne_index'] = train_df.NE.apply(lambda row: ne_tag2index(row))

valid_df['token_index'] = valid_df.TOKEN.apply(lambda row: token2index(row))
valid_df['ne_index'] = valid_df.NE.apply(lambda row: ne_tag2index(row))

test_df['token_index'] = test_df.TOKEN.apply(lambda row: token2index(row))
test_df['ne_index'] = test_df.NE.apply(lambda row: ne_tag2index(row))

7. Generate training data feature matrix `X_train` of size (14987, 50) as follows:
  * Use `train_df.groupby('SENTENCE').token_index.apply(list)` to get a list of lists of token indices, one list for each sentence.
  * Use `pad_sequences()` from `tensorflow.keras.preprocessing.sequence` to pad every list of token indices with the value `0` at the beginning so they are all of length 50.

In [10]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

X_train = train_df.groupby('SENTENCE').token_index.apply(list)
X_train = pad_sequences(X_train, maxlen=50)
X_train.shape

(14987, 50)

8. Generate output data feature matrix `Y_train` of size (14987, 50, 10) by applying the same method to the entity token indices (column `ne_index`), and then one-hot encoding using `to_categorical()` from `tensorflow.keras.utils`.

In [11]:
from tensorflow.keras.utils import to_categorical

Y_train = train_df.groupby('SENTENCE').ne_index.apply(list)
Y_train = pad_sequences(Y_train, maxlen=50)
Y_train = to_categorical(Y_train, num_classes=10)
Y_train.shape

(14987, 50, 10)

9. Apply 7-8 on the validation and testing data as well to generate matrices `X_valid`, `Y_valid`, `X_test`, `Y_test`.

In [12]:
X_valid = valid_df.groupby('SENTENCE').token_index.apply(list)
X_valid = pad_sequences(X_valid, maxlen=50)

Y_valid = valid_df.groupby('SENTENCE').ne_index.apply(list)
Y_valid = pad_sequences(Y_valid, maxlen=50)
Y_valid = to_categorical(Y_valid, num_classes=10)

X_test = test_df.groupby('SENTENCE').token_index.apply(list)
X_test = pad_sequences(X_test, maxlen=50)

Y_test = test_df.groupby('SENTENCE').ne_index.apply(list)
Y_test = pad_sequences(Y_test, maxlen=50)
Y_test = to_categorical(Y_test, num_classes=10)

In [13]:
X_valid.shape, Y_valid.shape, X_test.shape, Y_test.shape

((3466, 50), (3466, 50, 10), (3684, 50), (3684, 50, 10))

## Part 3: Building and training the model

Now we are ready to build our network that will predict NER tags from the inputted words. The architecture will be roughly similar to our previous exercise on RNNs.

The following imports will help you:

In [14]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Embedding, Bidirectional

**Questions:**

10. Build a sequential model `model` with the following layers:
  * Embedding – use embedding dimension 200, and make sure to set `input_length=50` and `mask_zero=True` (to ignore the padding indices).
  * LSTM – use hidden state dimension 128, and return the hidden state at each time step (`return_sequences=True`).
  * Fully-connected layer (`Dense()`) with softmax activation. Hint: The output dimension of `Dense()` is the number of possible output labels, including the padding label `0`.

  Compile the model with loss function `categorical_crossentropy` and optimizer `adam`, and using accuracy as a metric. Print a summary of the model (`model.summary()`). What is the expected shape of input for the model?

In [15]:
model = Sequential([
    Embedding(output_dim=200, input_dim=5002, input_length=50, mask_zero=True),
    LSTM(128, return_sequences=True),
    Dense(10, activation='softmax')
])

In [16]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics='accuracy')

In [17]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 50, 200)           1000400   
_________________________________________________________________
lstm (LSTM)                  (None, 50, 128)           168448    
_________________________________________________________________
dense (Dense)                (None, 50, 10)            1290      
Total params: 1,170,138
Trainable params: 1,170,138
Non-trainable params: 0
_________________________________________________________________


In [18]:
print(f'the expected input for the model: {model.input_shape}')

the expected input for the model: (None, 50)


11. Train the model on `X_train` and `Y_train`, using `X_valid`, and `Y_valid` as validation data. Use whatever batch size and number of epochs work best for you. Train the model until validation loss or accuracy starts increasing. How many epochs did you use for training?

In [19]:
from tensorflow.keras.callbacks import EarlyStopping

callback = EarlyStopping(monitor='val_loss',  patience=1, 
                         verbose=1, restore_best_weights=True)

model.fit(X_train, Y_train, validation_data=(X_valid, Y_valid),
          epochs=100, batch_size=64, callbacks=callback)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Restoring model weights from the end of the best epoch.
Epoch 00007: early stopping


<keras.callbacks.History at 0x7f49dcfd5e90>

12. Create a model *model2* that is the same as *model* but with the LSTM layer wrapped by `Bidirectional()`, so the model becomes a BiLSTM model. How does this change the final validation loss? Does the model improve?

In [20]:
model2 = Sequential([
    Embedding(output_dim=200, input_dim=5002, input_length=50, mask_zero=True),
    Bidirectional(LSTM(128, return_sequences=True)),
    Dense(10, activation='softmax')
])

In [21]:
model2.compile(loss='categorical_crossentropy', optimizer='adam', metrics='accuracy')

In [22]:
model2.fit(X_train, Y_train, validation_data=(X_valid, Y_valid),
          epochs=100,
           batch_size=64,
          callbacks=callback)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Restoring model weights from the end of the best epoch.
Epoch 00005: early stopping


<keras.callbacks.History at 0x7f4965422f10>

13. Compare the performance of the two models on the test set data `X_test` and `Y_test` (Hint: use `model.evaluate()`). Mention important metrics, overfitting, early stopping, etc... 

In [23]:
eval1 = model.evaluate(X_test, Y_test)
eval2 = model2.evaluate(X_test, Y_test)

print(f'model 1 evaluation: {eval1}\nmodel 2 evaluation: {eval2}')

model 1 evaluation: [0.06601439416408539, 0.9199905395507812]
model 2 evaluation: [0.05508552864193916, 0.9373894929885864]


comparing the performance, the model loss and accuracy values are better for the second model.  
the first model has better metrics for validation vs test - it can indicates the model is overfitting.  
it didnt happend in the second model, the second model, using biLSTM layer, looks better.

## Running on custom input

14. What does your model predict as NER tags for the following test sentences?

Hint: Try using the following pipeline on each sentence:

* Tokenize with nltk.word_tokenize()
* Convert to array of indices with word2index() defined above
* Pad to length 50 with pad_sequences() from Keras
* Predict probabilities of NER tags with model2.predict()
* Find maximum likelihood tags using np.argmax(), and ignore padding values

In [24]:
test_sentences = [
  "This is a test.",
  "I live in the United States.",
  "Israel is a country in the Middle East.",
  "UK joins US in Gulf mission after Iran taunts American allies",
  "The project was funded by the Portuguese Foundation for Science and Technology and the Israel Cancer Research Fund."
]

In [28]:
import nltk
import numpy as np
nltk.download('punkt')

def predict(sentence):
    token_indics = []
    predictions = []
    sentence_token = nltk.word_tokenize(sentence)
    for token in sentence_token:
        token_indics.append(token2index(token))
    token_indics = pad_sequences([token_indics], maxlen=50)
    pred = model2.predict(token_indics)
    for i in pred[0][-len(sentence_token):]:
        predictions.append(np.argmax(i))
    return predictions

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [29]:
for sentence in test_sentences:
    pred = predict(sentence)
    print(sentence)
    for i in pred:
        print(ne_tags[i-1], end=' ')
    print('')

This is a test.
O O O O O 
I live in the United States.
O O O O B-LOC I-LOC O 
Israel is a country in the Middle East.
B-LOC O O O O O B-LOC I-LOC O 
UK joins US in Gulf mission after Iran taunts American allies
B-LOC B-LOC B-LOC O B-LOC O O B-LOC O I-MISC I-MISC 
The project was funded by the Portuguese Foundation for Science and Technology and the Israel Cancer Research Fund.
O O O O O O B-MISC O O O O O O O B-LOC O I-ORG I-ORG O 


## Bonus: Adding features

**Bonus question:**

In (A) below, add code to add a new column 'SHAPE' to the dataset. This column should represent the shape of the word token by:
* Replacing all capital letters with 'X'
* Replacing all lowercase letters with 'x'
* Replacing all digits with 'd'

For example, we should have the following:

* 'house' => 'xxxxx'
* 'Apple' => 'Xxxxx'
* 'R2D2' => 'XdXd'
* 'U.K.' => 'X.X.'

Hint: for a Pandas series. you can use series.str.replace() to easily replace text.

In [35]:
def series2shape(series):
    series = series.str.replace('[a-z]', 'x', regex=True)
    series = series.str.replace('[A-Z]', 'X', regex=True)
    series = series.str.replace('\d', 'd', regex=True)
    return series

In [36]:
train_df['SHAPE'] = series2shape(train_df.TOKEN)
valid_df['SHAPE'] = series2shape(valid_df.TOKEN)
test_df['SHAPE'] = series2shape(test_df.TOKEN)

In [37]:
train_df['SHAPE'] = series2shape(train_df.TOKEN)
valid_df['SHAPE'] = series2shape(valid_df.TOKEN)
test_df['SHAPE'] = series2shape(test_df.TOKEN)

Once you complete this, run the following code to see how adding this as a feature improves the performance of the model. For simplicity we only use the top 100 word shapes. How does the final loss change?

In [38]:
from collections import Counter

shape_vocab = [w for w, f in Counter(train_df.SHAPE).most_common(100)]
shape_set = set(shape_vocab)
def shape2index(shape):
  if shape in shape_set:
    return shape_vocab.index(shape) + 2
  else: # out-of-vocabulary shape
    return 1

n_words = 50
def df2features2(df):
  df['shape_index'] = df.SHAPE.apply(shape2index)
  token_index_lists = df.groupby('SENTENCE').token_index.apply(list)
  ne_index_lists = df.groupby('SENTENCE').ne_index.apply(list)
  shape_index_lists = df.groupby('SENTENCE').ne_index.apply(list)
  X = np.stack([
      pad_sequences(token_index_lists, maxlen=n_words, value=0),
      pad_sequences(shape_index_lists, maxlen=n_words, value=0)
  ])
  Y = to_categorical(pad_sequences(ne_index_lists, maxlen=n_words, value=0))
  return X, Y

X2_train, Y2_train = df2features2(train_df)
X2_valid, Y2_valid = df2features2(valid_df)
X2_test, Y2_test = df2features2(test_df)

In [39]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Concatenate

input1 = Input(shape=(50,))
input2 = Input(shape=(50,))
embedded1 = Embedding(
    len(vocab) + 2, 200,
    input_length=50, mask_zero=True)(input1)
embedded2 = Embedding(
    len(shape_vocab) + 2, 8,
    input_length=50, mask_zero=True)(input2)
x = Concatenate()([embedded1, embedded2])
x = Bidirectional(LSTM(128, return_sequences=True))(x)
output = Dense(len(ne_tags) + 1, activation='softmax')(x)
model3 = Model(inputs=[input1, input2], outputs=[output])
model3.compile(loss='categorical_crossentropy', optimizer='adam', metrics='accuracy')

In [40]:
from tensorflow.keras.callbacks import EarlyStopping

model3.fit(
    [X2_train[0], X2_train[1]],
    Y2_train,
    validation_data=([X2_valid[0], X2_valid[1]], Y2_valid),
    epochs=100, batch_size=32,
    callbacks=[EarlyStopping(monitor='val_loss', patience=3, verbose=1)]
)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 00038: early stopping


<keras.callbacks.History at 0x7f494f784bd0>

In [41]:
print("Model3 loss on test data:")
model3.evaluate([X2_test[0], X2_test[1]], Y2_test)

Model3 loss on test data:


[0.0004642434942070395, 0.9997628331184387]

model3 loss value is almost 0. the accuracy is extremly high with value of 0.999.