# Named Entity Recognition with LSTM - Students

Named entity recognition (NER) — sometimes referred to as entity chunking, extraction, or identification — is the task of identifying and categorizing key information (entities) in text. An entity can be any word or series of words that consistently refers to the same thing. Every detected entity is classified into a predetermined category. For example, an NER machine learning (ML) model might detect the word “super.AI” in a text and classify it as a “Company”.

NER is a form of natural language processing (NLP), a subfield of artificial intelligence. NLP is concerned with computers processing and analyzing natural language, i.e., any language that has developed naturally, rather than artificially, such as with computer coding languages.

![image](https://miro.medium.com/max/720/0*GZ9EzgeviitRHAT8)

At the heart of any NER model is a two step process:
* Detect a named entity
* Categorize the entity


In [None]:
"""
(Practical tip) Table of contents can be compiled directly in jupyter notebooks using the following code:
I set an exception: if the package is in your installation you can import it otherwise you download it 
then import it.
"""
try:
    from jyquickhelper import add_notebook_menu 
except:
    !pip install jyquickhelper
    from jyquickhelper import add_notebook_menu

In [2]:
"""
Output Table of contents to navigate easily in the notebook. 
For interested readers, the package also includes Ipython magic commands to go back to this cell
wherever you are in the notebook to look for cells faster
"""
add_notebook_menu()

## Imports

In [3]:
import numpy as np
import matplotlib.pyplot as plt

In [4]:
import sklearn

In [5]:
#!pip install git+https://github.com/MeMartijn/updated-sklearn-crfsuite.git#egg=sklearn_crfsuite
from sklearn_crfsuite import metrics

In [6]:
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras import layers, optimizers, regularizers
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.utils import plot_model

2022-09-25 12:50:40.463340: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


## The Dataset

A simple sentence NER example:

[**ORG** U.N. ] official [**PER** Ekeus ] heads for [**LOC** Baghdad ] 

We will concentrate on four types of named entities:
 * persons (**PER**), 
 * locations (**LOC**) 
 * organizations (**ORG**)
 * Others (**O**)

In [7]:
def _generate_examples(filepath):
        with open(filepath, encoding="utf-8") as f:
            sent = []
            for line in f:
                if line.startswith("-DOCSTART-") or line == "" or line == "\n":
                    if sent:
                        yield sent
                        sent = []
                else:
                    splits = line.split(" ")
                    token = splits[0]
                    pos_tag = splits[1]
                    ner_tag = splits[3].rstrip()
                    if 'MISC' in ner_tag:
                        ner_tag = 'O'
                    
                    sent.append((token, pos_tag, ner_tag))

In [8]:
%%time
train_sents = list(_generate_examples('NER Dataset/train.txt'))
test_sents = list(_generate_examples('NER Dataset/test.txt'))

CPU times: user 299 ms, sys: 45.8 ms, total: 345 ms
Wall time: 441 ms


In [15]:
test_sents[2]

[('United', 'NNP', 'B-LOC'),
 ('Arab', 'NNP', 'I-LOC'),
 ('Emirates', 'NNPS', 'I-LOC'),
 ('1996-12-06', 'CD', 'O')]

In [11]:
# reduced features
def reduced_word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        'w':word.lower(),
    }
    
    return sent[i][0].lower()

In [13]:
def sent2features(sent):
    return [reduced_word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]

In [16]:
sent2features(test_sents[2])

['united', 'arab', 'emirates', '1996-12-06']

In [17]:
sent2labels(test_sents[2])

['B-LOC', 'I-LOC', 'I-LOC', 'O']

In [18]:
sent2tokens(test_sents[2])

['United', 'Arab', 'Emirates', '1996-12-06']

In [19]:
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

In [20]:
X_test[2]

['united', 'arab', 'emirates', '1996-12-06']

In [21]:
y_test[2]

['B-LOC', 'I-LOC', 'I-LOC', 'O']

In [None]:
def lengths(data):
    return max([len(sent) for sent in data])

max_length = lengths(X_train)

In [31]:
from tensorflow.keras.layers import TextVectorization

# Pour le TD... on va couper à une valeur de 20
max_length = 20

X_vectorizer = TextVectorization(standardize=None,
                                 split="whitespace",
                                 output_mode="int",
                                 output_sequence_length=max_length)

x_ = [' '.join(sent) for sent in X_train]
X_vectorizer.adapt(x_)
X_train_enc = X_vectorizer(x_)

x_ = [' '.join(sent) for sent in X_test]
X_test_enc = X_vectorizer(x_)

In [32]:
vocabulary = X_vectorizer.get_vocabulary()

In [33]:
y_vectorizer = TextVectorization(standardize=None,
                                 split="whitespace",
                                 output_mode="int",
                                 output_sequence_length=max_length)

x_ = [' '.join(sent) for sent in y_train]
y_vectorizer.adapt(x_)
y_train_enc = y_vectorizer(x_)

x_ = [' '.join(sent) for sent in y_test]
y_test_enc = y_vectorizer(x_)

In [42]:
labels = y_vectorizer.get_vocabulary()

['', '[UNK]', 'O', 'B-LOC', 'B-PER', 'B-ORG', 'I-PER', 'I-ORG', 'I-LOC']

## Model

In [35]:
# Constants
vocab_size = len(vocabulary)
nb_labels = len(labels)

max_len = 10  # Sequence length to pad the outputs to.
embedding_dim = 50
lstm_hidden = 100

In [36]:
# define the model
input_ = layers.Input(shape=(max_length,), dtype=tf.int32)
x = layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim, name="Embedding")(input_)
x = layers.LSTM(lstm_hidden, return_sequences=True, name="hidden")(x)
output_ = layers.Dense(nb_labels, activation='softmax')(x)
model = Model(input_, output_)
# summarize the model
model.summary()

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 20)]              0         
                                                                 
 Embedding (Embedding)       (None, 20, 50)            1050550   
                                                                 
 hidden (LSTM)               (None, 20, 100)           60400     
                                                                 
 dense_1 (Dense)             (None, 20, 9)             909       
                                                                 
Total params: 1,111,859
Trainable params: 1,111,859
Non-trainable params: 0
_________________________________________________________________


In [37]:
callbacks_list = [EarlyStopping(monitor='val_accuracy', min_delta=0.0005, patience=10, verbose=1, mode='max', restore_best_weights=True)
                 ]

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
hist = model.fit(X_train_enc, y_train_enc, validation_split=0.2,
                 epochs=1000, batch_size=250, callbacks=callbacks_list, verbose=1)

Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000
Epoch 14/1000
Epoch 15/1000
Epoch 16/1000
Epoch 17/1000
Epoch 18/1000
Epoch 19/1000
Epoch 20/1000
Epoch 21/1000
Epoch 22/1000
Epoch 23/1000
Epoch 24/1000
Epoch 25/1000
Epoch 26/1000
Epoch 27/1000
Epoch 28/1000
Epoch 29/1000
Epoch 30/1000
Epoch 31/1000
Epoch 32/1000
Epoch 33/1000
Epoch 34/1000
Epoch 35/1000
Epoch 36/1000
Epoch 37/1000
Epoch 38/1000
Epoch 39/1000
Epoch 39: early stopping


In [None]:
y_pred_enc = model.predict(X_test_enc)
y_pred_enc = np.argmax(y_pred_enc, axis=2)

In [98]:
labels = ['[PAD]']+labels[1:]
labels_id = [i for i, _ in enumerate(labels)]
labels, labels_id

(['[PAD]', '[UNK]', 'O', 'B-LOC', 'B-PER', 'B-ORG', 'I-PER', 'I-ORG', 'I-LOC'],
 [0, 1, 2, 3, 4, 5, 6, 7, 8])

In [86]:
sklearn.metrics.f1_score(y_test_enc.flatten(), y_pred_enc.flatten(), average='weighted')

0.9494480058628227

In [99]:
print(sklearn.metrics.classification_report(y_test_enc.flatten(), y_pred_enc.flatten(),
                                            target_names=labels, labels=labels_id, digits=3))

              precision    recall  f1-score   support

       [PAD]      0.998     1.000     0.999     31800
       [UNK]      0.000     0.000     0.000         0
           O      0.948     0.971     0.959     31139
       B-LOC      0.767     0.739     0.753      1436
       B-PER      0.862     0.554     0.674      1363
       B-ORG      0.566     0.627     0.595      1464
       I-PER      0.854     0.544     0.664       925
       I-ORG      0.525     0.389     0.447       686
       I-LOC      0.430     0.573     0.491       227

   micro avg      0.951     0.951     0.951     69040
   macro avg      0.661     0.600     0.620     69040
weighted avg      0.950     0.951     0.949     69040



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [103]:
# Generally we want the prediction without the [PAD], [UNK] and 'O' class. Just remove it.
print(sklearn.metrics.classification_report(y_test_enc.flatten(), y_pred_enc.flatten(),
                                            target_names=labels[3:], labels=labels_id[3:], digits=3))

              precision    recall  f1-score   support

       B-LOC      0.767     0.739     0.753      1436
       B-PER      0.862     0.554     0.674      1363
       B-ORG      0.566     0.627     0.595      1464
       I-PER      0.854     0.544     0.664       925
       I-ORG      0.525     0.389     0.447       686
       I-LOC      0.430     0.573     0.491       227

   micro avg      0.688     0.596     0.638      6101
   macro avg      0.667     0.571     0.604      6101
weighted avg      0.713     0.596     0.640      6101



In [111]:
# Group B and I label
y_test2 = [-1 if item<3 else item%3 for sublist in y_test_enc for item in sublist]
y_pred2 = [-1 if item<3 else item%3 for sublist in y_pred_enc for item in sublist]

labels2 =list(np.unique(y_test2))
labels2.remove(-1)

print(sklearn.metrics.classification_report(y_test2, y_pred2,
                                            target_names=["LOC", "PER", "ORG"],
                                            labels=labels2, digits=3))

              precision    recall  f1-score   support

         LOC      0.795     0.664     0.724      2361
         PER      0.741     0.501     0.598      2049
         ORG      0.548     0.623     0.583      1691

   micro avg      0.691     0.598     0.641      6101
   macro avg      0.694     0.596     0.635      6101
weighted avg      0.708     0.598     0.642      6101



## Your work

<font color='red'>
<br>
**$TO DO - Students$**
    
Before modifying the code, take the time to understand it well.
    

* Try to improve the f1 score using others **LSTM** architecture :
    * Use Bi-RNN approach (tf.keras.layers.Bidirectional)
    * Use stacked Bi-RNN approach
    * Eventually, replace the `softmax` activation function by a [CRF layer](https://www.tensorflow.org/addons/api_docs/python/tfa/layers/CRF)
</font>