## Sequence Tagging using Sequential Models

Sequence Tagging is an information extraction technique to identify and classify named entities in text. These entities can be pre-defined and generic like location names, organizations, time and etc...

In [1]:
import keras
keras.__version__

Using TensorFlow backend.


'2.2.5'

#### Desired Sample Output

<img src="https://miro.medium.com/max/2400/1*8LOMipM-fmszClg-AwATkQ.png">

### Files required are given in below link

https://drive.google.com/drive/folders/1m9JjfsAEN50flYwFPCgZWQ5nHXGt0ZwI?usp=sharing



In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use("ggplot")

from google.colab import drive
drive.mount('/content/drive/')



Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive/


In [5]:
%cd '/content/drive/My Drive/NLP/'

/content/drive/My Drive/NLP


In [0]:

data_filename = "ner_dataset.csv.zip"

import zipfile
z = zipfile.ZipFile(data_filename, 'r')
z.extractall()
z.close()

In [8]:
import tensorflow as tf
print(tf.__version__)

1.15.0


In [9]:
pwd

'/content/drive/My Drive/NLP'

In [10]:
data = pd.read_csv("ner_dataset.csv", encoding="latin-1")
# data = data.drop(['POS'], axis =1)
# data = data.fillna(method="ffill")
data.head(30)

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O
5,,through,IN,O
6,,London,NNP,B-geo
7,,to,TO,O
8,,protest,VB,O
9,,the,DT,O


### Fill the NaN with Sentence tag to easily identify the words of a single sentence.

In [0]:
data = data.fillna(method="ffill")

### Drop POS column from dataset as we are only interested in tags for sentence tagging.

In [0]:
data = data.drop(['POS'], axis =1)

In [13]:
data.head(30)

Unnamed: 0,Sentence #,Word,Tag
0,Sentence: 1,Thousands,O
1,Sentence: 1,of,O
2,Sentence: 1,demonstrators,O
3,Sentence: 1,have,O
4,Sentence: 1,marched,O
5,Sentence: 1,through,O
6,Sentence: 1,London,B-geo
7,Sentence: 1,to,O
8,Sentence: 1,protest,O
9,Sentence: 1,the,O


Now we can see from the above result that the words belonging to the same sentence have the same sentence id.

In [14]:
tags = list(set(data["Tag"].values))
n_tags = len(tags)
n_tags

17

In [15]:
print(tags)

['B-gpe', 'I-nat', 'I-geo', 'I-eve', 'I-gpe', 'I-org', 'I-per', 'O', 'B-per', 'I-tim', 'B-org', 'B-art', 'B-tim', 'B-nat', 'B-eve', 'I-art', 'B-geo']


In [16]:
words = set(list(data['Word'].values))
words.add('dummy')
n_words = len(words)
n_words

35179

### Groupby sentences and combining words and tags for each setence using groupby and apply on dataframe

In [0]:
combining_words_tags = lambda s: [(w, t) for w, t in zip(s["Word"].values.tolist(),s["Tag"].values.tolist())]
d = data.groupby("Sentence #").apply(combining_words_tags)

In [0]:
sentences = [s for s in d]

In [19]:
sentences[0]

[('Thousands', 'O'),
 ('of', 'O'),
 ('demonstrators', 'O'),
 ('have', 'O'),
 ('marched', 'O'),
 ('through', 'O'),
 ('London', 'B-geo'),
 ('to', 'O'),
 ('protest', 'O'),
 ('the', 'O'),
 ('war', 'O'),
 ('in', 'O'),
 ('Iraq', 'B-geo'),
 ('and', 'O'),
 ('demand', 'O'),
 ('the', 'O'),
 ('withdrawal', 'O'),
 ('of', 'O'),
 ('British', 'B-gpe'),
 ('troops', 'O'),
 ('from', 'O'),
 ('that', 'O'),
 ('country', 'O'),
 ('.', 'O')]

In [20]:
print(len(sentences))
sentences = sentences[:3200]

47959


### Map words and tags to integers

In [21]:
words2index = {w:i for i,w in enumerate(words)}
tags2index = {t:i for i,t in enumerate(tags)}
print(words2index['India'])
print(tags2index['B-geo'])

12410
16


#### Make all sentences equal length by appending a `dummy` token at the end of the sentence if the sentence is short. And if the sentence is long consider only `max_length` number of words from that sentence. 

In [22]:
max_len = 30
X = [[w[0]for w in s] for s in sentences]
new_X = []
for seq in X:
    new_seq = []
    for i in range(max_len):
        try:
            new_seq.append(seq[i])
        except:
            new_seq.append("dummy")
    new_X.append(new_seq)
new_X[0]

['Thousands',
 'of',
 'demonstrators',
 'have',
 'marched',
 'through',
 'London',
 'to',
 'protest',
 'the',
 'war',
 'in',
 'Iraq',
 'and',
 'demand',
 'the',
 'withdrawal',
 'of',
 'British',
 'troops',
 'from',
 'that',
 'country',
 '.',
 'dummy',
 'dummy',
 'dummy',
 'dummy',
 'dummy',
 'dummy']

#### Similarly pad labels with `O` tag

In [23]:
from keras.preprocessing.sequence import pad_sequences
y = [[tags2index[w[1]] for w in s] for s in sentences]
y = pad_sequences(maxlen=max_len, sequences=y, padding="post", value=tags2index["O"])
y[0]

array([ 7,  7,  7,  7,  7,  7, 16,  7,  7,  7,  7,  7, 16,  7,  7,  7,  7,
        7,  0,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7], dtype=int32)

### Split the dataset into train and test sets

In [0]:
from sklearn.model_selection import train_test_split
X_tr, X_te, y_tr, y_te = train_test_split(new_X, y, test_size=0.2, random_state=10)

In [25]:
len(X_tr)

2560

In [26]:
np.array(X_tr).shape

(2560, 30)

In [27]:
np.array(y_tr).shape

(2560, 30)

### Using ELMO(Deep contextualized word representations) embeddings (like word2vec) as pre-trained word embeddings

Link to paper: https://arxiv.org/pdf/1802.05365.pdf

Importing these embeddings from tensorflow hub as given in this link - https://tfhub.dev/google/elmo/1

In [0]:
batch_size = 32
# import tensorflow as tf
import tensorflow_hub as hub
from keras import backend as K
sess = tf.Session()
K.set_session(sess)

In [29]:
tf.__version__, hub.__version__

('1.15.0', '0.7.0')

In [0]:
elmo_model = hub.Module("https://tfhub.dev/google/elmo/1", trainable=True)

In [0]:
def ElmoEmbedding(x):
    return elmo_model(inputs={
                            "tokens": tf.squeeze(tf.cast(x, tf.string)),
                            "sequence_len": tf.constant(batch_size*[max_len])
                      },
                      signature="tokens",
                      as_dict=True)["elmo"]

## Model

In [0]:
from keras.models import Model, Input
from keras.layers.merge import add
from keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional, Lambda, Flatten

In [33]:
input_text = Input(shape=(max_len,), dtype=tf.string)
embedding = Lambda(ElmoEmbedding, output_shape=(max_len, 1024))(input_text)
x = Bidirectional(LSTM(units=50, return_sequences=True, dropout=0.2))(embedding)
# x = Bidirectional(LSTM(units=50, return_sequences=True))(x)
out = TimeDistributed(Dense(n_tags, activation="softmax"))(x)













INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore














Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


In [0]:
def custom_sparse_categorical_accuracy(y_true, y_pred):
    return K.cast(K.equal(K.max(y_true, axis=-1),
                          K.cast(K.argmax(y_pred, axis=-1), K.floatx())),
                  K.floatx())

In [36]:
model = Model(input_text, out)
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=[custom_sparse_categorical_accuracy])













In [0]:
# y_tr = (np.arange(y_tr.max()+1) == y_tr[...,None]).astype(int)
y_tr = y_tr.reshape(y_tr.shape[0], y_tr.shape[1], 1)

In [0]:
y_te = y_te.reshape(y_te.shape[0], y_te.shape[1], 1)

In [45]:
y_tr.shape

(2560, 30, 1)

In [46]:
y_te.shape

(640, 30, 1)

In [47]:
model.summary()

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 30)                0         
_________________________________________________________________
lambda_1 (Lambda)            (None, 30, 1024)          0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, 30, 100)           430000    
_________________________________________________________________
time_distributed_1 (TimeDist (None, 30, 17)            1717      
Total params: 431,717
Trainable params: 431,717
Non-trainable params: 0
_________________________________________________________________


In [48]:
history = model.fit(np.array(X_tr), y_tr, batch_size=batch_size, epochs=1, verbose=1)

Epoch 1/1


In [49]:
!pip install seqeval

Collecting seqeval
  Downloading https://files.pythonhosted.org/packages/34/91/068aca8d60ce56dd9ba4506850e876aba5e66a6f2f29aa223224b50df0de/seqeval-0.0.12.tar.gz
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-0.0.12-cp36-none-any.whl size=7424 sha256=2108b00e3a9f76483f832061625bdba211b094c35966b2b09a26755c253d847e
  Stored in directory: /root/.cache/pip/wheels/4f/32/0a/df3b340a82583566975377d65e724895b3fad101a3fb729f68
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-0.0.12


In [50]:
np.array(X_te[:]).shape

(640, 30)

In [0]:
idx2tag = {i: w for w, i in tags2index.items()}

In [56]:
np.array(X_te[:320])

array([['Mr.', 'Abbas', 'was', ..., 'dummy', 'dummy', 'dummy'],
       ['In', 'his', 'peace', ..., 'forgiveness', '.', 'dummy'],
       ['Officials', 'say', 'the', ..., 'dummy', 'dummy', 'dummy'],
       ...,
       ['Aid', 'is', 'being', ..., 'dummy', 'dummy', 'dummy'],
       ['Soldiers', 'used', 'loud', ..., 'dummy', 'dummy', 'dummy'],
       ['The', 'new', 'measures', ..., 'dummy', 'dummy', 'dummy']],
      dtype='<U21')

In [57]:
from seqeval.metrics import precision_score, recall_score, f1_score, classification_report

test_pred = model.predict(np.array(X_te[:32*10]), verbose=1)

idx2tag = {i: w for w, i in tags2index.items()}

def pred2label(pred):
    out = []
    for pred_i in pred:
        out_i = []
        for p in pred_i:
            p_i = np.argmax(p)
            out_i.append(idx2tag[p_i].replace("dummy", "O"))
        out.append(out_i)
    return out



In [0]:
def test2label(pred):
    out = []
    for pred_i in pred:
        out_i = []
        for p in pred_i:
            out_i.append(idx2tag[p[0]].replace("dummy", "O"))
        out.append(out_i)
    return out

In [59]:
test_pred

array([[[5.45897521e-03, 1.89164199e-03, 2.75707245e-03, ...,
         1.03210774e-03, 2.60246661e-03, 1.18223745e-02],
        [4.65417979e-03, 2.00981158e-03, 4.20976803e-03, ...,
         2.20777374e-03, 3.78012075e-03, 1.98435299e-02],
        [1.21530506e-03, 1.09286630e-04, 8.89657822e-04, ...,
         1.33106441e-04, 2.40229463e-04, 1.53204508e-03],
        ...,
        [6.30277646e-05, 3.35958794e-06, 3.97593030e-05, ...,
         6.22620109e-06, 6.79542063e-06, 5.37094420e-05],
        [7.17398216e-05, 4.24541031e-06, 4.88255901e-05, ...,
         7.82687403e-06, 8.77640650e-06, 6.33322124e-05],
        [2.31050974e-04, 1.89355269e-05, 1.91066487e-04, ...,
         3.67700886e-05, 3.70664311e-05, 2.28486359e-04]],

       [[3.88388592e-03, 3.06456670e-04, 4.01611812e-03, ...,
         3.34000855e-04, 7.45332800e-04, 1.61627699e-02],
        [4.21891687e-03, 3.08939780e-04, 2.69517605e-03, ...,
         4.21482284e-04, 1.22448604e-03, 2.08394676e-02],
        [2.75270687e-03, 

In [0]:
pred_labels = pred2label(test_pred)

In [61]:
test_labels = test2label(y_te[:32*10])
print(classification_report(test_labels, pred_labels))

           precision    recall  f1-score   support

      per       0.46      0.48      0.47        95
      org       0.37      0.24      0.29       141
      tim       0.61      0.57      0.59       138
      gpe       0.73      0.71      0.72        97
      geo       0.59      0.67      0.63       242
      art       0.00      0.00      0.00         4
      eve       0.00      0.00      0.00         2
      nat       0.00      0.00      0.00         2

micro avg       0.57      0.54      0.55       721
macro avg       0.55      0.54      0.54       721



The score can be increased by considering full train data and increasing the epochs.

### To visualize tags on text

In [62]:
!pip install ipymarkup

Collecting ipymarkup
  Downloading https://files.pythonhosted.org/packages/d8/29/eaa1bcf649d6333dea829c05577c67f881d0555b6d77c1da72afda5c847d/ipymarkup-0.5.0-py2.py3-none-any.whl
Installing collected packages: ipymarkup
Successfully installed ipymarkup-0.5.0


In [0]:
from ipymarkup import show_box_markup
from ipymarkup.palette import palette, BLUE, RED, GREEN, ORANGE, PURPLE

In [0]:
test = X_te[2]
text = ' '.join(test)

spans = []
current_pos2 = 0
for index, i in enumerate(test):
  if index<max_len and i !="dummy":
    if len(idx2tag[y_te[2][index][0]].split('-'))>1:
      tag = idx2tag[y_te[2][index][0]].split('-')[1]
    else:
      tag = idx2tag[y_te[2][index][0]].split('-')[0]
    
    current_pos1 = current_pos2
    current_pos2 += len(i)+1
    if current_pos2 > current_pos1:
      spans.append( (current_pos1, current_pos2, tag) )

In [65]:
spans

[(0, 10, 'O'),
 (10, 14, 'O'),
 (14, 18, 'O'),
 (18, 23, 'O'),
 (23, 27, 'O'),
 (27, 36, 'O'),
 (36, 39, 'O'),
 (39, 43, 'O'),
 (43, 50, 'O'),
 (50, 54, 'O'),
 (54, 63, 'O'),
 (63, 73, 'tim'),
 (73, 76, 'O'),
 (76, 80, 'O'),
 (80, 88, 'O'),
 (88, 97, 'O'),
 (97, 105, 'O'),
 (105, 109, 'O'),
 (109, 116, 'geo'),
 (116, 123, 'O'),
 (123, 128, 'O'),
 (128, 135, 'O'),
 (135, 139, 'O'),
 (139, 146, 'gpe'),
 (146, 153, 'O'),
 (153, 155, 'O')]

In [66]:
show_box_markup(text, spans, palette=palette(tim=BLUE, geo=RED, gpe=ORANGE, O=PURPLE))