# Relation Extraction using DNN

In this notebook, relations between the entities in a sequence is being predicted. The model is given the tokenized sequence which is padded or truncated to max_sent_len and the position of entities in the sequence. The entity positions are given by making a vector of zeros of size max_sent_len and masking (replace 0 with 1) the corresponding positions of entities in the sequence. 

The model uses CNN and GRU to get the word level information. The features from this layers are than passsed through Mask Max Pooling Layer, which pulls the information pertaining to the word features corresponding to entity words in the sentence. The word level information is also attended using a self attention layer. The globally max pooled word level features, mask max pooled entity level features and the attended features are all appended and passed to dense layers to be classified into one of the relations.

In [1]:
import keras, tensorflow, sys
keras.__version__, tensorflow.__version__, sys.version

Using TensorFlow backend.


('2.2.4',
 '1.11.0',
 '3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:27:44) [MSC v.1900 64 bit (AMD64)]')

In [2]:
# import the necessary packages

from keras.models import Model, Input

from keras.layers import Dense, LSTM, Dropout, Embedding,  concatenate, Flatten, Permute
from keras.layers import GlobalMaxPooling1D, Convolution1D, CuDNNGRU, Activation, Lambda
from keras.layers import GlobalAveragePooling1D, Concatenate, SpatialDropout1D, Bidirectional

from keras.optimizers import Adam, RMSprop

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

from keras.engine.topology import Layer
from keras import initializers as initializers, regularizers, optimizers

from keras import backend as K

from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix

import regex as re
import pickle
import numpy as np

import tensorflow as tf

from sklearn.preprocessing import LabelEncoder

from nltk.tokenize import word_tokenize

In [3]:
max_sent_len = 100

## Get mask for entity words

Get a mask vector for sentence by putting 1 for entity words or else 0.

In [4]:
def get_mask_entities(x, word_index):
    ''' 1 for entity words, 0 otherwise '''
    
    ret = np.zeros_like(x)
    for i in range(x.shape[0]): 
        e1 = [0, 0]
        e2 = [0, 0]
        for j in range(x.shape[1]):
            if x[i][j] == word_index["e1_start"]:
                e1[0] = j
            elif x[i][j] == word_index["e1_end"]: 
                e1[1] = j
            elif x[i][j] == word_index["e2_start"]: 
                e2[0] = j
            elif x[i][j] == word_index["e2_end"]: 
                e2[1] = j
                break
        for j in range(e1[0]+1, e1[1]): 
            ret[i][j] = 1
        for j in range(e2[0]+1, e2[1]): 
            ret[i][j] = 1
    
    return ret

## Load Glove Embedding

Load the embedding file 'glove.840B.300d.txt' and find the mean and standard deviation vectors of the word vectors. Than for all the words in the vocab initialize the corresponding word vector from the loaded embedded file. For the words for which wordvecs cannot be found in the embedding file, initialize them with a random normal distribution with the above found mean and standard deviation.

In [5]:
def load_glove(word_index):
    EMBEDDING_FILE = '../../embeddings/glove.840B.300d/glove.840B.300d.txt'
    def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
    embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE, encoding="utf8"))

    all_embs = np.stack(embeddings_index.values())
    emb_mean,emb_std = all_embs.mean(), all_embs.std()
    embed_size = all_embs.shape[1]

    num_words = len(word_index)
    embedding_matrix = np.random.normal(emb_mean, emb_std, (num_words, embed_size))
    for word, i in word_index.items():
        if i >= num_words: continue
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None: embedding_matrix[i] = embedding_vector
            
    return embedding_matrix 


## Load and Process data file.

SemEval2010_task8 : In this task of SemEval 2010 challenge, the puspose was to find the one out of the 19 possible relations between the entities in a sentence. The entities in the sentences are marked by <ei>, </ei> and the corresponding relation of the entities are given in the same line as sentence, seperated by a tab.

The data files is uploaded along with the code. The data is loaded and preprocesssing steps are performed.

In [6]:
with open("SemEval2010_task8_all_data/SemEval2010_task8_training/TRAIN_FILE.TXT") as f:
    train_file = f.readlines()


Get the sentences and relations from the lines in the text file and replace the <ei>, </ei> tags with Ei_Start and Ei_End words, so that the tokenizer dosen't misunderstand '<' , '>' symbols as punctuations. Than word tokenize the sequences using NLTK word_tokenize.

In [7]:
       
lines = [line.strip() for line in train_file]
sentences, relations = [], []
for idx in range(0, len(lines), 4):
    sentence = lines[idx].split("\t")[1][1:-1]
    label = lines[idx+1]

    sentence = sentence.replace("<e1>", " E1_START ").replace("</e1>", " E1_END ")
    sentence = sentence.replace("<e2>", " E2_START ").replace("</e2>", " E2_END ")

    tokens = word_tokenize(sentence)        
    
    sentences.append(tokens)
    relations.append(label)

print("Number of setences and relations, i.e. no. of samples:", len(sentences))


Number of setences and relations, i.e. no. of samples: 8000


## Sample text after pre - processing

In [8]:
# Sample text
print(sentences[0])

['The', 'system', 'as', 'described', 'above', 'has', 'its', 'greatest', 'application', 'in', 'an', 'arrayed', 'E1_START', 'configuration', 'E1_END', 'of', 'antenna', 'E2_START', 'elements', 'E2_END', '.']


## Tokenize and pad the text sequences

Tokenize -> change the word to there integer ids

Pad -> Trim or pad with zeros to make all sentences of same length.

In [9]:
# Tokenize the sentence and than pad/ truncate the sentences  
tokenizer = Tokenizer()
tokenizer.fit_on_texts(list(sentences))
print("Vocab size:", len(tokenizer.word_counts))

sentences_idx = tokenizer.texts_to_sequences(sentences)
sentences_idx = pad_sequences(sentences_idx, maxlen=max_sent_len, padding="post")


Vocab size: 19938


In [10]:
# Get dictionary of word indexes.
word_index = tokenizer.word_index

# Get masked sentence, i.e. only entity words are 1, rest all the words are 0.
mask_entites = get_mask_entities(sentences_idx, word_index)

In [11]:
# Sample mask
print(mask_entites[0])

# Notice the two ones here | &  here | , representing the postition of the entity words.

[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


In [12]:
# Load Embedding for all the words in the vocab.
embedding = load_glove(word_index)
print("Embedding matrix shape:", embedding.shape)

Embedding matrix shape: (19938, 300)


## Different Relations

Label encode the relations, and show different relations possible.

In [13]:
# Encoding the relations to indexes.
label_encoder = LabelEncoder()
relations_idx = label_encoder.fit_transform(relations)
print("Total Number of relations:", len(label_encoder.classes_))
print("Relations:\n", label_encoder.classes_)

Total Number of relations: 19
Relations:
 ['Cause-Effect(e1,e2)' 'Cause-Effect(e2,e1)' 'Component-Whole(e1,e2)'
 'Component-Whole(e2,e1)' 'Content-Container(e1,e2)'
 'Content-Container(e2,e1)' 'Entity-Destination(e1,e2)'
 'Entity-Destination(e2,e1)' 'Entity-Origin(e1,e2)' 'Entity-Origin(e2,e1)'
 'Instrument-Agency(e1,e2)' 'Instrument-Agency(e2,e1)'
 'Member-Collection(e1,e2)' 'Member-Collection(e2,e1)'
 'Message-Topic(e1,e2)' 'Message-Topic(e2,e1)' 'Other'
 'Product-Producer(e1,e2)' 'Product-Producer(e2,e1)']


## Building the model

Class MaskMaxPoolingLayer is used to perform max pooling for the masked entity words from the feature vector of the sentence.

In [14]:


class MaskMaxPoolingLayer(Layer):
    
    def __init__(self, **kwargs):
        super(MaskMaxPoolingLayer, self).__init__(**kwargs)
    
    def build(self, input_shape):
        super(MaskMaxPoolingLayer, self).build(input_shape)

    def call(self, x):
        x_1_float32 = K.cast(x[1], dtype='float32')
        x_0 = K.permute_dimensions(x[0], pattern=[2, 0, 1])
        x_0 = tf.multiply(x_0, x_1_float32)
        x_0 = K.permute_dimensions(x_0, pattern=[1, 2, 0])
        x_0 = K.max(x_0, axis=-2)
        return x_0
    
    def compute_output_shape(self, input_shape):
        return (input_shape[0][0], input_shape[0][-1])



Using preloaded glove vectors as embedding weights for the model.

Embedded word vectors are first passed to 1D convolution and than passed to bidirectional GRU. GRU takes care of the sequential information, while CNN improves the embeddings by emphasizing on neighbor information. 

Global max pool layer pools 1 feature from each of the feature vector, unlike maxpool where we determine how many values is to be pooled.

Mask maxpool layers pool the feature vectors corresponding to the entity words. 

Global max pooled, mask max pooled and Self-attended features of the RNN output are all concatenated and passed to the dense layers.

Finally multiple fully-connected layers are used to classify the incoming query into one of the possible relations.

Adam optimizer and sparse categorical crossentropy loss are used.

In [15]:
# Model

words_input = Input(shape=(max_sent_len,), dtype='int32')
words_input_mask = Input(shape=(max_sent_len,), dtype='int32')

words = Embedding(input_dim=embedding.shape[0], output_dim=embedding.shape[1], weights=[embedding], trainable=True,
                  embeddings_regularizer=regularizers.l2(0.00001))(words_input)

words = Dropout(rate=0.5)(words)

output = Convolution1D(nb_filter=256, filter_length=3, activation="tanh", padding='same', strides=1)(words)
output = Dropout(rate=0.3)(output)

output = Bidirectional(CuDNNGRU(units=64, return_sequences=True, recurrent_regularizer=regularizers.l2(0.00001)),
                       merge_mode='concat') (output)

output_h = Activation('tanh')(output)

output1 = GlobalMaxPooling1D()(output_h) 

output2 = MaskMaxPoolingLayer()([output_h, words_input_mask]) 

# Applying attention to RNN output
output = Dense(units=1, kernel_regularizer=regularizers.l2(0.00001))(output_h)
output = Permute((2, 1))(output)
output = Activation('softmax', name="attn_softmax")(output)
output = Lambda(lambda x: tf.matmul(x[0], x[1]))([output, output_h])
output3 = Flatten()(output)

# Concatenating maxpooled, mask maxpooled and self attended features.
output = Concatenate()([output1, output2, output3])

output = Dropout(rate=0.3)(output)

output = Dense(units=300, kernel_regularizer=regularizers.l2(0.00001), activation='tanh')(output)

output = Dense(units=len(label_encoder.classes_), kernel_regularizer=regularizers.l2(0.00001))(output)
output = Activation('softmax')(output)

model = Model(inputs=[words_input, words_input_mask], outputs=[output])

model.compile(loss='sparse_categorical_crossentropy', optimizer= optimizers.Adadelta(lr=1.0, decay=0.0),
              metrics=['accuracy'])

model.summary()

  # This is added back by InteractiveShellApp.init_path()


__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 100)          0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 100, 300)     5981400     input_1[0][0]                    
__________________________________________________________________________________________________
dropout_1 (Dropout)             (None, 100, 300)     0           embedding_1[0][0]                
__________________________________________________________________________________________________
conv1d_1 (Conv1D)               (None, 100, 256)     230656      dropout_1[0][0]                  
__________________________________________________________________________________________________
dropout_2 

In [16]:
model.fit([sentences_idx, mask_entites], relations_idx, shuffle=True, batch_size=128, epochs=100, verbose=0) 

<keras.callbacks.History at 0x179f02ae828>

## Load Test data and evaluate the model.

In [17]:
with open("SemEval2010_task8_all_data/SemEval2010_task8_testing_keys/TEST_FILE_FULL.TXT") as f:
    test_file = f.readlines()

test = [line.strip() for line in test_file]
test_x, test_y = [], []
for idx in range(0, len(test), 4):
    sentence = test[idx].split("\t")[1][1:-1]
    label = test[idx+1]

    sentence = sentence.replace("<e1>", " E1_START ").replace("</e1>", " E1_END ")
    sentence = sentence.replace("<e2>", " E2_START ").replace("</e2>", " E2_END ")

    tokens = word_tokenize(sentence)     

    test_x.append(tokens)
    test_y.append(label)
    
print("Number of setences and relations, i.e. no. of samples:", len(test_x))

test_x_idx = tokenizer.texts_to_sequences(test_x)
test_x_idx = pad_sequences(test_x_idx, maxlen=max_sent_len)

test_mask_entites = get_mask_entities(test_x_idx, word_index)

test_y_idx = label_encoder.transform(test_y)

Number of setences and relations, i.e. no. of samples: 2717


In [18]:
pred = model.predict([test_x_idx, test_mask_entites])
pred = np.argmax(pred, axis=-1)

In [19]:
print("f1_score:",f1_score(test_y_idx, pred, average="macro"), accuracy_score(test_y_idx, pred))    

f1_score: 0.7349170060047774 0.7670224512329775


  'precision', 'predicted', average, warn_for)
