## Sequence-to-Sequence Use Case

For the use case of seq2seq models, we have taken **textual content
of annotated corpus** used in the research paper _“Development of a
benchmark corpus to support the automatic extraction of drug-related
adverse effects from medical case reports”_ (www.sciencedirect.com/
science/article/pii/S1532046412000615), by H. Gurulingappa.

The work presented  can support the development and validation
of methods for the automatic extraction of drug-related
adverse effects from medical case reports. 

The documents are systematically double annotated in various rounds to ensure
consistent annotations. The annotated documents are finally
harmonized to generate representative consensus annotations.

The authors used an open source skip-gram model provided
by NLPLab (http://evexdb.org/pmresources/vec-space-models/
wikipedia-pubmed-and-PMC-w2v.bin), which was
trained on all the PubMed abstracts and PMC full texts (4.08
million distinct words). The output of skip-gram model is a set
of word vectors of 200 dimensions.

The ADE corpus used from the paper by Gurulingappa is distributed
with three files: DRUG-AE.rel, DRUG-DOSE.rel, and ADE-NEG.txt. We are
making use of the DRUG-AE.rel file, which provides relationships between
drugs and adverse effects. 

The format of the DRUG-AE.rel file is as follows, fields are separated by
pipe delimiters:

- Column-1: PubMed-ID : 10030778
- Column-2: Sentence : Intravenous azithromycin-induced ototoxicity.
- Column-3: Adverse-Effect : ototoxicity
- Column-4: Begin offset of Adverse-Effect at ‘document level’: 43
- Column-5: End offset of Adverse-Effect at ‘document level’: 54
- Column-6: Drug: azithromycin
- Column-7: Begin offset of Drug at ‘document level’: 22
- Column-8: End offset of Drug at ‘document level’: 34

**Download Data (4.1 GB) from:** http://evexdb.org/pmresources/vec-space-models/wikipedia-pubmed-and-PMC-w2v.bin

**Importing the required packages**

In [1]:
# Importing the required packages
import os
import re
import csv
import codecs
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from string import punctuation
from gensim.models import KeyedVectors

**Check the Keras and TensorFlow version**

In [2]:
import keras
print(keras.__version__)            # Book -> 2.1.2
import tensorflow
print(tensorflow.__version__)       # Book -> 1.13.1

Using TensorFlow backend.


2.2.4
1.13.1


In [None]:
EMBEDDING_FILE = 'wikipedia-pubmed-and-PMC-w2v.bin'
print('Indexing word vectors')
word2vec = KeyedVectors.load_word2vec_format(EMBEDDING_FILE,binary=True)
print('Found %s word vectors of word2vec' % len(word2vec.vocab))

In [None]:
word2vec = KeyedVectors.load_word2vec_format(EMBEDDING_FILE,binary=True)
print('Found %s word vectors of word2vec' % len(word2vec.vocab))

In [None]:
import copy
from keras.preprocessing.sequence import pad_sequences

**Reading the text file 'DRUG-AE.rel**

In [None]:
# Reading the text file 'DRUG-AE.rel' which provides relations between drugs and adverse effects.
TEXT_FILE = 'DRUG-AE.rel'

**Creating input for the model**

The input for our model is a sequence of characters that was pre-defined with a 
length of 200, i.e., we will have a dataset of size = “number of original
characters-sequence length.”

For each input data, i.e., 200-character sequence, next, one character
will be our output in one-hot encoded format.
We will append the input data fields, along with their corresponding labels, in the **input_data_ae** and **op_labels_ae tensors.**

In [None]:


# Creating lists for the input fields and corresponding labels
input_data_ae = []
op_labels_ae = []

sentences = []

f = open(TEXT_FILE, 'r')

for each_line in f.readlines():
    sent_list = np.zeros([0,200])
    labels = np.zeros([0,3])
    tokens = each_line.split("|")
    sent = tokens[1]
    if sent in sentences:
        continue
    sentences.append(sent)
    begin_offset = int(tokens[3])
    end_offset = int(tokens[4])
    mid_offset = range(begin_offset+1, end_offset)
    word_tokens = nltk.word_tokenize(sent)
    offset = 0
    for each_token in word_tokens:
        offset = sent.find(each_token, offset)
        offset1 = copy.deepcopy(offset)
        offset += len(each_token)
        if each_token in punctuation or re.search(r'\d', each_token):
            continue
        each_token = each_token.lower()
        each_token = re.sub("[^A-Za-z\-]+","", each_token)
        if each_token in word2vec.vocab:
            new_word = word2vec.word_vec(each_token)
        if offset1 == begin_offset:
            sent_list = np.append(sent_list, np.array([new_word]), axis=0)
            labels = np.append(labels, np.array([[0,0,1]]), axis=0)
        elif offset == end_offset or offset in mid_offset:
            sent_list = np.append(sent_list, np.array([new_word]), axis=0)
            labels = np.append(labels, np.array([[0,1,0]]), axis=0)
        else:
            sent_list = np.append(sent_list, np.array([new_word]), axis=0)
            labels = np.append(labels, np.array([[1,0,0]]), axis=0)

    input_data_ae.append(sent_list)
    op_labels_ae.append(labels)
input_data_ae = np.array(input_data_ae)
op_labels_ae  = np.array(op_labels_ae)

**Add padding to the input text**, with the maximum length of the input at
any time step being 30 (a safe bet!).

In [None]:
input_data_ae = pad_sequences(input_data_ae, maxlen=30, dtype='float64', padding='post')
op_labels_ae = pad_sequences(op_labels_ae, maxlen=30, dtype='float64', padding='post')
print(len(input_data_ae))
print(len(op_labels_ae))

In [None]:
from keras.preprocessing.text import Tokenizer
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation,Bidirectional, TimeDistributed
from keras.layers.merge import concatenate
from keras.models import Model, Sequential
from keras.layers.normalization import BatchNormalization
from keras.callbacks import EarlyStopping, ModelCheckpoint

**Creating Train and Validation datasets**

In [None]:
# Creating Train and Validation datasets, for 4271 entries, 4000 in train dataset, and 271 in validation dataset
x_train= input_data_ae[:4000]
x_test = input_data_ae[4000:]
y_train = op_labels_ae[:4000]
y_test =op_labels_ae[4000:]

**Defining the model architecture**

We are going to use one hidden layer of a bidirectional LSTM network, with 300 hidden
units and a dropout probability of 0.2. In addition to this, we are making use
of a TimeDistributedDense layer, with a dropout probability of 0.2.

_Dropout_ is a regularization technique by which, while you’re updating
layers of your neural net, you randomly don’t update, or dropout, some
of the layer. That is, while updating your neural net layer, you update
each node with a probability of 1-dropout, and leave it unchanged with a
probability dropout.

_Time distributed layers_ are used for RNN (and LSTMs) to maintain a
one-to-one mapping between input and output. Assume we have 30 time
steps with 200 samples of data, i.e., 30 × 200, and we want to use an RNN
with an output of 3. If we don’t use a TimeDistributedDense layer, we will
get a 200 × 30 × 3 tensor. So, we have the output flattened with each time
step mixed. If we apply the _TimeDistributedDense_ layer, we are going to
apply a fully connected dense layer on each of the time steps and get the
output separately by time step.

We are also using:
- Loss Function: categorical_crossentropy 
- Optimizer: adam
- Activation Function: softmax

In [None]:
batch = 1      # Making the batch size as 1, as showing model each of the instances one-by-one

# Adding Bidirectional LSTM with Dropout, and Time Distributed layer with Dropout
# Finally using Adam optimizer for training purpose
xin = Input(batch_shape=(batch,30,200), dtype='float')
seq = Bidirectional(LSTM(300, return_sequences=True),merge_mode='concat')(xin)
mlp1 = Dropout(0.2)(seq)
mlp2 = TimeDistributed(Dense(60, activation='softmax'))(mlp1)
mlp3 = Dropout(0.2)(mlp2)
mlp4 = TimeDistributed(Dense(3, activation='softmax'))(mlp3)
model = Model(inputs=xin, outputs=mlp4)
model.compile(optimizer='Adam', loss='categorical_crossentropy')

**Training the model**

We are going train our model with 50 epochs and a batch size of 1.
You can always increase the number of epochs, as long as the model keeps
on improving. One can also create checkpoints, so that later, the model
can be retrieved and used. The idea behind creating the checkpoint is to
save the model weights while training, so that later, you do not have to go
through the same process again. **This has been left as an exercise for the
reader.**

In [None]:
model.fit(x_train, y_train,
          batch_size=batch,
          epochs=50,
          validation_data=(x_test, y_test))

**Validation of the Model**

Validating the model results on the validation dataset with 271 entries.

In [None]:
val_pred = model.predict(x_test,batch_size=batch)
labels = []
for i in range(len(val_pred)):
    b = np.zeros_like(val_pred[i])
    b[np.arange(len(val_pred[i])), val_pred[i].argmax(1)] = 1
    labels.append(b)

**Check the model performance using F1-score, along with precision and recall**

In [None]:
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

In [None]:
score =[]
f1 = []
precision =[]
recall =[]
point = []


for i in range(len(y_test)):
    if(f1_score(labels[i],y_test[i],average='weighted')>.6):
        point.append(i)
    score.append(f1_score(labels[i],y_test[i],average='weighted'))
    precision.append(precision_score(labels[i],y_test[i],average='weighted'))
    recall.append(recall_score(labels[i],y_test[i],average='weighted'))


In [None]:
print(np.mean(score))
print(np.mean(precision))
print(np.mean(recall))

In [None]:
print(score)
print("\n------x------\n")
print(precision)
print("\n------x------\n")
print(recall)

**Note:** To get better results we can 
to build a denser network, increasing the number of epochs and the
length of the dataset.