## Argument Mining Practical Session

## PART II - Glove

To use this on Colab, run the following cells. (Authentication required).

**IMPORTANT**: the folder `share` exists on **my** Google Drive. Make sure to change this with a folder of your choice on your Google Drive.

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


The cell above should result in an `already mounted` message.

Folder `share` exists on my Google Drive. I navigate to this folder to store there the data.

In [2]:
%cd gdrive/MyDrive/share

/content/gdrive/MyDrive/share


Make sure you have the content downloaded in the previous notebook.

In [3]:
!ls

abstrct		   glove.6B.200d.txt  glove.6B.50d.txt
glove.6B.100d.txt  glove.6B.300d.txt  glove.6B.zip


If you haven't already:

In [None]:
! unzip glove.6B.zip

Then (if you want) you can delete from your Google Drive all the extracted files **except** the file `glove_6B_100d.txt`. This will be the one we use.

We're good to go.

## Premise vs Claim Classification

**IMPORTANT** If you run locally (NOT COLAB) start from here.

Let's begin with some imports...

In [4]:
import numpy as np
import pandas as pd

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding

I spare you some time by importing functions we implemented in the last notebook.

In [5]:
import os
import re

datadir = './abstrct/AbstRCT_corpus/data/train/neoplasm_train'
annotated = [f for f in os.listdir(datadir) if f.endswith('.ann')]

testdir = './abstrct/AbstRCT_corpus/data/test/neoplasm_test'
testannotated = [f for f in os.listdir(testdir) if f.endswith('.ann')]

def extract_annotated(fpath):
    res = []
    with open(fpath, 'r') as infile:
        for row in infile.readlines():
            row = row.strip()
            if re.match('^T\d', row):
                name, annotation, text = row.split('\t')
                fname = os.path.basename(fpath).replace('.ann', '')    
                name = f'{fname}-{name}'
                is_arg, start, end = annotation.split()
                start = int(start)
                end = int(end)
                res.append((name, is_arg, start, end, text))
    return res

In [6]:
from collections import defaultdict

train_ann_dict = defaultdict(list)
test_ann_dict = defaultdict(list)

for f in annotated:
  fname = f.replace('.ann', '')
  filepath = os.path.join(datadir, f)
  train_ann_dict[fname] = extract_annotated(filepath)

for f in testannotated:
  fname = f.replace('.ann', '')
  filepath = os.path.join(testdir, f)
  test_ann_dict[fname] = extract_annotated(filepath)


In [7]:
train_data = []
for k, list_ in train_ann_dict.items():
  for tup in list_:
    train_data.append((tup[1], tup[4]))

test_data = []
for k, list_ in test_ann_dict.items():
  for tup in list_:
    test_data.append((tup[1], tup[4]))


In [10]:
train_data[0]

('MajorClaim',
 'A combination of mitoxantrone plus prednisone is preferable to prednisone alone for reduction of pain in men with metastatic, hormone-resistant, prostate cancer.')

In [11]:
# ('sentence', 1)
train_data = [ (tup[1], 0) if tup[0] == 'Premise' else (tup[1], 1) for tup in train_data ]
test_data = [ (tup[1], 0) if tup[0] == 'Premise' else (tup[1], 1) for tup in test_data ]


In [12]:
train_data[0]

('A combination of mitoxantrone plus prednisone is preferable to prednisone alone for reduction of pain in men with metastatic, hormone-resistant, prostate cancer.',
 1)

In [13]:
train = pd.DataFrame(train_data, columns=['text', 'label'])
test = pd.DataFrame(test_data, columns=['text', 'label'])

In [15]:
test.head()

Unnamed: 0,text,label
0,"octreotide, the long acting somatostatin analo...",1
1,The patients treated with octreotide showed a ...,0
2,Monitoring of tumour size changes (US-TC) over...,0
3,whereas neoplasm grew according to an almost e...,0
4,survival was better in treated patients,0


In [16]:
docs = list(train['text'])
labels = list(train['label'])
test_docs = list(test['text'])
test_labels = list(test['label'])

In [17]:
t = Tokenizer()
t.fit_on_texts(docs)

In [18]:
vocab_size = len(t.word_index) + 1

In [20]:
encoded_docs = t.texts_to_sequences(docs)

In [23]:
len(encoded_docs)

2267

In [24]:
encoded_docs[0]

[12,
 160,
 4,
 1064,
 145,
 362,
 48,
 1828,
 8,
 362,
 177,
 11,
 189,
 4,
 65,
 2,
 335,
 7,
 323,
 429,
 1492,
 404,
 31]

In [25]:
t = Tokenizer()
t.fit_on_texts(test_docs)
test_encoded_docs = t.texts_to_sequences(test_docs)
print(len(test_encoded_docs))

686


In [30]:
max([len(x) for x in encoded_docs])

122

In [47]:
!ls

abstrct		   glove.6B.200d.txt  glove.6B.50d.txt
glove.6B.100d.txt  glove.6B.300d.txt  glove.6B.zip


In [54]:
max_length = 122
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
len(padded_docs)

2267

In [55]:
test_padded_docs = pad_sequences(test_encoded_docs, maxlen=max_length, padding='post')

In [34]:
padded_docs[0]

array([  12,  160,    4, 1064,  145,  362,   48, 1828,    8,  362,  177,
         11,  189,    4,   65,    2,  335,    7,  323,  429, 1492,  404,
         31,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0], dtype=int32)

In [48]:
embeddings_index = dict()
f = open('./glove.6B.300d.txt')
for line in f:
  values = line.split()
  word = values[0]
  coefs = np.asarray(values[1:], dtype='float32')
  embeddings_index[word] = coefs
f.close()
print(f'{len(embeddings_index)} word vectors loaded')

400000 word vectors loaded


In [None]:
embeddings_index['the']

In [49]:
embedding_matrix = np.zeros((vocab_size, 300))
for word, i in t.word_index.items():
  embedding_vector = embeddings_index.get(word)
  if embedding_vector is not None:
    embedding_matrix[i] = embedding_vector

In [None]:
embedding_matrix

In [41]:
# 1 - loaded all annotation
# 2 - tokenize -> encoded_docs (train and test)
# 3 - pad to 130 length (depends on the dataset (max was 122))
# 4 - loaded word embeddings for 400K words {word: [coeffs]}
# 5 - embedding_matrix: words in dataset as rows, and all coeffs for each rows (4K, 100)
# 6 - build a neural network using the pre-trained weights for the WE features

In [56]:
model = Sequential()

e = Embedding(vocab_size, 300, weights=[embedding_matrix], input_length=122, trainable=False)

model.add(e)
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
print(model.summary())

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 122, 300)          1266600   
_________________________________________________________________
flatten_2 (Flatten)          (None, 36600)             0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 36601     
Total params: 1,303,201
Trainable params: 36,601
Non-trainable params: 1,266,600
_________________________________________________________________
None


In [57]:
model.fit(padded_docs, np.array(labels), epochs=50, verbose=0)

<tensorflow.python.keras.callbacks.History at 0x7fcb946bac18>

In [58]:
loss, accuracy = model.evaluate(test_padded_docs, np.array(test_labels), verbose=0)

In [59]:
accuracy * 100

63.55684995651245