# NAACL 2018 Shared Task - Metaphor Detection

This notebook implements a method for Metaphor Detection using Keras, as part of the (thesis)[TODO]. It is based on the (NAACL 2018 Shared Task for Metaphor Detection)[TODO] but did not compete in the task.

For futher details on the Shared Task, visit: https://github.com/EducationalTestingService/metaphor/tree/master/NAACL-FLP-shared-task

## Prerequisites 

Install the Python3 requirements from the requirements.txt

```
pip3 install -r requirements.txt
```

Download WordEmbeddings for encoding lexical items (Gensim KeyedVectors, or Pymagnitude)

```
curl TODO
```

Download the VUAM Corpus (can't be included due to copyrights)

```
curl http://ota.ahds.ac.uk/headers/2541.xml
# Or use python functions provided in utils module (see below)
```

In [None]:
# Importing custom modules
import utils
import corpus
import evaluate
import features

# Import general dependencies
import numpy
import os
import collections
from keras.utils import to_categorical
from keras.layers import TimeDistributed, Bidirectional, LSTM, Input, Masking, Dense
from keras.models import Model
from keras import backend as kerasbackend
from sklearn.model_selection import KFold

# VUAM Corpus

The VUAMC is the basis for this task. However, it cannot be included in the repository due to copyrights. 

The next Cell will check if the VUAMC is downloaded and do so if necessay. It will also generate the CSV files using the converter provided by NAACL

In [None]:
if not os.path.exists('source/vuamc_corpus_test.csv') and not os.path.exists('source/vuamc_corpus_train.csv'):
    print('VUAMC training and test data not found. Generating...')
    utils.download_vuamc_xml()
    utils.generate_vuamc_csv()
    print('VUAMC CSV generated')

# Test and Training Corpus

The next cell will convert the CSV files for training and testing into a Corpus object. This is to manage the sentences in the given corpus and provide functions such as: list all labels, list all tokens, etc.

The validation checks if the tokens in the corpus and the tokens in the training/test files align.

In [None]:
# Load Train Corpus from CSV
c_train = corpus.VUAMC('source/vuamc_corpus_train.csv', 'source/verb_tokens_train_gold_labels.csv')
c_train.validate_corpus()
print('Loaded and validated training corpus')

# Load Test Corpus from CSV
c_test = corpus.VUAMC('source/vuamc_corpus_test.csv', 'source/verb_tokens_test.csv', mode='test')
c_test.validate_corpus()
print('Loaded and validated test corpus')

# Corpus Validation

The next cell with show that the training data is highly imbalanced. For the training of the model we will use a binary classification, using 0 to encode non-metaphor tokens and 1 to encode metaphor tokens. 

The training set, however, includes a significantly higher amount of non-metaphor tokens. A fact that will cause simple training to fail, since due to the imbalance the model will almost always choose a 0. The calculated ratios will be used to introduce a *weighted_categorical_crossentropy* loss function to combat this imbalance.

In [None]:
number_of_all_labels = len(c_train.label_list)
count_of_label_classes = collections.Counter(c_train.label_list)

percentage_of_non_metaphor_tokens = round(count_of_label_classes[0] / number_of_all_labels * 100)
percentage_of_metaphor_tokens = round(count_of_label_classes[1] / number_of_all_labels * 100)
ratio = utils.simplify_ratio(percentage_of_non_metaphor_tokens, percentage_of_metaphor_tokens)
assert(percentage_of_non_metaphor_tokens + percentage_of_metaphor_tokens == 100)

print('Percentage of metaphor tokens: {}'.format(percentage_of_metaphor_tokens))
print('Percentage of non-metaphor tokens: {}'.format(percentage_of_non_metaphor_tokens))
print('Ratio: {}:{}'.format(ratio[0], ratio[1]))

# Model Configuration

The next cell is the primary configuration for the model.

## Weighted Categorical Crossentropy

As described above, the training set is highly imbalanced. Therefore we will use a weighted_categorical_crossentropy to calculate the loss in the training. This loss can be adjusted here. 

In [None]:
MAX_SENTENCE_LENGTH = 50
EMBEDDING_DIM = 300
KFOLD_SPLIT = 5
KERAS_OPTIMIZER = 'rmsprop'
KERAS_METRICS = [utils.f1]
KERAS_EPOCHS = 1
KERAS_BATCH_SIZE = 32
KERAS_ACTIVATION = 'softmax'
KERAS_DROPOUT = 0.25

KERAS_LOSS = utils.weighted_categorical_crossentropy(ratio)
print('loss_weights: {}'.format(ratio))

# Word Embeddings

The model uses Word Embeddings to encode lexical items as real number vectors to encode semantics. 

The nex cell will load the Embeddings for the training and test corpus. This is done by using a polymorph Class that implements the *Embeddings* interface. This way changing embeddings is as simple as changing the Embeddings Object.

After the corpora are encoded, the Embeddings object is deleted to free up some memory (some embedding models use lazy loading, which would not use up memory).

In [None]:
# Uncomment to use different Embeddings
# embeddings = features.Word2Vec()
# embeddings = features.Magnitude()
embeddings = features.DummyEmbeddings(EMBEDDING_DIM)

x_input, y_labels = features.generate_input_and_labels(c_train.sentences, Vectors=embeddings)
x_test, y_test = features.generate_input_and_labels(c_test.sentences, Vectors=embeddings)

# Free up some memory
del embeddings
print('Deleted Word Embeddings')

# Training labels need to be categorical, with 2 classes (0-non-metaphor, 1-metaphor)
y_labels = to_categorical(y_labels, 2)

# The Model

This call compiles the model used in the Task.

 - Input: The input layer will receive the encoded sentences. Shape: Sentence Length * Embedding Dimensions
 - Core: The core of the model is a bidirectionsal LSTM with a recurrent Dropout
 - Output: The output layer is dense time distributed series with predicions for 2 classes (0|1)

In [None]:
inputs = Input(shape=(MAX_SENTENCE_LENGTH, EMBEDDING_DIM))
model = Masking(mask_value=[-1] * EMBEDDING_DIM)(inputs)
model = Bidirectional(LSTM(100, return_sequences=True, dropout=0, recurrent_dropout=KERAS_DROPOUT))(model)
outputs = TimeDistributed(Dense(2, activation=KERAS_ACTIVATION))(model)
model = Model(inputs=inputs, outputs=outputs)

model.compile(optimizer=KERAS_OPTIMIZER, loss=KERAS_LOSS, metrics=KERAS_METRICS)
model.summary()

# Generate Training and Validation split

To futher optimize the training we will use a Kfold split on the training and validation data. This will split the input data and labels *n* times and fit the model each time on the subset.

In [None]:
kfold = KFold(n_splits=KFOLD_SPLIT, shuffle=True, random_state=1337)
history = []

for train, test in kfold.split(x_input, y_labels):
    x_train = x_input[train]
    x_val = x_input[test]
    y_train = y_labels[train]
    y_val = y_labels[test]

    # Fit the model for each split
    model.fit(x_train, y_train,
              batch_size=KERAS_BATCH_SIZE,
              epochs=KERAS_EPOCHS,
              validation_data=(x_val, y_val))
    
    # TODO: save history for all epochs

    scores = model.evaluate(x_val, y_val)
    print('Test score: {:.2%}'.format(scores[0]))
    print('Test accuracy: {:.2%}'.format(scores[1]))

# Prediction and Evalutation

To evalutate the model we will use the test corpus and generate predictions (labels) for the input sentences. Each sentence will receive a list of binary classes (0|1) for its tokens. 

The predictions will be saved in a CSV file, which will be simimlar to the Gold Labels from the NAACL. Using both of these files (predicitions and gold-standards) we will evalutate the perfomance of the model. 

Performance is measured in Precision and Recall, expressed in the F1 score.

In [None]:
# Get float predictions and turn them into binaries
float_predictions = model.predict(x_test, batch_size=KERAS_BATCH_SIZE)
binary_predictions = kerasbackend.argmax(float_predictions)
label_predictions = kerasbackend.eval(binary_predictions)

# Write prediction to CSV file
predictions_file = 'predictions.csv'
standard_file = 'source/verb_tokens_test_gold_labels.csv'

# Write the predictions.csv and compare to gold standard
rows = evaluate.corpus_evaluation(c_test, label_predictions, MAX_SENTENCE_LENGTH)
evaluate.csv_evalutation(rows, predictions_file)
results = evaluate.precision_recall_f1(predictions_file, standard_file)

print(results)

# Model Training Plot

The following plot shows the learning of the model during training.

In [None]:
import plotly 
plotly.offline.init_notebook_mode(connected=True)

loss_p = plotly.graph_objs.Scatter(
    y = model.history.history['loss'],
    mode = 'lines+markers',
    name = 'Loss'
)

val_loss_p = plotly.graph_objs.Scatter(
    y = model.history.history['val_loss'],
    mode = 'lines+markers',
    name = 'Validation Loss'
)

acc_p = plotly.graph_objs.Scatter(
    y = model.history.history['f1'],
    mode = 'lines+markers',
    name = 'Categorical Accuracy'
)

val_acc_p = plotly.graph_objs.Scatter(
    y = model.history.history['val_f1'],
    mode = 'lines+markers',
    name = 'Validation Categorical Accuracy'
)

layout = plotly.graph_objs.Layout(title="Training History",
                yaxis=dict(title='Value'),
                xaxis=dict(title='Epoch'))

data = [loss_p, acc_p, val_loss_p, val_acc_p]
fig = plotly.graph_objs.Figure(data=data, layout=layout)

plotly.offline.iplot(fig, filename='jupyter-train-history')