# NAACL 2018 Shared Task - Metaphor Detection

This notebook implements neural network method for Metaphor Detection using Keras, as part of my [bachelor thesis](https://github.com/martialblog/bachelor-thesis-code). It is based on the [NAACL 2018 Shared Task for Metaphor Detection](https://sites.google.com/site/figlangworkshop/shared-task) but did not compete in the task.

For further details on the Shared Task and the training data visit: https://github.com/EducationalTestingService/metaphor/tree/master/NAACL-FLP-shared-task

## Table of contents

- [Prerequisites](#prerequisites)
- [Download VUAM Corpus](#vuamc_generation)
- [Generate Training and Test Data](#corpus_generation)
- [Validate Training and Test Data](#corpus_validation)
- [Keras Model Configuration](#model_configuration)
- [Load Word Embeddings](#word_embeddings)
- [Keras Model Compilation](#model_compilation)
- [Model Training](#training)
- [Model Evaluation](#evaluation)
- [Plot Training Phase](#training_plot)

<a id='prerequisites'></a>
## Prerequisites 

Install the Python 3 requirements from the requirements.txt

```
pip3 install -r requirements.txt
```

Download the Word Embeddings for encoding lexical items (Gensim KeyedVectors, or pymagnitude) into the *source/* directory. Example for pymagnitude:

```
cd source/
curl -O http://magnitude.plasticity.ai/fasttext+subword/wiki-news-300d-1M.magnitude
curl -O http://magnitude.plasticity.ai/word2vec+subword/GoogleNews-vectors-negative300.magnitude
```

- https://github.com/plasticityai/magnitude
- https://code.google.com/archive/p/word2vec/

Download the VUAM Corpus as XML (can't be included due to licence restrictions) into the *starterkits/* directory. **Hint**: There is a cell in this Notebook that will do that. See [VUAM Corpus](#vuamc_generation).

```
cd starterkits/
curl -O http://ota.ahds.ac.uk/headers/2541.xml

# Or use the Python functions provided in the utils module
python3 -i utils.py
download_vuamc_xml()
```

The VUAMC needs to be converted into a CSV file and placed into the *source/* directory. This is done by using the starterkit scripts provided by the NAACL, which are included in the repository, or a Python function.

```
cd starterkits/
python3 vua_xml_parser.py
python3 vua_xml_parser_test.py

# Or use the Python functions provided in the utils module
python3 -i utils.py
generate_vuamc_csv()
```

In [None]:
# Importing custom modules
import utils
import corpus
import evaluate
import features

# Import general dependencies
import numpy
import os
import collections
from keras.utils import to_categorical
from keras.layers import TimeDistributed, Bidirectional, LSTM, Input, Masking, Dense
from keras.models import Model
from keras import backend as kerasbackend
from sklearn.model_selection import KFold
from keras.preprocessing.text import Tokenizer

<a id='vuamc_generation'></a>
# VUAM Corpus

The VUAMC is the training set for this task. However, it could not be included in the repository due its licence. 

The next cell will check if the VUAMC is downloaded correctly and will take care of it if necessary. It will also generate the CSV files using the converter provided by the NAACL.

In [None]:
if not os.path.exists('source/vuamc_corpus_test.csv') and not os.path.exists('source/vuamc_corpus_train.csv'):
    print('VUAMC training and test data not found. Generating...')
    utils.download_vuamc_xml()
    utils.generate_vuamc_csv()
    print('VUAMC CSV generated')

<a id='corpus_generation'></a>
# Test and Training Corpus

The next cell will convert the CSV files for the training and testing into a *Corpus* object. This is to manage the sentences in the given corpus during runtime and to provide functions such as: list all labels, list all tokens, etc.

The validation checks if the tokens in the corpus and the tokens in the training/test files align.

In [None]:
# Load Train Corpus from CSV
# c_train = corpus.VUAMC('source/vuamc_corpus_train.csv', 'source/verb_tokens_train_gold_labels.csv', 'source/vuamc_corpus_train_pos.csv')
c_train = corpus.VUAMC('source/vuamc_corpus_train.csv', 'source/all_pos_tokens_train_gold_labels.csv', 'source/vuamc_corpus_train_pos.csv')
c_train.validate_corpus()
print('Loaded and validated training corpus')

# Load Test Corpus from CSV
# c_test = corpus.VUAMC('source/vuamc_corpus_test.csv', 'source/verb_tokens_test.csv', 'source/vuamc_corpus_test_pos.csv', mode='test')
c_test = corpus.VUAMC('source/vuamc_corpus_test.csv', 'source/all_pos_tokens_test.csv', 'source/vuamc_corpus_test_pos.csv', mode='test')
c_test.validate_corpus()
print('Loaded and validated test corpus')

<a id='corpus_validation'></a>
# Corpus Validation

For the training of the model we will use a binary classification, using 0 to encode non-metaphor tokens and 1 to encode metaphor tokens. 

This next cell will demonstrate that the training data is highly imbalanced. The training set includes a significantly higher amount of non-metaphor tokens. A fact that will cause the training to fail, since the model will almost always choose a 0, because this way it is still right almost all the time.

To mitigate the imbalance a *weighted_categorical_crossentropy* will be introducted later.

In [None]:
number_of_all_labels = len(c_train.label_list)
count_of_label_classes = collections.Counter(c_train.label_list)

percentage_of_non_metaphor_tokens = round(count_of_label_classes[0] / number_of_all_labels * 100)
percentage_of_metaphor_tokens = round(count_of_label_classes[1] / number_of_all_labels * 100)
ratio = utils.simplify_ratio(percentage_of_non_metaphor_tokens, percentage_of_metaphor_tokens)
assert(percentage_of_non_metaphor_tokens + percentage_of_metaphor_tokens == 100)

print('Percentage of metaphor tokens: {}%'.format(percentage_of_metaphor_tokens))
print('Percentage of non-metaphor tokens: {}%'.format(percentage_of_non_metaphor_tokens))
print('Ratio: {}:{}'.format(ratio[0], ratio[1]))

<a id='model_configuration'></a>
# Model Configuration

The next cell is the primary configuration for the model. Change the parameters here to change the training.

## Weighted Categorical Crossentropy

As described above, the training set is highly imbalanced. Therefore, we will use a *weighted categorical crossentropy* to calculate the loss during the training.  The weights for the classes can be adjusted using the **WEIGHT_SMOOTHING** constant.

In [None]:
MAX_SENTENCE_LENGTH = 50
WEIGHT_SMOOTHING = 0.0
EMBEDDING_DIM = 300
KFOLD_SPLIT = 8
KERAS_OPTIMIZER = 'rmsprop'
KERAS_METRICS = [utils.precision, utils.recall, utils.f1]
KERAS_EPOCHS = 5
KERAS_BATCH_SIZE = 32
KERAS_ACTIVATION = 'softmax'
KERAS_DROPOUT = 0.25

# help(get_class_weights) for details
class_weights =  list(utils.get_class_weights(c_train.label_list, WEIGHT_SMOOTHING).values())
print('loss_weight {}'.format(class_weights))
KERAS_LOSS = utils.weighted_categorical_crossentropy(class_weights)

<a id='word_embeddings'></a>
# Word Embeddings

The model uses Word Embeddings to encode lexical items as real number vectors. The next cell will load the Embeddings for the training and test corpus. 

This is done by using a polymorph class that implements the *Embeddings* interface. That way, changing embeddings is as simple as changing the *Embeddings Object*. Some examples are given in the comments. 

After both corpora are encoded, the embeddings object is deleted to free up some memory (some embedding libraries use lazy loading, which would not use up memory).

In [None]:
# Uncomment to use different Embeddings
# embeddings = features.Word2Vec()
# embeddings = features.Magnitudes(filepath='customembeddings.magnitude')
# embeddings = features.DummyEmbeddings(EMBEDDING_DIM)
embeddings = features.Magnitudes()

x_input, y_labels, z_postags = features.generate_input_and_labels(c_train.sentences, Vectors=embeddings)
x_test, y_test, z_testtags = features.generate_input_and_labels(c_test.sentences, Vectors=embeddings)
print('Generated Word Embeddings')

# Free up some memory
del embeddings
print('Deleted Embeddings Object')

# POS Tags to numerical sequences
pos_tokenizer = Tokenizer()
pos_tokenizer.fit_on_texts(z_postags)
pos_sequences = pos_tokenizer.texts_to_sequences(z_postags)
pos_test_sequences = pos_tokenizer.texts_to_sequences(z_testtags)

# Training labels need to be categorical, with 2 classes (0-non-metaphor, 1-metaphor)
y_labels = to_categorical(y_labels, 2)
z_pos = to_categorical(pos_sequences)
z_test = to_categorical(pos_test_sequences)

<a id='model_compilation'></a>
# The Model

This cell compiles the model used in the Task.

 - Input: The input layer will receive the encoded sentences. Shape: Sentence Length * Embedding Dimensions
 - POS Tags: POS tags can be excluded by removing the Input layer for them
 - Core: The core of the model is a bidirectionsal LSTM with a recurrent dropout
 - Output: The output layer is dense time distributed series with predicions for 2 classes (0|1)

In [None]:
postags = Input(shape=(MAX_SENTENCE_LENGTH, 17))
sentences = Input(shape=(MAX_SENTENCE_LENGTH, EMBEDDING_DIM))
model = Masking(mask_value=[-1] * EMBEDDING_DIM)(sentences)
model = Bidirectional(LSTM(100, return_sequences=True, dropout=0, recurrent_dropout=KERAS_DROPOUT))(model)
outputs = TimeDistributed(Dense(2, activation=KERAS_ACTIVATION))(model)
model = Model(inputs=[sentences, postags], outputs=outputs)
# Model with out POS Tags:
# model = Model(inputs=[sentences], outputs=outputs)

model.compile(optimizer=KERAS_OPTIMIZER, loss=KERAS_LOSS, metrics=KERAS_METRICS)
model.summary()

<a id='training'></a>
# Generate Training and Validation split

To further optimize the training we will use a k-fold split on the training data. This will split the input data and labels *k* times (into tr) and fit the model each time on the subset.

**Hint:** If the model should not use POS tags, the input and validation data needs to be removed here as well.

In [None]:
kfold = KFold(n_splits=KFOLD_SPLIT, shuffle=True, random_state=1337)
histories = []

for train, test in kfold.split(x_input, y_labels):
    x_train = x_input[train]
    x_val = x_input[test]
    y_train = y_labels[train]
    y_val = y_labels[test]
    pos_val = z_pos[test]
    pos_train = z_pos[train]

    # Fit the model for each split
    history = model.fit([x_train, pos_train], y_train,
                  batch_size=KERAS_BATCH_SIZE,
                  epochs=KERAS_EPOCHS,
                  validation_data=([x_val, pos_val], y_val))
    
    histories.append(history)

    # Evaluation after each split
    scores = model.evaluate([x_val, pos_val], y_val)
    print('Loss: {:.2%}'.format(scores[0]))
    print('Precision: {:.2%}'.format(scores[1]))
    print('Recall: {:.2%}'.format(scores[2]))

<a id='evaluation'></a>
# Prediction and Evalutation

To evalutate the model, we will use the test corpus and generate predictions (labels) for the input sentences. Each sentence will receive a list of binary classes (0|1) for its tokens. 

The predictions will be saved in a CSV file, which will be similar to the *Gold Labels* from the NAACL. Using both of these files (predicitions and gold-standards) we will evalutate the perfomance of the model. 

The Performance is measured in Precision, Recall and F1 score.

In [None]:
# Get float predictions and turn them into binaries
float_predictions = model.predict([x_test, z_test], batch_size=KERAS_BATCH_SIZE)

# Without POS tags
# float_predictions = model.predict([x_test], batch_size=KERAS_BATCH_SIZE)

binary_predictions = kerasbackend.argmax(float_predictions)
label_predictions = kerasbackend.eval(binary_predictions)

# Write prediction to CSV file
predictions_file = 'fasttest_all_predictions_pos.csv'
# standard_file = 'source/verb_tokens_test_gold_labels.csv'
standard_file = 'source/all_pos_tokens_test_gold_labels.csv'

# Write the predictions.csv and compare to gold standard
rows = evaluate.corpus_evaluation(c_test, label_predictions, MAX_SENTENCE_LENGTH)
evaluate.csv_evalutation(rows, predictions_file)
results = evaluate.precision_recall_f1(predictions_file, standard_file)

print(results)

<a id='training_plot'></a>
# Model Training Plot

The following plot shows the learning of the model during the training epochs.

In [None]:
import plotly 
plotly.offline.init_notebook_mode(connected=True)

loss_p = plotly.graph_objs.Scatter(
    y = [history.history['loss'][0] for history in histories],
    mode = 'lines+markers',
    name = 'Loss'
)

val_loss_p = plotly.graph_objs.Scatter(
    y = [history.history['val_loss'][0] for history in histories],
    mode = 'lines+markers',
    name = 'Validation Loss'
)

acc_p = plotly.graph_objs.Scatter(
    y = [history.history['f1'][0] for history in histories],
    mode = 'lines+markers',
    name = 'Categorical Accuracy'
)

val_acc_p = plotly.graph_objs.Scatter(
    y = [history.history['val_f1'][0] for history in histories],
    mode = 'lines+markers',
    name = 'Validation Categorical Accuracy'
)

layout = plotly.graph_objs.Layout(title="Training History",
                yaxis=dict(title='Value'),
                xaxis=dict(title='Epoch'))

data = [loss_p, val_loss_p, acc_p, val_acc_p]
fig = plotly.graph_objs.Figure(data=data, layout=layout)

plotly.offline.iplot(fig, filename='jupyter-train-history')