Multi-task Learning of Negation and Speculation for Targeted Sentiment Classification

This repository contains the code, challenge datasets for negation and speculation for Targeted Sentiment Analysis (TSA), and links to the models created from the code described in following paper: Multi-task Learning of Negation and Speculation for Targeted Sentiment Classification.

Table of contents:

Paper Abstract
Installation/Requirements, Datasets, and Resources used
Experiments
Models
Analysis/Notebooks
Acknowledgements

Paper Abstract

The majority of work in targeted sentiment analysis has concentrated on finding better methods to improve the overall results. Within this paper we show that these models are not robust to linguistic phenomena, specifically negation and speculation. In this paper, we propose a multi-task learning method to incorporate information from syntactic and semantic auxiliary tasks, including negation and speculation scope detection, to create English-language models that are more robust to these phenomena. Further we create two challenge datasets to evaluate model performance on negated and speculative samples. We find that multi-task models and transfer learning via language modelling can improve performance on these challenge datasets, but the overall performances indicate that there is still much room for improvement. We release both the datasets and the source code at https://github.com/jerbarnes/multitask_negation_for_targeted_sentiment.

Installation/Requirements, Datasets, and Resources

Install/Requirements

Python >= 3.6.1
Requires PyTorch version 1.2.0. This needs to be installed first and also depends on whether you would like to install the GPU or CPU version, see the following to install Pytorch 1.2.0 and it's variants of GPU or CPU version.
pip install -r requirements.txt
pip install .

If wanted to, run the tests:

python -m pytest

Datasets

For more details on the datasets see ./dataset_readme.md.

Main train and evaluate datasets

The following TSA datasets were used for evaluation:

The SemEval 2014 Laptop dataset.
The combination of SemEval 2014, 2015, and 2016 Restaurant dataset.
The MAMS restaurant dataset from Jiang et al. 2019.
The MPQA dataset from Wiebe et al. 2005 in CONLL format which can be found within ./data/main_task/en/mpqa, split into train, development, and test splits.

The first two sentiment datasets are from Li et al. 2019. The first three datasets can be downloaded and converted into CONLL format using the following script:

python targeted_sentiment_downloader_converter.py

All of these datasets can be found in folders laptop, restaurant, MAMS, and mpqa within the ./data/main_task/en directory. We use the BIOUL format for all of these datasets.

An example of TSA task in BIOUL format (this example comes from the MAMS development split):

The	basil	pepper	mojito	was	a	little	daunting	in	concept	,	but	I	was	refreshed	at	the	flavour	.
O	B-NEG	I-NEG	L-NEG	O	O	O	O	O	O	O	O	O	O	O	O	O	U-POS	O

The dataset statistics for these four datasets can be seen below, split into train, development, and test splits. NOTE only the MPQA dataset contains the BOTH label. Furthermore the MPQA dataset within the data itself represents the labels as positive, neutral, negative, and both for the POS, NEU, NEG, and BOTH shown in the table below. The table below can be generated using the following script (script can also produce the table in markdown, latex, and without any options pandas dataframe):

python data/main_task/en/sentiment_dataset_stats.py --main-datasets --to-html

	train								dev								test
	sents.	targs.	len.	mult.	POS	NEU	NEG	BOTH	sents.	targs.	len.	mult.	POS	NEU	NEG	BOTH	sents.	targs.	len.	mult.	POS	NEU	NEG	BOTH
dataset
laptop	2741	2044	1.5	136	19.86	43.20	36.94	0.00	304	256	1.5	18	17.97	40.62	41.41	0.0	800	634	1.6	38	26.03	53.47	20.50	0.0
restaurant	3490	3896	1.4	312	15.79	60.04	24.18	0.00	387	414	1.4	34	12.32	65.22	22.46	0.0	2158	2288	1.4	136	11.49	66.61	21.90	0.0
MAMS	4297	11162	1.3	4287	45.06	30.22	24.72	0.00	500	1329	1.3	498	45.45	30.25	24.30	0.0	500	1332	1.3	499	45.50	29.88	24.62	0.0
mpqa	4195	1264	6.3	94	13.29	43.91	39.08	3.72	1389	400	5.4	29	17.00	42.50	37.00	3.5	1620	365	6.7	22	19.18	33.15	41.37	6.3

Negated and Speculative challenge datasets (evaluate only datasets)

The Development and Test splits for the negated and speculative only TSA datasets that have been annotated by one of the authors of this work can be found here:

Laptop_Neg -- Development, Test
Laptop_Spec -- Development, Test
Restaurant_Neg -- Development, Test
Restaurant_Spec -- Development, Test

Within these 4 datasets/splits only negated (Neg) or speculative (Spec) sentiments exist. All of the samples within these datasets have come from the development/test splits of the standard Laptop or Restaurant dataset and in cases have been changed so that the sentiment is either negated or speculative.

Below shows three sentences, the original, negated, and speculative. These sentences show case negated and speculative sentiment that is within these negated and speculative datasets. The tokens in bold are those that have been added to the original sentence, the target sushi is either positive (:smile:), negative (:disappointed:), or neutral (:expressionless:) in the original, negated, and speculative cases.

Type	Sentence	Sentiment towards `sushi`
original	this is good, inexpensive sushi.	positive (:smile:)
negated	this is not good, inexpensive sushi.	negative (:disappointed:)
speculative	I'm not sure if this is good, inexpensive sushi.	neutral (:expressionless:)

The dataset statistics for these negated and speculative TSA datasets can be seen below, split into development, and test splits. The table below can be generated using the following script (script can also produce the table in markdown, latex, and without any options pandas dataframe):

python data/main_task/en/sentiment_dataset_stats.py --challenge-datasets --to-html

	dev								test
	sents.	targs.	len.	mult.	POS	NEU	NEG	BOTH	sents.	targs.	len.	mult.	POS	NEU	NEG	BOTH
dataset
laptop_neg	147	181	1.5	41	17.13	47.51	35.36	0.0	401	464	1.6	79	26.72	50.00	23.28	0.0
laptop_spec	110	142	1.4	10	50.70	33.10	16.20	0.0	208	220	1.5	19	38.18	41.36	20.45	0.0
restaurant_neg	198	274	1.4	61	16.42	51.09	32.48	0.0	818	1013	1.4	161	15.00	52.81	32.18	0.0
restaurant_spec	138	200	1.3	35	30.00	41.00	29.00	0.0	400	451	1.4	49	16.85	43.46	39.69	0.0

Auxiliary datasets

Dataset	Task	Format	Split locations
(CD) Conan Doyle	Negation scope detection	BIO CONLL format	Train, Development, Test
(SFU) SFU review corpus	Negation scope detection	BIO CONLL format	Train, Development, Test
(SPEC) SFU review corpus	Speculation scope detection	BIO CONLL format	Train, Development, Test
(UPOS) Streusle review corpus	Universal Part Of Speech (UPOS) tagging	CONLL format	Train, Development, Test
(DR) Streusle review corpus	Dependency Relation (DR) prediction	CONLL format	Train, Development, Test
(LEX) Streusle review corpus	Lexical analysis (LEX) prediction	BIO (style) CONLL format	Train, Development, Test

The SFU review corpus was split into 80%, 10%, and 10% train, development, and test splits respectively using the following script: ./scripts/sfu_data_splits.sh. For more details on the complex task of lexical analysis (LEX) see point 19 from the following README, which has come from the Streusle review corpus.

An example of all of the tasks can be seen in the table below:

	you	might	not	like	the	service
CD	B_scope	I_scope	B_cue	B_scope	I_scope	I_scope
SFU	B_scope	I_scope	B_cue	B_scope	I_scope	I_scope
SPEC	B_scope	B_cue	B_scope	I_scope	I_scope	I_scope
UPOS	PRON	AUX	PART	VERB	DET	NOUN
DR	nsubj	aux	advmod	root	det	obj
LEX	O_PRON	O_AUX	O_ADV	B_V-v.emotion	O_DET	B_N-n.ACT

Resources

All resources such as word embeddings (including Contextualised Word Representation (CWR) models) and AllenNLP model configurations are stored within ./resources.

All of the AllenNLP model configurations used for the main experiments can be found at: ./resources/model_configs, the configurations used for hyperparameter tuning can be found at: ./resources/tuning/tuning_configs, lastly the configurations used for getting some basic dataset statistics can be found at: ./resources/statistic_configs/en.
All of the embeddings are not stored in this repository due to their size.

The 300D 840B token GloVe embedding, needs to be downloaded to the following path ./resources/embeddings/en/glove.840B.300d.txt.
The standard Transformer ELMo which was used as the CWR embedding for the MPQA dataset experiments can be downloaded from this link and is to be downloaded to ./resources/embeddings/en/transformer-elmo-2019.01.10.tar.gz
For the MAMS and Restaurant dataset CWR experiments the fine tuned Transformer ELMo was used and can be downloaded from here and is explained in more detail in this repository on how it was fine tuned to the Yelp restaurant review dataset. This model should be downloaded to ./resources/embeddings/en/restaurant_model.tar.gz.
For the Laptop dataset CWR experiments the fine tuned Transformer ELMo was used and can be downloaded from here and is explained in more detail in this repository on how it was fine tuned to the Amazon electronics review dataset. This model should be downloaded to ./resources/embeddings/en/laptop_model.tar.gz.

Experiments

Additional experiments can be found in the ./experiments_readme.md.

We experiment with a single task baseline (STL) and a hierarchical multi-task model with a skip-connection (MTL), both of which can be seen in the Figure below. For the STL model, we first embed a sentence and then pass the embeddings to a Bidirectional LSTM (Bi-LSTM). These features are then concatenated to the input embeddings and fed to the second Bi-LSTM layer, ending with the token-wise sentiment predictions from the CRF tagger. For the MTL model, we additionally use the output of the first Bi-LSTM layer as features for the separate auxiliary task CRF tagger. As can be seen from the Figure below, the STL model and the MTL main task model use the same the green layers. The MTL additionally uses the pink layer for the auxiliary task. At inference time the MTL model is as efficient as STL, given that it only uses the green layers when predicting the targeted sentiment, of which this is empirically shown in the inference time section.

Before running any of the experiments for the single and multi task models we perform a hyperparameter search for both models.

Also before running any of the code this bash command needs to be ran first, as stanford-nlp package is installed and due to Ray that is used in allentune, the STANFORDNLP_TEST_HOME environment variable has to be set before using allentune thus I did the following:

export STANFORDNLP_TEST_HOME=~/stanfordnlp_test

Hyperparameter tuning

We used the allentune package.

The tuning is performed on the smallest datasets which is the Laptop dataset for Main task (TSA) and the Conan Doyle (CD) for the Negation/Auxiliary task when tuning the multi and single task models. The parameters we tune for are the following:

Dropout rate - between 0 and 0.5
Hidden size for shared/first layer of the Bi-LSTM - between 30 and 110
Starting learning rate for adam - between 0.01 (1e-2) and 0.0001 (1e-4)

The tuning is performed separately for the single and multi-task models. The single task model will only be tuned for the sentiment task and not the negation. Furthermore we tune the models by randomly sampling the parameters stated above within the range specified changing the random seed each time, of which these parameters are sampled 30 times in total for each model. From the 30 model runs the parameters from the best run based on the F1-Span/F1-i measure from the validation set are selected for all of the experiments for that model.

Multi Task Learning Tuning

Run the following:

allentune search \
    --experiment-name multi_task_laptop_conan_search \
    --num-cpus 5 \
    --num-gpus 1 \
    --cpus-per-trial 5 \
    --gpus-per-trial 1 \
    --search-space resources/tuning/tuning_configs/multi_task_search_space.json \
    --num-samples 30 \
    --base-config resources/tuning/tuning_configs/multi_task_laptop_conan.jsonnet \
    --include-package multitask_negation_target
allentune report \
    --log-dir logs/multi_task_laptop_conan_search/ \
    --performance-metric best_validation_f1-measure-overall \
    --model multi-task

The multi-task model found the following as the best parameters from run number 24 with a validation F1-Span score of 60.17%:

lr = 0.0019
shared/first layer hidden size = 65
dropout = 0.27

Single Task Learning Tuning

Run the following:

allentune search \
    --experiment-name single_task_laptop_search \
    --num-cpus 5 \
    --num-gpus 1 \
    --cpus-per-trial 5 \
    --gpus-per-trial 1 \
    --search-space resources/tuning/tuning_configs/single_task_search_space.json \
    --num-samples 30 \
    --base-config resources/tuning/tuning_configs/single_task_laptop.jsonnet \
    --include-package multitask_negation_target
allentune report \
    --log-dir logs/single_task_laptop_search/ \
    --performance-metric best_validation_f1-measure-overall \
    --model single-task

The single-task model found the following as the best parameters from run number 7 with a validation F1-Span score of 61.56%:

lr = 0.0015
shared/first layer hidden size = 60
dropout = 0.5

Plotting the expected validation score

To get a plot of the two STL and MTL models expected validation scores, you first have to copy the results from the STL and MTL together into a new file. Of which we have done this here. With this new combined file run the following to create the plot, which can be found here and the PNG version is shown below:

allentune plot \
    --data-name Laptop \
    --subplots 1 1 \
    --figsize 10 10 \
    --plot-errorbar \
    --result-file logs/other_result.jsonl \
    --output-file resources/tuning/combined_tuning_laptop_performance.pdf \
    --performance-metric-field best_validation_f1-measure-overall \
    --performance-metric F1-Span

Example of how to Train the Single-Task System using AllenNLP train command

You can use the allennlp train command here:

allennlp train resources/model_configs/targeted_sentiment_laptop_baseline.jsonnet -s /tmp/any --include-package multitask_negation_target

Example of how to Train the Multi-Task System using AllenNLP train command

You can use the allennlp train command here:

allennlp train resources/model_configs/multi_task_trainer.jsonnet -s /tmp/any --include-package multitask_negation_target

Mass experiments setup

In all experiments the embedding whether that is GloVe or CWR is frozen as in the embedding layer(s) does not get tuned during training.. This can be changed within the model configurations.

The previous two subsections describe how to just train one model on one dataset, in the paper we trained each model 5 times and there were numerous models (1 STL and 6 MTL) and 4 datasets. Thus to do this we created two scripts. The first script trains a model e.g. STL on one dataset 5 times and then saves the 5 models including the respective auxiliary task models where applicable and also saves the result. The second script runs the first script across all of the models and datasets.

The first python script has the following argument signature:

Model config file path
Main task test data file path
Main task development/validation data file path
Folder to save the results too. This folder will contain two files a test.conll and dev.conll each of these files will contain the predicted results for the associated data split. The files will have the following structure: Token#GOLD_Label#Predicted_Label_1#Predicted_Label_2. Where the # indicates whitespace and the number of predicted labels is determined by the number of times the model has been ran.
Number of times to run the model -- in all of our experiments we run the model 5 times thus this is always 5 in our case.
Folder to save the trained model(s) too. If you are training an MTL model then the auxiliary task model(s) will also be saved here.
OPTIONAL FLAG --mtl is required if you are training an MTL model.
OPTIONAL FLAG --aux_name the name of auxilary task is required if training an MTL model. By default this is negation but if a negation task is not being trained than the name of the task from the model config is required e.g. for u_pos the task name is task_u_pos thus you remove the task_ to get the aux_name which in this case is u_pos.

And an example of running this script is shown below, whereby this runs the STL model with GloVe embeddings 5 times on the Laptop dataset:

python ./scripts/train_and_generate.py ./resources/model_configs/stl/en/laptop.jsonnet ./data/main_task/en/laptop/test.conll ./data/main_task/en/laptop/dev.conll ./data/results/en/stl/laptop 5 ./data/models/en/stl/laptop

The MTL models can be run in a similar way but does require a few extra flags. Thus the example below shows the MTL (UPOS) model run 5 times with CWR embedding on the MAMS dataset:

python ./scripts/train_and_generate.py ./resources/model_configs/mtl/en/u_pos/mams_contextualized.jsonnet ./data/main_task/en/MAMS/test.conll ./data/main_task/en/MAMS/dev.conll ./data/results/en/mtl/u_pos/MAMS_contextualized 5 ./data/models/en/mtl/u_pos/MAMS_contextualized --mtl --aux_name upos

The second python script which trains all of the models and makes the predictions for the standard datasets is this script (this does not make predictions on the negated or speculative TSA datasets):

./run_all.sh

Predicting on the Negation Challenge corpus

These are the Neg datasets from the Negated and Speculative challenge datasets (evaluate only datasets) section

./scripts/generate_negation_only_predictions.sh

Predicting on the Speculation Challenge Corpus

These are the Spec datasets from the Negated and Speculative challenge datasets (evaluate only datasets) section

./scripts/generate_spec_only_predictions.sh

Number of parameters

(We assume that all of the models are stored in the following directory ./data/models, see the Models section for more details on how to download the trained models.)

To find the statistics for the number of parameters in the different models run:

python number_parameters.py

Inference time

(We assume that all of the models are stored in the following directory ./data/models, see the Models section for more details on how to download the trained models.)

This tests the inference time for the following models after they have been loaded into memory:

NOTE If you go to any of the model links we use model_0.tar.gz

Both of the models will have been trained on the Laptop dataset. Additionally the links associated to the models above will take you to the location where you can download those models. The inference times will be tested on the Laptop test dataset which contains 800 sentences. Further the models will be tested on the following hardware:

GPU - GeForce GTX 1060 6GB
CPU - AMD Ryzen 5 1600

And with the following batch sizes:

1
8
16
32

The computer also had 16GB of RAM. Additional the computer will run the model 5 times and time each run and report the minimum and maximum run times. Minimum times are recommended by the python timeit library and maximum is reported to show the potential distribution.

To run these inference time testing run the following:

python inference_time.py

It will print out a Latex table of results, which when converted to markdown look like the following:

Embedding	Model	Batch Size	Device	Min Time	Max Time
GloVe	STL	1	CPU	10.24	10.45
GloVe	STL	8	CPU	7.00	7.21
GloVe	STL	16	CPU	6.67	6.91
GloVe	STL	32	CPU	6.35	6.51
GloVe	MTL	1	CPU	10.06	10.26
GloVe	MTL	8	CPU	7.05	7.19
GloVe	MTL	16	CPU	6.90	6.99
GloVe	MTL	32	CPU	6.41	6.46
GloVe	STL	1	GPU	9.24	9.26
GloVe	STL	8	GPU	6.58	6.67
GloVe	STL	16	GPU	6.34	6.36
GloVe	STL	32	GPU	6.12	6.26
GloVe	MTL	1	GPU	9.43	9.49
GloVe	MTL	8	GPU	6.60	6.70
GloVe	MTL	16	GPU	6.26	6.55
GloVe	MTL	32	GPU	6.10	6.20
CWR	STL	1	CPU	64.79	71.26
CWR	STL	8	CPU	43.62	49.70
CWR	STL	16	CPU	47.06	48.41
CWR	STL	32	CPU	56.76	62.77
CWR	MTL	1	CPU	64.01	67.90
CWR	MTL	8	CPU	49.05	50.00
CWR	MTL	16	CPU	53.74	56.42
CWR	MTL	32	CPU	55.33	55.79
CWR	STL	1	GPU	23.26	23.79
CWR	STL	8	GPU	8.82	9.09
CWR	STL	16	GPU	8.57	8.86
CWR	STL	32	GPU	8.45	9.78
CWR	MTL	1	GPU	23.81	23.97
CWR	MTL	8	GPU	9.19	9.49
CWR	MTL	16	GPU	8.54	8.92
CWR	MTL	32	GPU	8.43	8.70

Also this data is stored in the following file ./inference_save.json

Models

All of the models from the Mass experiments setup section, which are all of the models that were created from the experiments that are declared in the paper can be found at https://ucrel-web.lancs.ac.uk/moorea/research/multitask_negation_for_targeted_sentiment/models/en/. These models are saved as AllenNLP models and can be load, using load_archive, as shown in the documentation. An example of loading a model, in python (assuming you have saved a model to ./data/models/en/stl/laptop_contextualized/model_0.tar.gz):

from pathlib import Path
from allennlp.models.archival import load_archive

cuda_device = -1 # 0 for GPU -1 for CPU
model_path = Path('./data/models/en/stl/laptop_contextualized/model_0.tar.gz')
loaded_model = load_archive(str(model_path.resolve()), cuda_device=cuda_device)

A script that shows how to load the model and make predictions so that the model can be used to benchmark inference time is the ./inference_time.py script.

The link sends you to a page with the Single task models in one folder with the following folder structure:

stl/DATASET_NAME_EMBEDDING/model_RUN_NUMBER.tar.gz

Whereby DATASET_NAME can be, which refer to the 4 Main train and evaluate datasets:

MAMS
laptop
mpqa
restaurant

EMBEDDING is either an empty string for the GloVe embedding or _contextualized for the CWR that matches the relevant DATASET_NAME see the Resources section.

RUN_NUMBER can be 0, 1, 2, 3, or 4 which represents the five different runs for each experiment. An example path to the STL model trained on the MAMS dataset using the GloVe embeddings and was the 2nd trained model:

stl/MAMS/model_1.tar.gz

The multi task models have the following structure:

mtl/AUXILIARY_DATASET/DATASET_NAME_EMBEDDING/model_RUN_NUMBER.tar.gz

Whereby AUXILIARY_DATASET is the auxiliary task that the model was also trained on, which can be that refer to the 6 Auxiliary datasets:

conan_doyle
dr
lextag
sfu
sfu_spec
u_pos

An example path to the MTL model trained on the MAMS dataset, with auxiliary task of speculation prediction, using a CWR and was the 1st trained model:

mtl/sfu_spec/MAMS_contextualized/model_0.tar.gz

Also in each of these folders also contains the saved auxiliary task model which in this examples will be saved as:

mtl/sfu_spec/MAMS_contextualized/task_speculation_model_0.tar.gz

Analysis/Notebooks

The notebooks ./notebooks (all notebooks can be loaded using Google Colab) store all of the evaluation results which generate the tables within the paper and run/produce the statistical significance test results that are within those tables in the paper.

The results on the 4 main datasets: Laptop, Restaurant, MAMS, and MPQA see the ./notebooks/Main_Evaluation.ipynb notebook.

The results on the Laptop and Restaurant negation and speculation challenge datasets, that was created from this work, see the ./notebooks/Negation_Evaluation.ipynb and ./notebooks/Speculation_Evaluation.ipynb notebooks.

Acknowledgements

This work has been carried out as part of the SANT project (Sentiment Analysis for Norwegian Text), funded by the Research Council of Norway (grant number 270908). Andrew has been funded by Lancaster University by an EPSRC Doctoral Training Grant. The authors thank the UCREL research centre for hosting the models created from this research.

Name		Name	Last commit message	Last commit date
Latest commit History 182 Commits
data		data
logs		logs
multitask_negation_target		multitask_negation_target
notebooks		notebooks
resources		resources
scripts		scripts
.gitignore		.gitignore
CITATION.bib		CITATION.bib
LICENSE.txt		LICENSE.txt
README.md		README.md
dataset_readme.md		dataset_readme.md
error_analysis.py		error_analysis.py
experiments_readme.md		experiments_readme.md
inference_save.json		inference_save.json
inference_time.py		inference_time.py
number_parameters.py		number_parameters.py
requirements.txt		requirements.txt
run_all.sh		run_all.sh
setup.py		setup.py
targeted_sentiment_downloader_converter.py		targeted_sentiment_downloader_converter.py

License

jerbarnes/multitask_negation_for_targeted_sentiment

Folders and files

Latest commit

History

Repository files navigation

Multi-task Learning of Negation and Speculation for Targeted Sentiment Classification

Paper Abstract

Installation/Requirements, Datasets, and Resources

Install/Requirements

Datasets

Main train and evaluate datasets

Negated and Speculative challenge datasets (evaluate only datasets)

Auxiliary datasets

Resources

Experiments

Hyperparameter tuning

Multi Task Learning Tuning

Single Task Learning Tuning

Plotting the expected validation score

Example of how to Train the Single-Task System using AllenNLP train command

Example of how to Train the Multi-Task System using AllenNLP train command

Mass experiments setup

Predicting on the Negation Challenge corpus

Predicting on the Speculation Challenge Corpus

Number of parameters

Inference time

Models

Analysis/Notebooks

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Languages