This repository contains the source code for the paper "Sequence Labeling Parsing by Learning Across Representations" that has been accepted for ACL 2019.
NEW VERSION OF THE CODE WITH PYTHON3+ AND PYTORCH 1+ IS AVAILABLE HERE. The code includes Multi-Task Learning, although it's only working for dependency parsing.
Recent research has shown that constituency and dependency parsing can be recast as sequence labeling problem resulting in a fast and accurate parsing. We extend this idea by adding multitask learning (MTL) that enables learning both constituency and dependency syntactic abstractions and leveraging their complementary nature. When adding one parsing paradigm as auxiliary loss, the performance of the other paradigm consistently improves. In other words, our model achieves better results for dependency parsing when information about the syntactic representations of constituency trees is available to the model during training and vice versa. Moreover, a single sequence labeling model can serve for parsing both constituency and dependency trees at almost no cost in terms of performance or speed (when both paradigms are considered as the main tasks).
- 1. Constituency and Dependency Parsing as Sequence Labeling explained
- 2. Requirements
- 3. Encoding Trees into Labels
- 4. Usage (train and parse)
- 5. Reference
Various encodings has been proposed for linearizing syntactic trees. This code employs the relative encoding proposed in "Constituent Parsing as Sequence Labeling" and relative PoS-based encoding from "Viable Dependency Parsing as Sequence Labeling". Below some examples with an intuitive explanation of how constituency and dependency trees can be encoded as labels. More detailed descriptions are to find in the papers.
(a) a constituency tree with the relative variation of encoding.
Example for the word "control" with the label (-2,S,Ø): First, we count how many ancestors are shared between the target word "control" and the next token("."), which is 1 since they only share "S". Next, we look at the relative variation with respect to the previous word. The tokens "good" and "control" share 3 levels. After subtraction of relative variation from the absolute, we get -2. The second element "S" is the last ancestor that is shared between the target and next word. The third element in the tuple is empty (Ø) because the target word is not a part of a leaf unary chain.
(a) a dependency tree with the relative PoS-based encoding.
Example for the word "control" with the label (-1,V,dobj): The head of the target word is the first token to the left (-1) that has PoS tag V with the relation dobj.
Our code is an extension of the original code for constituency and dependency parsing. We also rely on NCRF++: An Open-source Neural Sequence Labeling Toolkit that we have slightly modified. In MTL setup, our model learns many tasks jointly that consist of dependency and constituency parsing and in that way it can benefit from learning their shared syntactic representations.
It is recommended to create a virtual environment in order to keep the installed packages separate to avoid conflicts with other programs.
Python 2.7
PyTorch 0.3
The program was tested on Ubuntu 16.04, Python 2.7.12, PyTorch 0.3.1.
In order to train and run the model, the input files have to contain encoded trees into the labels. It can be easily done by using the following scripts for dependency and constituency trees. We follow the format setup proposed in Constituent Parsing as Sequence Labeling: better, faster, stronger sequence tagging constituent parsers.
python encode_trees2labels.py --file_to_encode $PATH_TO_THE_CONLL-X_FILE --output $PATH_TO_OUTPUTFILE --task SINGLE/COMBINED/MULTI
The file to encode has to be in the CoNLL-X format:
1 He _ N N _ 2 nsubj _ _
2 has _ V V _ 0 root _ _
3 good _ J J _ 4 amod _ _
4 control _ N N _ 2 dobj _ _
5 . _ . . _ 2 punct _ _
The output will differ depending on the chosen MTL setup:
single
a label will be treated as single task where its components are separated by the symbol "@"
He N +1@nsubj@V
has V -1@root@ROOT
good J +1@amod@N
control N -1@dobj@V
. . -1@punct@V
multi
each component in the label separated by "{}" will be treated as a separate task
He N +1{}nsubj{}V
has V -1{}root{}ROOT
good J +1{}amod{}N
control N -1{}dobj{}V
. . -1{}punct{}V
combined
best performance is achieved when information about the rel.position and word's head (f.ex. +1@V) is combined into one task and dependency relation as the second task
He N +1@V{}nsubj
has V -1@ROOT{}root
good J +1@N{}amod
control N -1@V{}dobj
. . -1@V{}punct
python dataset.py
Check the usage here.
Labels for constituency and dependency parsing have to be merged in order to train both parsing paradigms.
python merge_dep_and_cons_files.py --encoded_file_constituency $PATH_TO_ENCODED_FILE_CONSTITUENCY --encoded_file_dependency $PATH_TO_ENCODED_DEPENDENCY --merged_output $PATH_TO_OUTPUT
An example of the output:
Monday NNP 1ROOT{}S{}NP{}-1@IN{}pobj
The format of each line is: a token, followed by its PoS tag as a self-defined feature and a label with 5 tasks (separated by {})
python main.py --config $PATH_TO_CONFIG_FILE_FOR_TRAINING
config files for both single and multi-task models for constituency and dependency parsing. The names of the models follow the original naming used in the paper.
For a multi-task label:
multi-task label 3{}NP{}-EMPTY-{}+1@NNS{}amod
index of task 0 1 2 3 4
with 5 tasks in total where the first 3 tasks are for constituency(3{}NP{}-EMPTY-
) and
the last 2 for dependency(+1@NNS{}amod
), we can specify:
an MTL setup for constituency parsing as the main task and dependency parsing as auxiliary task:
###MTL setup###
index_of_main_tasks=0,1,2
tasks=5
tasks_weights=1|1|1|0.1|0.1
dependency_parsing=False
constituency_parsing=True
or an MTL setup for dependency parsing as the main task and constituency parsing as auxiliary task:
###MTL setup###
index_of_main_tasks=3,4
tasks=5
tasks_weights=0.2|0.2|0.2|1|1
dependency_parsing=True
constituency_parsing=False
or an MTL setup where both dependency and constituency parsing are considered as main tasks:
###MTL setup###
index_of_main_tasks=0,1,2,3,4
tasks=5
tasks_weights=1|1|1|1|1
dependency_parsing=True
constituency_parsing=True
python decode.py --test $PATH_ENCODED_TESTSET [--gold_constituency] $PATH_GOLD_TESTSET_FOR_CONST [--gold_dependency] $PATH_GOLD_TESTSET_FOR_DEPEN --model $PATH_TO_MODEL --status test --gpu False [--multitask] [--output_constituency] $PATH_OUTPUT_CONS [--output_dependency] $PATH_OUTPUT_DEPEN --ncrfpp $PATH_TO_NCRFPP [--evalb_param] $PATH_EVALB_PARAM
--test
path to the encoded testset--gold_constituency
path to the gold constituency file. Required whenconstituency_parsing=True
defined in the config file--gold_dependency
path to the gold dependency file. Required whendependency_parsing=True
--model
path to the model ended with "/mod"--gpu
True/False--multitask
when multitask labels are used--output_constituency
path to constituency output file. Only ifconstituency_parsing=True
--output_dependency
path to dependency output file. Only ifdependency_parsing=True
--ncrfpp
path to the directory containing main.py and decode.py--evalb_param
path to: when evaluated on PTB "cons2labels/EVALB/COLLINS.prm" or SPMRL "cons2labels/EVAL_SPRML/evalb_spmrl2013.final/spmrl.prm". Only ifconstituency_parsing=True
This work has received funding from the European Research Council (ERC), under the European Union's Horizon 2020 research and innovation programme (FASTPARSE, grant agreement No 714150).
If you wish to use our work for research purposes, please cite us!
@inproceedings{strzyz-etal-2019-sequence,
title = "Sequence Labeling Parsing by Learning across Representations",
author = "Strzyz, Michalina and
Vilares, David and
G{\'o}mez-Rodr{\'\i}guez, Carlos",
booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2019",
address = "Florence, Italy",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/P19-1531",
doi = "10.18653/v1/P19-1531",
pages = "5350--5357",
abstract = "We use parsing as sequence labeling as a common framework to learn across constituency and dependency syntactic abstractions.To do so, we cast the problem as multitask learning (MTL). First, we show that adding a parsing paradigm as an auxiliary loss consistently improves the performance on the other paradigm. Secondly, we explore an MTL sequence labeling model that parses both representations, at almost no cost in terms of performance and speed. The results across the board show that on average MTL models with auxiliary losses for constituency parsing outperform single-task ones by 1.05 F1 points, and for dependency parsing by 0.62 UAS points.",
}
Any questions? Bugs? Comments? Contact me using michalina.strzyz[at]udc.es