This repository contains the implementation of the joint syntaco-discourse parser and the syntaco-discourse treebank. For more details, please refer to the paper Joint Syntacto-Discourse Parsing and the Syntacto-Discourse Treebank.
Due to copyright restriction, we can not provide the joint treebank in the form that can be directly used to train a parser. Instead, we provide a patch tool kit to generate the Syntaco-Discourse Treebank giving the RST Discourse Treebank and the Penn Treebank.
-
python-gflags
for parsing script arguments. -
nltk
for tokenization.
Please follow the steps below to generate the treebank:
-
Place the RST Discourse Treebank in folder
dataset/rst
. Put the discourse trees (wsj_xxxx.out.dis
files) in the RST Discourse Treebank todataset/rst/train
anddataset/rst/test
respectively. Here eachwsj_xxxx.out.dis
file corresponds to one WSJ article, wherexxxx
is the article number. -
Place the Penn Treebank trees in folder
dataset/ptb
. These constituency trees are in parentheses format. They are grouped as one treebank file (with namewsj_xxxx.cleangold
) for a WSJ article. -
Apply patches to the RST Discourse Treebank files and Penn Treebank files. This step is necessary because there are some small mismatches between the RST Discourse tree texts and the Penn tree texts.
cd dataset/rst/train patch -p0 < ../../../patches/rst-ptb.train.patch cd ../test patch -p0 < ../../../patches/rst-ptb.test.patch cd ../../ptb patch -p0 < ../../patches/ptb-rst.patch cd ...
-
Run tokenization.
python src/tokenize_rst.py --rst_path dataset/rst/train python src/tokenize_rst.py --rst_path dataset/rst/test
-
Generate the training set and testing set for the joint treebank separately:
mkdir dataset/joint python src/aligner.py --rst_path dataset/rst/train --const_path dataset/ptb > dataset/joint/train.txt python src/aligner.py --rst_path dataset/rst/test --const_path dataset/ptb > dataset/joint/test.txt
To test the scripts above, you can play with the sample data:
python src/tokenize_rst.py --rst_path sampledata/rst
python src/aligner.py --rst_path sampledata/rst --const_path sampledata/ptb > sampledata/joint.txt
Since the joint parser is based on the Span-based Constituency Parser, please have the following required dependencies installed:
-
DyNet
for the underlying neural model. -
numpy
for interacting with DyNet.
To train on the provided sample data, you can simply run:
mkdir exps
python src/trainer.py --train sampledata/joint.txt --dev sampledata/joint.txt --epoch 200 --save exps/sampledata.model
You can find the training parameters and their descriptions in src/trainer.py
.
To evaluate the trained model on the sample data, you can run:
python src/parser.py --train sampledata/joint.txt --test sampledata/joint.txt --model exps/sampledata.model --verbose