Joint Syntaco-Discourse Parsing and Treebank

This repository contains the implementation of the joint syntaco-discourse parser and the syntaco-discourse treebank. For more details, please refer to the paper Joint Syntacto-Discourse Parsing and the Syntacto-Discourse Treebank.

Syntaco-Discourse Treebank

Due to copyright restriction, we can not provide the joint treebank in the form that can be directly used to train a parser. Instead, we provide a patch tool kit to generate the Syntaco-Discourse Treebank giving the RST Discourse Treebank and the Penn Treebank.

Required Python Dependencies

python-gflags for parsing script arguments.
nltk for tokenization.

Procedures to Generate Treebank

Please follow the steps below to generate the treebank:

Place the RST Discourse Treebank in folder dataset/rst. Put the discourse trees (wsj_xxxx.out.dis files) in the RST Discourse Treebank to dataset/rst/train and dataset/rst/test respectively. Here each wsj_xxxx.out.dis file corresponds to one WSJ article, where xxxx is the article number.
Place the Penn Treebank trees in folder dataset/ptb. These constituency trees are in parentheses format. They are grouped as one treebank file (with name wsj_xxxx.cleangold) for a WSJ article.

Apply patches to the RST Discourse Treebank files and Penn Treebank files. This step is necessary because there are some small mismatches between the RST Discourse tree texts and the Penn tree texts.

cd dataset/rst/train
patch -p0 < ../../../patches/rst-ptb.train.patch
cd ../test
patch -p0 < ../../../patches/rst-ptb.test.patch
cd ../../ptb
patch -p0 < ../../patches/ptb-rst.patch
cd ...

Run tokenization.

python src/tokenize_rst.py --rst_path dataset/rst/train
python src/tokenize_rst.py --rst_path dataset/rst/test

Generate the training set and testing set for the joint treebank separately:

mkdir dataset/joint
python src/aligner.py --rst_path dataset/rst/train --const_path dataset/ptb > dataset/joint/train.txt
python src/aligner.py --rst_path dataset/rst/test --const_path dataset/ptb > dataset/joint/test.txt

To test the scripts above, you can play with the sample data:

python src/tokenize_rst.py --rst_path sampledata/rst
python src/aligner.py --rst_path sampledata/rst --const_path sampledata/ptb > sampledata/joint.txt

Syntaco-Dsicourse Parser

Required Python Dependencies

Since the joint parser is based on the Span-based Constituency Parser, please have the following required dependencies installed:

DyNet for the underlying neural model.
numpy for interacting with DyNet.

Training

To train on the provided sample data, you can simply run:

mkdir exps
python src/trainer.py --train sampledata/joint.txt --dev sampledata/joint.txt --epoch 200 --save exps/sampledata.model

You can find the training parameters and their descriptions in src/trainer.py.

Evaluating

To evaluate the trained model on the sample data, you can run:

python src/parser.py --train sampledata/joint.txt --test sampledata/joint.txt --model exps/sampledata.model --verbose

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
patches		patches
sampledata		sampledata
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
label.png		label.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Joint Syntaco-Discourse Parsing and Treebank

Syntaco-Discourse Treebank

Required Python Dependencies

Procedures to Generate Treebank

Syntaco-Dsicourse Parser

Required Python Dependencies

Training

Evaluating

About

Releases

Packages

Languages

License

kaayy/josydipa

Folders and files

Latest commit

History

Repository files navigation

Joint Syntaco-Discourse Parsing and Treebank

Syntaco-Discourse Treebank

Required Python Dependencies

Procedures to Generate Treebank

Syntaco-Dsicourse Parser

Required Python Dependencies

Training

Evaluating

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages