Skip to content

nttcslab-nlp/RSTParser_EMNLP22

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Implemenation of Neural RST Parser

A Simple and Strong Baseline for End-to-End Neural RST-style Discourse Parsing

Paper: URL for arXiv

Abstract: To promote and further develop RST-style discourse parsing models, we need a strong baseline that can be regarded as a reference for reporting reliable experimental results. This paper explores a strong baseline by integrating existing simple parsing strategies, top-down and bottom-up, with various transformer-based pre-trained language models. The experimental results obtained from two benchmark datasets demonstrate that the parsing performance strongly relies on the pre-trained language models rather than the parsing strategies. In particular, the bottom-up parser achieves large performance gains compared to the current best parser when employing DeBERTa. We further reveal that language models with a span-masking scheme especially boost the parsing performance through our analysis within intra- and multi-sentential parsing, and nuclearity prediction.

Setup

  1. prepare python environment with conda
  2. clone this repository
    git clone https://github.com/nttcslab-nlp/RSTParser_EMNLP22
    cd RSTParser_EMNLP22
  3. manually install pytorch to enable GPU support
    conda install pytorch cudatoolkit=XXX -c pytorch
    conda install torchtext -c pytorch
  4. install dependencies
    pip install -r requirements.txt

Preprocess for dataset

RSTDT

If you have the RSTDT dataset that has been preprocessed by Heilman's script, you can use it without doing anything.

Each data of Heilman's has following elements.

- doc_id
- edu_start_indices
- edu_starts_paragraph
- edu_strings
- path_basename         # not necessary
- pos_tags              # not necessary
- rst_tree
- syntax_trees          # not necessary
- token_tree_positions  # not necessary
- tokens

There is a sample at data/rstdt/sample.json

Instr-DT

The instr-DT dataset contains data that has multiple trees for one document. We treat those data by combining the multiple trees into single tree with the "Nucleus-Nucleus" and "topic-change?" labels.

cd data/instrdt
git clone https://github.com/EducationalTestingService/rstfinder.git
rm rstfinder/rstfinder/__init__.py  # this requires zpar package
ln -nfs $PWD/rstfinder/rstfinder/ tools/

python tools/preprocess.py \
    --input-dir PATH/TO/instr-discourse-data/discourse_annotation/ \
    --output-file all.json \
    --joint-with-nn
for tgt in train valid test; do
    python tools/extract.py \
        --src all.json \
        --tgt ${tgt}.json \
        --ids ${tgt}_ids.txt
done

Train and test

bash scripts/run_shift_reduce_v1.deberta-base.1e-5.sh

This script was genereated by scripts/generator_rstdt.sh. Some enviroment variables (CUDA_VISIBLE_DEVICES) are hard codede in the script.

Although the maximum is 20 epochs, it converges in about 5 epochs. (Training time is about 1h/epoch with GeForce RTX 3090)

Models are saved into ./models/rstdt/shift_reduce_v1.deberta-base.1e-5/version_?/checkpoints/.

Saved models are followings

PATH/TO/checkpoints/
- epoch=3-step=?????.ckpt  # saved in training process
- epoch=3-step=?????.ckpt  # saved in training process
- epoch=4-step=?????.ckpt  # saved in training process
- last.ckpt                # saved at the end of training process
- best.ctpt                # selected the best model by validation score in evaluation process
- average.ckpt             # output of checkpoint weight averaging (CPA) at evaluation process

Test only single checkpoint

python src/test.py --ckpt-path PATH/TO/CKPT --save-dir PATH/TO/TREES/ --metrics OriginalParseval

Parse raw document (sequence of EDUs)

TBU

Performance and Checkpoints

These are results evaluated by OriginalParseval. (scores are a bit different from the paper due to model retraining.)

Model LM CPA Span Nuc. Rel. Ful. ckpt
Shift-Reduce DeBERTa 77.0 67.2 57.5 55.5 Google Drive
Shift-Reduce DeBERTa x 78.3 69.0 58.2 56.2 Google Drive

Reference

TBU

LICENSE

This software is released under the NTT License, see LICENSE.txt.

According to the license, it is not allowed to create pull requests. Please feel free to send issues.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published