This project provides a system to perform Japanese bridging reference resolution.
poetry install
cd /somewhere
-
KWDLC
mkdir kwdlc git clone https://github.com/ku-nlp/KWDLC kwdlc/KWDLC
kyoto-reader, which this project depends on,
provides some commands to preprocess corpora.
For example, kyoto configure
creates Makefile
to add features for this analyzer,
and kyoto idsplit
splits corpus documents into train/valid/test datasets.
$ git clone https://github.com/ku-nlp/JumanDIC
$ kyoto configure --corpus-dir /somewhere/kwdlc/KWDLC/knp \
--data-dir /somewhere/kwdlc \
--juman-dic-dir /somewhere/JumanDIC/dic
created Makefile at /somewhere/kwdlc
$ cd /somewhere/kwdlc && make -i
$ kyoto idsplit --corpus-dir /somewhere/kwdlc/knp \
--output-dir /somewhere/kwdlc \
--train /somewhere/kwdlc/KWDLC/id/split_for_pas/train.id \
--valid /somewhere/kwdlc/KWDLC/id/split_for_pas/dev.id \
--test /somewhere/kwdlc/KWDLC/id/split_for_pas/test.id
After adding features to the corpora, you need to load and pickle them.
python src/preprocess.py \
--kwdlc /somewhere/kwdlc \
--kc /somewhere/kc \
--out /somewhere/dataset \
--bert-name roberta \
--bert-path nlp-waseda/roberta-base-japanese
src/preprocess.py
creates a file config.yml
in which preprocessing settings and statistics of the dataset are written.
Copy this file to the config directory.
cp data/bldsample/config.yaml conf/dataset/example.yaml
-
Copy
conf/example.yaml
and edit the contents.cp conf/example.yaml conf/custom.yaml
-
Run training
python src/train.py -cn custom devices=[0,1] num_workers=4
poetry run python src/test.py checkpoint=/path/to/trained/checkpoint eval_set=valid devices=[0,1]
BRG_CACHE_DIR
: A directory where processed documents are cached. Default value is/data/$USER/brg_cache
.BRG_OVERWRITE_CACHE
: If set, the data loader does not load cache even if it exists.BRG_DISABLE_CACHE
: If set, the data loader does not load or save cache.WANDB_PROJECT
: Project name in Weights & Biases.
- Kyoto University Web Document Leads Corpus (KWDLC)
@inproceedings{ueda-kurohashi-2022-improving,
title = "Improving Bridging Reference Resolution using Continuous Essentiality from Crowdsourcing",
author = "Ueda, Nobuhiro and
Kurohashi, Sadao",
booktitle = "Proceedings of the Fifth Workshop on Computational Models of Reference, Anaphora and Coreference",
month = oct,
year = "2022",
address = "Gyeongju, Republic of Korea",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.crac-1.8",
pages = "74--87",
abstract = "Bridging reference resolution is the task of finding nouns that complement essential information of another noun. The essentiality varies depending on noun combination and context and has a continuous distribution. Despite the continuous nature of essentiality, existing datasets of bridging reference have only a few coarse labels to represent the essentiality. In this work, we propose a crowdsourcing-based annotation method that considers continuous essentiality. In the crowdsourcing task, we asked workers to select both all nouns with a bridging reference relation and a noun with the highest essentiality among them. Combining these annotations, we can obtain continuous essentiality. Experimental results demonstrated that the constructed dataset improves bridging reference resolution performance. The code is available at https://github.com/nobu-g/bridging-resolution.",
}
MIT
Nobuhiro Ueda <ueda at nlp.ist.i.kyoto-u.ac.jp>