This repository provides the code for the paper Modeling Human Sentence Processing with Left-Corner Recurrent Neural Network Grammars.
Modeling Human Sentence Processing with Left-Corner Recurrent Neural Network Grammars
Ryo Yoshida, Hiroshi Noji, and Yohei Oseki
EMNLP 2021
python==3.9.13
R==4.2.2
git clone git@github.com:osekilab/RNNG-LC.git
cd RNNG-LC
python -m venv .venv
. .venv/bin/activate
pip install -r requirements.txt
- Download the NPCMJ corpus from here.
- A version from 2021-03-02 was used in the paper. This version is not currently available on the website; if you want to download this version, please contact the authors for the data.
- Unzip and place the data on
data/
. - Run the following command to preprocess the data:
python src/parser-data-gen/preprocess.py \ --file_root data/npcmj \ --save_root data/npcmj_preprocessed \ --error_root data/npcmj_error
- Run the following command to split train/dev/test data:
python src/parser-data-gen/split.py \ --preprocessed_root data/npcmj_preprocessed \ --save_root data/npcmj_split
If you want to download the models trained in the paper, please contact the authors.
- Install rnng-pytorch.
git clone https://github.com/aistairc/rnng-pytorch.git cd rnng-pytorch git checkout f9a5663
- Version
f9a5663
was used in the paper.
- Version
- Run the following command to preprocess the data for RNNGs:
python preprocess.py \ --vocabsize 8000 \ --unkmethod subword \ --subword_type bpe \ --trainfile ../data/npcmj_split/train.mrg \ --valfile ../data/npcmj_split/dev.mrg \ --testfile ../data/npcmj_split/test.mrg \ --outputfile ./data/npcmj \ --keep_ptb_bracket
- Train RNNGs.
mkdir models cd .. bash scripts/train_rnng.sh
- Install neural-complexity.
git clone https://github.com/vansky/neural-complexity.git cd neural-complexity git checkout tags/v1.1.0
- Version
v1.1.0
was used in the paper.
- Version
- Run the following command to preprocess the training data for LSTMs:
python src/lstm-data-gen/train_data_gen.py \ --rnng_data_root rnng-pytorch/data/ \ --train_file npcmj-train.json \ --val_file npcmj-val.json \ --test_file npcmj-test.json
- Train LSTMs.
mkdir models cd .. bash scripts/train_lstm.sh
We cannot share the original text of the evaluation data due to copyright issues; please contact https://github.com/masayu-a/BCCWJ-EyeTrack.
After obtaining the data,
- tokenize each sentence in the original text with HARUNIWA2
- and replace round brackets with PTB-style brackets.
Then, place the text data of each unit ([A-D].txt
) on data/bccwj/
.
- NOTE: We manually corrected some tokenization errors, in which the boundaries of phrasal units were not split.
- Calculate surprisals with RNNGs.
bash scripts/calc_surp_rnng.sh
-
Run the following command to preprocess the evaluation data for LSTMs:
python src/lstm-data-gen/eval_data_gen.py \ --eval_data_root data/bccwj \ --A_file A.txt \ --B_file B.txt \ --C_file C.txt \ --D_file D.txt \ --spm_model rnng-pytorch/data/npcmj-spm.model
-
Calculate surprisals with LSTMs.
bash scripts/calc_surp_lstm.sh
- Download the BCCWJ-EyeTrack from here.
- Add the original text of each phrasal unit to the
surface
column infpt.csv
. - Place the
fpt.csv
file ondata/BCCWJ-EyeTrack/
.
- Aggregate RNNGs surprisals.
bash scripts/aggregate_surp_rnng.sh
- Aggregate LSTMs surprisals.
bash scripts/aggregate_surp_lstm.sh
- Aggregate the number of unknown words.
bash scripts/aggregate_num_unk.sh
- NOTE: The number of unknown words may vary depending on the version of the evaluation data.
- Concatenate the above data.
bash scripts/concat.sh
- Run the following command to post-process the data:
python src/aggregate/post_process.py \ --input_path data/aggregated/concat/fpt.csv \ --save_path data/aggregated/concat/fpt-del.csv
Script for the evaluation is in r-workspace/modeling.R
.
We recommend using RStudio to run the script.
MIT
If you want to download the 2021-03-02 version of the NPCMJ and the models trained in our paper, please contact yoshiryo0617 [at] g.ecc.u-tokyo.ac.jp