Modeling Human Sentence Processing with Left-Corner Recurrent Neural Network Grammars

This repository provides the code for the paper Modeling Human Sentence Processing with Left-Corner Recurrent Neural Network Grammars.

Modeling Human Sentence Processing with Left-Corner Recurrent Neural Network Grammars
Ryo Yoshida, Hiroshi Noji, and Yohei Oseki
EMNLP 2021

Requirements

python==3.9.13
R==4.2.2

Installation

git clone git@github.com:osekilab/RNNG-LC.git
cd RNNG-LC
python -m venv .venv
. .venv/bin/activate
pip install -r requirements.txt

Training data preparation

Download the NPCMJ corpus from here.
- A version from 2021-03-02 was used in the paper. This version is not currently available on the website; if you want to download this version, please contact the authors for the data.
Unzip and place the data on data/.

Run the following command to preprocess the data:

python src/parser-data-gen/preprocess.py \
   --file_root data/npcmj \
   --save_root data/npcmj_preprocessed \
   --error_root data/npcmj_error

Run the following command to split train/dev/test data:

python src/parser-data-gen/split.py \
   --preprocessed_root data/npcmj_preprocessed \
   --save_root data/npcmj_split

Training

If you want to download the models trained in the paper, please contact the authors.

RNNG

Install rnng-pytorch.

git clone https://github.com/aistairc/rnng-pytorch.git
cd rnng-pytorch
git checkout f9a5663

Version f9a5663 was used in the paper.

Run the following command to preprocess the data for RNNGs:

python preprocess.py \
   --vocabsize 8000 \
   --unkmethod subword \
   --subword_type bpe \
   --trainfile ../data/npcmj_split/train.mrg \
   --valfile ../data/npcmj_split/dev.mrg \
   --testfile ../data/npcmj_split/test.mrg \
   --outputfile ./data/npcmj \
   --keep_ptb_bracket

Train RNNGs.

mkdir models
cd ..
bash scripts/train_rnng.sh

LSTM

Install neural-complexity.

git clone https://github.com/vansky/neural-complexity.git
cd neural-complexity
git checkout tags/v1.1.0

Version v1.1.0 was used in the paper.

Run the following command to preprocess the training data for LSTMs:

python src/lstm-data-gen/train_data_gen.py \
   --rnng_data_root rnng-pytorch/data/ \
   --train_file npcmj-train.json \
   --val_file npcmj-val.json \
   --test_file npcmj-test.json

Train LSTMs.

mkdir models
cd ..
bash scripts/train_lstm.sh

Evaluation data preparation

We cannot share the original text of the evaluation data due to copyright issues; please contact https://github.com/masayu-a/BCCWJ-EyeTrack.
After obtaining the data,

tokenize each sentence in the original text with HARUNIWA2
and replace round brackets with PTB-style brackets.

Then, place the text data of each unit ([A-D].txt) on data/bccwj/.

NOTE: We manually corrected some tokenization errors, in which the boundaries of phrasal units were not split.

Surprisal calculation

RNNG

Calculate surprisals with RNNGs.
```
bash scripts/calc_surp_rnng.sh
```

LSTM

Run the following command to preprocess the evaluation data for LSTMs:

python src/lstm-data-gen/eval_data_gen.py \
   --eval_data_root data/bccwj \
   --A_file A.txt \
   --B_file B.txt \
   --C_file C.txt \
   --D_file D.txt \
   --spm_model rnng-pytorch/data/npcmj-spm.model

Calculate surprisals with LSTMs.
```
bash scripts/calc_surp_lstm.sh
```

Aggregation

Eye-tracking data preparation

Download the BCCWJ-EyeTrack from here.
Add the original text of each phrasal unit to the surface column in fpt.csv.
Place the fpt.csv file on data/BCCWJ-EyeTrack/.

RNNG

Aggregate RNNGs surprisals.
```
bash scripts/aggregate_surp_rnng.sh
```

LSTM

Aggregate LSTMs surprisals.
```
bash scripts/aggregate_surp_lstm.sh
```

UNK

Aggregate the number of unknown words.
```
bash scripts/aggregate_num_unk.sh
```
- NOTE: The number of unknown words may vary depending on the version of the evaluation data.

Concat

Concatenate the above data.
```
bash scripts/concat.sh
```

Post-processing

Run the following command to post-process the data:

python src/aggregate/post_process.py \
   --input_path data/aggregated/concat/fpt.csv \
   --save_path data/aggregated/concat/fpt-del.csv

Evaluation

Script for the evaluation is in r-workspace/modeling.R.
We recommend using RStudio to run the script.

License

MIT

Note

If you want to download the 2021-03-02 version of the NPCMJ and the models trained in our paper, please contact yoshiryo0617 [at] g.ecc.u-tokyo.ac.jp

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.vscode		.vscode
data		data
r-workspace		r-workspace
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Modeling Human Sentence Processing with Left-Corner Recurrent Neural Network Grammars

Requirements

Installation

Training data preparation

Training

RNNG

LSTM

Evaluation data preparation

Surprisal calculation

RNNG

LSTM

Aggregation

Eye-tracking data preparation

RNNG

LSTM

UNK

Concat

Post-processing

Evaluation

License

Note

About

Releases

Packages

Languages

License

osekilab/RNNG-EyeTrack

Folders and files

Latest commit

History

Repository files navigation

Modeling Human Sentence Processing with Left-Corner Recurrent Neural Network Grammars

Requirements

Installation

Training data preparation

Training

RNNG

LSTM

Evaluation data preparation

Surprisal calculation

RNNG

LSTM

Aggregation

Eye-tracking data preparation

RNNG

LSTM

UNK

Concat

Post-processing

Evaluation

License

Note

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages