COFFTEA

Code for EMNLP2023 paper "Coarse-to-Fine Dual Encoders are Better Frame Identification Learners"

Paper link: Coarse-to-Fine Dual Encoders are Better Frame Identification Learners

Data Preparation

You can prepare data for COFFTEA from scratch, or can directly download from link and place into the coresponding directory.

Firstly, follow Open-sesame to handle data in the XML format specified under FrameNet, and get processed CONLL files, including the spilt of train/eval/test.

Then, run the python code in process_data directory to construct the dataset we used:

Note: Please specify FrameNet VERSION in global_config.json

Extract frame definitions,lexical unit definitions and frame relations.

python process_data/extra_definition.py
python process_data/frame_relation.py

Construct In-batch Learning dataset.

python process_data/in-batch.py

Construct In-candidate Learning dataset，pad_model comes from ["random", "lu", "lu+random", "lu+sib+random"].

python process_data/in-candidate.py  {pad_mode}

For example, your data directory will be same as the following:

.
├── fn1.7
│   ├── in_batch
│   │   └── with_exemplars
│   │       ├── dev_lexical_filter.csv
│   │       ├── dev_wo_lexical_filter.csv
│   │       ├── examplar.csv
│   │       ├── frame_definition.csv
│   │       ├── test_lexical_filter.csv
│   │       ├── test_wo_lexical_filter.csv
│   │       └── train_wo_lexical_filter.csv
│   └── in_candidate
│       └── with_exemplars
│           ├── lu_random
│           │   ├── dev_lexical_filter.csv
│           │   ├── dev_wo_lexical_filter.csv
│           │   ├── examplar.csv
│           │   ├── frame_definition.csv
│           │   ├── test_lexical_filter.csv
│           │   ├── test_wo_lexical_filter.csv
│           │   └── train_lexical_filter.csv
│           └── lu_sib_random
│               ├── dev_lexical_filter.csv
│               ├── dev_wo_lexical_filter.csv
│               ├── examplar.csv
│               ├── frame_definition.csv
│               ├── test_lexical_filter.csv
│               ├── test_wo_lexical_filter.csv
│               └── train_lexical_filter.csv
└── raw_data
    ├── fe_relation
    │   └── fn1.7
    │       └── fe_relation.csv
    ├── fndata-1.7
        └── ...
    ├── frame_def
    │   └── fn1.7
    │       ├── frame2def.json
    │       ├── frame2lu_def.json
    │       └── frame2lu.json
    ├── frame_relation
    │   └── fn1.7
    │       ├── frame_relation.csv
    │       └── frame_relation.pkl
    ├── lu_def
    │   └── fn1.7
    │       ├── lu2def.json
    │       ├── lu2frame_def.json
    │       └── lu2frame.json
    └── open_sesame_v1_data
        └── fn1.7
            ├── fn1.7.dev.syntaxnet.conll
            ├── fn1.7.dev.syntaxnet.conll.sents
            ├── fn1.7.exemplar.train.syntaxnet.conll
            ├── fn1.7.exemplar.train.syntaxnet.conll.sents
            ├── fn1.7.fulltext.train.syntaxnet.conll
            ├── fn1.7.fulltext.train.syntaxnet.conll.sents
            ├── fn1.7.test.syntaxnet.conll
            └── fn1.7.test.syntaxnet.conll.sents

Training

Note: the TRAIN_MODE and TRAIN_DATA_MODE are connected, and you can choose the TEST_DATA_MODE to determine the evaluation setting on lexical filtering or without lexical filtering.

In-batch Learning

To effectively facilitate one target with all frames and update frame representations simultaneously, we employ In-batch Learning to realize this.

Please run bash code/train_batch.sh.

In-candidate Learning

To further distinguish frames that are more likely to be confused with one another, we employ In-candidate to incorporate more complicated frame relations to construct the hard negative frames, following the order of candidate frames, sibling frames, and random frames.

Please run bash code/train_candidate.sh.

Coarse-to-fine Two-stage Learning

As our paper claimed, there is a trade-off between in-batch learning $L_B$ and in-candidate learning $L_C$, the former is better on without lexical filtering and the latter excels at lexical filtering. We train COFFTEA with $L_B$ on exemplars and then train COFFTEA with $L_C$ with train split, which is equivalent to performing coarse-grained pretraining on a large-scale corpus and then fine-tuning on a smaller dataset for domain adaptation at a finer granularity.

firstly, run code/pretrain.sh and then replace the pretrain_model_path with the pretraining model path and run code/train_candidate_from_pretrain.sh.

Contact Us

For any questions or issues, please feel free to contact ankaikai@stu.pku.edu.cn.

We sincerely thank the open-source projects FIDO and Open-sesame for their invaluable contributions in helping us complete our work.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
code		code
data		data
process_data		process_data
LICENSE		LICENSE
README.md		README.md
method.pdf		method.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

data

data

process_data

process_data

LICENSE

LICENSE

README.md

README.md

method.pdf

method.pdf

Repository files navigation

COFFTEA

Data Preparation

Training

In-batch Learning

In-candidate Learning

Coarse-to-fine Two-stage Learning

Contact Us

About

Releases

Packages

Languages

License

pkunlp-icler/COFFTEA

Folders and files

Latest commit

History

Repository files navigation

COFFTEA

Data Preparation

Training

In-batch Learning

In-candidate Learning

Coarse-to-fine Two-stage Learning

Contact Us

About

Resources

License

Stars

Watchers

Forks

Languages