Full abstract relation extraction from biological texts with bi-affine relation attention networks
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
data add ctd wordpiece vocab Jan 25, 2019
src Delete word_piece_tokenizer.pyc Mar 5, 2018
README.md added python version Oct 29, 2018


Full Abstract Relation Extraction from Biological Texts with Bi-affine Relation Attention Networks

This code was used in the paper:

Simultaneously Self-attending to All Mentions for Full-Abstract Biological Relation Extraction
Patrick Verga , Emma Strubell, and Andrew McCallum.
North American Chapter of the Association for Computational Linguistics (NAACL) 2018


python version 2.7 tensorflow version 1.0.1

Setup Environment Variables

From this directory call: source set_environment.sh
Note: this will only set the paths for this session.

Processing Data


Process the CDR dataset

Process the CDR dataset including additional weakly labeled data

These scripts will use byte-pair encoding (BPE) tokenization. There are also scripts to tokenize using the Genia tokenizer.

Run Model

Train a model locally on gpu id 0
${CDR_IE_ROOT}/bin/run.sh ${CDR_IE_ROOT}/configs/cdr/relex/cdr_2500 0

If you are using a cluster with Slurm, you can instead use this command:
${CDR_IE_ROOT}/bin/srun.sh ${CDR_IE_ROOT}/configs/cdr/relex/cdr_2500

Saving loading models

By default the model will be evaulated on the CDR dev set. To save the best model to the file 'model.tf', add the save_model flag
${CDR_IE_ROOT}/bin/run.sh ${CDR_IE_ROOT}/configs/cdr/relex/cdr_2500 0 --save_model model.tf

To load a saved model, run
${CDR_IE_ROOT}/bin/run.sh ${CDR_IE_ROOT}/configs/cdr/relex/cdr_2500 0 --load_model path/to/model.tf

Pretrained Models

You can download some pretrained models here

Generating the CTD dataset

This script will generate the full CTD dataset. The following command will tokenize using BPE with a budget of 50k tokens.

You can also generate the data using the genia tokenizer with

By default, abstracts with > 500 tokens are discarded. To not filter you can change the MAX_LEN variable to a very large number.