This project provides a convolutional neural network model for relation extraction.
CONTRIBUTING.md before filing an issue or creating a pull request.
These instructions will get you a copy of the project up and run on your local machine for development and testing purposes.
- copy the project on your local machine
git clone https://github.com/ncbi-nlp/DeepRel.git
- install the required packages
pip install -r requirements.txt
Follow the instruction to install
Download Stanford CoreNLP and unpack the compressed file to
Prepare the dataset
The program needs three separated datasets in JSON format: training,
development, and test. Each file contains sentences with annotations and
deeprel_schema.json describes the data format. The folder
examples contains some examples.
To validate the dataset format, run
jsonschema -i examples/aimed-dev.json deeprel_schema.json
Prepare the configuration file
The program needs
INI_FILE to configure the locations of
Stanford CoreNLP, etc. An example of
INI_FILE can be found in
It is a good practice to place the
INI_FILE in the same folder of
model_dir, but it is not required.
Preparse the datasets
In most cases, run the following program will parse the datasets and create input matrices for training and testing.
python run.py -pfvmsd INI_FILE
The program will generate intermediate files in
model_dir specified in the
- all - store parsed documents in JSON
- DATASET.npz - input matrix of sentences
- DATASET-sp.npz - input matrix of shortest paths between two annotations
- DATASET-doc.npz - input matrix of doc2vec
- vocabs.json - vocabulary
- word2vec.npz - maps from words to vectors
- pos.npz - maps from part-of-speeches to vectors
- chunk.npz - maps from chunks to vectors
- arg1_dis.npz - maps from the distances between argument1 and current word to vectors
- arg2_dis.npz - maps from the distances between argument2 and current word to vectors
- dependency.npz - maps from dependencies to vectors
- type.npz - maps from named entities to vectors
You can also run the
run.py program step by step, so you can modify and check different parts of the inputs.
For example, to check how different parsers will affect the performance, you can replace the
parse tree in each JSON file in
all and run
-fvmstd to regenerate the matrices.
python deeprel/run.py -h Usage: run.py [options] INI_FILE Options: --log <str> Log option. One of DEBUG, INFO, WARNING, ERROR, and CRITICAL. [default: INFO] -p preparse [default: False] -f create features [default: False] -v create vocabularies [default: False] -m create matrix [default: False] -s create shortest path matrix [default: False] -d create doc2vec [default: False] -t test matrix format [default: False] -k skip pre-parsed documents [default: False]
Train the model
python deeprel/train.py INI_FILE
The program will train a CNN model using the training and development sets.
The model will be stored at
model_dir specified in the
Test the model
python deeprel/test.py model_dir
This will print a report of results using the model and test set.
Please read CONTRIBUTING for details on our code of conduct, and the process for submitting pull requests to us.
This work was supported by the Intramural Research Programs of the National Institutes of Health, National Library of Medicine. We are also grateful to Robert Leaman for the helpful discussion.
- Peng Y, Rios A, Kavuluru R, Lu Z. Extracting chemical-protein relations with ensembles of SVM and deep learning models. Database. 2018, 1-9. bay073.
- Peng Y, Lu Z. Deep learning for extracting protein-protein interactions from biomedical literature. In Proceedings of BioNLP workshop. 2017.