PhenoRerank
contains the source code and pre-processed datasets for benchmarking the phenotype annotators and re-ranking the results. To facilitate the benchmarking, we provide the wrapper code of several existing annotators including OBO, NCBO, Monarch Initiative, Doc2hpo, MetaMap, Clinphen, NeuralCR, TrackHealth. We developed a re-ranking model that can boost the performance of the annotators in particular for precision. It filters out the false positives based on the contextual information. It is pre-trained on the pretext task defined on the textual data in Human Phenotype Ontology (i.e. term names, synonyms, definitions, and comments). It can also be fine-tuned on a specific dataset for further improvement.
The following instructions will help you setup the programs as well as the datasets, and re-produce the benchmarking results.
Firstly, you need to install a Python Interpreter (tested 3.6.10) and the following packages:
- numpy (tested 1.18.5)
- scipy (tested 1.8.0)
- pandas (tested 1.0.5)
- ftfy (tested 5.7)
- apiclient (tested 1.0.4)
- pyyaml (tested 6.0.1)
- pymetamap [optional] (tested 0.1)
- clinphen [optional] (tested 1.28)
- Run the script
install.sh
to download and configure the external programs for benchmark. - Follow the instructions here to install MetaMap and make sure that the locations of programs
skrmedpostctl
andwsdserverctl
are added to$PATH
- Follow the instructions here to install the dependencies of NeuralCR and download the model parameters. Then make a copy or create a soft link of the
model_params
in the folder you are going to run benchmark.
Follow the guidelines to get the API keys for NCBO and TrackHealth. Then assign to the API_KEY
global variable in the wrapper util/ncbo.py
and util/trkhealth.py
.
After cloning the repository and configuring the programs, you can download the pre-generated datasets and pre-trained model here.
Filename | Description |
---|---|
biolarkgsc.csv | Pre-processed BiolarkGSC+ dataset with document-level annotations |
biolarkgsc_locs.csv | Pre-processed BiolarkGSC+ dataset with mention-level annotations |
copd.csv | Pre-processed COPD-HPO dataset with document-level annotations |
copd_locs.csv | Pre-processed COPD-HPO dataset with mention-level annotations |
You can load a dataset into a Pandas DataFrame using the following code snippet.
import pandas as pd
data = pd.read_csv('XXX.csv', sep='\t', dtype={'id': str}, encoding='utf-8')
You can benchmark annotator ncbo
on biolarkgsc
dataset using the following command:
python benchmark.py ncbo biolarkgsc -i ./data
This command will search the dataset file biolarkgsc.csv
in the path ./data
and output the result of annotator ncbo
in the file biolarkgsc_ncbo_preds.csv
Please download the pre-trained model hpo_bert_onto.pth
or copy yours to the working folder in advance. Also, prepare the pre-processed HPO dictionary file hpo_labels.csv
and your prediction file in the same folder where you run the following command.
python rerank.py --model bert_onto -u biolarkgsc --onto hpo_labels.csv --resume hpo_bert_onto.pth
Once the prediction files are ready, please rename them appropriately. Then you can evaluate the results for comparison using the following commands.
python eval.py biolarkgsc method1.csv method2.csv method3.csv
For the sake of the best performance, you can fine-tune the re-ranking model on your own dataset if the dataset has sentence-/mention-level annotations. Use the following commands to firstly convert the dataset into appropriate format for training the re-ranking model.
python rerank.py -m train --noeval --model bert_onto --pretrained true -u biolarkgsc -f csv --onto hpo_labels.csv --pooler none --pdrop 0.1 --do_norm --norm_type batch --initln --earlystop --lr 0.0002 --maxlen 384 -j 10 -z 8 -g 0
You can re-generate the dataset from the annotations of BiolarkGSC+ and COPD using the following command:
python gendata.py -u biolarkgsc
python gendata.py -u copd