DTranNER

Biomedical Named Entity Recognizer

DTranNER is a deep-learning-based method suited for biomedical named entity recognition that obtains state-of-the-art performance in NER on the five biomedical benchmark corpora (BC2GM, BC4CHEMD, BC5CDR-disease, BC5CDR-chemical, and NCBI-Diesease). DTranNER equips with deep learning-based label-label transition model to describe ever-changing contextual relations between neighboring labels. Please refer to our paper DTranNER: biomedical named entity recognition with deep learning-based label-label transition model for more details.

Updates

(29 August 2019) A new version of DTranNER is now available. It is entirely renewed, based on PyTorch, with providing significant performance improvements over the scores on the submitted manuscript.

Initial Setup

To use DTranNER, you are required to set up a python3-based environment with packages such as pytorch v1.1.0, numpy, and gensim.

Usage

Download the specified word embedding (wikipedia-pubmed-and-PMC-w2v.bin) on here and put it under the directory w2v whose location is under the project-root directory.

mkdir w2v
mv wikipedia-pubmed-and-PMC-w2v.bin $PROJECT_ROOT/w2v/

Model Training

For model training, we recommend using GPU.

python train.py \
    --DTranNER
    --dataset_name ['BC5CDR-disease','BC5CDR-chem','BC2GM','BC4CHEMD',or 'NCBI-disease'] \
    --hidden_dim [e.g., 500] \
    --pp_hidden_dim [e.g., 500] \
    --bilinear_dim [e.g., 500] \
    --pp_bilinear_pooling
    --gpu [e.g., 0]

You can change the arguments as you want.

Download Word Embedding

We initialize the word embedding matrix with the pre-trained word vectors from Pyysalo et al., 2013. These word vectors are obtained from here. They were trained using the PubMed abstracts, PubMed Central (PMC), and a Wikipedia dump. Recently, contextualized word embeddings have been emerged. We incorporated ELMo https://arxiv.org/abs/1802.05365 into our token embedding layer.

Datasets

The source of pre-processed datasets are from https://github.com/cambridgeltl/MTL-Bioinformatics-2016 and We use biomedical corpora collected by Crichton et al. The dataset is publicly available and can be downloaded from here. In our implementation, the datasets are accessed via $PROJECT_HOME/data/. For details on NER datasets, please refer to A Neural Network Multi-Task Learning Approach to Biomedical Named Entity Recognition (Crichton et al. 2017).

Tagging Scheme

In this study, we use IOBES tagging scheme. O denotes non-entity token, B denotes the first token of such an entity consisting of multiple tokens, I denotes the inside token of the entity, E denotes the last token, and S denotes a single-token-based entity. We are conducting experiments with IOB tagging scheme at this moment. It will be reported soon.

Benchmarks

Here we compare our model with recent state-of-the-art models on the five biomedical corpora mentioned above. We measure F1 score as the evaluation metric. The experimental results are shown in below the table.

Model	BC2GM	BC4CHEMD	BC5CDR-Chemical	BC5CDR-Disease	NCBI-disease
Att-BiLSTM-CRF 2017	-	91.14	92.57	-	-
D3NER 2018	-	-	93.14	84.68	84.41
Collabonet 2018	79.73	88.85	93.31	84.08	86.36
Wang et al. 2018	80.74	89.37	93.03	84.95	86.14
BioBERT v1.0	84.40	91.41	93.44	86.56	89.36
BioBERT v1.1	84.72	92.36	93.47	87.15	89.71
DTranNER	84.56	91.99	94.16	87.22	88.62

Contact

Please post a Github issue or contact skhong831@kaist.ac.kr or skhong0831@gmail.com if you have any questions.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
data		data
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DTranNER

Biomedical Named Entity Recognizer

Links

Updates

Initial Setup

Usage

Model Training

Download Word Embedding

Datasets

Tagging Scheme

Benchmarks

Contact

About

Releases

Packages

Languages

kaist-dmlab/BioNER

Folders and files

Latest commit

History

Repository files navigation

DTranNER

Biomedical Named Entity Recognizer

Links

Updates

Initial Setup

Usage

Model Training

Download Word Embedding

Datasets

Tagging Scheme

Benchmarks

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages