Skip to content

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit



17 Commits

Repository files navigation

BioBERT Pre-trained Weights

This repository provides pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. Please refer to our paper BioBERT: a pre-trained biomedical language representation model for biomedical text mining for more details.

Downloading pre-trained weights

Go to releases section of this repository or click links below to download pre-trained weights of BioBERT. We provide three combinations of pre-trained weights: BioBERT (+ PubMed), BioBERT (+ PMC), and BioBERT (+ PubMed + PMC). Pre-training was based on the original BERT code provided by Google, and training details are described in our paper. Currently available versions of pre-trained weights are as follows:

Make sure to specify the versions of pre-trained weights used in your works. If you have difficulty choosing which one to use, we recommend using BioBERT-Base v1.1 (+ PubMed 1M) or BioBERT-Large v1.1 (+ PubMed 1M) depending on your GPU resources. Note that for BioBERT-Base, we are using WordPiece vocabulary (vocab.txt) provided by Google as any new words in biomedical corpus can be represented with subwords (for instance, Leukemia => Leu + ##ke + ##mia). More details are in the closed issue #1.

Pre-training corpus

We do not provide pre-processed version of each corpus. However, each pre-training corpus could be found in the following links:

  • PubMed Abstracts1:
  • PubMed Abstracts2:
  • PubMed Central Full Texts:

Estimated size of each corpus is 4.5 billion words for PubMed Abstracts1 + PubMed Abstracts2, and 13.5 billion words for PubMed Central Full Texts.

Fine-tuning BioBERT

To fine-tunine BioBERT on biomedical text mining tasks using provided pre-trained weights, refer to the DMIS GitHub repository for BioBERT.


    author = {Lee, Jinhyuk and Yoon, Wonjin and Kim, Sungdong and Kim, Donghyeon and Kim, Sunkyu and So, Chan Ho and Kang, Jaewoo},
    title = "{BioBERT: a pre-trained biomedical language representation model for biomedical text mining}",
    journal = {Bioinformatics},
    year = {2019},
    month = {09},
    issn = {1367-4803},
    doi = {10.1093/bioinformatics/btz682},
    url = {},

Contact information

For help or issues using pre-trained weights of BioBERT, please submit a GitHub issue. Please contact Jinhyuk Lee (, or Sungdong Kim ( for communication related to pre-trained weights of BioBERT.