Skip to content
MedNLI - A Natural Language Inference Dataset For The Clinical Domain
Branch: master
Clone or download
Latest commit 31e8360 Oct 22, 2019
Type Name Latest commit message Commit time
Failed to load latest commit information.
docs Update web page Feb 9, 2019
models Initial commit Aug 19, 2018
utils Pre-trained models Oct 11, 2018
.gitignore Initial commit Aug 19, 2018 Add baselines information Oct 16, 2018 Initial commit Aug 19, 2018 Initial commit Aug 19, 2018 Initial commit Aug 19, 2018 Initial commit Aug 19, 2018 Initial commit Aug 19, 2018 Pre-trained models Oct 11, 2018 Fix bug in preprocessing Aug 28, 2018
requirements.txt Bump nltk from 3.2.5 to 3.4.5 Oct 21, 2019 Initial commit Aug 19, 2018

MedNLI - Natural Language Inference in Clinical Texts


This repository contains the code to fully reproduce experiments in the paper. As such, it has quite a few dependencies and not trivial to install. If you want just a simple ready-to-use baseline with pre-trained models, please have a look at our baselines repository:


  1. Clone this repo: git clone ...
  2. Install NumPy: pip install numpy==1.13.3
  3. Install PyTorch v0.2.0: pip install (see for details)
  4. Install requirements: pip install -r requirements.txt
  5. Install MetaMap:
    • Make sure to set METAMAP_BINARY_PATH in the to your MetaMap binary installation
  6. Install PyMetaMap:
  7. Install UMLS Metathesaurus:
    • Make sure to set UMLS_INSTALLATION_DIR in the pointing to your UMLS installation

Downloading the datasets

  1. Download SNLI:
  2. Download MultiNLI: (we experimented with MultiNLI v0.9)
  3. Download MedNLI:

Put all of the data inside the ./data/ dir so is has the following structure:

$ ls data/
mednli_1.0  multinli_0.9  snli_1.0
$ ls data/snli_1.0/
README.txt  snli_1.0_dev.jsonl  snli_1.0_dev.txt  snli_1.0_test.jsonl  snli_1.0_test.txt  snli_1.0_train.jsonl  snli_1.0_train.txt

Downloading the word embeddings

Word Embedding Link
glove glove.840B.300d.pickled
mimic mimic.fastText.no_clean.300d.pickled
bio_asq bio_asq.no_clean.300d.pickled
wiki_en wiki_en.fastText.300d.pickled
wiki_en_mimic wiki_en_mimic.fastText.no_clean.300d.pickled
glove_bio_asq glove_bio_asq.no_clean.300d.pickled
glove_bio_asq_mimic glove_bio_asq_mimic.no_clean.300d.pickled

Put all embeddings inside the ./data/word_embeddings/ dir so is has the following structure:

$ ls data/word_embeddings/
glove.840B.300d.pickled		glove_bio_asq_mimic.no_clean.300d.pickled 	mimic.fastText.no_clean.300d.pickled

Running the code

Code tested on Python 3.4 and Python 3.6.3

  1. Configuration:
  2. Preprocess the data: python
    • This script will create files genre_*.pkl in the ./data/nli_processed/ directory
    • Preprocess the test data: python process_test
  3. Extract concepts: python
    • Make sure to run MetaMap servers first before executing this script
    • The script above works only for the MedNLI dataset. Rename the files genre_*.pkl to genre_concepts_*.pkl for SNLI and all MultiNLI domains.
    • Call main_data_test as the main function to process the test data
  4. Create word embeddings cache: python <path_to_glove/word2vec file> ./data/word_embeddings/<name>
    • See WORD_VECTORS_FILENAME in the for file namings
  5. Create UMLS graph cache: python
  6. Optional: to create input data for the official retrofitting script run python
  7. Train the model: python
    • You can change the parameters in the config function or in the command line: python with use_umls_attention=True use_token_level_attention=True (see the Sacred documentation for details)

Using a pre-trained model

  1. Download model weights, and the the model-specific tokenizer and embeddings (see the table below).
  2. Put the model weights into the ./data/saved_models/ dir.
  3. Put the tokenizer and the embeddings into the ./data/ dir.
  4. Create an input file that contains premises and hypotheses, delimited by the \t character (see example).
  5. Run the script and provide the input data in STDIN: python < data/input.txt. The resulting probabilities of the contradiction, neutral, and entailment classes correspondingly wll be printed to STDOUT. If you do not want to see the logging and wish to save the results to a file, redirect STDERR to /dev/null and STDOUT to a file: python < data/test_input.txt 2>/dev/null > data/test_input_probabilities.txt

You can configure the model weights, tokenizer, and the embeddings filename using the command line arguments:

python with model_class=PyTorchInferSentModel model_weights_filename=PyTorchInferSentModel_50_glove_bio_asq_mimic_clinical__.slysamwq.h5 tokenizer_filename=tokenizer_clinical_.pickled embeddings_filename=embeddings_clinical_.pickled
Model description Model files and parameters
InferSent model, trained on MedNLI only using the glove_bio_asq_mimic word vectors model_class: PyTorchInferSentModel
model weights

More models coming soon!

Configuration options

model_class = 'PyTorchInferSentModel' # class name of the model to run. See the `create_model` function for the available models
max_len = 50 # max sentence length
lowercase = False # lowercase input data or nor
clean = False # remove punctuation etc or not
stem = False # do stemming to not
word_vectors_type = 'glove'  # word vectors - see the `WORD_VECTORS_FILENAME` in `` for details
word_vectors_replace_cui = ''  # filename with retorifitted embeddings for CUIs, eg cui.glove.cbow_most_common.CHD-PAR.SNOMEDCT_US.retrofitted.pkl
downsample_source = 0 # down sample the source domain data to the size of the MedNLI

# transfer learning settings
genre_source = 'clinical' # source domain for transfer learning. target='' and tune='' - no transfer
genre_target = '' # target domain - always MedNLI in case of experiemnts in the paper
genre_tune = '' # fine-tuning domain
lambda_multi_task = -1 # whether to use dynamically sampled batches from different domains or not.
uniform_batches = True # a batch will contain samples from just one domain

rnn_size = 300 # size of LSTM
rnn_cell = 'LSTM' # LSTM is used in the experiments in the paper
regularization = 0.000001 # regularization strength
dropout = 0.5 # dropout
hidden_size = 300 # size of the hidden fully-connected layers
trainable_embeddings = False # train embeddings or not

# knowledge-based attention
# set both to true to reproduce the token-level UMLS attention used in the paper
use_umls_attention = False # whether to use the knowledge-based attention or not
use_token_level_attention = False # use CUIs or separate tokens for attention

batch_size = 512 # batch size
epochs = 40 # number of epochs for training
learning_rate = 0.001 # learning rate for the Adam optimizer
training_loop_mode = 'best_loss'  # best_loss or best_acc - the model will be saved on the base loss or accuracy on the validation set correspondingly

Experiments in the paper


To run the BOW, InferSent, and ESIM models with default settings, use the following commands accordingly:

python with model_class=PyTorchSimpleModel
python with model_class=PyTorchInferSentModel
python with model_class=PyTorchESIMModel

Transfer learning

To pre-train the model on the Slate domain, fine-tune on the MedNLI and test on the dev set of MedNLI (Sequential transfer in the paper), run the following command:

python with genre_source=slate genre_tune=clinical genre_target=clinical

To run the Multi-target transfer learning, specify the genres and use the corresponding versions of the models: PyTorchMultiTargetSimpleModel, PyTorchMultiTargetInferSentModel, and PyTorchMultiTargetESIMModel.

Word embeddings

All word embeddings have to be pickled first - see the script. To run the model with a specific embeddings, use the word_vectors_type parameter:

python with word_vectors_type=wiki_en_mimic


  • First, create the input data for retrofitting with the script.
  • Second, run the official script from GitHub. (
  • Next, pickle the resulting word vectors with the script.
  • Finally, set the word_vectors_replace_cui parameter to the pickled retrofitted vectors:
    • python with word_vectors_replace_cui=cui.glove.cbow_most_common.CHD-PAR.SNOMEDCT_US.retrofitted.pkl

Knowledge-directed attention

Set both use_umls_attention and use_token_level_attention to True to reproduce the token-level UMLS attention experiments:

python with use_umls_attention=True use_token_level_attention=True


The paper was accepted to EMNLP 2018! Meanwhile, here is an extended arXiv version:

Romanov, A., & Shivade, C. (2018). Lessons from Natural Language Inference in the Clinical Domain. arXiv preprint arXiv:1808.06752.

	title = {Lessons from Natural Language Inference in the Clinical Domain},
	url = {},
	abstract = {State of the art models using deep neural networks have become very good in learning an accurate mapping from inputs to outputs. However, they still lack generalization capabilities in conditions that differ from the ones encountered during training. This is even more challenging in specialized, and knowledge intensive domains, where training data is limited. To address this gap, we introduce {MedNLI} - a dataset annotated by doctors, performing a natural language inference task ({NLI}), grounded in the medical history of patients. We present strategies to: 1) leverage transfer learning using datasets from the open domain, (e.g. {SNLI}) and 2) incorporate domain knowledge from external data and lexical sources (e.g. medical terminologies). Our results demonstrate performance gains using both strategies.},
	journaltitle = {{arXiv}:1808.06752 [cs]},
	author = {Romanov, Alexey and Shivade, Chaitanya},
	urldate = {2018-08-27},
	date = {2018-08-21},
	eprinttype = {arxiv},
	eprint = {1808.06752},
You can’t perform that action at this time.