Skip to content
A Python library to de-identify medical records with state-of-the-art NLP methods.
Python Shell Makefile
Branch: master
Clone or download
Type Name Latest commit message Commit time
Failed to load latest commit information.
.semaphore Add semaphore auto_cancel for branches other than master (#13) Feb 12, 2020
deidentify Bump version to 0.3.2 Jan 16, 2020
scripts Add initial version of deidentify Dec 16, 2019
.coveragerc Add initial version of deidentify Dec 16, 2019
.editorconfig Add initial version of deidentify Dec 16, 2019
.gitattributes Add initial version of deidentify Dec 16, 2019
.gitignore Add virtualenv artifacts to gitignore Jan 15, 2020
.pylintrc Add initial version of deidentify Dec 16, 2019 Update Jan 16, 2020 Add documentation (#2) Dec 18, 2019
LICENSE Add LICENSE (#4) Dec 18, 2019
Makefile Add make target to publish deidentify to PyPI Dec 20, 2019 Add link to HSDM paper Jan 15, 2020 Add demo python script Jan 15, 2020
environment.yml Remove non-PyPI dependencies (#9) Jan 15, 2020 Fix tag verify by exporting RELEASE_VERSION Jan 16, 2020


A Python library to de-identify medical records with state-of-the-art NLP methods. Pre-trained models for the Dutch language are available.

This repository shares the resources developed in the following paper:

J. Trienes, D. Trieschnigg, C. Seifert, and D. Hiemstra. Comparing Rule-based, Feature-based and Deep Neural Methods for De-identification of Dutch Medical Records. In: Proceedings of the 1st ACM WSDM Health Search and Data Mining Workshop (HSDM), 2020.

You can get the authors' version of the paper from this link: paper.

Quick Start


Create a new virtual environment with an environment manager of your choice. Then, install deidentify:

pip install deidentify

We use the spaCy tokenizer. For good compatibility with the pre-trained models, we recommend using the same spaCy tokenization models that were used at de-identification model training time:

pip install

Example Usage

Below, we will create an example document and run a pre-trained de-identification model over it. First, let's download a pre-trained model and save it in the model cache at ~/.deidentify. See below for a list of available models.

python -m deidentify.util.download_model model_bilstmcrf_ons_fast-v0.1.0

Then, we can create a document, load the tagger with the pre-trained model, and finally annotate the document.

from deidentify.base import Document
from deidentify.taggers import FlairTagger
from deidentify.tokenizer import TokenizerFactory

# Create some text
text = (
    "Dit is stukje tekst met daarin de naam Jan Jansen. De patient J. Jansen (e: "
    ", t: 06-12345678) is 64 jaar oud en woonachtig in Utrecht. Hij werd op 10 "
    "oktober door arts Peter de Visser ontslagen van de kliniek van het UMCU."

# Wrap text in document
documents = [
    Document(name='doc_01', text=text)

# Select downloaded model
model = 'model_bilstmcrf_ons_fast-v0.1.0'

# Instantiate tokenizer
tokenizer = TokenizerFactory().tokenizer(corpus='ons', disable=("tagger", "ner"))

# Load tagger with a downloaded model file and tokenizer
tagger = FlairTagger(model=model, tokenizer=tokenizer, verbose=False)

# Annotate your documents
annotated_docs = tagger.annotate(documents)

This completes the annotation stage. Let's inspect the entities that the tagger found:

from pprint import pprint

first_doc = annotated_docs[0]

This should print the entities of the first document.

[Annotation(text='Jan Jansen', start=39, end=49, tag='Name', doc_id='', ann_id='T0'),
 Annotation(text='J. Jansen', start=62, end=71, tag='Name', doc_id='', ann_id='T1'),
 Annotation(text='', start=76, end=93, tag='Email', doc_id='', ann_id='T2'),
 Annotation(text='06-12345678', start=98, end=109, tag='Phone_fax', doc_id='', ann_id='T3'),
 Annotation(text='64 jaar', start=114, end=121, tag='Age', doc_id='', ann_id='T4'),
 Annotation(text='Utrecht', start=143, end=150, tag='Address', doc_id='', ann_id='T5'),
 Annotation(text='10 oktober', start=164, end=174, tag='Date', doc_id='', ann_id='T6'),
 Annotation(text='Peter de Visser', start=185, end=200, tag='Name', doc_id='', ann_id='T7'),
 Annotation(text='UMCU', start=234, end=238, tag='Hospital', doc_id='', ann_id='T8')]

Afterwards, you can replace the discovered entities from the documents using a utility function:

from deidentify.util import mask_annotations

masked_doc = mask_annotations(first_doc)

Which should print:

Dit is stukje tekst met daarin de naam [NAME]. De patient [NAME] (e: [EMAIL], t: [PHONE_FAX]) is [AGE] oud en woonachtig in [ADDRESS]. Hij werd op [DATE] door arts [NAME] ontslagen van de kliniek van het [HOSPITAL].

Available Taggers

There are currently three taggers that you can use:

  • deidentify.taggers.DeduceTagger: A wrapper around the DEDUCE tagger by Menger et al. (2018, code, paper)
  • deidentify.taggers.CRFTagger: A CRF tagger using the feature set by Liu et al. (2015, paper)
  • deidentify.taggers.FlairTagger: A wrapper around the Flair SequenceTagger allowing the use of neural architectures such as BiLSTM-CRF. The pre-trained models below use contextualized string embeddings by Akbik et al. (2018, paper)

All taggers implement the deidentify.taggers.TextTagger interface which you can implement to provide your own taggers.

Pre-trained Models

We provide a number of pre-trained models for the Dutch language. The models were developed on the Nedap/University of Twente (NUT) dataset. The dataset consists of 1260 documents from three domains of Dutch healthcare: elderly care, mental care and disabled care (note: in the codebase we sometimes also refer to this dataset as ons). More information on the design of the dataset can be found in our paper.

Name Tagger Language Dataset F1* Precision* Recall* Tags
DEDUCE (Menger et al., 2018)** DeduceTagger Dutch NUT 0.7564 0.9092 0.6476 8 PHI Tags
model_crf_ons_tuned-v0.1.0 CRFTagger Dutch NUT 0.9048 0.9632 0.8530 15 PHI Tags
model_bilstmcrf_ons_fast-v0.1.0 FlairTagger Dutch NUT 0.9461 0.9591 0.9335 15 PHI Tags
model_bilstmcrf_ons_large-v0.1.0 FlairTagger Dutch NUT 0.9505 0.9683 0.9333 15 PHI Tags

*All scores are micro-averaged, blind token-level precision/recall/F1 obtained on the test portion of each dataset. For additional metrics, see the corresponding model release.

**DEDUCE was developed on a dataset of psychiatric nursing notes and treatment plans. The numbers reported here were obtained by applying DEDUCE to our NUT dataset. For more information on the development of DEDUCE, see the paper by Menger et al. (2018).

Running Experiments and Training Models

If you have your own dataset of annotated documents and you want to train your own models on it, you can take a look at the following guides:

If you want more information on the experiments in our paper, have a look here:

Computational Environment

When you want to run your own experiments, we assume that you clone this code base locally and execute all scripts under deidentify/ within the following conda environment:

# Install package dependencies and add local files to the Python path of that environment.
conda env create -f environment.yml
conda activate deidentify && export PYTHONPATH="${PYTHONPATH}:$(pwd)"


Please cite the following paper when using deidentify:

  title={Comparing Rule-based, Feature-based and Deep Neural Methods for De-identification of Dutch Medical Records},
  author={Trienes, Jan and Trieschnigg, Dolf and Seifert, Christin and Hiemstra, Djoerd},
  booktitle = {Proceedings of the 1st ACM WSDM Health Search and Data Mining Workshop},
  series = {{HSDM} 2020},
  year = {2020}


If you have any question, please contact Jan Trienes at

You can’t perform that action at this time.