GATE plugin to expand, annotate and coreference biomedical abbreviations and acronyms
Java CSS
Latest commit 44c02f7 Aug 4, 2014 Phil Gooch Initial support for lemmatisation
Attempts to lemmatise both terms and abbreviations, so that e.g.
malignant tumor cells is lemmatised against MTCs and MTC (with lemma
‘malignant tumor cell’)
Failed to load latest commit information.
doc/javadoc Update to coreference mode Jul 16, 2012
resources Fixed some misclassified abbreviations Sep 3, 2012
src/org/philgooch Initial support for lemmatisation Aug 4, 2014
test Unit test fix - init BADREX after ANNIE Jul 16, 2012
.gitignore fix to allow spaces in configURL May 21, 2013
BiomedicalAbbreviationExpander.jar Initial support for lemmatisation Aug 4, 2014
LICENSE.txt Initial beta release build May 11, 2012
build.xml fix to allow spaces in configURL May 21, 2013

Copyright (c) 2012, Phil Gooch. This software is licenced under the GNU Library General Public License Version 3, 29 June 2007.

See LICENSE.txt file for license details.

BADREX: Biomedical Abbreviation Expander

BADREX is a GATE plugin that identifies term-abbreviation pairs using dynamic regular expressions that generalise and extend the Schwartz-Hearst algorithm. In addition it uses a subset of the inner-outer selection rules described in Ao & Takagi's ALICE algorithm. Rather than simply extracting terms and their abbreviations, it annotates them in situ and adds the corresponding long form and short form text as features on each.

It also has the option of expanding all abbreviations in the text that match the short-form of the most recently matched long-form--short-form pair. In addition, there is the option of annotating and classifying common medical abbreviations extracted from Wikipedia.

Against the Medstract corpus it achieves precision and recall of 98% and 97% respectively.

You can download the corrected BioText gold standard markables and the corrected Medstract gold standard markables. BioText corpus reproduced with kind permission of Prof. Marti Hearst.

A white paper describing BADREX and its evaluation is available.

How to use BADREX

BADREX is compatible with GATE version 6.1 and higher. The plugin can be loaded via the GATE Java API, or in GATE Developer go to File->Manage Creole Plugins, click the 'Add new CREOLE repository' button and select the 'BiomedicalAbbreviationExpander' directory.

Be sure to add a RegEx Sentence Splitter to your pipeline before running this plugin. (The ANNIE Sentence Splitter can also be used, although this also requires a Tokenizer.)

To favour precision over recall, set maxInner and maxOuter to low values, e.g. 5, and set the threshold to 1.0 or 0.9 To favour recall over precision, set maxInner and maxOuter to high values, e.g. 10, and set the threshold to 0.75 or below



  • configFileURL: Location of configuration file that lists the stop-words and lookup files
  • gazetteerListsURL: Location of gazetteer definition file for lists of common medical abbreviations


  • inputASName: Input annotation set name
  • outputASName: Output annotation set name
  • sentenceType: Annotation type for Sentence annotations. Defaults to Sentence.
  • longType: Annotation type to mark the term's long form
  • longTypeFeature: Feature name to contain the expanded abbreviation on the short form annotation
  • shortType: Annotation type to mark the term's short form
  • shortTypeFeature: Feature name to contain the abbreviation on the long form annotation
  • expandAllShortFormInstances: Once a term-abbreviation pair has been identified, should all instances of that abbreviation be annotated and expanded? Defaults to false.
  • maxInner: Maximum length of outer string (text before parentheses)
  • maxOuter: Maximum length of inner string (text inside parentheses)
  • threshold: Fraction of short form characters that must match the long form to count as a match
  • swapShortest: Swap annotation types if the outer phrase is shorter than the inner phrase? Defaults to true (some datasets always annotate the outer phrase the same way, even if the inner phrase is the abbreviation)
  • useLookups: Set to true to run a gazetteer lookup of common medical abbreviations
  • useBidirectionMatch: In the event of the first character of the inner not matching the first character of the outer, attempts a match against the first character of the last word of the outer against the last character in the outer. Defaults to false. Can boost recall but reduce precision if set to true.
  • underlyingAnnots: Use these annotations to tag the term and abbreviation if they contain or are contained in the matched long-form of the term. Clear this list to use the default longType and shortType annotations.