NegBio User Guide
Run the pipeline step-by-step
The step-by-step pipeline generates all intermediate documents. You can easily rerun one step if it makes errors. The whole steps are
text2bioccombines text into a BioC XML file.
normalizeremoves noisy text such as
section_splitsplits the report into sections based on titles at
ssplitsplits text into sentences.
- Named entity recognition
dner_mmdetects UMLS concepts using MetaMap.
dner_chexpertdetects concepts using the CheXpert vocabularies at
parseparses sentence using the Bllip parser.
ptb2udconverts the parse tree to universal dependencies using Stanford converter.
- Negation detection
negdetects negative and uncertain findings.
neg_chexpertdetects positive, negative and uncertain findings (recommended)
cleanupremoves intermediate information.
Steps 2-10 will process the input files one-by-one and generate the results in the output directory. The 2nd and 3rd can be skipped. You can chose either step 5 or 6 for named entity recognition.
1. Convert text files to BioC format
You can skip this step if the reports are already in the BioC format. If you have lots of reports, it is recommended to put them into several BioC files, for example, 100 reports per BioC file.
$ export BIOC_DIR=/path/to/bioc $ export TEXT_DIR=/path/to/text $ negbio_pipeline text2bioc --output=$BIOC_DIR/test.xml $TEXT_DIR/*.txt
Another most commonly used command is:
$ find $TEXT_DIR -type f | negbio_pipeline text2bioc --output=$BIOC_DIR
2. Normalize reports
This step removes the noisy text such as
[**Patterns**] in the MIMIC-III reports.
$ negbio_pipeline normalize --output=$OUTPUT_DIR $INPUT_DIR/*.xml
3. Split each report into sections
This step splits the report into sections.
The default section titles is at
You can specify customized section titles using the option
$ negbio_pipeline section_split --output=$OUTPUT_DIR $INPUT_DIR/*.xml
4. Splits each report into sentences
This step splits the report into sentences using the NLTK splitter (nltk.tokenize.sent_tokenize).
$ negbio_pipeline ssplit --output=$OUTPUT_DIR $INPUT_DIR/*.xml
5. Named entity recognition
This step recognizes named entities (e.g., findings, diseases, devices) from the reports. The first version of NegBio uses MetaMap to detect UMLS concepts.
MetaMap can be can be downloaded from https://metamap.nlm.nih.gov/MainDownload.shtml.
Installation instructions can be found at https://metamap.nlm.nih.gov/Installation.shtml.
Before using MetaMap, please make sure that both
wsdserverctl are started.
MetaMap intends to extract all UMLS concepts.
Many of them are not irrelevant to radiology.
Therefore, it is better to specify the UMLS concepts of interest via
$ export METAMAP_BIN=META_MAP_HOME/bin/metamap16 $ negbio_pipeline dner_mm --metamap=$METAMAP_BIN --output=$OUTPUT_DIR $INPUT_DIR/*.xml
NegBio also integrates the CheXpert vocabularies to recognize the presence of 14 observations.
All vocabularies can be found at
Each file in the folder represents one type of named entities with various text expressions.
So far, NegBio does not support adding more types in the folder, but you can add more text expressions of the type.
$ negbio_pipeline dner_chexpert --output=$OUTPUT_DIR $INPUT_DIR/*.xml
In general, MetaMap is more comprehensive while CheXpert is more accurate on 14 types of findings. MetaMap is also slower and easier to break than CheXpert.
6. Parse the sentence
This step parses sentence using the Bllip parser.
$ negbio_pipeline parse --output=$OUTPUT_DIR $INPUT_DIR/*.xml
7. Convert the parse tree to UD
This step converts the parse tree to universal dependencies using Stanford converter.
$ negbio_pipeline ptb2ud --output=$OUTPUT_DIR $INPUT_DIR/*.xml
8. Detect negative and uncertain findings
This step detects negative and uncertain findings using patterns.
By default, the program uses the negation and uncertainty patterns in the
However, you are free to create your own patterns via
The pattern is a semgrex-type
pattern for matching node in the dependency graph.
Currently, we only support
A detailed grammar specification (using PLY, Python Lex-Yacc) can be found in
$ negbio_pipeline neg --output=$OUTPUT_DIR $INPUT_DIR/*.xml
NegBio also integrates the CheXpert algorithms. Different from the original NegBio, CheXpert utilizes a 3-phase pipeline consisting of pre-negation uncertainty, negation, and post-negation uncertainty (Irvin et al., 2019). Each phase consists of rules which are matched against the mention; if a match is found, then the mention is classified accordingly (as uncertain in the first or third phase, and as negative in the second phase). If a mention is not matched in any of the phases, it is classified as positive.
Generally, the CheXpert contains more rules and is more accurate than the original NegBio.
$ negbio_pipeline neg_chexpert --output=$OUTPUT_DIR $INPUT_DIR/*.xml
Similarly, you are free to create patterns via
9. Cleans intermediate information
This step removes intermediate information (sentence annotations) from the BioC files.
$ negbio_pipeline cleanup --output=$OUTPUT_DIR $INPUT_DIR/*.xml