A program to recognize self-acknowledged limitation sentences in biomedical articles
The repository contains the source code for the system described in the article Automatic recognition of self-acknowledged limitations in clinical research literature. The best performing rule-based system is presented (gov.nih.nlm.limitations.RuleBasedLimitationSentenceRecognizer
), as well as the rule-based baseline (gov.nih.nlm.limitations.RuleBasedLimitationSentenceRecognizerBaseline
).
To replicate the results, run gov.nih.nlm.limitations.RuleBasedLimitationSentenceRecognizer
with three arguments:
- DATA/XML: directory that contains the parsed XML of the test set
- DATA/limitation_sentences_final.txt: gold annotations
- Output file name (after the run, this file should match DATA/rule_based_test.out.txt)
The parsed XML is generated from PubMed Central XML using gov.nih.nlm.limitations.CorpusParser
.
To process articles in plain text, run gov.nih.nlm.limitations.CombinedPreprintLimitationRecognizer
with two arguments:
- Input directory: a directory of plain text files
- Output file: the file for output (output is in JSON format)
Stanford CoreNLP model jar file that is needed for processing raw text for lexical and syntactic information (stanford-corenlp-3.3.1-models.jar
) is not included with the distribution due to its size. It can be downloaded from http://stanfordnlp.github.io/CoreNLP/ and copied to lib
directory.
- Halil Kilicoglu: (
halil (at) illinois.edu
)