GitHub - h4ste/trec-pm

Medical Information Retrieval System developed for the 2017 and 2018 Text REtrieval Conference (TREC) Precision Medicine Track (TREC-PM)

🚧 Documentation under construction! 🚧

🔧 Building the project

Compiling the project

To compile the project, execute the following command:

./gradlew build

This will download what seems like the entire internet, then compile the java sources and generate windows (.bat) and linux (.sh) scripts in the bin/ directory.

Downloading the data

🚧 pending 🚧

Obtaining knowledge bases and resources

Obtaining the COSMIC data

🚧 pending 🚧

Obtaining the FDA label data

🚧 pending 🚧

Obtaining DGIdb access

🚧 pending 🚧

Obtaining NCI Thesaurus data

🚧 pending 🚧

Installing runtime dependencies

1️⃣ Installing GATE

mkdir -p tools
pushd tools
wget 'https://downloads.sourceforge.net/project/gate/gate/8.0/gate-8.0-build4825-BIN.zip'
unzip gate-8.0-build4825-BIN.zip && rm gate-8.0-build4825-BIN.zip
export GATE_HOME=$PWD/gate-8.0-build4825-BIN/
popd

2️⃣ Installing GENIA

mkdir -p tools
wget -qO- http://www.nactem.ac.uk/tsujii/GENIA/tagger/geniatagger-3.0.2.tar.gz | tar -C tools -xvzf -
pushd tools/geniatagger-3.0.2
make
popd

3️⃣ Installing LingScope

mkdir -p tools/lingscope
pushd tools/lingscope
wget -q 'https://downloads.sourceforge.net/project/lingscope/negation_models.zip' 'https://downloads.sourceforge.net/project/lingscope/hedge_models.zip'
unzip negation_models.zip && rm negation_models.zip
unzip hedge_models.zip && rm hedge_models.zip
popd
mkdir -p lib
pushd lib
wget 'https://downloads.sourceforge.net/project/lingscope/lingscope_v3/dist/lingscope.jar' 'https://downloads.sourceforge.net/project/lingscope/lingscope_v3/dist/lib/abner.jar'
popd

📚 Indexing the collections

Indexing MEDLINE

Two MEDLINE indexes are supported: a basic index which stores documents in the inverted index (i.e., as Lucene stored fields), and a Lazy index which stores JAXB-parsed articles in a separate forward index (i.e., Lucene docvalues). The Lazy JAXB-based index is preferred.

To create a lazy JAXB-based index, issue the following command:

sh bin/index_medline.sh INDEX_DIR [--replace] INPUT_DIR1, INPUT_DIR2, ....

This will create a Lucene index at INDEX_DIR.

Indexing Cancer Abstracts

Execute:

sh bin/index_abstracts.sh INDEX_DIR INPUT_DIR1, INPUT_DIR2, ...

Indexing ClinicalTrials.gov

Execute:

sh bin/index_trials.sh [-n|--negate] [-D|--delete] INDEX_DIR  TRIAL_INPUT_DIR1, TRIAL_INPUT_DIR2, ...

🔍 Running the system

Execute:

sh bin/search-topics.sh [-m|--model] [-t|--runtag] TOPICS_FILE OUTPUT_DIR

where

the model option is one of
- JOINT: ,
- ASPECT_FUSION: ,
- SIMILARITY_FUSION: ,
- FUSION_FUSION: ,
- SIMPLE: ,
- STUPID: ,
and the runtag option is the runtag to be used in TREC-style submission files (e.g., UTDHLTRI)

The system will produce TREC-style run/submission files in OUTPUT_DIR:

medline_submission.txt includes the results of the system for Task A (i.e., operating on MEDLINE and conference proceedings)
clinical_trial_submission.txt includes the results of the system for Task B (i.e., operating on ClinicalTrials.gov)

Merging runs

The different retrieval models trade recall for precision. In order to return up to 1,000 articles/trials per topic, we append the retrieved documents from multiple runs.

sh bin/merge_runs.sh [--runTag] [-L|--limit"] OUTPUT_FILE RUN_1, RUN_2, ...

This will produce a single output file where-in the results for each topic will be the results retrieved by RUN_1, followed by those retrieved by RUN_2, then RUN_3, etc.

Visualizing Search Results

Static HTML pages visualizing (1) retrieved articles, (2) topic analysis, and (3) generated Lucene queries will be generated when the system executes. These files may be found in OUTPUT_DIR:

medline_results.html includes the results of the system for Task A (i.e., operating on MEDLINE and conference proceedings)
clinical_trial_results.html includes the results of the system for Task B (i.e., operating on ClinicalTrials.gov)

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
gradle/wrapper		gradle/wrapper
project		project
src		src
.gitignore		.gitignore
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Medical Information Retrieval System developed for the 2017 and 2018 Text REtrieval Conference (TREC) Precision Medicine Track (TREC-PM)

🔧 Building the project

Compiling the project

Downloading the data

Obtaining knowledge bases and resources

Obtaining the COSMIC data

Obtaining the FDA label data

Obtaining DGIdb access

Obtaining NCI Thesaurus data

Installing runtime dependencies

1️⃣ Installing GATE

2️⃣ Installing GENIA

3️⃣ Installing LingScope

📚 Indexing the collections

Indexing MEDLINE

Indexing Cancer Abstracts

Indexing ClinicalTrials.gov

🔍 Running the system

Merging runs

Visualizing Search Results

About

Releases

Packages

Languages

h4ste/trec-pm

Folders and files

Latest commit

History

Repository files navigation

Medical Information Retrieval System developed for the 2017 and 2018 Text REtrieval Conference (TREC) Precision Medicine Track (TREC-PM)

🔧 Building the project

Compiling the project

Downloading the data

Obtaining knowledge bases and resources

Obtaining the COSMIC data

Obtaining the FDA label data

Obtaining DGIdb access

Obtaining NCI Thesaurus data

Installing runtime dependencies

1️⃣ Installing GATE

2️⃣ Installing GENIA

3️⃣ Installing LingScope

📚 Indexing the collections

Indexing MEDLINE

Indexing Cancer Abstracts

Indexing ClinicalTrials.gov

🔍 Running the system

Merging runs

Visualizing Search Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages