Overview

This repository is a research work by Lindvall Lab at Dana-Farber Cancer Institute on extracting present/current symptoms reported by the patients from their electronic health record (EHR). Symptoms are vital outcomes for cancer clinical trials, observational research, and population-level surveillance. We sought to develop, test, and externally validate a deep learning model to extract symptoms from unstructured clinical notes in the electronic health record (EHR).

Project Pipeline

Processing
- Process label-studio annotation output for model input
Training
Inference
- Use best model for predictions
- Copyright

How to process annotation output label/text for model input

Processing the label-studio output

python processing/label_output.py \
  --input {location of the label-studio output json files} \
  --label_config {configuration used to set up label-studio; xml file} \
  --label all OR --keep goals_or care
  --hpi \
  --stratified_split 0.3 \
  --test

Without --test argument, data will be stratified split to train/valid 0.7/0.3
With --test argument, data will be stratified split to train/valid/test 0.7/0.15/0.15
It takes around 17s to load the spacy en_core_sci_lg model, please wait.

Training

Run the models

Transformer model choices: 'bert', 'xlnet', 'roberta', 'xlm-roberta', 'camembert', 'distilbert', 'electra'

conda activate transformers
python ner.py \
  --dset {location of the data that has been converted to ConLL format} \
  --model_class electra \
  --pretrained_model google/electra-base-discriminator \
  --lr 6e-5 \
  --decay 0.02 \
  --warmups 500

Optimize the hyperparameters

Bayesian optimization with Gaussian processes
- Please open the interactive plots (contour_plot, slice_plot, cv_plot, etc) in browser

python optimization.py \
  --model bert \
  --lr 1e-6 1e-4 \
  --decay 0.01 0.1 \
  --warmups 0 3000 \
  --eps 1e-9 1e-7

Load model outputs back into server hosting label studio - for active learning

python processing/model_output.py \
  --model_output processing/output/symptoms_hpi_all/prediction_test.txt \
  --label_output_dir symptoms/storage/label-studio/project/completions/ \
  --label_config symptoms/storage/label-studio/project/config.xml

Inference

Use raw csv files with a column containing clinical note - no need to convert into ConLL format.

python inference/run_and_predict.py -ipf {location of the input file} -opf {location of dummy output file} -cn {name of the column containing the clinical note}

Copyright

All codes are modified from

License

The GNU GPL v2 version of PathML is made available via Open Source licensing. The user is free to use, modify, and distribute under the terms of the GNU General Public License version 2.

Commercial license options are available also.

Contact

Questions? Comments? Suggestions? Get in touch!

CHARLOTTA_LINDVALL@DFCI.HARVARD.EDU

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
inference		inference
processing		processing
LICENSE		LICENSE
README.md		README.md
ner.py		ner.py
optimization.py		optimization.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Project Pipeline

How to process annotation output label/text for model input

Training

Run the models

Optimize the hyperparameters

Load model outputs back into server hosting label studio - for active learning

Inference

Copyright

License

Contact

About

Releases

Packages

Contributors 2

Languages

License

lindvalllab/MLSym

Folders and files

Latest commit

History

Repository files navigation

Overview

Project Pipeline

How to process annotation output label/text for model input

Training

Run the models

Optimize the hyperparameters

Load model outputs back into server hosting label studio - for active learning

Inference

Copyright

License

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages