Skip to content

Training NLP Transformers on Clinical notes (big data)

License

Notifications You must be signed in to change notification settings

navedrizvi/bdh-project

Repository files navigation

Big Data for Health Informatics

Knowledge Transfer with Transformers

Overview

Project repository for Health Informatics course at Gatech. (CSE 6250)

New medical studies provide a rich source of material for doctors that can be utilized to improve patient outcomes in novel ways. In this study, we explore the feasibility of improving patient mortality predictions using text features generated by NLP transformers, and if so, are the improvements to the prediction scores attributable to the specific transformer used. We used two types of transformers: a generic transformer trained on PubMed text (BlueBERT), and a use-case specific transformer trained on coronavirus text (CORD-19). For comparison, we also trained two other patient mortality models: (1) trained on structured data only; (2) trained on structured data and text features generated using TF-IDF. Results show that the second model trained on structured data and TF-IDF text features outperforms the BlueBERT trained model, and that there is no significant difference in performance between the BluBERT and CORD-19 trained models.

Structure

All project related files are contained in the src directory.

  1. src\ML_Results.ipynb: Jupyter notebook containing results of the study.
  2. src\preprocessing.py: PySpark script containing the preprocessing logic and is meant to run in a Spark cluster.
  3. src\preprocessing2.py: Python script for feature preprocessing. (downstream to #2)
  4. src\ml.py: Python script containing the ML model training logic (derived from analysis in #1). (downstream to #3)
  5. src\preprocessing-py.py: (not a main component of this project) Python script used for preprocessing data sample suitable in size to load on a single machine.

The outputs of preprocessing.py and preprocessing2.py will appear under data/processed/spark-etl/ and data/processed/spark-processed-features/

Setup

This project makes use of the following:

  • MIMIC III Dataset (access permissions required). Once you have access, you can place the data under data/raw/ from root, and the project files that require the data as input (preprocessing.py and preprocessing2.py) can be executed
  • Spark 3.2 (PySpark along with Pandas API)
  • Scikit learn's RandomForestClassifier
  • Huggingface pretrained-models
  • DVC: to maintain Data and model version control
  • GCP storage

An exhaustive list of project requirements can be found in:

Test

After setting up the environment, you can run files in the following order (dependency graph):

preprocessing.py -> preprocessing2.py -> ml.py

python preprocessing.py # This can also run in a cluster for parallelism 
python preprocessing2.py # This should be run in a local laptop
python -i ml.py # for experimenting with ML models

About

Training NLP Transformers on Clinical notes (big data)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •