Big Data for Health Informatics

Knowledge Transfer with Transformers

Overview

Project repository for Health Informatics course at Gatech. (CSE 6250)

New medical studies provide a rich source of material for doctors that can be utilized to improve patient outcomes in novel ways. In this study, we explore the feasibility of improving patient mortality predictions using text features generated by NLP transformers, and if so, are the improvements to the prediction scores attributable to the specific transformer used. We used two types of transformers: a generic transformer trained on PubMed text (BlueBERT), and a use-case specific transformer trained on coronavirus text (CORD-19). For comparison, we also trained two other patient mortality models: (1) trained on structured data only; (2) trained on structured data and text features generated using TF-IDF. Results show that the second model trained on structured data and TF-IDF text features outperforms the BlueBERT trained model, and that there is no significant difference in performance between the BluBERT and CORD-19 trained models.

Structure

All project related files are contained in the src directory.

src\ML_Results.ipynb: Jupyter notebook containing results of the study.
src\preprocessing.py: PySpark script containing the preprocessing logic and is meant to run in a Spark cluster.
src\preprocessing2.py: Python script for feature preprocessing. (downstream to #2)
src\ml.py: Python script containing the ML model training logic (derived from analysis in #1). (downstream to #3)
src\preprocessing-py.py: (not a main component of this project) Python script used for preprocessing data sample suitable in size to load on a single machine.

The outputs of preprocessing.py and preprocessing2.py will appear under data/processed/spark-etl/ and data/processed/spark-processed-features/

Setup

This project makes use of the following:

MIMIC III Dataset (access permissions required). Once you have access, you can place the data under data/raw/ from root, and the project files that require the data as input (preprocessing.py and preprocessing2.py) can be executed
Spark 3.2 (PySpark along with Pandas API)
Scikit learn's RandomForestClassifier
Huggingface pretrained-models
DVC: to maintain Data and model version control
GCP storage

An exhaustive list of project requirements can be found in:

miniconda environment file

Test

After setting up the environment, you can run files in the following order (dependency graph):

preprocessing.py -> preprocessing2.py -> ml.py

python preprocessing.py # This can also run in a cluster for parallelism 
python preprocessing2.py # This should be run in a local laptop
python -i ml.py # for experimenting with ML models

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
.dvc		.dvc
.vscode		.vscode
hassan-workspace		hassan-workspace
ivan-workspace		ivan-workspace
miko-workspace		miko-workspace
naved-workspace		naved-workspace
src		src
.DS_Store		.DS_Store
.dvcignore		.dvcignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
data.dvc		data.dvc
environment.yml		environment.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Big Data for Health Informatics

Knowledge Transfer with Transformers

Overview

Structure

Setup

Test

About

Releases

Packages

Contributors 4

Languages

License

navedrizvi/bdh-project

Folders and files

Latest commit

History

Repository files navigation

Big Data for Health Informatics

Knowledge Transfer with Transformers

Overview

Structure

Setup

Test

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages