Assessment of Pre-trained Observational Large Longitudinal models in OHDSI (APOLLO)

Introduction

This Python package is for building and evaluating large general pre-trained models on data in the OMOP Common Data Model (CDM) format. The models are fitted on the structured data (concepts) in the CDM, not any natural language. We aim to evaluate these models on various tasks, such as patient-level prediction (either zero-shot or fine-tuned).

Overview

This package assumes the GeneralPretrainedModelTools R package has been executed to retrieve (a sample of) the CDM data to local Parquet files. After this, a 'cdm_processor' must be run to convert the data to sequence data suitable for a large language model. TODO: how to go from here.

Getting Started

Pre-requisite

The project is built in python 3.10, and project dependency needs to be installed

Create a new Python virtual environment

python -m venv venv;
source venv/bin/activate;

Install the packages in requirements.txt

pip install -r requirements.txt

Simulate CDM data

In real-world applications, the CDM data can be retrieved from a database using the GeneralPretrainedModelTools R package. For testing purposes, we can simulate CDM data using a built-in simulator:

Edit simulator.ini so the root_folder argument points to a folder on the local file system.

Run:

PYTHONPATH=./: python simulating/simulator.py simulator.ini

By default, the simulation script will generate pretraining data in a subfolder called 'pretraining'.

In addition, data will be generated for a patient-level prediction task, where patient data up to an index date is used to predict whether a patient will have a certain condition in the prediction window (default = 365 days) after the index date. Training data, for fine-tuning the pretrained model, will be generated in a subfolder called 'train'. Test data, for evaluating the fine-tuned model, will be generated in a subfolder called 'test'. In both 'train' and 'test' folders, subfolders will be generated for a subset of simulated concept IDs with labels indicating whether the concept was observed in the prediction window.

Procesing CDM data for CEHR-BERT pre-training

Edit cdm_processor.ini to point to folders on the local file system, e.g. the 'pretraining' folder generated by the simulation script.

Run:

PYTHONPATH=./: python cdm_processing/cdm_processor.py cdm_processor.ini

Pre-train model

Edit model_trainer.ini to point to folder on the local file system, e.g. the 'patient_sequence' folder generated by the CDM processing script.

Run:

PYTHONPATH=./: python training/train_model.py model_trainer.ini

On macOS, you may need to set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 to avoid an error.

License

Apollo is licensed under Apache License 2.0.

Development status

Under development. Do not use.

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
.github/workflows		.github/workflows
cdm_processing		cdm_processing
data_loading		data_loading
evaluating		evaluating
evaluation_studies/pilot_sim_eval		evaluation_studies/pilot_sim_eval
extras		extras
model		model
simulating		simulating
tests		tests
training		training
utils		utils
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
cdm_processor.yaml		cdm_processor.yaml
evaluator.yaml		evaluator.yaml
meta_evaluator.yaml		meta_evaluator.yaml
model_trainer.yaml		model_trainer.yaml
requirements.txt		requirements.txt
setup.py		setup.py
simulator.yaml		simulator.yaml

License

OHDSI/Apollo

Folders and files

Latest commit

History

Repository files navigation

Assessment of Pre-trained Observational Large Longitudinal models in OHDSI (APOLLO)

Introduction

Overview

Getting Started

Pre-requisite

Simulate CDM data

Procesing CDM data for CEHR-BERT pre-training

Pre-train model

License

Development status

About

Resources

License

Stars

Watchers

Forks

Languages