Skip to content
/ Apollo Public

[Under development] Assessment of Pre-trained Observational Large Language-models in OHDSI (APOLLO)

License

Notifications You must be signed in to change notification settings

OHDSI/Apollo

Repository files navigation

Assessment of Pre-trained Observational Large Longitudinal models in OHDSI (APOLLO)

Build Status

Introduction

This Python package is for building and evaluating large general pre-trained models on data in the OMOP Common Data Model (CDM) format. The models are fitted on the structured data (concepts) in the CDM, not any natural language. We aim to evaluate these models on various tasks, such as patient-level prediction (either zero-shot or fine-tuned).

Overview

This package assumes the GeneralPretrainedModelTools R package has been executed to retrieve (a sample of) the CDM data to local Parquet files. After this, a 'cdm_processor' must be run to convert the data to sequence data suitable for a large language model. TODO: how to go from here.

Getting Started

Pre-requisite

The project is built in python 3.10, and project dependency needs to be installed

Create a new Python virtual environment

python -m venv venv;
source venv/bin/activate;

Install the packages in requirements.txt

pip install -r requirements.txt

Simulate CDM data

In real-world applications, the CDM data can be retrieved from a database using the GeneralPretrainedModelTools R package. For testing purposes, we can simulate CDM data using a built-in simulator:

  1. Edit simulator.ini so the root_folder argument points to a folder on the local file system.

  2. Run:

    PYTHONPATH=./: python simulating/simulator.py simulator.ini

By default, the simulation script will generate pretraining data in a subfolder called 'pretraining'.

In addition, data will be generated for a patient-level prediction task, where patient data up to an index date is used to predict whether a patient will have a certain condition in the prediction window (default = 365 days) after the index date. Training data, for fine-tuning the pretrained model, will be generated in a subfolder called 'train'. Test data, for evaluating the fine-tuned model, will be generated in a subfolder called 'test'. In both 'train' and 'test' folders, subfolders will be generated for a subset of simulated concept IDs with labels indicating whether the concept was observed in the prediction window.

Procesing CDM data for CEHR-BERT pre-training

  1. Edit cdm_processor.ini to point to folders on the local file system, e.g. the 'pretraining' folder generated by the simulation script.

  2. Run:

    PYTHONPATH=./: python cdm_processing/cdm_processor.py cdm_processor.ini

Pre-train model

  1. Edit model_trainer.ini to point to folder on the local file system, e.g. the 'patient_sequence' folder generated by the CDM processing script.

  2. Run:

    PYTHONPATH=./: python training/train_model.py model_trainer.ini

On macOS, you may need to set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 to avoid an error.

License

Apollo is licensed under Apache License 2.0.

Development status

Under development. Do not use.

About

[Under development] Assessment of Pre-trained Observational Large Language-models in OHDSI (APOLLO)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages