This repository contains the package codes for the ML4H (2025) submission of Bayesian Event-Based Model for Disease Subtype and Stage Inference.
@inproceedings{Hao2025JointProgression,
author = {Hongtao Hao and Joseph L. Austerweil},
title = {Bayesian Event-Based Model for Disease Subtype and Stage Inference},
booktitle = {Proceedings of the 5th Machine Learning for Health Symposium},
volume = {297},
pages = {??--??}, % Page numbers are not provided now, will add later.
year = {2025},
publisher = {PMLR},
}
pip install bebmsor git clone this project, and then
pip install -e .If you need quick examples of data usable for bebms for testing purposes, you can use sample data available at bebms/data/samples.
If you need to generate synthetic data: Git clone this repository, and at the root, run
bash gen.shThe generated data will be found at bebms/test/my_data as .csv files.
The parameters are pre-set and can be found at bebms/data/params.json. You can modify the parameters by modifying the json file.
You can also change parameters in config.toml to adjust what data to generate.
You can look into the gen.py and generate_data.py for more details. By default, keep_all_cols = False and the result is the data in wide format.
You can change it to keep_all_cols = False, and the data will be in long (tidy) format. This is because the wide format cannot contain the affected column which is regarding to biomarker, rather than each participant.
After git cloning this repository and generating syntheti cdata, to run bebms, at the root, run
bash test.shYou can check bebms/test/test.py to learn how to use the run_bebms function.
The results will be saved in the folder of bebms/test/algo_results.
You can also compare the results of bebms with those of SuStaIn.
First, you need to install packages required by SuStaIn:
pip install git+https://github.com/noxtoby/awkde
pip install git+https://github.com/hongtaoh/ucl_kde_ebm
pip install git+https://github.com/hongtaoh/pySuStaInThen, at the root of this repository, run
bash test_sustain.shYou can check details at bebms/test/test_sustain.py.
The results will be saved in the folder of bebms/test/sustain_results.
You can save the results of bebms along with those of SuStaIn by running at the root:
python3 save_csv.pyThe results will be found at the root as all_results.csv.
You can use your own data. But make sure that your data follows the format as in data in bebms/data/samples.
After you have your own data, the first step is to find the optimal number of subtypes.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from bebms.cross_validate import cross_validatation
import bebms.utils as utils
data_file = 'path/to/your/data.csv'
cvic_scores, optimal_n = cross_validatation(
data_file=data_file,
iterations=10000, # how many MCMC iterations to run.
n_shuffle=2, # how many biomarkers to shuffle in each subtype; recommend 2.
n_subtype_shuffle=2, # how many subtypes to shuffle; recommend 2.
burn_in=200,
prior_n=1, # Strength of the prior belief in prior estimate of the mean (μ), set to 1 as default
prior_v=1, # # Prior degrees of freedom, influencing the certainty of prior estimate of the variance (σ²), set to 1 as default
max_n_subtypes=6, # the max number of subtypes
N_FOLDS=5, # K-fold validation. Choose K here.
seed=42, # random seed.
with_labels=True, # whether to assume the knowledge of diagnosis labels, i.e., healthy or not.
z_score_norm=False # whether to use z score normalization for all biomarker data. default is false
)
# to get the optimal number of subtypes
ml_n_subtypes = utils.choose_optimal_subtypes(cvic_scores)
print(ml_n_subtypes)
# Summarize results
df_cvic = pd.DataFrame({
"n_subtypes": np.arange(1, 7),
"CVIC": cvic_scores
})
print(df_cvic)
# Plot CVIC curve
plt.figure(figsize=(6,4))
plt.plot(df_cvic["n_subtypes"], df_cvic["CVIC"], marker="o")
plt.xlabel("Number of subtypes")
plt.ylabel("CVIC (lower is better)")
plt.title("Cross-validated model selection (BEBMS)")
plt.grid(True)
plt.show()After you know the optimal number of subtypes, you can start running bebms on your dataset.
It's ideal if you can try different random seeds and see which one leads to the highest data log likelihood:
import pandas as pd
import numpy as np
from bebms import cross_validatation, run_bebms
from collections import defaultdict, Counter
data_file = 'path/to/your/data.csv'
dic = defaultdict(float)
for _ in range(10): # try 10 random seeds; modify the number as you wish.
x = np.random.randint(1, 2**32 - 1)
results = run_bebms(
data_file= data_file,
n_subtypes=3, # that is the optimal number of subtypes you identified above
output_dir='bebms_results',
n_iter=20000, # number of MCMC iterations.
n_shuffle=2,
n_subtype_shuffle=2,
burn_in=200,
thinning=1,
seed = x,
obtain_results=True, # to get the results
save_results=False, # but no need to save the results; why? because here we only need to get the data likelihood, and no need to save the results
with_labels=True, # we assume the knowledge of diagnosis labels
save_plots=False, # we do not save plots
z_score_norm=False # whether to use z score normalization for all biomarker data. default is false
)
dic[x] = results['max_log_likelihood']
# By checking dic, you can know which random seed led to the highest data log likelihood
# Finally, you can run bebms to get the results.
seed = 12345 # Suppose that is the optimal seed you identified above
results, all_orders, all_loglikes, best_order_matrix, biomarker_names, ml_stage, ml_subtype = run_bebms(
data_file= data_file,
n_subtypes=3,
output_dir='bebms_results', # where results will be saved into
n_iter=20000,
n_shuffle=2,
n_subtype_shuffle=2,
burn_in=200,
thinning=1,
seed = seed,
obtain_results=True,
save_results=True, # Now we need to save results
with_labels=True,
save_plots=True # Now we need save the result plots.
)The default of the functions of run_bebms and cross_validation is z_score_norm = False. But you can try z_score_norm = True and see which results are more plausible. This might require domain expertise.
- 2025-08-21 (V 0.0.3)
- Did the
generate_data.py.
- Did the
- 2025-08-22 (V 0.0.5)
- Did the
mh.py - Correct conjugate_priors implementation.
- Did the
- 2025-08-23 (V 0.1.2)
- Improved functions in
utils.py.
- Improved functions in
- 2025-08-29 (V 0.1.3)
- Didn't change much.
- 2025-08-30 (V 0.1.8)
- Optimized
compute_likelihood_and_posteriorssuch that we only calculate healthy participants' ln likelihood once every time. - Made sure subtype assignment accuracy does not apply to healthy participants at all.
- Fixed a major bug in data generation. The very low subtype assignment might be due to this error.
- Included both subtype accuracy in
run.py.
- Optimized
- 2025-08-31 (V 0.2.5)
- Resacle event times and disease stages for exp7-9 such that max(event_times) = max_stage -1, and max(disease_stages) = max_stage.
- Changed the experiments and some of the implementation.
- Forcing
max(event_times) = max_stage -1, but not forcing disease stages.
- 2025-09-01 (V 0.2.9)
- REMOVED THE Forcing
max(event_times) = max_stage -1 - Modified the
run.py.
- REMOVED THE Forcing
- 2025-09-02 (V 0.3.3.1)
- Redid the staging and subtyping.
- Integrated with labels and not.
- 2025-09-04 (V 0.3.3.2)
- Made sure in staging with labels, the new_order indices starts from 1 instead of 0. This is because participant stages now start from 0.
- 2025-09-06 (V 0.3.5.6)
- Added the plot function back.
- 2025-09-08 (V 0.3.5.8)
- Added
ml_subtypein output results. - Added all_logs to the output returned in
run.py.
- Added
- 2025-09-21 (V 0.3.9)
- Removed
iteration >= burn_inwhen updating best_*.
- Removed
- 2025-11-03 (V 0.4.1)
- Changed the package name to
bebms. - Edited README.
- Changed the package name to
- 2025-11-06 (V 0.4.3)
- Updated README.
- Allowed
keep_all_cols=Truewhen generating synthetic data. Will use the long format in that situation.
- 2025-11-07 (V 0.4.5)
- Now
z_score_normis added inrun.pyandcross_validation.pyto allow users to do z score normalization for the data matrix.
- Now