Federated Learning for Clinical Structured Data: A Benchmark Comparison of Engineering and Statistical Approaches
- Supplementary materials
- Introduction
- System requirements
- A demo for generating and analyzing simulated data
- Citation
- Contact
Python and R workflow for generating and analyzing simulated datasets for benchmark comparisons of engineering-based FL algorithms (FedAvg, FedAvgM, q-FedAvg and FedProx) and the statistics-based FL algorithm (GLORE).
See our new Preprint for the whole story.
This repository incorporates some code from FedProx and GLORE.
Download Supplementary materials
Federated Learning (FL) has shown promising potential for safeguarding data privacy in healthcare collaborations. Although the term “FL” was originally coined by engineers, the statistical community has also explored similar privacy-preserving algorithms. Statistical FL algorithms, however, remain considerably less recognized than their engineering counterparts. Our goal was to bridge the gap by presenting the first comprehensive comparison of FL frameworks from both engineering and statistical domains. We evaluated five FL frameworks using both simulated and real-world data. The results indicate that statistical FL algorithms yield less biased point estimates for model coefficients and offer convenient confidence interval estimations. In contrast, engineering-based methods tend to generate more accurate predictions, sometimes surpassing central pooled and statistical FL models. This study underscores the relative strengths and weaknesses of both types of methods, emphasizing the need for increased awareness and their integration in future FL applications.
-
R packages: 'cowplot', 'dplyr', 'ggplot2', 'grid', 'gridExtra', 'pROC', 'rstudioapi', 'stringr'.
-
Java: version 8 or higher.
-
Python: version 3.9 (macOS) or 3.7 (Windows).
To install the required Python packages, run:
pip install -r requirements.txt
(userequirements_win.txt
for Windows).
Extra notes for using TensorFlow: The current version is for macOS. For Windows users, when importing Tensorflow, replace the current lines with import tensorflow as tf
.
In this section, we will walk through a demonstration of generating and analyzing simulated data using three clients (site 1, site 2, and site 3).
Run script scripts/R/Sim/main.R
to generate 50 seeds of simulation, with the output saved in the data/simulated
directory.
- Run script
scripts/R/train.local.R
to produce local results. Point estimate results likeCoef.local.Site1.csv
are stored in each seed folder. - Run script
scripts/R/centralized.R
to produce global results. Point estimate resultsCoef_central.csv
stored in each seed folder.
- Run following commands to compile
Server.java
andClient.java
.
cd scripts/GLORE
javac -cp Jama-1.0.2.jar Server.java Client.java
- Run script
run_glore.py
(macOS) orrun_glore_win.py
(Windows) to start the server and clients, with output fileoutput_glore.txt
stored in each seed folder.
python run_glore.py [path]
For example:
python run_glore.py ../../data/simulated/homogenous
- Run script
scripts/data_LR/extract_glore_all.py
to extract model coefficients and total training time for all datasets and seeds, with output filesCoef_glore.csv
andCov_glore.csv
stored in each seed folder.
cd scripts/data_LR
python extract_glore_all.py ../../data/simulated
- Change strategies in
scripts/Flower/FL_run_win.py
(Windows) orscripts/Flower/FL_run.py
(macOS) for different FL methods:- Strategy 1: FedAvg
- Strategy 2: q-FedAvg
- Strategy 3: FedAvgM
- Run script
python scripts/Flower/run_flwr_all_win.py [path]
for Windows andpython scripts/Flower/run_flwr_all.py [path]
for macOS, with output fileoutput_flwr_fedavg.txt
stored in each seed folder.
For example:
python scripts/Flower/run_flwr_all_win.py data/simulated/homogenous
- Run script
python scripts/data_LR/extract_flower_fedavg.py [path]
to extract coefficients and communication cost. The same forscripts/data_LR/extract_flower_fedavgM.py
andscripts/data_LR/extract_flower_Qfedavg.py
. For example:
python scripts/data_LR/extract_flower_fedavg.py data/simulated/homogenous
- Convert training and testing data to JSON format and copy them to the correct FedProx input data folder.
cd scripts/data_LR
python convert_data_to_json.py
python move_data.py ../../data/simulated/[path]
- Run script
scripts/FedProx/fedprox.py
, with output files likefedprox_lr0.01_drop0_mu0
stored in each seed folder.
cd scripts/FedProx
python fedprox.py [path]
- Run script
python extract_fedprox.py [output path]
to extract model coefficients and communication, with output files stored in each seed folder.
-
AUC of prediction task
-
Coefficient estimate
-
Communication
Run script scripts/Evaluation/extract_time.R
to extract communications for GLORE.
The average round result will be generated at scripts/Evaluation/time_rounds
.
S. Li, et al. "Federated learning for clinical structured data: A benchmark comparison of engineering and statistical approaches." arXiv preprint arXiv:2311.03417 (2023).
- Siqi Li (Email: siqili@u.duke.nus.edu)
- Nan Liu (Email: liu.nan@duke-nus.edu.sg)