GitHub - mahmoudsajjadi/ComDML: Fully Distributed Split Federated Learning

Communication-Efficient Training Workload Balancing for Distributed Multi-Agent Learning

Fully Distributed Split Federated Learning

This repository contains the Python implementation of ComDML, a novel approach for Communication-efficient Training Workload Balancing in Decentralized Multi-Agent Learning. ComDML tackles the challenge of uneven training workloads in decentralized systems, facilitating efficient model training without relying on a central server. By enabling agents to collaborate and share the training burden, ComDML empowers scalable and resource-optimized DML applications. The paper describing ComDML is going to be published at the IEEE International Conference on Distributed Computing Systems (ICDCS) 2024, in New Jersey.

The training workflow for ComDML is illustrated here:

Implementation

This repository includes the necessary code to replicate experiments and run ComDML on various datasets and models. To run ComDML, ensure you have the following dependencies installed:

Python
PyTorch
NumPy
torchvision
pandas
scikit-learn
matplotlib
wandb

To install the required Python packages, you can use pip:

pip install torch torchvision numpy pandas scikit-learn matplotlib wandb

Abstract

Decentralized Multi-agent Learning (DML) enables collaborative model training while preserving data privacy. However, inherent heterogeneity in agents’ resources may lead to substantial variations in training times, causing straggler effects and resource wastage. To address this, ComDML balances workload among agents through decentralized workload offloading, leveraging local-loss split training. This approach optimizes workload balancing by considering both communication and computation capacities of agents. A dynamic decentralized pairing scheduler efficiently pairs agents and determines optimal offloading amounts. ComDML demonstrates robustness in heterogeneous environments, significantly reducing training time while maintaining model accuracy.

Introduction

Effective training of Deep Neural Networks (DNNs) often requires access to vast amounts of data, posing challenges in privacy and communication costs. Federated Learning (FL) algorithms have emerged as a solution, but face challenges in heterogeneous environments. ComDML introduces a decentralized approach, balancing workload to minimize training time. Traditional FL methods rely on a central server, prone to bottlenecks and failures. Decentralized systems offer improved resilience and security. ComDML addresses workload balancing challenges without a central coordinator.

Workload Balancing for Decentralized Multi-agent Learning

ComDML employs local-loss-based split training to achieve workload balancing in DML systems. This approach allows slower agents to offload part of their workload to faster agents, ensuring efficient utilization of resources. The optimization for workload balancing considers both communication and computation resources, formulated as an integer programming problem. A dynamic decentralized pairing scheduler efficiently pairs agents based on observed capabilities.

Decentralized Workload Balancing

To effectively implement workload balancing, ComDML utilizes a dynamic decentralized pairing scheduler. This scheduler pairs agents based on their computation and communication capacities, minimizing overall training time. The training workflow involves split model profiling, agent pairing, and model aggregation, achieving resource optimization with minimal overhead.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
api/data_preprocessing		api/data_preprocessing
model		model
utils		utils
.gitignore		.gitignore
README.md		README.md
main.py		main.py
pairing_scheduler.py		pairing_scheduler.py
training_process.png		training_process.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

api/data_preprocessing

api/data_preprocessing

model

model

utils

utils

.gitignore

.gitignore

README.md

README.md

main.py

main.py

pairing_scheduler.py

pairing_scheduler.py

training_process.png

training_process.png

Repository files navigation

Communication-Efficient Training Workload Balancing for Distributed Multi-Agent Learning

Fully Distributed Split Federated Learning

Implementation

Abstract

Introduction

Workload Balancing for Decentralized Multi-agent Learning

Decentralized Workload Balancing

About

Releases

Packages

Languages

mahmoudsajjadi/ComDML

Folders and files

Latest commit

History

Repository files navigation

Communication-Efficient Training Workload Balancing for Distributed Multi-Agent Learning

Fully Distributed Split Federated Learning

Implementation

Abstract

Introduction

Workload Balancing for Decentralized Multi-agent Learning

Decentralized Workload Balancing

About

Resources

Stars

Watchers

Forks

Languages