Skip to content

Johnny-Zip/MU-SplitFed

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MU-SplitFed: A straggler-resilient SFL algorithm in zeroth-order optimization

Project Structure

MU-SplitFed/
├── cezo_fl/                    # Core CeZO-FL implementation
│   ├── __init__.py
│   ├── server.py              # Federated learning server
│   ├── client_test.py         # Client testing utilities
│   ├── random_gradient_estimator.py  # ZO gradient estimation
│   ├── run_client_jobs.py     # Client job execution
│   ├── shared.py              # Shared utilities
│   └── util/                  # Utility modules
│       ├── checkpoint.py      # Model checkpointing
│       ├── compression.py     # Gradient compression
│       ├── data_split.py      # Data splitting utilities
│       ├── dataloaders.py     # Data loading utilities
│       ├── dataset.py         # Dataset implementations
│       ├── language_utils.py  # Language model utilities
│       ├── metrics.py         # Evaluation metrics
│       └── model_helpers.py   # Model helper functions
├── config.py                  # Configuration management
├── preprocess.py              # Data preprocessing
├── run.py                     # Main execution script
├── sl_main_new.py            # Split learning main implementation
├── zo_optimizer_new.py       # Zeroth-order optimizer
└── dev_tools/                # Development tools
    ├── dev-requirement.txt
    └── README.md

Installation

  1. Clone the repository:
git clone <repository-url>
cd MU-SplitFed
  1. Install dependencies:
pip install -r requirements.txt
  1. For development dependencies:
pip install -r dev_tools/dev-requirement.txt

Usage

Basic Training

Run split learning with zeroth-order optimization:

python sl_main_new.py --dataset sst2 --large-model opt-125m --iterations 1000 --server-iter 5 --splitted-layer 12

Key Parameters

  • --dataset: Dataset to use (sst2, cb, wsc, wic, multirc, rte, boolq, squad, drop, xsum)
  • --large-model: Model size (opt-125m, opt-1.3b, opt-2.7b, opt-6.7b, opt-13b, opt-30b)
  • --iterations: Number of training iterations
  • --server-iter: Number of server-side iterations per round
  • --splitted-layer: Layer where to split the model
  • --lr: Learning rate (default: 1e-4)
  • --mu: Perturbation magnitude for ZO (default: 1e-3)
  • --num-pert: Number of perturbations per gradient estimate (default: 1)
  • --lora: Enable LoRA fine-tuning
  • --lora-r: LoRA rank (default: 8)
  • --lora-alpha: LoRA alpha (default: 16)

Example Commands

Small model training:

python sl_main_new.py --dataset sst2 --large-model opt-125m --iterations 500 --server-iter 3 --splitted-layer 8 --lr 1e-4

Large model with LoRA:

python sl_main_new.py --dataset sst2 --large-model opt-1.3b --iterations 1000 --server-iter 5 --splitted-layer 12 --lora --lora-r 16 --lora-alpha 32

Generation tasks:

python sl_main_new.py --dataset squad --large-model opt-125m --iterations 200 --server-iter 2 --splitted-layer 10

Configuration

The project uses a comprehensive configuration system in config.py. Key configuration options include:

  • Model settings: Model type, dtype, LoRA parameters
  • Training settings: Batch size, learning rate, momentum
  • ZO settings: Perturbation magnitude, number of perturbations, gradient estimation method
  • Split learning: Server iterations, split layer
  • Hardware: CUDA/MPS support, device selection

Supported Models and Datasets

Models

  • OPT-125M, OPT-1.3B, OPT-2.7B, OPT-6.7B, OPT-13B, OPT-30B

Datasets

  • Classification: SST-2, CB, WSC, WIC, MultiRC, RTE, BoolQ
  • Generation: SQuAD, DROP, XSum

Memory Optimization

The implementation includes several memory optimization features:

  • Split Learning: Reduces memory requirements by splitting large models
  • Gradient Compression: Optional gradient compression techniques
  • Mixed Precision: Support for float16 and bfloat16
  • No Optim Mode: Memory-efficient training without PyTorch optimizers

Development

Running Tests

python -m pytest cezo_fl/

Code Style

The project follows Python best practices and includes type hints throughout.

Citation

If you use this code in your research, please cite:

@misc{cezo-fl-2024,
  title={CeZO-FL: Communication-Efficient Zeroth-Order Federated Learning},
  author={[Your Name]},
  year={2024},
  howpublished={GitHub Repository},
  url={https://github.com/[username]/MU-SplitFed}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Acknowledgments

  • Built on top of Hugging Face Transformers
  • Uses PEFT for LoRA implementation
  • Inspired by federated learning and zeroth-order optimization research

Troubleshooting

Common Issues

  1. CUDA Out of Memory: Reduce batch size or use smaller models
  2. MPS Issues on macOS: The code automatically falls back to CPU if MPS is not available
  3. Model Loading: Ensure you have sufficient disk space for large model downloads

Performance Tips

  • Use --no-optim flag for memory efficiency
  • Adjust --splitted-layer based on your memory constraints
  • Use smaller --num-pert values for faster training
  • Enable LoRA for large models to reduce memory usage

For more detailed information, please refer to the individual module documentation or open an issue.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages