Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unable to train a bloat16-compressed model #545

Open
the-beee opened this issue Jan 4, 2023 · 1 comment
Open

[BUG] Unable to train a bloat16-compressed model #545

the-beee opened this issue Jan 4, 2023 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@the-beee
Copy link

the-beee commented Jan 4, 2023

Describe the bug

Jan 04 22:30:14.302 [INFO] test-run-1112b accumulated 10 samples for epoch #0 from 2 peers. ETA 0.00 sec (refresh in 0.50 sec)
Jan 04 22:30:14.476 [INFO] Beginning optimizer step #0
Jan 04 22:31:26.924 [ERROR] [hivemind.optim.power_sgd_averager._aggregate_with_group:187] Expected out tensor to have dtype c10::BFloat16, but got float instead
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/hivemind/optim/power_sgd_averager.py", line 159, in _aggregate_with_group
    torch.matmul(m.reshape(-1, q.size(0)), q, out=p)
RuntimeError: Expected out tensor to have dtype c10::BFloat16, but got float instead
Jan 04 22:31:26.925 [WARN] [hivemind.optim.power_sgd_averager._register_allreduce_group:129] All-reduce group b'\xc4\xaf\xafv&\xcd\xd6q\xa3\xea\x9d-\x13\x0f\xa4hNQ\xf6>PHASE_P' did not finish.
Jan 04 22:31:26.925 [WARN] [hivemind.optim.power_sgd_averager._register_allreduce_group:129] All-reduce group b'\xc4\xaf\xafv&\xcd\xd6q\xa3\xea\x9d-\x13\x0f\xa4hNQ\xf6>PHASE_Q' did not finish.
Jan 04 22:31:26.925 [WARN] [hivemind.averaging.averager._step:482] PowerSGDGradientAverager caught MatchmakingException('Unable to run All-Reduce: Expected out tensor to have dtype c10::BFloat16, but got float instead'), retrying
Jan 04 22:35:47.094 [ERROR] [hivemind.optim.power_sgd_averager._aggregate_with_group:187] Expected out tensor to have dtype c10::BFloat16, but got float instead
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/hivemind/optim/power_sgd_averager.py", line 159, in _aggregate_with_group
    torch.matmul(m.reshape(-1, q.size(0)), q, out=p)
RuntimeError: Expected out tensor to have dtype c10::BFloat16, but got float instead
Jan 04 22:35:47.094 [WARN] [hivemind.optim.power_sgd_averager._register_allreduce_group:129] All-reduce group b'Q\x84\x8d\xa9\xf3\x90\xd4\xdf\xcc]\x153\x0c+\x9e\x90|\xed|\x8ePHASE_P' did not finish.
Jan 04 22:35:47.094 [WARN] [hivemind.optim.power_sgd_averager._register_allreduce_group:129] All-reduce group b'Q\x84\x8d\xa9\xf3\x90\xd4\xdf\xcc]\x153\x0c+\x9e\x90|\xed|\x8ePHASE_Q' did not finish.
Jan 04 22:35:47.095 [WARN] [hivemind.averaging.averager._step:482] PowerSGDGradientAverager caught MatchmakingException('Unable to run All-Reduce: Expected out tensor to have dtype c10::BFloat16, but got float instead'), retrying
Jan 04 22:40:07.221 [ERROR] [hivemind.optim.power_sgd_averager._aggregate_with_group:187] Expected out tensor to have dtype c10::BFloat16, but got float instead

To Reproduce

git clone https://github.com/the-beee/naifu-diffusion
cd naifu-diffusion
pip install -r requirements.txt
python trainer.py

Please update config/distributed.yaml to include the peers address in the hivemind section, before starting the second peer.

Environment

Collecting environment information...
PyTorch version: 1.13.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.1 LTS (x86_64)
GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.24.1
[pip3] pytorch-lightning==1.8.6
[pip3] torch==1.13.1
[pip3] torch-ema==0.3
[pip3] torchmetrics==0.11.0
[pip3] torchvision==0.14.1
[pip3] hivemind==1.1.4
[conda] Could not collect
@the-beee the-beee added the bug Something isn't working label Jan 4, 2023
@justheuristic
Copy link
Member

Hi! Thanks for the detailed report! It is indeed a bug, and we'll fix it in the nearest release.
In the meantime, i'm afrain that the only override is to keep float32 params with hivemind.Optimizer - while the on-device model is in bfloat16.

@the-beee the-beee changed the title [BUG] Enable to train a bloat16-compressed model [BUG] Unable to train a bloat16-compressed model Jan 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants