distributed sampler fix by jpata · Pull Request #467 · jpata/particleflow

jpata · 2026-03-25T15:07:40Z

The distributed sampler was causing issues on LUMI with 8 GPUs if different workers ran out of the dataset at different times. The tests are expected to fail, there is an outdated test on the main branch that is removed in this PR.

TODO:

wait until pyg-cld-hits-v1_cld_20260326_093401_088146 has trained on LUMI for 24h and post the loss curves to check that it doesn't crash now

The learning rate might need tuning but it seems there are no more issues with the sampler. EDIT: found a crash on 2xL40s after 40k steps, fix attempt in be6b00a.

wait until pyg-cld-hits-v1_cld_20260327_081645_454635 has trained on LUMI after fixing be6b00a.
wait until pyg-cld-hits-v1_cld_20260327_081548_787211 has trained on Tallinn, same as above

…utiveSampler

…ow into jp_20260325_samplerfix

Joosep Pata and others added 14 commits March 25, 2026 11:49

fix: synchronize sample counts across ranks in DistributedShardConsec…

c978fe9

…utiveSampler

tallinn dual l40 options

c1b9e23

added distributed sampler test

1813f4e

update distributed sampler tests

6fdd9f7

consolidate tests

2d874f9

pytest without cache

7c80ecb

Merge remote-tracking branch 'origin/main' into jp_20260325_samplerfix

fa8c8a6

remove outdated test

1fe34c8

use lamb on lumi

73395cd

Merge branch 'jp_20260325_samplerfix2' of github.com:jpata/particlefl…

160eb4a

…ow into jp_20260325_samplerfix

fix optimizer device

5d40566

fix resumablesampler len

be6b00a

6 layers

89b04d0

use more layers on lumi

6c5052d

jpata merged commit b4e03f2 into main Mar 28, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distributed sampler fix#467

distributed sampler fix#467
jpata merged 14 commits intomainfrom
jp_20260325_samplerfix2

jpata commented Mar 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jpata commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jpata commented Mar 25, 2026 •

edited

Loading