Skip to content

RDMA issue: failed to create extended queue pair (QP): Operation not supported #493

@lewtun

Description

@lewtun

🐛 Describe the bug

Hello torchforgers!

I've successfully run the install command on 1 node of 8 x H100s, but when I then try to run the GRPO example with:

python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml

I get a peculiar RDMA error:

[RLTrainer-0/1] 2025-10-22 23:21:36 INFO Pushing weights for policy version 1
failed to create extended queue pair (QP): Operation not supported
[0]E1022 23:21:36.634458 2876119 hyperactor/src/proc.rs:1175] unix:@JEdzy6nlnlU4L0ytzv3i4UOA,anon_0_1BsCGXqAsYRV,rdma_manager[0]: actor failure: serving unix:@JEdzy6nlnlU4L0ytzv3i4UOA,anon_0_1BsCGXqAsYRV,rdma_manager[0]: processing error: could not create loopback QP for device rdmap79s0: failed to create queue pair (QP): Invalid argument (os error 22)

Any idea what might cause this?

Full stack trace attached:
forge-bug.txt

Versions

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions