Skip to content

[BUG]: multi machines, multi cards training is stucked while "Launch model Parallel" #4189

@shileims

Description

@shileims

🐛 Describe the bug

image

Environment

pytorch==1.11, cuda==11.3, colossalai==0.3.0
transformers==4.28.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions