Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train.py: error: unrecognized arguments: --local-rank=0 #134

Open
davidvct opened this issue Jan 23, 2024 · 4 comments
Open

train.py: error: unrecognized arguments: --local-rank=0 #134

davidvct opened this issue Jan 23, 2024 · 4 comments

Comments

@davidvct
Copy link

davidvct commented Jan 23, 2024

Encounter this error when trying to train GoPro datasets:
python -m torch.distributed.launch --nproc_per_node=1 --master_port=4321 train.py -opt options/train/GoPro/NAFNet-width32.yml --launcher pytorch

I searched the train.py, there is no --local-rank=0.

How to fix?

@txy00001
Copy link

在train里添加
image

@sentinel8b
Copy link

sentinel8b commented Apr 24, 2024

Change

parser.add_argument('--local_rank', type=int, default=0)

To

parser.add_argument('--local-rank', type=int, default=0)

And I didn't add

os.environ['RANK'] = str(0)

@FogSue
Copy link

FogSue commented May 10, 2024

Change

parser.add_argument('--local_rank', type=int, default=0)

To

parser.add_argument('--local-rank', type=int, default=0)

And I didn't add

os.environ['RANK'] = str(0)

thanks,when i try to use torchrun it reported:”can not open python:no such file“,when i follow your change,it works!

@tobymuller233
Copy link

Change

parser.add_argument('--local_rank', type=int, default=0)

To

parser.add_argument('--local-rank', type=int, default=0)

And I didn't add

os.environ['RANK'] = str(0)

It seems that "local-rank" with a "-" in the middle instead of "_" doesn't follow the naming rule in Python.
I'm trying to debug a multi GPU program in vscode and config launch.json as followed:
{ "version": "0.2.0", "configurations": [ { "name": "Debug Distributed Training (GPU 0)", "type": "debugpy", "request": "launch", "program": "${workspaceFolder}/train.py", "console": "integratedTerminal", "args": [ "~/stu_motion/scrfd/configs/scrfd/scrfd_1g.py", "--launcher", "pytorch", ], "env": { "PYTHONPATH": "${workspaceFolder}/..:${env:PYTHONPATH}", "MASTER_ADDR": "127.0.0.1", "MASTER_PORT": "29500", "WORLD_SIZE": "2", "RANK": "0" }, "pythonArgs": [ "-m", "torch.distributed.launch", "--nproc_per_node=2", "--master_port=29500" ] }, { "name": "Debug Distributed Training (GPU 1)", "type": "debugpy", "request": "launch", "program": "${workspaceFolder}/train.py", "console": "integratedTerminal", "args": [ "~/stu_motion/scrfd/configs/scrfd/scrfd_1g.py", "--launcher", "pytorch", ], "env": { "PYTHONPATH": "${workspaceFolder}/..:${env:PYTHONPATH}", "MASTER_ADDR": "127.0.0.1", "MASTER_PORT": "29500", "WORLD_SIZE": "2", "RANK": "1" }, "pythonArgs": [ "-m", "torch.distributed.launch", "--nproc_per_node=2", "--master_port=29500" ] } ] }
I have no idea about whether it's true or not, but it turns out that the program failed to run correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants