Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bad substitution报错 #10

Open
WeiCL7777 opened this issue Jul 5, 2023 · 7 comments
Open

bad substitution报错 #10

WeiCL7777 opened this issue Jul 5, 2023 · 7 comments

Comments

@WeiCL7777
Copy link

~/WCL/KD/DIST_KD-main/classification$ sh tools/dist_train.sh 1 configs/strategies/distill/dist_cifar.yaml ${cifar_resnet20} --teacher-model ${cifar_resnet56} --experiment ${checkpoint} --teacher-ckpt ${'./ckpt/ckpt_epoch_240.pth'}
bash: ${'./ckpt/ckpt_epoch_240.pth'}: bad substitution
作者您好,我在跑cifar结果时,已经把ckpt文件下载好并指定路径,但出现如上bad substitution报错,请教作者解决方法,谢谢!

@hunto
Copy link
Owner

hunto commented Jul 5, 2023

你试着把 ${} 去掉看看呢。用下面的命令:

sh tools/dist_train.sh 1 configs/strategies/distill/dist_cifar.yaml cifar_resnet20 --teacher-model cifar_resnet56 --experiment checkpoint --teacher-ckpt ./ckpt/ckpt_epoch_240.pth

@WeiCL7777
Copy link
Author

你试着把 ${} 去掉看看呢。用下面的命令:

sh tools/dist_train.sh 1 configs/strategies/distill/dist_cifar.yaml cifar_resnet20 --teacher-model cifar_resnet56 --experiment checkpoint --teacher-ckpt ./ckpt/ckpt_epoch_240.pth

作者您好,我试了一下你的方案,还是出现如下报错:
~/WCL/KD/DIST_KD-main/classification$ sh tools/dist_train.sh 1 configs/strategies/distill/dist_cifar.yaml cifar_resnet20 --teacher-model cifar_resnet56 --experiment checkpoint --teacher-ckpt ./ckpt/ckpt_epoch_240.pth
tools/dist_train.sh: 5: tools/dist_train.sh: Bad substitution
研究了一下还是没发现解决办法,还是需要请教作者。

@hunto
Copy link
Owner

hunto commented Jul 5, 2023

估计是你的sh版本的问题,你可以用 readlink -f $(which sh) 看一下是什么版本。

或者,使用bash tools/dist_train.sh./tools/dist_train.sh试试。

@WeiCL7777
Copy link
Author

谢谢作者,sh版本问题已解决。后续运行出现多卡分布式训练的问题:ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 24696),希望请教一下作者。初步考虑似乎是--local_rank=0这个参数指定GPU的问题,一开始以为是指定的GPU被占用了,尝试修改--local_rank但没有效果。报错如下:
bash tools/dist_train.sh 1 configs/strategies/distill/dist_cifar.yaml cifar_resnet20 --teacher-model cifar_resnet56 --experime
nt checkpoint --teacher-ckpt ./ckpt/ckpt_epoch_240.pth

  • python -m torch.distributed.launch --nproc_per_node=1 tools/train.py -c configs/strategies/distill/dist_cifar.yaml --model cifar_resnet20 --teacher-model cifar_resnet56 --experiment checkp
    oint --teacher-ckpt ./ckpt/ckpt_epoch_240.pth
    /remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
    and will be removed in future. Use torchrun.
    Note that --use_env is set by default in torchrun.
    If your script expects --local_rank argument to be set, please
    change it to read from os.environ['LOCAL_RANK'] instead. See
    https://pytorch.org/docs/stable/distributed.html#launch-utility for
    further instructions

    warnings.warn(
    usage: train.py [-h] [--dataset {cifar10,cifar100,imagenet}] [--data-path DATA_PATH] [--model MODEL] [--model-config MODEL_CONFIG] [--resume RESUME] [-b BATCH_SIZE]
    [--val-batch-size-multiplier VAL_BATCH_SIZE_MULTIPLIER] [--auxiliary] [--auxiliary_weight AUXILIARY_WEIGHT] [--smoothing SMOOTHING] [--opt OPT] [--opt-eps OPT_EPS]
    [--opt-no-filter] [--momentum MOMENTUM] [--sgd-no-nesterov] [--weight-decay WEIGHT_DECAY] [--clip-grad-norm] [--clip-grad-max-norm CLIP_GRAD_MAX_NORM] [--amp]
    [--sched SCHED] [--decay-epochs DECAY_EPOCHS] [--lr LR] [--warmup-lr WARMUP_LR] [--min-lr MIN_LR] [--epochs EPOCHS] [--warmup-epochs WARMUP_EPOCHS]
    [--decay-rate DECAY_RATE] [--decay_by_epoch] [--image-mean IMAGE_MEAN IMAGE_MEAN IMAGE_MEAN] [--image-std IMAGE_STD IMAGE_STD IMAGE_STD]
    [--interpolation {bilinear,bicubic}] [--color-jitter COLOR_JITTER] [--cutout-length CUTOUT_LENGTH] [--aa AA] [--reprob REPROB] [--remode REMODE] [--drop DROP]
    [--drop-path-rate DROP_PATH_RATE] [--drop-path-strategy {const,linear}] [--mixup MIXUP] [--cutmix CUTMIX] [--cutmix-minmax CUTMIX_MINMAX [CUTMIX_MINMAX ...]]
    [--mixup-prob MIXUP_PROB] [--mixup-switch-prob MIXUP_SWITCH_PROB] [--mixup-mode MIXUP_MODE] [--model-ema] [--model-ema-decay MODEL_EMA_DECAY] [--seed SEED]
    [--log-interval LOG_INTERVAL] [-j WORKERS] [--experiment EXPERIMENT] [--slurm] [--local-rank LOCAL_RANK] [--dist-port DIST_PORT] [--kd KD] [--teacher-model TEACHER_MODEL]
    [--teacher-pretrained] [--teacher-no-pretrained] [--teacher-ckpt TEACHER_CKPT] [--kd-loss-weight KD_LOSS_WEIGHT] [--ori-loss-weight ORI_LOSS_WEIGHT]
    [--teacher-module TEACHER_MODULE] [--student-module STUDENT_MODULE] [--dbb] [--dyrep] [--dyrep-adjust-interval DYREP_ADJUST_INTERVAL]
    [--dyrep-max-adjust-epochs DYREP_MAX_ADJUST_EPOCHS] [--dyrep-recal-bn-iters DYREP_RECAL_BN_ITERS] [--dyrep-recal-bn-every-epoch] [--edgenn-config EDGENN_CONFIG]
    train.py: error: unrecognized arguments: --local_rank=0
    ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 24696) of binary: /remote-home/clwei/anaconda3/envs/py39/bin/python
    Traceback (most recent call last):
    File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
    File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
    File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launch.py", line 195, in
    main()
    File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
    File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
    File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
    File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call
    return launch_agent(self._config, self._entrypoint, list(args))
    File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
    torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
    ============================================================
    tools/train.py FAILED


Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-07-05_13:21:58
host : 04100b78df6c
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 24696)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

@WeiCL7777
Copy link
Author

估计是你的sh版本的问题,你可以用 readlink -f $(which sh) 看一下是什么版本。

或者,使用bash tools/dist_train.sh./tools/dist_train.sh试试。

作者您好,上面我提到的分布式训练问题基本解决了,通过将dist_train.sh文件中的python torch.distributed.launch 直接改为torchrun即可。现在的报错主要是The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use),具体如下,尝试指定不同的GPU,暂时也没有效果,希望请教作者的见解,谢谢。
bash tools/dist_train.sh 2 configs/strategies/distill/dist_cifar.yaml cifar_resnet20 --teacher-model cifar_resnet56 --experime
nt checkpoint --teacher-ckpt ./ckpt/ckpt_epoch_240.pth

  • torchrun --nproc_per_node=2 --master_port=25641 tools/train.py -c configs/strategies/distill/dist_cifar.yaml --model cifar_resnet20 --teacher-model cifar_resnet56 --experiment checkpoint -
    -teacher-ckpt ./ckpt/ckpt_epoch_240.pth
    WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your appl
ication as needed.


[W socket.cpp:426] [c10d] The server socket has failed to bind to [::]:25641 (errno: 98 - Address already in use).
[W socket.cpp:426] [c10d] The server socket has failed to bind to 0.0.0.0:25641 (errno: 98 - Address already in use).
[E socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.
Traceback (most recent call last):
File "/remote-home/clwei/anaconda3/envs/py39/bin/torchrun", line 8, in
sys.exit(main())
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 237, in launch_agent
result = agent.run()
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
result = self._invoke_run(role)
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 844, in _invoke_run
self._initialize_workers(self._worker_group)
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 678, in _initialize_workers
self._rendezvous(worker_group)
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 538, in _rendezvous
store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py", line 55, in next_rendezvous
self._store = TCPStore( # type: ignore[call-arg]
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:25641 (errno: 98 - Address already in use). The server socket
has failed to bind to 0.0.0.0:25641 (errno: 98 - Address already in use).

@hunto
Copy link
Owner

hunto commented Jul 5, 2023

我猜想,你把最新命令里的--nproc_per_node=2改为--nproc_per_node=1应该就可以了。你之前端口冲突了,需要用--master_port=25641修改端口

@WeiCL7777
Copy link
Author

我猜想,你把最新命令里的--nproc_per_node=2改为--nproc_per_node=1应该就可以了。你之前端口冲突了,需要用--master_port=25641修改端口

作者您好,您说的这个我也试过,刚才又跑了一下,并尝试减小batch size,但还是会出现timeout的情况,看着像是GPU被占用产生的问题?
$ bash tools/dist_train.sh 1 configs/strategies/distill/dist_cifar.yaml cifar_resnet20 --teacher-model cifar_resnet56 --experiment checkpoint --teacher-ckpt ./ckpt/ckpt_epoch_240.pth

  • torchrun --nproc_per_node=1 --master_port=25641 tools/train.py -c configs/strategies/distill/dist_cifar.yaml --model cifar_resnet20 --teacher-model cifar_resnet56 --experiment checkpoint --teacher-ckpt ./ckpt/ckpt_epoch_240.pth
    [E socket.cpp:860] [c10d] The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 12345).
    Traceback (most recent call last):
    File "/remote-home/clwei/WCL/KD/DIST_KD-main/classification/tools/train.py", line 365, in
    main()
    File "/remote-home/clwei/WCL/KD/DIST_KD-main/classification/tools/train.py", line 39, in main
    init_dist(args)
    File "/remote-home/clwei/WCL/KD/DIST_KD-main/classification/tools/lib/utils/dist_utils.py", line 46, in init_dist
    torch.distributed.init_process_group(backend='nccl', init_method='env://')
    File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 754, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
    File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
    File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 173, in _create_c10d_store
    tcp_store = TCPStore(hostname, port, world_size, False, timeout)
    TimeoutError: The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 12345).
    ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2540) of binary: /remote-home/clwei/anaconda3/envs/py39/bin/python
    Traceback (most recent call last):
    File "/remote-home/clwei/anaconda3/envs/py39/bin/torchrun", line 8, in
    sys.exit(main())
    File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
    return f(*args, **kwargs)
    File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
    File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
    File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call
    return launch_agent(self._config, self._entrypoint, list(args))
    File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
    torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
    ============================================================
    tools/train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-07-06_01:28:25
host : 04100b78df6c
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2540)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants