bad substitution报错 #10

WeiCL7777 · 2023-07-05T02:32:50Z

~/WCL/KD/DIST_KD-main/classification$ sh tools/dist_train.sh 1 configs/strategies/distill/dist_cifar.yaml ${cifar_resnet20} --teacher-model ${cifar_resnet56} --experiment ${checkpoint} --teacher-ckpt ${'./ckpt/ckpt_epoch_240.pth'}
bash: ${'./ckpt/ckpt_epoch_240.pth'}: bad substitution
作者您好，我在跑cifar结果时，已经把ckpt文件下载好并指定路径，但出现如上bad substitution报错，请教作者解决方法，谢谢！

hunto · 2023-07-05T07:17:19Z

你试着把 ${} 去掉看看呢。用下面的命令：

sh tools/dist_train.sh 1 configs/strategies/distill/dist_cifar.yaml cifar_resnet20 --teacher-model cifar_resnet56 --experiment checkpoint --teacher-ckpt ./ckpt/ckpt_epoch_240.pth

WeiCL7777 · 2023-07-05T10:30:22Z

你试着把 ${} 去掉看看呢。用下面的命令：

sh tools/dist_train.sh 1 configs/strategies/distill/dist_cifar.yaml cifar_resnet20 --teacher-model cifar_resnet56 --experiment checkpoint --teacher-ckpt ./ckpt/ckpt_epoch_240.pth

作者您好，我试了一下你的方案，还是出现如下报错：
~/WCL/KD/DIST_KD-main/classification$ sh tools/dist_train.sh 1 configs/strategies/distill/dist_cifar.yaml cifar_resnet20 --teacher-model cifar_resnet56 --experiment checkpoint --teacher-ckpt ./ckpt/ckpt_epoch_240.pth
tools/dist_train.sh: 5: tools/dist_train.sh: Bad substitution
研究了一下还是没发现解决办法，还是需要请教作者。

hunto · 2023-07-05T10:35:49Z

估计是你的sh版本的问题，你可以用 readlink -f $(which sh) 看一下是什么版本。

或者，使用bash tools/dist_train.sh或./tools/dist_train.sh试试。

WeiCL7777 · 2023-07-05T13:30:59Z

谢谢作者，sh版本问题已解决。后续运行出现多卡分布式训练的问题：ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 24696)，希望请教一下作者。初步考虑似乎是--local_rank=0这个参数指定GPU的问题，一开始以为是指定的GPU被占用了，尝试修改--local_rank但没有效果。报错如下：
bash tools/dist_train.sh 1 configs/strategies/distill/dist_cifar.yaml cifar_resnet20 --teacher-model cifar_resnet56 --experime
nt checkpoint --teacher-ckpt ./ckpt/ckpt_epoch_240.pth

python -m torch.distributed.launch --nproc_per_node=1 tools/train.py -c configs/strategies/distill/dist_cifar.yaml --model cifar_resnet20 --teacher-model cifar_resnet56 --experiment checkp
oint --teacher-ckpt ./ckpt/ckpt_epoch_240.pth
/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
usage: train.py [-h] [--dataset {cifar10,cifar100,imagenet}] [--data-path DATA_PATH] [--model MODEL] [--model-config MODEL_CONFIG] [--resume RESUME] [-b BATCH_SIZE]
[--val-batch-size-multiplier VAL_BATCH_SIZE_MULTIPLIER] [--auxiliary] [--auxiliary_weight AUXILIARY_WEIGHT] [--smoothing SMOOTHING] [--opt OPT] [--opt-eps OPT_EPS]
[--opt-no-filter] [--momentum MOMENTUM] [--sgd-no-nesterov] [--weight-decay WEIGHT_DECAY] [--clip-grad-norm] [--clip-grad-max-norm CLIP_GRAD_MAX_NORM] [--amp]
[--sched SCHED] [--decay-epochs DECAY_EPOCHS] [--lr LR] [--warmup-lr WARMUP_LR] [--min-lr MIN_LR] [--epochs EPOCHS] [--warmup-epochs WARMUP_EPOCHS]
[--decay-rate DECAY_RATE] [--decay_by_epoch] [--image-mean IMAGE_MEAN IMAGE_MEAN IMAGE_MEAN] [--image-std IMAGE_STD IMAGE_STD IMAGE_STD]
[--interpolation {bilinear,bicubic}] [--color-jitter COLOR_JITTER] [--cutout-length CUTOUT_LENGTH] [--aa AA] [--reprob REPROB] [--remode REMODE] [--drop DROP]
[--drop-path-rate DROP_PATH_RATE] [--drop-path-strategy {const,linear}] [--mixup MIXUP] [--cutmix CUTMIX] [--cutmix-minmax CUTMIX_MINMAX [CUTMIX_MINMAX ...]]
[--mixup-prob MIXUP_PROB] [--mixup-switch-prob MIXUP_SWITCH_PROB] [--mixup-mode MIXUP_MODE] [--model-ema] [--model-ema-decay MODEL_EMA_DECAY] [--seed SEED]
[--log-interval LOG_INTERVAL] [-j WORKERS] [--experiment EXPERIMENT] [--slurm] [--local-rank LOCAL_RANK] [--dist-port DIST_PORT] [--kd KD] [--teacher-model TEACHER_MODEL]
[--teacher-pretrained] [--teacher-no-pretrained] [--teacher-ckpt TEACHER_CKPT] [--kd-loss-weight KD_LOSS_WEIGHT] [--ori-loss-weight ORI_LOSS_WEIGHT]
[--teacher-module TEACHER_MODULE] [--student-module STUDENT_MODULE] [--dbb] [--dyrep] [--dyrep-adjust-interval DYREP_ADJUST_INTERVAL]
[--dyrep-max-adjust-epochs DYREP_MAX_ADJUST_EPOCHS] [--dyrep-recal-bn-iters DYREP_RECAL_BN_ITERS] [--dyrep-recal-bn-every-epoch] [--edgenn-config EDGENN_CONFIG]
train.py: error: unrecognized arguments: --local_rank=0
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 24696) of binary: /remote-home/clwei/anaconda3/envs/py39/bin/python
Traceback (most recent call last):
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launch.py", line 195, in
main()
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launch.py", line 191, in main
launch(args)
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launch.py", line 176, in launch
run(args)
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
tools/train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-07-05_13:21:58
host : 04100b78df6c
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 24696)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

WeiCL7777 · 2023-07-05T15:23:43Z

估计是你的sh版本的问题，你可以用 readlink -f $(which sh) 看一下是什么版本。

或者，使用bash tools/dist_train.sh或./tools/dist_train.sh试试。

作者您好，上面我提到的分布式训练问题基本解决了，通过将dist_train.sh文件中的python torch.distributed.launch 直接改为torchrun即可。现在的报错主要是The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use），具体如下，尝试指定不同的GPU，暂时也没有效果，希望请教作者的见解，谢谢。
bash tools/dist_train.sh 2 configs/strategies/distill/dist_cifar.yaml cifar_resnet20 --teacher-model cifar_resnet56 --experime
nt checkpoint --teacher-ckpt ./ckpt/ckpt_epoch_240.pth

torchrun --nproc_per_node=2 --master_port=25641 tools/train.py -c configs/strategies/distill/dist_cifar.yaml --model cifar_resnet20 --teacher-model cifar_resnet56 --experiment checkpoint -
-teacher-ckpt ./ckpt/ckpt_epoch_240.pth
WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your appl
ication as needed.

[W socket.cpp:426] [c10d] The server socket has failed to bind to [::]:25641 (errno: 98 - Address already in use).
[W socket.cpp:426] [c10d] The server socket has failed to bind to 0.0.0.0:25641 (errno: 98 - Address already in use).
[E socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.
Traceback (most recent call last):
File "/remote-home/clwei/anaconda3/envs/py39/bin/torchrun", line 8, in
sys.exit(main())
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 237, in launch_agent
result = agent.run()
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
result = self._invoke_run(role)
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 844, in _invoke_run
self._initialize_workers(self._worker_group)
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 678, in _initialize_workers
self._rendezvous(worker_group)
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 538, in _rendezvous
store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py", line 55, in next_rendezvous
self._store = TCPStore( # type: ignore[call-arg]
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:25641 (errno: 98 - Address already in use). The server socket
has failed to bind to 0.0.0.0:25641 (errno: 98 - Address already in use).

hunto · 2023-07-05T15:48:46Z

我猜想，你把最新命令里的--nproc_per_node=2改为--nproc_per_node=1应该就可以了。你之前端口冲突了，需要用--master_port=25641修改端口

WeiCL7777 · 2023-07-06T01:50:30Z

我猜想，你把最新命令里的--nproc_per_node=2改为--nproc_per_node=1应该就可以了。你之前端口冲突了，需要用--master_port=25641修改端口

作者您好，您说的这个我也试过，刚才又跑了一下，并尝试减小batch size，但还是会出现timeout的情况，看着像是GPU被占用产生的问题？
$ bash tools/dist_train.sh 1 configs/strategies/distill/dist_cifar.yaml cifar_resnet20 --teacher-model cifar_resnet56 --experiment checkpoint --teacher-ckpt ./ckpt/ckpt_epoch_240.pth

torchrun --nproc_per_node=1 --master_port=25641 tools/train.py -c configs/strategies/distill/dist_cifar.yaml --model cifar_resnet20 --teacher-model cifar_resnet56 --experiment checkpoint --teacher-ckpt ./ckpt/ckpt_epoch_240.pth
[E socket.cpp:860] [c10d] The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 12345).
Traceback (most recent call last):
File "/remote-home/clwei/WCL/KD/DIST_KD-main/classification/tools/train.py", line 365, in
main()
File "/remote-home/clwei/WCL/KD/DIST_KD-main/classification/tools/train.py", line 39, in main
init_dist(args)
File "/remote-home/clwei/WCL/KD/DIST_KD-main/classification/tools/lib/utils/dist_utils.py", line 46, in init_dist
torch.distributed.init_process_group(backend='nccl', init_method='env://')
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 754, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 173, in _create_c10d_store
tcp_store = TCPStore(hostname, port, world_size, False, timeout)
TimeoutError: The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 12345).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2540) of binary: /remote-home/clwei/anaconda3/envs/py39/bin/python
Traceback (most recent call last):
File "/remote-home/clwei/anaconda3/envs/py39/bin/torchrun", line 8, in
sys.exit(main())
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/remote-home/clwei/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
tools/train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-07-06_01:28:25
host : 04100b78df6c
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2540)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bad substitution报错 #10

bad substitution报错 #10

WeiCL7777 commented Jul 5, 2023

hunto commented Jul 5, 2023 •

edited

Loading

WeiCL7777 commented Jul 5, 2023

hunto commented Jul 5, 2023

WeiCL7777 commented Jul 5, 2023

WeiCL7777 commented Jul 5, 2023

hunto commented Jul 5, 2023

WeiCL7777 commented Jul 6, 2023

bad substitution报错 #10

bad substitution报错 #10

Comments

WeiCL7777 commented Jul 5, 2023

hunto commented Jul 5, 2023 • edited Loading

WeiCL7777 commented Jul 5, 2023

hunto commented Jul 5, 2023

WeiCL7777 commented Jul 5, 2023

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-07-05_13:21:58 host : 04100b78df6c rank : 0 (local_rank: 0) exitcode : 2 (pid: 24696) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

WeiCL7777 commented Jul 5, 2023

hunto commented Jul 5, 2023

WeiCL7777 commented Jul 6, 2023

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-07-06_01:28:25 host : 04100b78df6c rank : 0 (local_rank: 0) exitcode : 1 (pid: 2540) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

hunto commented Jul 5, 2023 •

edited

Loading

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-07-05_13:21:58
host : 04100b78df6c
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 24696)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-07-06_01:28:25
host : 04100b78df6c
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2540)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html