Got crashed after the first epoch training on windows 11. #1

izhx · 2022-12-02T16:25:49Z

run python scripts/train.py -c examples/bert_crf/configs/resume.yaml

On window11, i7 12700h, Nvidia RTX 3070 laptop.
Installed modelscope==1.0.3, it works on the linux platforms.

2022-12-03 00:15:25,329 - modelscope - INFO - epoch [1][200/239]        lr: 5.000e-05, eta: 0:26:17, iter_time: 0.319, data_load_time: 0.005, memory: 4263, loss: 17.1283
2022-12-03 00:15:37,843 - modelscope - WARNING - ('METRICS', 'default', 'ner-metric') not found in ast index file
2022-12-03 00:15:37,843 - modelscope - WARNING - ('METRICS', 'default', 'ner-dumper') not found in ast index file
Total test samples:   0%|                                                                      | 0/463 [00:00<?, ?it/s]2022-12-03 00:15:38,091 - modelscope - INFO - PyTorch version 1.12.0 Found.
2022-12-03 00:15:38,093 - modelscope - INFO - Loading ast index from C:\Users\zx920\.cache\modelscope\ast_indexer
2022-12-03 00:15:38,147 - modelscope - INFO - Loading done! Current index file version is 1.0.3, with md5 ab126a3e272314963017d9feade29ae0
Traceback (most recent call last):
  File "<string>", line 1, in <module>
Total test samples:   0%|                                                                      | 0/463 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "scripts/train.py", line 54, in <module>
    main(args)
  File "scripts/train.py", line 21, in main
  File "C:\Users\zx920\.conda\envs\adas\lib\multiprocessing\spawn.py", line 105, in spawn_main
    trainer.train(args.checkpoint_path)
  File "C:\Users\zx920\workspace\AdaSeq\adaseq\trainers\default_trainer.py", line 354, in train
    exitcode = _main(fd)
  File "C:\Users\zx920\.conda\envs\adas\lib\multiprocessing\spawn.py", line 115, in _main
    return super().train(checkpoint_path=checkpoint_path, *args, **kwargs)
  File "C:\Users\zx920\.conda\envs\adas\lib\site-packages\modelscope\trainers\trainer.py", line 459, in train
    self = reduction.pickle.load(from_parent)
EOFError: Ran out of input
    self.train_loop(self.train_dataloader)
  File "C:\Users\zx920\.conda\envs\adas\lib\site-packages\modelscope\trainers\trainer.py", line 871, in train_loop
    self.invoke_hook(TrainerStages.after_train_epoch)
  File "C:\Users\zx920\.conda\envs\adas\lib\site-packages\modelscope\trainers\trainer.py", line 977, in invoke_hook
    getattr(hook, fn_name)(self)
  File "C:\Users\zx920\.conda\envs\adas\lib\site-packages\modelscope\trainers\hooks\evaluation_hook.py", line 31, in after_train_epoch
    self.do_evaluate(trainer)
  File "C:\Users\zx920\.conda\envs\adas\lib\site-packages\modelscope\trainers\hooks\evaluation_hook.py", line 35, in do_evaluate
    eval_res = trainer.evaluate()
  File "C:\Users\zx920\.conda\envs\adas\lib\site-packages\modelscope\trainers\trainer.py", line 484, in evaluate
    metric_classes)
  File "C:\Users\zx920\.conda\envs\adas\lib\site-packages\modelscope\trainers\trainer.py", line 921, in evaluation_loop
    data_loader_iters=self._eval_iters_per_epoch)
  File "C:\Users\zx920\.conda\envs\adas\lib\site-packages\modelscope\trainers\utils\inference.py", line 51, in single_gpu_test
    for i, data in enumerate(data_loader):
  File "C:\Users\zx920\.conda\envs\adas\lib\site-packages\torch\utils\data\dataloader.py", line 438, in __iter__
    return self._get_iterator()
  File "C:\Users\zx920\.conda\envs\adas\lib\site-packages\torch\utils\data\dataloader.py", line 384, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "C:\Users\zx920\.conda\envs\adas\lib\site-packages\torch\utils\data\dataloader.py", line 1048, in __init__
    w.start()
  File "C:\Users\zx920\.conda\envs\adas\lib\multiprocessing\process.py", line 112, in start
    self._popen = self._Popen(self)
  File "C:\Users\zx920\.conda\envs\adas\lib\multiprocessing\context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Users\zx920\.conda\envs\adas\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "C:\Users\zx920\.conda\envs\adas\lib\multiprocessing\popen_spawn_win32.py", line 89, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Users\zx920\.conda\envs\adas\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
  File "C:\Users\zx920\.conda\envs\adas\lib\site-packages\torch\multiprocessing\reductions.py", line 145, in reduce_tensor
    raise RuntimeError("Cowardly refusing to serialize non-leaf tensor which requires_grad, "
RuntimeError: Cowardly refusing to serialize non-leaf tensor which requires_grad, since autograd does not support crossing process boundaries.  If you just want to transfer the data, call detach() on the tensor before serializing (e.g., putting it on the queue).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

The text was updated successfully, but these errors were encountered:

izhx · 2022-12-02T16:40:40Z

From modelscope 1.0.0 to 1.1.0, all crashed with the same messages.
That's a tricky one. 😿

I'm afraid we can't use it on windows, until it gets fixed.

liuyhwangyh · 2022-12-03T10:07:51Z

Please set workers_per_gpu: 1 to 0 in examples/bert_crf/configs/resume.yaml

izhx · 2022-12-03T11:35:50Z

Please set workers_per_gpu: 1 to 0 in examples/bert_crf/configs/resume.yaml

Thanks, we also located the problem.
By disabling the multi-process loading of dataloader, i.e., set workers_per_gpu: 0, we are fine on windows.

We will change the default configurations.

izhx · 2022-12-20T08:30:41Z

Fixed

huangshenno1 added the bug Something isn't working label Dec 3, 2022

izhx closed this as completed Dec 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Got crashed after the first epoch training on windows 11. #1

Got crashed after the first epoch training on windows 11. #1

izhx commented Dec 2, 2022 •

edited

Loading

izhx commented Dec 2, 2022 •

edited

Loading

liuyhwangyh commented Dec 3, 2022

izhx commented Dec 3, 2022

izhx commented Dec 20, 2022

Got crashed after the first epoch training on windows 11. #1

Got crashed after the first epoch training on windows 11. #1

Comments

izhx commented Dec 2, 2022 • edited Loading

izhx commented Dec 2, 2022 • edited Loading

liuyhwangyh commented Dec 3, 2022

izhx commented Dec 3, 2022

izhx commented Dec 20, 2022

izhx commented Dec 2, 2022 •

edited

Loading

izhx commented Dec 2, 2022 •

edited

Loading