Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Got crashed after the first epoch training on windows 11. #1

Closed
izhx opened this issue Dec 2, 2022 · 4 comments
Closed

Got crashed after the first epoch training on windows 11. #1

izhx opened this issue Dec 2, 2022 · 4 comments
Labels
bug Something isn't working

Comments

@izhx
Copy link
Collaborator

izhx commented Dec 2, 2022

run python scripts/train.py -c examples/bert_crf/configs/resume.yaml

On window11, i7 12700h, Nvidia RTX 3070 laptop.
Installed modelscope==1.0.3, it works on the linux platforms.

2022-12-03 00:15:25,329 - modelscope - INFO - epoch [1][200/239]        lr: 5.000e-05, eta: 0:26:17, iter_time: 0.319, data_load_time: 0.005, memory: 4263, loss: 17.1283
2022-12-03 00:15:37,843 - modelscope - WARNING - ('METRICS', 'default', 'ner-metric') not found in ast index file
2022-12-03 00:15:37,843 - modelscope - WARNING - ('METRICS', 'default', 'ner-dumper') not found in ast index file
Total test samples:   0%|                                                                      | 0/463 [00:00<?, ?it/s]2022-12-03 00:15:38,091 - modelscope - INFO - PyTorch version 1.12.0 Found.
2022-12-03 00:15:38,093 - modelscope - INFO - Loading ast index from C:\Users\zx920\.cache\modelscope\ast_indexer
2022-12-03 00:15:38,147 - modelscope - INFO - Loading done! Current index file version is 1.0.3, with md5 ab126a3e272314963017d9feade29ae0
Traceback (most recent call last):
  File "<string>", line 1, in <module>
Total test samples:   0%|                                                                      | 0/463 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "scripts/train.py", line 54, in <module>
    main(args)
  File "scripts/train.py", line 21, in main
  File "C:\Users\zx920\.conda\envs\adas\lib\multiprocessing\spawn.py", line 105, in spawn_main
    trainer.train(args.checkpoint_path)
  File "C:\Users\zx920\workspace\AdaSeq\adaseq\trainers\default_trainer.py", line 354, in train
    exitcode = _main(fd)
  File "C:\Users\zx920\.conda\envs\adas\lib\multiprocessing\spawn.py", line 115, in _main
    return super().train(checkpoint_path=checkpoint_path, *args, **kwargs)
  File "C:\Users\zx920\.conda\envs\adas\lib\site-packages\modelscope\trainers\trainer.py", line 459, in train
    self = reduction.pickle.load(from_parent)
EOFError: Ran out of input
    self.train_loop(self.train_dataloader)
  File "C:\Users\zx920\.conda\envs\adas\lib\site-packages\modelscope\trainers\trainer.py", line 871, in train_loop
    self.invoke_hook(TrainerStages.after_train_epoch)
  File "C:\Users\zx920\.conda\envs\adas\lib\site-packages\modelscope\trainers\trainer.py", line 977, in invoke_hook
    getattr(hook, fn_name)(self)
  File "C:\Users\zx920\.conda\envs\adas\lib\site-packages\modelscope\trainers\hooks\evaluation_hook.py", line 31, in after_train_epoch
    self.do_evaluate(trainer)
  File "C:\Users\zx920\.conda\envs\adas\lib\site-packages\modelscope\trainers\hooks\evaluation_hook.py", line 35, in do_evaluate
    eval_res = trainer.evaluate()
  File "C:\Users\zx920\.conda\envs\adas\lib\site-packages\modelscope\trainers\trainer.py", line 484, in evaluate
    metric_classes)
  File "C:\Users\zx920\.conda\envs\adas\lib\site-packages\modelscope\trainers\trainer.py", line 921, in evaluation_loop
    data_loader_iters=self._eval_iters_per_epoch)
  File "C:\Users\zx920\.conda\envs\adas\lib\site-packages\modelscope\trainers\utils\inference.py", line 51, in single_gpu_test
    for i, data in enumerate(data_loader):
  File "C:\Users\zx920\.conda\envs\adas\lib\site-packages\torch\utils\data\dataloader.py", line 438, in __iter__
    return self._get_iterator()
  File "C:\Users\zx920\.conda\envs\adas\lib\site-packages\torch\utils\data\dataloader.py", line 384, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "C:\Users\zx920\.conda\envs\adas\lib\site-packages\torch\utils\data\dataloader.py", line 1048, in __init__
    w.start()
  File "C:\Users\zx920\.conda\envs\adas\lib\multiprocessing\process.py", line 112, in start
    self._popen = self._Popen(self)
  File "C:\Users\zx920\.conda\envs\adas\lib\multiprocessing\context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Users\zx920\.conda\envs\adas\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "C:\Users\zx920\.conda\envs\adas\lib\multiprocessing\popen_spawn_win32.py", line 89, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Users\zx920\.conda\envs\adas\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
  File "C:\Users\zx920\.conda\envs\adas\lib\site-packages\torch\multiprocessing\reductions.py", line 145, in reduce_tensor
    raise RuntimeError("Cowardly refusing to serialize non-leaf tensor which requires_grad, "
RuntimeError: Cowardly refusing to serialize non-leaf tensor which requires_grad, since autograd does not support crossing process boundaries.  If you just want to transfer the data, call detach() on the tensor before serializing (e.g., putting it on the queue).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
@izhx
Copy link
Collaborator Author

izhx commented Dec 2, 2022

From modelscope 1.0.0 to 1.1.0, all crashed with the same messages.
That's a tricky one. 😿

I'm afraid we can't use it on windows, until it gets fixed.

@liuyhwangyh
Copy link

Please set workers_per_gpu: 1 to 0 in examples/bert_crf/configs/resume.yaml

@izhx
Copy link
Collaborator Author

izhx commented Dec 3, 2022

Please set workers_per_gpu: 1 to 0 in examples/bert_crf/configs/resume.yaml

Thanks, we also located the problem.
By disabling the multi-process loading of dataloader, i.e., set workers_per_gpu: 0, we are fine on windows.

We will change the default configurations.

@huangshenno1 huangshenno1 added the bug Something isn't working label Dec 3, 2022
@izhx
Copy link
Collaborator Author

izhx commented Dec 20, 2022

Fixed

@izhx izhx closed this as completed Dec 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants