My trainig process is frozen #107

whansk50 · 2023-06-07T08:01:54Z

Hello,
training process is initiated without problem, but when some times left, it is frozen like:

it doesn't show any change on console

and I check GPUs at that time and what I got is GPU-Util(not memory) is full when the process is frozen (that I think this is a clue of this problem):

I fixed parameter like batch_size, worker, etc, but it doesn't help

Can anyone help?

my env is on miniconda3, and using CUDA 11.8, so version is:
PyTorch 2.0.0
PyTorch Lightning 2.0.2

jiangxiluning · 2023-07-06T08:57:57Z

Did u use DDP training? If so, may be spawned process might occur some errors causing it to quit. So the main process was still waiting.

whansk50 · 2023-07-07T02:24:06Z

Yeah, I fixed trainer like this to fit the version of Lightning(currently using with single gpu):

trainer = Trainer(
        logger=wandb_logger,
        callbacks=[checkpoint_callback],
        max_epochs=config.trainer.epochs,
        default_root_dir=root_dir,
        devices=gpus,
        accelerator='cuda',
        benchmark=True,
        sync_batchnorm=True,
        precision=config.precision,
        log_every_n_steps=config.trainer.log_every_n_steps,
        overfit_batches=config.trainer.overfit_batches,
        fast_dev_run=config.trainer.fast_dev_run,
        inference_mode=True,
        check_val_every_n_epoch=config.trainer.check_val_every_n_epoch,
        strategy=DDPStrategy(find_unused_parameters=True)
        )

Then is "DDPStrategy" related to this problem? I'm new at using Lightning, so how can I fix it?

jiangxiluning · 2023-07-08T11:23:34Z

You can use CPU training or single GPU training mode to check if any bad thing occurs.

whansk50 · 2023-07-09T23:34:18Z

I already did with single GPU and got no problem.
Then Isn't there any solution with multiple GPUs?

jiangxiluning · 2023-07-10T09:58:52Z

try to set the worker number of Dataset to 0 and see.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

My trainig process is frozen #107

My trainig process is frozen #107

whansk50 commented Jun 7, 2023

jiangxiluning commented Jul 6, 2023

whansk50 commented Jul 7, 2023

jiangxiluning commented Jul 8, 2023

whansk50 commented Jul 9, 2023

jiangxiluning commented Jul 10, 2023

My trainig process is frozen #107

My trainig process is frozen #107

Comments

whansk50 commented Jun 7, 2023

jiangxiluning commented Jul 6, 2023

whansk50 commented Jul 7, 2023

jiangxiluning commented Jul 8, 2023

whansk50 commented Jul 9, 2023

jiangxiluning commented Jul 10, 2023