Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

My trainig process is frozen #107

Open
whansk50 opened this issue Jun 7, 2023 · 5 comments
Open

My trainig process is frozen #107

whansk50 opened this issue Jun 7, 2023 · 5 comments

Comments

@whansk50
Copy link

whansk50 commented Jun 7, 2023

Hello,
training process is initiated without problem, but when some times left, it is frozen like:

image

it doesn't show any change on console

and I check GPUs at that time and what I got is GPU-Util(not memory) is full when the process is frozen (that I think this is a clue of this problem):

image

I fixed parameter like batch_size, worker, etc, but it doesn't help

Can anyone help?

my env is on miniconda3, and using CUDA 11.8, so version is:
PyTorch 2.0.0
PyTorch Lightning 2.0.2

@jiangxiluning
Copy link
Owner

Did u use DDP training? If so, may be spawned process might occur some errors causing it to quit. So the main process was still waiting.

@whansk50
Copy link
Author

whansk50 commented Jul 7, 2023

Yeah, I fixed trainer like this to fit the version of Lightning(currently using with single gpu):

trainer = Trainer(
        logger=wandb_logger,
        callbacks=[checkpoint_callback],
        max_epochs=config.trainer.epochs,
        default_root_dir=root_dir,
        devices=gpus,
        accelerator='cuda',
        benchmark=True,
        sync_batchnorm=True,
        precision=config.precision,
        log_every_n_steps=config.trainer.log_every_n_steps,
        overfit_batches=config.trainer.overfit_batches,
        fast_dev_run=config.trainer.fast_dev_run,
        inference_mode=True,
        check_val_every_n_epoch=config.trainer.check_val_every_n_epoch,
        strategy=DDPStrategy(find_unused_parameters=True)
        )

Then is "DDPStrategy" related to this problem? I'm new at using Lightning, so how can I fix it?

@jiangxiluning
Copy link
Owner

You can use CPU training or single GPU training mode to check if any bad thing occurs.

@whansk50
Copy link
Author

whansk50 commented Jul 9, 2023

I already did with single GPU and got no problem.
Then Isn't there any solution with multiple GPUs?

@jiangxiluning
Copy link
Owner

try to set the worker number of Dataset to 0 and see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants