Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stops after 1000 steps #13

Closed
GuusDeKroon opened this issue Sep 29, 2022 · 8 comments
Closed

Stops after 1000 steps #13

GuusDeKroon opened this issue Sep 29, 2022 · 8 comments

Comments

@GuusDeKroon
Copy link

GuusDeKroon commented Sep 29, 2022

Hi!
I've been having this issue where the program stops training at 1000 iters.
Everything else seems to be fine.

Here's the code output:

Another one bites the dust...

Traceback (most recent call last):
  File "main.py", line 852, in <module>
    trainer.test(model, data)
  File "C:\Users\Guus\miniconda3\envs\ldm\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 938, in test
    return self._call_and_handle_interrupt(self._test_impl, model, dataloaders, ckpt_path, verbose, datamodule)
  File "C:\Users\Guus\miniconda3\envs\ldm\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 723, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "C:\Users\Guus\miniconda3\envs\ldm\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 985, in _test_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "C:\Users\Guus\miniconda3\envs\ldm\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1160, in _run
    verify_loop_configurations(self)
  File "C:\Users\Guus\miniconda3\envs\ldm\lib\site-packages\pytorch_lightning\trainer\configuration_validator.py", line 46, in verify_loop_configurations
    __verify_eval_loop_configuration(trainer, model, "test")
  File "C:\Users\Guus\miniconda3\envs\ldm\lib\site-packages\pytorch_lightning\trainer\configuration_validator.py", line 197, in __verify_eval_loop_configuration
    raise MisconfigurationException(f"No `{loader_name}()` method defined to run `Trainer.{trainer_method}`.")
pytorch_lightning.utilities.exceptions.MisconfigurationException: No `test_dataloader()` method defined to run `Trainer.test`.
@Kallamamran
Copy link

Do you have "max_training_steps = 1000"?
I don't get mine to run at all, so :/

@GuusDeKroon
Copy link
Author

Nope, the max training steps is at 3000

@djbielejeski
Copy link
Collaborator

I've been seeing this too, looking into it right now.

Repository owner deleted a comment from 1blackbar Sep 29, 2022
@djbielejeski
Copy link
Collaborator

I tried removing the --no-test param, still get this error

Here comes the checkpoint...
Another one bites the dust...

Traceback (most recent call last):
  File "main.py", line 847, in <module>
    trainer.fit(model, data)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit
    self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
    results = self._run_stage()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
    return self._run_train()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1353, in _run_train
    self.fit_loop.run()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 266, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 205, in run
    self.on_advance_end()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 255, in on_advance_end
    self._run_validation()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 311, in _run_validation
    self.val_loop.run()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 155, in advance
    dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 134, in advance
    self._on_evaluation_batch_end(output, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 267, in _on_evaluation_batch_end
    self.trainer._call_callback_hooks(hook_name, output, *kwargs.values())
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1636, in _call_callback_hooks
    fn(self, self.lightning_module, *args, **kwargs)
  File "/workspace/Dreambooth-Stable-Diffusion/main.py", line 470, in on_validation_batch_end
    self.log_img(pl_module, batch, batch_idx, split="val")
  File "/workspace/Dreambooth-Stable-Diffusion/main.py", line 434, in log_img
    images = pl_module.log_images(batch, split=split, **self.log_images_kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/Dreambooth-Stable-Diffusion/ldm/models/diffusion/ddpm.py", line 1328, in log_images
    batch = batch[0]
KeyError: 0

@1blackbar
Copy link

Itsrunning fine here , past 1000 steps
image

@djbielejeski
Copy link
Collaborator

Pretty sure I found the issue. It is when writing an epoch, and the epoch size depends on your training_samples count and your regularization_images count. You can trigger this by going over 1 epoch. I think I found the spot in the code, testing it now.

@djbielejeski
Copy link
Collaborator

Fixed here

@djbielejeski
Copy link
Collaborator

At least my issue...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants