Assorted Issues #1331

vedantroy · 2022-07-30T23:58:32Z

** Environment **

OS: Ubuntu 22.04 LTS
GPU:

  *-display                 
       description: VGA compatible controller
       product: GP102 [GeForce GTX 1080 Ti]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:01:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
       configuration: driver=nvidia latency=0
       resources: irq:58 memory:fa000000-faffffff memory:c0000000-cfffffff memory:d0000000-d1ffffff ioport:e000(size=128) memory:c0000-dffff
  *-graphics
       product: EFI VGA
       physical id: 2
       logical name: /dev/fb0
       capabilities: fb
       configuration: depth=32 resolution=1024,768

Cuda: 11.7
Composer: 0.8.2

loss should specify micro_batch instead of batch
Cannot log multiple losses, allow me to return dictionary of losses
Cannot log things at batch level (not micro-batch level) inside of loss method
(Bug) grad_accum fails with CUDA OOM even though batch_size=1 w/ no grad_accum works
(Bug) Nothing is printed indicated that composer is restarting the forward method when grad_accum="auto" is set to True

The text was updated successfully, but these errors were encountered:

hanlint · 2022-07-31T00:13:12Z

Thanks @vedantroy for reporting these, we'll take a look at each carefully.

For #5 above, you should see something in the logs (e.g.):

INFO:composer.trainer.trainer:CUDA out of memory detected.
Gradient Accumulation increased from 1 -> 2, and the batch
will be retrained.

Let us know if that's not the case!

vedantroy · 2022-07-31T00:17:04Z

Thanks @vedantroy for reporting these, we'll take a look at each carefully.

For #5 above, you should see something in the logs (e.g.):
INFO:composer.trainer.trainer:CUDA out of memory detected.
Gradient Accumulation increased from 1 -> 2, and the batch
will be retrained.
Let us know if that's not the case!

I do not see that 😔, hence why I added it to the list of issues.

hanlint · 2022-07-31T00:34:15Z

Sounds good, tagging @mvpatel2000 for auto grad accum.

Curious the use case for logging and the loss within the microbatch. Was this to debug an error? Ideally, microbatching is just an implementation detail, and for convergence purposes just look at the batch loss.

vedantroy · 2022-07-31T01:07:32Z

@hanlint
My trainer has this:

    def loss(self, out, micro_batch):
        mse_loss, vb_loss = self.diffusion.training_losses(
            out.model_out, x_0=out.x_0, x_t=out.x_t, t=out.t, noise=out.noise
        )
        return mse_loss + vb_loss

I want to log both the mse_loss and vb_loss, right now composer only logs the final sum.
Ideally, I could return a dictionary like this:

return {  "mse_loss: mse_loss, "vb_loss: vb_loss }

and composer would log both of the losses. This feature would be very useful b/c for debugging purposes, since I need to see how both of my losses change.

Right now, I can't do this. So to work around the issue, I do the following:

        self.logger.data_batch({"mse_loss": mse_loss, "vb_loss": vb_loss})
        return mse_loss + vb_loss

However, logger.data_batch logs once a batch (instead of every micro-batch). That's totally fine by me, except it looks like the logger only logs the last data point on each microbatch instead of doing some sort of reduction.

Also, another bug
The CosineAnnealingScheduler does not decrease the learning rate. However, the CosineAnnealingWithWarmupScheduler does.

See:

However, my learning rate stays flat:

ravi-mosaicml · 2022-08-01T17:12:54Z

Hi there! Thanks for raising these issues -- appreciate it!

loss should specify micro_batch instead of batch

Cannot log things at batch level (not micro-batch level) inside of loss method

Can you elaborate? If I understand correctly, you would like to be able to access the loss from each microbatch?

Cannot log multiple losses, allow me to return dictionary of losses

Noted, we'll see if we can allow custom types (i.e. dictionaries) for the loss output. We do support a tuple for the loss function output. In the meantime, you can do something like:

def loss(self, out, micro_batch):
        mse_loss, vb_loss = self.diffusion.training_losses(
            out.model_out, x_0=out.x_0, x_t=out.x_t, t=out.t, noise=out.noise
        )
        return (mse_loss, vb_loss)

However, logger.data_batch logs once a batch (instead of every micro-batch). That's totally fine by me, except it looks like the logger only logs the last data point on each microbatch instead of doing some sort of reduction.

This is likely because the optimization step (rather than the microbatch step) is given to wandb.log as the step argument (https://docs.wandb.ai/guides/track/log#stepwise-and-incremental-logging). We do not support metric logging at the microbatch level, as the number of microbatches should not affect convergence. As a workaround, you could log each microbatch to a seperate plot.

(Bug) grad_accum fails with CUDA OOM even though batch_size=1 w/ no grad_accum works

Can you share the stack trace? CC @mvpatel2000

(Bug) Nothing is printed indicated that composer is restarting the forward method when grad_accum="auto" is set to True

Noted, we'll make sure the print statement is always visible (currently it requires the log level to be configured correctly). CC @mvpatel2000.

Also, another bug
The CosineAnnealingScheduler does not decrease the learning rate. However, the CosineAnnealingWithWarmupScheduler does.

Noted, thanks! We'll take a look.

mvpatel2000 · 2022-08-01T21:18:04Z

Thanks for bringing this up! Feedback is super helpful.

(Bug) Nothing is printed indicated that composer is restarting the forward method when grad_accum="auto" is set to True

We just merged in a PR resolving this raising the log level so you should always see it in stdout. Please let me know if this doesn't work!

(Bug) grad_accum fails with CUDA OOM even though batch_size=1 w/ no grad_accum works

If you could send the logs and trace from this, that'd be awesome. I can take a look and fix any issues

abhi-mosaic · 2022-08-02T23:53:23Z

Hey @vedantroy , I'm having a bit of trouble reproducing the LR issue. Could you try running it again and printing the logs to console via ProgressBarLogger(progress_bar=False, log_to_console=True, log_level='BATCH') and report back what the raw float values for lr are?

I'm trying to figure out if there is something missing in my testing, or if it's just WandB's logging of float values that is truncated... Like in the early stage of a cosine decay, the true value might be 0.9999912.. but then maybe it's just getting displayed as 1.0.

vedantroy · 2022-08-05T03:20:59Z

@abhi-mosaic, I turned on ProgressBarLogger, but I did not find any difference in the console output.

vedantroy · 2022-08-05T03:24:27Z

Thanks for bringing this up! Feedback is super helpful.

(Bug) Nothing is printed indicated that composer is restarting the forward method when grad_accum="auto" is set to True

We just merged in a PR resolving this raising the log level so you should always see it in stdout. Please let me know if this doesn't work!

(Bug) grad_accum fails with CUDA OOM even though batch_size=1 w/ no grad_accum works

If you could send the logs and trace from this, that'd be awesome. I can take a look and fix any issues

Ahhh, I can't reproduce the bug anymore, as the codebase has evolved and I don't know which version of the codebase caused that particular bug. :/

vedantroy · 2022-08-05T03:26:02Z

I don't care about logging things at a micro batch level. I was saying that the parameter name batch in the loss method should be changed to micro_batch to be more descriptive of what it actually is.

@ravi-mosaicml

hanlint · 2022-08-05T06:50:19Z

@mvpatel2000 for the Auto grad accum issue, it may be that we need to do a cuda cache clear after the grad accum adjustment, otherwise the repeated restarts can cause memory fragmentation on the GPU. This can be triggered by increasing the image resolution such as a known small batch size fits into memory, then greatly increasing the batch size to force multiple grad accum adjustments.

mvpatel2000 · 2022-08-13T00:22:41Z

Noting here that we merged in cache clearing and are seeing far lower rates of cache fragmentation. It looks like this has basically resolved the auto grad accum issues, but please let us know if you are still seeing problems

mvpatel2000 · 2022-09-03T00:19:01Z

Closing this because it seems resolved -- please feel free to open if you feel any point was not addressed

vedantroy added the bug Something isn't working label Jul 30, 2022

vedantroy mentioned this issue Jul 30, 2022

Does setting grad_accum="auto" cause the model to be called in parallel? #1329

Closed

hanlint assigned mvpatel2000 and hanlint Jul 31, 2022

Landanjs mentioned this issue Aug 2, 2022

Refactor multiple losses to support dictionaries and fix discrepancies #1349

Merged

5 tasks

mvpatel2000 closed this as completed Sep 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assorted Issues #1331

Assorted Issues #1331

vedantroy commented Jul 30, 2022

hanlint commented Jul 31, 2022

vedantroy commented Jul 31, 2022

hanlint commented Jul 31, 2022

vedantroy commented Jul 31, 2022 •

edited

ravi-mosaicml commented Aug 1, 2022

mvpatel2000 commented Aug 1, 2022

abhi-mosaic commented Aug 2, 2022 •

edited

vedantroy commented Aug 5, 2022

vedantroy commented Aug 5, 2022

vedantroy commented Aug 5, 2022 •

edited

hanlint commented Aug 5, 2022

mvpatel2000 commented Aug 13, 2022

mvpatel2000 commented Sep 3, 2022

Assorted Issues #1331

Assorted Issues #1331

Comments

vedantroy commented Jul 30, 2022

hanlint commented Jul 31, 2022

vedantroy commented Jul 31, 2022

hanlint commented Jul 31, 2022

vedantroy commented Jul 31, 2022 • edited

ravi-mosaicml commented Aug 1, 2022

mvpatel2000 commented Aug 1, 2022

abhi-mosaic commented Aug 2, 2022 • edited

vedantroy commented Aug 5, 2022

vedantroy commented Aug 5, 2022

vedantroy commented Aug 5, 2022 • edited

hanlint commented Aug 5, 2022

mvpatel2000 commented Aug 13, 2022

mvpatel2000 commented Sep 3, 2022

vedantroy commented Jul 31, 2022 •

edited

abhi-mosaic commented Aug 2, 2022 •

edited

vedantroy commented Aug 5, 2022 •

edited