-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZeroDivisionError in backward #26
Comments
@carlc-nv is the primary developer of Amp, I've let him know. |
Hi @furybubu, thanks for reporting this issue. A couple questions:
That last step will log exactly what casts amp is inserting into the model. The specific issue you are observing is that the "fp16 loss scale" is becoming increasingly small until it becomes zero. This suggests to me there is a different fp16-related issue, since the loss scale decreases only when there is an |
HI @cbcase , I will try to rerun it with the verbose flag to see if I get more clues in the output. |
So I basically get things like this: |
Hi @furybubu, I'm looking at adding better debugging support for when there are mixed precision issues. If you're interested in being a guinea pig, I've pushed work-in-progress changes to this branch: https://github.com/NVIDIA/apex/tree/amp_debug. You can check it out and install it in the usual way. Right now, there's just one function
Here's what that looks like in practice: Sample original code: data, target = load_data() # However you load data
output = model(data)
loss = criterion(output, model)
... To run debug: data, target = load_data()
def loss_fn():
output = model(data)
return criterion(output, model)
handle.run_debug(model, loss_fn) The debug script will do three things:
Let us know if you're able to try this out and what you learn! In particular, I would be interested to hear:
|
HI all! Wonder if some other ppl reported similar issue and what was the solution? 36 if p.grad is not None:
37 self._has_overflow = scale_check_overflow(p.grad.data,
---> 38 1. / scale)
39 if self._has_overflow:
40 break
ZeroDivisionError: float division by zero that's using approach suggested here it reduces scale gradually from 2^15 to 8 and then breaks
|
Hi, I am having an error when I implement the amp procedure on a working CNN like this:
self.optimizer.zero_grad()
`
And here is the error I get:
scaled_loss.backward() File "/usr/lib/python3.5/contextlib.py", line 66, in __exit__ next(self.gen) File "/usr/local/lib/python3.5/dist-packages/apex-0.1-py3.5-linux-x86_64.egg/apex/amp/handle.py", line 53, in scale_loss optimizer.param_groups, loss_scale) File "/usr/local/lib/python3.5/dist-packages/apex-0.1-py3.5-linux-x86_64.egg/apex/amp/scaler.py", line 21, in unscale_and_update 1. / scale, ZeroDivisionError: float division by zero
Any suggestion would be appreciated.
The text was updated successfully, but these errors were encountered: