-
Notifications
You must be signed in to change notification settings - Fork 25.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing slow pipeline tests #14260
Fixing slow pipeline tests #14260
Conversation
output_lengths = self._get_feat_extract_output_lengths(attention_mask.sum(-1)).to(torch.long) | ||
# Effectively attention_mask.sum(-1), but not inplace to be able to run | ||
# on inference mode. | ||
mask = attention_mask.cumsum(dim=-1)[:, -1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) - could we call it non_padded_lengths
- the idea here is to extract the sub sampled length from the "real non-padded" input length
mask = attention_mask.cumsum(dim=-1)[:, -1] | |
non_padded_lengths = attention_mask.cumsum(dim=-1)[:, -1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, do you know any other way to do that operation ? it's very surprising that .sum
is inplace, and I am scared that using cumsum
instead is super wasteful.
Tried to grep it in our code, but I couldn't find anything of that sort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm - don't really know to be honest...torch.sum(...)
doesn't work either? But I think using .cumsum(...)
is totally fine as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, torch.sum(..)
doesn't work.
src/transformers/models/unispeech_sat/modeling_unispeech_sat.py
Outdated
Show resolved
Hide resolved
|
||
def sequential_inference(self, **inputs): | ||
""" | ||
Inference used for models that need to process sequences in a sequential fashion, like the SQA models which | ||
handle conversational query related to a table. | ||
""" | ||
with torch.no_grad(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be happy if we could give mask
a better naming. Apart from that, thanks a lot for enabling inference mode for all models :-)
if self.training: | ||
if torch.isinf(hidden_states).any() or torch.isnan(hidden_states).any(): | ||
clamp_value = torch.finfo(hidden_states.dtype).max - 1000 | ||
hidden_states = torch.clamp(hidden_states, min=-clamp_value, max=clamp_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stas00 IS that ok to remove at inference time ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory yes. In practice, it depends on how the model was pre-trained.
The model weights don't change during inference, so we don't need to keep things in check all the time.
However if the pre-trained model's weights lead to an overflow in a single iteration during training, as is the case with some mt5 models under mixed-precision then this can occur just as well during inference.
This is primarily an issue with pre-trained on bf16 models fine-tuned/inferenced on fp16 (mixed or non-mixed precision).
If a model was pretrained with fp16/mixed precision it's pretty sure the clamping won't be needed.
To give you a more intelligent answer it'd require running some tests with the actual DETR models and checking their activations magnitudes at the point you're asking about, which should be pretty trivial, using https://huggingface.co/transformers/debugging.html#underflow-and-overflow-detection, which can be plugged into HF Trainer and the examples with just a single cl arg --debug underflow_overflow
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest I think this code was just badly copy pasted, so I'm more in favor of disabling this hack for training (as it is done now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, if everyone is favorable, then let's do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest I think this code was just badly copy pasted, so I'm more in favor of disabling this hack for training (as it is done now)
you must have meant for inference, right Patrick?
128e4f5
to
bbb9b27
Compare
Good for merge for me |
* Fiixng slow pipeline tests * Remove the image-segmentaiton override. * Fixing clamping only in training. * Wav2vec2. * Remove last mention of `no_grad`. * Fixing copies. * Rename.
Some tests were broken because of pytorch
inference_mode
.This should cover all cases of
inplace
tensor modifications afaik.Let me know if there are better ways to fix those.
What does this PR do?
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@stas00
@patrickvonplaten
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.