-
Notifications
You must be signed in to change notification settings - Fork 31.2k
Swin support for any input size #15986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
sgugger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I has missed it inside the Maskformer PR but the hidden_states_spatial_dimensions in the model outputs is not a nested tensor, which will make distributed training fail for Maskformer when using the Trainer.
Otherwise, looks good to refactor this inside swin.
|
|
||
| attention_windows = attention_output.view(-1, self.window_size, self.window_size, channels) | ||
| shifted_windows = window_reverse(attention_windows, self.window_size, height, width) # B H' W' C | ||
| shifted_windows = window_reverse(attention_windows, self.window_size, height_pad, width_pad) # B H' W' C |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment could go above and be more helpful. I have no idea what H' and W' are for instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the comment since it is clear from the code what is going on (IMHO)
tests/swin/test_modeling_swin.py
Outdated
| config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common() | ||
|
|
||
| def set_nan_tensor_to_zero(t): | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| shape `(batch_size, sequence_length, hidden_size)`. | ||
| Hidden-states of the model at the output of each layer plus the initial embedding outputs. | ||
| hidden_states_spatial_dimensions (`tuple(tuple(int, int))`, *optional*): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure whether it makes sense to do this. If the model just returns hidden states of shape (B, C, H, W), then we get both hidden states and spatial dimensions for free.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on this, very nice idea.
From a design point of view, if we say output_hidden_states the user will expect to receive the hidden_states untouched. But I see why this is more convenient.
Pinging @sgugger for a feedback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have already suggested this on the Maskformer PR but it was ignored so I thought it wasn't possible. If it indeed is, it would kill two birds with one stone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies if it was ignored in MaskFormer. it is indeed possible but requires some refactoring. let me try to change this
|
Swin now returns a list of reshaped from transformers import SwinConfig, SwinModel
import torch
model = SwinModel(SwinConfig(image_size=384))
x = torch.randn((1, 3, 1024, 640))
out = model(x, output_hidden_states=True)
[print(e.shape) for e in out.hidden_states]
torch.Size([1, 96, 256, 160])
torch.Size([1, 192, 128, 80])
torch.Size([1, 384, 64, 40])
torch.Size([1, 768, 32, 20])
torch.Size([1, 768, 32, 20])Maybe I am missing something, kindly pinging @sgugger |
| pixel_values = self.maybe_pad(pixel_values, height, width) | ||
| embeddings = self.projection(pixel_values) | ||
| _, _, height, width = embeddings.shape | ||
| output_dimensions = (height, width) | ||
| embeddings = embeddings.flatten(2).transpose(1, 2) | ||
|
|
||
| return embeddings, output_dimensions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you derive the output dimensions from the embeddings tensor here? Probably only for square sizes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think is easier and more robust to just get the shape
|
Following @NielsRogge suggestion, we now return the reshaped hidden sized inside |
b6db2aa to
9d5d277
Compare
| Last layer hidden-state of the first token of the sequence (classification token) after further processing | ||
| through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns | ||
| the classification token after processing through a linear layer and a tanh activation function. The linear | ||
| layer weights are trained from the next sentence prediction (classification) objective during pretraining. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be updated, it comes from BERT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated, I am wondering why we are not supporting cls token pooling
|
Thanks to all the reviewers. I've resolved all the conversation and renamed some layers to match our convention of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Some minor comments left.
Note that the tests of Swin can be cleaned up a bit, you can remove all the chunk_length and encoder_decoder stuff, since those doesn't apply for Swin
* padding done * correctly return one attention per layer * almost correct, attentions are not flatten one tuple per stage * tests green * doc * conversations * reshaping hidden_states * view in the test * reshape_hidden_states in Encoder and Model * new outputs with reshaped_hidden_states * conversations * doc * Update docs/source/model_doc/swin.mdx Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * conversations * fix tests * minor changes * resolved conversations * attentions one per stage * typo * typos * typos * function signature * CI * clean up tests Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
What does this PR do?
This PR adds padding to Swin allowing to support any (if divisible by
32) input size.Example:
Moreover, it adds a new field to the outputs,
hidden_states_spatial_dimensions, containing the spatial dimension of all the stages' inputs