Skip to content

Conversation

@FrancescoSaverioZuppichini
Copy link
Contributor

What does this PR do?

This PR adds padding to Swin allowing to support any (if divisible by 32) input size.

Example:

from transformers import SwinConfig, SwinModel
import torch

model = SwinModel(SwinConfig(image_size=384))
x = torch.randn((1, 3, 1024, 640))
out = model(x)

Moreover, it adds a new field to the outputs, hidden_states_spatial_dimensions, containing the spatial dimension of all the stages' inputs

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I has missed it inside the Maskformer PR but the hidden_states_spatial_dimensions in the model outputs is not a nested tensor, which will make distributed training fail for Maskformer when using the Trainer.

Otherwise, looks good to refactor this inside swin.


attention_windows = attention_output.view(-1, self.window_size, self.window_size, channels)
shifted_windows = window_reverse(attention_windows, self.window_size, height, width) # B H' W' C
shifted_windows = window_reverse(attention_windows, self.window_size, height_pad, width_pad) # B H' W' C
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment could go above and be more helpful. I have no idea what H' and W' are for instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the comment since it is clear from the code what is going on (IMHO)

config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()

def set_nan_tensor_to_zero(t):

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
hidden_states_spatial_dimensions (`tuple(tuple(int, int))`, *optional*):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether it makes sense to do this. If the model just returns hidden states of shape (B, C, H, W), then we get both hidden states and spatial dimensions for free.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this, very nice idea.

From a design point of view, if we say output_hidden_states the user will expect to receive the hidden_states untouched. But I see why this is more convenient.

Pinging @sgugger for a feedback

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already suggested this on the Maskformer PR but it was ignored so I thought it wasn't possible. If it indeed is, it would kill two birds with one stone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies if it was ignored in MaskFormer. it is indeed possible but requires some refactoring. let me try to change this

@FrancescoSaverioZuppichini
Copy link
Contributor Author

FrancescoSaverioZuppichini commented Mar 9, 2022

Swin now returns a list of reshaped hidden_states (B, C, H, W). However, due to the view operation, the viewed tensor won't have .grad in it, so test_retain_grad_hidden_states_attentions fails. Not sure how to proceed

from transformers import SwinConfig, SwinModel
import torch

model = SwinModel(SwinConfig(image_size=384))
x = torch.randn((1, 3, 1024, 640))
out = model(x, output_hidden_states=True)
[print(e.shape) for e in out.hidden_states]

torch.Size([1, 96, 256, 160])
torch.Size([1, 192, 128, 80])
torch.Size([1, 384, 64, 40])
torch.Size([1, 768, 32, 20])
torch.Size([1, 768, 32, 20])

Maybe I am missing something, kindly pinging @sgugger

Comment on lines +182 to +331
pixel_values = self.maybe_pad(pixel_values, height, width)
embeddings = self.projection(pixel_values)
_, _, height, width = embeddings.shape
output_dimensions = (height, width)
embeddings = embeddings.flatten(2).transpose(1, 2)

return embeddings, output_dimensions
Copy link
Contributor

@NielsRogge NielsRogge Mar 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you derive the output dimensions from the embeddings tensor here? Probably only for square sizes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think is easier and more robust to just get the shape

@FrancescoSaverioZuppichini
Copy link
Contributor Author

Following @NielsRogge suggestion, we now return the reshaped hidden sized inside reshape_hidden_sizes in all four (Encoder/Model/MaskedImage/ImageClassifier) outputs

Comment on lines 106 to 109
Last layer hidden-state of the first token of the sequence (classification token) after further processing
through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns
the classification token after processing through a linear layer and a tanh activation function. The linear
layer weights are trained from the next sentence prediction (classification) objective during pretraining.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be updated, it comes from BERT

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, I am wondering why we are not supporting cls token pooling

@FrancescoSaverioZuppichini
Copy link
Contributor Author

Thanks to all the reviewers. I've resolved all the conversation and renamed some layers to match our convention of Stage and Layer

Copy link
Contributor

@NielsRogge NielsRogge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Some minor comments left.

Note that the tests of Swin can be cleaned up a bit, you can remove all the chunk_length and encoder_decoder stuff, since those doesn't apply for Swin

@FrancescoSaverioZuppichini FrancescoSaverioZuppichini merged commit 667b823 into master Mar 16, 2022
@FrancescoSaverioZuppichini FrancescoSaverioZuppichini deleted the swin_padding branch March 16, 2022 17:38
FrancescoSaverioZuppichini added a commit that referenced this pull request Mar 21, 2022
* padding done

* correctly return one attention per layer

* almost correct, attentions are not flatten one tuple per stage

* tests green

* doc

* conversations

* reshaping hidden_states

* view in the test

* reshape_hidden_states in Encoder and Model

* new outputs with reshaped_hidden_states

* conversations

* doc

* Update docs/source/model_doc/swin.mdx

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* conversations

* fix tests

* minor changes

* resolved conversations

* attentions one per stage

* typo

* typos

* typos

* function signature

* CI

* clean up tests

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants