Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

confused about the function forward_chop #15

Closed
cmhungsteve opened this issue Aug 30, 2021 · 22 comments
Closed

confused about the function forward_chop #15

cmhungsteve opened this issue Aug 30, 2021 · 22 comments

Comments

@cmhungsteve
Copy link

Thank you for providing the wonderful codes.
However, I am confused about the function forward_chop in model/__init.py.
It seems that this function unfolds the input image into several patches and then feeds those patches into the IPT model, but I didn't find any detailed explanation in the paper or as comments in the code.
For example, what does shave mean here?
If I want to unfold the input image into non-overlapped patches, how should I do it?

Thank you.

@HantingChen
Copy link
Collaborator

In paper, we mentioned that "During the test, we crop the images in the test set into 48 × 48 patches with a 10 pixels overlap." The reason is that our transformer model can only handle images with fixed input shape.

shave means the overlap pixel, you can set shave=0 to unfold image into non-overlapped patches for test, but the performance may drop a little.

@cmhungsteve
Copy link
Author

from the code 153-161:

padsize = int(self.patch_size)
shave = int(self.patch_size/2)
scale = self.scale[self.idx_scale]
h_cut = (h-padsize)%(int(shave/2))
w_cut = (w-padsize)%(int(shave/2))
x_unfold = torch.nn.functional.unfold(x, padsize, stride=int(shave/2)).transpose(0,2).contiguous()

Let's say padsize is 48.
Doesn't that mean shave is 24 and stride is 12?
I am also confused by the relation between shave and stride.

If we set shave to 0, then stride will be 0, which is still weird......

@HantingChen
Copy link
Collaborator

HantingChen commented Aug 31, 2021

Sorry for the misleading. I just found that we upload another version for chop (performance is slightly higher).

So in this version, the height and width of the patch will be divide by 4 (12 for 48x48 inputs), and 12 pixels in the edge will be dropped, then 24*24 patches will be folded with a 12 overlap.

If you want the original version (e.g., the code to unfold the input image into non-overlapped patches), we can upload it as an option.

@cmhungsteve
Copy link
Author

Thank you for the prompt reply.
If I want to modify the current code into a "non-overlapped unfold/fold" version, how should I do that?
If you can upload a code for this version, that would be perfect!

Thank you.

@HantingChen
Copy link
Collaborator

You can find that in

def forward_chop_new(self, x, shave=12, batchsize = 64):

@cmhungsteve
Copy link
Author

Thank you for sharing!
Another question is:
In lines 193-196 in forward_chop, there is a piece of codes like follows:

y_ones = torch.ones(y_inter.shape, dtype=y_inter.dtype)
divisor = torch.nn.functional.fold(torch.nn.functional.unfold(y_ones, padsizescale-shavescale, stride=int(shave/2*scale)),((h-h_cut-shave)*scale,(w-w_cut-shave)*scale), padsize*scale-shave*scale, stride=int(shave/2*scale))
y_inter = y_inter/divisor

I checked the PyTorch official website, it seems that the codes are used because unfold and fold are not invertible,

However, in forward_chop_new, there are no such kinds of codes.
Is there any reason?

Thank you.

@HantingChen
Copy link
Collaborator

In forward_chop, the unfold patches have overlaps. (More specifically, we not only cut pixels in the edges and merge the cutted images). While in forward_chop_new, we cut the pixels in the edges directly and each cutted patch do not have overlaps.

@cmhungsteve
Copy link
Author

I see. Thank you for the explanation.

x_unfold = torch.nn.functional.unfold(x, padsize, stride=padsize-shave).transpose(0,2).contiguous()

But why can't we set stride=padsize, but set stride=padsize-shave instead?
Doesn't that mean there are still overlaps between 48*48 blocks?

@HantingChen
Copy link
Collaborator

Yes, there are overlaps, but we will cut the pixels in the edges:

y_unfold = y_unfold[...,int(shave/2*scale):padsize*scale-int(shave/2*scale),int(shave/2*scale):padsize*scale-int(shave/2*scale)].contiguous()

@cmhungsteve
Copy link
Author

what happens if we unfold the image into non-overlapping 48*48 blocks?
If it's doable, how should I modify the codes?
Thank you.

@HantingChen
Copy link
Collaborator

what happens if we unfold the image into non-overlapping 48*48 blocks?
If it's doable, how should I modify the codes?
Thank you.

Just set shave=0 in forward_chop_new

@cmhungsteve
Copy link
Author

cool. Thank you.
What's the drawback if we set shave=0 (e.g., performance drops a lot)?

@HantingChen
Copy link
Collaborator

Yes, the performance will drops a lot (since the pixel in the edge will perform bad). Besides, the transition between different patches will be sharp and uneven.

@cmhungsteve
Copy link
Author

cmhungsteve commented Sep 2, 2021

Got it. That's clearer.
Thank you so much!

One last question:
It seems that the final output y is the combination of y_inter, y_h_cut, y_w_cut, y_h_top, y_h_top, y_hw_cut.
However, there are overlaps between them.
How do you deal with the overlaps and then generate the final output y?
Are there any details explained in the paper?

Thank you.

@HantingChen
Copy link
Collaborator

The edge of each patch is cutted, so when we put them together, the edge of the whole image is also cutted. That's the reason why we calculate y_h_cut, y_w_cut, y_h_top, y_h_top, y_hw_cut can put them into the edge of the whole image. Besides, the size of whole image may not be an integral multiple of patch size, so there must be some overlap (we put them in y_h_cut, y_w_cut).

If you set shave=0, it is not necesarry to calculate some of them. But you can also use this code since the final output is exactly the same.

@cmhungsteve
Copy link
Author

cmhungsteve commented Sep 5, 2021

I am trying forward_chop_new with shave=0. (i.e., there are exactly no overlaps)
And I thought in this line:

y[...,:padsize*scale,:] = y_h_top

y[...,:padsize*scale,:] and y_h_top should be the same so that lines 286 & 287 are not necessary.
However, I found that y[...,:padsize*scale,:] and y_h_top are slightly different.
(i.e., cropping then feeding to IPT != feeding to IPT then cropping)

Do you know why this happens?

Thank you.

@HantingChen
Copy link
Collaborator

I am trying forward_chop_new with shave=0. (i.e., there are exactly no overlaps)
And I thought in this line:

y[...,:padsize*scale,:] = y_h_top

y[...,:padsize*scale,:] and y_h_top should be the same so that lines 286 & 287 are not necessary.
However, I found that y[...,:padsize*scale,:] and y_h_top are slightly different.
(i.e., cropping then feeding to IPT != feeding to IPT then cropping)
Do you know why this happens?

Thank you.

That's strange. Maybe the nn.layernorm is calculated differently when feeding with different batches. I think this will not affect its performance.

@cmhungsteve
Copy link
Author

cmhungsteve commented Sep 7, 2021

I also have a question about the inputs that you feed into the Transformer:

y_unfold.append(self.model.forward(x_unfold[i*batchsize:(i+1)*batchsize,...]))

Assume x_unfold has the shape [25, 3, 48, 48] (25 patches, each one has resolution 48x48).
And assume after the head encoder, the input becomes [25, 32, 48, 48].

The part that makes me confused is that there is another unfold function before the Transformer encoder:

x = torch.nn.functional.unfold(x,self.patch_dim,stride=self.patch_dim).transpose(1,2).transpose(0,1).contiguous()

Since self.patch_dim=3, this makes x becomes [(16*16), 25, (32*3*3)] and become the inputs of the multi-head attention.
This means that the sequence length for the attention is 256 here, not 25.
Does that mean the attention mechanism is not used to learn the relation between 25 patches, but learn the relation between tiny patches (resolution: 3x3) inside each patch instead?
If so, this seems different from the original ViT paper, which uses attention to learn the relation between patches inside an image.
Why is this difference here? Is it because IPT is for low-level vision tasks, not classification like ViT?

Thank you.

@HantingChen
Copy link
Collaborator

I think there might be some misunderstanding.

The inputs of our IPT model is exactly 48483 images (whose scale is 2242243 in ViT). And the 4848 images will be cropped into 33 small patches with 16*16 seq_length.

The fowrard_chop function is used for handling different input size, since our IPT can only take 48*48 size images.

@cmhungsteve
Copy link
Author

I see. Now it's much clearer.
I also want to check if I understand the paper correctly.
In Section 3.1, H and W are both 48, and P is 3. Is that correct?

@HantingChen
Copy link
Collaborator

I see. Now it's much clearer.
I also want to check if I understand the paper correctly.
In Section 3.1, H and W are both 48, and P is 3. Is that correct?

Yes

@cmhungsteve
Copy link
Author

Got it. Thank you so much!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants