confused about the function `forward_chop` #15

cmhungsteve · 2021-08-30T08:43:20Z

Thank you for providing the wonderful codes.
However, I am confused about the function forward_chop in model/__init.py.
It seems that this function unfolds the input image into several patches and then feeds those patches into the IPT model, but I didn't find any detailed explanation in the paper or as comments in the code.
For example, what does shave mean here?
If I want to unfold the input image into non-overlapped patches, how should I do it?

Thank you.

The text was updated successfully, but these errors were encountered:

HantingChen · 2021-08-30T09:37:26Z

In paper, we mentioned that "During the test, we crop the images in the test set into 48 × 48 patches with a 10 pixels overlap." The reason is that our transformer model can only handle images with fixed input shape.

shave means the overlap pixel, you can set shave=0 to unfold image into non-overlapped patches for test, but the performance may drop a little.

cmhungsteve · 2021-08-30T14:14:03Z

from the code 153-161:

padsize = int(self.patch_size)
shave = int(self.patch_size/2)
scale = self.scale[self.idx_scale]
h_cut = (h-padsize)%(int(shave/2))
w_cut = (w-padsize)%(int(shave/2))
x_unfold = torch.nn.functional.unfold(x, padsize, stride=int(shave/2)).transpose(0,2).contiguous()

Let's say padsize is 48.
Doesn't that mean shave is 24 and stride is 12?
I am also confused by the relation between shave and stride.

If we set shave to 0, then stride will be 0, which is still weird......

HantingChen · 2021-08-31T02:47:32Z

Sorry for the misleading. I just found that we upload another version for chop (performance is slightly higher).

So in this version, the height and width of the patch will be divide by 4 (12 for 48x48 inputs), and 12 pixels in the edge will be dropped, then 24*24 patches will be folded with a 12 overlap.

If you want the original version (e.g., the code to unfold the input image into non-overlapped patches), we can upload it as an option.

cmhungsteve · 2021-08-31T03:13:44Z

Thank you for the prompt reply.
If I want to modify the current code into a "non-overlapped unfold/fold" version, how should I do that?
If you can upload a code for this version, that would be perfect!

Thank you.

HantingChen · 2021-08-31T11:08:50Z

You can find that in

Pretrained-IPT/model/__init__.py

Line 251 in a16dd86

def forward_chop_new(self, x, shave=12, batchsize = 64):

cmhungsteve · 2021-09-01T02:25:52Z

Thank you for sharing!
Another question is:
In lines 193-196 in forward_chop, there is a piece of codes like follows:

y_ones = torch.ones(y_inter.shape, dtype=y_inter.dtype)
divisor = torch.nn.functional.fold(torch.nn.functional.unfold(y_ones, padsizescale-shavescale, stride=int(shave/2*scale)),((h-h_cut-shave)*scale,(w-w_cut-shave)*scale), padsize*scale-shave*scale, stride=int(shave/2*scale))
y_inter = y_inter/divisor

I checked the PyTorch official website, it seems that the codes are used because unfold and fold are not invertible,

However, in forward_chop_new, there are no such kinds of codes.
Is there any reason?

Thank you.

HantingChen · 2021-09-01T02:44:19Z

In forward_chop, the unfold patches have overlaps. (More specifically, we not only cut pixels in the edges and merge the cutted images). While in forward_chop_new, we cut the pixels in the edges directly and each cutted patch do not have overlaps.

cmhungsteve · 2021-09-01T03:46:07Z

I see. Thank you for the explanation.

Pretrained-IPT/model/__init__.py

Line 261 in a16dd86

    
                   x_unfold = torch.nn.functional.unfold(x, padsize, stride=padsize-shave).transpose(0,2).contiguous()

But why can't we set stride=padsize, but set stride=padsize-shave instead?
Doesn't that mean there are still overlaps between 48*48 blocks?

HantingChen · 2021-09-01T03:57:50Z

Yes, there are overlaps, but we will cut the pixels in the edges:

Pretrained-IPT/model/__init__.py

Line 289 in a16dd86

    
                   y_unfold = y_unfold[...,int(shave/2*scale):padsize*scale-int(shave/2*scale),int(shave/2*scale):padsize*scale-int(shave/2*scale)].contiguous()

cmhungsteve · 2021-09-01T04:04:05Z

what happens if we unfold the image into non-overlapping 48*48 blocks?
If it's doable, how should I modify the codes?
Thank you.

HantingChen · 2021-09-01T06:29:35Z

what happens if we unfold the image into non-overlapping 48*48 blocks?
If it's doable, how should I modify the codes?
Thank you.

Just set shave=0 in forward_chop_new

cmhungsteve · 2021-09-01T06:55:19Z

cool. Thank you.
What's the drawback if we set shave=0 (e.g., performance drops a lot)?

HantingChen · 2021-09-01T07:13:54Z

Yes, the performance will drops a lot (since the pixel in the edge will perform bad). Besides, the transition between different patches will be sharp and uneven.

cmhungsteve · 2021-09-02T00:56:02Z

Got it. That's clearer.
Thank you so much!

One last question:
It seems that the final output y is the combination of y_inter, y_h_cut, y_w_cut, y_h_top, y_h_top, y_hw_cut.
However, there are overlaps between them.
How do you deal with the overlaps and then generate the final output y?
Are there any details explained in the paper?

Thank you.

HantingChen · 2021-09-02T01:55:27Z

The edge of each patch is cutted, so when we put them together, the edge of the whole image is also cutted. That's the reason why we calculate y_h_cut, y_w_cut, y_h_top, y_h_top, y_hw_cut can put them into the edge of the whole image. Besides, the size of whole image may not be an integral multiple of patch size, so there must be some overlap (we put them in y_h_cut, y_w_cut).

If you set shave=0, it is not necesarry to calculate some of them. But you can also use this code since the final output is exactly the same.

cmhungsteve · 2021-09-05T06:25:01Z

I am trying forward_chop_new with shave=0. (i.e., there are exactly no overlaps)
And I thought in this line:

Pretrained-IPT/model/__init__.py

Line 286 in a16dd86

y[...,:padsize*scale,:] = y_h_top

y[...,:padsize*scale,:] and y_h_top should be the same so that lines 286 & 287 are not necessary.
However, I found that y[...,:padsize*scale,:] and y_h_top are slightly different.
(i.e., cropping then feeding to IPT != feeding to IPT then cropping)

Do you know why this happens?

Thank you.

HantingChen · 2021-09-07T02:28:22Z

I am trying forward_chop_new with shave=0. (i.e., there are exactly no overlaps)
And I thought in this line:

Pretrained-IPT/model/__init__.py

Line 286 in a16dd86

y[...,:padsize*scale,:] = y_h_top

y[...,:padsize*scale,:] and y_h_top should be the same so that lines 286 & 287 are not necessary.
However, I found that y[...,:padsize*scale,:] and y_h_top are slightly different.
(i.e., cropping then feeding to IPT != feeding to IPT then cropping)
Do you know why this happens?

Thank you.

That's strange. Maybe the nn.layernorm is calculated differently when feeding with different batches. I think this will not affect its performance.

cmhungsteve · 2021-09-07T13:30:14Z

I also have a question about the inputs that you feed into the Transformer:

Pretrained-IPT/model/__init__.py

Line 281 in a16dd86

y_unfold.append(self.model.forward(x_unfold[i*batchsize:(i+1)*batchsize,...]))

Assume x_unfold has the shape [25, 3, 48, 48] (25 patches, each one has resolution 48x48).
And assume after the head encoder, the input becomes [25, 32, 48, 48].

The part that makes me confused is that there is another unfold function before the Transformer encoder:

Pretrained-IPT/model/ipt.py

Line 137 in a16dd86

    
           x = torch.nn.functional.unfold(x,self.patch_dim,stride=self.patch_dim).transpose(1,2).transpose(0,1).contiguous()

Since self.patch_dim=3, this makes x becomes [(16*16), 25, (32*3*3)] and become the inputs of the multi-head attention.
This means that the sequence length for the attention is 256 here, not 25.
Does that mean the attention mechanism is not used to learn the relation between 25 patches, but learn the relation between tiny patches (resolution: 3x3) inside each patch instead?
If so, this seems different from the original ViT paper, which uses attention to learn the relation between patches inside an image.
Why is this difference here? Is it because IPT is for low-level vision tasks, not classification like ViT?

Thank you.

HantingChen · 2021-09-08T06:57:10Z

I think there might be some misunderstanding.

The inputs of our IPT model is exactly 48483 images (whose scale is 2242243 in ViT). And the 4848 images will be cropped into 33 small patches with 16*16 seq_length.

The fowrard_chop function is used for handling different input size, since our IPT can only take 48*48 size images.

cmhungsteve · 2021-09-08T09:21:13Z

I see. Now it's much clearer.
I also want to check if I understand the paper correctly.
In Section 3.1, H and W are both 48, and P is 3. Is that correct?

HantingChen · 2021-09-08T09:26:38Z

I see. Now it's much clearer.
I also want to check if I understand the paper correctly.
In Section 3.1, H and W are both 48, and P is 3. Is that correct?

Yes

cmhungsteve · 2021-09-10T07:28:17Z

Got it. Thank you so much!!!

cmhungsteve closed this as completed Sep 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

confused about the function `forward_chop` #15

confused about the function `forward_chop` #15

cmhungsteve commented Aug 30, 2021

HantingChen commented Aug 30, 2021

cmhungsteve commented Aug 30, 2021

HantingChen commented Aug 31, 2021 •

edited

cmhungsteve commented Aug 31, 2021

HantingChen commented Aug 31, 2021

cmhungsteve commented Sep 1, 2021

HantingChen commented Sep 1, 2021

cmhungsteve commented Sep 1, 2021

HantingChen commented Sep 1, 2021

cmhungsteve commented Sep 1, 2021

HantingChen commented Sep 1, 2021

cmhungsteve commented Sep 1, 2021

HantingChen commented Sep 1, 2021

cmhungsteve commented Sep 2, 2021 •

edited

HantingChen commented Sep 2, 2021

cmhungsteve commented Sep 5, 2021 •

edited

HantingChen commented Sep 7, 2021

cmhungsteve commented Sep 7, 2021 •

edited

HantingChen commented Sep 8, 2021

cmhungsteve commented Sep 8, 2021

HantingChen commented Sep 8, 2021

cmhungsteve commented Sep 10, 2021

confused about the function forward_chop #15

confused about the function forward_chop #15

Comments

cmhungsteve commented Aug 30, 2021

HantingChen commented Aug 30, 2021

cmhungsteve commented Aug 30, 2021

HantingChen commented Aug 31, 2021 • edited

cmhungsteve commented Aug 31, 2021

HantingChen commented Aug 31, 2021

cmhungsteve commented Sep 1, 2021

HantingChen commented Sep 1, 2021

cmhungsteve commented Sep 1, 2021

HantingChen commented Sep 1, 2021

cmhungsteve commented Sep 1, 2021

HantingChen commented Sep 1, 2021

cmhungsteve commented Sep 1, 2021

HantingChen commented Sep 1, 2021

cmhungsteve commented Sep 2, 2021 • edited

HantingChen commented Sep 2, 2021

cmhungsteve commented Sep 5, 2021 • edited

HantingChen commented Sep 7, 2021

cmhungsteve commented Sep 7, 2021 • edited

HantingChen commented Sep 8, 2021

cmhungsteve commented Sep 8, 2021

HantingChen commented Sep 8, 2021

cmhungsteve commented Sep 10, 2021

confused about the function `forward_chop` #15

confused about the function `forward_chop` #15

HantingChen commented Aug 31, 2021 •

edited

cmhungsteve commented Sep 2, 2021 •

edited

cmhungsteve commented Sep 5, 2021 •

edited

cmhungsteve commented Sep 7, 2021 •

edited