about the gamma parameter #1

Entonytang · 2018-05-31T04:07:35Z

In your code, the shape of gamma [batchsize,1,1,1]. I think the shape should be [1].
Besides, the attention score you get seems to be different with Han's paper. Did you calculate the attention score using the same equation as eqn.(1) in paper.

heykeetae · 2018-05-31T11:51:36Z

Thank you very much for your comment! I think you are right about gamma. Let me make sure to correct it and update it.
About the attention map, it has the dimension of batchsize x number_of_feature (o in paper, which I interpreted as the total pixel number). It is the same as batchsize x H x W. In the code, H =W (=f in code, perhaps is the source of confusion). Since each pixel owns an attention map, requiring the total required dimension of batchsize x f^2 x f x f.
Sorry for the confusing notation. Please point out if there are still other mistakes.

Entonytang · 2018-05-31T13:50:25Z

Based on my understanding. Self attention operation should like this : why you choose this method to calculate attention scores.

class SelfAttention(nn.Module):

def __init__(self, in_channel):
    super().__init__()
    self.query = nn.Conv1d(in_channel, in_channel // 8, 1)
    self.key = nn.Conv1d(in_channel, in_channel // 8, 1)
    self.value = nn.Conv1d(in_channel, in_channel, 1)
    self.gamma = nn.Parameter(torch.tensor(0.0))

def forward(self, input):
    shape = input.shape
    flatten = input.view(shape[0], shape[1], -1)
    query = self.query(flatten).permute(0, 2, 1)
    key = self.key(flatten)
    value = self.value(flatten)
    query_key = torch.bmm(query, key)
    attn = F.softmax(query_key, 1)
    attn = torch.bmm(value, attn)
    attn = attn.view(*shape)
    out = self.gamma * attn + input

    return out

heykeetae · 2018-05-31T14:39:28Z

Great suggestion! I'll sleep on that. However, that way is similar to what i tried at first, where I realized it makes more sense to have each pixel look at the different location of the previous layer by having a different attention map, since there are n number of resulting features o_j.

Entonytang · 2018-05-31T14:48:11Z

And : f_ready = f_x.contiguous().view(b_size, -1, f_size ** 2, f_size, f_size).permute(0, 1, 2, 4, 3) . Why you choose to transpose f_ready and multiply f_ready with g_ready. (why you choose transpose here)

heykeetae · 2018-05-31T14:57:42Z

That part is to reflect f(x)^T * g(x) in the paper :)

Entonytang · 2018-05-31T15:11:39Z

This operation aim to get a scalar value vector^T *vector = scalar value ; but the transpose operation in your code doesn't have this affect.
This is just my understanding.

heykeetae · 2018-05-31T15:25:14Z

That's a very good point. The calculation involves the depth of a feature map, so multiplication does not end up with a scaler (per pixel), but looking at the line, attn_dist = torch.mul(f_ready, g_ready).sum(dim=1).contiguous().view(-1, f_size ** 2), there is .sum(dim=1) following the multiplication, that sums up depth-wise, making it a scalar, per pixel.

leehomyc · 2018-06-04T19:45:19Z

If every pixel has its own attention map, the memory will be consumed quickly as the image size goes up. I agree with @Entonytang's interpretation.

heykeetae · 2018-06-06T01:49:55Z

@leehomyc I'm still not sure if having only one attention score map justifies it. Looking at the paper, Figs. 1 and 5 show attention results, where a 'particular' area takes hint from different region. In @Entonytang 's implementation, that sort of visualization is not possible.

hythbr · 2018-06-06T14:02:07Z

I think the attention score by @Entonytang is agree with Han's paper. But, based on my understanding, attn = torch.bmm(value, attn) should like this
value = value.permute(0, 2, 1)
attn = torch.bmm( attn, value)
attn = attn.permute(0,2,1)
What do you think? @Entonytang, @heykeetae

leehomyc · 2018-06-06T18:01:59Z

why permute @hythbr

hythbr · 2018-06-07T05:52:02Z

According to the Eqn.(2) in paper, I think the matrix-matrix product after permute may represent the meaning of the Eqn.. However, I am not sure if it is right. Please point out if there are some errors. @leehomyc

heykeetae · 2018-06-08T06:08:09Z

We have updated the whole self attention module. please check it out! memory problem solved, and we are convinced it should agree with the paper too.

Entonytang · 2018-06-08T06:39:52Z

In this implementation, can you get better performance than the previous method you use? and can you tell me the final gamma value you trained ?

heykeetae · 2018-06-08T06:52:48Z

The performance, in honesty, is not distinguishable by human eyes. We should try the IS or FID to quantify the performance. About the gamma, the intent of the original authors goes unclear, which keeps increasing (or decreasing) under this implementation. It does not seem to converge for now, but one can try longer training to find it out!

liangbh6 · 2018-07-14T01:04:21Z

@Entonytang @heykeetae Hi, I read the code and doubt that how gamma change during training. It is defined as self.gamma = nn.Parameter(torch.zeros(1)) in line39 of sagan_model.py

liangbh6 · 2018-07-14T01:26:16Z

Well, I have figured out that gamma is treated as a learnable parameter.

valillon · 2021-05-17T07:37:06Z

heykeetae closed this as completed Jun 8, 2018

heykeetae reopened this Jun 8, 2018

AIprogrammer mentioned this issue Aug 9, 2019

Weird Error while using multi GPU. #43

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

about the gamma parameter #1

about the gamma parameter #1

Entonytang commented May 31, 2018

heykeetae commented May 31, 2018 •

edited

Loading

Entonytang commented May 31, 2018 •

edited

Loading

heykeetae commented May 31, 2018 •

edited

Loading

Entonytang commented May 31, 2018

heykeetae commented May 31, 2018

Entonytang commented May 31, 2018 •

edited

Loading

heykeetae commented May 31, 2018 •

edited

Loading

leehomyc commented Jun 4, 2018 •

edited

Loading

heykeetae commented Jun 6, 2018

hythbr commented Jun 6, 2018

leehomyc commented Jun 6, 2018

hythbr commented Jun 7, 2018

heykeetae commented Jun 8, 2018

Entonytang commented Jun 8, 2018

heykeetae commented Jun 8, 2018

liangbh6 commented Jul 14, 2018

liangbh6 commented Jul 14, 2018

valillon commented May 17, 2021

about the gamma parameter #1

about the gamma parameter #1

Comments

Entonytang commented May 31, 2018

heykeetae commented May 31, 2018 • edited Loading

Entonytang commented May 31, 2018 • edited Loading

heykeetae commented May 31, 2018 • edited Loading

Entonytang commented May 31, 2018

heykeetae commented May 31, 2018

Entonytang commented May 31, 2018 • edited Loading

heykeetae commented May 31, 2018 • edited Loading

leehomyc commented Jun 4, 2018 • edited Loading

heykeetae commented Jun 6, 2018

hythbr commented Jun 6, 2018

leehomyc commented Jun 6, 2018

hythbr commented Jun 7, 2018

heykeetae commented Jun 8, 2018

Entonytang commented Jun 8, 2018

heykeetae commented Jun 8, 2018

liangbh6 commented Jul 14, 2018

liangbh6 commented Jul 14, 2018

valillon commented May 17, 2021

heykeetae commented May 31, 2018 •

edited

Loading

Entonytang commented May 31, 2018 •

edited

Loading

heykeetae commented May 31, 2018 •

edited

Loading

Entonytang commented May 31, 2018 •

edited

Loading

heykeetae commented May 31, 2018 •

edited

Loading

leehomyc commented Jun 4, 2018 •

edited

Loading