Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions from aiueogawa #158

Closed
JiahuiYu opened this issue Oct 15, 2018 · 45 comments
Closed

Questions from aiueogawa #158

JiahuiYu opened this issue Oct 15, 2018 · 45 comments

Comments

@JiahuiYu
Copy link
Owner

@aiueogawa Hi, I have opened a specific issue for you. You have asked five questions and I have answered all your questions. If you do not understand each one, please ask here and we can communicate here. Thanks.

@JiahuiYu
Copy link
Owner Author

@aiueogawa Your questions are:

@JiahuiYu Thanks for revised information.

Q1:
Your discriminator of SN-PatchGAN produces HxWxC outputs and the loss defined in the following definition is also the same size HxWxC, because ReLU is an element wise operation and expectations are taken over samples.

image

How are the loss values are summed up into a single total loss.
Simple summation or calculate average of them?
This difference (sum or average) affects the proportion of contributions of reconstruction loss and SN-PatchGAN loss to the final objective of the generator, because you mentioned,

our final objective function for inpainting network is only composed of pixel-wise l1 reconstruction loss and SN-PatchGAN loss with default loss balancing hyper-parameter as 1 : 1.
in the DeepFillv2 paper.

Q2:
You describe the following two options for memory and computational efficiency of contextual attention layer in DeepFillv1.

extracting background patches with strides to reduce the number of filters
downscaling resolution of foreground inputs before convolution and upscaling attention map after propagation
and another option in DeepFillv2.

restricting the search range of contextual attention module from the whole image to a local neighborhood
In DeepFillv2, what values did you use as stride and downscale_rate?
And did you restrict the search range in CelebA-HQ training without user-guidance channels.

Q3:
OK, then in DeepFillv2, what values did you use as stride and downscale_rate for contextual attention?

Q4: How many times did you iterate training of a discriminator at a training of a generator?

Q5:
In DeepFillv2 paper,

The overall mask generation algorithm is illustrated in Algorithm 1. Additionally we can sample multiple strokes in single image to mask multiple regions.
Are multiple strokes used in training?

P.S.

Can you please ask all your questions for one time instead of long mutual conversations?
I already tried to ask all my related questions for one time but you always answered only part of my questions. Therefore, I asked you many times.
BTW, when a comment is updated for additional questions, can you notice the update?

@JiahuiYu
Copy link
Owner Author

@aiueogawa My answers are

@aiueogawa I have merged your questions and deleted redundant ones. Can you please ask all your questions for one time instead of long mutual conversations? So others who see this issue can have a clean view.

Q1:
Reduce mean, as shown here.

Q2:
Option 3 is not used in CelebA-HQ case. All results in paper are without option 3. The option 3 is for real user cases when images are very large.

Q3:
The same as the one in this repo by default.

@aiueogawa
Sorry for the delay. I appreciate your understanding to squeeze the questions. Thanks! Here are some answers.

Q4:
I use 1:1 for training discriminator and generator.

Q5:
I use random number of strokes between 1 to 4.

@aiueogawa
Copy link

@JiahuiYu Thanks for all. I got it.

@JiahuiYu
Copy link
Owner Author

@aiueogawa Great. Let me know if you still have questions.

@aiueogawa
Copy link

@JiahuiYu Is it OK to ask questions about DeepFillv1 in this issue? Or another issue is better?

@JiahuiYu
Copy link
Owner Author

@aiueogawa Sure, feel free to ask any question here. I apologize for merging your questions without your permission. The reason is that I would like to keep that issue about DeepFill v2 clean and clear for others (which is also the reason why I keep that issue open). For any question you can post it here and this issue is opened for you. :)

@JiahuiYu
Copy link
Owner Author

@aiueogawa And we can use long conversations here as well.

@aiueogawa
Copy link

aiueogawa commented Oct 15, 2018

@JiahuiYu Do not apologize to me. I appreciate that you provide a chance to ask questions.

It seems that there are gaps between paper (DeepFillv1) and implementation in contextual attention.

Q1.
In paper, cosine similarity which is a inner product of two normalized vectors is used for matching patches extracted from foreground and those from background.
Implementation, however, does normalize only patches from background but does not normalize foreground patches from foreground. This results in that similarity computed in this way is not actual cosine similarity and similarity depends on the norm of foreground patchs.
Why do not you normalize foreground patches?

Q2.
In paper, attention map is passed through softmax along background patches dimension and sum of s* in paper along background patches dimension should be 1.
yi in implementation is, however, multiplied by mm which is a mask indicating background region after softmax, resulting in sum of yi along background patches dimension being less than 1 when background region is not the whole background.
Why do not adjust the difference of sum of attention scores?

Q3.
For memory efficiency, paper says

downscaling resolution of foreground inputs before convolution and upscaling attention map after propagation

.
contextual_attention in implementation, however, downscale not only foreground but also background before convolution and deconvolution with stride instead of upscaling attention map after propagation.
Why do you do so?

@JiahuiYu
Copy link
Owner Author

JiahuiYu commented Oct 15, 2018

@aiueogawa

Q1:
It is indeed a good question. The norm of foreground input is reduced in our implementation because we have a follow-up softmax operation to select which background patch is with best match. Assume the norm of foreground patch is N, then all background patch matching scores are divided by N. The relative ranking will not be changed if we miss N.

Q3:
If I understand your question correctly, for your question, I think upscaling attention map is also an option. You may also want to know why we downscale before computing scores, and use high resolution in deconvolution. The principle is that we would like to paste high-resolution background patches for reconstructing foreground (the matching scores can be "low-resolution" because usually it has spatial coherence).

@aiueogawa
Copy link

@JiahuiYu Thanks! I'm now reading your answer. BTW I'm sorry but I made mistake and sent the comment in edit. After several minutes, I updated it but maybe you miss it.

@JiahuiYu
Copy link
Owner Author

@aiueogawa No problem.

Q2:
You are very careful. The fact is it does not matter a lot. In face we find softmax scores are usually very discriminative (an example maybe like [0.99, 0.001, 0.001, ... , 0.002] etc. ) In those cases, multiply by zeros should not affect much. And you are correct, re-normalizing scores will be more accurate but it need additional computation.

@JiahuiYu
Copy link
Owner Author

@aiueogawa No worry, you can ask any question anytime here and I will answer as soon as I see one. I feel you have investigated into the very details and I appreciate. Admittedly the implementation of contextual attention is complicated and I have tried to simplify it. But it seems all components including score fusing is necessary to have a good result.

@aiueogawa
Copy link

@JiahuiYu Thanks. I understand all of answers above well. BTW I have already read almost all of the previous issues in this repository so I do not need those already answered actually.

Q4.
In paper, an attention map is propagated after sigmoid function.
In implementation, however, an attention map is propagated over outputs of convolution before sigmoid function.
How do you think?

@JiahuiYu
Copy link
Owner Author

JiahuiYu commented Oct 15, 2018

@aiueogawa

Q4:

I think what you mean is the softmax function instead of sigmoid function here.

Good question! At the first glance it seems like we should do score fusion after softmax. However, if so, we will need to renormalize the fused scores to make the summation as one.

Bt the way, you may wonder why in Q2 the renormalization is not important. It is because we have already masked foreground before softmax here so each one will have very small score values. Q4, however, is not in this case.

@dgrid
Copy link

dgrid commented Oct 20, 2018

@JiahuiYu
Q5:

It seems that in DeepFillv1 eLU activation is used for a generator, in DeepFillv2 did you still use eLU as activation?

UPDATE:
I'm sorry but I commented as another different GitHub account.
dgrid is the same person as aiueogawa.

@JiahuiYu
Copy link
Owner Author

Yes. Unless explicitly mentioned in the DeepFill v2 paper, all settings are the same.

@aiueogawa
Copy link

@JiahuiYu
Q5-2:
According to #62 (comment), you said

In your code, the default activation seems None. We use ReLU as activation.

but in the answer of Q5 above, you said you used ELU activation.

Which is right?

@JiahuiYu
Copy link
Owner Author

Sorry for confusing. I have checked code and confirmed that we use ELU as activation. I have updated that issue as well. Thanks for pointing out.

@JiahuiYu
Copy link
Owner Author

@aiueogawa Have you asked Q6 since I did see part of the question in my feeding stream.

@aiueogawa
Copy link

@JiahuiYu I had asked Q6 but I've noticed you'd already answered it in another thread, which is about typo in free-form mask algorithm, and I deleted the question. I'm sorry it was confusing.

@JiahuiYu
Copy link
Owner Author

@aiueogawa No need to say sorry. You have contributed a lot by providing these questions and helping me fix errors and typos. Feel free to ask any question. :)

@aiueogawa
Copy link

aiueogawa commented Oct 21, 2018

@JiahuiYu
Q7: (Q6 was deleted implicitly)
#62 (comment) tells you used batch size 24 on NVIDIA Tesla V100 with 16GB memory.
I use exact the same GPU and experimental settings as you, but cannot run experiments without OOM error.
I did some experiments for debug but even only feed forwarding coarse generator requires around 30GB with 24 batch size.
Why were you able to run such a big model with large batch?
Perhaps did you use multiple GPUs?

@aiueogawa
Copy link

aiueogawa commented Oct 21, 2018

@JiahuiYu
Q8:
In DeepFillv1, inputs for generator in implementation are combination of incomplete images, ones and mask.
According to another issue, ones are required for mirror padding, and it's OK.
My question is about mask.
A mask is computed as ones_x*mask (https://github.com/JiahuiYu/generative_inpainting/blob/master/inpaint_model.py#L42) but I don't think this makes sense.
mask is 1 at masked resion and otherwise 0 while ones_x is 1 anywhere.
I think ones_x*mask is exactly the same as mask.
Why do you compute mask as such a way?

UPDATE:
It seems that mask is duplicated by dimension broadcasting.
Is this right?

@JiahuiYu
Copy link
Owner Author

Hi @aiueogawa

Q7:
I did not use multiple GPUs. What's your image resolution? Have you slimmed the channels by 25% as mentioned in DeepFillv2 paper, page 6, last line of left column.

Q8:
You are correct, the values do not change, but the output tensor shape does change for concatenation purpose. The original mask is with batch size 1.

@aiueogawa
Copy link

aiueogawa commented Oct 21, 2018

@JiahuiYu

I did not use multiple GPUs. What's your image resolution? Have you slimmed the channels by 25% as mentioned in DeepFillv2 paper, page 6, last line of left column.

I'm now working on reproducing a CelebAHQ experiment so image resolution is 512 x 512.
Of course, I've slimmed the channels by 25% e.g. cnum = 32 * 0.75. For example, from 32 channels to 24 channels.

UPDATE:
Furthermore, our generator includes 4.2M parameters which is slightly more than your 4.1M parameters so implementation does not seems so different.

@JiahuiYu
Copy link
Owner Author

@aiueogawa I see. I use image resolution of 256x256 for training thus the feature map is much smaller.

@aiueogawa
Copy link

@JiahuiYu
Q9:
Your free-form mask algorithm sometimes move outside the image.
For example, assuming that image resolution is 256x256 and (startX, startY) = (1, 1), when angle is chosen toward the corner of image, line stroke move out of the image.
In such a situation, though of course that situation is rare, how do you deal with?

@aiueogawa
Copy link

@JiahuiYu
Q10:
If generator is trained with image resolution of 256x256, can inference (actual use) be done for 512x512 or larger resolution?

@JiahuiYu
Copy link
Owner Author

Q9:
It is trivial since one can just clip the stroke of outside image space.

Q10:
Yes for the case of natural scenes. But for faces, I don’t think so.

@aiueogawa
Copy link

@JiahuiYu
Q11:
In train.py, you use the same AdamOptimizer as optimizers for a discriminator and generator and share the optimizer's internal variables, e.g. moving average estimators of mean and variance of gradients.
I think one usually uses different optimizers for them.
Why do you use a shared optimizer?

@aiueogawa
Copy link

aiueogawa commented Oct 25, 2018

@JiahuiYu
Q12:
I'm now experimenting training in the same settings as DeepFillv2 with CelebA-HQ 256x256 resolution.
However, discriminator loss, in other words PatchSNGAN loss, moves around at high level.
In my experience, when experimenting vanilla GAN setting, discriminator loss moves around in a more broad range from high to low.
In my experiment with the your settings, discriminator loss and relevant information is shown in the following TensorBoard picture.

adversarial_loss: the hinge loss ranging from 0 to 1
discriminator_loss: exactly the same as adversarial_loss
real_scores: mean outputs of discriminator with real images (drawn from dataset) as inputs
real_loss: real images' part of the hinge loss (before being multiplied by 0.5)
fake_scores: mean outputs of discriminator with fake images (generated from a generator) as inputs
fake_loss: fake images' part of the hinge loss (before being multiplied by 0.5)

screen shot 2018-10-25 at 5 13 49 pm

Is this behavior of discriminator loss preferable?

UPDATE: This summary on TensorBoard is recorded from epoch1 to epoch40 with batch size 24.

@JiahuiYu
Copy link
Owner Author

@aiueogawa

Q11:

It is fine because for ADAM, EACH parameter have its own gradient statistics. When update, we designate which parameters are updating. In other words, using one optimizer or two are equivalent.

Q12:

The curves look good.

@aiueogawa
Copy link

@JiahuiYu
Q13:
Did you use mirror padding not only in a generator but also a discriminator in DeepFillv2?

@aiueogawa
Copy link

@JiahuiYu
Q14:
In your code, the same mask is used for images in the same batch.
When using different masks for images in the same batch, loss behavior differs from that of settings above.
Why do you use the same mask for images in the same batch?

@JiahuiYu
Copy link
Owner Author

@aiueogawa

Q13:
For DeepFillv2, we just use 'SAME' padding.

Q14:
Same mask is an engineering concern. Random mask sampling indeed takes time. I have tried to use different mask in same mini-batch, the curve is similar. Have you verified (view) the correctness of mask sampling code that you modify?

@aiueogawa
Copy link

@JiahuiYu
Q13:
According to #62 (comment), you said you use ones as a part of inputs.
Ones indicate border and convolution using ones and same padding simulates(?) mirror padding according to another issue.

For DeepFillv2, we just use 'SAME' padding.

I'm confused about whether or not you use ones.

Q14:
I verified that my code is correct and after enough iterations, the behavior becomes similar to that of original code.

@JiahuiYu
Copy link
Owner Author

@aiueogawa I saw why you are confused. In tensorflow, 'SAME' padding means that the same shape of input, instead of same values. For your reference, I have found the doc of tensorflow here:

The spatial semantics of the convolution ops depend on the padding scheme chosen: 'SAME' or 'VALID'. Note that the padding values are always zero.

Thus, we can use ones to indicate the border. Hope that solves your confusion. :)

@aiueogawa
Copy link

@JiahuiYu No, I know the behavior of 'SAME' and 'VALID' padding in TensorFlow.

For DeepFillv2, we just use 'SAME' padding.

This sounds you don't use ones in DeepFillv2.
However, in another issue, you do use ones in DeepFillv2.

I don't tell which is right.

@JiahuiYu
Copy link
Owner Author

JiahuiYu commented Oct 28, 2018

We DO use ones.

@aiueogawa
Copy link

@JiahuiYu Thanks. And the original question of Q13 is rewritten as:

Did you use ones not only in a generator but also a discriminator in DeepFillv2?

@JiahuiYu
Copy link
Owner Author

Not in discriminator.

@aiueogawa
Copy link

@JiahuiYu
Q15:
In DeepFillv2, for CelebA-HQ, you detect landmarks of faces and connect related nearby landmarks as training sketches.
What detector did you use?
Pretrained one or train a detector from scratch?
How many points of facial landmarks are detected?

@JiahuiYu
Copy link
Owner Author

JiahuiYu commented Nov 1, 2018

I use non-deep learning method so there is no pretrained model for detection. Dlib is what I use as library to detect face landmarks. Default number of key points is used.

@aiueogawa
Copy link

@JiahuiYu
Q16:
Are your results in paper inpainted results of in-data or out-of-data?

@JiahuiYu
Copy link
Owner Author

Of course out-of-data, which means the network never see testing data. It has been answered in FQA in README.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants