-
Notifications
You must be signed in to change notification settings - Fork 541
Optim-wip: Fix failing tests, model download link, & more #656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thank you for the fixes, @ProGamerGov ! It doesn't seem to fail on master. Are those python files different on optim-wip branch ? |
Looks like lint_test_py36_pip and lint_test_py37_conda tests are now failing due to this error:
I think the error could be a Python bug or something as I haven't seen it until today. |
I found the source of the I made an issue post on the PyTorch repo here: pytorch/pytorch#57421 |
* Remove `ImageTensor` test skips as the `torch.Tensor`'s `__new__` function has been fixed. * Add tests for `ImageTensor` functions. * Removed old `AlphaChannelLoss` code.
The issue with |
I have discovered a new issue with the code: Running these two lines of code:
Results in this error:
If I add a line to set the init variable to the device that the color decorrelation transform is on, then I get the following error:
We don't have to solve this issue in this PR though, as it doesn't effect any of the current tutorials. Edit: I have resolved the issue! Though my fix makes it harder to disable color decorrelation. |
* Fix `NaturalImage` device bug. * Set `decorrelate_init` default to `False`. * Fix `NaturalImage` size type.
So, currently
If you want to disable color decorrelation / recorrelation, then you either need to set We can resolve this specific issue in a future PR. |
It now fits with the `Optional` type hint that it was given.
* Fix issue where the final value in a list was not selectable. * Fix error when lists have a size of 1.
Well I have no idea what happened here:
My fix for Edit: I can't reproduce the error at all, so maybe it's an issue with CircleCI? Second Edit: I think the test was failing because the previous parameters it was used were tuned for the old I've fixed the issue now! |
* Tests showed that using only 5 iterations was no longer sufficient to ensure the final loss values were less than the first loss values.
@NarineK The lines I altered in |
I looked into the FFTImage size thing, and there is a clear deviation in Ludwig's first Captum iteration of FFTImage and the Lucid code. The Captum code create the image with the shape given from the
While Lucid derives the shape of the image from the shape of the frequency tensor (that the if statement changes with an odd width value):
https://github.com/tensorflow/lucid/blob/master/lucid/optvis/param/spatial.py#L66-L67 Therefore, I think the special behavior for widths that have an odd value is required for the Lucid way of doing things and not the new Captum FFTImage. I think that Ludwig may have just forgotten to update |
Thank you for the explanation, @ProGamerGov! Is there a specific reason why we chose to use buffer instead of instance field?If we use it as an instance field it shouldn't cause any problems. I meant in
|
If they are exactly the same then there won't be merge conflicts but we will have same changes in different commits. I think it is fine for now since those are minor changes but it is better to sync up with master for larger changes. |
@NarineK Ludwig's original Captum code had both
Yeah, we can sync up with the master branch in separate PR. I was just trying to avoid it in this PR so that the commit history wasn't filled with all the master commits, like what happened with SK's PR. |
Oh, I think I understand why buffers were used now. They make it so that when a This code works when
I guess we could override |
Thank you for looking deeper into this:
I think that it is tricky to put a module on the device because pytorch puts actual tensors (parameters) related to that module on the device and not the module per se. For example we cannot call To be clear we can explicitly put the We can set: before using it and use instance fields instead. Would that work ? |
Thank you for looking into this, @ProGamerGov ! As you mentioned for coeffs_shape Ludwig used In terms of results, not shapes, are we getting different outcomes ? Also there is a division by 50 magic number (this isn't related to the dimensionality) that I don't see in lucid |
@NarineK Yeah, moving the tensor to the input device for every forward call would work! But would that incur any sort of performance penalty for constantly moving the tensor to the target device? I tried testing it out, and there didn't seem to be any obvious effect to the time it takes to complete 512 iterations. Alternatively, we could just include an additional parameter to I've been disabling the color decorrelation like this for experiments:
Ludwig hasn't been actively lately, so we might have to wait a bit to ask him why he made certain design choices. Chris may be able to help, but he is busy at the moment as well I think. Our version of I think that he intended Captum's |
Yes, I think it's necessary as that way we get the right matching size for 5 dimensional fft tensors:
|
I don't know or think that there will be performance issues. It is simply linking the device. I don't think that Alternatively, we could also pass device as an optional argument to the constructor. The user needs to specify explicitly, but we can specifically describe that the device is used for the transformation tensor. I think that users need to understand |
@NarineK I just realized that there may be a better way that we could solve this with minimal change to This way we would keep the
|
See: meta-pytorch#675 for more details.
Resolve the `ToRGB` device issue as mentioned in: meta-pytorch#656
@NarineK I've implemented the |
This means that the tensor coming out of decorrelation_module will always be on cpu even if the user explicitly puts it on the gpu in the decorrelation_module - we will override it with cpu. Since this is not critical we can leave it as is. I haven't reviewed the rest of PR yet. Let me have a quick look. |
captum/optim/_param/image/images.py
Outdated
decorrelation_module.cpu() if decorrelation_module is not None else None | ||
) | ||
if init is not None: | ||
assert not init.is_cuda |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enforcing the tensor to be on CPU without warning why, will make the user to think that there is a bug in the code.
We should explicitly put is on the CPU. I assume this is because of decorrelation_module.cpu()
I think that it would be good to print a warning message about why we put decorrelation_module
on the CPU and explain it in the documentation.
Perhaps also adding a TODO so that we can make the module more flexible and don't have to move to CPU device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NarineK The init tensor had to be put on the CPU because FFTImage
expected it to be on the CPU, but I just fixed it so that spectrum_scale
is now placed on to the init tensor's device.
In order to avoid any future issues with the decorrelation module being created in the function signature, I added a deepcopy line that only runs when it's a ToRGB
instance.
I also removed the device assertion and made ToRGB
place the transform on the input tensor's device as well, so that it becomes a no-op when the buffer is placed on the target device.
The device issues should now be resolved I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the deepcopy line as it was redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks cleaner now.
Here squash_func
is not assigned, right? We could also represent it as one-line lambda function. I wonder why we use different squash_func
s depending on whether init
is provided or not.
if squash_func is None:
def squash_func(x: torch.Tensor) -> torch.Tensor:
return x.clamp(0, 1)
Also, do you know what is it meant by the comment # TODO: the world is not yet ready
at line 460 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NarineK The different squash function for when init tensors are provided was something I found to produce better results with images. Lambda functions were used previously, but SK thought inner functions were better.
The TODO line is because named dimensions are not fully supported by PyTorch yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting, why are inner functions better in this case ?
rename
is used in line 440 too. I we can leave it but ideally we should use it for the PT versions that support it.
""" | ||
|
||
def forward(self, x: torch.Tensor) -> torch.Tensor: | ||
def __init__(self, *args, **kwargs) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are the args and kwargs passed if they are not used ? Are there any real scenarios that we need it ? Is this because we replace ReLU with SkipLayer ?
I could imagine, for example, that x is a tuple of tensors and we would need to return that tuple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the change because users may have models using activ(inplace=False)
or activ(False)
where activ = torch.nn.ReLU
. This is same way that torch.nn.Identity
works: https://pytorch.org/docs/stable/generated/torch.nn.Identity.html
I'll add the type hint for tuples of tensors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've improved the documentation, added the type hints, and also provided a link to the nn.Identity
class!
captum/optim/models/_common.py
Outdated
def __init__(self, *args, **kwargs) -> None: | ||
super().__init__() | ||
|
||
def forward(self, x: torch.Tensor, *args, **kwargs) -> torch.Tensor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are the args and kwargs passed if they are not used ? We should document it because it is unclear why this is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the change for situations like replacing the decorrelation module as it specifies an additional argument in the forward pass. I'll add some documentation explaining why I added args and kwargs. Without this change, SkipLayer
is the exact same as torch.nn.Identity
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this will work for ReLU with one input tensor and specific cases like ReLUs but it will have issues with custom modules that take tuples of tensors and modules such as nn.Linear, etc. I think that perhaps it would be good to give more specific name to it or describe that this is a skip layer that assumes that the inputs and outputs have to have the same shape and that it is the first tensor that is returned as an output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you for addressing the comments, @ProGamerGov!
I left two minor comments.
Quick fix for the model file path as it's breaking all the notebooks & tests.
It also looks like
black
was updated and now is reporting an issue withcaptum/attr/_core/feature_ablation.py
&captum/metrics/_core/infidelity.py
, so I've resolved that as well. The files in the master branch may be a bit different, but we need to at least keep the tests passing for the other areas of Captum until we merge into master.Added new
ImageTensor
tests.Added
detach()
toInputOptimization
to fix out of memory crashes.Made sure that
SkipLayer
can handle any addition arguments to it's init and forward functions. This adds to the usefulness of the layer and improves it beyond just copyingnn.Identity
.Fixed various bugs.