New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TF ViT MAE #16255
Add TF ViT MAE #16255
Conversation
* partially ported pt methods and classes of vit mae to tensorflow. * ported TFViTMAEIntermediate and TFViTMAEOutput. * chore: addresses PR feedback. * added TFViTMAEModel and started TFViTMAEDecoder. * add: initial implementation of tf vit mae. * fix: model output type. * fix: a bunch of inconsistencies but need to investigate tf.repeat(). * chore: pr feedback. * fix: gather error. * chore: resorted to the original pt vit_mae model. * fix: fix gather error partially (ral this time). * feat: adding tf vit mae; model initializing but weight porting is flawed. * partial fix: fixing the parameter names Fixed the parameter names of dropout layers, decoder_embed and the decoder_layers. This helps with the proper cross-loading of the weights. With proper cross loading the debug test does not pass. My intuition is to dive a little deeper into the computations of the model * chore: applied make style. Co-authored-by: ariG23498 <aritra.born2fly@gmail.com>
I just have a quick look and left a comment. I feel strange that there are style changes like I thought you already updated the version (during your work on
Maybe you were in a different virtual Python environment while working on this PR? |
So, first I deactivate the current Python virtual environment and then run the installation and run I think I should fetch upstream before that and rebase. |
Yeah, if your current venv is specific to your other work/projects, and you don't want to change its installed packages.
You can try it. I always try to have (very) recent commit from master to work on a new PR. Hope the rebase is smooth in your case here. |
torch.zeros(1, self.num_patches + 1, config.hidden_size), | ||
requires_grad=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ydshieh I am unable to do the rebasing as mentioned in #16255 (comment) as @ariG23498 is the repo owner.
Here's what I did before committing the changes:
- I cloned https://github.com/huggingface/transformers in a separate directory.
- Initialized a new Python virtual environment.
cd
'd to this clonedtransformers
and ranpip install -e .[quality]
.- After that from the forked
transformers
directory (from thetf-mae
branch) I ranmake style
while being in the new virtual environment as mentioned above.
Let me know if these steps are good to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also new to this usecase. It seems logical and the effects are there. Therefore I think it is OK.
The documentation is not available anymore as the PR was closed or merged. |
Updates: Repo consistency tells us that copied components have inconsistencies. Sorry, if this sounds foolish but we could not actually figure out those inconsistencies. Class name changes and other relevant changes in the comment do not count here I hope. To ensure the copying was indeed right we went ahead used a comparator tool to verify if that's the case. Here are the individual links to the comparison results for the components that are said to have copy inconsistencies:
Note that we did run We have also followed the copy comment format from the PT script ensuring they are well aligned. What else are we missing out here? |
Probably my previous comment didn't explain things clear enough. Let's take an example with:
This In order to fix the issue, they are 2 potential options:
So I would suggest try option 1 first. If there are remaining places, we can use option 2 for that remaining places. |
For the remaining # copied from issue,
to
should work. |
It turns out that there is still
to be changed ...I didn't check the whole block previously, sorry. I don't always pull the latest commit from this PR. In general, it would be easier to detect if you run |
Thanks for the prompt feedback @ydshieh It was my bad to not check the entire code block. |
Don't worry. And just a potential correction (not really important): my previous comment "it would be easier to detect ..." might be not True in this particular line: I don't think there will be a diff visible to this particular line, after running |
All green ✅ Thanks @ydshieh for your prompt and valuable feedback! |
# source: https://discuss.pytorch.org/t/random-seed-that-spans-across-devices/19735 | ||
torch.manual_seed(2) | ||
# make random mask reproducible across the PT and TF model | ||
np.random.seed(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this change, it would be great to hear from @NielsRogge too.
Previously we only saw the issue caused by CPU/GPU PRGNs (for the PyTorch ViTMAE).
But now we also see the issue caused by the different frameworks.
I think using numpy
is a good choice to address this issue. However, let's wait Niels' opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was discussed with @NielsRogge beforehand, though.
But I don't think this PR is contingent on this change as it was all well discussed before and reviewed as well. I might wrong though.
But now we also see the issue caused by the different frameworks.
Additionally, could you elaborate on the above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we discussed on Slack. As far as I know, Niels knows we are going to use the new noise
argument and he is happy with this change.
However, regarding using numpy
instead of torch
(and updating the expected slice), I don't remember if Niels is aware of this change and expressed his opinion. I might be wrong (?).
Regarding But now we also see the issue caused by the different frameworks.
, it is just about we decided to use numpy
because the PRNGs from torch and tf will give different results even with the same seed, and we don't want to import torch in tf (non pt-tf cross) test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying!
because the PRNGs from torch and tf will give different results even with the same seed,
I think you probably meant TF PT PRNGs won't produce same results even if they were seeded with the same number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that's what I mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once Niels is OK with using NumPy, I think it is better to have the comment like
# make random mask reproducible across different frameworks (PyTorch, TensorFlow, etc.), as well as across different accelerators (CPU, GPU, etc.).
# note that the same seed on CPU and on GPU doesn’t mean they spew the same random number sequences,
# as they both have fairly different PRNGs (for efficiency reasons).
# source (for PyTorch): https://discuss.pytorch.org/t/random-seed-that-spans-across-devices/19735
(just at this place and the corresponding place in test_modeling_tf_vit_mae.py
should be fine, no need to apply to other places in this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much more comprehensive. Thanks for suggesting 👌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So Niels is OK with this change, I will just let you update the comment, @sayakpaul . Thank you!
About I was a bit worried it will break some testing at this moment, so I tested with Full error log with tempfile.TemporaryDirectory() as tmpdirname:
> model.save_pretrained(tmpdirname, saved_model=True)
tests\vit_mae\test_modeling_tf_vit_mae.py:616:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
src\transformers\modeling_tf_utils.py:1418: in save_pretrained
self.save(saved_model_dir, include_optimizer=False, signatures=self.serving)
..\..\..\..\miniconda3\envs\py39\lib\site-packages\keras\utils\traceback_utils.py:67: in error_handler
raise e.with_traceback(filtered_tb) from None
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
args = ({'pixel_values': <tf.Tensor 'pixel_values:0' shape=(None, None, None, None) dtype=float32>},)
kwargs = {}
def autograph_handler(*args, **kwargs):
"""Calls a converted version of original_func."""
# TODO(mdan): Push this block higher in tf.function's call stack.
try:
return autograph.converted_call(
original_func,
args,
kwargs,
options=autograph.ConversionOptions(
recursive=True,
optional_features=autograph_options,
user_requested=True,
))
except Exception as e: # pylint:disable=broad-except
if hasattr(e, "ag_error_metadata"):
> raise e.ag_error_metadata.to_exception(e)
E ValueError: in user code:
E
E File "C:\Users\33611\Desktop\Projects\transformers-huggingface\transformers\src\transformers\models\vit_mae\modeling_tf_vit_mae.py", line 728, in serving *
E return self.call(inputs)
E File "C:\Users\33611\Desktop\Projects\transformers-huggingface\transformers\src\transformers\modeling_tf_utils.py", line 816, in run_call_with_unpacked_inputs *
E return func(self, **unpacked_inputs)
E File "C:\Users\33611\Desktop\Projects\transformers-huggingface\transformers\src\transformers\models\vit_mae\modeling_tf_vit_mae.py", line 1074, in call *
E loss = self.forward_loss(pixel_values, logits, mask)
E File "C:\Users\33611\Desktop\Projects\transformers-huggingface\transformers\src\transformers\models\vit_mae\modeling_tf_vit_mae.py", line 1007, in forward_loss *
E target = self.patchify(imgs)
E File "C:\Users\33611\Desktop\Projects\transformers-huggingface\transformers\src\transformers\models\vit_mae\modeling_tf_vit_mae.py", line 975, in patchify *
E imgs = tf.cond(
E
E ValueError: Tried to convert 'x' to a tensor and failed. Error: None values not supported.
..\..\..\..\miniconda3\envs\py39\lib\site-packages\tensorflow\python\framework\func_graph.py:1129: ValueError |
Changing transformers/tests/utils/test_modeling_tf_core.py Lines 124 to 135 in b320d87
with similar error E in user code:
E
E File "C:\Users\33611\Desktop\Projects\transformers-huggingface\transformers\src\transformers\modeling_tf_utils.py", line 816, in run_call_with_unpacked_inputs *
E return func(self, **unpacked_inputs)
E File "C:\Users\33611\Desktop\Projects\transformers-huggingface\transformers\src\transformers\models\vit_mae\modeling_tf_vit_mae.py", line 1074, in call *
E loss = self.forward_loss(pixel_values, logits, mask)
E File "C:\Users\33611\Desktop\Projects\transformers-huggingface\transformers\src\transformers\models\vit_mae\modeling_tf_vit_mae.py", line 1007, in forward_loss *
E target = self.patchify(imgs)
E File "C:\Users\33611\Desktop\Projects\transformers-huggingface\transformers\src\transformers\models\vit_mae\modeling_tf_vit_mae.py", line 980, in patchify *
E tf.debugging.assert_equal(imgs.shape[1], imgs.shape[2])
E
E ValueError: None values not supported. (This test is currently only used for just a few core NLP models) (This test will still fail with |
Thanks for investigating it, @ydshieh. Is there anything we can do in this PR to mitigate the problem? |
Let's see what gante think. |
This is why I used For the second one (#16255 (comment)) I am not too sure since you mentioned that the test is only applied for some core NLP models. |
My comments above are not about It is more about a question (to gante) |
Now I understand. Appreciate the class clarification. |
@sayakpaul |
@gante thank you! Changes made. |
Can you confirm that the tests run with |
btw ignore the failure in |
Yes, I did run the tests before pushing the changes. |
@gante over to you to take the reigns. |
CI green, slow tests were run, all TF weights on the hub 👉 merging. Great work @sayakpaul @ariG23498 💪 |
This PR adds the MAE [1] model in TensorFlow. It was developed by @ariG23498 and myself.
Fun facts about this PR:
transformers
.References:
[1] Masked Autoencoders Are Scalable Vision Learners
Update
The PR is now ready for review. @gante @Rocketknight1 @sgugger