-
Notifications
You must be signed in to change notification settings - Fork 26.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add UDOP models #21239
[WIP] Add UDOP models #21239
Conversation
The documentation is not available anymore as the PR was closed or merged. |
@sgugger @NielsRogge The model weights are here https://huggingface.co/ZinengTang/Udop/tree/main , But how to get the config for these models ? |
@raghavanone For reference, someone asked the same question on the UDOP repo: microsoft/i-Code#17 |
Note: Cannot proceed further without microsoft releasing the entire weights. Currently vision decoder weights have not been released. |
If I'm not mistaken, vision decoder weights should not be needed when using the text layout decoder part, only. |
@raghavanone is there anything else blocking? It sounds like we can proceed with the given weights, assuming that we notify users that the vision decoder is not trained. |
@logan-markewich Yes, I will work on closing this within couple of days . |
@sgugger Need some pointers on How should this model be tested ? Can I follow the tests used for T5 model and replicate similar tests ? |
@NielsRogge Any pointer here ? |
I hope it gets merged soon @raghavanone . Nice work :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR and the work adding this model! I've done my best to address everything in the first PR - let me know if anything is unclear.
Some general notes:
- There's numerous references to other models, in particar T5, throughout this PR. Can you make sure to update all of these to reflect the new model?
- Make sure to add docstrings to the public classes and functions that can be directly imported from the init
- Consistent naming patterns: some of the classes' prefix is
Udop
, othersUDOP
. We're moving away from variable casing in models names, and all should be updated to beUdop
. - The model folder should be all lower case
udop
i.e.src/transformers/models/udop
. - All custom layers in the modeling file should have a
Udop
prefix e.g.UdopBlock
- All building blocks layers should take the config and initialise themselves with the values it contains rather than having many keywords in the init
- Use as expressive variable names as possible, rather than e.g.
B
usebatch_size
- Utilise
# Copied from
wherever possible - Make sure to add tests for all possible input modalities
- There should be integration tests added for the models checking their logits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR and the work adding this model! I've done my best to address everything in the first PR - let me know if anything is unclear.
Some general notes:
- There's numerous references to other models, in particar T5, throughout this PR. Can you make sure to update all of these to reflect the new model?
- Make sure to add docstrings to the public classes and functions that can be directly imported from the init
- Consistent naming patterns: some of the classes' prefix is
Udop
, othersUDOP
. We're moving away from variable casing in models names, and all should be updated to beUdop
. - The model folder should be all lower case
udop
i.e.src/transformers/models/udop
. - All custom layers in the modeling file should have a
Udop
prefix e.g.UdopBlock
- All building blocks layers should take the config and initialise themselves with the values it contains rather than having many keywords in the init
- Use as expressive variable names as possible, rather than e.g.
B
usebatch_size
- Utilise
# Copied from
wherever possible - Make sure to add tests for all possible input modalities
- There should be integration tests added for the models checking their logits
Forgive my naiveté, why do all the tests call |
Ah, I see that the test script they provide also uses T5-large, I expected it to use one of those checkpoints |
@raghavanone how are things going with this so far? I'm very interested in using this model as soon as it gets integrated - if you need a hand with anything let me know! And thanks for bringing it into the library 😄 |
@thefirebanks I am working on fixing last few tests. Hoping to close this PR very soon. Sorry for the delay. |
@raghavanone I am currently trying to finetune
I explained what appears to be happening in this comment. It looks like the EDIT Happy to make the changes myself with repo permissions |
@plamb-viso Yes, removing those parameters were not done in all places, I have fixed it locally. I am working on fixing failing tests. This the last step pending for merging. Fixing these tests are taking more time than expected. |
README.md
Outdated
@@ -419,6 +419,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h | |||
1. **[Trajectory Transformer](https://huggingface.co/docs/transformers/model_doc/trajectory_transformers)** (from the University of California at Berkeley) released with the paper [Offline Reinforcement Learning as One Big Sequence Modeling Problem](https://arxiv.org/abs/2106.02039) by Michael Janner, Qiyang Li, Sergey Levine | |||
1. **[Transformer-XL](https://huggingface.co/docs/transformers/model_doc/transfo-xl)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov. | |||
1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (from Microsoft), released together with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei. | |||
1. **[Udop](model_doc/udop)** (from Microsoft) released with the paper [Unifying Vision, Text, and Layout for Universal Document Processing](https://arxiv.org/pdf/2212.02623.pdf) by Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, Mohit Bansal, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. **[Udop](model_doc/udop)** (from Microsoft) released with the paper [Unifying Vision, Text, and Layout for Universal Document Processing](https://arxiv.org/pdf/2212.02623.pdf) by Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, Mohit Bansal, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. | |
1. **[UDOP](https://huggingface.co/docs/transformers/main/model_doc/udop)** (from Microsoft) released with the paper [Unifying Vision, Text, and Layout for Universal Document Processing](https://arxiv.org/abs/2212.02623) by Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, Mohit Bansal, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. |
Please run make fix-copies
again to update the other README's
docs/source/en/model_doc/udop.mdx
Outdated
specific language governing permissions and limitations under the License. | ||
--> | ||
|
||
# Udop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Udop | |
# UDOP |
docs/source/en/model_doc/udop.mdx
Outdated
|
||
## Overview | ||
|
||
The udop models was presented in [Unifying Vision, Text, and Layout for Universal Document Processing](https://arxiv.org/pdf/2212.02623.pdf) by Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, Mohit Bansal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The udop models was presented in [Unifying Vision, Text, and Layout for Universal Document Processing](https://arxiv.org/pdf/2212.02623.pdf) by Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, Mohit Bansal. | |
The UDOP model was presented in [Unifying Vision, Text, and Layout for Universal Document Processing](https://arxiv.org/abs/2212.02623) by Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, Mohit Bansal. |
docs/source/en/model_doc/udop.mdx
Outdated
T5 comes in two variations : | ||
|
||
- [udop-uni] | ||
|
||
- [udop-dual] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Links will have to be added here. Also T5 -> UDOP
@@ -175,6 +175,8 @@ | |||
("transfo-xl", "TransfoXLConfig"), | |||
("trocr", "TrOCRConfig"), | |||
("tvlt", "TvltConfig"), | |||
# ("udop_dual", "UdopConfig"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NielsRogge I am not sure how to handle here. There are two models in UDOP, both share same config.
How do I setup MODEL_FOR_PRETRAINING_MAPPING_NAMES ?
How do I setup CONFIG_MAPPING_NAMES ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just needs to be ("udop", "UdopConfig")
. "udop" is the model_type.
with tempfile.TemporaryDirectory() as tmpdirname: | ||
torch.onnx.export( | ||
model, | ||
(config_and_inputs[1], config_and_inputs[3], config_and_inputs[2]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raghavanone Trying to understand how you are exporting the model here. Judging from prepare_config_and_inputs
you are passing values into the model for (the forward function parameters) input_ids
, seg_data
and image
. But these parameters are not the first 3 parameters for the forward function. How does onnx know that you are inputting values for input_ids
, seg_data
and image
in this tuple?
In short, I'm having a lot of problems trying to torch.jit.trace
or torch.onnx.export
UdopUniModelForConditionalGeneration
because (i believe) the forward function parameters I'm passing data into are not at the beginning of the forward function and in consecutive order. If you try to pass in None
for the parameters you aren't filling you get a Only Tensors and (possibly nested) Lists, Dicts, and Tuples of Tensors can be traced
error (can't pass in Nonetype).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, when I run this test in my env I get:
File "python3.9/site-packages/transformers/models/udop/modeling_udop.py", line 1512, in forward
bbox = torch.clip(bbox, 0.0, 1.0)
TypeError: clip() received an invalid combination of arguments - got (NoneType, float, float), but expected one of:
* (Tensor input, Number min, Number max, *, Tensor out)
* (Tensor input, Tensor min, Tensor max, *, Tensor out)
Which is the same error I'm getting while trying to export my finetuned version of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am able to get passed this error if I arrange the parameters im filling to the beginning of the forward function and in the same order as passed to torch.onnx.export. However, more errors arise:
RuntimeError: 0 INTERNAL ASSERT FAILED at "/Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/jit/ir/alias_analysis.cpp":614, please report a bug to PyTorch. We don't have an op for aten::full_like but it isn't a special case. Argument types: Tensor, bool, int, int, Device, bool, NoneType,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@plamb-viso Exporting to onnx is yet to done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raghavanone you can remove all ONNX export related code from this PR, we can add ONNX support for UDOP in a follow-up PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also ONNX support now lives fully in optimum, not Transformers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @NielsRogge, @sgugger, I'll try to not muddy this PR up anymore with unrelated questions, but I'm curious, what would be HuggingFace's process for testing tracing/onnx export of a new model (e.g. UDOP) after the PR supporting the models inclusion into transformers has been merged? Just curious if you can set expectations at all for when users could expect some test code that verifies the UDOP model inside transformers can properly export.
This is the configuration class to store the configuration of a [`UdopDualForConditionalGeneration`] or a | ||
[`UdopUnimodelForConditionalGeneration`]. It is used to instantiate a UdopDualForConditionalGeneration or | ||
UdopUnimodelForConditionalGeneration model according to the specified arguments, defining the model architecture. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the configuration class to store the configuration of a [`UdopDualForConditionalGeneration`] or a | |
[`UdopUnimodelForConditionalGeneration`]. It is used to instantiate a UdopDualForConditionalGeneration or | |
UdopUnimodelForConditionalGeneration model according to the specified arguments, defining the model architecture. | |
This is the configuration class to store the configuration of a [`UdopForConditionalGeneration`] or a | |
[`UdopDualForConditionalGeneration`]. It is used to instantiate a UdopForConditionalGeneration or | |
UdopDualForConditionalGeneration model according to the specified arguments, defining the model architecture. |
Maybe let's simplify the names of the model classes a bit. Let's call the default model UdopForConditionalGeneration
and the other variant UdopDualForConditionalGeneration
.
|
||
Arguments: | ||
vocab_size (`int`, *optional*, defaults to 32128): | ||
Vocabulary size of the Udop model. Defines the number of different tokens that can be represented by the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vocabulary size of the Udop model. Defines the number of different tokens that can be represented by the | |
Vocabulary size of the UDOP model. Defines the number of different tokens that can be represented by the |
Vocabulary size of the Udop model. Defines the number of different tokens that can be represented by the | ||
`inputs_ids`. | ||
d_model (`int`, *optional*, defaults to 512): | ||
Size of the encoder layers and the pooler layer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably is the hidden size of both the encoder and decoder?
Size of the intermediate feed forward layer in each `UdopBlock`. | ||
num_layers (`int`, *optional*, defaults to 6): | ||
Number of hidden layers in the Transformer encoder. | ||
num_decoder_layers (`int`, *optional*): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason there's no num_encoder_layers
argument?
@raghavanone I saw you closed this PR. Skimming over your work, the PR seemed to be in a rather good state. Where there any blockers you encountered? IMO, it would be nice to add UDOP models in Hugginface at some point. |
@maxjeblick @NielsRogge feels that the code original repo is bit hacky, he is working a separate PR to UDOP in better implementation, so closed this in consultation with him. He should open a PR soon . @NielsRogge please do add more details for the benefit of folks following this PR |
Thanks a lot for the fast reply! |
@NielsRogge @raghavanone please link the new PR when its available for people subscribed to this one |
Hi yes I'll open a PR soon! Thanks a lot for your work already @raghavanone, will ping you on the PR |
Hi @NielsRogge I saw the large amount of commits on your new UDOP branch, curious if you have any idea on when you think a PR might be ready |
Sorry to keep hammering on this, but again have noticed a flurry of activity on that branch then almost 2 weeks off. Curious what the plan is for it @NielsRogge |
Hi @plamb-viso sorry for the late reply, the model is working, only have limited time to work on it. I'll open a PR this weekend/Monday. For now you can already use the model if you're curious, check this code example regarding usage. Model is already on the hub here. |
Out of curiosity @NielsRogge : did you ever use your implementation to fine tune it on a task like CORD? |
I've fine-tuned the model on a toy dataset of RVL-CDIP, works well but the model is pretty heavy, got OOM on Google Colab even with batch size = 1 so had to use a bigger GPU. The author only released large variants. |
In my original work on @raghavanone 's version of the model, I also had to use a batch size of 1 to get it to not OOM on 40gb GPUs |
#20650