Finetuning on Due-Benchmark #71

swathikirans · 2023-07-05T09:32:05Z

Hi,

I have been trying to finetune the model on due-benchmark using the provided script. However, the performance is quite low compared to the reported numbers. For example, DocVQA results in an ANLS score of 75 instead of the reported 84. I have two main queries.

The provided checkpoint is missing one parameter: special_vis_token. For now this parameter is initialized randomly. I am not sure if this has a significant impact on the final score.
As per the paper, the input is prepended with a task specific prompt. However, it seems this is not done for the due-benchmark tasks. Could this be the reason for the low performance?

zinengtang · 2023-07-05T09:52:30Z

I think the main thing to focus on is the prompt. Finetuning from different prompts affect the performance. Properly adding the 2D and 1D position is also important. Anything missing could result in a performance drop.

swathikirans · 2023-07-05T09:58:56Z

Thank you for the quick reply. So is it not possible to get the results reported in the paper by running the published code without any changes? What is the exact prompt used for DocVQA? The prompt used in RVL-CDIP code is different than what is mentioned in the paper. So I am not sure if prompt used for training DocVQA is also the same from the paper. It would be really helpful if you can provide all the details that are required to obtain the results reported in the paper.

zinengtang · 2023-07-06T07:57:57Z

The prompt should be the same as in the paper with "question answering on DocVQA. [question]. [context]".
I am mostly curious about the position embedding/bias addition to the model, which matters a lot if not set up properly. Could you provide other information. How many epochs did you run? If it still doesn't work, let me try to push the DocVQA finetuning code.

swathikirans · 2023-07-06T08:19:38Z

I used the same prompt as above. The modifications I made are after here as follows:

prepend the input_ids (item_dict["input_ids"]) with prompt_token_ids
prepend the attention mask (item_dict["attention_mask"]) with N True values where N is the length of the prompt_token_ids
prepend the bounding boxes (item_dict["seg_data"]["tokens"]["bboxes"]) with an Nx4 array of zero values where N is the length of the prompt_token_ids

I used this script for finetuning. The training always stop around 4 epochs due to early stopping criteria.

swathikirans · 2023-07-06T10:27:03Z

I was using the Unimodal 224. However, from the paper, the performance of the various models vary only between [+2, -2] at the maximum. Anyway, I will try the other models as well. Thanks for the input.

swathikirans · 2023-07-11T11:16:50Z

Hi, I tried the other two variants (512 and dual) as well. These models also did not result in any significant improvement. So far the best score obtained on DocVQA task in due-benchmark is 76.29 with the 512 resolution model.

swathikirans · 2023-07-12T10:46:31Z

Could you please provide the following details?

Which model is used for preprocessing the data (generating memmaps)? Is it the t5-large provided by due-benchmark or the UDOP pretrained model?
Which transformer version is used to train the model?

zinengtang · 2023-07-14T07:46:22Z

T5-base is used for preprocessing the data. The t5-large is the huggingface transformers.
I've tested with 4.20 and 4.30.

Btw, which checkpoint you used for evaluation, the one with lowest validation loss or the last checkpoint. I am asking because usually loss is not a good reflector of language score and we usually use the last checkpoint.

swathikirans · 2023-07-14T08:29:16Z

I used the T5-Large provided by due-benchmark for preprocessing the data.
The recommended transformers version 4.30.0 was giving loss does not have a grad function error. So I had to replace the AdamW optimizer from transformers with the pytorch one. I also tried with 4.20 and AdamW from transformers. However there was no change in the performance.

I used the last checkpoint (last.ckpt) to get the test predictions. Not sure what exactly is going wrong.

eshani-agrawal · 2023-10-03T23:16:12Z

what are the resource requirements in order to finetune on DocVQA task?

Caixin89 mentioned this issue Feb 16, 2024

Finetuning on InfographicVQA #125

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finetuning on Due-Benchmark #71

Finetuning on Due-Benchmark #71

swathikirans commented Jul 5, 2023

zinengtang commented Jul 5, 2023

swathikirans commented Jul 5, 2023

zinengtang commented Jul 6, 2023

swathikirans commented Jul 6, 2023

swathikirans commented Jul 6, 2023

swathikirans commented Jul 11, 2023

swathikirans commented Jul 12, 2023

zinengtang commented Jul 14, 2023

swathikirans commented Jul 14, 2023

eshani-agrawal commented Oct 3, 2023

Finetuning on Due-Benchmark #71

Finetuning on Due-Benchmark #71

Comments

swathikirans commented Jul 5, 2023

zinengtang commented Jul 5, 2023

swathikirans commented Jul 5, 2023

zinengtang commented Jul 6, 2023

swathikirans commented Jul 6, 2023

swathikirans commented Jul 6, 2023

swathikirans commented Jul 11, 2023

swathikirans commented Jul 12, 2023

zinengtang commented Jul 14, 2023

swathikirans commented Jul 14, 2023

eshani-agrawal commented Oct 3, 2023