Fine tuning TrOCR on 22 Indian Languages #25132

AnustupOCR · 2023-07-27T07:27:45Z

          Yeah that definitely will change behaviour. If you check

from transformers import TrOCRProcessor, VisionEncoderDecoderModel

processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-stage1")
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-stage1")

print(model.config.decoder.decoder_start_token_id)

you'll see that it's set to 2.

However, if you set it to processor.tokenizer.cls_token_id, then you set it to 0. But the model was trained with ID=2 as decoder start token ID.

Originally posted by @NielsRogge in #15823 (comment)

Hi,
I have been working on TrOCR recently, and I am very new to these things.
I am trying to extend TrOCR to all 22 scheduled Indian Languages.
From my understanding,
I have used AutoImageProcessor and AutoTokenizer class and for ecoder i have used BEiT and IndicBERTv2 respectively as it supports all the 22 languages.

In the above mentioned reply, there seems to be mismatch wherein the Model was originally trained with decoder_start_token_id=2 and when fine tuning it is being set as tokenizer.cls_token_id which is 0. So should we explicitly set it to 2 before training?
Becuase after running 3 epochs on 20M examples dataset, when im running inference, its generating dots and commas.

The text was updated successfully, but these errors were encountered:

ydshieh · 2023-07-27T07:45:51Z

@AnustupOCR

This question is better to be on Hugging Face Forum. The issue page here is for bug reporting and feature requests.

However, it makes sense to try decoder_start_token_id=2 but monitoring the generation results earlier (not to wait until 3 epochs on 20M examples).

BTW, you use microsoft/trocr-base-stage1 which has RobertaTokenizer (and has English-only vocabulary). It will be difficult for this model to learn with the new languages. Maybe better to use a TrOCR checkpoint with XLMRobertaTokenizer if there is one on the Hub.

AnustupOCR · 2023-07-27T10:22:30Z

@ydshieh Sorry, I will surely shift to the Forum for my future queries.
But, to clarify, I am not using microsoft/trocr-base-stage1 as the checkpoint,
I will attatch the model , tokenzer and image processor I am using.
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------from transformers import VisionEncoderDecoderModel
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
#device="cpu"
enc='microsoft/beit-base-patch16-224-pt22k-ft22k'
dec='ai4bharat/IndicBERTv2-MLM-only'
model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(enc,dec)
model.to(device)

from transformers import AutoImageProcessor, AutoTokenizer,TrOCRProcessor,BeitFeatureExtractor

image_processor = BeitFeatureExtractor.from_pretrained("microsoft/beit-base-patch16-224-pt22k-ft22k")

tokenizer = AutoTokenizer.from_pretrained("ai4bharat/IndicBERTv2-MLM-only")

processor = TrOCRProcessor(feature_extractor = image_processor, tokenizer = tokenizer)
#processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-stage1")
train_dataset = IAMDataset(root_dir='/home/ruser1/Anustup/synthtiger-1.2.1/results/bnnewtst/images/',
df=train_df,
processor=processor)
eval_dataset = IAMDataset(root_dir='/home/ruser1/Anustup/synthtiger-1.2.1/results/bnnewtst/images/',
df=test_df,

                       processor=processor)

Any kind of help would really mean a lot
Thank you so much

ydshieh · 2023-07-27T12:25:40Z

So it's not from a pretrained TrOCRModel (decoder) model, but just a VisionEncoderDecoderModel model.

Note that, ai4bharat/IndicBERTv2-MLM-only is actually an encoder model (I believe so, but you can verify), not a decoder model for generation. But it should still able to generate something.

The best suggestions I could provide:

running the generation with a small example, see what is the first token being used as the starting token.
running a dummy training, check a bit what the examples (after encoding) looks like + check what the model receive as inputs (especially if the first token is the same as the one seen above)
running the real training, but try to do generation in an earlier stage. You can use predict_with_generate=True (and set do_eval) to verify if there is some progress

github-actions · 2023-08-26T08:02:27Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions bot closed this as completed Sep 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine tuning TrOCR on 22 Indian Languages #25132

Fine tuning TrOCR on 22 Indian Languages #25132

AnustupOCR commented Jul 27, 2023

ydshieh commented Jul 27, 2023 •

edited

AnustupOCR commented Jul 27, 2023

ydshieh commented Jul 27, 2023

github-actions bot commented Aug 26, 2023

Fine tuning TrOCR on 22 Indian Languages #25132

Fine tuning TrOCR on 22 Indian Languages #25132

Comments

AnustupOCR commented Jul 27, 2023

ydshieh commented Jul 27, 2023 • edited

AnustupOCR commented Jul 27, 2023

ydshieh commented Jul 27, 2023

github-actions bot commented Aug 26, 2023

ydshieh commented Jul 27, 2023 •

edited