Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine tuning TrOCR on 22 Indian Languages #25132

Closed
AnustupOCR opened this issue Jul 27, 2023 · 4 comments
Closed

Fine tuning TrOCR on 22 Indian Languages #25132

AnustupOCR opened this issue Jul 27, 2023 · 4 comments

Comments

@AnustupOCR
Copy link

          Yeah that definitely will change behaviour. If you check
from transformers import TrOCRProcessor, VisionEncoderDecoderModel

processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-stage1")
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-stage1")

print(model.config.decoder.decoder_start_token_id)

you'll see that it's set to 2.

However, if you set it to processor.tokenizer.cls_token_id, then you set it to 0. But the model was trained with ID=2 as decoder start token ID.

Originally posted by @NielsRogge in #15823 (comment)


Hi,
I have been working on TrOCR recently, and I am very new to these things.
I am trying to extend TrOCR to all 22 scheduled Indian Languages.
From my understanding,
I have used AutoImageProcessor and AutoTokenizer class and for ecoder i have used BEiT and IndicBERTv2 respectively as it supports all the 22 languages.

In the above mentioned reply, there seems to be mismatch wherein the Model was originally trained with decoder_start_token_id=2 and when fine tuning it is being set as tokenizer.cls_token_id which is 0. So should we explicitly set it to 2 before training?
Becuase after running 3 epochs on 20M examples dataset, when im running inference, its generating dots and commas.

@ydshieh
Copy link
Collaborator

ydshieh commented Jul 27, 2023

@AnustupOCR

This question is better to be on Hugging Face Forum. The issue page here is for bug reporting and feature requests.


However, it makes sense to try decoder_start_token_id=2 but monitoring the generation results earlier (not to wait until 3 epochs on 20M examples).

BTW, you use microsoft/trocr-base-stage1 which has RobertaTokenizer (and has English-only vocabulary). It will be difficult for this model to learn with the new languages. Maybe better to use a TrOCR checkpoint with XLMRobertaTokenizer if there is one on the Hub.

@AnustupOCR
Copy link
Author

@ydshieh Sorry, I will surely shift to the Forum for my future queries.
But, to clarify, I am not using microsoft/trocr-base-stage1 as the checkpoint,
I will attatch the model , tokenzer and image processor I am using.
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------from transformers import VisionEncoderDecoderModel
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
#device="cpu"
enc='microsoft/beit-base-patch16-224-pt22k-ft22k'
dec='ai4bharat/IndicBERTv2-MLM-only'
model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(enc,dec)
model.to(device)


from transformers import AutoImageProcessor, AutoTokenizer,TrOCRProcessor,BeitFeatureExtractor

image_processor = BeitFeatureExtractor.from_pretrained("microsoft/beit-base-patch16-224-pt22k-ft22k")

tokenizer = AutoTokenizer.from_pretrained("ai4bharat/IndicBERTv2-MLM-only")

processor = TrOCRProcessor(feature_extractor = image_processor, tokenizer = tokenizer)
#processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-stage1")
train_dataset = IAMDataset(root_dir='/home/ruser1/Anustup/synthtiger-1.2.1/results/bnnewtst/images/',
df=train_df,
processor=processor)
eval_dataset = IAMDataset(root_dir='/home/ruser1/Anustup/synthtiger-1.2.1/results/bnnewtst/images/',
df=test_df,

                       processor=processor)

Any kind of help would really mean a lot
Thank you so much

@ydshieh
Copy link
Collaborator

ydshieh commented Jul 27, 2023

So it's not from a pretrained TrOCRModel (decoder) model, but just a VisionEncoderDecoderModel model.

Note that, ai4bharat/IndicBERTv2-MLM-only is actually an encoder model (I believe so, but you can verify), not a decoder model for generation. But it should still able to generate something.

The best suggestions I could provide:

  • running the generation with a small example, see what is the first token being used as the starting token.
  • running a dummy training, check a bit what the examples (after encoding) looks like + check what the model receive as inputs (especially if the first token is the same as the one seen above)
  • running the real training, but try to do generation in an earlier stage. You can use predict_with_generate=True (and set do_eval) to verify if there is some progress

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this as completed Sep 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants