New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fine tuning TrOCR on 22 Indian Languages #25132
Comments
This question is better to be on Hugging Face Forum. The issue page here is for bug reporting and feature requests. However, it makes sense to try BTW, you use |
@ydshieh Sorry, I will surely shift to the Forum for my future queries. device = torch.device("cuda" if torch.cuda.is_available() else "cpu") from transformers import AutoImageProcessor, AutoTokenizer,TrOCRProcessor,BeitFeatureExtractor image_processor = BeitFeatureExtractor.from_pretrained("microsoft/beit-base-patch16-224-pt22k-ft22k") tokenizer = AutoTokenizer.from_pretrained("ai4bharat/IndicBERTv2-MLM-only") processor = TrOCRProcessor(feature_extractor = image_processor, tokenizer = tokenizer)
Any kind of help would really mean a lot |
So it's not from a pretrained TrOCRModel (decoder) model, but just a Note that, The best suggestions I could provide:
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
you'll see that it's set to 2.
However, if you set it to
processor.tokenizer.cls_token_id
, then you set it to 0. But the model was trained with ID=2 as decoder start token ID.Originally posted by @NielsRogge in #15823 (comment)
Hi,
I have been working on TrOCR recently, and I am very new to these things.
I am trying to extend TrOCR to all 22 scheduled Indian Languages.
From my understanding,
I have used AutoImageProcessor and AutoTokenizer class and for ecoder i have used BEiT and IndicBERTv2 respectively as it supports all the 22 languages.
In the above mentioned reply, there seems to be mismatch wherein the Model was originally trained with decoder_start_token_id=2 and when fine tuning it is being set as tokenizer.cls_token_id which is 0. So should we explicitly set it to 2 before training?
Becuase after running 3 epochs on 20M examples dataset, when im running inference, its generating dots and commas.
The text was updated successfully, but these errors were encountered: