-
Notifications
You must be signed in to change notification settings - Fork 27k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fine tune TrOCR using bert-base-multilingual-cased #15823
Comments
Hi, In case you want train a TrOCR model on another language, you can warm-start (i.e. initialize the weights of) the encoder and decoder with pretrained weights from the hub, as follows:
Here, I'm initializing the weights of the decoder from a RoBERTa language model trained on the Urdu language from the hub. |
Thankyou @NielsRogge . I was not adding last two configurations. Luckily, its working now. Thanks a lot again :) |
Hey there, I want to know is not using processor will affect training accuracy? As I've tried to replace TrOCR processor with ViT feature extractor and roberta as detokenizer as follows:
After training on 998 images(IAM Handwritten) with text-image pair, model even cant recognize text from trained image. Is it related to size of training dataset or processor is important for OCR case? |
Hi, Are you using the following tokenizer:
? Cause that is required for the model to work on another language. |
Which version of Transformers are you using? Cause the We recently fixed the Can the model properly overfit the 20 image-text pairs? |
Transformers version: 4.17.0 |
Can the model properly overfit the 20 image-text pairs? Is the image you're testing on included in the 20 pairs? |
No..model is not generating correct text even for image from training set. |
Ok then I suggest to first debug that, see also this post. |
Hey @NielsRogge , after double-checking image text list and saving model via |
Hi @NielsRogge. Thank you for your wonderful tutorial on fine tuning the TrOCR. I am trying to tune the TrOCR for Arabic Language. I have collected and arranged the data as explained in your tutorial. Which pre-trained model, I need to use. Like the one you have mentioned above for Urudu, any pretrined weights are available for Arabic ?. |
Hi, You can filter on your language by clicking on the "models" tab, then selecting a language on the left: https://huggingface.co/models?language=ar&sort=downloads So you can for instance initialize the weights of the decoder with those of https://huggingface.co/aubmindlab/bert-base-arabertv02 |
Hi @NielsRogge Thanks a lot for the tutorial. When using VisionEncoderDecoder and training it on a new dataset as in the tutorial, which parts of the model (encoder and decoder) are frozen and which are trainable? |
Hi!
All weights are updated! You initialize the weights of the encoder with those of a pre-trained vision encoder (like ViT), initialize the weights of the decoder with those of a pre-trained text model (like BERT, GPT-2) and randomly initialize the weights of the cross-attention layers in the decoder. Next, all weights are updated based on a labeled dataset of (image, text) pairs. |
Thanks for your answer @NielsRogge
I have the following image:
I notice tha it starts and ends with the id number 2 (does that mean that cls=eos in pretraining phase?)
But once I run the configuration cell (specificly the line setting
which is just a dot when decoded by tokenizer. |
Yeah that definitely will change behaviour. If you check from transformers import TrOCRProcessor, VisionEncoderDecoderModel
processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-stage1")
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-stage1")
print(model.config.decoder.decoder_start_token_id) you'll see that it's set to 2. However, if you set it to |
Hi @NielsRogge, My objective is to recognize the Arabic characters from the license plate image. I have segmented the words alone using EAST. Few words are displayed below: I have trained the TrOCR with my custom data (Arabic Alphabets). 29 alphabet characters are there in the dataset (only one alphabet per image). Combination of basic alphabets lead to new letter. I need to include these letters in the dataset. Dataset contains 2900 images (100 image for each alphabet). I have used the following models for encoder and decoder:
I changed the max_length as 4 and n_gram_size as 1. I have not changed the vocab_size. I trained the model and for 3 epochs. CER is fluctuating. At 1400 step, got a CER of 0.54, after that it was increased and fluctuated, but not reduced below 0.54.
I am using following code to predict the above images on the trained model: Please let me know the mistake I am committing. Is it because that I am training the model with individual character images rather that word images ? Do I need to do some modifications in config settings or Do I need to do something with tokenizer. I have attached my training code link here: |
Hi, If you're only training the model on individual character images, then I'm pretty sure it won't be able to produce a reasonable prediction if you give it a sequence of characters at inference time. You would have to train on a sequence of characters as well during training. Also, a max length of 4 is rather low, I would increase this during training. |
Thank you. I will collect the images of words and train again. Any suggestion on how many images minimum we need for reasonable training ? |
I would start with at least 100 (image, text) pairs, and as usual, the more data you have, the better. |
Can you explain to me why |
@Samreenhabib hi, can i get your contact ? I want to ask more about fine tuned for multilingual case |
I think that's because the TrOCR authors initialized the decoder with the weights of RoBERTa, an encoder-only Transformer model. Hence, they used the CLS token as start token. |
Hi @NielsRogge , I'm having a problem with fine-tuning Base-TrOCR. I launched the training two times and it stopped at an intermediate training step raising the error :
I'm not sure if it's from the tokenizer or the feature extractor (both in th TrOCR processor from the tutorial). or is it because we are calling the dataset's label output |
I see we are using the |
Hi, the However, in order to stack tensors, they all need to have the same shape. It seems like you didn't truncate some labels (which are |
Hi @Samreenhabib kindly share with me model configuration on urdu images data.Thanks |
Hey @NielsRogge , I am stuck at one place, need your help.
Here are encoder decoder configurations:
Upon |
Hey, apologies for late reply. I dont know if you still looking for configurations. The code was exact same as in https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TrOCR/Fine_tune_TrOCR_on_IAM_Handwriting_Database_using_Seq2SeqTrainer.ipynb
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
For more questions, please use the forum, as we'd like to keep Github issues for bugs/feature requests. |
Hi @Samreenhabib, Can you please share your code with me? I am trying to fine tune trocr with bangla dataset and being a beginner, I am facing lots of problem. If would be very helpful, if you could share your code. I will be grateful to you. Thanks! |
Hi @SamithaShetty, as @NielsRogge notes in the comment above, questions like these are best placed in our forums. We try to reserve the github issues for feature requests and bug reports. |
Hi, But, i have been facing some issue, For my training on Hindi+Bengali(20M)- Thank you so much I will attatch the initialisation cell below: image_processor = AutoImageProcessor.from_pretrained("microsoft/beit-base-patch16-224-pt22k-ft22k") tokenizer = AutoTokenizer.from_pretrained("ai4bharat/IndicBERTv2-MLM-only") processor = TrOCRProcessor(feature_extractor = image_processor, tokenizer = tokenizer) from transformers import VisionEncoderDecoderModel device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
@AnustupOCR , it seems like you are not saving processor according to your requirement. Please take a look on the code here: https://github.com/Samreenhabib/Urdu-OCR/blob/main/Custom%20Transformer%20OCR/Custom%20TrOCR.ipynb |
how to pick encoder and decoder for fine-tune TrOCR on a specifique langage ? |
@NielsRogge Hello sir I am trying TrOCR on Devnagari handwritten text, I would like to know which decoder will be best for this ? |
I am trying to use TrOCR for recognizing Urdu text from image. For feature extractor, I am using DeiT and bert-base-multilingual-cased as decoder. I can't figure out what will be the requirements if I want to fine tune pre-trained TrOCR model but with decoder of multilingual cased. I've followed https://github.com/NielsRogge/Transformers-Tutorials/tree/master/TrOCR tutorial but, it cant understand urdu text as expected I guess. Please guide me on how should I proceed? Should I create and train new tokenizer build for urdu? If yes, then how can I integrate it with ViT?
The text was updated successfully, but these errors were encountered: