Fine tune TrOCR using bert-base-multilingual-cased #15823

Samreenhabib · 2022-02-24T18:55:12Z

I am trying to use TrOCR for recognizing Urdu text from image. For feature extractor, I am using DeiT and bert-base-multilingual-cased as decoder. I can't figure out what will be the requirements if I want to fine tune pre-trained TrOCR model but with decoder of multilingual cased. I've followed https://github.com/NielsRogge/Transformers-Tutorials/tree/master/TrOCR tutorial but, it cant understand urdu text as expected I guess. Please guide me on how should I proceed? Should I create and train new tokenizer build for urdu? If yes, then how can I integrate it with ViT?

NielsRogge · 2022-02-25T13:53:59Z

Hi,

In case you want train a TrOCR model on another language, you can warm-start (i.e. initialize the weights of) the encoder and decoder with pretrained weights from the hub, as follows:

from transformers import VisionEncoderDecoderModel

# initialize the encoder from a pretrained ViT and the decoder from a pretrained BERT model. 
# Note that the cross-attention layers will be randomly initialized, and need to be fine-tuned on a downstream dataset
model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
    "google/vit-base-patch16-224-in21k", "urduhack/roberta-urdu-small"
)

Here, I'm initializing the weights of the decoder from a RoBERTa language model trained on the Urdu language from the hub.

Samreenhabib · 2022-02-26T11:01:21Z

Thankyou @NielsRogge . I was not adding last two configurations. Luckily, its working now. Thanks a lot again :)

Samreenhabib · 2022-03-03T16:24:29Z

Hey there, I want to know is not using processor will affect training accuracy? As I've tried to replace TrOCR processor with ViT feature extractor and roberta as detokenizer as follows:

class IAMDataset(Dataset):
    def __init__(self, root_dir, df, feature_extractor,tokenizer, max_target_length=128):
        self.root_dir = root_dir
        self.df = df
        self.feature_extractor = feature_extractor
        self.tokenizer = tokenizer
        self.max_target_length = max_target_length
    def __len__(self):
        return len(self.df)
    def __getitem__(self, idx):
        file_name = self.df['file_name'][idx]
        text = self.df['text'][idx]
        image = Image.open(self.root_dir + file_name).convert("RGB")
        pixel_values = self.feature_extractor(image, return_tensors="pt").pixel_values
        labels = self.tokenizer(text, padding="max_length", max_length=self.max_target_length).input_ids
        labels = [label if label != self.tokenizer.pad_token_id else -100 for label in labels]
        encoding = {"pixel_values": pixel_values.squeeze(), "labels": torch.tensor(labels)}
        return encoding

After training on 998 images(IAM Handwritten) with text-image pair, model even cant recognize text from trained image. Is it related to size of training dataset or processor is important for OCR case?

NielsRogge · 2022-03-03T16:27:04Z

Hi,

Are you using the following tokenizer:

from transformers import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained("urduhack/roberta-urdu-small")

?

Cause that is required for the model to work on another language.

Samreenhabib · 2022-03-04T10:34:29Z

Yes, I'm calling Vit + urduhack/roberta as encoder and decoder. For testing purpose, I've trained this model on 20 image-text pairs. When I'm trying to recognize text from image, output text is composed of repeating words as shown in image.

I know training sample is not under requirement, please highlight what I'm doing wrong while recognizing text from image:

model = VisionEncoderDecoderModel.from_pretrained("./wo_processor")
image = Image.open('/content/10.png').convert("RGB")
pixel_values = feature_extractor(image, return_tensors="pt").pixel_values 
generated_ids = model.generate(pixel_values)
generated_text= decoder_tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

NielsRogge · 2022-03-04T10:45:55Z

Which version of Transformers are you using? Cause the eos_token_id must be properly set in the decoder.

We recently fixed the generate() method to take the eos_token_id of the decoder into account (see #14905).

Can the model properly overfit the 20 image-text pairs?

Samreenhabib · 2022-03-04T16:04:19Z

Transformers version: 4.17.0
Currently, model.config.eos_token_id = decoder_tokenizer.sep_token_id is set.

NielsRogge · 2022-03-04T16:52:40Z

Can the model properly overfit the 20 image-text pairs?

Is the image you're testing on included in the 20 pairs?

Samreenhabib · 2022-03-04T17:04:24Z

No..model is not generating correct text even for image from training set.

NielsRogge · 2022-03-07T16:08:47Z

Ok then I suggest to first debug that, see also this post.

Samreenhabib · 2022-03-07T16:20:31Z

Hey @NielsRogge , after double-checking image text list and saving model via trainer.save_model, code is running and output is as expected ..Thanks for all guidance. 👍

tkseneee · 2022-03-29T13:16:08Z

Hi,

In case you want train a TrOCR model on another language, you can warm-start (i.e. initialize the weights of) the encoder and decoder with pretrained weights from the hub, as follows:
from transformers import VisionEncoderDecoderModel

# initialize the encoder from a pretrained ViT and the decoder from a pretrained BERT model. 
# Note that the cross-attention layers will be randomly initialized, and need to be fine-tuned on a downstream dataset
model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
    "google/vit-base-patch16-224-in21k", "urduhack/roberta-urdu-small"
)
Here, I'm initializing the weights of the decoder from a RoBERTa language model trained on the Urdu language from the hub.

Hi @NielsRogge. Thank you for your wonderful tutorial on fine tuning the TrOCR. I am trying to tune the TrOCR for Arabic Language. I have collected and arranged the data as explained in your tutorial. Which pre-trained model, I need to use. Like the one you have mentioned above for Urudu, any pretrined weights are available for Arabic ?.

NielsRogge · 2022-03-29T13:18:17Z

Hi,

You can filter on your language by clicking on the "models" tab, then selecting a language on the left: https://huggingface.co/models?language=ar&sort=downloads

So you can for instance initialize the weights of the decoder with those of https://huggingface.co/aubmindlab/bert-base-arabertv02

IlyasMoutawwakil · 2022-04-13T08:09:20Z

Hi @NielsRogge Thanks a lot for the tutorial. When using VisionEncoderDecoder and training it on a new dataset as in the tutorial, which parts of the model (encoder and decoder) are frozen and which are trainable?

NielsRogge · 2022-04-13T08:28:44Z

Hi!

When using VisionEncoderDecoder and training it on a new dataset as in the tutorial, which parts of the model (encoder and decoder) are frozen and which are trainable?

All weights are updated! You initialize the weights of the encoder with those of a pre-trained vision encoder (like ViT), initialize the weights of the decoder with those of a pre-trained text model (like BERT, GPT-2) and randomly initialize the weights of the cross-attention layers in the decoder. Next, all weights are updated based on a labeled dataset of (image, text) pairs.

IlyasMoutawwakil · 2022-04-14T10:30:38Z

Thanks for your answer @NielsRogge
I have another question and it's about the configuration cell in the TrOCR tutorial ; the cell in which we set special tokens etc.
Is it normal that once we set model.config.decoder_start_token_id to processor.tokenizer.cls_token_id and go back to the model (still in stage1 and no fine-tuning was performed), the pretrained model's output changes (to worse).

Example:

I have the following image:

I apply the pretrained model (TrOCR-base-stage1) on this image of printed text, it works fine and the generated ids are:

tensor([[    2,   417,   108,   879,  3213,  4400, 10768,   438,  1069, 21454,
           579,   293,  4685,  4400, 40441, 20836,  5332,     2]])

I notice tha it starts and ends with the id number 2 (does that mean that cls=eos in pretraining phase?)
When these generated_ids are decoded I get exactly what's on the image:

d’un accident et décrivant ses causes et circonstances

But once I run the configuration cell (specificly the line setting model.config.decoder_start_token_id), the generated_ids from the same image becomes:

tensor([[0, 4, 2]])

which is just a dot when decoded by tokenizer.
I want to know if this is normal/expected behavior ?

NielsRogge · 2022-04-14T12:49:23Z

Yeah that definitely will change behaviour. If you check

from transformers import TrOCRProcessor, VisionEncoderDecoderModel

processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-stage1")
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-stage1")

print(model.config.decoder.decoder_start_token_id)

you'll see that it's set to 2.

However, if you set it to processor.tokenizer.cls_token_id, then you set it to 0. But the model was trained with ID=2 as decoder start token ID.

tkseneee · 2022-04-15T06:52:01Z

Hi @NielsRogge,
Thank you for your detail responses. I followed your reply to my query and used your tutorial to train the TrOCR model

My objective is to recognize the Arabic characters from the license plate image. I have segmented the words alone using EAST. Few words are displayed below:

I have trained the TrOCR with my custom data (Arabic Alphabets). 29 alphabet characters are there in the dataset (only one alphabet per image). Combination of basic alphabets lead to new letter. I need to include these letters in the dataset.

Dataset contains 2900 images (100 image for each alphabet).

I have used the following models for encoder and decoder:

"microsoft/trocr-base-printed" for TrOCR Processor
"google/vit-base-patch16-384" for Encoder
"aubmindlab/bert-base-arabertv02" for decoder (Arabic letter warm start)

I changed the max_length as 4 and n_gram_size as 1. I have not changed the vocab_size.

I trained the model and for 3 epochs. CER is fluctuating. At 1400 step, got a CER of 0.54, after that it was increased and fluctuated, but not reduced below 0.54.
On the saved model, I tested the image with single character, it is doing reasonably well in predicting that character. But while I am giving multiple character in one single image, it is failing miserably.

-- Predicted correctly by printing the correct text

-- NOT Predicting anything, retuning null.

I am using following code to predict the above images on the trained model:

Please let me know the mistake I am committing. Is it because that I am training the model with individual character images rather that word images ? Do I need to do some modifications in config settings or Do I need to do something with tokenizer.

I have attached my training code link here:
https://colab.research.google.com/drive/11ARSwRinMj4l8qwGhux074G6RL9hCdrW?usp=sharing

NielsRogge · 2022-04-15T07:09:23Z

Hi,

If you're only training the model on individual character images, then I'm pretty sure it won't be able to produce a reasonable prediction if you give it a sequence of characters at inference time. You would have to train on a sequence of characters as well during training.

Also, a max length of 4 is rather low, I would increase this during training.

tkseneee · 2022-04-15T07:53:51Z

Thank you. I will collect the images of words and train again. Any suggestion on how many images minimum we need for reasonable training ?

NielsRogge · 2022-04-15T08:07:44Z

I would start with at least 100 (image, text) pairs, and as usual, the more data you have, the better.

IlyasMoutawwakil · 2022-04-15T08:47:38Z

you'll see that it's set to 2.

However, if you set it to processor.tokenizer.cls_token_id, then you set it to 0. But the model was trained with ID=2 as decoder start token ID.

Can you explain to me why model.config.decoder_start_token_id (null) is set to processor.tokenizer.cls_token_id (0) and not to model.decoder.config.decoder_start_token_id (2) because it seems to me like the less confusing optiopn (for the model).

dhea1323 · 2022-04-20T08:02:46Z

@Samreenhabib hi, can i get your contact ? I want to ask more about fine tuned for multilingual case

NielsRogge · 2022-04-20T08:46:01Z

Can you explain to me why model.config.decoder_start_token_id (null) is set to processor.tokenizer.cls_token_id (0) and not to model.decoder.config.decoder_start_token_id (2) because it seems to me like the less confusing optiopn (for the model).

I think that's because the TrOCR authors initialized the decoder with the weights of RoBERTa, an encoder-only Transformer model. Hence, they used the CLS token as start token.

IlyasMoutawwakil · 2022-04-25T08:39:16Z

Hi @NielsRogge , I'm having a problem with fine-tuning Base-TrOCR. I launched the training two times and it stopped at an intermediate training step raising the error :


RuntimeError: stack expects each tensor to be equal size, but got [128] at entry 0 and [152] at entry 3
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<command-2931901844376276> in <module>
     11 )
     12 
---> 13 training_results = trainer.train()

/databricks/python/lib/python3.7/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1394 
   1395             step = -1
-> 1396             for step, inputs in enumerate(epoch_iterator):
   1397 
   1398                 # Skip past any already trained steps if resuming training

/databricks/python/lib/python3.7/site-packages/torch/utils/data/dataloader.py in __next__(self)
    433         if self._sampler_iter is None:
    434             self._reset()
--> 435         data = self._next_data()
    436         self._num_yielded += 1
    437         if self._dataset_kind == _DatasetKind.Iterable and \

/databricks/python/lib/python3.7/site-packages/torch/utils/data/dataloader.py in _next_data(self)
    473     def _next_data(self):
    474         index = self._next_index()  # may raise StopIteration
--> 475         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    476         if self._pin_memory:
    477             data = _utils.pin_memory.pin_memory(data)

/databricks/python/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
     45         else:
     46             data = self.dataset[possibly_batched_index]
---> 47         return self.collate_fn(data)

/databricks/python/lib/python3.7/site-packages/transformers/data/data_collator.py in default_data_collator(features, return_tensors)
     64 
     65     if return_tensors == "pt":
---> 66         return torch_default_data_collator(features)
     67     elif return_tensors == "tf":
     68         return tf_default_data_collator(features)

/databricks/python/lib/python3.7/site-packages/transformers/data/data_collator.py in torch_default_data_collator(features)
    126         if k not in ("label", "label_ids") and v is not None and not isinstance(v, str):
    127             if isinstance(v, torch.Tensor):
--> 128                 batch[k] = torch.stack([f[k] for f in features])
    129             else:
    130                 batch[k] = torch.tensor([f[k] for f in features])

RuntimeError: stack expects each tensor to be equal size, but got [128] at entry 0 and [152] at entry 3

I'm not sure if it's from the tokenizer or the feature extractor (both in th TrOCR processor from the tutorial). or is it because we are calling the dataset's label output labels instead of label/label_ids (see last if statement in the traceback). I don't want to use more compute power to test all my hypotheses. Can you help me with this one? I hope you're more familiar with the internals of the seq2seq trainer. Thank you very much in advance.

IlyasMoutawwakil · 2022-04-25T08:54:38Z

I see we are using the default_data_collator but our dataset returns labels instead of labels_ids could that be the problem?
https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.default_data_collator
but it doesn't explain why one of the "features" would be of size 152.

NielsRogge · 2022-04-25T12:40:37Z

Hi, the default_data_collator just stacks the pixel values and labels along the first (batch) dimension.

However, in order to stack tensors, they all need to have the same shape. It seems like you didn't truncate some labels (which are input_ids of the encoded text). Can you verify that you pad + truncate when creating the labels?

arain60gb · 2022-10-15T15:27:05Z

Hi @Samreenhabib kindly share with me model configuration on urdu images data.Thanks

Samreenhabib · 2022-11-22T05:17:16Z

Hey @NielsRogge , I am stuck at one place, need your help.
I trained tokenizer on Urdu text lines using following:

tokenizer = ByteLevelBPETokenizer(lowercase=True)
tokenizer.train(files=paths, vocab_size=8192, min_frequency=2,
                show_progress=True,
                special_tokens=[
                                "<s>",
                                "<pad>",
                                "</s>",
                                "<unk>",
                                "<mask>",
])
tokenizer.save_model(tokenizer_folder)

Here are encoder decoder configurations:

config = RobertaConfig(
    vocab_size=8192,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)
decoder = RobertaModel(config=config)
decoder_tokenizer = RobertaTokenizerFast.from_pretrained(tokenizer_folder, max_len=512)
decoder.resize_token_embeddings(len(decoder_tokenizer))
encoder_config = ViTConfig(image_size=384)
encoder = ViTModel(encoder_config)
model = VisionEncoderDecoderModel(encoder=encoder, decoder=decoder)
model.config.decoder.is_decoder = True
model.config.decoder.add_cross_attention = True

Upon trainer.train() this is what I'm receiving: BaseModelOutputWithPoolingAndCrossAttentions' object has no attribute 'logits'

Samreenhabib · 2022-11-22T05:24:47Z

Hi @Samreenhabib kindly share with me model configuration on urdu images data.Thanks

Hey, apologies for late reply. I dont know if you still looking for configurations. The code was exact same as in https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TrOCR/Fine_tune_TrOCR_on_IAM_Handwriting_Database_using_Seq2SeqTrainer.ipynb
However, to call trainer trained on your dataset , ensure you save it :trainer.save_model('./urdu_trainer'). Then simply call

model = VisionEncoderDecoderModel.from_pretrained("./urdu_trainer")
image = Image.open('/content/40.png').convert("RGB")
image
pixel_values = processor.feature_extractor(image, return_tensors="pt").pixel_values 
print(pixel_values.shape)
generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_text)

github-actions · 2022-12-16T15:03:25Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

NielsRogge · 2022-12-16T15:10:41Z

For more questions, please use the forum, as we'd like to keep Github issues for bugs/feature requests.

rajsabi · 2023-01-11T07:00:39Z

I am trying to use TrOCR for recognizing Urdu text from image. For feature extractor, I am using DeiT and bert-base-multilingual-cased as decoder. I can't figure out what will be the requirements if I want to fine tune pre-trained TrOCR model but with decoder of multilingual cased. I've followed https://github.com/NielsRogge/Transformers-Tutorials/tree/master/TrOCR tutorial but, it cant understand urdu text as expected I guess. Please guide me on how should I proceed? Should I create and train new tokenizer build for urdu? If yes, then how can I integrate it with ViT?

Hi @Samreenhabib kindly share with me model configuration on urdu images data.Thanks

Hey, apologies for late reply. I dont know if you still looking for configurations. The code was exact same as in https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TrOCR/Fine_tune_TrOCR_on_IAM_Handwriting_Database_using_Seq2SeqTrainer.ipynb However, to call trainer trained on your dataset , ensure you save it :trainer.save_model('./urdu_trainer'). Then simply call
model = VisionEncoderDecoderModel.from_pretrained("./urdu_trainer")
image = Image.open('/content/40.png').convert("RGB")
image
pixel_values = processor.feature_extractor(image, return_tensors="pt").pixel_values 
print(pixel_values.shape)
generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_text)

Hi @Samreenhabib, Can you please share your code with me? I am trying to fine tune trocr with bangla dataset and being a beginner, I am facing lots of problem. If would be very helpful, if you could share your code. I will be grateful to you. Thanks!

amyeroberts · 2023-06-06T10:21:17Z

Hi @SamithaShetty, as @NielsRogge notes in the comment above, questions like these are best placed in our forums. We try to reserve the github issues for feature requests and bug reports.

AnustupOCR · 2023-07-27T06:56:22Z

Hey there, I want to know is not using processor will affect training accuracy? As I've tried to replace TrOCR processor with ViT feature extractor and roberta as detokenizer as follows:

class IAMDataset(Dataset):
    def __init__(self, root_dir, df, feature_extractor,tokenizer, max_target_length=128):
        self.root_dir = root_dir
        self.df = df
        self.feature_extractor = feature_extractor
        self.tokenizer = tokenizer
        self.max_target_length = max_target_length
    def __len__(self):
        return len(self.df)
    def __getitem__(self, idx):
        file_name = self.df['file_name'][idx]
        text = self.df['text'][idx]
        image = Image.open(self.root_dir + file_name).convert("RGB")
        pixel_values = self.feature_extractor(image, return_tensors="pt").pixel_values
        labels = self.tokenizer(text, padding="max_length", max_length=self.max_target_length).input_ids
        labels = [label if label != self.tokenizer.pad_token_id else -100 for label in labels]
        encoding = {"pixel_values": pixel_values.squeeze(), "labels": torch.tensor(labels)}
        return encoding

After training on 998 images(IAM Handwritten) with text-image pair, model even cant recognize text from trained image. Is it related to size of training dataset or processor is important for OCR case?

Hi,
I have been working on TrOCR recently, and I am very new to these things.
I am trying to extend TrOCR to all 22 scheduled Indian Languages.
From my understanding,
I have used AutoImageProcessor and AutoTokenizer class and for ecoder i have used BEiT and IndicBERTv2 respectively as it supports all the 22 languages.

But, i have been facing some issue,
I am using Synthtetic generated dataset, which of almost of the same config of IAMDataset.
I have been training the model out with 2M examples for Bengali.
And 20M examples of Hindi+Bengali(10M each) seperately.
For my training with Bengali only(2M) -
Upon running the inference after 10 epochs, I am facing the same error as mentioned by @Samreenhabib ,Generated text has repetation of first word only.

For my training on Hindi+Bengali(20M)-
Upon running inference after 3 epochs, I am facing the same issue as mentioned by (@IlyasMoutawwakil ) ,where the generated texts are just dots and commas,
I am using the same code as mentioned in @NielsRogge 's tutorial with pytorch, just i have added the implementation of Accelerate to train on multiple GPU's.
Any kind of help or suggestions would really help a lot as my internship is getiing over within a week, so i have to figure the error out as soon as possible.

Thank you so much

I will attatch the initialisation cell below:
from transformers import AutoImageProcessor, AutoTokenizer,TrOCRProcessor

image_processor = AutoImageProcessor.from_pretrained("microsoft/beit-base-patch16-224-pt22k-ft22k")

tokenizer = AutoTokenizer.from_pretrained("ai4bharat/IndicBERTv2-MLM-only")

processor = TrOCRProcessor(feature_extractor = image_processor, tokenizer = tokenizer)
#processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-stage1")
train_dataset = IAMDataset(root_dir='/home/ruser1/Anustup/NewNOISE/bn/images/',
df=train_df,
processor=processor)
eval_dataset = IAMDataset(root_dir='/home/ruser1/Anustup/NewNOISE/bn/images/',
df=test_df,
processor=processor)

from transformers import VisionEncoderDecoderModel
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device="cpu"
enc='microsoft/beit-base-patch16-224-pt22k-ft22k'
dec='ai4bharat/IndicBERTv2-MLM-only'
model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(enc,dec)
model.to(device)

Thank you again

Samreenhabib · 2023-07-27T12:07:23Z

Hey there, I want to know is not using processor will affect training accuracy? As I've tried to replace TrOCR processor with ViT feature extractor and roberta as detokenizer as follows:
class IAMDataset(Dataset):
    def __init__(self, root_dir, df, feature_extractor,tokenizer, max_target_length=128):
        self.root_dir = root_dir
        self.df = df
        self.feature_extractor = feature_extractor
        self.tokenizer = tokenizer
        self.max_target_length = max_target_length
    def __len__(self):
        return len(self.df)
    def __getitem__(self, idx):
        file_name = self.df['file_name'][idx]
        text = self.df['text'][idx]
        image = Image.open(self.root_dir + file_name).convert("RGB")
        pixel_values = self.feature_extractor(image, return_tensors="pt").pixel_values
        labels = self.tokenizer(text, padding="max_length", max_length=self.max_target_length).input_ids
        labels = [label if label != self.tokenizer.pad_token_id else -100 for label in labels]
        encoding = {"pixel_values": pixel_values.squeeze(), "labels": torch.tensor(labels)}
        return encoding
After training on 998 images(IAM Handwritten) with text-image pair, model even cant recognize text from trained image. Is it related to size of training dataset or processor is important for OCR case?
Hi, I have been working on TrOCR recently, and I am very new to these things. I am trying to extend TrOCR to all 22 scheduled Indian Languages. From my understanding, I have used AutoImageProcessor and AutoTokenizer class and for ecoder i have used BEiT and IndicBERTv2 respectively as it supports all the 22 languages.

But, i have been facing some issue, I am using Synthtetic generated dataset, which of almost of the same config of IAMDataset. I have been training the model out with 2M examples for Bengali. And 20M examples of Hindi+Bengali(10M each) seperately. For my training with Bengali only(2M) - Upon running the inference after 10 epochs, I am facing the same error as mentioned by @Samreenhabib ,Generated text has repetation of first word only.

For my training on Hindi+Bengali(20M)- Upon running inference after 3 epochs, I am facing the same issue as mentioned by (@IlyasMoutawwakil ) ,where the generated texts are just dots and commas, I am using the same code as mentioned in @NielsRogge 's tutorial with pytorch, just i have added the implementation of Accelerate to train on multiple GPU's. Any kind of help or suggestions would really help a lot as my internship is getiing over within a week, so i have to figure the error out as soon as possible.

Thank you so much

I will attatch the initialisation cell below: from transformers import AutoImageProcessor, AutoTokenizer,TrOCRProcessor

image_processor = AutoImageProcessor.from_pretrained("microsoft/beit-base-patch16-224-pt22k-ft22k")

tokenizer = AutoTokenizer.from_pretrained("ai4bharat/IndicBERTv2-MLM-only")

processor = TrOCRProcessor(feature_extractor = image_processor, tokenizer = tokenizer) #processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-stage1") train_dataset = IAMDataset(root_dir='/home/ruser1/Anustup/NewNOISE/bn/images/', df=train_df, processor=processor) eval_dataset = IAMDataset(root_dir='/home/ruser1/Anustup/NewNOISE/bn/images/', df=test_df, processor=processor)

from transformers import VisionEncoderDecoderModel import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

device="cpu"
enc='microsoft/beit-base-patch16-224-pt22k-ft22k'
dec='ai4bharat/IndicBERTv2-MLM-only'
model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(enc,dec)
model.to(device)
Thank you again

@AnustupOCR , it seems like you are not saving processor according to your requirement. Please take a look on the code here: https://github.com/Samreenhabib/Urdu-OCR/blob/main/Custom%20Transformer%20OCR/Custom%20TrOCR.ipynb

MohamedLahmeri01 · 2024-08-24T14:29:29Z

how to pick encoder and decoder for fine-tune TrOCR on a specifique langage ?

tuladhar07 · 2024-09-20T05:11:24Z

@NielsRogge Hello sir I am trying TrOCR on Devnagari handwritten text, I would like to know which decoder will be best for this ?

Samreenhabib closed this as completed Feb 26, 2022

Samreenhabib reopened this Mar 3, 2022

Samreenhabib closed this as completed Mar 7, 2022

PersianSpock mentioned this issue Jul 17, 2022

Fine tune TrOCR for persian language #18163

Closed

2 tasks

Samreenhabib reopened this Nov 22, 2022

NielsRogge closed this as completed Dec 16, 2022

NielsRogge mentioned this issue Dec 22, 2022

TrOCR Encoder &Decoder Replacement #20338

Closed

AnustupOCR mentioned this issue Jul 27, 2023

Fine tuning TrOCR on 22 Indian Languages #25132

Closed

Fine tune TrOCR using bert-base-multilingual-cased #15823

Fine tune TrOCR using bert-base-multilingual-cased #15823

Comments

Samreenhabib commented Feb 24, 2022

NielsRogge commented Feb 25, 2022 • edited Loading

Samreenhabib commented Feb 26, 2022

Samreenhabib commented Mar 3, 2022 • edited Loading

NielsRogge commented Mar 3, 2022

Samreenhabib commented Mar 4, 2022

NielsRogge commented Mar 4, 2022 • edited Loading

Samreenhabib commented Mar 4, 2022

NielsRogge commented Mar 4, 2022 • edited Loading

Samreenhabib commented Mar 4, 2022 • edited Loading

NielsRogge commented Mar 7, 2022

Samreenhabib commented Mar 7, 2022

tkseneee commented Mar 29, 2022

NielsRogge commented Mar 29, 2022

IlyasMoutawwakil commented Apr 13, 2022 • edited Loading

NielsRogge commented Apr 13, 2022

IlyasMoutawwakil commented Apr 14, 2022

NielsRogge commented Apr 14, 2022

tkseneee commented Apr 15, 2022 • edited Loading

NielsRogge commented Apr 15, 2022 • edited Loading

tkseneee commented Apr 15, 2022

NielsRogge commented Apr 15, 2022

IlyasMoutawwakil commented Apr 15, 2022 • edited Loading

dhea1323 commented Apr 20, 2022

NielsRogge commented Apr 20, 2022

IlyasMoutawwakil commented Apr 25, 2022 • edited Loading

IlyasMoutawwakil commented Apr 25, 2022

NielsRogge commented Apr 25, 2022

arain60gb commented Oct 15, 2022

Samreenhabib commented Nov 22, 2022

Samreenhabib commented Nov 22, 2022

github-actions bot commented Dec 16, 2022

NielsRogge commented Dec 16, 2022

rajsabi commented Jan 11, 2023

amyeroberts commented Jun 6, 2023

AnustupOCR commented Jul 27, 2023

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") device="cpu" enc='microsoft/beit-base-patch16-224-pt22k-ft22k' dec='ai4bharat/IndicBERTv2-MLM-only' model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(enc,dec) model.to(device)

Samreenhabib commented Jul 27, 2023

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

MohamedLahmeri01 commented Aug 24, 2024

tuladhar07 commented Sep 20, 2024

NielsRogge commented Feb 25, 2022 •

edited

Loading

Samreenhabib commented Mar 3, 2022 •

edited

Loading

NielsRogge commented Mar 4, 2022 •

edited

Loading

NielsRogge commented Mar 4, 2022 •

edited

Loading

Samreenhabib commented Mar 4, 2022 •

edited

Loading

IlyasMoutawwakil commented Apr 13, 2022 •

edited

Loading

tkseneee commented Apr 15, 2022 •

edited

Loading

NielsRogge commented Apr 15, 2022 •

edited

Loading

IlyasMoutawwakil commented Apr 15, 2022 •

edited

Loading

IlyasMoutawwakil commented Apr 25, 2022 •

edited

Loading

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device="cpu"
enc='microsoft/beit-base-patch16-224-pt22k-ft22k'
dec='ai4bharat/IndicBERTv2-MLM-only'
model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(enc,dec)
model.to(device)