Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine tune TrOCR using bert-base-multilingual-cased #15823

Closed
Samreenhabib opened this issue Feb 24, 2022 · 38 comments
Closed

Fine tune TrOCR using bert-base-multilingual-cased #15823

Samreenhabib opened this issue Feb 24, 2022 · 38 comments

Comments

@Samreenhabib
Copy link

I am trying to use TrOCR for recognizing Urdu text from image. For feature extractor, I am using DeiT and bert-base-multilingual-cased as decoder. I can't figure out what will be the requirements if I want to fine tune pre-trained TrOCR model but with decoder of multilingual cased. I've followed https://github.com/NielsRogge/Transformers-Tutorials/tree/master/TrOCR tutorial but, it cant understand urdu text as expected I guess. Please guide me on how should I proceed? Should I create and train new tokenizer build for urdu? If yes, then how can I integrate it with ViT?

@NielsRogge
Copy link
Contributor

NielsRogge commented Feb 25, 2022

Hi,

In case you want train a TrOCR model on another language, you can warm-start (i.e. initialize the weights of) the encoder and decoder with pretrained weights from the hub, as follows:

from transformers import VisionEncoderDecoderModel

# initialize the encoder from a pretrained ViT and the decoder from a pretrained BERT model. 
# Note that the cross-attention layers will be randomly initialized, and need to be fine-tuned on a downstream dataset
model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
    "google/vit-base-patch16-224-in21k", "urduhack/roberta-urdu-small"
)

Here, I'm initializing the weights of the decoder from a RoBERTa language model trained on the Urdu language from the hub.

@Samreenhabib
Copy link
Author

Thankyou @NielsRogge . I was not adding last two configurations. Luckily, its working now. Thanks a lot again :)

@Samreenhabib
Copy link
Author

Samreenhabib commented Mar 3, 2022

Hey there, I want to know is not using processor will affect training accuracy? As I've tried to replace TrOCR processor with ViT feature extractor and roberta as detokenizer as follows:

class IAMDataset(Dataset):
    def __init__(self, root_dir, df, feature_extractor,tokenizer, max_target_length=128):
        self.root_dir = root_dir
        self.df = df
        self.feature_extractor = feature_extractor
        self.tokenizer = tokenizer
        self.max_target_length = max_target_length
    def __len__(self):
        return len(self.df)
    def __getitem__(self, idx):
        file_name = self.df['file_name'][idx]
        text = self.df['text'][idx]
        image = Image.open(self.root_dir + file_name).convert("RGB")
        pixel_values = self.feature_extractor(image, return_tensors="pt").pixel_values
        labels = self.tokenizer(text, padding="max_length", max_length=self.max_target_length).input_ids
        labels = [label if label != self.tokenizer.pad_token_id else -100 for label in labels]
        encoding = {"pixel_values": pixel_values.squeeze(), "labels": torch.tensor(labels)}
        return encoding

After training on 998 images(IAM Handwritten) with text-image pair, model even cant recognize text from trained image. Is it related to size of training dataset or processor is important for OCR case?

@Samreenhabib Samreenhabib reopened this Mar 3, 2022
@NielsRogge
Copy link
Contributor

Hi,

Are you using the following tokenizer:

from transformers import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained("urduhack/roberta-urdu-small")

?

Cause that is required for the model to work on another language.

@Samreenhabib
Copy link
Author

Yes, I'm calling Vit + urduhack/roberta as encoder and decoder. For testing purpose, I've trained this model on 20 image-text pairs. When I'm trying to recognize text from image, output text is composed of repeating words as shown in image.
roberta
I know training sample is not under requirement, please highlight what I'm doing wrong while recognizing text from image:

model = VisionEncoderDecoderModel.from_pretrained("./wo_processor")
image = Image.open('/content/10.png').convert("RGB")
pixel_values = feature_extractor(image, return_tensors="pt").pixel_values 
generated_ids = model.generate(pixel_values)
generated_text= decoder_tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

@NielsRogge
Copy link
Contributor

NielsRogge commented Mar 4, 2022

Which version of Transformers are you using? Cause the eos_token_id must be properly set in the decoder.

We recently fixed the generate() method to take the eos_token_id of the decoder into account (see #14905).

Can the model properly overfit the 20 image-text pairs?

@Samreenhabib
Copy link
Author

Transformers version: 4.17.0
Currently, model.config.eos_token_id = decoder_tokenizer.sep_token_id is set.

@NielsRogge
Copy link
Contributor

NielsRogge commented Mar 4, 2022

Can the model properly overfit the 20 image-text pairs?

Is the image you're testing on included in the 20 pairs?

@Samreenhabib
Copy link
Author

Samreenhabib commented Mar 4, 2022

No..model is not generating correct text even for image from training set.

@NielsRogge
Copy link
Contributor

Ok then I suggest to first debug that, see also this post.

@Samreenhabib
Copy link
Author

Hey @NielsRogge , after double-checking image text list and saving model via trainer.save_model, code is running and output is as expected ..Thanks for all guidance. 👍

@tkseneee
Copy link

Hi,

In case you want train a TrOCR model on another language, you can warm-start (i.e. initialize the weights of) the encoder and decoder with pretrained weights from the hub, as follows:

from transformers import VisionEncoderDecoderModel

# initialize the encoder from a pretrained ViT and the decoder from a pretrained BERT model. 
# Note that the cross-attention layers will be randomly initialized, and need to be fine-tuned on a downstream dataset
model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
    "google/vit-base-patch16-224-in21k", "urduhack/roberta-urdu-small"
)

Here, I'm initializing the weights of the decoder from a RoBERTa language model trained on the Urdu language from the hub.

Hi @NielsRogge. Thank you for your wonderful tutorial on fine tuning the TrOCR. I am trying to tune the TrOCR for Arabic Language. I have collected and arranged the data as explained in your tutorial. Which pre-trained model, I need to use. Like the one you have mentioned above for Urudu, any pretrined weights are available for Arabic ?.

@NielsRogge
Copy link
Contributor

Hi,

You can filter on your language by clicking on the "models" tab, then selecting a language on the left: https://huggingface.co/models?language=ar&sort=downloads

So you can for instance initialize the weights of the decoder with those of https://huggingface.co/aubmindlab/bert-base-arabertv02

@IlyasMoutawwakil
Copy link
Member

IlyasMoutawwakil commented Apr 13, 2022

Hi @NielsRogge Thanks a lot for the tutorial. When using VisionEncoderDecoder and training it on a new dataset as in the tutorial, which parts of the model (encoder and decoder) are frozen and which are trainable?

@NielsRogge
Copy link
Contributor

Hi!

When using VisionEncoderDecoder and training it on a new dataset as in the tutorial, which parts of the model (encoder and decoder) are frozen and which are trainable?

All weights are updated! You initialize the weights of the encoder with those of a pre-trained vision encoder (like ViT), initialize the weights of the decoder with those of a pre-trained text model (like BERT, GPT-2) and randomly initialize the weights of the cross-attention layers in the decoder. Next, all weights are updated based on a labeled dataset of (image, text) pairs.

@IlyasMoutawwakil
Copy link
Member

Thanks for your answer @NielsRogge
I have another question and it's about the configuration cell in the TrOCR tutorial ; the cell in which we set special tokens etc.
Is it normal that once we set model.config.decoder_start_token_id to processor.tokenizer.cls_token_id and go back to the model (still in stage1 and no fine-tuning was performed), the pretrained model's output changes (to worse).

Example:

I have the following image:
f6333ac8-14fd-4b10-9c6c-c7146b88f9ce
I apply the pretrained model (TrOCR-base-stage1) on this image of printed text, it works fine and the generated ids are:

tensor([[    2,   417,   108,   879,  3213,  4400, 10768,   438,  1069, 21454,
           579,   293,  4685,  4400, 40441, 20836,  5332,     2]])

I notice tha it starts and ends with the id number 2 (does that mean that cls=eos in pretraining phase?)
When these generated_ids are decoded I get exactly what's on the image:

d’un accident et décrivant ses causes et circonstances

But once I run the configuration cell (specificly the line setting model.config.decoder_start_token_id), the generated_ids from the same image becomes:

tensor([[0, 4, 2]])

which is just a dot when decoded by tokenizer.
I want to know if this is normal/expected behavior ?

@NielsRogge
Copy link
Contributor

Yeah that definitely will change behaviour. If you check

from transformers import TrOCRProcessor, VisionEncoderDecoderModel

processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-stage1")
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-stage1")

print(model.config.decoder.decoder_start_token_id)

you'll see that it's set to 2.

However, if you set it to processor.tokenizer.cls_token_id, then you set it to 0. But the model was trained with ID=2 as decoder start token ID.

@tkseneee
Copy link

tkseneee commented Apr 15, 2022

Hi @NielsRogge,
Thank you for your detail responses. I followed your reply to my query and used your tutorial to train the TrOCR model

My objective is to recognize the Arabic characters from the license plate image. I have segmented the words alone using EAST. Few words are displayed below:
image
image
image

I have trained the TrOCR with my custom data (Arabic Alphabets). 29 alphabet characters are there in the dataset (only one alphabet per image). Combination of basic alphabets lead to new letter. I need to include these letters in the dataset.
image

Dataset contains 2900 images (100 image for each alphabet).

I have used the following models for encoder and decoder:

  • "microsoft/trocr-base-printed" for TrOCR Processor
  • "google/vit-base-patch16-384" for Encoder
  • "aubmindlab/bert-base-arabertv02" for decoder (Arabic letter warm start)

I changed the max_length as 4 and n_gram_size as 1. I have not changed the vocab_size.

I trained the model and for 3 epochs. CER is fluctuating. At 1400 step, got a CER of 0.54, after that it was increased and fluctuated, but not reduced below 0.54.
On the saved model, I tested the image with single character, it is doing reasonably well in predicting that character. But while I am giving multiple character in one single image, it is failing miserably.

image
-- Predicted correctly by printing the correct text

image
-- Predicted correctly by printing the correct text

image
-- Predicted correctly by printing the correct text

image
-- NOT Predicting anything, retuning null.

I am using following code to predict the above images on the trained model:
image

Please let me know the mistake I am committing. Is it because that I am training the model with individual character images rather that word images ? Do I need to do some modifications in config settings or Do I need to do something with tokenizer.

I have attached my training code link here:
https://colab.research.google.com/drive/11ARSwRinMj4l8qwGhux074G6RL9hCdrW?usp=sharing

@NielsRogge
Copy link
Contributor

NielsRogge commented Apr 15, 2022

Hi,

If you're only training the model on individual character images, then I'm pretty sure it won't be able to produce a reasonable prediction if you give it a sequence of characters at inference time. You would have to train on a sequence of characters as well during training.

Also, a max length of 4 is rather low, I would increase this during training.

@tkseneee
Copy link

Thank you. I will collect the images of words and train again. Any suggestion on how many images minimum we need for reasonable training ?

@NielsRogge
Copy link
Contributor

I would start with at least 100 (image, text) pairs, and as usual, the more data you have, the better.

@IlyasMoutawwakil
Copy link
Member

IlyasMoutawwakil commented Apr 15, 2022

you'll see that it's set to 2.

However, if you set it to processor.tokenizer.cls_token_id, then you set it to 0. But the model was trained with ID=2 as decoder start token ID.

Can you explain to me why model.config.decoder_start_token_id (null) is set to processor.tokenizer.cls_token_id (0) and not to model.decoder.config.decoder_start_token_id (2) because it seems to me like the less confusing optiopn (for the model).

@dhea1323
Copy link

@Samreenhabib hi, can i get your contact ? I want to ask more about fine tuned for multilingual case

@NielsRogge
Copy link
Contributor

Can you explain to me why model.config.decoder_start_token_id (null) is set to processor.tokenizer.cls_token_id (0) and not to model.decoder.config.decoder_start_token_id (2) because it seems to me like the less confusing optiopn (for the model).

I think that's because the TrOCR authors initialized the decoder with the weights of RoBERTa, an encoder-only Transformer model. Hence, they used the CLS token as start token.

@IlyasMoutawwakil
Copy link
Member

IlyasMoutawwakil commented Apr 25, 2022

Hi @NielsRogge , I'm having a problem with fine-tuning Base-TrOCR. I launched the training two times and it stopped at an intermediate training step raising the error :


RuntimeError: stack expects each tensor to be equal size, but got [128] at entry 0 and [152] at entry 3
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<command-2931901844376276> in <module>
     11 )
     12 
---> 13 training_results = trainer.train()

/databricks/python/lib/python3.7/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1394 
   1395             step = -1
-> 1396             for step, inputs in enumerate(epoch_iterator):
   1397 
   1398                 # Skip past any already trained steps if resuming training

/databricks/python/lib/python3.7/site-packages/torch/utils/data/dataloader.py in __next__(self)
    433         if self._sampler_iter is None:
    434             self._reset()
--> 435         data = self._next_data()
    436         self._num_yielded += 1
    437         if self._dataset_kind == _DatasetKind.Iterable and \

/databricks/python/lib/python3.7/site-packages/torch/utils/data/dataloader.py in _next_data(self)
    473     def _next_data(self):
    474         index = self._next_index()  # may raise StopIteration
--> 475         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    476         if self._pin_memory:
    477             data = _utils.pin_memory.pin_memory(data)

/databricks/python/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
     45         else:
     46             data = self.dataset[possibly_batched_index]
---> 47         return self.collate_fn(data)

/databricks/python/lib/python3.7/site-packages/transformers/data/data_collator.py in default_data_collator(features, return_tensors)
     64 
     65     if return_tensors == "pt":
---> 66         return torch_default_data_collator(features)
     67     elif return_tensors == "tf":
     68         return tf_default_data_collator(features)

/databricks/python/lib/python3.7/site-packages/transformers/data/data_collator.py in torch_default_data_collator(features)
    126         if k not in ("label", "label_ids") and v is not None and not isinstance(v, str):
    127             if isinstance(v, torch.Tensor):
--> 128                 batch[k] = torch.stack([f[k] for f in features])
    129             else:
    130                 batch[k] = torch.tensor([f[k] for f in features])

RuntimeError: stack expects each tensor to be equal size, but got [128] at entry 0 and [152] at entry 3

I'm not sure if it's from the tokenizer or the feature extractor (both in th TrOCR processor from the tutorial). or is it because we are calling the dataset's label output labels instead of label/label_ids (see last if statement in the traceback). I don't want to use more compute power to test all my hypotheses. Can you help me with this one? I hope you're more familiar with the internals of the seq2seq trainer. Thank you very much in advance.

@IlyasMoutawwakil
Copy link
Member

I see we are using the default_data_collator but our dataset returns labels instead of labels_ids could that be the problem?
https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.default_data_collator
but it doesn't explain why one of the "features" would be of size 152.

@NielsRogge
Copy link
Contributor

Hi, the default_data_collator just stacks the pixel values and labels along the first (batch) dimension.

However, in order to stack tensors, they all need to have the same shape. It seems like you didn't truncate some labels (which are input_ids of the encoded text). Can you verify that you pad + truncate when creating the labels?

@arain60gb
Copy link

Hi @Samreenhabib kindly share with me model configuration on urdu images data.Thanks

@Samreenhabib
Copy link
Author

Hey @NielsRogge , I am stuck at one place, need your help.
I trained tokenizer on Urdu text lines using following:

tokenizer = ByteLevelBPETokenizer(lowercase=True)
tokenizer.train(files=paths, vocab_size=8192, min_frequency=2,
                show_progress=True,
                special_tokens=[
                                "<s>",
                                "<pad>",
                                "</s>",
                                "<unk>",
                                "<mask>",
])
tokenizer.save_model(tokenizer_folder)

Here are encoder decoder configurations:

config = RobertaConfig(
    vocab_size=8192,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)
decoder = RobertaModel(config=config)
decoder_tokenizer = RobertaTokenizerFast.from_pretrained(tokenizer_folder, max_len=512)
decoder.resize_token_embeddings(len(decoder_tokenizer))
encoder_config = ViTConfig(image_size=384)
encoder = ViTModel(encoder_config)
model = VisionEncoderDecoderModel(encoder=encoder, decoder=decoder)
model.config.decoder.is_decoder = True
model.config.decoder.add_cross_attention = True

Upon trainer.train() this is what I'm receiving: BaseModelOutputWithPoolingAndCrossAttentions' object has no attribute 'logits'

@Samreenhabib Samreenhabib reopened this Nov 22, 2022
@Samreenhabib
Copy link
Author

Hi @Samreenhabib kindly share with me model configuration on urdu images data.Thanks

Hey, apologies for late reply. I dont know if you still looking for configurations. The code was exact same as in https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TrOCR/Fine_tune_TrOCR_on_IAM_Handwriting_Database_using_Seq2SeqTrainer.ipynb
However, to call trainer trained on your dataset , ensure you save it :trainer.save_model('./urdu_trainer'). Then simply call

model = VisionEncoderDecoderModel.from_pretrained("./urdu_trainer")
image = Image.open('/content/40.png').convert("RGB")
image
pixel_values = processor.feature_extractor(image, return_tensors="pt").pixel_values 
print(pixel_values.shape)
generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_text)

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@NielsRogge
Copy link
Contributor

For more questions, please use the forum, as we'd like to keep Github issues for bugs/feature requests.

@rajsabi
Copy link

rajsabi commented Jan 11, 2023

I am trying to use TrOCR for recognizing Urdu text from image. For feature extractor, I am using DeiT and bert-base-multilingual-cased as decoder. I can't figure out what will be the requirements if I want to fine tune pre-trained TrOCR model but with decoder of multilingual cased. I've followed https://github.com/NielsRogge/Transformers-Tutorials/tree/master/TrOCR tutorial but, it cant understand urdu text as expected I guess. Please guide me on how should I proceed? Should I create and train new tokenizer build for urdu? If yes, then how can I integrate it with ViT?

Hi @Samreenhabib kindly share with me model configuration on urdu images data.Thanks

Hey, apologies for late reply. I dont know if you still looking for configurations. The code was exact same as in https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TrOCR/Fine_tune_TrOCR_on_IAM_Handwriting_Database_using_Seq2SeqTrainer.ipynb However, to call trainer trained on your dataset , ensure you save it :trainer.save_model('./urdu_trainer'). Then simply call

model = VisionEncoderDecoderModel.from_pretrained("./urdu_trainer")
image = Image.open('/content/40.png').convert("RGB")
image
pixel_values = processor.feature_extractor(image, return_tensors="pt").pixel_values 
print(pixel_values.shape)
generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_text)

Hi @Samreenhabib, Can you please share your code with me? I am trying to fine tune trocr with bangla dataset and being a beginner, I am facing lots of problem. If would be very helpful, if you could share your code. I will be grateful to you. Thanks!

@amyeroberts
Copy link
Collaborator

Hi @SamithaShetty, as @NielsRogge notes in the comment above, questions like these are best placed in our forums. We try to reserve the github issues for feature requests and bug reports.

@AnustupOCR
Copy link

Hey there, I want to know is not using processor will affect training accuracy? As I've tried to replace TrOCR processor with ViT feature extractor and roberta as detokenizer as follows:

class IAMDataset(Dataset):
    def __init__(self, root_dir, df, feature_extractor,tokenizer, max_target_length=128):
        self.root_dir = root_dir
        self.df = df
        self.feature_extractor = feature_extractor
        self.tokenizer = tokenizer
        self.max_target_length = max_target_length
    def __len__(self):
        return len(self.df)
    def __getitem__(self, idx):
        file_name = self.df['file_name'][idx]
        text = self.df['text'][idx]
        image = Image.open(self.root_dir + file_name).convert("RGB")
        pixel_values = self.feature_extractor(image, return_tensors="pt").pixel_values
        labels = self.tokenizer(text, padding="max_length", max_length=self.max_target_length).input_ids
        labels = [label if label != self.tokenizer.pad_token_id else -100 for label in labels]
        encoding = {"pixel_values": pixel_values.squeeze(), "labels": torch.tensor(labels)}
        return encoding

After training on 998 images(IAM Handwritten) with text-image pair, model even cant recognize text from trained image. Is it related to size of training dataset or processor is important for OCR case?

Hi,
I have been working on TrOCR recently, and I am very new to these things.
I am trying to extend TrOCR to all 22 scheduled Indian Languages.
From my understanding,
I have used AutoImageProcessor and AutoTokenizer class and for ecoder i have used BEiT and IndicBERTv2 respectively as it supports all the 22 languages.

But, i have been facing some issue,
I am using Synthtetic generated dataset, which of almost of the same config of IAMDataset.
I have been training the model out with 2M examples for Bengali.
And 20M examples of Hindi+Bengali(10M each) seperately.
For my training with Bengali only(2M) -
Upon running the inference after 10 epochs, I am facing the same error as mentioned by @Samreenhabib ,Generated text has repetation of first word only.

For my training on Hindi+Bengali(20M)-
Upon running inference after 3 epochs, I am facing the same issue as mentioned by (@IlyasMoutawwakil ) ,where the generated texts are just dots and commas,
I am using the same code as mentioned in @NielsRogge 's tutorial with pytorch, just i have added the implementation of Accelerate to train on multiple GPU's.
Any kind of help or suggestions would really help a lot as my internship is getiing over within a week, so i have to figure the error out as soon as possible.

Thank you so much

I will attatch the initialisation cell below:
from transformers import AutoImageProcessor, AutoTokenizer,TrOCRProcessor

image_processor = AutoImageProcessor.from_pretrained("microsoft/beit-base-patch16-224-pt22k-ft22k")

tokenizer = AutoTokenizer.from_pretrained("ai4bharat/IndicBERTv2-MLM-only")

processor = TrOCRProcessor(feature_extractor = image_processor, tokenizer = tokenizer)
#processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-stage1")
train_dataset = IAMDataset(root_dir='/home/ruser1/Anustup/NewNOISE/bn/images/',
df=train_df,
processor=processor)
eval_dataset = IAMDataset(root_dir='/home/ruser1/Anustup/NewNOISE/bn/images/',
df=test_df,
processor=processor)


from transformers import VisionEncoderDecoderModel
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device="cpu"
enc='microsoft/beit-base-patch16-224-pt22k-ft22k'
dec='ai4bharat/IndicBERTv2-MLM-only'
model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(enc,dec)
model.to(device)

Thank you again

@Samreenhabib
Copy link
Author

Hey there, I want to know is not using processor will affect training accuracy? As I've tried to replace TrOCR processor with ViT feature extractor and roberta as detokenizer as follows:

class IAMDataset(Dataset):
    def __init__(self, root_dir, df, feature_extractor,tokenizer, max_target_length=128):
        self.root_dir = root_dir
        self.df = df
        self.feature_extractor = feature_extractor
        self.tokenizer = tokenizer
        self.max_target_length = max_target_length
    def __len__(self):
        return len(self.df)
    def __getitem__(self, idx):
        file_name = self.df['file_name'][idx]
        text = self.df['text'][idx]
        image = Image.open(self.root_dir + file_name).convert("RGB")
        pixel_values = self.feature_extractor(image, return_tensors="pt").pixel_values
        labels = self.tokenizer(text, padding="max_length", max_length=self.max_target_length).input_ids
        labels = [label if label != self.tokenizer.pad_token_id else -100 for label in labels]
        encoding = {"pixel_values": pixel_values.squeeze(), "labels": torch.tensor(labels)}
        return encoding

After training on 998 images(IAM Handwritten) with text-image pair, model even cant recognize text from trained image. Is it related to size of training dataset or processor is important for OCR case?

Hi, I have been working on TrOCR recently, and I am very new to these things. I am trying to extend TrOCR to all 22 scheduled Indian Languages. From my understanding, I have used AutoImageProcessor and AutoTokenizer class and for ecoder i have used BEiT and IndicBERTv2 respectively as it supports all the 22 languages.

But, i have been facing some issue, I am using Synthtetic generated dataset, which of almost of the same config of IAMDataset. I have been training the model out with 2M examples for Bengali. And 20M examples of Hindi+Bengali(10M each) seperately. For my training with Bengali only(2M) - Upon running the inference after 10 epochs, I am facing the same error as mentioned by @Samreenhabib ,Generated text has repetation of first word only.

For my training on Hindi+Bengali(20M)- Upon running inference after 3 epochs, I am facing the same issue as mentioned by (@IlyasMoutawwakil ) ,where the generated texts are just dots and commas, I am using the same code as mentioned in @NielsRogge 's tutorial with pytorch, just i have added the implementation of Accelerate to train on multiple GPU's. Any kind of help or suggestions would really help a lot as my internship is getiing over within a week, so i have to figure the error out as soon as possible.

Thank you so much

I will attatch the initialisation cell below: from transformers import AutoImageProcessor, AutoTokenizer,TrOCRProcessor

image_processor = AutoImageProcessor.from_pretrained("microsoft/beit-base-patch16-224-pt22k-ft22k")

tokenizer = AutoTokenizer.from_pretrained("ai4bharat/IndicBERTv2-MLM-only")

processor = TrOCRProcessor(feature_extractor = image_processor, tokenizer = tokenizer) #processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-stage1") train_dataset = IAMDataset(root_dir='/home/ruser1/Anustup/NewNOISE/bn/images/', df=train_df, processor=processor) eval_dataset = IAMDataset(root_dir='/home/ruser1/Anustup/NewNOISE/bn/images/', df=test_df, processor=processor)

from transformers import VisionEncoderDecoderModel import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

device="cpu"
enc='microsoft/beit-base-patch16-224-pt22k-ft22k'
dec='ai4bharat/IndicBERTv2-MLM-only'
model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(enc,dec)
model.to(device)
Thank you again

@AnustupOCR , it seems like you are not saving processor according to your requirement. Please take a look on the code here: https://github.com/Samreenhabib/Urdu-OCR/blob/main/Custom%20Transformer%20OCR/Custom%20TrOCR.ipynb

@MohamedLahmeri01
Copy link

how to pick encoder and decoder for fine-tune TrOCR on a specifique langage ?

@tuladhar07
Copy link

@NielsRogge Hello sir I am trying TrOCR on Devnagari handwritten text, I would like to know which decoder will be best for this ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests