Adopt TrOCR #1384

grinay · 2022-09-13T02:07:26Z

Hello. Thank you guys for your effort working on that amazing project. I currently working on a HTR(handwritten text recognition ) task. And want to adopt TrOCR. But I don't really understand where to start. May you suggest possible solution with MMOCR? How do you add new algorithms to it?

gaotongxiao · 2022-09-13T06:23:06Z

Hi, thanks for your interest! If you are already a user of MMOCR 0.x, I'd recommend you to work on MMOCR 1.x, which contains tons of major upgrades and provides a better and consistent interface design to developers. We have been writing more documentation in https://mmocr.readthedocs.io/en/dev-1.x/ for developers, including the steps to add a model from scratch. Though these advanced tutorials are not ready for now, you can still plan this project referring to the guideline below.

Dataset: Check out our docs and see if the datasets used in the paper are already supported. If not, you'd have to prepare a script that converts the dataset into MMOCR-ready format
Data Transforms (augmentations): Check if required image transformations are implemented in MMOCR by going through Data Transforms. If not, you might want to implement one referring to the official implementation of this paper.
Model: A text recognition model usually consists of 3 modules: backbone, encoder (optional), and decoder. You'd need to implement these models to get everything work.
Prepare a config, which tells MMOCR how to connect everything you've implemented.

You might also want some references from design doc for ASTER though it was for MMOCR 0.x. If you need any help from us, feel free to drop an email to mmocr@openmmlab.com and we will connect you on Slack.

jturner116 · 2022-09-20T12:32:15Z

Thank you for typing this up, Tong! I am hoping to use PARSeq for text recognition and will be following the steps you laid out here. I am looking forward to those advanced tutorials :D

Mekacher-Anis · 2022-10-26T13:47:37Z

@jturner116 I'm also looking into adding PARSeq, did you make any progress on the implementation ?

jturner116 · 2022-10-26T15:43:31Z

@Mekacher-Anis No, I've just finished finetuning PARSeq. I'm leaning towards DBNet++ for detection, but I haven't even tried finetuning it yet. I should be making more progress this week and next

gaotongxiao · 2022-10-27T03:47:21Z

Awesome! We've been considering adding a project/ folder to accept community implementations as a part of this repository. Feel free to let us know if you need any assistance on this project. @Mekacher-Anis

Mekacher-Anis · 2022-10-28T12:44:44Z

@gaotongxiao PARSeq uses only a Vision Transformer (ViT) as its "encoder" so it doesn't have a "backbone" in the traditional sense, like miniVGG or ResNet, so what should the value of the backbone in config file be ?

gaotongxiao · 2022-10-29T09:06:47Z

@Mekacher-Anis Depends on MMOCR's version. You may treat the encoder as the backbone in MMOCR 0.x. That is, fill in the backbone field with "ViT" and leave encoder to be empty. MMOCR 1.0 allows an empty backbone, so it would be more natural to put ViT into the encoder field.

Mekacher-Anis · 2022-10-29T09:53:08Z

@gaotongxiao great, thank you ! yes I'm using MMOCR 1.0 and I've added it as an encoder
another question: I'm adapting the code from the paper's official repo https://github.com/baudm/parseq
there they offer pretrained weights as a pt file and I've also finetuned the model and I have a ckpt file locally.
is it possible to load the pretrained weights, so I don't have to train the model ?

Mekacher-Anis · 2022-10-30T11:40:25Z

@gaotongxiao @jturner116
I've been able to adopt the trained model for the new implementation in MMOCR.
I've added most of the stuff for parseq on my fork in this commit 158ff8c
Todo:

Add PARSeq encoder module, this has been adapted from the original repo and it has a dependency on this external library for ViT
Add PARSeq decoder module with the required forward_train and forward_test functions. This has also been adapted from the original repo. The only problem I faced is calculating the loss, it's based on the CELossModule but the problem is forward_train function which calculates the logits for the multiple permutations of the same input batch, so I simply return the loss for the first permutation
Register all modules
Add the required configuration files
Add the training/testing pipeline
Adapt the trained model from the original repo to mmocr. I had to edit the keys in the state_dict in the Checkpoint file for this. The adpated Checkpoint file is in the load_from in the configuration file
Add all the datasets used for training in the paper, I've only added "MJSynth"+"Synthtext"+"Coco Text" but the paper also uses "RCTW17"+"Uber-Text"+"ArT"+"LSVT"+"MLT19"+"ReCTS"+"Text OCR"+"Open Images"
The paper also uses SWA scheduler starting at iteration 127260, but in the current implementation the scheduler doesn´t change
Add all the data augmentation techniques used during training, I've only added "RandAugment", "GuassianBlur" and "RandomInvert", the paper also uses "PoissonNoise" which I couldn't find an implementation for

gaotongxiao · 2022-10-31T05:20:30Z

It's an excellent implementation! Before proceeding to the next step, have you been able to use the pre-trained model on MMOCR and confirm that test-time accuracy looks alright?

I can see this model requires a lot of datasets. Currently, MMOCR has supported some of them but requires a lot of manual effort to download datasets and run the scripts. We'll be releasing Dataset Preparer soon which will help users get every dataset ready with only one line of command. Since eventually all the scripts in tools/dataset_converters will be migrated into this module, we can synchronize the development plan before you start to support some of these datasets (if that's in your plan)

As for PoissonNoise, using ImgAugWrapper to invoke imgaug's implementation of AdditiveGaussianNoise may suffice.

Mekacher-Anis · 2022-10-31T09:03:05Z

yes I've tested the pretrained Model after I imported it into MMOCR, the one that I imported is the official pretrained Model from their releases but I've also fine-tuned it on the LVDB dataset.
I've tested the Model with their test scripts and with the mmocr test scripts after I adapted it to the current implementation and I got the same result on both.

{
  "LVDB/recog/1-N.E.D_ignore_case": 0.9123,
  "LVDB/recog/1-N.E.D_exact": 0.8929,
  "LVDB/recog/1-N.E.D_ignore_case_symbol": 0.9173,
  "LVDB/recog/word_acc": 0.7955,
  "LVDB/recog/word_acc_ignore_case": 0.8106,
  "LVDB/recog/word_acc_ignore_case_symbol": 0.8241,
  "LVDB/recog/char_recall": 0.9364,
  "LVDB/recog/char_precision": 0.9388
}

Thanks for the advice, I'll add the PoissonNoise using ImgAugWrapper

jturner116 · 2022-10-31T12:25:42Z

@Mekacher-Anis Wow, that is very clean. I was planning to just keep my detection and recognition fairly separate, but your implementation is inspiring! Thank you for your work on this

Mekacher-Anis · 2022-11-01T17:08:37Z

I added PoissonNoise using ImgAugWrapper Mekacher-Anis@077f93d
I couldn't figure out from the paper how they're applying the augmentation, currently all augmentation techniques are applied randomly with 25% probability.
btw. is there a possibility to merge the implemenation into mmocr ? would be great addition, as PARSeq is currently pretty much the best Model out there on most datasets https://paperswithcode.com/task/scene-text-recognition

gaotongxiao · 2022-11-02T03:41:38Z

I couldn't figure out from the paper how they're applying the augmentation, currently all augmentation techniques are applied randomly with 25% probability.

I see, it's a conservative choice :) Usually setting prob as 50% can also work well.

btw. is there a possibility to merge the implemenation into mmocr ? would be great addition, as PARSeq is currently pretty much the best Model out there on most datasets https://paperswithcode.com/task/scene-text-recognition

Sure! We would definitely like to. We are finalizing the rules for projects/ folder that will be used to accept community implementations and will get back to you this week. BTW, we would be training your implementation from scratch to verify its correctness and prepare some pretrained models, so let me know when your implementation is ready for training.

Mekacher-Anis · 2022-11-04T13:55:14Z

@gaotongxiao I have added a custom CEModuleLoss which can take the logits of multiple permutations generated by PARSeq and calculate the average loss like they do in their implementation, but when I tried to fine-tune my checkpoint on the IAM Handwriting Dataset the model performance got worse on the LVDB Dataset, although the Datasets are somewhat similar.
I can't really figure out whether the cause is my implemenation of PARSeq decoder and the CE Module or it's something that may happen when fine tuning a model.
Training is on iam dataset and validation is on the LVDB validation set. I'm using the load_from to load the migrated Model discussed in other posts.

gaotongxiao · 2022-11-07T02:58:42Z

@Mekacher-Anis Have you tried to increase the model's robustness by using more augmentations like "ColorJitter"? The word images in LVDB are collected from comprehensive scenes, while IAM contains all handwritten words from scanned pages only.

Mekacher-Anis · 2022-11-07T16:20:36Z

ColorJitter didn't help either, performance is still worse than the orignal.
The color channels have a mean of 135 and standard deviation of 91, which should represent some pretty good variations in color.
I also fine-tuned the model on IAM using the code from the PARSeq repository to make sure it's not caused by my adapted implementation and the performance also got worse with their code.
here are the test results using my adapted implementation in MMOCR fine-tuned on IAM (left is fine-tuned):

Something that I found weird, is that the overall word_accuracy dropped significantly in comparison to ignore_case and ignore_case_symbol (although they also dropped)
I tired freezing the encoder/decoder and still saw a 10% decrease in accuracy.
I used the browse_dataset.py script to visualize both datasets and they're extremely similar.
Tbh I'm clueless as to what might be causing that huge drop in performance, me being a complete noob, I would have expected the performance to either get better or not change... not drop significantly like that.

gaotongxiao · 2022-11-08T06:13:04Z

As for the data distribution, IAM could be a subset of LVDB. For example, LVDB contains some printed characters that look quite different from handwritten ones, and fine-tuning the model on the IAM which has the handwritten part only can aggravate its performance on the printed ones. To get more intuition and insights, I'd suggest checking the samples whose predictions turn wrong after fine-tuning.

Something that I found weird, is that the overall word_accuracy dropped significantly in comparison to ignore_case and ignore_case_symbol (although they also dropped)

It's not surprising as word_accuracy is a stricter metric than the others. It implies that the model cares less about the letter case after being fine-tuned. Make sure you set letter_case as unchanged during fine-tuning:

mmocr/mmocr/models/textrecog/module_losses/ce_module_loss.py

Line 64 in bf69478

letter_case: str = 'unchanged',

gaotongxiao · 2022-11-11T02:55:46Z

Hi @Mekacher-Anis , the guideline for projects/ are ready for review, which will likely be hanging for a week before it gets finalized. You are welcome to leave any comments and thoughts as a community member at any time. If it looks all good to you, you could also start preparing your project accordingly. We are looking forward to seeing your first project & PR :)

Mekacher-Anis · 2022-11-11T13:01:44Z

@gaotongxiao awesome, I'll start working on cleaning up the implementation and adding typings when I have the time.

gaotongxiao · 2022-11-14T08:23:53Z

@Mekacher-Anis Great, but you don't need to submit a perfect project at the beginning. A PR with a project that is just about to work is also acceptable, and you can choose to polish it (like adding typings) afterward.

Mekacher-Anis · 2022-11-14T15:06:39Z

@gaotongxiao
I've done a lot of other unrelated changes to the my 1.x branch, so there isn't really an easy way for me to create a merge request directly from the branch, so I'm thinking of creating a new branch based on a "clean" 1.x branch and then cherry picking the changes one by one

gaotongxiao · 2022-11-15T02:02:25Z

@Mekacher-Anis I see, that makes sense :)

gaotongxiao · 2022-12-12T14:46:41Z

Hi @Mekacher-Anis , I just wonder if you have some time to wrap up your implementation? We are looking forward to seeing your contribution!

Mekacher-Anis · 2023-01-14T09:08:11Z

@gaotongxiao I'm sorry for the late response. I'm writing my bachelor's thesis; I will do my best to get my changes merge-ready as fast as possible after I'm done with the writing.

gaotongxiao · 2023-01-15T11:55:56Z

@Mekacher-Anis No problem. Good luck with your thesis!

mm-assistant bot assigned gaotongxiao Sep 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adopt TrOCR #1384

Adopt TrOCR #1384

grinay commented Sep 13, 2022

gaotongxiao commented Sep 13, 2022

jturner116 commented Sep 20, 2022

Mekacher-Anis commented Oct 26, 2022

jturner116 commented Oct 26, 2022

gaotongxiao commented Oct 27, 2022

Mekacher-Anis commented Oct 28, 2022

gaotongxiao commented Oct 29, 2022

Mekacher-Anis commented Oct 29, 2022

Mekacher-Anis commented Oct 30, 2022 •

edited

Loading

gaotongxiao commented Oct 31, 2022

Mekacher-Anis commented Oct 31, 2022

jturner116 commented Oct 31, 2022

Mekacher-Anis commented Nov 1, 2022 •

edited

Loading

gaotongxiao commented Nov 2, 2022

Mekacher-Anis commented Nov 4, 2022 •

edited

Loading

gaotongxiao commented Nov 7, 2022 •

edited

Loading

Mekacher-Anis commented Nov 7, 2022

gaotongxiao commented Nov 8, 2022

gaotongxiao commented Nov 11, 2022

Mekacher-Anis commented Nov 11, 2022

gaotongxiao commented Nov 14, 2022

Mekacher-Anis commented Nov 14, 2022

gaotongxiao commented Nov 15, 2022

gaotongxiao commented Dec 12, 2022

Mekacher-Anis commented Jan 14, 2023

gaotongxiao commented Jan 15, 2023

Adopt TrOCR #1384

Adopt TrOCR #1384

Comments

grinay commented Sep 13, 2022

gaotongxiao commented Sep 13, 2022

jturner116 commented Sep 20, 2022

Mekacher-Anis commented Oct 26, 2022

jturner116 commented Oct 26, 2022

gaotongxiao commented Oct 27, 2022

Mekacher-Anis commented Oct 28, 2022

gaotongxiao commented Oct 29, 2022

Mekacher-Anis commented Oct 29, 2022

Mekacher-Anis commented Oct 30, 2022 • edited Loading

gaotongxiao commented Oct 31, 2022

Mekacher-Anis commented Oct 31, 2022

jturner116 commented Oct 31, 2022

Mekacher-Anis commented Nov 1, 2022 • edited Loading

gaotongxiao commented Nov 2, 2022

Mekacher-Anis commented Nov 4, 2022 • edited Loading

gaotongxiao commented Nov 7, 2022 • edited Loading

Mekacher-Anis commented Nov 7, 2022

gaotongxiao commented Nov 8, 2022

gaotongxiao commented Nov 11, 2022

Mekacher-Anis commented Nov 11, 2022

gaotongxiao commented Nov 14, 2022

Mekacher-Anis commented Nov 14, 2022

gaotongxiao commented Nov 15, 2022

gaotongxiao commented Dec 12, 2022

Mekacher-Anis commented Jan 14, 2023

gaotongxiao commented Jan 15, 2023

Mekacher-Anis commented Oct 30, 2022 •

edited

Loading

Mekacher-Anis commented Nov 1, 2022 •

edited

Loading

Mekacher-Anis commented Nov 4, 2022 •

edited

Loading

gaotongxiao commented Nov 7, 2022 •

edited

Loading