-
Notifications
You must be signed in to change notification settings - Fork 736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adopt TrOCR #1384
Comments
Hi, thanks for your interest! If you are already a user of MMOCR 0.x, I'd recommend you to work on MMOCR 1.x, which contains tons of major upgrades and provides a better and consistent interface design to developers. We have been writing more documentation in https://mmocr.readthedocs.io/en/dev-1.x/ for developers, including the steps to add a model from scratch. Though these advanced tutorials are not ready for now, you can still plan this project referring to the guideline below.
You might also want some references from design doc for ASTER though it was for MMOCR 0.x. If you need any help from us, feel free to drop an email to mmocr@openmmlab.com and we will connect you on Slack. |
Thank you for typing this up, Tong! I am hoping to use PARSeq for text recognition and will be following the steps you laid out here. I am looking forward to those advanced tutorials :D |
@jturner116 I'm also looking into adding PARSeq, did you make any progress on the implementation ? |
@Mekacher-Anis No, I've just finished finetuning PARSeq. I'm leaning towards DBNet++ for detection, but I haven't even tried finetuning it yet. I should be making more progress this week and next |
Awesome! We've been considering adding a |
@gaotongxiao PARSeq uses only a Vision Transformer (ViT) as its "encoder" so it doesn't have a "backbone" in the traditional sense, like miniVGG or ResNet, so what should the value of the backbone in config file be ? |
@Mekacher-Anis Depends on MMOCR's version. You may treat the encoder as the backbone in MMOCR 0.x. That is, fill in the |
@gaotongxiao great, thank you ! yes I'm using MMOCR 1.0 and I've added it as an |
@gaotongxiao @jturner116
|
It's an excellent implementation! Before proceeding to the next step, have you been able to use the pre-trained model on MMOCR and confirm that test-time accuracy looks alright? I can see this model requires a lot of datasets. Currently, MMOCR has supported some of them but requires a lot of manual effort to download datasets and run the scripts. We'll be releasing Dataset Preparer soon which will help users get every dataset ready with only one line of command. Since eventually all the scripts in As for |
yes I've tested the pretrained Model after I imported it into MMOCR, the one that I imported is the official pretrained Model from their releases but I've also fine-tuned it on the LVDB dataset. {
"LVDB/recog/1-N.E.D_ignore_case": 0.9123,
"LVDB/recog/1-N.E.D_exact": 0.8929,
"LVDB/recog/1-N.E.D_ignore_case_symbol": 0.9173,
"LVDB/recog/word_acc": 0.7955,
"LVDB/recog/word_acc_ignore_case": 0.8106,
"LVDB/recog/word_acc_ignore_case_symbol": 0.8241,
"LVDB/recog/char_recall": 0.9364,
"LVDB/recog/char_precision": 0.9388
} Thanks for the advice, I'll add the |
@Mekacher-Anis Wow, that is very clean. I was planning to just keep my detection and recognition fairly separate, but your implementation is inspiring! Thank you for your work on this |
I added PoissonNoise using ImgAugWrapper Mekacher-Anis@077f93d |
I see, it's a conservative choice :) Usually setting prob as 50% can also work well.
Sure! We would definitely like to. We are finalizing the rules for |
@gaotongxiao I have added a custom CEModuleLoss which can take the logits of multiple permutations generated by PARSeq and calculate the average loss like they do in their implementation, but when I tried to fine-tune my checkpoint on the IAM Handwriting Dataset the model performance got worse on the LVDB Dataset, although the Datasets are somewhat similar. |
@Mekacher-Anis Have you tried to increase the model's robustness by using more augmentations like "ColorJitter"? The word images in LVDB are collected from comprehensive scenes, while IAM contains all handwritten words from scanned pages only. |
As for the data distribution, IAM could be a subset of LVDB. For example, LVDB contains some printed characters that look quite different from handwritten ones, and fine-tuning the model on the IAM which has the handwritten part only can aggravate its performance on the printed ones. To get more intuition and insights, I'd suggest checking the samples whose predictions turn wrong after fine-tuning.
It's not surprising as
|
Hi @Mekacher-Anis , the guideline for |
@gaotongxiao awesome, I'll start working on cleaning up the implementation and adding typings when I have the time. |
@Mekacher-Anis Great, but you don't need to submit a perfect project at the beginning. A PR with a project that is just about to work is also acceptable, and you can choose to polish it (like adding typings) afterward. |
@gaotongxiao |
@Mekacher-Anis I see, that makes sense :) |
Hi @Mekacher-Anis , I just wonder if you have some time to wrap up your implementation? We are looking forward to seeing your contribution! |
@gaotongxiao I'm sorry for the late response. I'm writing my bachelor's thesis; I will do my best to get my changes merge-ready as fast as possible after I'm done with the writing. |
@Mekacher-Anis No problem. Good luck with your thesis! |
Hello. Thank you guys for your effort working on that amazing project. I currently working on a HTR(handwritten text recognition ) task. And want to adopt TrOCR. But I don't really understand where to start. May you suggest possible solution with MMOCR? How do you add new algorithms to it?
The text was updated successfully, but these errors were encountered: