-
Notifications
You must be signed in to change notification settings - Fork 444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arabic data detection and recognition #490
Comments
Hi @haythemBD, Thanks for your interest in the lib, for the moment we don't have an arabic vocabulary in our supported vocabs. If you wish to train a recognition model on arabic data we first need to integrate (and therefore define) an arabic vocabulary, then it will be easy to train a recognition model with this vocabulary if you have arabic-annoted data (at a word level for the recognition task). If you wish to help us on that 🙏, you are more than welcome to open a PR to propose an arabic vocabulary, then it will be available to define an arabic recognition model and you will just have to train it on your data 😄 ! Please read the contributing Have a nice day! |
@charlesmindee Thanks for creating this amazing library. So, for Arabic vocab, you mean a list of all Arabic characters, hindi numbers, punctuation marks, etc. Is that all what you need to train a recognition model? Am I getting it right? |
Hi @mzeidhassan, to define an arabic recognition model we only need a vocab like these ones:
From there, if you have an arabic dataset (arabic pictures of words + the corresponding annotations) you can train a fully operational recognition model and even plug it onto one of our pretrained detection model. Have a nice day! |
Small precision @mzeidhassan @haythemBD : if any of you is willing to make a small PR to add the arabic entry in the VOCABS, that would be very helpful :) |
Thanks @charlesmindee and @fg-mindee Maybe, I didn't do it the right way :-), so please feel free to reject this PR and let me know if you want me to upload a text file with all characters, or maybe I can just update this file "https://github.com/mindee/doctr/blob/main/docs/source/datasets.rst" and add the Arabic vocab. Please let me know. A note about Punctuation: Also, we usually use 'Hindi; numbers, but you will definitely find 'Arabic' numbers in Arabic text. So, both should be included. Hindi: ٠١٢٣٤٥٦٧٨٩ Diacritics in Arabic are not isolated characters, they are just marks that can be put above or below Arabic characters; you can think of it as 'vowels' in English, or maybe like accent in French for example. |
@haythemBD you can find the instructions to train on your arabic dataset here: https://github.com/mindee/doctr/tree/main/references/recognition |
Hello, I want to ask if we can train an end-to-end doct for model on arabic data detection and recgontion??
The text was updated successfully, but these errors were encountered: