Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arabic data detection and recognition #490

Closed
haythemBD opened this issue Sep 22, 2021 · 6 comments
Closed

Arabic data detection and recognition #490

haythemBD opened this issue Sep 22, 2021 · 6 comments
Assignees
Labels
module: datasets Related to doctr.datasets type: enhancement Improvement

Comments

@haythemBD
Copy link

haythemBD commented Sep 22, 2021

Hello, I want to ask if we can train an end-to-end doct for model on arabic data detection and recgontion??

@charlesmindee charlesmindee self-assigned this Sep 22, 2021
@fg-mindee fg-mindee added type: enhancement Improvement module: datasets Related to doctr.datasets labels Sep 22, 2021
@charlesmindee
Copy link
Collaborator

Hi @haythemBD,

Thanks for your interest in the lib, for the moment we don't have an arabic vocabulary in our supported vocabs. If you wish to train a recognition model on arabic data we first need to integrate (and therefore define) an arabic vocabulary, then it will be easy to train a recognition model with this vocabulary if you have arabic-annoted data (at a word level for the recognition task).
For the detection part, you could also re-train a detection model on arabic data but it should work well with our pretrained model on european data since it detects patterns at a word-level.

If you wish to help us on that 🙏, you are more than welcome to open a PR to propose an arabic vocabulary, then it will be available to define an arabic recognition model and you will just have to train it on your data 😄 ! Please read the contributing
section to be aware of our guidelines!

Have a nice day!

@mzeidhassan
Copy link
Contributor

@charlesmindee Thanks for creating this amazing library. So, for Arabic vocab, you mean a list of all Arabic characters, hindi numbers, punctuation marks, etc. Is that all what you need to train a recognition model? Am I getting it right?

@charlesmindee
Copy link
Collaborator

charlesmindee commented Sep 28, 2021

Hi @mzeidhassan, to define an arabic recognition model we only need a vocab like these ones:

VOCABS['portuguese'] = VOCABS['english'] + 'áàâãéêëíïóôõúüçÁÀÂÃÉËÍÏÓÔÕÚÜÇ' + '¡¿'
VOCABS['spanish'] = VOCABS['english'] + 'áéíóúüñÁÉÍÓÚÜÑ' + '¡¿'
VOCABS['german'] = VOCABS['english'] + 'äöüßÄÖÜẞ'

From there, if you have an arabic dataset (arabic pictures of words + the corresponding annotations) you can train a fully operational recognition model and even plug it onto one of our pretrained detection model.

Have a nice day!

@fg-mindee
Copy link
Contributor

Small precision @mzeidhassan @haythemBD : if any of you is willing to make a small PR to add the arabic entry in the VOCABS, that would be very helpful :)

@mzeidhassan
Copy link
Contributor

mzeidhassan commented Sep 29, 2021

Thanks @charlesmindee and @fg-mindee
I hope this one works.
#514

Maybe, I didn't do it the right way :-), so please feel free to reject this PR and let me know if you want me to upload a text file with all characters, or maybe I can just update this file "https://github.com/mindee/doctr/blob/main/docs/source/datasets.rst" and add the Arabic vocab. Please let me know.

A note about Punctuation:
Arabic shares almost the same English punctuation, with 2 exceptions. Question mark (؟) and semicolon (؛). As you can see, question mark in Arabic is reversed compared to English.

Also, we usually use 'Hindi; numbers, but you will definitely find 'Arabic' numbers in Arabic text. So, both should be included.

Hindi: ٠١٢٣٤٥٦٧٨٩
Arabic: 0123456789

Diacritics in Arabic are not isolated characters, they are just marks that can be put above or below Arabic characters; you can think of it as 'vowels' in English, or maybe like accent in French for example.

@fg-mindee
Copy link
Contributor

Closed by #502 & #514

@haythemBD you can find the instructions to train on your arabic dataset here: https://github.com/mindee/doctr/tree/main/references/recognition
And you only have to specify the --vocab argument to select the new arabic vocab :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: datasets Related to doctr.datasets type: enhancement Improvement
Projects
None yet
Development

No branches or pull requests

4 participants