Arabic data detection and recognition #490

haythemBD · 2021-09-22T13:02:19Z

Hello, I want to ask if we can train an end-to-end doct for model on arabic data detection and recgontion??

charlesmindee · 2021-09-22T13:18:45Z

Thanks for your interest in the lib, for the moment we don't have an arabic vocabulary in our supported vocabs. If you wish to train a recognition model on arabic data we first need to integrate (and therefore define) an arabic vocabulary, then it will be easy to train a recognition model with this vocabulary if you have arabic-annoted data (at a word level for the recognition task).
For the detection part, you could also re-train a detection model on arabic data but it should work well with our pretrained model on european data since it detects patterns at a word-level.

If you wish to help us on that 🙏, you are more than welcome to open a PR to propose an arabic vocabulary, then it will be available to define an arabic recognition model and you will just have to train it on your data 😄 ! Please read the contributing
section to be aware of our guidelines!

Have a nice day!

mzeidhassan · 2021-09-28T03:28:28Z

@charlesmindee Thanks for creating this amazing library. So, for Arabic vocab, you mean a list of all Arabic characters, hindi numbers, punctuation marks, etc. Is that all what you need to train a recognition model? Am I getting it right?

charlesmindee · 2021-09-28T07:38:31Z

Hi @mzeidhassan, to define an arabic recognition model we only need a vocab like these ones:

VOCABS['portuguese'] = VOCABS['english'] + 'áàâãéêëíïóôõúüçÁÀÂÃÉËÍÏÓÔÕÚÜÇ' + '¡¿'
VOCABS['spanish'] = VOCABS['english'] + 'áéíóúüñÁÉÍÓÚÜÑ' + '¡¿'
VOCABS['german'] = VOCABS['english'] + 'äöüßÄÖÜẞ'

From there, if you have an arabic dataset (arabic pictures of words + the corresponding annotations) you can train a fully operational recognition model and even plug it onto one of our pretrained detection model.

Have a nice day!

fg-mindee · 2021-09-28T11:08:58Z

Small precision @mzeidhassan @haythemBD : if any of you is willing to make a small PR to add the arabic entry in the VOCABS, that would be very helpful :)

mzeidhassan · 2021-09-29T23:16:39Z

Thanks @charlesmindee and @fg-mindee
I hope this one works.
#514

Maybe, I didn't do it the right way :-), so please feel free to reject this PR and let me know if you want me to upload a text file with all characters, or maybe I can just update this file "https://github.com/mindee/doctr/blob/main/docs/source/datasets.rst" and add the Arabic vocab. Please let me know.

A note about Punctuation:
Arabic shares almost the same English punctuation, with 2 exceptions. Question mark (؟) and semicolon (؛). As you can see, question mark in Arabic is reversed compared to English.

Also, we usually use 'Hindi; numbers, but you will definitely find 'Arabic' numbers in Arabic text. So, both should be included.

Hindi: ٠١٢٣٤٥٦٧٨٩
Arabic: 0123456789

Diacritics in Arabic are not isolated characters, they are just marks that can be put above or below Arabic characters; you can think of it as 'vowels' in English, or maybe like accent in French for example.

fg-mindee · 2021-10-04T11:22:44Z

Closed by #502 & #514

@haythemBD you can find the instructions to train on your arabic dataset here: https://github.com/mindee/doctr/tree/main/references/recognition
And you only have to specify the --vocab argument to select the new arabic vocab :)

charlesmindee self-assigned this Sep 22, 2021

fg-mindee added type: enhancement Improvement module: datasets Related to doctr.datasets labels Sep 22, 2021

charlesmindee mentioned this issue Oct 1, 2021

adding an Arabic vocab file #514

Merged

fg-mindee closed this as completed Oct 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arabic data detection and recognition #490

Arabic data detection and recognition #490

haythemBD commented Sep 22, 2021 •

edited

Loading

charlesmindee commented Sep 22, 2021

mzeidhassan commented Sep 28, 2021

charlesmindee commented Sep 28, 2021 •

edited

Loading

fg-mindee commented Sep 28, 2021

mzeidhassan commented Sep 29, 2021 •

edited

Loading

fg-mindee commented Oct 4, 2021

Arabic data detection and recognition #490

Arabic data detection and recognition #490

Comments

haythemBD commented Sep 22, 2021 • edited Loading

charlesmindee commented Sep 22, 2021

mzeidhassan commented Sep 28, 2021

charlesmindee commented Sep 28, 2021 • edited Loading

fg-mindee commented Sep 28, 2021

mzeidhassan commented Sep 29, 2021 • edited Loading

fg-mindee commented Oct 4, 2021

haythemBD commented Sep 22, 2021 •

edited

Loading

charlesmindee commented Sep 28, 2021 •

edited

Loading

mzeidhassan commented Sep 29, 2021 •

edited

Loading