mismatch in sequence of words in result.export() #528

PoornaSaiNagendra · 2021-10-13T12:13:23Z

🐛 Bug

The sequence of words outputted by result.export() is not the same as words in the image given as input. The columns were getting swapped.

To Reproduce

Steps to reproduce the behavior:

model = ocr_predictor(pretrained=True)
doc = DocumentFile.from_images(</path/to/the/image>)
result = model(doc)
result.show(doc)
json_output = result.export()
num_words = len(json_output['pages'][0]['blocks'][0]['lines'][0]['words'])
words_list = []
words_dic = json_output['pages'][0]['blocks'][0]['lines'][0]['words']

for word in range(num_words):
res = words_dic[word]['value']
words_list.append(res)
total_text = ' '.join(words_list)

I can't provide the complete image due to privacy issues but I am providing the desired part of the image for my use case.

Expected behavior

The output I am getting is:

HINDI (SPECIALI EVEN EIGHT 100 078 DISTIN 33 078 ENGLISH GENERAL) HIVE TWO 33 100 052 052 SANSKRIT GENERAL AIVEN TWO 100 33 072 072 MATHEMATICS 100 33 SUK ONE 061 061 SCIENCE 100 25 08 20 040 060 Sus ZERO SOCIAL SCIENCE 33 100 062 TWO 062

And the expected output is:

HINDI (SPECIALI) 100 33 078 078 SEVEN EIGHT DISTN ENGLISH (GENERAL) 100 33 052 052 FIVE TWO SANSKRIT GENERAL TWO 100 33 072 072 SEVEN TWO MATHEMATICS 100 33 061 061 SIX ONE SCIENCE 100 25 08 040 20 060 SIX ZERO SOCIAL SCIENCE 100 33 062 062 SIX TWO

Environment

I am using Google Colab free version

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).

You can get the script and run it with:

![cropped_dect](https://user-images.githubusercontent.com/42320447/137127595-1bbb42b6-3035-4f32-aeb1-6c0a8133baa8.jpeg)

wget https://raw.githubusercontent.com/mindee/doctr/main/scripts/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

Collecting environment information...

DocTR version: 0.4.0
TensorFlow version: 2.6.0
PyTorch version: 1.9.0+cu111 (torchvision 0.10.0+cu111)
OpenCV version: 4.5.3
OS: Ubuntu 18.04.5 LTS
Python version: 3.7
Is CUDA available (TensorFlow): No
Is CUDA available (PyTorch): No
CUDA runtime version: 11.1.105
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.5

Additional context

The above image is the cropped output from result.show(doc).

Thanks for any help you can provide in resolving this issue.

The text was updated successfully, but these errors were encountered:

charlesmindee · 2021-10-13T13:43:58Z

Hi @PoornaSaiNagendra,

Thank you for your interest in doctr! If I understand well your problem is the ordering of boxes in the output (boxes are not mapped to the correct lines/blocks and/or blocks are not ordered). We use boxes coordinates to reconstruct lines and hierarchical clustering of lines to find blocks, but this is not a very robust approach, especially when you have many columns on the page.

To help me a little bit on that since I don't have access to the document, could you plot or list the content of the different lines and/or blocks ?

Thanks a lot 🙏

PoornaSaiNagendra · 2021-10-14T06:47:58Z

Hi @charlesmindee,

Thanks for replying. Here I am providing you with the document duplicate I used. Hope that helps you in solving the issue.

Image source: Google images
Note: No copyright infringement is intended

The above image can be found using in below link:
(https://images.app.goo.gl/FQuYLc2GhUkNHz83A)

Thanks a lot

charlesmindee · 2021-10-14T15:12:31Z

Hi @PoornaSaiNagendra,

The option to resolve page lines and blocks is not activated by default, you need to activate it in the DocumentBuilder (models/utils/builder.py) to sort your document by blocks and lines, otherwise you get a unique block with a unique line inside it encapsulating al the words of the page.

I activated the option and it is not working well with your document, as I mentioned above our lines/blocks resolution algorithm is not very robust. What you can do is try to modify the geometrical parameters of the line resolution function in the builder, or use directly the coordinates of the boxes in the output to reorder the boxes as you wish to. I am sorry for this dysfunction, we are going to work on table comprehension/reconstruction as suggested in #524 in the next weeks and it may help you on that! 😄

Best

PoornaSaiNagendra · 2021-10-14T17:41:53Z

Thanks for the suggestion. Looking forward to table comprehension/reconstruction.

Regards

PoornaSaiNagendra · 2021-10-15T10:02:30Z

Hi @charlesmindee

Actually, I am looking from extracting information in the table. To do so initially I have proceeded with regex but due to a mismatch in the alignment of words at present, the same regex might not be suitable in the long run when the issue is resolved.

Could you please let me know if there are any chances of including key information extraction(KIE) models to the pipeline at present or suggest any other alternative approach to build our own custom KIE that can be added as postprocessing of docTR.

Thanks and Regards

felixdittrich92 · 2021-10-15T13:49:34Z

@PoornaSaiNagendra
do you mean something like this LayoutLM-Example
if yes than take a look at Tut for the moment in this case you can replace tesseract with doctr after detection :)

PoornaSaiNagendra · 2021-10-15T15:23:37Z

@felixdittrich92
Thanks for helping me get the materials I needed, also in my case as I have data inside a table so I was looking for models similar to this that can help me in integrating doctr with downstream tasks like key information extraction 😄 As of now I am using spaCy for adding own custom entities.

charlesmindee · 2021-10-20T15:56:06Z

I am moving this to a discussion so that we can keep on discussing on that and close the bug issue.

PoornaSaiNagendra added the type: bug Something isn't working label Oct 13, 2021

charlesmindee self-assigned this Oct 13, 2021

fg-mindee added the topic: table comprehension Related to table comprehension label Oct 14, 2021

fg-mindee added this to the 1.0.0 milestone Oct 14, 2021

mindee locked and limited conversation to collaborators Oct 20, 2021

charlesmindee closed this as completed Oct 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

mismatch in sequence of words in result.export() #528

mismatch in sequence of words in result.export() #528

PoornaSaiNagendra commented Oct 13, 2021

charlesmindee commented Oct 13, 2021 •

edited

Loading

PoornaSaiNagendra commented Oct 14, 2021 •

edited

Loading

charlesmindee commented Oct 14, 2021

PoornaSaiNagendra commented Oct 14, 2021

PoornaSaiNagendra commented Oct 15, 2021

felixdittrich92 commented Oct 15, 2021 •

edited

Loading

PoornaSaiNagendra commented Oct 15, 2021 •

edited

Loading

charlesmindee commented Oct 20, 2021

This issue was moved to a discussion.

This issue was moved to a discussion.

mismatch in sequence of words in result.export() #528

mismatch in sequence of words in result.export() #528

Comments

PoornaSaiNagendra commented Oct 13, 2021

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

charlesmindee commented Oct 13, 2021 • edited Loading

PoornaSaiNagendra commented Oct 14, 2021 • edited Loading

charlesmindee commented Oct 14, 2021

PoornaSaiNagendra commented Oct 14, 2021

PoornaSaiNagendra commented Oct 15, 2021

felixdittrich92 commented Oct 15, 2021 • edited Loading

PoornaSaiNagendra commented Oct 15, 2021 • edited Loading

charlesmindee commented Oct 20, 2021

This issue was moved to a discussion.

charlesmindee commented Oct 13, 2021 •

edited

Loading

PoornaSaiNagendra commented Oct 14, 2021 •

edited

Loading

felixdittrich92 commented Oct 15, 2021 •

edited

Loading

PoornaSaiNagendra commented Oct 15, 2021 •

edited

Loading