Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

mismatch in sequence of words in result.export() #528

Closed
PoornaSaiNagendra opened this issue Oct 13, 2021 · 8 comments
Closed

mismatch in sequence of words in result.export() #528

PoornaSaiNagendra opened this issue Oct 13, 2021 · 8 comments
Assignees
Labels
topic: table comprehension Related to table comprehension type: bug Something isn't working
Milestone

Comments

@PoornaSaiNagendra
Copy link

馃悰 Bug

The sequence of words outputted by result.export() is not the same as words in the image given as input. The columns were getting swapped.

To Reproduce

Steps to reproduce the behavior:

  1. model = ocr_predictor(pretrained=True)

  2. doc = DocumentFile.from_images(</path/to/the/image>)

  3. result = model(doc)

  4. result.show(doc)

  5. json_output = result.export()

  6. num_words = len(json_output['pages'][0]['blocks'][0]['lines'][0]['words'])

  7. words_list = []
    words_dic = json_output['pages'][0]['blocks'][0]['lines'][0]['words']

    for word in range(num_words):
    res = words_dic[word]['value']
    words_list.append(res)

  8. total_text = ' '.join(words_list)

I can't provide the complete image due to privacy issues but I am providing the desired part of the image for my use case.

Expected behavior

The output I am getting is:

HINDI (SPECIALI EVEN EIGHT 100 078 DISTIN 33 078 ENGLISH GENERAL) HIVE TWO 33 100 052 052 SANSKRIT GENERAL AIVEN TWO 100 33 072 072 MATHEMATICS 100 33 SUK ONE 061 061 SCIENCE 100 25 08 20 040 060 Sus ZERO SOCIAL SCIENCE 33 100 062 TWO 062

And the expected output is:

HINDI (SPECIALI) 100 33 078 078 SEVEN EIGHT DISTN ENGLISH (GENERAL) 100 33 052 052 FIVE TWO SANSKRIT GENERAL TWO 100 33 072 072 SEVEN TWO MATHEMATICS 100 33 061 061 SIX ONE SCIENCE 100 25 08 040 20 060 SIX ZERO SOCIAL SCIENCE 100 33 062 062 SIX TWO

Environment

I am using Google Colab free version

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).

You can get the script and run it with:

![cropped_dect](https://user-images.githubusercontent.com/42320447/137127595-1bbb42b6-3035-4f32-aeb1-6c0a8133baa8.jpeg)

wget https://raw.githubusercontent.com/mindee/doctr/main/scripts/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

Collecting environment information...

DocTR version: 0.4.0
TensorFlow version: 2.6.0
PyTorch version: 1.9.0+cu111 (torchvision 0.10.0+cu111)
OpenCV version: 4.5.3
OS: Ubuntu 18.04.5 LTS
Python version: 3.7
Is CUDA available (TensorFlow): No
Is CUDA available (PyTorch): No
CUDA runtime version: 11.1.105
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.5

Additional context

cropped_dect

The above image is the cropped output from result.show(doc).

Thanks for any help you can provide in resolving this issue.

@PoornaSaiNagendra PoornaSaiNagendra added the type: bug Something isn't working label Oct 13, 2021
@charlesmindee charlesmindee self-assigned this Oct 13, 2021
@charlesmindee
Copy link
Collaborator

charlesmindee commented Oct 13, 2021

Hi @PoornaSaiNagendra,

Thank you for your interest in doctr! If I understand well your problem is the ordering of boxes in the output (boxes are not mapped to the correct lines/blocks and/or blocks are not ordered). We use boxes coordinates to reconstruct lines and hierarchical clustering of lines to find blocks, but this is not a very robust approach, especially when you have many columns on the page.

To help me a little bit on that since I don't have access to the document, could you plot or list the content of the different lines and/or blocks ?

Thanks a lot 馃檹

@PoornaSaiNagendra
Copy link
Author

PoornaSaiNagendra commented Oct 14, 2021

Hi @charlesmindee,

Thanks for replying. Here I am providing you with the document duplicate I used. Hope that helps you in solving the issue.

Chhattisgarh_BOSE

Image source: Google images
Note: No copyright infringement is intended

The above image can be found using in below link:
(https://images.app.goo.gl/FQuYLc2GhUkNHz83A)

Thanks a lot

@fg-mindee fg-mindee added the topic: table comprehension Related to table comprehension label Oct 14, 2021
@fg-mindee fg-mindee added this to the 1.0.0 milestone Oct 14, 2021
@charlesmindee
Copy link
Collaborator

Hi @PoornaSaiNagendra,

The option to resolve page lines and blocks is not activated by default, you need to activate it in the DocumentBuilder (models/utils/builder.py) to sort your document by blocks and lines, otherwise you get a unique block with a unique line inside it encapsulating al the words of the page.

I activated the option and it is not working well with your document, as I mentioned above our lines/blocks resolution algorithm is not very robust. What you can do is try to modify the geometrical parameters of the line resolution function in the builder, or use directly the coordinates of the boxes in the output to reorder the boxes as you wish to. I am sorry for this dysfunction, we are going to work on table comprehension/reconstruction as suggested in #524 in the next weeks and it may help you on that! 馃槃

Best

@PoornaSaiNagendra
Copy link
Author

Thanks for the suggestion. Looking forward to table comprehension/reconstruction.

Regards

@PoornaSaiNagendra
Copy link
Author

Hi @charlesmindee

Actually, I am looking from extracting information in the table. To do so initially I have proceeded with regex but due to a mismatch in the alignment of words at present, the same regex might not be suitable in the long run when the issue is resolved.

Could you please let me know if there are any chances of including key information extraction(KIE) models to the pipeline at present or suggest any other alternative approach to build our own custom KIE that can be added as postprocessing of docTR.

Thanks and Regards

@felixdittrich92
Copy link
Contributor

felixdittrich92 commented Oct 15, 2021

@PoornaSaiNagendra
do you mean something like this LayoutLM-Example
if yes than take a look at Tut for the moment in this case you can replace tesseract with doctr after detection :)

@PoornaSaiNagendra
Copy link
Author

PoornaSaiNagendra commented Oct 15, 2021

@felixdittrich92
Thanks for helping me get the materials I needed, also in my case as I have data inside a table so I was looking for models similar to this that can help me in integrating doctr with downstream tasks like key information extraction 馃槃 As of now I am using spaCy for adding own custom entities.

@charlesmindee
Copy link
Collaborator

I am moving this to a discussion so that we can keep on discussing on that and close the bug issue.

@mindee mindee locked and limited conversation to collaborators Oct 20, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
topic: table comprehension Related to table comprehension type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants