Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds XML export method to DocumentBuilder #544

Merged
merged 15 commits into from
Nov 2, 2021
Merged

Adds XML export method to DocumentBuilder #544

merged 15 commits into from
Nov 2, 2021

Conversation

felixdittrich92
Copy link
Contributor

@fg-mindee
@charlesmindee
feat: adds the option to export the results in XML (hocr) format (like tesseract)

This request also offers the possibility to convert the documents into PDFs with a text layer
As with render(), the results depend on the correct division into blocks / lines and correct sorting of the boxes

Resolves: #512
Note:
#512 can be closed after adding an example/tutorial how to use this output to generate PDF Files with text layer

@fg-mindee fg-mindee self-requested a review October 26, 2021 09:45
@fg-mindee fg-mindee self-assigned this Oct 26, 2021
@fg-mindee fg-mindee added the module: io Related to doctr.io label Oct 26, 2021
@fg-mindee fg-mindee changed the title #512 add export_as_xml Added XML export method to DocumentBuilder Oct 26, 2021
@fg-mindee fg-mindee changed the title Added XML export method to DocumentBuilder Adds XML export method to DocumentBuilder Oct 26, 2021
Copy link
Contributor

@fg-mindee fg-mindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Looks good to me overall, I only think we should move everything into a Page.export_as_xml method which will be called by Document.export_as_xml

Let me know what you think 👌

doctr/io/elements.py Outdated Show resolved Hide resolved
test/common/test_core.py Outdated Show resolved Hide resolved
doctr/io/elements.py Outdated Show resolved Hide resolved
doctr/models/builder.py Outdated Show resolved Hide resolved
doctr/io/elements.py Outdated Show resolved Hide resolved
@felixdittrich92
Copy link
Contributor Author

@fg-mindee
btw. i have checked the parser this will only work if we set resolve_lines and resolve_blocks in builder.py to true :)

Copy link
Contributor

@fg-mindee fg-mindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I added a few suggestions: would you mind adding more comment in your code please? So that people understand what the actual code is expected to do :)

doctr/models/builder.py Outdated Show resolved Hide resolved
doctr/io/elements.py Outdated Show resolved Hide resolved
doctr/io/elements.py Outdated Show resolved Hide resolved
doctr/io/elements.py Show resolved Hide resolved
doctr/io/elements.py Outdated Show resolved Hide resolved
test/common/test_io_elements.py Outdated Show resolved Hide resolved
@felixdittrich92
Copy link
Contributor Author

@fg-mindee
@charlesmindee
i´m done from my side :)
What do you thing ?

@felixdittrich92
Copy link
Contributor Author

@fg-mindee
any further changes needed ? 🤗

Have a nice day 😃

@codecov
Copy link

codecov bot commented Oct 29, 2021

Codecov Report

Merging #544 (39460d0) into main (b27c3a6) will increase coverage by 0.03%.
The diff coverage is 97.36%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #544      +/-   ##
==========================================
+ Coverage   96.04%   96.08%   +0.03%     
==========================================
  Files         109      109              
  Lines        4198     4236      +38     
==========================================
+ Hits         4032     4070      +38     
  Misses        166      166              
Flag Coverage Δ
unittests 96.08% <97.36%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
doctr/io/elements.py 91.95% <97.36%> (+1.51%) ⬆️
...dels/detection/differentiable_binarization/base.py 91.82% <0.00%> (+0.62%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b27c3a6...39460d0. Read the comment docs.

Copy link
Contributor

@fg-mindee fg-mindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again! I added some comments, I think we're almost ready for merge

docs/source/using_models.rst Outdated Show resolved Hide resolved
doctr/io/elements.py Outdated Show resolved Hide resolved
doctr/io/elements.py Outdated Show resolved Hide resolved
doctr/io/elements.py Outdated Show resolved Hide resolved
doctr/io/elements.py Outdated Show resolved Hide resolved
doctr/io/elements.py Outdated Show resolved Hide resolved
doctr/io/elements.py Outdated Show resolved Hide resolved
examples/generate_pdfa_with_doctr_output.py Outdated Show resolved Hide resolved
test/common/test_io_elements.py Outdated Show resolved Hide resolved
test/common/test_io_elements.py Outdated Show resolved Hide resolved
@felixdittrich92
Copy link
Contributor Author

@fg-mindee
please let me know what you think about the skip_rotated_boxes part 🤗

Copy link
Contributor

@fg-mindee fg-mindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last changes to do and we can merge :)

doctr/models/builder.py Outdated Show resolved Hide resolved
doctr/io/elements.py Outdated Show resolved Hide resolved
doctr/io/elements.py Outdated Show resolved Hide resolved
doctr/io/elements.py Outdated Show resolved Hide resolved
doctr/io/elements.py Show resolved Hide resolved
doctr/io/elements.py Outdated Show resolved Hide resolved
@fg-mindee fg-mindee added topic: documentation Improvements or additions to documentation ext: tests Related to tests folder labels Nov 1, 2021
@fg-mindee fg-mindee added the type: enhancement Improvement label Nov 1, 2021
@fg-mindee fg-mindee added this to the 0.6.0 milestone Nov 1, 2021
@felixdittrich92
Copy link
Contributor Author

@fg-mindee
now i have found this damn blank line 😂
I was fixated on the last line the whole time 🙈

Copy link
Contributor

@fg-mindee fg-mindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few typos left 😅

doctr/io/elements.py Outdated Show resolved Hide resolved
doctr/io/elements.py Outdated Show resolved Hide resolved
Copy link
Contributor

@fg-mindee fg-mindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the edits 🙏

@fg-mindee
Copy link
Contributor

FYI @felixdittrich92
Not sure it was on purpose so I thought it's better to let you know: it looks like you're working with multiple Git accounts. Some of your commits come from felixdittrich92 but others are from felix (cf. https://github.com/mindee/doctr/pull/544/commits)

If that's not on purpose, I would suggest sticking to your main account ;)

@fg-mindee fg-mindee merged commit 92c1eeb into mindee:main Nov 2, 2021
@frgfm frgfm mentioned this pull request Jun 28, 2022
85 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ext: tests Related to tests folder module: io Related to doctr.io topic: documentation Improvements or additions to documentation type: enhancement Improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding hocr output to generate Pdf/A
2 participants