Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_text() returns emtpy text when the document doesn't have sections #73

Open
livelxw opened this issue Apr 17, 2024 · 1 comment
Open

Comments

@livelxw
Copy link

livelxw commented Apr 17, 2024

I found that the to_text() reads sections:

    def to_text(self):
        """
        Returns text of a document by iterating through all the sections '\n'
        """
        text = ""
        for section in self.sections():
            text = text + section.to_text(include_children=True, recurse=True) + "\n"
        return text

and self.sections() reads child nodes of root_node with tag header:

    def sections(self):
        """
        Returns all the sections in the block. This is useful for getting all the sections in a document.
        """
        sections = []
        def chunk_collector(node):
            if node.tag in ['header']:
                sections.append(node)
        self.iter_children(self, 0, chunk_collector)
        return sections

When the response from nlm-ingestor server doesn't contain sections, the function will return emtpy string. Should it get text from all children of root_node?

@thomastiotto
Copy link

thomastiotto commented Apr 25, 2024

I'd also like to add that calling Document.to_text() outputs duplicated text as it's being called on each section and sections can be children of other sections.

In this example, self.sections()[0] (block_idx=1) is the parent of self.sections()[1] (block_idx=2), so obviously calling to_text() on both will result in duplicated text.

Screenshot 2024-04-25 at 11 16 35

I also think it would make more sense to have to_text() implemented on the Document.root_node, which was the behaviour I was expecting before looking through the documentation.
It seems much more logical to simply do this? It seems to work on a simple PDF:

def to_text(self):
    text = ""
    for n in self.root_node.children:
        text = text + n.to_text(include_children=True, recurse=True) + "\n"
    return text

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants