`to_text()` returns emtpy text when the document doesn't have sections #73

livelxw · 2024-04-17T06:40:24Z

I found that the to_text() reads sections:

    def to_text(self):
        """
        Returns text of a document by iterating through all the sections '\n'
        """
        text = ""
        for section in self.sections():
            text = text + section.to_text(include_children=True, recurse=True) + "\n"
        return text

and self.sections() reads child nodes of root_node with tag header:

    def sections(self):
        """
        Returns all the sections in the block. This is useful for getting all the sections in a document.
        """
        sections = []
        def chunk_collector(node):
            if node.tag in ['header']:
                sections.append(node)
        self.iter_children(self, 0, chunk_collector)
        return sections

When the response from nlm-ingestor server doesn't contain sections, the function will return emtpy string. Should it get text from all children of root_node?

The text was updated successfully, but these errors were encountered:

thomastiotto · 2024-04-25T09:16:47Z

I'd also like to add that calling Document.to_text() outputs duplicated text as it's being called on each section and sections can be children of other sections.

In this example, self.sections()[0] (block_idx=1) is the parent of self.sections()[1] (block_idx=2), so obviously calling to_text() on both will result in duplicated text.

I also think it would make more sense to have to_text() implemented on the Document.root_node, which was the behaviour I was expecting before looking through the documentation.
It seems much more logical to simply do this? It seems to work on a simple PDF:

def to_text(self):
    text = ""
    for n in self.root_node.children:
        text = text + n.to_text(include_children=True, recurse=True) + "\n"
    return text

thomastiotto mentioned this issue Apr 25, 2024

Bug in API function: Incorrect behavior with repeated sections. #49

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`to_text()` returns emtpy text when the document doesn't have sections #73

`to_text()` returns emtpy text when the document doesn't have sections #73

livelxw commented Apr 17, 2024

thomastiotto commented Apr 25, 2024 •

edited

Loading

to_text() returns emtpy text when the document doesn't have sections #73

to_text() returns emtpy text when the document doesn't have sections #73

Comments

livelxw commented Apr 17, 2024

thomastiotto commented Apr 25, 2024 • edited Loading

`to_text()` returns emtpy text when the document doesn't have sections #73

`to_text()` returns emtpy text when the document doesn't have sections #73

thomastiotto commented Apr 25, 2024 •

edited

Loading