Use figure index rather than xml:id attribute this is not always present #51

elshimone · 2023-12-03T13:41:20Z

This is particularly true when loading papers which are parsed from PDFs e.g #46

If this just needs to be unique within the scope of the document, then perhaps we can use an index instead.

…ent. Particularly when loading papers which are parsed from PDFs

davidmezzetti · 2023-12-03T14:53:40Z

Thank you for the PR! I know that issue has been open a while.

The xml:id has some value in that it shows if it's a figure or a table. The section name doesn't matter all that much, in fact in some cases it's empty.

How about going with this?

        # Extract text from tables
        for i, figure in enumerate(soup.find("text").find_all("figure")):
            # Use XML Id (if available) as figure name to ensure figures are uniquely named
            name = figure.get("xml:id")
            name = name.upper() if name else f"FIGURE_{i}"

            # Search for table
            table = figure.find("table")
            if table:
                sections.extend([(name, x) for x in Table.extract(table)])

elshimone · 2023-12-03T15:30:46Z

Yes sounds good - fyi I did see a mixture of figures with and without ids (or rather, figure like ids, e.g. fig_1,fig_2, , fig_3, fig_4 .....) within the same xml document so I decided to bin it. Makes sense to preserve it where possible though.

Use figure index rather than xml:id attribute this is not always pres…

2cbfeb4

…ent. Particularly when loading papers which are parsed from PDFs

davidmezzetti added this to the v2.3.0 milestone Dec 3, 2023

davidmezzetti linked an issue Dec 3, 2023 that may be closed by this pull request

AttributeError: 'NoneType' object has no attribute 'upper' #46

Closed

davidmezzetti assigned elshimone Dec 3, 2023

Preserve xml:id where present

f05330c

davidmezzetti merged commit 1a09f28 into neuml:master Dec 3, 2023
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use figure index rather than xml:id attribute this is not always present #51

Use figure index rather than xml:id attribute this is not always present #51

elshimone commented Dec 3, 2023

davidmezzetti commented Dec 3, 2023

elshimone commented Dec 3, 2023

Use figure index rather than xml:id attribute this is not always present #51

Use figure index rather than xml:id attribute this is not always present #51

Conversation

elshimone commented Dec 3, 2023

davidmezzetti commented Dec 3, 2023

elshimone commented Dec 3, 2023