Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use figure index rather than xml:id attribute this is not always present #51

Merged
merged 2 commits into from
Dec 3, 2023

Conversation

elshimone
Copy link
Contributor

This is particularly true when loading papers which are parsed from PDFs e.g #46

If this just needs to be unique within the scope of the document, then perhaps we can use an index instead.

…ent.

Particularly when loading papers which are parsed from PDFs
@davidmezzetti davidmezzetti added this to the v2.3.0 milestone Dec 3, 2023
@davidmezzetti davidmezzetti linked an issue Dec 3, 2023 that may be closed by this pull request
@davidmezzetti
Copy link
Member

Thank you for the PR! I know that issue has been open a while.

The xml:id has some value in that it shows if it's a figure or a table. The section name doesn't matter all that much, in fact in some cases it's empty.

How about going with this?

        # Extract text from tables
        for i, figure in enumerate(soup.find("text").find_all("figure")):
            # Use XML Id (if available) as figure name to ensure figures are uniquely named
            name = figure.get("xml:id")
            name = name.upper() if name else f"FIGURE_{i}"

            # Search for table
            table = figure.find("table")
            if table:
                sections.extend([(name, x) for x in Table.extract(table)])

@elshimone
Copy link
Contributor Author

Yes sounds good - fyi I did see a mixture of figures with and without ids (or rather, figure like ids, e.g. fig_1,fig_2, , fig_3, fig_4 .....) within the same xml document so I decided to bin it. Makes sense to preserve it where possible though.

@davidmezzetti davidmezzetti merged commit 1a09f28 into neuml:master Dec 3, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AttributeError: 'NoneType' object has no attribute 'upper'
2 participants