Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid start/end positions for TextNode source range #1927

Closed
KennyWongPFPT opened this issue Mar 28, 2023 · 1 comment
Closed

Invalid start/end positions for TextNode source range #1927

KennyWongPFPT opened this issue Mar 28, 2023 · 1 comment
Assignees
Labels
bug Confirmed bug that we should fix fixed
Milestone

Comments

@KennyWongPFPT
Copy link

If I configure the parser to track positions and feed in the fragment <table>foo<tr><td>bar</td></tr></table>, both foo and bar get parsed as text nodes. However, the source range for foo contains invalid start/end positions. I realise foo is a misplaced text, but is there a specific reason why we do not populate the positions? Also, I see the same unexpected behaviour if I add whitespaces between <table> and <tr>. The resulting text node contains invalid positions.

Here's a simple test program:

import org.jsoup.nodes.Document;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.Range;
import org.jsoup.nodes.TextNode;
import org.jsoup.parser.HtmlTreeBuilder;
import org.jsoup.parser.Parser;
import org.jsoup.select.NodeTraversor;

public class Test {
    public static void main(String[] args) {
        HtmlTreeBuilder treeBuilder = new HtmlTreeBuilder();
        Parser parser = new Parser(treeBuilder);
        parser.setTrackPosition(true);
        Document document = parser.parseInput("<table>foo<tr><td>bar</td></tr></table>", "");
        NodeTraversor.traverse((Node node, int depth) -> {
            if (node instanceof TextNode textNode) {
                Range sourceRange = textNode.sourceRange();
                System.out.printf("text=%s start=%d end=%d%n",
                    textNode.text(),
                    sourceRange.start().pos(),
                    sourceRange.end().pos());
            }
        }, document);
    }
}

And the unexpected output:

$ java -cp ~/.m2/repository/org/jsoup/jsoup/1.15.4/jsoup-1.15.4.jar Test.java 
text=foo start=0 end=-1    # start/end positions are invalid here
text=bar start=18 end=21
@jhy jhy closed this as completed in c93ea51 Mar 29, 2023
@jhy jhy self-assigned this Mar 29, 2023
@jhy jhy added bug Confirmed bug that we should fix fixed labels Mar 29, 2023
@jhy jhy added this to the 1.16.1 milestone Mar 29, 2023
@jhy
Copy link
Owner

jhy commented Mar 29, 2023

Thanks, fixed - the parser was losing the Token start/end positions for fostered table text as we were only storing the pending string data, vs the original token.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed bug that we should fix fixed
Projects
None yet
Development

No branches or pull requests

2 participants