Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Textarea contents are parsed as empty if the source input is sufficiently large and contain un-escaped closing HTML tags #1929

Closed
chenbingbing01 opened this issue Mar 29, 2023 · 7 comments
Assignees
Labels
bug Confirmed bug that we should fix
Milestone

Comments

@chenbingbing01
Copy link

chenbingbing01 commented Mar 29, 2023

Testing Html Parse that error. textarea Tag's innerText is not inner. innerText become textarea after tag.
image

image

test code:

public class Test {

	public static void main(String[] args) {
		// TODO Auto-generated method stub
		String sourceHtml="\r\n"
// JH: snipped 2000 lines - moved to 1929-source.html.gz
		System.out.println(sourceHtml.length());
		org.jsoup.nodes.Document document = Jsoup.parse(sourceHtml);
		document.getElementsByTag("textarea");
	}

}
@jhy
Copy link
Owner

jhy commented Mar 29, 2023

I can't follow this - can you please simplify this to a testcase with the few pertinent HTML lines, and then an assertEquals for what you are expecting (vs what you are getting). Images of code are not helpful.

@jhy
Copy link
Owner

jhy commented Mar 29, 2023

Your report is 2174 lines long. This is still at least 2170 lines more than "a few".

Please, work out the minimum amount of HTML that is triggering your issue, and clean up this report (edit the first submission, and delete your second) to only include that.

You may use https://try.jsoup.org/ to easily see the parse results for a given input.

@chenbingbing01
Copy link
Author

Sorry,This bug only exists for a very long Html. I use https://try.jsoup.org/ see the parse results for a given input. It‘s still parsing is wrong.
image

@jhy
Copy link
Owner

jhy commented Apr 25, 2023

1929-source.html.gz

(Attached the reporter's original source HTML.)

@jhy
Copy link
Owner

jhy commented Apr 25, 2023

OK, I was able to reproduce this. Here's a simpler repro:

    public static void main(String[] args) {
        StringBuilder sb = new StringBuilder();
        int num = 2000;
        for (int i = 0; i <= num; i++) {
            sb.append("\n<text>foo</text>\n");
        }
        String textContent = sb.toString();
        String sourceHtml = "<textarea>" + textContent + "</textarea>";

        Document doc = Jsoup.parse(sourceHtml);
        Element textArea = doc.expectFirst("textarea");

        System.out.println("Text area parsed: " + textArea.wholeText().equals(textContent));
    }

Produces:

Text area parsed: false

It looks like a buffering issue. If I set num=1000, the result is true, but goes false at 2000.

@jhy jhy changed the title html parse bug Textarea contents are empty if the source input is sufficiently large Apr 25, 2023
@jhy jhy self-assigned this Apr 25, 2023
@jhy jhy added the bug Confirmed bug that we should fix label Apr 25, 2023
@jhy jhy closed this as completed in f0ae81b Apr 25, 2023
@jhy
Copy link
Owner

jhy commented Apr 25, 2023

Thanks, fixed!

@jhy jhy changed the title Textarea contents are empty if the source input is sufficiently large Textarea contents are parsed as empty if the source input is sufficiently large and contain un-escaped closing HTML tags Apr 25, 2023
@jhy jhy added this to the 1.16.1 milestone Apr 25, 2023
@chenbingbing01
Copy link
Author

Nice work! Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed bug that we should fix
Projects
None yet
Development

No branches or pull requests

2 participants