Textarea contents are parsed as empty if the source input is sufficiently large and contain un-escaped closing HTML tags #1929

chenbingbing01 · 2023-03-29T06:09:48Z

Testing Html Parse that error. textarea Tag's innerText is not inner. innerText become textarea after tag.

test code:

public class Test {

	public static void main(String[] args) {
		// TODO Auto-generated method stub
		String sourceHtml="\r\n"
// JH: snipped 2000 lines - moved to 1929-source.html.gz
		System.out.println(sourceHtml.length());
		org.jsoup.nodes.Document document = Jsoup.parse(sourceHtml);
		document.getElementsByTag("textarea");
	}

}

jhy · 2023-03-29T06:49:10Z

I can't follow this - can you please simplify this to a testcase with the few pertinent HTML lines, and then an assertEquals for what you are expecting (vs what you are getting). Images of code are not helpful.

jhy · 2023-03-29T08:40:17Z

Your report is 2174 lines long. This is still at least 2170 lines more than "a few".

Please, work out the minimum amount of HTML that is triggering your issue, and clean up this report (edit the first submission, and delete your second) to only include that.

You may use https://try.jsoup.org/ to easily see the parse results for a given input.

chenbingbing01 · 2023-03-29T09:02:07Z

Sorry,This bug only exists for a very long Html. I use https://try.jsoup.org/ see the parse results for a given input. It‘s still parsing is wrong.

jhy · 2023-04-25T01:30:47Z

1929-source.html.gz

(Attached the reporter's original source HTML.)

jhy · 2023-04-25T01:48:52Z

OK, I was able to reproduce this. Here's a simpler repro:

    public static void main(String[] args) {
        StringBuilder sb = new StringBuilder();
        int num = 2000;
        for (int i = 0; i <= num; i++) {
            sb.append("\n<text>foo</text>\n");
        }
        String textContent = sb.toString();
        String sourceHtml = "<textarea>" + textContent + "</textarea>";

        Document doc = Jsoup.parse(sourceHtml);
        Element textArea = doc.expectFirst("textarea");

        System.out.println("Text area parsed: " + textArea.wholeText().equals(textContent));
    }

Produces:

Text area parsed: false

It looks like a buffering issue. If I set num=1000, the result is true, but goes false at 2000.

jhy · 2023-04-25T02:33:07Z

Thanks, fixed!

chenbingbing01 · 2023-04-26T01:42:37Z

Nice work! Thanks.

jhy changed the title ~~html parse bug~~ Textarea contents are empty if the source input is sufficiently large Apr 25, 2023

jhy self-assigned this Apr 25, 2023

jhy added the bug Confirmed bug that we should fix label Apr 25, 2023

jhy closed this as completed in f0ae81b Apr 25, 2023

jhy changed the title ~~Textarea contents are empty if the source input is sufficiently large~~ Textarea contents are parsed as empty if the source input is sufficiently large and contain un-escaped closing HTML tags Apr 25, 2023

jhy added this to the 1.16.1 milestone Apr 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Textarea contents are parsed as empty if the source input is sufficiently large and contain un-escaped closing HTML tags #1929

Textarea contents are parsed as empty if the source input is sufficiently large and contain un-escaped closing HTML tags #1929

chenbingbing01 commented Mar 29, 2023 •

edited by jhy

jhy commented Mar 29, 2023

jhy commented Mar 29, 2023

chenbingbing01 commented Mar 29, 2023

jhy commented Apr 25, 2023

jhy commented Apr 25, 2023

jhy commented Apr 25, 2023

chenbingbing01 commented Apr 26, 2023

Textarea contents are parsed as empty if the source input is sufficiently large and contain un-escaped closing HTML tags #1929

Textarea contents are parsed as empty if the source input is sufficiently large and contain un-escaped closing HTML tags #1929

Comments

chenbingbing01 commented Mar 29, 2023 • edited by jhy

jhy commented Mar 29, 2023

jhy commented Mar 29, 2023

chenbingbing01 commented Mar 29, 2023

jhy commented Apr 25, 2023

jhy commented Apr 25, 2023

jhy commented Apr 25, 2023

chenbingbing01 commented Apr 26, 2023

chenbingbing01 commented Mar 29, 2023 •

edited by jhy