Skip to content

Commit

Permalink
Check the buffer is fully read if looking for absent content
Browse files Browse the repository at this point in the history
Fixes #1929

In this case we are testing for a missing `</textarea>` - but if the buffer hasn't been fully read, we could never find it.

For the normal case that this code is looking for - a missing `</title>` in brief HTML, a best-effort check (assuming the buffer is complete) is sufficient.
  • Loading branch information
jhy committed Apr 25, 2023
1 parent 78aeac1 commit f0ae81b
Show file tree
Hide file tree
Showing 4 changed files with 26 additions and 1 deletion.
4 changes: 4 additions & 0 deletions CHANGES
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,10 @@ Release 1.16.1 [PENDING]
* Bugfix: <br> tags should be wrap-indented when in block tags (and not when in inline tags).
<https://github.com/jhy/jsoup/issues/1911>

* Bugfix: the contents of a sufficiently large <textarea> with un-escaped HTML closing tags may be incorrectly parsed
to an empty node.
<https://github.com/jhy/jsoup/issues/1929>

Release 1.15.4 [18-Feb-2023]
* Improvement: added the ability to escape CSS selectors (tags, IDs, classes) to match elements that don't follow
regular CSS syntax. For example, to match by classname <p class="one.two">, use document.select("p.one\\.two");
Expand Down
5 changes: 5 additions & 0 deletions src/main/java/org/jsoup/parser/CharacterReader.java
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,11 @@ public int pos() {
return readerPos + bufPos;
}

/** Tests if the buffer has been fully read. */
boolean readFully() {
return readFully;
}

/**
Enables or disables line number tracking. By default, will be <b>off</b>.Tracking line numbers improves the
legibility of parser error messages, for example. Tracking should be enabled before any content is read to be of
Expand Down
2 changes: 1 addition & 1 deletion src/main/java/org/jsoup/parser/TokeniserState.java
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@ void read(Tokeniser t, CharacterReader r) {
if (r.matches('/')) {
t.createTempBuffer();
t.advanceTransition(RCDATAEndTagOpen);
} else if (r.matchesAsciiAlpha() && t.appropriateEndTagName() != null && !r.containsIgnoreCase(t.appropriateEndTagSeq())) {
} else if (r.readFully() && r.matchesAsciiAlpha() && t.appropriateEndTagName() != null && !r.containsIgnoreCase(t.appropriateEndTagSeq())) {
// diverge from spec: got a start tag, but there's no appropriate end tag (</title>), so rather than
// consuming to EOF; break out here
t.tagPending = t.createTagPending(false).name(t.appropriateEndTagName());
Expand Down
16 changes: 16 additions & 0 deletions src/test/java/org/jsoup/parser/HtmlParserTest.java
Original file line number Diff line number Diff line change
Expand Up @@ -1732,4 +1732,20 @@ private boolean didAddElements(String input) {
//assertEquals("OneTwo", doc.expectFirst("body > div").text());
System.out.println(doc.html());
}

@Test void largeTextareaContents() {
// https://github.com/jhy/jsoup/issues/1929
StringBuilder sb = new StringBuilder();
int num = 2000;
for (int i = 0; i <= num; i++) {
sb.append("\n<text>foo</text>\n");
}
String textContent = sb.toString();
String sourceHtml = "<textarea>" + textContent + "</textarea>";

Document doc = Jsoup.parse(sourceHtml);
Element textArea = doc.expectFirst("textarea");

assertEquals(textContent, textArea.wholeText());
}
}

0 comments on commit f0ae81b

Please sign in to comment.