Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Websites with large amounts of data fail to parse. #1218

Closed
Derek-Baum opened this issue May 21, 2019 · 18 comments
Closed

Websites with large amounts of data fail to parse. #1218

Derek-Baum opened this issue May 21, 2019 · 18 comments
Assignees
Labels
bug Confirmed bug that we should fix
Milestone

Comments

@Derek-Baum
Copy link

Derek-Baum commented May 21, 2019

Currently using Jsoup on some large websites, and it throws the Mark Invalid Exception which means the bufref is negative?

I tried using both Jsoup.connect(url).get()
and Jsoup.connect(url).execute().parse()
Both cause the same exception.

	at org.jsoup.parser.CharacterReader.rewindToMark(CharacterReader.java:132)
	at org.jsoup.parser.Tokeniser.consumeCharacterReference(Tokeniser.java:182)
	at org.jsoup.parser.TokeniserState.readCharRef(TokeniserState.java:1698)
	at org.jsoup.parser.TokeniserState.access$100(TokeniserState.java:8)
	at org.jsoup.parser.TokeniserState$2.read(TokeniserState.java:36)
	at org.jsoup.parser.Tokeniser.read(Tokeniser.java:57)
	at org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:55)
	at org.jsoup.parser.TreeBuilder.parse(TreeBuilder.java:47)
	at org.jsoup.parser.Parser.parseInput(Parser.java:35)
	at org.jsoup.helper.DataUtil.parseInputStream(DataUtil.java:169)
	at org.jsoup.helper.HttpConnection$Response.parse(HttpConnection.java:835)
	at org.jsoup.helper.HttpConnection.get(HttpConnection.java:285)```

if anybody would like to reproduce, here are some urls which it fails to parse:
https://www.spec.org/cpu2006/results/res2014q4/
https://www.spec.org/cpu2006/results/res2012q3/
https://www.spec.org/cpu2006/results/res2014q1/
https://www.spec.org/cpu2006/results/res2014q3/
https://www.spec.org/cpu2006/results/res2011q2/
https://www.spec.org/cpu2006/results/res2010q3/
https://www.spec.org/cpu2006/results/res2017q2/
https://www.spec.org/cpu2006/results/res2016q3/
https://www.spec.org/cpu2006/results/res2015q4/
https://www.spec.org/cpu2006/results/res2007q4/
https://www.spec.org/cpu2006/results/res2009q4/
https://www.spec.org/cpu2006/results/res2012q2/
https://www.spec.org/cpu2006/results/res2014q2/
https://www.spec.org/cpu2006/results/res2012q4/
https://www.spec.org/cpu2006/results/res2011q1/


Thanks
@jhy
Copy link
Owner

jhy commented May 21, 2019

Hi, thanks for the report. Can you confirm which version of jsoup you're using?

Works in 1.12.1 on try.jsoup - http://try.jsoup.org/~MnrEgz1lWYn4NMN2ObClDfXz7D8

Might be a slight code path difference which I can dig into there. But there were some fixes around this in 1.12.1 from the previous version.

@Derek-Baum
Copy link
Author

@jhy Using version 1.12.1, downloaded hours ago.

@krystiangorecki
Copy link
Contributor

Same here, version 1.12.1.

Response execute = Jsoup.connect("https://www.spec.org/cpu2006/results/res2014q4/").execute();
Jsoup.parse(execute.body());

works every time, but
Jsoup.connect("https://www.spec.org/cpu2006/results/res2014q4/").get(); and Jsoup.connect("https://www.spec.org/cpu2006/results/res2014q4/").execute().parse();
fail every time with:

Exception in thread "main" java.io.IOException: Mark invalid
	at org.jsoup.parser.CharacterReader.rewindToMark(CharacterReader.java:132)
	at org.jsoup.parser.Tokeniser.consumeCharacterReference(Tokeniser.java:182)
	at org.jsoup.parser.TokeniserState.readCharRef(TokeniserState.java:1698)
	at org.jsoup.parser.TokeniserState.access$100(TokeniserState.java:8)
	at org.jsoup.parser.TokeniserState$2.read(TokeniserState.java:36)
	at org.jsoup.parser.Tokeniser.read(Tokeniser.java:57)
	at org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:55)
	at org.jsoup.parser.TreeBuilder.parse(TreeBuilder.java:47)
	at org.jsoup.parser.Parser.parseInput(Parser.java:35)
	at org.jsoup.helper.DataUtil.parseInputStream(DataUtil.java:169)
	at org.jsoup.helper.HttpConnection$Response.parse(HttpConnection.java:835)
	at org.jsoup.helper.HttpConnection.get(HttpConnection.java:285)
	at JsoupIssue1218.main(JsoupIssue1218.java:7)

@krystiangorecki
Copy link
Contributor

krystiangorecki commented May 23, 2019

I suspect it happens when entity is on a boundary of a buffer read but not always.
When using this crafted file it happens at &ccif; entity:
Jsoup.connect("https://gist.githubusercontent.com/krystiangorecki/d3bad50ef5615f06b077438607423533/raw/71adfdf81121282ea936510ed6cfe440adeb2d83/JsoupIssue1218.html").execute().parse();

Screenshot_5

@csaboka
Copy link
Contributor

csaboka commented Jun 1, 2019

@krystiangorecki : Your suspicion is correct. I could reproduce this reliably in unit tests, by using a custom Reader that only returns a very small amount of data for each read() call. This simulates real readers that can return a similarly small amount in edge cases.

Opened #1225 for my proposed fix.

@wx2020
Copy link

wx2020 commented Jun 26, 2019

Hello jhy, when I use jsoup-1.12.1, I get the same error:
java.io.IOException: Mark invalid at org.jsoup.parser.CharacterReader.rewindToMark(CharacterReader.java:132) at org.jsoup.parser.Tokeniser.consumeCharacterReference(Tokeniser.java:182) at org.jsoup.parser.TokeniserState.readCharRef(TokeniserState.java:1698) at org.jsoup.parser.TokeniserState.access$100(TokeniserState.java:8) at org.jsoup.parser.TokeniserState$2.read(TokeniserState.java:36) at org.jsoup.parser.Tokeniser.read(Tokeniser.java:57) at org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:55) at org.jsoup.parser.TreeBuilder.parse(TreeBuilder.java:47) at org.jsoup.parser.Parser.parseInput(Parser.java:35) at org.jsoup.helper.DataUtil.parseInputStream(DataUtil.java:169) at org.jsoup.helper.HttpConnection$Response.parse(HttpConnection.java:835) at org.jsoup.helper.HttpConnection.get(HttpConnection.java:285) at JsoupTest.main(JsoupTest.java:12)

This is my code:

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupTest {
	public static void main(String[] args) {
		String url = "https://www.zhihu.com/rss";
		try {
			Document doc = Jsoup.connect(url).get();
			Elements items = doc.getElementsByTag("item");
			Element item = items.first();
			
			String title = item.getElementsByTag("title").first().text();
			System.out.println("title = " + title);
			String link = item.getElementsByTag("link").first().text();
			System.out.println("link = " + link);
			String datetime = item.getElementsByTag("pubDate").first().text();
			System.out.println("datetime = " + datetime);
			
			Element descriptionElement = Jsoup.parse(item.getElementsByTag("description").first().text()).body();
		
			String image = null;
			try {
				image = item.getElementsByTag("image").first().text();
			} catch (NullPointerException e1) {
				image = descriptionElement.getElementsByTag("img").first().attr("src");
			} 
			System.out.println("image = " + image);

			
		} catch (IOException e) {
			
			e.printStackTrace();
		}
	}
}

However, when I use jsoup-1.11.3.jar, everything is fine.
Hope it helps you.
Best wishes, wx2020

@jhy
Copy link
Owner

jhy commented Jul 3, 2019

Thanks all for your reviews on this.

@csaboka in looking at your notes and the PR -- the only local repro I can get is with your test in TokeniserTest -https://github.com/jhy/jsoup/pull/1225/files#diff-886e5a4f90143c287a5dfcf4c27964ea . The test in src/test/java/org/jsoup/parser/CharacterReaderTest.java passes already in master. Did it fail for you before your changes?

I can repro with the gist @krystiangorecki posted if loading from that URL, but not if I serve the same file from the Jetty integ test server.

I understand the bug but there is a sensitivity in the reader length return I am not quite following yet -- want to make sure we get to a 100% fix here.

@csaboka I like the approach of your PR generally. I am conscious that this is a memory and latency optimized area, so am reluctant to add a bunch of new string buffers (and ongoing bufferUp() calls) if we can help it. I have a mostly working solution which doesn't add another buffer for marks, but during bufferUp() will not invalidate the mark, but will move it to the front of the buffer instead. Because we know the largest expected mark size (the largest character reference), we can enforce it not to fail.

@wx2020 thanks for your note. For clarity in 1.11.3 everything was not fine; the same buffer underrun happened, we just didn't throw a validation error. It just would have silently corrupted the input data.

@jhy
Copy link
Owner

jhy commented Jul 3, 2019

What would be great is if we can get a local repro without modifying the read() limit artificially. I want to be able to verify that other aspects which expect a larger read (because of using a buffered input, which will try to fill the larger length), the solution I'm looking at will still work.

@csaboka
Copy link
Contributor

csaboka commented Jul 3, 2019

The test in src/test/java/org/jsoup/parser/CharacterReaderTest.java passes already in master. Did it fail for you before your changes?

No, now that I've looked at it again, the test passes without my changes. If I change the consumeTo() line from this:
r.consumeTo("four");
to this:
assertEquals("one two three ", r.consumeTo("four"));
it fails, though. consumeTo() only consumes till the end of the currently buffered data without my changes, so it returns just "one". Unfortunately, it's still broken with my changes. Apparently, the code still doesn't handle properly when the search text straddles between buffer boundaries, so it reads until the end in that case. I don't know how to fix this properly, and I don't have time ATM to investigate it further.

I am conscious that this is a memory and latency optimized area, so am reluctant to add a bunch of new string buffers (and ongoing bufferUp() calls) if we can help it.

Buffers should only be created in the hopefully rare case of reads straddling buffer boundaries. In the real world, when you buffer up kilobytes of markup at a time, it probably wouldn't cause that much garbage.

I have a mostly working solution which doesn't add another buffer for marks, but during bufferUp() will not invalidate the mark, but will move it to the front of the buffer instead.

Fixing marking will deal with this specific exception, it won't solve all issues with large inputs. The various consumeXxx() methods don't "see beyond" what's currently buffered, so they could fail to find things and return all the remaining buffer (not all data until the end of the stream, just what's in the buffer) if what they're looking for ends beyond the currently buffered data. The second commit of my PR tried to fix that, but apparently it's still not working for all cases. In general, it's pretty hard to consume up to some character sequence without using a temporary StringBuilder if you need to consider that you may have to read multiple "buffer-fuls" of data until you find what you're looking for.

What would be great is if we can get a local repro without modifying the read() limit artificially.

I can try coming up with a case that breaks the current code without introducing a custom Reader implementation tomorrow. It should just generate enough padding so that the "interesting" part straddles the buffer boundary.

@csaboka
Copy link
Contributor

csaboka commented Jul 3, 2019

because of using a buffered input, which will try to fill the larger length

Hmm, apparently it's not safe to assume that BufferedReader will always try to fill the buffer completely. This is what the Javadoc says:

This iterated read continues until one of the following conditions becomes true:

The specified number of characters have been read,
The read method of the underlying stream returns -1, indicating end-of-file, or
The ready method of the underlying stream returns false, indicating that further input requests would block.

So if you're feeding jsoup directly from an HTTP response stream, and the server is taking its time sending the next chunk, the reads will only report what's already available without blocking and you may only get a few bytes' worth of data in bufferUp() even if you have a generous buffer.

@csaboka
Copy link
Contributor

csaboka commented Jul 8, 2019

@jhy : This has taken me a bit more than originally expected, but I could spend some time on understanding CharacterReader again today. I have to correct myself: I don't think it's possible for marking (and therefore entity parsing) to fail with an in-memory reader, unless you have an absurdly long entity, longer than the maximum buffer length of 32K.

The mark() method does a very eager bufferUp() that always tries to fill all the remaining buffer, i.e. normally it should have 32K characters of look-ahead. The only case this could fail is if the read() call invoked by bufferUp() doesn't fill the available buffer even though it's not EOF yet. In other words, BufferedReader() must have stopped its loop because it saw that the underlying reader is not ready. This could be worked around by calling read() in a loop until the buffer is completely filled or EOF is reported.

While investigating the callers of various CharacterReader methods, I've also found an issue where a very long bogus comment could be misparsed. I've created a new branch that contains only new unit tests failing with the latest master (the two tests included in my previous PR plus the one for bogus comments).

I believe this issue could be resolved by switching from a single read() call to calling read() in a loop, and changing the BogusComment tokeniser state to handle underruns. I'll try to make those fixes later and add them to the new branch. If it works out, it would require just one unintrusive change instead of the bunch of changes required by my previous PR.

@csaboka
Copy link
Contributor

csaboka commented Jul 16, 2019

Opened #1242 for another stab at fixing this issue. It's built upon the branch I mentioned in my previous comment, the one that adds unit tests demonstrating some other edge case failures.

spypunk added a commit to spypunk/sponge that referenced this issue Dec 11, 2019
JesusMtnez added a commit to JesusMtnez/dotfiles that referenced this issue Jan 1, 2020
Using Jsoup.connect and then Jsoup.parse to avoid issue
jhy/jsoup#1218
@sitetester
Copy link

I'm facing similar issue. It can't parse verbs on this page - https://cooljugator.com/en/list/10
This is just an example link, it's happening for all similar links.

https://cooljugator.com/en/list/all is really big to load :)
any ideas ?

@sitetester
Copy link

Just to help someone:

Now I'm able to parse big HTML page using

e.g. url = "https://cooljugator.com/en/list/all"

val urlSource = Source.fromURL(url)
   val dirtyHtml = urlSource.mkString
   val html = HtmlCleanerHelper.cleanHtml(dirtyHtml)
   val xmlElement = scala.xml.XML.loadString(html)

   val divs = xmlElement \\ "div"
   divs.foreach(div => {
     val cls = div.attribute("class").getOrElse("").toString.trim
     if (cls == "ui segment stacked") {
       val aa = div \\ "a"
       println(aa.size)
     }
   })

Which prints 29402. I need to optimize this to load div by class attribute only. But for now, it works.

@jhy jhy closed this as completed in a0b87bf Jan 21, 2020
@jhy jhy self-assigned this Jan 21, 2020
@jhy jhy added the bug Confirmed bug that we should fix label Jan 21, 2020
@jhy jhy added this to the 1.12.2 milestone Jan 21, 2020
@jhy
Copy link
Owner

jhy commented Jan 21, 2020

Hi, I believe this is fixed now with a0b87bf and will be available in 1.12.2.

Thanks to everyone who has helped on this with extra details and code, and my apologies for the time taken in getting here.

The fix I implemented works by making sure enough content has been read into the buffer to survive any mark resets. They are only used in HTML entity decoding, so the min size is larger than the max entity size.

Other reads are safe because the Tokeniser will call bufferUp() in time before reading from the local charBuffer, and they can always make forward progress without needing to rewind more than 1 char.

If you see other MarkInvalids, please open an issue.

@btheu
Copy link

btheu commented Jan 22, 2020

Hi, 1.12.2 RELEASE is a great news.

I have encounter an issue from Inputstream parser. This happen with an okhttp3 Response InputStream on a specific website. Html comment is bad interpreted causing html node skip from node.select().

I let you know about that issue, i plan to open a dedicated issue with a simpliest testcase showing the issue.

@jhy
Copy link
Owner

jhy commented Jan 26, 2020

@btheu please go ahead and open that. I would try building jsoup from HEAD now with this fix to see if that was related.

@jhy
Copy link
Owner

jhy commented Feb 9, 2020

@btheu and others watching, jsoup 1.12.2 is available now. https://jsoup.org/news/release-1.12.2

LucestDail added a commit to LucestDail/srtest that referenced this issue Aug 14, 2022
jsoup 이슈(jhy/jsoup#1218 인한 url 호출방식 변경
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed bug that we should fix
Projects
None yet
Development

No branches or pull requests

7 participants