New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Websites with large amounts of data fail to parse. #1218
Comments
|
Hi, thanks for the report. Can you confirm which version of jsoup you're using? Works in 1.12.1 on try.jsoup - http://try.jsoup.org/~MnrEgz1lWYn4NMN2ObClDfXz7D8 Might be a slight code path difference which I can dig into there. But there were some fixes around this in 1.12.1 from the previous version. |
|
@jhy Using version 1.12.1, downloaded hours ago. |
|
Same here, version 1.12.1. works every time, but |
|
I suspect it happens when entity is on a boundary of a buffer read but not always. |
|
@krystiangorecki : Your suspicion is correct. I could reproduce this reliably in unit tests, by using a custom Reader that only returns a very small amount of data for each read() call. This simulates real readers that can return a similarly small amount in edge cases. Opened #1225 for my proposed fix. |
|
Hello jhy, when I use jsoup-1.12.1, I get the same error: This is my code: However, when I use jsoup-1.11.3.jar, everything is fine. |
|
Thanks all for your reviews on this. @csaboka in looking at your notes and the PR -- the only local repro I can get is with your test in TokeniserTest -https://github.com/jhy/jsoup/pull/1225/files#diff-886e5a4f90143c287a5dfcf4c27964ea . The test in src/test/java/org/jsoup/parser/CharacterReaderTest.java passes already in master. Did it fail for you before your changes? I can repro with the gist @krystiangorecki posted if loading from that URL, but not if I serve the same file from the Jetty integ test server. I understand the bug but there is a sensitivity in the reader length return I am not quite following yet -- want to make sure we get to a 100% fix here. @csaboka I like the approach of your PR generally. I am conscious that this is a memory and latency optimized area, so am reluctant to add a bunch of new string buffers (and ongoing bufferUp() calls) if we can help it. I have a mostly working solution which doesn't add another buffer for marks, but during bufferUp() will not invalidate the mark, but will move it to the front of the buffer instead. Because we know the largest expected mark size (the largest character reference), we can enforce it not to fail. @wx2020 thanks for your note. For clarity in 1.11.3 everything was not fine; the same buffer underrun happened, we just didn't throw a validation error. It just would have silently corrupted the input data. |
|
What would be great is if we can get a local repro without modifying the read() limit artificially. I want to be able to verify that other aspects which expect a larger read (because of using a buffered input, which will try to fill the larger length), the solution I'm looking at will still work. |
No, now that I've looked at it again, the test passes without my changes. If I change the consumeTo() line from this:
Buffers should only be created in the hopefully rare case of reads straddling buffer boundaries. In the real world, when you buffer up kilobytes of markup at a time, it probably wouldn't cause that much garbage.
Fixing marking will deal with this specific exception, it won't solve all issues with large inputs. The various consumeXxx() methods don't "see beyond" what's currently buffered, so they could fail to find things and return all the remaining buffer (not all data until the end of the stream, just what's in the buffer) if what they're looking for ends beyond the currently buffered data. The second commit of my PR tried to fix that, but apparently it's still not working for all cases. In general, it's pretty hard to consume up to some character sequence without using a temporary StringBuilder if you need to consider that you may have to read multiple "buffer-fuls" of data until you find what you're looking for.
I can try coming up with a case that breaks the current code without introducing a custom Reader implementation tomorrow. It should just generate enough padding so that the "interesting" part straddles the buffer boundary. |
Hmm, apparently it's not safe to assume that BufferedReader will always try to fill the buffer completely. This is what the Javadoc says: This iterated read continues until one of the following conditions becomes true: The specified number of characters have been read, So if you're feeding jsoup directly from an HTTP response stream, and the server is taking its time sending the next chunk, the reads will only report what's already available without blocking and you may only get a few bytes' worth of data in bufferUp() even if you have a generous buffer. |
|
@jhy : This has taken me a bit more than originally expected, but I could spend some time on understanding CharacterReader again today. I have to correct myself: I don't think it's possible for marking (and therefore entity parsing) to fail with an in-memory reader, unless you have an absurdly long entity, longer than the maximum buffer length of 32K. The mark() method does a very eager bufferUp() that always tries to fill all the remaining buffer, i.e. normally it should have 32K characters of look-ahead. The only case this could fail is if the read() call invoked by bufferUp() doesn't fill the available buffer even though it's not EOF yet. In other words, BufferedReader() must have stopped its loop because it saw that the underlying reader is not ready. This could be worked around by calling read() in a loop until the buffer is completely filled or EOF is reported. While investigating the callers of various CharacterReader methods, I've also found an issue where a very long bogus comment could be misparsed. I've created a new branch that contains only new unit tests failing with the latest master (the two tests included in my previous PR plus the one for bogus comments). I believe this issue could be resolved by switching from a single read() call to calling read() in a loop, and changing the BogusComment tokeniser state to handle underruns. I'll try to make those fixes later and add them to the new branch. If it works out, it would require just one unintrusive change instead of the bunch of changes required by my previous PR. |
|
Opened #1242 for another stab at fixing this issue. It's built upon the branch I mentioned in my previous comment, the one that adds unit tests demonstrating some other edge case failures. |
Using Jsoup.connect and then Jsoup.parse to avoid issue jhy/jsoup#1218
|
I'm facing similar issue. It can't parse verbs on this page - https://cooljugator.com/en/list/10 https://cooljugator.com/en/list/all is really big to load :) |
|
Just to help someone: Now I'm able to parse big HTML page using Which prints 29402. I need to optimize this to load div by class attribute only. But for now, it works. |
|
Hi, I believe this is fixed now with a0b87bf and will be available in 1.12.2. Thanks to everyone who has helped on this with extra details and code, and my apologies for the time taken in getting here. The fix I implemented works by making sure enough content has been read into the buffer to survive any mark resets. They are only used in HTML entity decoding, so the min size is larger than the max entity size. Other reads are safe because the Tokeniser will call bufferUp() in time before reading from the local charBuffer, and they can always make forward progress without needing to rewind more than 1 char. If you see other MarkInvalids, please open an issue. |
|
Hi, 1.12.2 RELEASE is a great news. I have encounter an issue from Inputstream parser. This happen with an okhttp3 Response InputStream on a specific website. Html comment is bad interpreted causing html node skip from node.select(). I let you know about that issue, i plan to open a dedicated issue with a simpliest testcase showing the issue. |
|
@btheu please go ahead and open that. I would try building jsoup from HEAD now with this fix to see if that was related. |
|
@btheu and others watching, jsoup 1.12.2 is available now. https://jsoup.org/news/release-1.12.2 |
jsoup 이슈(jhy/jsoup#1218 인한 url 호출방식 변경

Currently using Jsoup on some large websites, and it throws the Mark Invalid Exception which means the bufref is negative?
I tried using both Jsoup.connect(url).get()
and Jsoup.connect(url).execute().parse()
Both cause the same exception.
The text was updated successfully, but these errors were encountered: