New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug in CDATA parsing #7
Comments
I submitted patch to guile scheme with the failing tests inspected, verified and fixed. https://lists.gnu.org/archive/html/guile-devel/2020-01/msg00081.html |
Oleg applied my patch in the SSAX.scm on his site: http://okmij.org/ftp/Scheme/lib/SSAX.scm. |
Many thanks for bringing this to my attention! I added your test case, and it failed, as expected. I then made the change that you suggested, and while it now passes that test, it fails three others. I'm wondering if you have time to comment on them. Here's the raw text of the fail window:
The third of these is pretty straightforward; it appears that this is testing the behavior of CDATA within CDATA, but along the way, it indirectly checks that > is translated to >. I'm guessing that this test case should just be altered to expect > in the output (or altered to put a > in the input). Aaaand, actually, the second one looks just the same, with the solution being to alter the test. I'm not sure about the first one, though.... okay, actually, I think I understand; this test is to ensure that an ampersand that's not part of an entity description doesn't get parsed incorrectly. But now, we've scrubbed the special handling of ampersand altogether; no more entities, so no more need to check for whether an ampersand is standing on its own, and the whole thing just gets parsed as a string. Can you please confirm that my interpretations of these tests are correct? If so, it's easy for me to update the corresponding test cases and update the code. Thanks again for following up on this. |
Wow... I took a crack at making these test failures go away, and there are some really hairy tests in cdata-parsing; determining the correct results for these input strings is not something I currently have the expertise to determine. Can you speak to the correct result of
(and, um, no fair just running the code with your patch and seeing what it produces...) |
I should also mention that I went looking for an updated upstream version of these tests, but I couldn't find it anywhere on Oleg's page. |
Oh! never mind. Actually, the tests are wedged in there along with all the code. Got it! |
yeah, the patch Oleg applied contains my updated tests. I should have mentioned that! the code was the easy part, especially since guile used an ugly hack to be able to re-use the internal tests by Oleg, which meant I got no line numbers or any other info for the failed tests :) |
#7 This change comes from Linus Björnstam, who argues that the treatment of entities within CDATA used by ssax is, if not incorrect, nonstandard. More spec- ifically, he suggests that the XML standard is ambiguous, but that other common libraries do not parse e.g. > into > within CDATA strings.
closed by 363946c |
Many thanks! |
thanks! The code comment for the changed procedure is no longer correct though. |
I just deleted the last two lines of that comment; does that accurately summarize the change? |
Yup. Thanks!
…--
Linus Björnstam
On Sat, 4 Apr 2020, at 21:10, John Clements wrote:
I just deleted the last two lines of that comment; does that accurately
summarize the change?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#7 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACFV5K7NFRYIJVPDXFFIORLRK6A37ANCNFSM4KHQE77A>.
|
Hi there!
RhodiumToad over in #guile on freenode discovered a bug in the upstream SSAX library where > was converted to > inside unparsed character data blocks (<!CDATA[...]]>). This is not according to spec.
The bug happens in ssax:read-cdata-body. It can be fixed for that specific case by removing #& from the delimiters list and completely removing the handling of anything starting with #&.
A test for the bug:
Inside a CDATA, there are no special character sequences, and the correct output should be:
I did a quick fix, and have not run any further tests expect for REPLing around to make sure it doesn't destroy regular PCDATA parsing. The function ssax:read-cdata-body now looks like this:
I see how the spec could be misread, but trying out some other parsers I haven't been able to find anyone that does a similar thing. Example:
Best regards
Linus Björnstam
The text was updated successfully, but these errors were encountered: