Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in CDATA parsing #7

Closed
bjoli opened this issue Jan 16, 2020 · 12 comments
Closed

Bug in CDATA parsing #7

bjoli opened this issue Jan 16, 2020 · 12 comments

Comments

@bjoli
Copy link

bjoli commented Jan 16, 2020

Hi there!

RhodiumToad over in #guile on freenode discovered a bug in the upstream SSAX library where &gt; was converted to > inside unparsed character data blocks (<!CDATA[...]]>). This is not according to spec.

The bug happens in ssax:read-cdata-body. It can be fixed for that specific case by removing #& from the delimiters list and completely removing the handling of anything starting with #&.

A test for the bug:

(ssax:xml->sxml (open-input-string "<e><![CDATA[&gt;]]></e>") '())
;; => (*TOP* (e ">"))

Inside a CDATA, there are no special character sequences, and the correct output should be:

(*TOP* (e "&gt;"))

I did a quick fix, and have not run any further tests expect for REPLing around to make sure it doesn't destroy regular PCDATA parsing. The function ssax:read-cdata-body now looks like this:

(define ssax:read-cdata-body 
  (let ((cdata-delimiters (list char-return #\newline #\])))
    (lambda (port str-handler seed)
      (let loop ((seed seed))
	(let ((fragment (next-token '() cdata-delimiters
				    "reading CDATA" port)))
			; that is, we're reading the char after the 'fragment'
     (case (read-char port)	
       ((#\newline) (loop (str-handler fragment nl seed)))
       ((#\])
	(if (not (eqv? (peek-char port) #\]))
	    (loop (str-handler fragment "]" seed))
	    (let check-after-second-braket
		((seed (if (string-null? fragment) seed
			   (str-handler fragment "" seed))))
	      (case (peek-next-char port)	; after the second bracket
		((#\>) (read-char port)	seed)	; we have read "]]>"
		((#\]) (check-after-second-braket
			(str-handler "]" "" seed)))
		(else (loop (str-handler "]]" "" seed)))))))
       (else		; Must be CR: if the next char is #\newline, skip it
         (when (eqv? (peek-char port) #\newline) (read-char port))
         (loop (str-handler fragment nl seed)))))))))

I see how the spec could be misread, but trying out some other parsers I haven't been able to find anyone that does a similar thing. Example:

Python 2.7.17 (default, Dec 23 2019, 21:25:33)
>>> import xml.etree.ElementTree as ET
>>> root = ET.fromstring("<e><![CDATA[&gt;]]></e>")
>>> root.text
'&gt;'

Best regards
Linus Björnstam

@bjoli
Copy link
Author

bjoli commented Jan 16, 2020

I submitted patch to guile scheme with the failing tests inspected, verified and fixed.

https://lists.gnu.org/archive/html/guile-devel/2020-01/msg00081.html

@bjoli
Copy link
Author

bjoli commented Mar 30, 2020

Oleg applied my patch in the SSAX.scm on his site: http://okmij.org/ftp/Scheme/lib/SSAX.scm.

@jbclements
Copy link
Owner

Many thanks for bringing this to my attention!

I added your test case, and it failed, as expected. I then made the change that you suggested, and while it now passes that test, it fails three others. I'm wondering if you have time to comment on them.

Here's the raw text of the fail window:

--------------------
SSAX > cdata body parsing
. FAILURE
name:       test
location:   ssax-tests.rkt:120:4
actual:     '("abc&!!!" "")
expected:   '("abc" "&" "" "" "!!!" "")
--------------------
--------------------
SSAX > simple parser
. FAILURE
name:       test
location:   ssax-tests.rkt:510:4
params:
  '(" <P><![CDATA[<BR>\n<![CDATA[<BR>]]&gt;]]></P>"
    #<procedure:dummy-doctype-fn>
    ((P "<BR>" "\n" "<![CDATA[<BR>" "]]" "" ">")))
--------------------
--------------------
SSAX > ssax:xml->sxml
. FAILURE
name:       test
location:   ssax-tests.rkt:709:4
params:
  '(" <P><![CDATA[<BR>\n<![CDATA[<BR>]]&gt;]]></P>"
    ()
    (*TOP* (P "<BR>\n<![CDATA[<BR>]]>")))
--------------------
7 success(es) 3 failure(s) 0 error(s) 10 test(s) run

The third of these is pretty straightforward; it appears that this is testing the behavior of CDATA within CDATA, but along the way, it indirectly checks that > is translated to >. I'm guessing that this test case should just be altered to expect > in the output (or altered to put a > in the input).

Aaaand, actually, the second one looks just the same, with the solution being to alter the test.

I'm not sure about the first one, though.... okay, actually, I think I understand; this test is to ensure that an ampersand that's not part of an entity description doesn't get parsed incorrectly. But now, we've scrubbed the special handling of ampersand altogether; no more entities, so no more need to check for whether an ampersand is standing on its own, and the whole thing just gets parsed as a string.

Can you please confirm that my interpretations of these tests are correct? If so, it's easy for me to update the corresponding test cases and update the code.

Thanks again for following up on this.

@jbclements
Copy link
Owner

Wow... I took a crack at making these test failures go away, and there are some really hairy tests in cdata-parsing; determining the correct results for these input strings is not something I currently have the expertise to determine. Can you speak to the correct result of

#lang racket

(require sxml/ssax/SSAX-code)

(define cons* list*)
(define (consumer fragment foll-fragment seed)
    (cons* (if (equal? foll-fragment (string #\newline))
               " NL" foll-fragment) fragment seed))
(define (local-test str)
    (reverse
     (let ([port (open-input-string str)])
       (ssax:read-cdata-body port consumer '()))))
;; what should the expected output be?
(local-test "abc]]&gt;&gt&amp;]]]&gt;and]]>")
;; what should the expected output be?
(local-test "abc]]>;>&]]]>and]]>")

(and, um, no fair just running the code with your patch and seeing what it produces...)

@jbclements
Copy link
Owner

I should also mention that I went looking for an updated upstream version of these tests, but I couldn't find it anywhere on Oleg's page.

@jbclements
Copy link
Owner

Oh! never mind. Actually, the tests are wedged in there along with all the code. Got it!

@bjoli
Copy link
Author

bjoli commented Mar 31, 2020

yeah, the patch Oleg applied contains my updated tests. I should have mentioned that! the code was the easy part, especially since guile used an ugly hack to be able to re-use the internal tests by Oleg, which meant I got no line numbers or any other info for the failed tests :)

jbclements added a commit that referenced this issue Apr 3, 2020
#7
This change comes from Linus Björnstam, who argues
that the treatment of entities within CDATA used by
ssax is, if not incorrect, nonstandard. More spec-
ifically, he suggests that the XML standard is
ambiguous, but that other common libraries do not
parse e.g. &gt; into > within CDATA strings.
@jbclements
Copy link
Owner

closed by 363946c

@jbclements
Copy link
Owner

Many thanks!

@bjoli
Copy link
Author

bjoli commented Apr 4, 2020

thanks!

The code comment for the changed procedure is no longer correct though.

@jbclements
Copy link
Owner

I just deleted the last two lines of that comment; does that accurately summarize the change?

@bjoli
Copy link
Author

bjoli commented Apr 5, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants