Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
encoding/xml: Parser accepts invalid prolog, parser.text may accept disallowed characters #1259
versus revision 6725 and as early as the 2010-11-02 release. What steps will reproduce the problem? 1. create an xml.Parser on a document that has disallowed characters between start of document and root element, or between start of document and XML declaration, or between start of document and DOCTYPE declaration 2. exhaust the parser with _, err parser.Token() until err == os.EOF What is the expected output? Expected is to return the first call to Token() with an error of some sort, about the disallowed characters, and not parse further. What do you see instead? xml.Parser will return tokens all the way to EOF. The disallowed characters are the first token, and are of xml.CharData, followed by the rest of the document. Which compiler are you using (5g, 6g, 8g, gccgo)? 6g Which operating system are you using? darwin ( uname -a Darwin host-elided.local 10.4.0 Darwin Kernel Version 10.4.0: Fri Apr 23 18:28:53 PDT 2010; root:xnu-1504.7.4~1/RELEASE_I386 i386) Which revision are you using? (hg identify) 68aae563fd33+ tip Please provide any additional information below. I am attaching as small a go program as I can think of with examples inside it. This from a conversation with Russ Cox on golang-nuts ( http://groups.google.com/group/golang-nuts/browse_thread/thread/ddabf01fdbe57c9f# ) about xml.Parser and results of using it with the XML Conformance Suite. I believe that there are two specific difficulties: First, the initial call to RawToken() does not know that we are at the beginning of the document, before the prolog. At line 430, we check the first read byte for not being '<', at which point we call p.text and create a CharData with the results. This if block would need to handle the before-prolog case with some state (that no-longer-before-prolog-detection would need to change...). There are some character sequences that are permissible at this location, provided there isn't an XML declaration (DOCTYPE and root element can have whitespace and comments and processing instructions before them, it seems). Second, parser.text() accepts at least some byte sequences that we don't think it should (my attached example with have a single byte 0x12 at the beginning of the document, which isn't in the XML Character Range). This I don't haven't analyzed at all beyond this. Respectfully submitted, Nigel Kerr
This issue was closed.