repair attribute mis-interpretation in ElementTreeContentHandler #106

Closed
wants to merge 1 commit into
from

Projects

None yet

2 participants

Contributor
zzzeek commented Mar 29, 2013

regarding https://bugs.launchpad.net/lxml/+bug/1136509, this is a proposed fix for the issue.

The first part of the fix just rewrites the attributes in startElement to have keys of the form (namespace, key).

At first, i set namespace to None, but I had a problem with that. It appears that even namespaced attributes like the "xmlns:xsi" in my test document, is also passed to startElement, I suppose because the owning tag doesn't have a namespace. So in this case I'm splitting on the colon and passing in the two tokens to startElementNS, but I'm not sure if this approach is correct. In any case, I added two tests, if you can show what should happen in the tests at least that would make the correct behavior apparent here.

@zzzeek zzzeek - repair issue in sax.ElementTreeContentHandler
whereby attributes passed to startElement() would be mis-interpreted
as containing a namespace attribute, leading to a TypeError,
as well as where attributes with namespaces wouldn't be split
up correctly when passed to startElement().
3222e75
Owner
scoder commented Mar 29, 2013

My guess is that you get prefixed attributes because you are using a namespace unaware parser. You should parse documents that use prefixes with a properly configured parser that processes namespaces.

Contributor
zzzeek commented Mar 29, 2013

The ticket mentioned has the original test. It uses only lxml's parser, and as illustrated the XML is valid:

document = """<?xml version="1.0" encoding="utf-8"?>
<SomeDocument xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <Data FooBar="123">
    </Data>
</SomeDocument>

"""

from lxml import sax, etree
from xml.sax import parse
from StringIO import StringIO

# syntax is valid, etree parses it
lxml_parsed = etree.parse(StringIO(document))

parse(StringIO(document), sax.ElementTreeContentHandler())
Contributor
zzzeek commented Mar 29, 2013

as far as the namespace URI (which I suspected is what should be here), I'd need guidance on how to do that translation, if it's needed at all.

Contributor
zzzeek commented Mar 29, 2013

I should note, with the above test, lxml's parser sends that first node including the "xmlns:xsi" attribute name into startElement, not startElementNS. So what to do with the attribute named "xmlns:xsi" ?

Owner
scoder commented Mar 30, 2013

I'd just ignore it - and definitely not special case it. lxml will raise an error when the code tries to use it as a tag/attribute name, if I'm not mistaken, which I think is the right thing to do here. I can't see a case where this would come up, except for parsing a namespaced document with a namespace unaware parser. And that's a bug on the user side. Failing early helps them in fixing it by enabling namespace support on the sax parser.

http://docs.python.org/2/library/xml.sax.handler.html#xml.sax.handler.feature_namespaces

So, remove the special case, change the second test to test for an error, and I think that's all there is to it.

Does that fix your original problem, though?

Contributor
zzzeek commented Mar 30, 2013

so you want it to split "xmlns:xsi" on the colon, then discard "xmlns" and send the attribute as (None, xsi) for the key?

my problem doesn't apply in this case since I'm no longer using the sax parser for my particular problem.

Owner
scoder commented Mar 30, 2013

No - no splitting at all, just let it fail. "xmlns:xsi" is not a valid tag name or attribute name, so lxml will reject it with an appropriate error message.

Contributor
zzzeek commented Mar 30, 2013

oh...how come lxml's etree parser doesn't reject it? separate issue ?

Contributor
zzzeek commented Mar 30, 2013

can you clarify "namespace unaware parser" - is "xml.sax.parse" in the python standard lib namespace unaware? Because yes, it doesn't seem to handle documents that were spit out by Eclipse's XML generator which I'm assuming produces valid XML.

Also, in the time between I posted my original bug report and when I got a response, I stopped noticing that my test is using "xml.sax.parse", i.e. the Python stdlib parser, and not an lxml tool, so I have re-noted that.

Owner
scoder commented Mar 30, 2013

You have to explicitly enable namespace aware parsing in xml.sax (see the link I posted). Otherwise, it will not consider namespace declarations anything special and return them as regular attributes. In that case, an error is the appropriate response of lxml, because lxml is always namespace aware, so the user is doing something "stupid". (And I just tried it, creating an attribute "xmlns:test" gives a ValueError in lxml.)

Contributor
zzzeek commented Mar 30, 2013

Yeah, the python lib docs here are very lacking, because the docs for parse() don't talk about any of this. As far as I can tell, the API that actually works is entirely undocumented, as there are no python docs for ExpatParser, the "expatreader" module, or the "namespaceHandling" flag:

from xml.sax.expatreader import ExpatParser
parser = ExpatParser(namespaceHandling=1)
parser.setContentHandler(sax.ElementTreeContentHandler())
parser.parse(StringIO(document))

If you set that flag to 0, then you're getting the behavior of xml.sax.parse(). Unless I'm missing something, this is a pretty major omission in Python's standard docs. If I knew about this API in the first place I would not have encountered the lxml bug at all since all the XML I deal with has namespaces.

@zzzeek zzzeek closed this Mar 30, 2013
Owner
scoder commented Mar 31, 2013

I guess the docs just expect you to know how the SAX interface works. They were written at a time where SAX was a well accepted and widely used interface. The way to enable namespace aware parsing is to call setFeature().

http://docs.python.org/2/library/xml.sax.reader.html#xml.sax.xmlreader.XMLReader.setFeature

Here's an example:

http://www.xml.com/pub/a/2003/03/10/python.html

(BTW, that example was easy to find by asking duckduckgo for "python sax example namespaces" ...)

I'm sure the Python core developers will be very happy about any improvement to the documentation here. Especially a working example would seriously help.

Contributor
zzzeek commented Mar 31, 2013

well, here's another problem with that "we assume everyone knows" approach. I did nothing but sax parsing with Java from like 1999 to 2001, I haven't poked into my old zipfiles of that code but it's very likely there's some setFeature() calls in there.

So perhaps they should also consider that 12 years later, perhaps that memory has been replaced many times over by dozens of new APIs in the interim....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment