New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix attribute misquoting #219

Merged
merged 4 commits into from Jan 8, 2017

Conversation

Projects
None yet
2 participants
@plq
Copy link
Contributor

plq commented Dec 27, 2016

Behold:

>>> from lxml import html
>>> from lxml.etree import htmlfile
>>> from lxml.html.builder import E
>>> from StringIO import StringIO
>>> out = StringIO()
>>> with htmlfile(out) as f:
... with f.element("tagname", attrib={"attr": '"misquoted"'}):
... f.write("foo")
...
>>> out.getvalue()
'<tagname attr=""misquoted"">foo</tagname>'

Expected output:

'<tagname attr="&quot;misquoted&quot;">foo</tagname>'

https://bugs.launchpad.net/lxml/+bug/1594155

Not fixed yet but we're closer than we were yesterday :)

@plq

This comment has been minimized.

Copy link
Contributor Author

plq commented Dec 27, 2016

So I did some research on this. This bug exists because here:

tree.xmlOutputBufferWriteEscape(self._c_out, _xcstr(value), NULL)

the wrong string output function is used. xmlOutputBufferWriteEscape is for writing text into tag bodies, not attribute values.

I'm inclined to say that the function that we rather need to use is xmlBufAttrSerializeTxtContent, here:

https://git.gnome.org/browse/libxml2/tree/xmlsave.c?id=229d1f93ceaa5e29a1e8455e73a7f346993d20b3#n2039

As can be seen, this is a non-trivial function so I'd be reluctant to copy it to lxml source.

However, it takes an xmlAttrPtr, which we don't seem to have at that point in lxml source code.

I'm not sure how to proceed from here. Any ideas?

@plq

This comment has been minimized.

Copy link
Contributor Author

plq commented Jan 6, 2017

We can't use xmlEscapeEntities from xmlsave.c because 1) it's static 2) it doesn't escape anything between 0x20 and 0x80, where '"' is 0x22.

@plq

This comment has been minimized.

Copy link
Contributor Author

plq commented Jan 6, 2017

Okay, I bit the bullet and ported the whole thing. I hope I did it the right way!

@scoder

This comment has been minimized.

Copy link
Member

scoder commented Jan 8, 2017

Thanks! I know, this feels verbose and redundant, but given that this code is unlikely to ever change (well, functionality-wise, I mean), I also think that it's the best way to do this.

@scoder scoder merged commit c45c661 into lxml:master Jan 8, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment