Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError UTF-8 in CSS #213

Open
TZanke opened this issue Dec 17, 2018 · 9 comments
Open

ValueError UTF-8 in CSS #213

TZanke opened this issue Dec 17, 2018 · 9 comments

Comments

@TZanke
Copy link
Contributor

TZanke commented Dec 17, 2018

ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

File "premailer/premailer/premailer.py", line 453, in transform
element['item'].attrib['style'] = final_style

Mail Contains UTF-8 in CSS. Looks like this behaviour is valid CSS.

Mail head CSS:

span.berschrift1Zchn
{mso-style-name:"\00DCberschrift 1 Zchn";
mso-style-priority:9;
mso-style-link:"\00DCberschrift 1";
font-family:"Calibri Light",sans-serif;
color:#2F5496;}

character: https://www.htmlsymbols.xyz/unicode/U+00DC

@peterbe
Copy link
Owner

peterbe commented Dec 17, 2018

What's the traceback? And can you supply a test file(s) so I can try this locally.
Also, what version of Python?

@TZanke
Copy link
Contributor Author

TZanke commented Jan 11, 2019

File "/src/premailer/premailer/premailer.py", line 453, in transform
element['item'].attrib['style'] = final_style
File "src/lxml/etree.pyx", line 2408, in lxml.etree._Attrib.__setitem__
File "src/lxml/apihelpers.pxi", line 570, in lxml.etree._setAttributeValue
File "src/lxml/apihelpers.pxi", line 1439, in lxml.etree._utf8

ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

Example file:
Mail.eml.zip

Python: 2.7

@peterbe
Copy link
Owner

peterbe commented Jan 14, 2019

Hmm... The problem appears to be that the string contains "invalid Unicode" and lxml is strict about the.

I can confirm that the test file causes an error with Python 3.6 too.

Traceback (most recent call last):
  File "test-issue-213.py", line 9, in <module>
    print(transform(html))
  File "/Users/peterbe/dev/PYTHON/premailer/premailer/premailer.py", line 670, in transform
    return Premailer(**kwargs).transform(html, pretty_print=pretty_print)
  File "/Users/peterbe/dev/PYTHON/premailer/premailer/premailer.py", line 464, in transform
    element["item"].attrib["style"] = final_style
  File "src/lxml/etree.pyx", line 2408, in lxml.etree._Attrib.__setitem__
  File "src/lxml/apihelpers.pxi", line 570, in lxml.etree._setAttributeValue
  File "src/lxml/apihelpers.pxi", line 1431, in lxml.etree._utf8
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcbe' in position 180: surrogates not allowed

However, it might be possible to make it work. This appears to work after all:

# Exclusively for Python 2.7

from lxml import etree
from lxml.cssselect import CSSSelector

parser = etree.HTMLParser()
html = """<html>
<h1 style="color">Text</h1>
</html>"""
tree = etree.fromstring(html, parser).getroottree()
page = tree.getroot()
for element in CSSSelector("h1")(page):
    element.attrib["style"] = u"\ud83d\ude02"
    # element.attrib["style"] = u"something"

out = etree.tostring(page, encoding="utf-8").decode("utf-8")
print(repr(out))
print(out)

The output becomes:

u'<html>\n<body><h1 style="\U0001f602">Text</h1>\n</body></html>'
<html>
<body><h1 style="😂">Text</h1>
</body></html>

With Python 3.6...

from lxml import etree
from lxml.cssselect import CSSSelector

parser = etree.HTMLParser()
html = """<html>
<h1 style="color">Text</h1>
</html>"""
tree = etree.fromstring(html, parser).getroottree()
page = tree.getroot()
for element in CSSSelector("h1")(page):
    element.attrib["style"] = "\ud83d\ude02"
    # element.attrib["style"] = "😂"

out = etree.tostring(page, encoding="utf-8").decode("utf-8")
print(repr(out))
print(out)

...you get:

Traceback (most recent call last):
  File "invalid-unicode-py3.py", line 11, in <module>
    element.attrib["style"] = "\ud83d\ude02"
  File "src/lxml/etree.pyx", line 2408, in lxml.etree._Attrib.__setitem__
  File "src/lxml/apihelpers.pxi", line 570, in lxml.etree._setAttributeValue
  File "src/lxml/apihelpers.pxi", line 1431, in lxml.etree._utf8
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed

@peterbe
Copy link
Owner

peterbe commented Jan 15, 2019

@TZanke I'm intrigued to help but I admit that it's realistic that I won't be able to be of much use. First of all, I don't use Python 2 for any of my many projects any more. Second of all, I actually don't use premailer in any of my active projects actually.

@peterbe
Copy link
Owner

peterbe commented Jan 15, 2019

In other words, some help would be greatly appreciated.

If you're stuck, try to clean up your incoming HTML string so it doesn't contain weird Microsoft Unicode that may or may not be UTF-8.

@TZanke
Copy link
Contributor Author

TZanke commented Jan 16, 2019

At the moment i fix the HTML before running Premailer. This works.

Upgrade to Python 3 is planned this year, so Python 2 should not be a problem for us in the future.

@TZanke
Copy link
Contributor Author

TZanke commented Feb 25, 2019

This problem looks like a cssutils problem, i opened a bug:
https://bitbucket.org/cthedot/cssutils/issues/81/css-encoding-not-working

@nikolaik
Copy link

nikolaik commented Nov 18, 2019

The bitbucket repository for cssutils looks unmaintained since around 2017. There seems to exist a fork over at https://github.com/ebook-utils/css-parser, which is also included in debian as python-css-parser

Does the fork fix the UTF-8 encoding issue?

@TZanke
Copy link
Contributor Author

TZanke commented Nov 19, 2019

I installed css_parser 1.0.4 but the error still exists. Thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants