Skip to content


Subversion checkout URL

You can clone with
Download ZIP


Do not directly use isolated surrogates in unicode literals #150

wants to merge 6 commits into from

7 participants


Jython does not support isolated surrogates in unicode, including in unicode literals. This has been reported in #2 This bug is critical for Jython due to the fact that html5lib is a vendor lib for pip, and this is blocking pip from running on Jython.

For platforms besides Jython, this pull request allows for these surrogates to be constructed in literals, but through an additional step of indirection. For Jython itself, Jython's normal decode of literals will ensure that such invalid unicode strings cannot be constructed from any source.

To run this on Jython:

  1. Install, following these instructions:

  2. Use this branch of pip to install nosetests, etc.: Note that tox is not yet supported - because we need to get pip running first! :)

Note that in the dev build, you will find executables in dist/bin, such as dist/bin/jython or dist/bin/pip

The jython-socket-reboot branch is nearly complete for merging into Jython; it is a major component of Jython 2.7.0 beta 3. (I'm a core dev of Jython.)


Critic review:

This is an external review system which you may optionally use for the code review of your pull request.

In order to help critic track your changes, please do not make in-place history rewrites (e.g. via git rebase -i or git commit --amend) when updating this pull request.


I should mention that my branch of pip has the same change proposed here in its vendor lib inclusion of html5lib


Really I'd rather Jython just changed so that is matched the Python documentation and CPython and PyPy and had its unicode type be 16-/32-bit code-units. :)


Trying to run the tests with that branch of Jython on your branch of html5lib leads to a bunch of errors. Notably, html5lib/tests/ reports an error after two tests, when it's trying to run tests for lone surrogates in the input stream. I suppose this should just be a documented limitation of html5lib on Jython, and some workaround added so the rest of the tokenizer tests run to completion.

Otherwise there's a bunch of failures due to lack of an euc_jp codec. The docs say:

Python comes with a number of codecs built-in, either implemented as C functions or with dictionaries as mapping tables. The following table lists the codecs by name, together with a few common aliases, and the languages for which the encoding is likely used.

…which I take to mean as Python, as a language, comes with them. Hence that's seems to be another bug in Jython.


Guido van Rossum summarized the issue of UTF-16 support in Jython well, which we used as guidance when Jython 2.5 was under development:

This has all been rehashed endlessly. It's implementation (and
platform- and compilation options-) dependent because there are good
reasons for both choices. Even if CPython 3.0 supports a dynamic
choice (which some are proposing) then the language will still make
it implementation dependent because of Jython and IronPython, where
the only choice is UTF-16 (or UCS-2, depending the attitude towards

Re euc_jp codec, this is a known bug with respect to a lack of CJK codecs ( Hopefully there has been some more recent work on the corresponding patch.

However, I did just re-run using

$ ~/jythondev/jython-socket-reboot/dist/bin/nosetests --verbose
Ran 1038 tests in 16.498s


Maybe I need to specify additional options?


Did you init/update the git submodule? See the readme. You should be looking at tens of thousands of tests.

FWIW, while the Py2 language reference uses the ambiguous phrasing of "Unicode ordinal" in defining the unicode type (what on earth a "Unicode ordinal" is is a very good question — it's not defined anywhere!), the Py3 language reference states that the str type is a sequence of Unicode codepoints (which would therefore mean lone surrogates are allowed). But this is rehashing what I said in the Jython bug and probably not the place to debate Python semantics.


@gsnedders Got it, I was able to run 16729 tests, so we are now on the same page. Of these I'm seeing 7 errors. 6 are definitely euc_jp; 1 is failing in the middle of a loop, which I haven't investigated yet. Maybe this is euc_jp too?

Since we would like to fix and support CJK codecs, I'm going to put this PR on hold on our side until that CJK support is complete, which I would be expect in the beta 4. This timing seems to be sound given the work by Yuji Yamano and the fact that this is now getting some attention. Because we will also will be integrating ensurepip during beta 4, there may be an interval where ensurepip will refer to a Jython-specific fork of pip (so beyond the current practice of referring people to to try things out), but this should be short in duration.


The other error is another case of lone-surrogates, I can tell you that much.


@gsnedders re the other case: will take a look. Thanks again for your help!

I have updated to reflect this CJK codec dependency.


When you next take a look at this, mind adding $py.class to .gitignore?


So where are we with this right now?


The CJK work is now complete in Jython trunk: Jython now supports being able to wrap java.nio.charset codec support with the standard Python codec API. This includes the tested euc_jp codec in this test suite.

There's one remaining problem with isolated surrogates with the tests corresponding to unicodeCharsProblematic.test (a data file used to construct distinct unit tests, a rather clever approach); some sort of workaround will be required.

I will update the branch with this test workaround accordingly, along with .gitignore.

jimbaker added some commits
@jimbaker jimbaker Ignore compiled Python classes for Jython a6c4b41
@jimbaker jimbaker Pass on constructed tests in test_tokenizer that attempt to build
HTMLUnicodeInputStream objects from unicode strings that contain
isolated surrogates. Such tests are not meaningful on Jython which
does not allow for invalid unicode strings to be decoded in the first

All tests pass with python 2.7, python 3.4, and jython 2.7 trunk using html5lib-tests test data.

With this latest change, Jython will now only run 17353 of the current 17357 tests, effectively excluding the following test cases in unicodeCharsProblematic.test:

{"tests" : [
{"description": "Invalid Unicode character U+DFFF",
"input": "\\uDFFF",
"output":["ParseError", ["Character", "\\uFFFD"]]},

{"description": "Invalid Unicode character U+D800",
"input": "\\uD800",
"output":["ParseError", ["Character", "\\uFFFD"]]},

{"description": "Invalid Unicode character U+DFFF with valid preceding character",
"input": "a\\uDFFF",
"output":["ParseError", ["Character", "a\\uFFFD"]]},

{"description": "Invalid Unicode character U+D800 with valid following character",
"input": "\\uD800a",
"output":["ParseError", ["Character", "\\uFFFDa"]]},

However, it does run the test case in that data file using \u0000.

I did notice that the JSON parsing of namedentities.test is rather slow. Something to fixed separately, by using the Jackson package for the json module implementation.


I should also mention that has been deleted, now that it has been merged into (or alternatively,


For what it's worth, if this gets merged and released within the next month or so It'll make it into pip 1.6. Not sure what the release schedule looks like for html5lib though :)


@gsnedders Any updates/comments on this pull request? We are really hoping that pip 1.6 will be able to support Jython 2.7, as well as other users of html5lib-python


Apologies for being a bit slow (I've been all over the place travelling pretty continuously, and I'm now on holiday till EuroPython); there's a couple of comments on Critic now, see the first comment above.

@dstufft: the html5lib release schedule is typically someone going, "oh, hey, I need this bugfix that's on master, could you make a release?", and me responding with a release. :) (In theory, I plan on making releases every so often once anything worthwhile happens, but other people often find obscure bugfixes worthwhile, so it gets released when they ask.)


@gsnedders Looking forward to seeing you at EuroPython! I'll respond to the Critic review, thanks!


I added some more comments on critic.

@dstufft dstufft referenced this pull request in pypa/pip

import pwd fails under Jython on Windows #1975


Hmm, test failures are due to trailing whitespace.


#168 is essentially the same as this with the trailing whitespace fixed.


#168 fails all its tests, not just the flake8 ones.


That odd given nonameentername@c4755d3 is the only extra commit on the branch… Literally the whole diff is just that one commit! So, um… Somehow submodule not being updated properly on Travis? Because that looks like it's run with a more recent html5lib-tests revision…


Oh, Travis CI seems to run the implicit merge-commit in html5lib-python@refs/pull/168/merge. Hence given it's based on master (which currently fails tests because #174 isn't fixed yet) and run more recently it fails.


(That's annoying because it means that when Travis CI runs a given PR it tests different things; if I were to make it retest #150 it would have all the same failures, but with flake8 finding nothing, le sigh.)


I was hoping that the recent jython 2.7b4 release somehow worked around this problem, but it is still borken.

@tseaver tseaver referenced this pull request in pypa/virtualenv

Support for Jython 2.7rc2+ #746

@gsnedders gsnedders closed this

Hmm, I don't know of any resolution here. Are you in the process of merging #168?


Just closing this because #168 is it's replacement rather than keeping both open.


(We'll see if anything happens soon. Maybe.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on May 3, 2014
  1. @jimbaker
  2. @jimbaker
Commits on Jun 16, 2014
  1. @jimbaker
  2. @jimbaker

    Pass on constructed tests in test_tokenizer that attempt to build

    jimbaker authored
    HTMLUnicodeInputStream objects from unicode strings that contain
    isolated surrogates. Such tests are not meaningful on Jython which
    does not allow for invalid unicode strings to be decoded in the first
Commits on Aug 12, 2014
  1. @jimbaker
  2. @jimbaker
This page is out of date. Refresh to see the latest.
1  .gitignore
@@ -1,6 +1,7 @@
# Because we never want compiled Python
# Ignore stuff produced by distutils
42 html5lib/
@@ -1,5 +1,5 @@
from __future__ import absolute_import, division, unicode_literals
-from six import text_type
+from six import text_type, unichr
from six.moves import http_client
import codecs
@@ -28,7 +28,18 @@ class BufferedIOBase(object):
asciiUppercaseBytes = frozenset([item.encode("ascii") for item in asciiUppercase])
spacesAngleBrackets = spaceCharactersBytes | frozenset([b">", b"<"])
-invalid_unicode_re = re.compile("[\u0001-\u0008\u000B\u000E-\u001F\u007F-\u009F\uD800-\uDFFF\uFDD0-\uFDEF\uFFFE\uFFFF\U0001FFFE\U0001FFFF\U0002FFFE\U0002FFFF\U0003FFFE\U0003FFFF\U0004FFFE\U0004FFFF\U0005FFFE\U0005FFFF\U0006FFFE\U0006FFFF\U0007FFFE\U0007FFFF\U0008FFFE\U0008FFFF\U0009FFFE\U0009FFFF\U000AFFFE\U000AFFFF\U000BFFFE\U000BFFFF\U000CFFFE\U000CFFFF\U000DFFFE\U000DFFFF\U000EFFFE\U000EFFFF\U000FFFFE\U000FFFFF\U0010FFFE\U0010FFFF]")
+invalid_unicode_template = "[\u0001-\u0008\u000B\u000E-\u001F\u007F-\u009F\uFDD0-\uFDEF\uFFFE\uFFFF\U0001FFFE\U0001FFFF\U0002FFFE\U0002FFFF\U0003FFFE\U0003FFFF\U0004FFFE\U0004FFFF\U0005FFFE\U0005FFFF\U0006FFFE\U0006FFFF\U0007FFFE\U0007FFFF\U0008FFFE\U0008FFFF\U0009FFFE\U0009FFFF\U000AFFFE\U000AFFFF\U000BFFFE\U000BFFFF\U000CFFFE\U000CFFFF\U000DFFFE\U000DFFFF\U000EFFFE\U000EFFFF\U000FFFFE\U000FFFFF\U0010FFFE\U0010FFFF%s]"
+if utils.supports_lone_surrogates:
+ # Use one extra step of indirection and create surrogates with
+ # unichr. Not using this indirection would introduce an illegal
+ # unicode literal on platforms not supporting such lone
+ # surrogates.
+ invalid_unicode_re = re.compile(invalid_unicode_template % (
+ "%s-%s" % (unichr(0xD800), unichr(0xDFFF)),))
+ invalid_unicode_re = re.compile(invalid_unicode_template % "")
non_bmp_invalid_codepoints = set([0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF, 0x3FFFE,
0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 0x5FFFF,
@@ -164,13 +175,23 @@ def __init__(self, source):
- # Craziness
- if len("\U0010FFFF") == 1:
+ if not utils.supports_lone_surrogates:
+ # Such platforms will have already checked for such
+ # surrogate errors, so no need to do this checking.
+ self.reportCharacterErrors = None
+ self.replaceCharactersRegexp = None
+ elif len("\U0010FFFF") == 1:
self.reportCharacterErrors = self.characterErrorsUCS4
- self.replaceCharactersRegexp = re.compile("[\uD800-\uDFFF]")
+ self.replaceCharactersRegexp = re.compile("[%s-%s]" % (
+ unichr(0xD800), unichr(0xDFFF)))
self.reportCharacterErrors = self.characterErrorsUCS2
- self.replaceCharactersRegexp = re.compile("([\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?<![\uD800-\uDBFF])[\uDC00-\uDFFF])")
+ self.replaceCharactersRegexp = re.compile(
+ "([%s-%s](?![%s-%s])|(?<![%s-%s])[%s-%s])" % (
+ unichr(0xD800), unichr(0xDBFF),
+ unichr(0xDC00), unichr(0xDFFF),
+ unichr(0xD800), unichr(0xDBFF),
+ unichr(0xDC00), unichr(0xDFFF)))
# List of where new lines occur
self.newLines = [0]
@@ -265,11 +286,12 @@ def readChunk(self, chunkSize=None):
self._bufferedCharacter = data[-1]
data = data[:-1]
- self.reportCharacterErrors(data)
+ if utils.supports_lone_surrogates:
+ self.reportCharacterErrors(data)
- # Replace invalid characters
- # Note U+0000 is dealt with in the tokenizer
- data = self.replaceCharactersRegexp.sub("\ufffd", data)
+ # Replace invalid characters
+ # Note U+0000 is dealt with in the tokenizer
+ data = self.replaceCharactersRegexp.sub("\ufffd", data)
data = data.replace("\r\n", "\n")
data = data.replace("\r", "\n")
25 html5lib/tests/
@@ -7,7 +7,7 @@
from .support import get_data_files
from html5lib.tokenizer import HTMLTokenizer
-from html5lib import constants
+from html5lib import constants, utils
class TokenizerTestParser(object):
@@ -122,9 +122,28 @@ def tokensMatch(expectedTokens, receivedTokens, ignoreErrorOrder,
return tokens["expected"] == tokens["received"]
+_surrogateRe = re.compile(r"\\u(?P<codepoint>[0-9A-Fa-f]{4})")
def unescape(test):
def decode(inp):
- return inp.encode("utf-8").decode("unicode-escape")
+ try:
+ return inp.encode("utf-8").decode("unicode-escape")
+ except UnicodeDecodeError:
+ possible_surrogate_match =
+ if possible_surrogate_match and not utils.supports_lone_surrogates:
+ possible_surrogate = int("codepoint"), 16)
+ if possible_surrogate >= 0xD800 and possible_surrogate <= 0xDFFF:
+ # Not valid unicode input for platforms that do
+ # not have support for lone surrogates.
+ #
+ # NOTE it's not even possible to have such
+ # isolated surrogates in unicode input streams in
+ # such platforms (like Jython) - the decoding to
+ # unicode would have raised a similar
+ # UnicodeDecodeError.
+ return None
+ raise
test["input"] = decode(test["input"])
for token in test["output"]:
@@ -183,6 +202,8 @@ def testTokenizer():
test["initialStates"] = ["Data state"]
if 'doubleEscaped' in test:
test = unescape(test)
+ if test["input"] is None:
+ continue # Not valid input for this platform
for initialState in test["initialStates"]:
test["initialState"] = capitalize(initialState)
yield runTokenizerTest, test
14 html5lib/
@@ -1,5 +1,6 @@
from __future__ import absolute_import, division, unicode_literals
+import platform
from types import ModuleType
@@ -9,7 +10,18 @@
__all__ = ["default_etree", "MethodDispatcher", "isSurrogatePair",
- "surrogatePairToCodepoint", "moduleFactoryFactory"]
+ "surrogatePairToCodepoint", "moduleFactoryFactory",
+ "supports_lone_surrogates"]
+# Platforms not supporting lone surrogates (\uD800-\uDFFF) should be
+# added to the below test. In general this would be any platform using
+# UTF-16 as its encoding of unicode strings, such as Jython. This is
+# because UTF-16 itself is based on the use of such surrogates, and
+# there is no mechanism to further escape such escapes.
+# Otherwise we assume such support.
+supports_lone_surrogates = platform.python_implementation() != "Jython"
class MethodDispatcher(dict):
Something went wrong with that request. Please try again.