Do not directly use isolated surrogates in unicode literals #150

jimbaker · 2014-05-03T03:24:34Z

Jython does not support isolated surrogates in unicode, including in unicode literals. This has been reported in #2 This bug is critical for Jython due to the fact that html5lib is a vendor lib for pip, and this is blocking pip from running on Jython.

For platforms besides Jython, this pull request allows for these surrogates to be constructed in literals, but through an additional step of indirection. For Jython itself, Jython's normal decode of literals will ensure that such invalid unicode strings cannot be constructed from any source.

To run this on Jython:

Install https://bitbucket.org/jimbaker/jython-socket-reboot, following these instructions: https://wiki.python.org/jython/JythonDeveloperGuide
Use this branch of pip to install nosetests, etc.: https://github.com/jimbaker/pip Note that tox is not yet supported - because we need to get pip running first! :)

Note that in the dev build, you will find executables in dist/bin, such as dist/bin/jython or dist/bin/pip

The jython-socket-reboot branch is nearly complete for merging into Jython; it is a major component of Jython 2.7.0 beta 3. (I'm a core dev of Jython.)

…orms besides Jython

hoppipolla-critic-bot · 2014-05-03T03:24:38Z

Critic review: https://critic.hoppipolla.co.uk/r/1443

This is an external review system which you may optionally use for the code review of your pull request.

In order to help critic track your changes, please do not make in-place history rewrites (e.g. via git rebase -i or git commit --amend) when updating this pull request.

jimbaker · 2014-05-03T03:30:13Z

I should mention that my branch of pip has the same change proposed here in its vendor lib inclusion of html5lib

gsnedders · 2014-05-05T19:03:24Z

Really I'd rather Jython just changed so that is matched the Python documentation and CPython and PyPy and had its unicode type be 16-/32-bit code-units. :)

But…

Trying to run the tests with that branch of Jython on your branch of html5lib leads to a bunch of errors. Notably, html5lib/tests/test_tokenizer.py reports an error after two tests, when it's trying to run tests for lone surrogates in the input stream. I suppose this should just be a documented limitation of html5lib on Jython, and some workaround added so the rest of the tokenizer tests run to completion.

Otherwise there's a bunch of failures due to lack of an euc_jp codec. The docs say:

Python comes with a number of codecs built-in, either implemented as C functions or with dictionaries as mapping tables. The following table lists the codecs by name, together with a few common aliases, and the languages for which the encoding is likely used.

…which I take to mean as Python, as a language, comes with them. Hence that's seems to be another bug in Jython.

jimbaker · 2014-05-05T20:39:36Z

Guido van Rossum summarized the issue of UTF-16 support in Jython well, which we used as guidance when Jython 2.5 was under development:
https://mail.python.org/pipermail/python-3000/2006-September/003821.html

This has all been rehashed endlessly. It's implementation (and
platform- and compilation options-) dependent because there are good
reasons for both choices. Even if CPython 3.0 supports a dynamic
choice (which some are proposing) then the language will still make
it implementation dependent because of Jython and IronPython, where
the only choice is UTF-16 (or UCS-2, depending the attitude towards
surrogates).

Re euc_jp codec, this is a known bug with respect to a lack of CJK codecs (http://bugs.jython.org/issue1066). Hopefully there has been some more recent work on the corresponding patch.

However, I did just re-run using

$ ~/jythondev/jython-socket-reboot/dist/bin/nosetests --verbose
...
----------------------------------------------------------------------
Ran 1038 tests in 16.498s

OK

Maybe I need to specify additional options?

gsnedders · 2014-05-05T23:29:44Z

Did you init/update the git submodule? See the readme. You should be looking at tens of thousands of tests.

FWIW, while the Py2 language reference uses the ambiguous phrasing of "Unicode ordinal" in defining the unicode type (what on earth a "Unicode ordinal" is is a very good question — it's not defined anywhere!), the Py3 language reference states that the str type is a sequence of Unicode codepoints (which would therefore mean lone surrogates are allowed). But this is rehashing what I said in the Jython bug and probably not the place to debate Python semantics.

jimbaker · 2014-05-06T22:49:21Z

@gsnedders Got it, I was able to run 16729 tests, so we are now on the same page. Of these I'm seeing 7 errors. 6 are definitely euc_jp; 1 is failing in the middle of a loop, which I haven't investigated yet. Maybe this is euc_jp too?

Since we would like to fix http://bugs.jython.org/issue1066 and support CJK codecs, I'm going to put this PR on hold on our side until that CJK support is complete, which I would be expect in the beta 4. This timing seems to be sound given the work by Yuji Yamano and the fact that this is now getting some attention. Because we will also will be integrating ensurepip during beta 4, there may be an interval where ensurepip will refer to a Jython-specific fork of pip (so beyond the current practice of referring people to https://github.com/jimbaker/pip to try things out), but this should be short in duration.

gsnedders · 2014-05-06T23:02:08Z

The other error is another case of lone-surrogates, I can tell you that much.

jimbaker · 2014-05-06T23:55:49Z

@gsnedders re the other case: will take a look. Thanks again for your help!

I have updated http://bugs.jython.org/issue1066 to reflect this CJK codec dependency.

gsnedders · 2014-05-17T18:56:52Z

When you next take a look at this, mind adding $py.class to .gitignore?

agronholm · 2014-05-30T14:40:02Z

So where are we with this right now?

jimbaker · 2014-06-14T04:58:23Z

The CJK work is now complete in Jython trunk: Jython now supports being able to wrap java.nio.charset codec support with the standard Python codec API. This includes the tested euc_jp codec in this test suite.

There's one remaining problem with isolated surrogates with the tests corresponding to unicodeCharsProblematic.test (a data file used to construct distinct unit tests, a rather clever approach); some sort of workaround will be required.

I will update the branch with this test workaround accordingly, along with .gitignore.

HTMLUnicodeInputStream objects from unicode strings that contain isolated surrogates. Such tests are not meaningful on Jython which does not allow for invalid unicode strings to be decoded in the first place.

jimbaker · 2014-06-16T20:46:11Z

All tests pass with python 2.7, python 3.4, and jython 2.7 trunk using html5lib-tests test data.

With this latest change, Jython will now only run 17353 of the current 17357 tests, effectively excluding the following test cases in unicodeCharsProblematic.test:

{"tests" : [
{"description": "Invalid Unicode character U+DFFF",
"doubleEscaped":true,
"input": "\\uDFFF",
"output":["ParseError", ["Character", "\\uFFFD"]]},

{"description": "Invalid Unicode character U+D800",
"doubleEscaped":true,
"input": "\\uD800",
"output":["ParseError", ["Character", "\\uFFFD"]]},

{"description": "Invalid Unicode character U+DFFF with valid preceding character",
"doubleEscaped":true,
"input": "a\\uDFFF",
"output":["ParseError", ["Character", "a\\uFFFD"]]},

{"description": "Invalid Unicode character U+D800 with valid following character",
"doubleEscaped":true,
"input": "\\uD800a",
"output":["ParseError", ["Character", "\\uFFFDa"]]},

However, it does run the test case in that data file using \u0000.

I did notice that the JSON parsing of namedentities.test is rather slow. Something to fixed separately, by using the Jackson package for the json module implementation.

jimbaker · 2014-06-16T21:27:39Z

I should also mention that https://bitbucket.org/jimbaker/jython-socket-reboot has been deleted, now that it has been merged into https://bitbucket.org/jython/jython (or alternatively, http://hg.python.org/jython)

dstufft · 2014-06-16T21:29:07Z

For what it's worth, if this gets merged and released within the next month or so It'll make it into pip 1.6. Not sure what the release schedule looks like for html5lib though :)

jimbaker · 2014-07-02T22:36:59Z

@gsnedders Any updates/comments on this pull request? We are really hoping that pip 1.6 will be able to support Jython 2.7, as well as other users of html5lib-python

gsnedders · 2014-07-10T17:00:42Z

Apologies for being a bit slow (I've been all over the place travelling pretty continuously, and I'm now on holiday till EuroPython); there's a couple of comments on Critic now, see the first comment above.

@dstufft: the html5lib release schedule is typically someone going, "oh, hey, I need this bugfix that's on master, could you make a release?", and me responding with a release. :) (In theory, I plan on making releases every so often once anything worthwhile happens, but other people often find obscure bugfixes worthwhile, so it gets released when they ask.)

jimbaker · 2014-07-11T04:19:08Z

@gsnedders Looking forward to seeing you at EuroPython! I'll respond to the Critic review, thanks!

jgraham · 2014-07-11T18:07:33Z

I added some more comments on critic.

tseaver · 2014-11-29T17:29:29Z

Hmm, test failures are due to trailing whitespace.

gsnedders · 2014-11-29T17:42:25Z

#168 is essentially the same as this with the trailing whitespace fixed.

tseaver · 2014-11-29T17:44:09Z

#168 fails all its tests, not just the flake8 ones.

gsnedders · 2014-11-29T17:51:05Z

That odd given nonameentername@c4755d3 is the only extra commit on the branch… Literally the whole diff is just that one commit! So, um… Somehow submodule not being updated properly on Travis? Because that looks like it's run with a more recent html5lib-tests revision…

gsnedders · 2014-11-29T17:54:37Z

Oh, Travis CI seems to run the implicit merge-commit in html5lib-python@refs/pull/168/merge. Hence given it's based on master (which currently fails tests because #174 isn't fixed yet) and run more recently it fails.

gsnedders · 2014-11-29T17:55:26Z

(That's annoying because it means that when Travis CI runs a given PR it tests different things; if I were to make it retest #150 it would have all the same failures, but with flake8 finding nothing, le sigh.)

tseaver · 2015-02-19T17:30:31Z

I was hoping that the recent jython 2.7b4 release somehow worked around this problem, but it is still borken.

tseaver · 2015-04-25T23:46:22Z

Hmm, I don't know of any resolution here. Are you in the process of merging #168?

gsnedders · 2015-04-25T23:54:16Z

Just closing this because #168 is it's replacement rather than keeping both open.

gsnedders · 2015-04-25T23:54:26Z

(We'll see if anything happens soon. Maybe.)

Do not directly use isolated surrogates in unicode literals for platf…

8aab9d8

…orms besides Jython

Use six.unichr for Python 3.x

0c5916e

jimbaker added 2 commits June 16, 2014 14:33

Ignore compiled Python classes for Jython

a6c4b41

Pass on constructed tests in test_tokenizer that attempt to build

7f189f8

HTMLUnicodeInputStream objects from unicode strings that contain isolated surrogates. Such tests are not meaningful on Jython which does not allow for invalid unicode strings to be decoded in the first place.

jimbaker added 2 commits August 12, 2014 19:16

Merge remote-tracking branch 'upstream/master'

08e7eb5

Use utils.supports_lone_surrogates in place of Jython-specific tlogic

dc52b8e

dstufft mentioned this pull request Aug 28, 2014

import pwd fails under Jython on Windows pypa/pip#1975

Closed

tseaver mentioned this pull request Apr 24, 2015

Support for Jython 2.7rc2+ pypa/virtualenv#746

Closed

gsnedders closed this Apr 25, 2015

This was referenced Apr 26, 2015

Do not directly use isolated surrogates in unicode literals #185

Closed

Support Jython #2

Closed

Do not directly use isolated surrogates in unicode literals #150

Do not directly use isolated surrogates in unicode literals #150

Uh oh!

Conversation

jimbaker commented May 3, 2014

Uh oh!

hoppipolla-critic-bot commented May 3, 2014

Uh oh!

jimbaker commented May 3, 2014

Uh oh!

gsnedders commented May 5, 2014

Uh oh!

jimbaker commented May 5, 2014

Uh oh!

gsnedders commented May 5, 2014

Uh oh!

jimbaker commented May 6, 2014

Uh oh!

gsnedders commented May 6, 2014

Uh oh!

jimbaker commented May 6, 2014

Uh oh!

gsnedders commented May 17, 2014

Uh oh!

agronholm commented May 30, 2014

Uh oh!

jimbaker commented Jun 14, 2014

Uh oh!

jimbaker commented Jun 16, 2014

Uh oh!

jimbaker commented Jun 16, 2014

Uh oh!

dstufft commented Jun 16, 2014

Uh oh!

jimbaker commented Jul 2, 2014

Uh oh!

gsnedders commented Jul 10, 2014

Uh oh!

jimbaker commented Jul 11, 2014

Uh oh!

jgraham commented Jul 11, 2014

Uh oh!

tseaver commented Nov 29, 2014

Uh oh!

gsnedders commented Nov 29, 2014

Uh oh!

tseaver commented Nov 29, 2014

Uh oh!

gsnedders commented Nov 29, 2014

Uh oh!

gsnedders commented Nov 29, 2014

Uh oh!

gsnedders commented Nov 29, 2014

Uh oh!

tseaver commented Feb 19, 2015

Uh oh!

tseaver commented Apr 25, 2015

Uh oh!

gsnedders commented Apr 25, 2015

Uh oh!

gsnedders commented Apr 25, 2015

Uh oh!

Uh oh!