Handle unicode in classpath entries #4136

Merged
merged 6 commits into from Dec 13, 2016

Conversation

Projects
None yet
2 participants
@peiyuwang
Contributor

peiyuwang commented Dec 12, 2016

Problem

classmap console task fails with the following error

./pants classmap testprojects/src/java/org/pantsbuild/testproject/unicode/cucumber
...
  File "/Users/peiyu/github/pants/src/python/pants/backend/jvm/tasks/classmap.py", line 41, in console_output
    for file in self.classname_for_classfile(target, classpath_product):
  File "/Users/peiyu/github/pants/src/python/pants/backend/jvm/tasks/classmap.py", line 28, in classname_for_classfile
    classname = ClasspathUtil.classname_for_rel_classfile(f)
  File "/Users/peiyu/github/pants/src/python/pants/backend/jvm/tasks/classpath_util.py", line 174, in classname_for_rel_classfile
    if not class_file_name.endswith('.class'):

Exception message: 'ascii' codec can't decode byte 0xd8 in position 21: ordinal not in range(128)

Solution

There is already logic handing mixed encodings in DuplicateDetector, refactor that into ClasspathUtil so it can be shared by other classes that need to extra entries from jars.

Result

./pants classmap testprojects/src/java/org/pantsbuild/testproject/unicode/cucumber
...
cucumber.api.java.hi.और 3rdparty:cucumber-java
cucumber.api.java.hi.कदा 3rdparty:cucumber-java
cucumber.api.java.hi.किन्तु 3rdparty:cucumber-java
cucumber.api.java.hi.चूंकि 3rdparty:cucumber-java
cucumber.api.java.hi.जब 3rdparty:cucumber-java
+ # utf-8 encoded entry names, some not. As a result we cannot simply decode in all cases
+ # and need to do this to_bytes(...).decode('utf-8') dance to stay safe across all entry
+ # name flavors and under all supported pythons.
+ yield to_bytes(name).decode('utf-8')

This comment has been minimized.

@stuhood

stuhood Dec 12, 2016

Member

IIRC, the right way to do the to_bytes conversion is to use six: https://pythonhosted.org/six/#six.binary_type , which would let you drop the pex dep.

@stuhood

stuhood Dec 12, 2016

Member

IIRC, the right way to do the to_bytes conversion is to use six: https://pythonhosted.org/six/#six.binary_type , which would let you drop the pex dep.

This comment has been minimized.

@peiyuwang

peiyuwang Dec 13, 2016

Contributor

Good call. Changed to strutil.ensure_text that conditionally calls encode depending on if it is six.binary_type.

@peiyuwang

peiyuwang Dec 13, 2016

Contributor

Good call. Changed to strutil.ensure_text that conditionally calls encode depending on if it is six.binary_type.

@peiyuwang peiyuwang merged commit 75ba41c into pantsbuild:master Dec 13, 2016

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@peiyuwang peiyuwang deleted the peiyuwang:fix/console-output-unicode branch Jan 18, 2017

lenucksi added a commit to lenucksi/pants that referenced this pull request Apr 25, 2017

Handle unicode in classpath entries (#4136)
### Problem

`classmap` console task fails with the following error
```
./pants classmap testprojects/src/java/org/pantsbuild/testproject/unicode/cucumber
...
  File "/Users/peiyu/github/pants/src/python/pants/backend/jvm/tasks/classmap.py", line 41, in console_output
    for file in self.classname_for_classfile(target, classpath_product):
  File "/Users/peiyu/github/pants/src/python/pants/backend/jvm/tasks/classmap.py", line 28, in classname_for_classfile
    classname = ClasspathUtil.classname_for_rel_classfile(f)
  File "/Users/peiyu/github/pants/src/python/pants/backend/jvm/tasks/classpath_util.py", line 174, in classname_for_rel_classfile
    if not class_file_name.endswith('.class'):

Exception message: 'ascii' codec can't decode byte 0xd8 in position 21: ordinal not in range(128)
```

### Solution

There is already logic handing mixed encodings in `DuplicateDetector`, refactor that into `ClasspathUtil` so it can be shared by other classes that need to extract entries from jars.

### Result

```
./pants classmap testprojects/src/java/org/pantsbuild/testproject/unicode/cucumber
...
cucumber.api.java.hi.और 3rdparty:cucumber-java
cucumber.api.java.hi.कदा 3rdparty:cucumber-java
cucumber.api.java.hi.किन्तु 3rdparty:cucumber-java
cucumber.api.java.hi.चूंकि 3rdparty:cucumber-java
cucumber.api.java.hi.जब 3rdparty:cucumber-java
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment