Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplifying calls when Java methods require java.lang.String #190

Open
EikoocS opened this issue Jun 21, 2022 · 1 comment
Open

Simplifying calls when Java methods require java.lang.String #190

EikoocS opened this issue Jun 21, 2022 · 1 comment

Comments

@EikoocS
Copy link

EikoocS commented Jun 21, 2022

Jython 2.7.2
Java 17

hi,i defined encoding is in the file as UTF-8 according to pep-0263
but the String argument passed in contains Non-ASCII characters it will result in garbled code

java.lang.String(text, "utf-8") can be used to resolve the garbled code
But String are used a lot and this call is a bit complicated, is there a way to simplify this call? Such as encoded according string with the script encoding
If not,Is it possible to add some method to simplify?

@jeff5
Copy link
Member

jeff5 commented Jun 24, 2022

I set myself up like this:

# utf8string.py
# -*- coding: utf-8 -*-
#
import java.lang
import array

en = "Stick with ASCII"
fr = "Le dîner à Étretat"
gk = "λόγος"
ch = "画蛇添足"

def test(b):
    a = array.array('B', b)
    print "array:  ", repr(a)
    u = unicode(b, "utf-8")
    print "unicode:", repr(u)
    t = java.lang.String(b, "utf-8")
    print "String: ", repr(t)
    print

test(en)

Then running (with 2.7.2 and Java 8 on Windows) I get:

PS 190> jython -i utf8string.py
array:   array('B', [83, 116, 105, 99, 107, 32, 119, 116, 104, 32, 65, 83, 67, 73, 73])
unicode: u'Stick with ASCII'
String:  Stick with ASCII

This also works for me (I have a cp936 encoded terminal):

>>> test(ch)
array:   array('B', [231, 148, 187, 232, 155, 135, 230, 183, 187, 232, 182, 179])
unicode: u'\u753b\u86c7\u6dfb\u8db3'
String:  画蛇添足

Now with the Greek and French examples I get this mess:

>>> test(gk)
array:   array('B', [206, 187, 207, 140, 206, 179, 206, 191, 207, 130])
unicode: u'\u03bb\u03cc\u03b3\u03bf\u03c2'
String:  Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "utf8string.py", line 18, in test
    print "String: ", repr(t)
  ...
UnicodeEncodeError: 'ms936' codec can't encode character u'\u03cc' in position 1: illegal multibyte sequence
String:  >>>

It is evident that the program works, in that the correct bytes end up in gk and a, and characters in the other representations of the text. What goes wrong is only in the output to the terminal, and only with java.lang.String, because it tries to make it text on screen during print, even when the repr is demanded.

Somewhat inconsistently, I can succeed with:

>>> t = java.lang.String(gk, "utf-8")
>>> type(t)
<type 'java.lang.String'>
>>> repr(t)
u'\u03bb\u03cc\u03b3\u03bf\u03c2'

Actually, I was a bit surprised to find that Stringis not immediately converted to unicode on creation, but it is probably useful that it isn't. Various actions will treat it like bytes, however, ignoring the upper byte of each UTF-16 code unit, or die in the attempt.

The work-around appears to be, if you can't simply use strand unicode, to be careful how you handle the String.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants