Simplifying calls when Java methods require java.lang.String #190

EikoocS · 2022-06-21T05:33:42Z

Jython 2.7.2
Java 17

hi,i defined encoding is in the file as UTF-8 according to pep-0263
but the String argument passed in contains Non-ASCII characters it will result in garbled code

java.lang.String(text, "utf-8") can be used to resolve the garbled code
But String are used a lot and this call is a bit complicated, is there a way to simplify this call? Such as encoded according string with the script encoding
If not,Is it possible to add some method to simplify?

jeff5 · 2022-06-24T08:54:25Z

I set myself up like this:

# utf8string.py
# -*- coding: utf-8 -*-
#
import java.lang
import array

en = "Stick with ASCII"
fr = "Le dîner à Étretat"
gk = "λόγος"
ch = "画蛇添足"

def test(b):
    a = array.array('B', b)
    print "array:  ", repr(a)
    u = unicode(b, "utf-8")
    print "unicode:", repr(u)
    t = java.lang.String(b, "utf-8")
    print "String: ", repr(t)
    print

test(en)

Then running (with 2.7.2 and Java 8 on Windows) I get:

PS 190> jython -i utf8string.py
array:   array('B', [83, 116, 105, 99, 107, 32, 119, 116, 104, 32, 65, 83, 67, 73, 73])
unicode: u'Stick with ASCII'
String:  Stick with ASCII

This also works for me (I have a cp936 encoded terminal):

>>> test(ch)
array:   array('B', [231, 148, 187, 232, 155, 135, 230, 183, 187, 232, 182, 179])
unicode: u'\u753b\u86c7\u6dfb\u8db3'
String:  画蛇添足

Now with the Greek and French examples I get this mess:

>>> test(gk)
array:   array('B', [206, 187, 207, 140, 206, 179, 206, 191, 207, 130])
unicode: u'\u03bb\u03cc\u03b3\u03bf\u03c2'
String:  Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "utf8string.py", line 18, in test
    print "String: ", repr(t)
  ...
UnicodeEncodeError: 'ms936' codec can't encode character u'\u03cc' in position 1: illegal multibyte sequence
String:  >>>

It is evident that the program works, in that the correct bytes end up in gk and a, and characters in the other representations of the text. What goes wrong is only in the output to the terminal, and only with java.lang.String, because it tries to make it text on screen during print, even when the repr is demanded.

Somewhat inconsistently, I can succeed with:

>>> t = java.lang.String(gk, "utf-8")
>>> type(t)
<type 'java.lang.String'>
>>> repr(t)
u'\u03bb\u03cc\u03b3\u03bf\u03c2'

Actually, I was a bit surprised to find that Stringis not immediately converted to unicode on creation, but it is probably useful that it isn't. Various actions will treat it like bytes, however, ignoring the upper byte of each UTF-16 code unit, or die in the attempt.

The work-around appears to be, if you can't simply use strand unicode, to be careful how you handle the String.

This was referenced Jun 25, 2022

PyString with non-byte value while installing pip #20

Closed

PyString with non-byte value in formatting of collections #192

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplifying calls when Java methods require java.lang.String #190

Simplifying calls when Java methods require java.lang.String #190

EikoocS commented Jun 21, 2022

jeff5 commented Jun 24, 2022

Simplifying calls when Java methods require java.lang.String #190

Simplifying calls when Java methods require java.lang.String #190

Comments

EikoocS commented Jun 21, 2022

jeff5 commented Jun 24, 2022