Convert chinese encoding GB18030 to UTF-8 doesn't work #3411

Closed
tpuddi opened this Issue Oct 22, 2015 · 4 comments

Comments

Projects
None yet
4 participants
@tpuddi

tpuddi commented Oct 22, 2015

Chinese characters encoding in GB18030 can not be converted to UTF-8. If the source encoding is GB2312 everythink works properly.

Here is an example:

require 'base64'

chinese = Base64.decode64('yqHHrsqh0824/Mqh0MQgt+e54jM2MMrmysqw5sL61+PE48v5sK4tsMK93Mb7s7XN+A0K')
puts chinese.force_encoding('GB2312').encode('UTF-8', invalid: :replace)

chinese = Base64.decode64('yqHHrsqh0824/Mqh0MQgt+e54jM2MMrmysqw5sL61+PE48v5sK4tsMK93Mb7s7XN+A0K')
puts chinese.force_encoding('GB18030').encode('UTF-8', invalid: :replace)

Linux system:

Distributor ID: Ubuntu
Description: Ubuntu 14.04.3 LTS
Release: 14.04
Codename: trusty

When I run these lines on ruby 2.1.6p336 (2015-04-13 revision 50298) [x86_64-linux] I get following results:

省钱省油更省心 风光360舒适版满足你所爱-奥杰汽车网
省钱省油更省心 风光360舒适版满足你所爱-奥杰汽车网

As you can see both lines returns the same content.

But if I run these lines on jruby 9.0.1.0 (2.2.2) 2015-09-02 583f336 Java HotSpot(TM) 64-Bit Server VM 25.60-b23 on 1.8.0_60-b27 +jit [linux-amd64] I get a different result:

省钱省油更省心 风光360舒适版满足你所爱-奥杰汽车网
������� ��360��������-�����

As you can see the second line is not able to encode the string in UTF-8.

I would be very thankful for any help on this issue.

@naag

This comment has been minimized.

Show comment
Hide comment
@naag

naag Nov 3, 2015

We're still having this issue with JRuby 9.0.3.0. Could this be related to our environment?

naag commented Nov 3, 2015

We're still having this issue with JRuby 9.0.3.0. Could this be related to our environment?

@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Nov 3, 2015

Member

@naag venturing a guess we may be a little out of date in our oniguruma translation tables and perhaps there is a bug somewhere? JRuby 9k uses this port for all transcoding and it should work as well as MRI.

Member

enebo commented Nov 3, 2015

@naag venturing a guess we may be a little out of date in our oniguruma translation tables and perhaps there is a bug somewhere? JRuby 9k uses this port for all transcoding and it should work as well as MRI.

@enebo enebo added this to the JRuby 9.0.5.0 milestone Nov 3, 2015

@enebo enebo added the encoding label Nov 3, 2015

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Jan 19, 2016

Member

I'll have a look at the GB18030 transcoding stuff and see if we're missing something.

Member

headius commented Jan 19, 2016

I'll have a look at the GB18030 transcoding stuff and see if we're missing something.

@naag

This comment has been minimized.

Show comment
Hide comment
@naag

naag Feb 4, 2016

<3 Thanks a lot guys!

naag commented Feb 4, 2016

<3 Thanks a lot guys!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment