Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert chinese encoding GB18030 to UTF-8 doesn't work #3411

Closed
tpuddi opened this issue Oct 22, 2015 · 4 comments
Closed

Convert chinese encoding GB18030 to UTF-8 doesn't work #3411

tpuddi opened this issue Oct 22, 2015 · 4 comments
Labels
Milestone

Comments

@tpuddi
Copy link

@tpuddi tpuddi commented Oct 22, 2015

Chinese characters encoding in GB18030 can not be converted to UTF-8. If the source encoding is GB2312 everythink works properly.

Here is an example:

require 'base64'

chinese = Base64.decode64('yqHHrsqh0824/Mqh0MQgt+e54jM2MMrmysqw5sL61+PE48v5sK4tsMK93Mb7s7XN+A0K')
puts chinese.force_encoding('GB2312').encode('UTF-8', invalid: :replace)

chinese = Base64.decode64('yqHHrsqh0824/Mqh0MQgt+e54jM2MMrmysqw5sL61+PE48v5sK4tsMK93Mb7s7XN+A0K')
puts chinese.force_encoding('GB18030').encode('UTF-8', invalid: :replace)

Linux system:

Distributor ID: Ubuntu
Description: Ubuntu 14.04.3 LTS
Release: 14.04
Codename: trusty

When I run these lines on ruby 2.1.6p336 (2015-04-13 revision 50298) [x86_64-linux] I get following results:

省钱省油更省心 风光360舒适版满足你所爱-奥杰汽车网
省钱省油更省心 风光360舒适版满足你所爱-奥杰汽车网

As you can see both lines returns the same content.

But if I run these lines on jruby 9.0.1.0 (2.2.2) 2015-09-02 583f336 Java HotSpot(TM) 64-Bit Server VM 25.60-b23 on 1.8.0_60-b27 +jit [linux-amd64] I get a different result:

省钱省油更省心 风光360舒适版满足你所爱-奥杰汽车网
������� ��360��������-�����

As you can see the second line is not able to encode the string in UTF-8.

I would be very thankful for any help on this issue.

@naag
Copy link

@naag naag commented Nov 3, 2015

We're still having this issue with JRuby 9.0.3.0. Could this be related to our environment?

@enebo
Copy link
Member

@enebo enebo commented Nov 3, 2015

@naag venturing a guess we may be a little out of date in our oniguruma translation tables and perhaps there is a bug somewhere? JRuby 9k uses this port for all transcoding and it should work as well as MRI.

@enebo enebo added this to the JRuby 9.0.5.0 milestone Nov 3, 2015
@enebo enebo added the encoding label Nov 3, 2015
@headius
Copy link
Member

@headius headius commented Jan 19, 2016

I'll have a look at the GB18030 transcoding stuff and see if we're missing something.

@naag
Copy link

@naag naag commented Feb 4, 2016

<3 Thanks a lot guys!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants