Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Psych yaml parser can not parse uppercase ÄÖÜ but äöü can parse #483

Closed
spider-network opened this issue Jan 7, 2013 · 9 comments
Closed
Milestone

Comments

@spider-network
Copy link

@spider-network spider-network commented Jan 7, 2013

My Env

java -version
java version "1.7.0_09"
OpenJDK Runtime Environment (IcedTea7 2.3.3) (7u9-2.3.3-0ubuntu1~12.04.1)
OpenJDK 64-Bit Server VM (build 23.2-b09, mixed mode)
ruby -v
jruby 1.7.1 (1.9.3p327) 2012-12-03 30a153b on OpenJDK 64-Bit Server VM 1.7.0_09-b30 [linux-amd64]

With jruby 1.7.2 it does also not work, bit it works with the normal MRI ;-(

How to reproduce the bug

test.yml

de:
  test: "Ä"

test.rb

require 'yaml'

YAML.parse(open("test.yml").read)

puts "Done"

Error

ruby test.rb
Psych::SyntaxError: (<unknown>): 'reader' unacceptable character '?' (0x84) special characters are not allowed
in "'reader'", position 14 at line 0 column 0
         parse at org/jruby/ext/psych/PsychParser.java:225
  parse_stream at /home/vagrant/.rvm/rubies/jruby-1.7.2/lib/ruby/1.9/psych.rb:205
         parse at /home/vagrant/.rvm/rubies/jruby-1.7.2/lib/ruby/1.9/psych.rb:153
        (root) at test.rb:3

Someone knows a workaround?

@spider-network
Copy link
Author

@spider-network spider-network commented Jan 7, 2013

On my Mac i have the Oracle Java version and it works but not on my Ubuntu server with OpenJDK ;-(

java -version
java version "1.6.0_37"
Java(TM) SE Runtime Environment (build 1.6.0_37-b06-434-11M3909)
Java HotSpot(TM) 64-Bit Server VM (build 20.12-b01-434, mixed mode)

@BanzaiMan
Copy link
Member

@BanzaiMan BanzaiMan commented Jan 7, 2013

Can you check your system encoding (e.g., locale)? Also, try enforcing UTF-8 in test.rb.

@enebo
Copy link
Member

@enebo enebo commented Jan 8, 2013

I am a little confused how this should work. The File.open.read will read in the file with a particular encoding and YAML expects it to be one of the two UTF-16's or UTF-8. So what happens if your default encoding is not UTF-* on the read? If it is ascii or the ascii-8bit (accented chars) ends up being valid UTF-* characters then you should see this error. I guess that could explain the error if as Hiro suggests your encoding is not UTF-8 on Ubuntu (LANG is also worth checking).

@spider-network
Copy link
Author

@spider-network spider-network commented Jan 9, 2013

locale

vagrant@precise64:~$ locale
LANG=de_DE.UTF-8
LANGUAGE=
LC_CTYPE="en_US"
LC_NUMERIC="en_US"
LC_TIME="en_US"
LC_COLLATE="en_US"
LC_MONETARY="en_US"
LC_MESSAGES="en_US"
LC_PAPER="en_US"
LC_NAME="en_US"
LC_ADDRESS="en_US"
LC_TELEPHONE="en_US"
LC_MEASUREMENT="en_US"
LC_IDENTIFICATION="en_US"
LC_ALL=en_US

It also does not work with enforcing UTF-8

vagrant@precise64:~/metrigo/jruby-bug/yaml$ ruby -v
jruby 1.7.1 (1.9.3p327) 2012-12-03 30a153b on OpenJDK 64-Bit Server VM 1.7.0_09-b30 [linux-amd64]
vagrant@precise64:~/metrigo/jruby-bug/yaml$ cat test.rb
# encoding: UTF-8
require 'yaml'

YAML.parse(open("test.yml").read)

puts "Done"
vagrant@precise64:~/metrigo/jruby-bug/yaml$
vagrant@precise64:~/metrigo/jruby-bug/yaml$ ruby test.rb
Psych::SyntaxError: (<unknown>): 'reader' unacceptable character '?' (0x84) special characters are not allowed
in "'reader'", position 14 at line 0 column 0
         parse at org/jruby/ext/psych/PsychParser.java:225
  parse_stream at /home/vagrant/.rvm/rubies/jruby-1.7.1/lib/ruby/1.9/psych.rb:205
         parse at /home/vagrant/.rvm/rubies/jruby-1.7.1/lib/ruby/1.9/psych.rb:153
        (root) at test.rb:4

@BanzaiMan
Copy link
Member

@BanzaiMan BanzaiMan commented Jan 9, 2013

I still can't reproduce this. There is something else at play. Could you try without RVM?

$ ./bin/jruby -v
jruby 1.7.1 (1.9.3p327) 2012-12-03 30a153b on OpenJDK 64-Bit Server VM 1.7.0_09-b30 [linux-amd64]
$ cat test.yml 
de:
  test: "Ä"
$ cat test.rb
require 'yaml'

YAML.parse(open("test.yml").read)

puts "Done"
$ ./bin/jruby test.rb
Done
$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
$ uname -a
Linux ip-10-196-35-92 3.2.0-31-virtual #50-Ubuntu SMP Fri Sep 7 16:36:36 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

@edzhelyov
Copy link

@edzhelyov edzhelyov commented Jan 25, 2013

I have a similar problem, but with translations (YAML files) from a 3rd party service. The file is downloaded and the characters appear as <80>, <88> when I open them with vim.

I tried to reproduce the test case from the author and everything works for me, so I suspect that maybe it depends on the source encoding of the YAML file...

@edzhelyov
Copy link

@edzhelyov edzhelyov commented Jan 25, 2013

My problem is related to the encoding set from RestClient(possibly Net::HTTP) on file attachments. The CRuby (1.9.3-p374) differs from the JRuby's one in my case. CRuby will set the response.body encoding to UTF-8 while in JRuby it will be "ASCII-8BIT".

Now I'm unsure if this is actually a bug and in which cases it happens as I have no understanding how Net:HTTP should behave when downloading files as I'm not aware what the HTTP specification says about encoding in this cases.

I couldn't reproduce the the exact behavior on public domain and it seems it happens on specific server responses. So in some cases, unknown to me, the server can specify the file encoding and in the CRuby acknowledge it.

If someone knows more on this subject I'm happy to discuss it further, so I can isolate a specific case.

@headius
Copy link
Member

@headius headius commented Jul 25, 2013

I think there's an encoding mismatch at play here. None of us can reproduce this, and one commentor theorized it could be an issue with a badly-encoded YAML source. When the file is encoded as UTF-8 and read as UTF-8, it appears to parse just fine.

If you can find a way for us to reproduce this, feel free to reopen.

@pvmeerbe
Copy link

@pvmeerbe pvmeerbe commented Nov 30, 2016

I encountered the same issue. I'm sending a YAML dump with an Ü from MRI ruby v2.2.3 to a torquebox v3.1.2 with jruby v1.7.8 via an API call (java version is 1.8.0_91-b14) .

Upgrading to torquebox v3.2.0 with jruby v9.1.5.0 solved the issue

Locale info :
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=de_BE.UTF-8
LC_TIME=de_BE.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=de_BE.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=de_BE.UTF-8
LC_NAME=de_BE.UTF-8
LC_ADDRESS=de_BE.UTF-8
LC_TELEPHONE=de_BE.UTF-8
LC_MEASUREMENT=de_BE.UTF-8
LC_IDENTIFICATION=de_BE.UTF-8
LC_ALL=

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants