-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support UTF-8 for JSON #7
Conversation
Sorry wanted it to be a separate PR since DMR-6 is an actual bug fix. I would at least like to see DMR-6 make it into next release. Let me know if you want individual PR's for each issue. |
for (int i = 1; i < length - 1; i = yyText.offsetByCodePoints(i, 1)) { | ||
int ch = yyText.codePointAt(i); | ||
for (int i = 1; i < length - 1; i++) { | ||
char ch = yyText.charAt(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change these to not use code points?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need it. I believe if you have characters outside BMP (which is my understanding of code points, but I am novice in this area) then they need to be represented by two unicode characters in json.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it depends on the encoding in which the parser reads the content. I'm ok leaving it in if you think it solves a problem. A test to prove this code wrong would be ideal :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you changed it to expect input in UTF-8 then I think we should assume that the text might contain any code point.
Looks good. |
Any update or status for this ? Would be nice to include in a release in the not so distant future :) |
Went ahead and took a stab at supporting UTF-8 for both parsing and displaying JSON. I left the readFromString method alone (still using US-ASCII).
I looked at a couple of libraries: org.json, json-smart, jackson, jettision and they all behave slightly differently when it comes to escaping unicode characters.
Jettision seemed to be the most accurate and it meets the requirements of the GateIn team wrt to localization. It behaves closely to http://www.ietf.org/rfc/rfc4627.txt as it only escapes unicode characters \u0000 through \u001F. It does not however escape forward slash.