Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guidance on JSON escaping for non-ascii characters. #46

Open
tomchristie opened this issue Sep 4, 2014 · 11 comments
Open

Guidance on JSON escaping for non-ascii characters. #46

tomchristie opened this issue Sep 4, 2014 · 11 comments

Comments

@tomchristie
Copy link
Contributor

JSON allows for both escaped or non-escaped non-ascii characters.

It'd be useful for this document to include guidance on which style is preferred, or if there is no preference.

For example, the following is valid JSON:

{
     "unicode black star": "\u2605"
}

As is the unescaped variant:

{
    "unicode black star": "★"
}

I could see valid arguments for either case.

Happy of course if you consider this out-of-scope, but I know it's something I'd value knowing another team's design preferences.

@geemus
Copy link
Member

geemus commented Sep 7, 2014

@tomchristie great question, and one I don't think we happened to run across in the wild, so I don't think I've given it too much thought just yet. At a glance, I suppose I might lean toward (assuming both are valid) the one that involves the least processing/change. ie if we don't have a good reason to force an encode on our side (and the resultant decode on the client), it seems like it would be easier to save ourselves the trouble of doing it (as well as remembering that it is required). That said, I haven't run in to this before, so I fully suspect nuances that I'm missing. Could you elaborate a bit on the pro-encoding side of things and/or let me know what you think about the leave-it-be argument I've roughly set forward? Thanks!

@tomchristie
Copy link
Contributor Author

the one that involves the least processing/change

That'd be a valid option. What that actually means might depend on which frameworks and/or json encoding libs you're using for your various services. For example see differences between Rails 3.2.13 vs Rails 4.0.

Could you elaborate a bit on the pro-encoding side of things and/or let me know what you think about the leave-it-be argument I've roughly set forward?

The option that requires least thought is clearly to use escaped characters.

However it's nicer to users if the API presents un-escaped characters - that way command line tools such as curl which simply dump the response body will be outputting properly rendered text.

It's not clear to me what further subetlies there might be around un-escaped charated tho. For example it's probably still a good idea to escape control characters in that case, as per this example. If leaving as utf-8, what ranges would still need encoding? The current set as used in Rails might be an okay choice, but it's not obvious.

(Note: I'm only using rails as an example here as it's the one place I've noticed where there's actually been some kinda of conscious design decision)

@geemus
Copy link
Member

geemus commented Sep 9, 2014

Yeah, I think it is definitely helpful to reference places where some effort went in to making this decision already. un-escaped does seem likely to work in the most different places (without extra work), at least at a guess. Control characters are an odd case, but perhaps there would be cases where an API would want to include them for curl or something? It would be pretty weird, but maybe possible. You could argue that in most cases you probably shouldn't be including these in API responses in the first place I suppose.

Anyway, seems like we are still leaning more toward unescaped/raw if I'm not mistaken. I'd maybe even say that we could just leave it as that generic recommendation and defer whether or not we need to say it should be escaped for some narrower character set or not until somebody more explicitly runs up against the question so we can have clearer examples/inputs to work from. What do you think?

@tomchristie
Copy link
Contributor Author

Coming back to this shortly, but in the meantime referencing Python's behaviour with ensure_ascii=False...

The standard JSON escape chars are escaped, using their shortforms... (Ie not the hex version)

  • "
  • \
  • /
  • \b
  • \f
  • \n
  • \r
  • \t

The following control characters are escaped to the hex notation:

  • \x00 - \x1F (Except where listed above)
  • \u2028 and \u2029

Everything else is regular unicode.

(Linking to simplejson because I cant find link to python source code ATM... https://github.com/simplejson/simplejson/blob/master/simplejson/encoder.py#L21)

@tomchristie
Copy link
Contributor Author

Noting that JSON requires \x00 - \x1F to always be escaped.

http://www.ietf.org/rfc/rfc4627.txt

@tomchristie
Copy link
Contributor Author

Relevant link on u2028 u2029... http://stackoverflow.com/questions/2965293/javascript-parse-error-on-u2028-unicode-character

Do shout me down if I'm being too verbose :) seems best to put this things down for future reference.

@tomchristie
Copy link
Contributor Author

Not sure if relevant or not, but JSLint on 'unsafe' chars that should be escaped (in the context of a browser)... http://www.jslint.com/lint.html#unsafe

@tomchristie
Copy link
Contributor Author

Okay, my thoughts after all that:

I guess I'd probably recommend either as okay, but unicode as preferred, due to being more user friendly when displayed. Would probably be okay to underspecify the required escapes, perhaps simply noting that control characters do need escaping, as per the JSON spec.

Alternatively, consider this as out-of-scope and offer no guidance one way or the other. (Also not unreasonable)

@geemus
Copy link
Member

geemus commented Sep 15, 2014

I'd be up for recommending unicode + escaped control characters. I think that seems reasonable and a good thing to note (thanks for the detailed references/notes). Would you be up for a pull request with the related verbiage? Thanks!

@tomchristie
Copy link
Contributor Author

Sure thing, consider it on my todo list.
Feel free to nudge me again on here if not done in the next week or so.

@geemus
Copy link
Member

geemus commented Sep 18, 2014

@tomchristie no worries, certainly no hurry here. In the mean time we can definitely refer people to this discussion if it comes up. Will just be nice to polish it up and get it in there when you have a moment. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants