Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

json responses with 'application/json' and no charset are parsed in ISO-8859-1 charset #150

Open
maxcom opened this issue Jun 23, 2017 · 18 comments

Comments

@maxcom
Copy link
Contributor

maxcom commented Jun 23, 2017

JSON in HTTP are always encoded in UTF-8. Responses are parsed correctly when server writes content type header like application/json; charset=utf-8.

However, many servers (like Play framework itself) uses application/json without charset. In Play 2.6, that responses are parsed in ISO-8859-1 charset.

I think that problem is in JsonBodyReadables implementation. Json.parse(response.body) should be replaced with Json.parse(response.bodyAsBytes.utf8string). Charset for response.body is always ISO-8859-1 in async http client when no charset info is provided in content type header.

@wsargent
Copy link
Member

Good catch.

@gmethvin
Copy link
Member

gmethvin commented Jun 23, 2017

Technically, JSON is not always UTF-8, but it is always Unicode, and application/json does not define a charset parameter. I would use Json.parse(response.bodyAsBytes.toArray) directly, since it will detect the encoding.

@jsw
Copy link
Contributor

jsw commented Jun 27, 2017

I've run into the same issue, where response.body does not correctly handle an emoji character. I'm not using play json to map the response body (manual jackson with scala module). Is the appropriate approach to use response.bodyAsBytes.toArray to get the raw bytes, which I can pass to Jackson, and use new String(response.bodyAsBytes.toArray) to get something for a log statement? I'm guessing that response.body is causing play to parse to JSON and then output to String, which is causing the encoding issue.

@Malax
Copy link

Malax commented Jul 4, 2017

Is there an ETA on a release that includes this fix?

@wsargent
Copy link
Member

wsargent commented Jul 4, 2017

@Malax within the next couple of weeks -- however, do note that due to the way that the standalone client works, it's fairly easy to create your own custom body readables and writables as a workaround.

@Malax
Copy link

Malax commented Jul 5, 2017

@wsargent Thanks for the reply! 👍 If it's in the next couple of weeks, we will just delay the migration to 2.6 until then.

@jsw
Copy link
Contributor

jsw commented Jul 5, 2017

@wsargent Can you comment on my workaround above for accessing the raw bytes when I don't want to use the play JSON parser?

@wsargent
Copy link
Member

wsargent commented Jul 5, 2017

@jsw You are broadly correct -- the best option is to use Json.parse(response.bodyAsBytes.toArray) which does UTF-8/UTF-16 encoding detection correctly.

@avdv
Copy link

avdv commented Jul 6, 2017

You are broadly correct -- the best option is to use Json.parse(response.bodyAsBytes.toArray) which does UTF-8/UTF-16 encoding detection correctly.

Would it be feasible to avoid the Array[Byte] -> ByteString -> Array[Byte] copying by using the "underlying" response?!

@jsw
Copy link
Contributor

jsw commented Jul 7, 2017

Unless I'm doing things wrong, I think the issue still exists in play-ws 1.0.1.

import org.scalatestplus.play.PlaySpec
import play.api.mvc.Action
import play.api.mvc.Results.Ok
import play.api.test.Helpers._
import play.api.test.WsTestClient
import play.core.server.Server
import scala.concurrent.ExecutionContext.Implicits.global

class JsonEncodingTest extends PlaySpec {
  "ws" should {
    "handle utf8" in {
      Server.withRouter() {
        case _  => Action { Ok("""{"foo": "👍"}""").as(JSON) }
      } { implicit port =>
        WsTestClient.withClient { client =>
          println(await(client.url("/").get().map(_.body.toString))) // unexpected output
          println(await(client.url("/").get().map(r => new String(r.bodyAsBytes.toArray)))) // expected output
        }
      }
    }
  }
}

Output

{"foo": "�"}
{"foo": "👍"}

@wsargent
Copy link
Member

wsargent commented Jul 8, 2017

You should be calling

client.url("/").get().map(r => Json.parse(r.bodyAsBytes.toArray))

@wsargent
Copy link
Member

wsargent commented Jul 8, 2017

Would it be feasible to avoid the Array[Byte] -> ByteString -> Array[Byte] copying by using the "underlying" response?!

In theory yes, in practice array copying usually doesn't yield much in the way of performance improvements because the JVM is pretty effective at array copying:

https://psy-lob-saw.blogspot.com/2015/04/on-arraysfill-intrinsics-superword-and.html

If you're interested, you can poke at the JVM overhead with JMH:

https://psy-lob-saw.blogspot.com/p/jmh-related-posts.html

There's a sbt plugin at https://github.com/ktoso/sbt-jmh that will help.

@jsw
Copy link
Contributor

jsw commented Jul 8, 2017

@wsargent My use cases are:

  • obtain a String or Array[Byte] to pass to Jackson mapper (I'm not using play-json)
  • obtain a String suitable for logging the response body

Given that, I'm not sure why I would want to do an extra parse via Json.parse and then convert that back to a String. Can you elaborate?

Also, can you explain why r.body.toString produces different output than new String(r.bodyAsBytes.toArray)?

@avdv
Copy link

avdv commented Jul 10, 2017

@wsargent Thanks for that valuable information. 👍

Given that, I'm not sure why I would want to do an extra parse via Json.parse and then convert that back to a String. Can you elaborate?

Json.parse detects the encoding of the data and decodes them appropriately.

Also, can you explain why r.body.toString produces different output than new String(r.bodyAsBytes.toArray)?

r.body decodes the data using the iso-8859-1 encoding which is the default behaviour of the AsyncHttpClient, AFAIK.

new String(...) uses the default platform encoding to decode the data, which is probably UTF-8 on Linux, CP1250 on Windows, et cetera.

@gmethvin
Copy link
Member

gmethvin commented Aug 14, 2017

The remaining problem being discussed here is basically that the body: String method is not smart enough to understand the rules of every content type, and I'm not sure if it should be. There are a few options here:

  • Restore the old behavior in which we read the charset parameter (regardless of whether the media type defines one) and default to UTF-8 for non text content types and ISO-8859-1 for text. That would cover a lot of cases, but it would still be broken for JSON sent as UTF-16, and virtually any other content type that allows charsets to be declared in another way (e.g. HTML can have charsets embedded in meta tags).
  • Instead of having a default implicit BodyReadable[String], we can have multiple String BodyReadables for specific cases, including one for parsing a JSON body to a string. That could use the logic here for charset detection: https://github.com/playframework/play-ws/pull/151/files/9155ef716b3c1d5699783e2aa2edf1c8b8260970#diff-0b357986f490a94794eef6c192426fd9R29

TBonnin added a commit to guardian/frontend that referenced this issue Sep 8, 2017
'application/json' responses encoded using utf-8 are wrongly parsed
using ISO-8859-1 charset in play-ws (at least until v1.0.7)
See: playframework/play-ws#150
TBonnin added a commit to guardian/frontend that referenced this issue Sep 8, 2017
'application/json' responses encoded using utf-8 are wrongly parsed
using ISO-8859-1 charset in play-ws (at least until v1.0.7) when using
WSResponse.body
See: playframework/play-ws#150
WSResponse.json doesn't seem to suffer from the same problem.
TBonnin added a commit to guardian/frontend that referenced this issue Sep 8, 2017
'application/json' responses encoded using utf-8 are wrongly parsed
using ISO-8859-1 charset in play-ws (at least until v1.0.7) when using
WSResponse.body
See: playframework/play-ws#150
WSResponse.json doesn't seem to suffer from the same problem.
TBonnin added a commit to guardian/frontend that referenced this issue Sep 8, 2017
'application/json' responses encoded using utf-8 are wrongly parsed
using ISO-8859-1 charset in play-ws (at least until v1.0.7) when using
WSResponse.body
See: playframework/play-ws#150
WSResponse.json doesn't seem to suffer from the same problem.
TBonnin added a commit to guardian/frontend that referenced this issue Sep 8, 2017
'application/json' responses encoded using utf-8 are wrongly parsed
using ISO-8859-1 charset in play-ws (at least until v1.0.7) when using
WSResponse.body
See: playframework/play-ws#150
WSResponse.json doesn't seem to suffer from the same problem.
TBonnin added a commit to guardian/frontend that referenced this issue Sep 8, 2017
'application/json' responses encoded using utf-8 are wrongly parsed
using ISO-8859-1 charset in play-ws (at least until v1.0.7) when using
WSResponse.body
See: playframework/play-ws#150
WSResponse.json doesn't seem to suffer from the same problem.
TBonnin added a commit to guardian/frontend that referenced this issue Sep 12, 2017
'application/json' responses encoded using utf-8 are wrongly parsed
using ISO-8859-1 charset in play-ws (at least until v1.0.7) when using
WSResponse.body
See: playframework/play-ws#150
WSResponse.json doesn't seem to suffer from the same problem.
@tartakynov
Copy link

The issue still exists in play-ws 1.1.1. Another possible solution would be to change the body type to ByteString 🙊

@fommil
Copy link

fommil commented Feb 16, 2018

somewhat related... how can we make the server include the charset instead of trying to be too smart for its own good?

@richdougherty
Copy link
Member

See possible fix on the Play server side: playframework/playframework#8239

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants