Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fallback to UTF-8 when reading generated source file #9052

Merged
merged 2 commits into from Mar 15, 2019

Conversation

Projects
None yet
4 participants
@mkurz
Copy link
Member

commented Feb 20, 2019

Idea how to fix #9022.

Just fallback to UTF-8 if a generated source file could not be read/decoded with the system default charset.
I think most of the time that should do the job.

The problem in #9022 is that the system charset is GBK, however a generated source file we want to read (that maybe is not generated by the routes compiler at all) probably contains characters that does not fit into the GBK encoding. Reading the file as UTF-8 should be a valid workaround in such cases.

Another idea:
Instead of this fallback fix I had the idea of just replacing implicitly[Codec].name with UTF-8 in RoutesCompiler.scala:

  • Here we would then generate all the files with UTF-8 encoding.
  • and when reading the file here with the UTF-8 charset again we could be sure the file was written as UTF-8 before.

However, I think the unapply method doesn't only get called for all files that were generated by the Files.write(...) statement in RoutesCompiler.scala, but also for every other generated source file which were not created by the routes compiler and therefore we do not have influence on the encoding with which such "other" generated files were created (The unapply is called as result of defining the sourcePositionMappers and kicks in here).
That's why I am not sure if we should just hardcode switch to UTF-8 in RoutesCompiler.scala...

What do you think?

@marcospereira

This comment has been minimized.

Copy link
Member

commented Feb 20, 2019

Thanks, @mkurz.

I would prefer to consistently use the same charset when generating/writing and reading the file. Do you have time to experiment with the second option described? I didn't had time to investigate, but I have the impression this is exactly what commons-io's FileUtils.writeStringToFile was doing before.

@mkurz

This comment has been minimized.

Copy link
Member Author

commented Feb 20, 2019

@marcospereira Yes, I also think this is was FileUtils.writeStringToFile is doing.

However, I think for this issue the problem is not which charset is used to write the file, but which charset is used to read the file. Because, like I explained, the unapply method is not just called for files that are generated by the routes-compiler but also for each generated source file, no matter who generated it. Therefore it can happen that any component that can generated source files uses any charset (probably the system default one, or UTF-8 or whatever) to write a file and later the in unapply we are going to read it.

So the real problem is that we don't know what a source file's encoding is at the time we read it. Now I guess what is happening in the specific case is that there is a source file that got generated/written as UTF-8 (but not by by the route compiler), meaning it contained chars that GBK doesn't know about. Now the file was tried to read with GBK encoding which leads to the exception.

Switching to UTF-8 in Files.write(...) in the route compiler is ok, but it's not the essential thing that solves the problem.

Do you understand what I want to say?

@mkurz

This comment has been minimized.

Copy link
Member Author

commented Feb 20, 2019

One more thing:

I would prefer to consistently use the same charset when generating/writing and reading the file...

With this pull request, this is still the case. We just fall back to UTF-8 to handle files that are not generated by the routes compiler but some other component and hope that UTF-8 can handle that file.

@marcospereira

This comment has been minimized.

Copy link
Member

commented Mar 1, 2019

@hey Matthias, yes, got your explanation.

I would prefer to consistently use the same charset when generating/writing and reading the file...

With this pull request, this is still the case. We just fall back to UTF-8 to handle files that are not generated by the routes compiler but some other component and hope that UTF-8 can handle that file.

You are right that in most of the cases we will use the same charset (system default one) to write and read the file, but then there is this fallback. My point, which reading now was not properly explained, is that why not use utf-8 instead of going to system default and falling back to utf-8 if it fails?

WDYT?

@mkurz mkurz added the in progress label Mar 2, 2019

@mkurz

This comment has been minimized.

Copy link
Member Author

commented Mar 8, 2019

@marcospereira I had another look at the issue and pushed a fix that - I am pretty sure - should fix the problem. Please just read the java comment I added to the fix, it should explain everything.

However, here are some more explanations.

Reading/writing the file with UTF-8 or the system charset is not the point here (more on that later). The problem was a different one:
What happend in the past (< Play 2.7) was that when trying to read a file via commons-io with a specific charset (in our case the system charset, but which one doesn't matter really) and the file didn't match that charset, was to not fail with an exception, but to just replace any unknown characters with placeholders. See this line that commons-io ends up in when creating the CharsetDecoder.

What's now different in Files.readLines is, that, by default, its internal CharsetDecoder is configured to REPORT problems instead of replacing. It's this line we end up in that case, which doesn't explicitly set anything, keeping the default reporting behaviour.

However, that means in the past the routes compiler did read files with "wrong" charsets without failing and ended up with garbled strings - in which it was looking for the "// @GENERATOR..." line (that will never be found in such "wrong" files)...

So both the commons-io and Java's Files.readLines use an InputStreamReader internally, with only one difference: The former is configured it to ignore (=replace) unknown characters and latter to report it (via exception).

Coming to your statement:

but I have the impression this is exactly what commons-io's FileUtils.writeStringToFile was doing before.

Actually that is wrong, FileUtils.writeStringToFile does use the encoding we did pass to it, not UTF-8, but, again: It's counterpart FileUtils.readFileToString is configured to ignore decoding errors.

Of course a solution would be to set up our own BufferedReader backed by an InputStreamReader that is explicitly configured to replace decoding errors (instead of reporting them) - which would mimic the same behaviour like in commons-io, however I think that isn't necessary because the fix I pushed is sufficient.

And now to the UTF-8 vs. system default charset discussion:
I would not switch to UTF-8. Why? Because the routes compiler generates a java/scala source file here and if the user doesn't set the encoding flag for javac/scalac, the system default encoding will be used for compiling. Meaning if the system and therefore javac uses GBK by default now reads the file we generated with UTF-8, I guess that could and will fail... That's why hardcoding to UTF-8 is not a good idea here I guess... WDYT?
Plus there were no complains in the past to use the system charset (Given that the problem in #9022 is a different one, like explained above)
Also, my initial fix (falling back to UTF-8) just doesn't make sense as well: If the file we try to read has a different encoding than the system default and also isn't UTF-8, we would still see the exception...

@dwijnand dwijnand removed the in progress label Mar 12, 2019

@marcospereira
Copy link
Member

left a comment

LGTM.

Thanks for the detailed explanations as always, Matthias.

@marcospereira marcospereira merged commit a57181b into playframework:master Mar 15, 2019

4 checks passed

Mergify — Rule: Merge PRs that are ready (merge) The pull request has been merged manually
Details
Mergify — Summary 2 rules match
Details
Travis CI - Pull Request Build Passed
Details
typesafe-cla-validator All users have signed the CLA
Details

marcospereira added a commit that referenced this pull request Mar 15, 2019

Fallback to UTF-8 when reading generated source file (#9052)
* Fallback to UTF-8 when reading generated source file

* Just ignore files with unknown encoding
@marcospereira

This comment has been minimized.

Copy link
Member

commented Mar 15, 2019

Backport to 2.7.x: b85fcae

@mkurz mkurz deleted the mkurz:charsetFix branch Mar 15, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.