Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary items with umlauts from custom data source are created and deleted right away #837

Closed
agross opened this issue Mar 6, 2016 · 12 comments
Closed
Milestone

Comments

@agross
Copy link
Contributor

@agross agross commented Mar 6, 2016

I'm currently migration an old CMS to nanoc. We load most of the CMS content from an XML file. Some items (binary items) are loaded from the file system.

There is one binary file containing an "ß" character. The item rep (default) for that item gets created and deleted right away. All binary items are handled by a passthrough rule.

$ bundle exec nanoc
Loading site… done
Compiling site…
      create  [0.00s]  build/bin/files/media/image/team-alexander-groß.jpg
      delete  build/bin/files/media/image/team-alexander-groß.jpg

Site compiled in 2.55s.
@ddfreyne
Copy link
Member

@ddfreyne ddfreyne commented Mar 6, 2016

Yikes! A wild guess, but this might be caused by Unicode normalisation being done differently in different places.

Do you have a test case for me that I can reproduce locally? If not, it’d be helpful if you could do some digging on your side and see whether you can isolate the issue. If my hunch is correct, pruner.rb:43 would show that present_files has a string normalised to one way, and compiled_files the same string, normalised a different way.

@ddfreyne ddfreyne added this to the 4.1.5 milestone Mar 6, 2016
@agross
Copy link
Contributor Author

@agross agross commented Mar 6, 2016

present_files contains

[ 18] "build/bin/files/media/image/team-alexander-gro\u00DF.jpg",

compiled_files contains

[115] "build/bin/files/media/image/team-alexander-gro\xE1.jpg"

@agross
Copy link
Contributor Author

@agross agross commented Mar 6, 2016

I added p present_files.find { |e| e =~ /alex/ }.encoding for both collections, both yield #<Encoding:UTF-8> in the output.

@agross
Copy link
Contributor Author

@agross agross commented Mar 6, 2016

Scratch that.

p present_files.find { |e| e =~ /team-alex/ }.encoding
# => #<Encoding:UTF-8>
p compiled_files.find { |e| e =~ /team-alex/ }.encoding
# => #<Encoding:IBM437>

@agross
Copy link
Contributor Author

@agross agross commented Mar 6, 2016

It seems this File.expand_path in my data source yields #<Encoding:IBM437>.

But even after changing the line to new_item(File.expand_path(file).encode(Encoding::UTF_8), ... the pruner still sees compiled_files.find { |e| e =~ /team-alex/ }.encoding as #<Encoding:IBM437>.

@agross
Copy link
Contributor Author

@agross agross commented Mar 6, 2016

Might be related to this bug: https://bugs.ruby-lang.org/issues/9713

@agross
Copy link
Contributor Author

@agross agross commented Mar 6, 2016

Copying the code from the issue above to something in lib/ and my spec_helper.rb I see there's a slight difference between

$ bundle exec nanoc
Loading site… Encoding.find 'filesystem': #<Encoding:Windows-1252>
Encoding.find 'locale': #<Encoding:IBM437>
Encoding.default internal: nil
Encoding.default external: #<Encoding:IBM437>
Encoding.locale_charmap: "CP437"
__FILE__: #<Encoding:UTF-8>
'foobar': #<Encoding:IBM437>

# and

$ bundle exec rspec # or rake
Encoding.find 'filesystem': #<Encoding:Windows-1252>
Encoding.find 'locale': #<Encoding:IBM437>
Encoding.default internal: nil
Encoding.default external: #<Encoding:IBM437>
Encoding.locale_charmap: "CP437"
__FILE__: #<Encoding:UTF-8>
'foobar': #<Encoding:UTF-8>

@ddfreyne
Copy link
Member

@ddfreyne ddfreyne commented Mar 8, 2016

Does the problem disappear when you replace

Find.find(site.config[:output_dir] + '/') do |f|

in pruner.rb with

Find.find(site.config[:output_dir] + '/').map { |f| f.encode('UTF-8') }.each do |f|

? If so, it looks like re-encoding all filenames obtained from Dir.glob to be UTF-8 would be the way to go.

@agross
Copy link
Contributor Author

@agross agross commented Mar 9, 2016

Thanks for the suggestion! Unfortunately it didn't work as it seems the files returned by Find.find are already UTF-8 encoded.

compiled_files contains the "team-alexander-gross" file name with Encoding:IBM437 encoding, so slapping the map on compiled_files did the trick.

Perhaps it's even better enforce the encoding at a more central place, like ItemRep.raw_paths.values or wherever the raw_paths.values come from. This works for me:

all_raw_paths = site.compiler.reps.flat_map { |r| r.raw_paths.values.map { |f| f.encode('UTF-8') } }

@ddfreyne ddfreyne removed this from the 4.1.5 milestone Mar 24, 2016
@ddfreyne ddfreyne added this to the 4.1.6 milestone Mar 24, 2016
@ddfreyne ddfreyne added this to the 4.1.6 milestone Mar 24, 2016
@ddfreyne ddfreyne removed this from the 4.1.5 milestone Mar 24, 2016
@ddfreyne
Copy link
Member

@ddfreyne ddfreyne commented Apr 15, 2016

Yup, I’d argue that all strings (including filenames) constructed by Nanoc should be in UTF-8. Will fix and ensure that encodings are correct everywhere.

(Hard to fix/test, because the default encoding is sadly part of the global state.)

@ddfreyne
Copy link
Member

@ddfreyne ddfreyne commented Apr 17, 2016

Fix in #852.

@ddfreyne
Copy link
Member

@ddfreyne ddfreyne commented Apr 17, 2016

Fixed in #852, and will be part of the 4.1.6 release.

@ddfreyne ddfreyne closed this Apr 17, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants