New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary items with umlauts from custom data source are created and deleted right away #837

Closed
agross opened this Issue Mar 6, 2016 · 12 comments

Comments

Projects
None yet
2 participants
@agross
Contributor

agross commented Mar 6, 2016

I'm currently migration an old CMS to nanoc. We load most of the CMS content from an XML file. Some items (binary items) are loaded from the file system.

There is one binary file containing an "ß" character. The item rep (default) for that item gets created and deleted right away. All binary items are handled by a passthrough rule.

$ bundle exec nanoc
Loading site… done
Compiling site…
      create  [0.00s]  build/bin/files/media/image/team-alexander-groß.jpg
      delete  build/bin/files/media/image/team-alexander-groß.jpg

Site compiled in 2.55s.
@ddfreyne

This comment has been minimized.

Show comment
Hide comment
@ddfreyne

ddfreyne Mar 6, 2016

Member

Yikes! A wild guess, but this might be caused by Unicode normalisation being done differently in different places.

Do you have a test case for me that I can reproduce locally? If not, it’d be helpful if you could do some digging on your side and see whether you can isolate the issue. If my hunch is correct, pruner.rb:43 would show that present_files has a string normalised to one way, and compiled_files the same string, normalised a different way.

Member

ddfreyne commented Mar 6, 2016

Yikes! A wild guess, but this might be caused by Unicode normalisation being done differently in different places.

Do you have a test case for me that I can reproduce locally? If not, it’d be helpful if you could do some digging on your side and see whether you can isolate the issue. If my hunch is correct, pruner.rb:43 would show that present_files has a string normalised to one way, and compiled_files the same string, normalised a different way.

@ddfreyne ddfreyne added this to the 4.1.5 milestone Mar 6, 2016

@agross

This comment has been minimized.

Show comment
Hide comment
@agross

agross Mar 6, 2016

Contributor

present_files contains

[ 18] "build/bin/files/media/image/team-alexander-gro\u00DF.jpg",

compiled_files contains

[115] "build/bin/files/media/image/team-alexander-gro\xE1.jpg"
Contributor

agross commented Mar 6, 2016

present_files contains

[ 18] "build/bin/files/media/image/team-alexander-gro\u00DF.jpg",

compiled_files contains

[115] "build/bin/files/media/image/team-alexander-gro\xE1.jpg"
@agross

This comment has been minimized.

Show comment
Hide comment
@agross

agross Mar 6, 2016

Contributor

I added p present_files.find { |e| e =~ /alex/ }.encoding for both collections, both yield #<Encoding:UTF-8> in the output.

Contributor

agross commented Mar 6, 2016

I added p present_files.find { |e| e =~ /alex/ }.encoding for both collections, both yield #<Encoding:UTF-8> in the output.

@agross

This comment has been minimized.

Show comment
Hide comment
@agross

agross Mar 6, 2016

Contributor

Scratch that.

p present_files.find { |e| e =~ /team-alex/ }.encoding
# => #<Encoding:UTF-8>
p compiled_files.find { |e| e =~ /team-alex/ }.encoding
# => #<Encoding:IBM437>
Contributor

agross commented Mar 6, 2016

Scratch that.

p present_files.find { |e| e =~ /team-alex/ }.encoding
# => #<Encoding:UTF-8>
p compiled_files.find { |e| e =~ /team-alex/ }.encoding
# => #<Encoding:IBM437>
@agross

This comment has been minimized.

Show comment
Hide comment
@agross

agross Mar 6, 2016

Contributor

It seems this File.expand_path in my data source yields #<Encoding:IBM437>.

But even after changing the line to new_item(File.expand_path(file).encode(Encoding::UTF_8), ... the pruner still sees compiled_files.find { |e| e =~ /team-alex/ }.encoding as #<Encoding:IBM437>.

Contributor

agross commented Mar 6, 2016

It seems this File.expand_path in my data source yields #<Encoding:IBM437>.

But even after changing the line to new_item(File.expand_path(file).encode(Encoding::UTF_8), ... the pruner still sees compiled_files.find { |e| e =~ /team-alex/ }.encoding as #<Encoding:IBM437>.

@agross

This comment has been minimized.

Show comment
Hide comment
@agross

agross Mar 6, 2016

Contributor

Might be related to this bug: https://bugs.ruby-lang.org/issues/9713

Contributor

agross commented Mar 6, 2016

Might be related to this bug: https://bugs.ruby-lang.org/issues/9713

@agross

This comment has been minimized.

Show comment
Hide comment
@agross

agross Mar 6, 2016

Contributor

Copying the code from the issue above to something in lib/ and my spec_helper.rb I see there's a slight difference between

$ bundle exec nanoc
Loading site… Encoding.find 'filesystem': #<Encoding:Windows-1252>
Encoding.find 'locale': #<Encoding:IBM437>
Encoding.default internal: nil
Encoding.default external: #<Encoding:IBM437>
Encoding.locale_charmap: "CP437"
__FILE__: #<Encoding:UTF-8>
'foobar': #<Encoding:IBM437>

# and

$ bundle exec rspec # or rake
Encoding.find 'filesystem': #<Encoding:Windows-1252>
Encoding.find 'locale': #<Encoding:IBM437>
Encoding.default internal: nil
Encoding.default external: #<Encoding:IBM437>
Encoding.locale_charmap: "CP437"
__FILE__: #<Encoding:UTF-8>
'foobar': #<Encoding:UTF-8>
Contributor

agross commented Mar 6, 2016

Copying the code from the issue above to something in lib/ and my spec_helper.rb I see there's a slight difference between

$ bundle exec nanoc
Loading site… Encoding.find 'filesystem': #<Encoding:Windows-1252>
Encoding.find 'locale': #<Encoding:IBM437>
Encoding.default internal: nil
Encoding.default external: #<Encoding:IBM437>
Encoding.locale_charmap: "CP437"
__FILE__: #<Encoding:UTF-8>
'foobar': #<Encoding:IBM437>

# and

$ bundle exec rspec # or rake
Encoding.find 'filesystem': #<Encoding:Windows-1252>
Encoding.find 'locale': #<Encoding:IBM437>
Encoding.default internal: nil
Encoding.default external: #<Encoding:IBM437>
Encoding.locale_charmap: "CP437"
__FILE__: #<Encoding:UTF-8>
'foobar': #<Encoding:UTF-8>

agross added a commit to dnugleipzig/web that referenced this issue Mar 6, 2016

@ddfreyne

This comment has been minimized.

Show comment
Hide comment
@ddfreyne

ddfreyne Mar 8, 2016

Member

Does the problem disappear when you replace

Find.find(site.config[:output_dir] + '/') do |f|

in pruner.rb with

Find.find(site.config[:output_dir] + '/').map { |f| f.encode('UTF-8') }.each do |f|

? If so, it looks like re-encoding all filenames obtained from Dir.glob to be UTF-8 would be the way to go.

Member

ddfreyne commented Mar 8, 2016

Does the problem disappear when you replace

Find.find(site.config[:output_dir] + '/') do |f|

in pruner.rb with

Find.find(site.config[:output_dir] + '/').map { |f| f.encode('UTF-8') }.each do |f|

? If so, it looks like re-encoding all filenames obtained from Dir.glob to be UTF-8 would be the way to go.

@agross

This comment has been minimized.

Show comment
Hide comment
@agross

agross Mar 9, 2016

Contributor

Thanks for the suggestion! Unfortunately it didn't work as it seems the files returned by Find.find are already UTF-8 encoded.

compiled_files contains the "team-alexander-gross" file name with Encoding:IBM437 encoding, so slapping the map on compiled_files did the trick.

Perhaps it's even better enforce the encoding at a more central place, like ItemRep.raw_paths.values or wherever the raw_paths.values come from. This works for me:

all_raw_paths = site.compiler.reps.flat_map { |r| r.raw_paths.values.map { |f| f.encode('UTF-8') } }
Contributor

agross commented Mar 9, 2016

Thanks for the suggestion! Unfortunately it didn't work as it seems the files returned by Find.find are already UTF-8 encoded.

compiled_files contains the "team-alexander-gross" file name with Encoding:IBM437 encoding, so slapping the map on compiled_files did the trick.

Perhaps it's even better enforce the encoding at a more central place, like ItemRep.raw_paths.values or wherever the raw_paths.values come from. This works for me:

all_raw_paths = site.compiler.reps.flat_map { |r| r.raw_paths.values.map { |f| f.encode('UTF-8') } }

@ddfreyne ddfreyne modified the milestones: 4.1.5, 4.1.6 Mar 24, 2016

@ddfreyne

This comment has been minimized.

Show comment
Hide comment
@ddfreyne

ddfreyne Apr 15, 2016

Member

Yup, I’d argue that all strings (including filenames) constructed by Nanoc should be in UTF-8. Will fix and ensure that encodings are correct everywhere.

(Hard to fix/test, because the default encoding is sadly part of the global state.)

Member

ddfreyne commented Apr 15, 2016

Yup, I’d argue that all strings (including filenames) constructed by Nanoc should be in UTF-8. Will fix and ensure that encodings are correct everywhere.

(Hard to fix/test, because the default encoding is sadly part of the global state.)

@ddfreyne

This comment has been minimized.

Show comment
Hide comment
@ddfreyne

ddfreyne Apr 17, 2016

Member

Fix in #852.

Member

ddfreyne commented Apr 17, 2016

Fix in #852.

@ddfreyne

This comment has been minimized.

Show comment
Hide comment
@ddfreyne

ddfreyne Apr 17, 2016

Member

Fixed in #852, and will be part of the 4.1.6 release.

Member

ddfreyne commented Apr 17, 2016

Fixed in #852, and will be part of the 4.1.6 release.

@ddfreyne ddfreyne closed this Apr 17, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment