New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize the handling of binary items #1321

Merged
merged 1 commit into from Mar 10, 2018

Conversation

Projects
None yet
4 participants
@terceiro
Contributor

terceiro commented Feb 23, 2018

No description provided.

@ddfreyne

This comment has been minimized.

Show comment
Hide comment
@ddfreyne

ddfreyne Feb 25, 2018

Member

Hi, thanks for the PR!

I’ve thought about this optimisation in the past, too. In the end, I decided against using hard links for this purpose, for two reasons:

  • Nanoc’s behavior will change: modifying a file in content/ might result in a file in output/ being modified (because they are the same file). I believe this will be confusing, because

    • it deviates from the understanding that one needs to run nanoc in order to bring the output/ directory up to date — a change in content/ can cause a change in output/ right away.
    • modifying a file in content/ might or might not cause a change in output/, depending on how that file is handled in Rules

    In the end, I think the change in behavior, and the unpredictability of the behavior, does not make this change worth the speedup.

  • I couldn’t find enough evidence to support that hard links would yield a significant speedup. I have not investigated this recently, but I imagine that the new asynchronous writes (which landed in Nanoc 4.9.0) will have made the impact of using hardlinks less significant. However, if you have a case where using hardlinks provided a significant speed improvement, I’d love to know about it!

For the reasons listed above, I’ll close the PR now, but I’m definitely keen on keeping the conversation going!

Member

ddfreyne commented Feb 25, 2018

Hi, thanks for the PR!

I’ve thought about this optimisation in the past, too. In the end, I decided against using hard links for this purpose, for two reasons:

  • Nanoc’s behavior will change: modifying a file in content/ might result in a file in output/ being modified (because they are the same file). I believe this will be confusing, because

    • it deviates from the understanding that one needs to run nanoc in order to bring the output/ directory up to date — a change in content/ can cause a change in output/ right away.
    • modifying a file in content/ might or might not cause a change in output/, depending on how that file is handled in Rules

    In the end, I think the change in behavior, and the unpredictability of the behavior, does not make this change worth the speedup.

  • I couldn’t find enough evidence to support that hard links would yield a significant speedup. I have not investigated this recently, but I imagine that the new asynchronous writes (which landed in Nanoc 4.9.0) will have made the impact of using hardlinks less significant. However, if you have a case where using hardlinks provided a significant speed improvement, I’d love to know about it!

For the reasons listed above, I’ll close the PR now, but I’m definitely keen on keeping the conversation going!

@ddfreyne ddfreyne closed this Feb 25, 2018

@terceiro

This comment has been minimized.

Show comment
Hide comment
@terceiro

terceiro Feb 25, 2018

Contributor

Hi, thanks for your feedback. Indeed the change in semantics breaks a few assumptions that we usually assume with nanoc.

My use case is this: I maintain a podcast with nanoc, in which the files are published together with the site itself. Each audio file is usually between 50 and 100MB. overall, I have so far ~2.4GB of audio files. I am concerned with both the speed of the builds and the space it uses. The builds for publication are done on the server side, so space efficiency is a little bit more relevant to me.

I admit I hadn't tried with Nanoc 4.9 before, and just did it. indeed the speedup is considerable, and that is very nice. So IMO speed is fixed.

Now, my only remaining concern is the duplicated space. Since the audio files are just copied bit-by-bit with no processing in Nanoc, it would be nice to be able to not duplicate that space. if everything goes right at some point this podcast will have tens of GBs of audio files. I am already considering removing the files from the Nanoc website, but maybe there is a low-hanging fruit in Nanoc that we could explore so that keeping the files in the nanoc website itself could be viable.

Contributor

terceiro commented Feb 25, 2018

Hi, thanks for your feedback. Indeed the change in semantics breaks a few assumptions that we usually assume with nanoc.

My use case is this: I maintain a podcast with nanoc, in which the files are published together with the site itself. Each audio file is usually between 50 and 100MB. overall, I have so far ~2.4GB of audio files. I am concerned with both the speed of the builds and the space it uses. The builds for publication are done on the server side, so space efficiency is a little bit more relevant to me.

I admit I hadn't tried with Nanoc 4.9 before, and just did it. indeed the speedup is considerable, and that is very nice. So IMO speed is fixed.

Now, my only remaining concern is the duplicated space. Since the audio files are just copied bit-by-bit with no processing in Nanoc, it would be nice to be able to not duplicate that space. if everything goes right at some point this podcast will have tens of GBs of audio files. I am already considering removing the files from the Nanoc website, but maybe there is a low-hanging fruit in Nanoc that we could explore so that keeping the files in the nanoc website itself could be viable.

@ddfreyne

This comment has been minimized.

Show comment
Hide comment
@ddfreyne

ddfreyne Feb 26, 2018

Member

Thanks for the reply! Your reasoning makes sense.

I imagine that most writes are atomic anyway (meaning that file modification equals create file + rename, rather than truncate and write), which means that updating a file in content/ won’t actually change the matching file in output/.

I’ll reopen the PR. If you want, can you add a test to ensure that the files are hardlinked? (If not, I can do that as well.)

Member

ddfreyne commented Feb 26, 2018

Thanks for the reply! Your reasoning makes sense.

I imagine that most writes are atomic anyway (meaning that file modification equals create file + rename, rather than truncate and write), which means that updating a file in content/ won’t actually change the matching file in output/.

I’ll reopen the PR. If you want, can you add a test to ensure that the files are hardlinked? (If not, I can do that as well.)

@ddfreyne ddfreyne reopened this Feb 26, 2018

@agross

This comment has been minimized.

Show comment
Hide comment
@agross

agross Feb 27, 2018

Contributor

I also wonder what the implications will be on Windows where ... hold your breath ... {hard,sym}links can by default only be created by administrators. Just tested FileUtils.ln, it returns 0 and no files are created (Ruby 2.5.0, I can at least create symlinks as per Windows permissions with mklink). On Windows, a copy is created.

Contributor

agross commented Feb 27, 2018

I also wonder what the implications will be on Windows where ... hold your breath ... {hard,sym}links can by default only be created by administrators. Just tested FileUtils.ln, it returns 0 and no files are created (Ruby 2.5.0, I can at least create symlinks as per Windows permissions with mklink). On Windows, a copy is created.

@gpakosz

This comment has been minimized.

Show comment
Hide comment
@gpakosz

gpakosz Feb 27, 2018

Member

I also wonder what the implications will be on Windows where ... hold your breath ... {hard,sym}links can by default only be created by administrators.

As of Fall Creators Update 2017, when developer mode is enabled you don't need privileges to create links

Member

gpakosz commented Feb 27, 2018

I also wonder what the implications will be on Windows where ... hold your breath ... {hard,sym}links can by default only be created by administrators.

As of Fall Creators Update 2017, when developer mode is enabled you don't need privileges to create links

@gpakosz

This comment has been minimized.

Show comment
Hide comment
@gpakosz

gpakosz Feb 27, 2018

Member

Now, my only remaining concern is the duplicated space. Since the audio files are just copied bit-by-bit with no processing in Nanoc, it would be nice to be able to not duplicate that space.

Is there a way to make hardlinks only if an item is subject to a passthrough rule? Note that it's different semantics than "file is identical".

Member

gpakosz commented Feb 27, 2018

Now, my only remaining concern is the duplicated space. Since the audio files are just copied bit-by-bit with no processing in Nanoc, it would be nice to be able to not duplicate that space.

Is there a way to make hardlinks only if an item is subject to a passthrough rule? Note that it's different semantics than "file is identical".

@terceiro

This comment has been minimized.

Show comment
Hide comment
@terceiro

terceiro Feb 27, 2018

Contributor

hey @ddfreyne, thanks for reconsidering. I have updated this PR with a test, however I am a bit suspiciouis of it, because 1) it will definitively fail if executed on a system where /tmp is a tmpfs and 2) it will probably fail on windows as well based on the input from @agross

@agross could you please test this on Windows? I don't have access to a Windows system.

Contributor

terceiro commented Feb 27, 2018

hey @ddfreyne, thanks for reconsidering. I have updated this PR with a test, however I am a bit suspiciouis of it, because 1) it will definitively fail if executed on a system where /tmp is a tmpfs and 2) it will probably fail on windows as well based on the input from @agross

@agross could you please test this on Windows? I don't have access to a Windows system.

@agross

This comment has been minimized.

Show comment
Hide comment
@agross

agross Feb 27, 2018

Contributor

@terceiro As I said, Ruby 2.5.0 on Windows just copies the file, I guess it won't pose a problem.

Contributor

agross commented Feb 27, 2018

@terceiro As I said, Ruby 2.5.0 on Windows just copies the file, I guess it won't pose a problem.

@terceiro

This comment has been minimized.

Show comment
Hide comment
@terceiro

terceiro Feb 27, 2018

Contributor
Contributor

terceiro commented Feb 27, 2018

@agross

This comment has been minimized.

Show comment
Hide comment
@agross

agross Feb 27, 2018

Contributor

@terceiro Alright, I missed that part. The problem is that I can't even install the nanoc's dependencies on Windows. Perhaps you can just skip the test if Gem.win_platform? is true?

Here's what I tested manually in irb:

require 'fileutils'
=> true
IO.write('test', 'foo')
=> 3
FileUtils.ln('test', 'linked')
=> 0
File.stat('test').ino
=> 1125899906918010
File.stat('linked').ino
=> 1125899906918010
FileUtils.ln('test', 'linked-f', force: true)
=> 0
File.stat('linked').ino
=> 1125899906918010
Contributor

agross commented Feb 27, 2018

@terceiro Alright, I missed that part. The problem is that I can't even install the nanoc's dependencies on Windows. Perhaps you can just skip the test if Gem.win_platform? is true?

Here's what I tested manually in irb:

require 'fileutils'
=> true
IO.write('test', 'foo')
=> 3
FileUtils.ln('test', 'linked')
=> 0
File.stat('test').ino
=> 1125899906918010
File.stat('linked').ino
=> 1125899906918010
FileUtils.ln('test', 'linked-f', force: true)
=> 0
File.stat('linked').ino
=> 1125899906918010
@terceiro

This comment has been minimized.

Show comment
Hide comment
@terceiro

terceiro Feb 27, 2018

Contributor

well, I don't know enough about Windows to say that that is a hardlink, but it looks a lot like a hard link ;-)

(so the test will pass)

Contributor

terceiro commented Feb 27, 2018

well, I don't know enough about Windows to say that that is a hardlink, but it looks a lot like a hard link ;-)

(so the test will pass)

@ddfreyne

This comment has been minimized.

Show comment
Hide comment
@ddfreyne

ddfreyne Mar 2, 2018

Member

I think this PR needs a bit of (manual) testing. I’ll take a stab at it this weekend. (I’m paranoid about something breaking that we haven’t thought of.)

I’ve been meaning to find a way to automatically test Nanoc on Windows (Travis CI doesn’t have Ruby on Windows). Not sure whether that exists as a Saas!

Is there a way to make hardlinks only if an item is subject to a passthrough rule? Note that it's different semantics than "file is identical".

A passthrough rule is a syntactic shorthand, so it wouldn’t be easy to make an exception for passthrough.

Member

ddfreyne commented Mar 2, 2018

I think this PR needs a bit of (manual) testing. I’ll take a stab at it this weekend. (I’m paranoid about something breaking that we haven’t thought of.)

I’ve been meaning to find a way to automatically test Nanoc on Windows (Travis CI doesn’t have Ruby on Windows). Not sure whether that exists as a Saas!

Is there a way to make hardlinks only if an item is subject to a passthrough rule? Note that it's different semantics than "file is identical".

A passthrough rule is a syntactic shorthand, so it wouldn’t be easy to make an exception for passthrough.

@agross

This comment has been minimized.

Show comment
Hide comment
@agross

agross Mar 2, 2018

Contributor

I’ve been meaning to find a way to automatically test Nanoc on Windows (Travis CI doesn’t have Ruby on Windows).

https://github.com/agross/nanoc/blob/appveyor/appveyor.yml

Basically, this gets me the same weird bundler errors as on my Windows machine. Adding x64_mingw helped, but we get test errors, some of which can be fixed by comparing paths with File.absolute_path.

Contributor

agross commented Mar 2, 2018

I’ve been meaning to find a way to automatically test Nanoc on Windows (Travis CI doesn’t have Ruby on Windows).

https://github.com/agross/nanoc/blob/appveyor/appveyor.yml

Basically, this gets me the same weird bundler errors as on my Windows machine. Adding x64_mingw helped, but we get test errors, some of which can be fixed by comparing paths with File.absolute_path.

@ddfreyne

This comment has been minimized.

Show comment
Hide comment
@ddfreyne

ddfreyne Mar 4, 2018

Member

@agross Ugh, I missed your comment before I started working on setting up AppVeyor! Anyway, not much work lost — and I’ll definitely take some stuff from your setup!

Member

ddfreyne commented Mar 4, 2018

@agross Ugh, I missed your comment before I started working on setting up AppVeyor! Anyway, not much work lost — and I’ll definitely take some stuff from your setup!

@ddfreyne

This comment has been minimized.

Show comment
Hide comment
@ddfreyne

ddfreyne Mar 4, 2018

Member

See #1327 and #1325 for Windows support — still work in progress.

Once that’s finished, I think we can think about hardlinks again!

Member

ddfreyne commented Mar 4, 2018

See #1327 and #1325 for Windows support — still work in progress.

Once that’s finished, I think we can think about hardlinks again!

@terceiro

This comment has been minimized.

Show comment
Hide comment
@terceiro

terceiro Mar 5, 2018

Contributor
Contributor

terceiro commented Mar 5, 2018

@ddfreyne

This comment has been minimized.

Show comment
Hide comment
@ddfreyne

ddfreyne Mar 6, 2018

Member

@terceiro Can you rebase your branch on top of master? It’s now tested on Windows (for Rubies 2.3 to 2.5).

Member

ddfreyne commented Mar 6, 2018

@terceiro Can you rebase your branch on top of master? It’s now tested on Windows (for Rubies 2.3 to 2.5).

Optimize the handling of binary items
When writing large binary items (e.g. audio or video files), creating a
hard link pointing to the original avoids duplicating the data on disk.
This is specially useful for large binary items.

Hard linking will fail when the source file is not on the same device as
the output directory (e.g. when /tmp is a tmpfs). In that case, fallback
to the previous behavior of copying, what does duplicate the space, but
should always work.
@terceiro

This comment has been minimized.

Show comment
Hide comment
@terceiro

terceiro Mar 6, 2018

Contributor

@ddfreyne rebased

Contributor

terceiro commented Mar 6, 2018

@ddfreyne rebased

@ddfreyne

This comment has been minimized.

Show comment
Hide comment
@ddfreyne

ddfreyne Mar 10, 2018

Member

Huh, strangely the AppVeyor build is not running. Will take a look at it later!

Member

ddfreyne commented Mar 10, 2018

Huh, strangely the AppVeyor build is not running. Will take a look at it later!

@ddfreyne

This comment has been minimized.

Show comment
Hide comment
@ddfreyne

ddfreyne Mar 10, 2018

Member

Everything checks out. Merging! Thanks for your contribution!

Member

ddfreyne commented Mar 10, 2018

Everything checks out. Merging! Thanks for your contribution!

@ddfreyne ddfreyne merged commit 01daad8 into nanoc:master Mar 10, 2018

3 checks passed

codecov/patch Coverage not affected when comparing ab29b20...017dd43
Details
codecov/project 98.44% (-0.03%) compared to ab29b20
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment