Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize the handling of binary items #1321

Merged
merged 1 commit into from Mar 10, 2018

Conversation

@terceiro
Copy link
Contributor

@terceiro terceiro commented Feb 23, 2018

No description provided.

@terceiro terceiro force-pushed the optimize-binary-items branch from b444824 to 9d505ca Feb 23, 2018
@ddfreyne
Copy link
Member

@ddfreyne ddfreyne commented Feb 25, 2018

Hi, thanks for the PR!

I’ve thought about this optimisation in the past, too. In the end, I decided against using hard links for this purpose, for two reasons:

  • Nanoc’s behavior will change: modifying a file in content/ might result in a file in output/ being modified (because they are the same file). I believe this will be confusing, because

    • it deviates from the understanding that one needs to run nanoc in order to bring the output/ directory up to date — a change in content/ can cause a change in output/ right away.
    • modifying a file in content/ might or might not cause a change in output/, depending on how that file is handled in Rules

    In the end, I think the change in behavior, and the unpredictability of the behavior, does not make this change worth the speedup.

  • I couldn’t find enough evidence to support that hard links would yield a significant speedup. I have not investigated this recently, but I imagine that the new asynchronous writes (which landed in Nanoc 4.9.0) will have made the impact of using hardlinks less significant. However, if you have a case where using hardlinks provided a significant speed improvement, I’d love to know about it!

For the reasons listed above, I’ll close the PR now, but I’m definitely keen on keeping the conversation going!

Loading

@ddfreyne ddfreyne closed this Feb 25, 2018
@terceiro
Copy link
Contributor Author

@terceiro terceiro commented Feb 25, 2018

Hi, thanks for your feedback. Indeed the change in semantics breaks a few assumptions that we usually assume with nanoc.

My use case is this: I maintain a podcast with nanoc, in which the files are published together with the site itself. Each audio file is usually between 50 and 100MB. overall, I have so far ~2.4GB of audio files. I am concerned with both the speed of the builds and the space it uses. The builds for publication are done on the server side, so space efficiency is a little bit more relevant to me.

I admit I hadn't tried with Nanoc 4.9 before, and just did it. indeed the speedup is considerable, and that is very nice. So IMO speed is fixed.

Now, my only remaining concern is the duplicated space. Since the audio files are just copied bit-by-bit with no processing in Nanoc, it would be nice to be able to not duplicate that space. if everything goes right at some point this podcast will have tens of GBs of audio files. I am already considering removing the files from the Nanoc website, but maybe there is a low-hanging fruit in Nanoc that we could explore so that keeping the files in the nanoc website itself could be viable.

Loading

@ddfreyne
Copy link
Member

@ddfreyne ddfreyne commented Feb 26, 2018

Thanks for the reply! Your reasoning makes sense.

I imagine that most writes are atomic anyway (meaning that file modification equals create file + rename, rather than truncate and write), which means that updating a file in content/ won’t actually change the matching file in output/.

I’ll reopen the PR. If you want, can you add a test to ensure that the files are hardlinked? (If not, I can do that as well.)

Loading

@ddfreyne ddfreyne reopened this Feb 26, 2018
@agross
Copy link
Contributor

@agross agross commented Feb 27, 2018

I also wonder what the implications will be on Windows where ... hold your breath ... {hard,sym}links can by default only be created by administrators. Just tested FileUtils.ln, it returns 0 and no files are created (Ruby 2.5.0, I can at least create symlinks as per Windows permissions with mklink). On Windows, a copy is created.

Loading

@gpakosz
Copy link
Member

@gpakosz gpakosz commented Feb 27, 2018

I also wonder what the implications will be on Windows where ... hold your breath ... {hard,sym}links can by default only be created by administrators.

As of Fall Creators Update 2017, when developer mode is enabled you don't need privileges to create links

Loading

@gpakosz
Copy link
Member

@gpakosz gpakosz commented Feb 27, 2018

Now, my only remaining concern is the duplicated space. Since the audio files are just copied bit-by-bit with no processing in Nanoc, it would be nice to be able to not duplicate that space.

Is there a way to make hardlinks only if an item is subject to a passthrough rule? Note that it's different semantics than "file is identical".

Loading

@terceiro terceiro force-pushed the optimize-binary-items branch from 9d505ca to 17eb486 Feb 27, 2018
@terceiro
Copy link
Contributor Author

@terceiro terceiro commented Feb 27, 2018

hey @ddfreyne, thanks for reconsidering. I have updated this PR with a test, however I am a bit suspiciouis of it, because 1) it will definitively fail if executed on a system where /tmp is a tmpfs and 2) it will probably fail on windows as well based on the input from @agross

@agross could you please test this on Windows? I don't have access to a Windows system.

Loading

@agross
Copy link
Contributor

@agross agross commented Feb 27, 2018

@terceiro As I said, Ruby 2.5.0 on Windows just copies the file, I guess it won't pose a problem.

Loading

@terceiro
Copy link
Contributor Author

@terceiro terceiro commented Feb 27, 2018

Loading

@agross
Copy link
Contributor

@agross agross commented Feb 27, 2018

@terceiro Alright, I missed that part. The problem is that I can't even install the nanoc's dependencies on Windows. Perhaps you can just skip the test if Gem.win_platform? is true?

Here's what I tested manually in irb:

require 'fileutils'
=> true
IO.write('test', 'foo')
=> 3
FileUtils.ln('test', 'linked')
=> 0
File.stat('test').ino
=> 1125899906918010
File.stat('linked').ino
=> 1125899906918010
FileUtils.ln('test', 'linked-f', force: true)
=> 0
File.stat('linked').ino
=> 1125899906918010

Loading

@terceiro
Copy link
Contributor Author

@terceiro terceiro commented Feb 27, 2018

well, I don't know enough about Windows to say that that is a hardlink, but it looks a lot like a hard link ;-)

(so the test will pass)

Loading

@ddfreyne
Copy link
Member

@ddfreyne ddfreyne commented Mar 2, 2018

I think this PR needs a bit of (manual) testing. I’ll take a stab at it this weekend. (I’m paranoid about something breaking that we haven’t thought of.)

I’ve been meaning to find a way to automatically test Nanoc on Windows (Travis CI doesn’t have Ruby on Windows). Not sure whether that exists as a Saas!

Is there a way to make hardlinks only if an item is subject to a passthrough rule? Note that it's different semantics than "file is identical".

A passthrough rule is a syntactic shorthand, so it wouldn’t be easy to make an exception for passthrough.

Loading

@agross
Copy link
Contributor

@agross agross commented Mar 2, 2018

I’ve been meaning to find a way to automatically test Nanoc on Windows (Travis CI doesn’t have Ruby on Windows).

https://github.com/agross/nanoc/blob/appveyor/appveyor.yml

Basically, this gets me the same weird bundler errors as on my Windows machine. Adding x64_mingw helped, but we get test errors, some of which can be fixed by comparing paths with File.absolute_path.

Loading

@ddfreyne
Copy link
Member

@ddfreyne ddfreyne commented Mar 4, 2018

@agross Ugh, I missed your comment before I started working on setting up AppVeyor! Anyway, not much work lost — and I’ll definitely take some stuff from your setup!

Loading

@ddfreyne
Copy link
Member

@ddfreyne ddfreyne commented Mar 4, 2018

See #1327 and #1325 for Windows support — still work in progress.

Once that’s finished, I think we can think about hardlinks again!

Loading

@terceiro
Copy link
Contributor Author

@terceiro terceiro commented Mar 5, 2018

Loading

@ddfreyne
Copy link
Member

@ddfreyne ddfreyne commented Mar 6, 2018

@terceiro Can you rebase your branch on top of master? It’s now tested on Windows (for Rubies 2.3 to 2.5).

Loading

When writing large binary items (e.g. audio or video files), creating a
hard link pointing to the original avoids duplicating the data on disk.
This is specially useful for large binary items.

Hard linking will fail when the source file is not on the same device as
the output directory (e.g. when /tmp is a tmpfs). In that case, fallback
to the previous behavior of copying, what does duplicate the space, but
should always work.
@terceiro terceiro force-pushed the optimize-binary-items branch from 17eb486 to 017dd43 Mar 6, 2018
@terceiro
Copy link
Contributor Author

@terceiro terceiro commented Mar 6, 2018

@ddfreyne rebased

Loading

@ddfreyne
Copy link
Member

@ddfreyne ddfreyne commented Mar 10, 2018

Huh, strangely the AppVeyor build is not running. Will take a look at it later!

Loading

@ddfreyne
Copy link
Member

@ddfreyne ddfreyne commented Mar 10, 2018

Everything checks out. Merging! Thanks for your contribution!

Loading

@ddfreyne ddfreyne merged commit 01daad8 into nanoc:master Mar 10, 2018
3 checks passed
Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants