New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add checksum parameter to Document #793

Merged
merged 4 commits into from Jan 3, 2016

Conversation

Projects
None yet
2 participants
@RubenVerborgh
Member

RubenVerborgh commented Jan 1, 2016

As suggested in #790, this pull request allows a Document to specify a pre-calculated checksum. This gives data sources the ability to determine a checksum for the items and layouts they create.
This pull request:

  • adds a checksum parameter to the initializer of Document (as well as new_item and new_layout)
  • modifies Checksummer to use the checksum on Item and Layout if present
  • modifies the filesystem datasource to set the size/modification time as checksum

This realizes a major compilation speed gain for my website (from 11s to 3s if nothing is changed), and paves the way for lazy loading of items, which could bring down compilation time even further.

@RubenVerborgh

This comment has been minimized.

Show comment
Hide comment
@RubenVerborgh

RubenVerborgh Jan 1, 2016

Member

Note that 289f6f3 is redundant after 8c72d51; however, I left it in for comparison (especially with regard to the tests).

Member

RubenVerborgh commented Jan 1, 2016

Note that 289f6f3 is redundant after 8c72d51; however, I left it in for comparison (especially with regard to the tests).

@RubenVerborgh

This comment has been minimized.

Show comment
Hide comment
@RubenVerborgh

RubenVerborgh Jan 1, 2016

Member

Also, the code to generate a checksum for a file is kind of a duplicate of the code of the checksummer for pathnames. I did not want to create a dependency between a datasource and the checksummer; while this might be okay for the default filesystem datasource, external datasources would not want to depend on the internal checksummer. (Would also be an argument to have pathnames calculate their own checksum, but we've discussed that already 😉)

Member

RubenVerborgh commented Jan 1, 2016

Also, the code to generate a checksum for a file is kind of a duplicate of the code of the checksummer for pathnames. I did not want to create a dependency between a datasource and the checksummer; while this might be okay for the default filesystem datasource, external datasources would not want to depend on the internal checksummer. (Would also be an argument to have pathnames calculate their own checksum, but we've discussed that already 😉)

@ddfreyne

This comment has been minimized.

Show comment
Hide comment
@ddfreyne

ddfreyne Jan 2, 2016

Member

This looks good! Some remarks:

  • I’m not in favour of using the file size and mtime to detect modifications. It’s not accurate enough—it’s trivial to generate two different files with the same checksum. The content and attributes (as strings) are already loaded, and generating a checksum from a string is quite fast (SHA-1 is intended to be fast). For binary items, this isn’t as straightforward, and file size + mtime will have to do.

    I’m not saying no to using mtime at all, but for a checksum, I don’t think it suffices. (Idea: only calculate a full checksum when the mtime appears to indicate that the item is outdated—but I believe such a change is out of scope for this PR and would require quite a bit more (architectural) rework.)

  • The checksummer is part of a private API, but I have no issues with the filesystem data source using that API. It is part of the private API because I don’t want to expose it outside of Nanoc (and have to maintain it as part of the public API).

Member

ddfreyne commented Jan 2, 2016

This looks good! Some remarks:

  • I’m not in favour of using the file size and mtime to detect modifications. It’s not accurate enough—it’s trivial to generate two different files with the same checksum. The content and attributes (as strings) are already loaded, and generating a checksum from a string is quite fast (SHA-1 is intended to be fast). For binary items, this isn’t as straightforward, and file size + mtime will have to do.

    I’m not saying no to using mtime at all, but for a checksum, I don’t think it suffices. (Idea: only calculate a full checksum when the mtime appears to indicate that the item is outdated—but I believe such a change is out of scope for this PR and would require quite a bit more (architectural) rework.)

  • The checksummer is part of a private API, but I have no issues with the filesystem data source using that API. It is part of the private API because I don’t want to expose it outside of Nanoc (and have to maintain it as part of the public API).

@RubenVerborgh

This comment has been minimized.

Show comment
Hide comment
@RubenVerborgh

RubenVerborgh Jan 2, 2016

Member
  • I wasn't too sure either, hence I left 289f6f3 in. I'll adjust the range of this pull request to not include 8c72d51 anymore.
    • Regarding “content already loaded”: this pull request would actually avoid the need to load the content, and this would be a nice next step. Content only needs to be loaded when the checksum is different.
  • Alright, but it's not necessary anymore then if the mtime is not used.
Member

RubenVerborgh commented Jan 2, 2016

  • I wasn't too sure either, hence I left 289f6f3 in. I'll adjust the range of this pull request to not include 8c72d51 anymore.
    • Regarding “content already loaded”: this pull request would actually avoid the need to load the content, and this would be a nice next step. Content only needs to be loaded when the checksum is different.
  • Alright, but it's not necessary anymore then if the mtime is not used.
@ddfreyne

This comment has been minimized.

Show comment
Hide comment
@ddfreyne

ddfreyne Jan 3, 2016

Member

Content only needs to be loaded when the checksum is different.

Not entirely correct: the checksum being different is a necessary but not sufficient condition for content to be loaded. For instance, an item (whose checksum is identical) that includes content from another item (whose checksum is different) will need to be recompiled even though its checksum hasn’t changed.

The idea of only loading data when necessary is nonetheless quite interesting and is something I plan to tackle in Nanoc at some point in the future.

Member

ddfreyne commented Jan 3, 2016

Content only needs to be loaded when the checksum is different.

Not entirely correct: the checksum being different is a necessary but not sufficient condition for content to be loaded. For instance, an item (whose checksum is identical) that includes content from another item (whose checksum is different) will need to be recompiled even though its checksum hasn’t changed.

The idea of only loading data when necessary is nonetheless quite interesting and is something I plan to tackle in Nanoc at some point in the future.

@ddfreyne ddfreyne added the feature label Jan 3, 2016

@ddfreyne ddfreyne added this to the 4.2 milestone Jan 3, 2016

@@ -249,7 +251,7 @@ def parse(content_filename, meta_filename, _kind)
content = pieces[4]
# Done
[meta, content]
[meta, content, meta_raw]

This comment has been minimized.

@ddfreyne

ddfreyne Jan 3, 2016

Member

I don’t think meta_raw is used.

@ddfreyne

ddfreyne Jan 3, 2016

Member

I don’t think meta_raw is used.

This comment has been minimized.

@ddfreyne

ddfreyne Jan 3, 2016

Member

nm—it is!

@ddfreyne

ddfreyne Jan 3, 2016

Member

nm—it is!

This comment has been minimized.

@ddfreyne

ddfreyne Jan 3, 2016

Member

Can you update the #parse documentation to reflect the changed return value?

@ddfreyne

ddfreyne Jan 3, 2016

Member

Can you update the #parse documentation to reflect the changed return value?

@ddfreyne

This comment has been minimized.

Show comment
Hide comment
@ddfreyne

ddfreyne Jan 3, 2016

Member

👍 apart from comments

Member

ddfreyne commented Jan 3, 2016

👍 apart from comments

@RubenVerborgh

This comment has been minimized.

Show comment
Hide comment
@RubenVerborgh

RubenVerborgh Jan 3, 2016

Member

Content only needs to be loaded when the checksum is different.

Not entirely correct: the checksum being different is a necessary but not sufficient condition for content to be loaded.

Yeah, I cut a corner in writing that down. What I meant is: content only needs to be loaded when a reason is found to recompile, since determining that reason no longer requires the item contents.

The idea of only loading data when necessary is nonetheless quite interesting and is something I plan to tackle in Nanoc at some point in the future.

I think it could be as simple as allowing new_item to accept a block and a checksum, but I might be oversimplifying. Can I make an issue to discuss that?

Member

RubenVerborgh commented Jan 3, 2016

Content only needs to be loaded when the checksum is different.

Not entirely correct: the checksum being different is a necessary but not sufficient condition for content to be loaded.

Yeah, I cut a corner in writing that down. What I meant is: content only needs to be loaded when a reason is found to recompile, since determining that reason no longer requires the item contents.

The idea of only loading data when necessary is nonetheless quite interesting and is something I plan to tackle in Nanoc at some point in the future.

I think it could be as simple as allowing new_item to accept a block and a checksum, but I might be oversimplifying. Can I make an issue to discuss that?

@ddfreyne

This comment has been minimized.

Show comment
Hide comment
@ddfreyne

ddfreyne Jan 3, 2016

Member

Yeah, I cut a corner in writing that down. What I meant is: content only needs to be loaded when a reason is found to recompile, since determining that reason no longer requires the item contents.

Yup.

I think it could be as simple as allowing new_item to accept a block and a checksum, but I might be oversimplifying. Can I make an issue to discuss that?

I have a few ideas regarding optimisations like these, and I’d rather write down these thoughts in a proper RFC first. (Can’t seem to find the time, though.)

Member

ddfreyne commented Jan 3, 2016

Yeah, I cut a corner in writing that down. What I meant is: content only needs to be loaded when a reason is found to recompile, since determining that reason no longer requires the item contents.

Yup.

I think it could be as simple as allowing new_item to accept a block and a checksum, but I might be oversimplifying. Can I make an issue to discuss that?

I have a few ideas regarding optimisations like these, and I’d rather write down these thoughts in a proper RFC first. (Can’t seem to find the time, though.)

@RubenVerborgh

This comment has been minimized.

Show comment
Hide comment
@RubenVerborgh

RubenVerborgh Jan 3, 2016

Member

Okay, keep me updated when that RFC goes out 😄

Finishing this PR now.

Member

RubenVerborgh commented Jan 3, 2016

Okay, keep me updated when that RFC goes out 😄

Finishing this PR now.

@RubenVerborgh

This comment has been minimized.

Show comment
Hide comment
@RubenVerborgh

RubenVerborgh Jan 3, 2016

Member

All done.

Member

RubenVerborgh commented Jan 3, 2016

All done.

@ddfreyne

This comment has been minimized.

Show comment
Hide comment
@ddfreyne

ddfreyne Jan 3, 2016

Member

Looks good, thanks!

Member

ddfreyne commented Jan 3, 2016

Looks good, thanks!

ddfreyne added a commit that referenced this pull request Jan 3, 2016

@ddfreyne ddfreyne merged commit d47c197 into nanoc:master Jan 3, 2016

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@ddfreyne

This comment has been minimized.

Show comment
Hide comment
@ddfreyne

ddfreyne Jan 3, 2016

Member

Some follow-up discussion:

  • Using mtime+file size as an initial checksum, and only calculating the full checksum if necessary (and only then loading content) is pointless, because when the mtime/file size checksum is different, the item will need to be recompiled and that means loading the content anyway. Additionally, the full checksum will need to be calculated for this item anyway, because it needs to be stored so that it can be used in future compilations. (The previous checksum could be reused rather than recalculated, but I’m unsure whether the speedup would be large enough to make it worth optimise this.)
  • Not loading content, or lazy-loading content, made much more sense in previous versions of Ruby, but since Ruby as of recently has a generational GC, the performance gains of keeping memory usage low are not nearly as big anymore.
Member

ddfreyne commented Jan 3, 2016

Some follow-up discussion:

  • Using mtime+file size as an initial checksum, and only calculating the full checksum if necessary (and only then loading content) is pointless, because when the mtime/file size checksum is different, the item will need to be recompiled and that means loading the content anyway. Additionally, the full checksum will need to be calculated for this item anyway, because it needs to be stored so that it can be used in future compilations. (The previous checksum could be reused rather than recalculated, but I’m unsure whether the speedup would be large enough to make it worth optimise this.)
  • Not loading content, or lazy-loading content, made much more sense in previous versions of Ruby, but since Ruby as of recently has a generational GC, the performance gains of keeping memory usage low are not nearly as big anymore.
@RubenVerborgh

This comment has been minimized.

Show comment
Hide comment
@RubenVerborgh

RubenVerborgh Jan 3, 2016

Member

Lazy loading still makes sense, not for memory reasons, but to avoid the overhead of parsing the file and its metadata. The checksum can be made without parsing – only when the checksum is different, the file actually needs to be parsed.

Member

RubenVerborgh commented Jan 3, 2016

Lazy loading still makes sense, not for memory reasons, but to avoid the overhead of parsing the file and its metadata. The checksum can be made without parsing – only when the checksum is different, the file actually needs to be parsed.

@ddfreyne

This comment has been minimized.

Show comment
Hide comment
@ddfreyne

ddfreyne Jan 3, 2016

Member

I’d argue that in most cases, parsing the metadata doesn’t impose enough of an overhead to try to avoid doing it—but I haven’t profiled this extensively to be certain of that.

Member

ddfreyne commented Jan 3, 2016

I’d argue that in most cases, parsing the metadata doesn’t impose enough of an overhead to try to avoid doing it—but I haven’t profiled this extensively to be certain of that.

@RubenVerborgh

This comment has been minimized.

Show comment
Hide comment
@RubenVerborgh

RubenVerborgh Jan 3, 2016

Member

I can give you a quick number: a site with a BibTeX datasource (± 150 items in my use case), compilation time was 12s before #790 and #793. Now it is down to 3s. Skipping field loading for the 150 items brings compilation time further down to 2s.

Member

RubenVerborgh commented Jan 3, 2016

I can give you a quick number: a site with a BibTeX datasource (± 150 items in my use case), compilation time was 12s before #790 and #793. Now it is down to 3s. Skipping field loading for the 150 items brings compilation time further down to 2s.

@ddfreyne

This comment has been minimized.

Show comment
Hide comment
@ddfreyne

ddfreyne Jan 3, 2016

Member

This is the “nothing changed” case, correct?

Member

ddfreyne commented Jan 3, 2016

This is the “nothing changed” case, correct?

@RubenVerborgh

This comment has been minimized.

Show comment
Hide comment
@RubenVerborgh
Member

RubenVerborgh commented Jan 3, 2016

Yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment