Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix false positive conflicts for static files in a collection #9141

Merged
merged 4 commits into from Oct 16, 2022

Conversation

yboulkaid
Copy link
Contributor

@yboulkaid yboulkaid commented Oct 1, 2022

This is a 🐛 bug fix.

Summary

This PR fixes some cases of false positive reports of file conflicts
when running jekyll serve.

Context

In #8961 we added the static files in collections to
Jekyll::Site#static_files.

Static files in collections were part of the Site#documents (they get
added here), so when we added them to static_files they were now
in two places.

As, Jekyll::Site#each_site_file traverses both static_files
and documents, those entries get duplicated. This in turn shows up in
the output of Jekyll::Commands::Doctor as a false positive:

Conflict: The following destination is shared by multiple files.
          The written file may end up with unexpected contents.
          [...]_site/books/images/sleep-cycles.png
           - [...]_books/images/sleep-cycles.png
           - [...]_books/images/sleep-cycles.png

This PR fixes this issue by making the each_site_file method go through
each file only once, thus fixing the issue.

@yboulkaid
Copy link
Contributor Author

A couple questions that arose when I was looking into this:

@ashmaroli
Copy link
Member

Thank you for submitting this proposal @yboulkaid, but I am not liking the double iteration in the new implementation.

Can you give an example of the stated false positives? For context, two Ruby strings "foo" and "foo" could be two different objects (different object_id) unless we are dealing with frozen string literals.
Similarly, two Jekyll resource objects with seemingly same attribute(s) could be two different instances of the same class if the file has been read into memory twice.

To maintain plausible backwards-compatibility, we cannot remove existing behavior, only add to it.
The commit that introduced the duplicate reference to static files has not been released yet. So that change can be freely altered.

In #9139, I am suspecting one of the used plugins because the path attribute is different and they are using a released version of Jekyll.

@yboulkaid
Copy link
Contributor Author

yboulkaid commented Oct 1, 2022

Thank you for submitting this proposal @yboulkaid, but I am not liking the double iteration in the new implementation.

I agree that the double iteration isn't very elegant, but it's the only way to remove duplicates after they have been added :/ Do we have any performance benchmarks we rely on to see how much of a hit this can be? I suspect not that much, but I only have managed small Jekyll sites

Can you give an example of the stated false positives? For context, two Ruby strings "foo" and "foo" could be two different objects (different object_id) unless we are dealing with frozen string literals. Similarly, two Jekyll resource objects with seemingly same attribute(s) could be two different instances of the same class if the file has been read into memory twice.

The test in this PR can reproduce the false positives. The duplicates also have the same object_id, so they are "true" duplicates. I checked by putting:

# in jekyll/commands/doctor.rb:130
# in destination_map(site)

binding.irb if thing.path.match?(/extensionless_static_file/)

To maintain plausible backwards-compatibility, we cannot remove existing behavior, only add to it. The commit that introduced the duplicate reference to static files has not been released yet. So that change can be freely altered.

If we decide that static files in collections should be included both in collection.documents and site.static_files, then I don't think we can escape some form of de-duplication step when looping through all objects in each_site_file, as by design this now includes both sets.

Looking through this again, I realized that we were also actually writing the same file twice even in Site#write, since we are also using the same each_site_file to loop through the files.

So as it stands today we can't assume uniqueness when looping in each_site_file, which could pose issues in other places. The message output by Doctor was just the most visible symptom.

(For context, adding static files the the collection document list has been added in 2014)

In #9139, I am suspecting one of the used plugins because the path attribute is different and they are using a released version of Jekyll.

True 👍

@ashmaroli
Copy link
Member

If we decide that static files in collections should be included both in collection.documents and site.static_files, then I don't think we can escape some form of de-duplication step...

It's not collection.documents but collection.files that holds references to static files from a collection.

The root cause behind the issue is that Site#documents contains both collection.docs and collection.files.

As of now, I am not able to come up with a better solution than what you have proposed. Yet, I am not willing to approve your proposal in current state either.

@yboulkaid
Copy link
Contributor Author

It's not collection.documents but collection.files that holds references to static files from a collection.
The root cause behind the issue is that Site#documents contains both collection.docs and collection.files.

You're right, thanks for the correction. Now we have an idea about what the issue is at least.

As of now, I am not able to come up with a better solution than what you have proposed. Yet, I am not willing to approve your proposal in current state either.

Could you elaborate about what in the proposed solution poses a problem? Is it the concept of the double loop, the performance implications, or something else? I can spend some more time looking into this, and knowing what is wrong with this solution helps giving me a direction

@ashmaroli
Copy link
Member

ashmaroli commented Oct 1, 2022

Could you elaborate about what in the proposed solution poses a problem?

Primarily the double iteration. Secondarily (to a very less extent because modern Ruby versions are better optimised for handling the scenario) stashing the block as a proc for subsequent use.
Regarding the double iteration, you may be able to better understand my disproval if you take a closer look at Site#docs_to_write and Site#static_files_to_write. Both of definitions are (intentionally) non-memoized yet involves iterations of their ow on every call.

The ideal solution according to me involves defining new variables and methods to ensure consistency and backwards-compatibility but that could involve making changes to classes other than Jekyll::Site.

@yboulkaid
Copy link
Contributor Author

I rewrote the each_site_file method in a way that avoids both the double loop and passing the &block as an argument to the method. Does that address your concerns?

@ashmaroli
Copy link
Member

ashmaroli commented Oct 2, 2022

Thank you for getting back at tackling this @yboulkaid. I must admit that the current diff looks a lot better than the prior ones. Jekyll uses a similar procedure to traverse through layouts of a rendered resource.

However, the attached test needs some cleaning up. You have currently created a new context under previously existing context. That is not correct since we're testing a separate / different Doctor command procedure. Additionally, the expectation is a bit vague: static files in a collection should not trigger false positives ?? I have no idea what the test is about until I read the block contents.
One of the reasons, is that we are not changing the Doctor command implementation at all but something in Jekyll::Site. Consider the following scenario:

# test/test_site.rb

context "static files in a collection" do
  should "not be revisited in `Site#each_site_file`" do
    ...
  end
end

@yboulkaid
Copy link
Contributor Author

yboulkaid commented Oct 2, 2022

Thanks for the detailed comments. I agree that the tests were not optimal, they were testing a symptom rather than the root cause (because that's what got me into this rabbit hole 😄)

I rewrote the test following your recommendation, I think it became much nicer :) What do you think?


visited_files = []
site.each_site_file { |file| visited_files << file }
assert_equal visited_files.count, visited_files.uniq.count
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also confirmed that this failed with the old implementation

Copy link
Member

@ashmaroli ashmaroli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 👍

@ashmaroli
Copy link
Member

The diff looks very nice now. 👍
But I am still request an additional review to rule out familiarity bias.

@ashmaroli ashmaroli requested review from mattr- and parkr October 2, 2022 13:11
@ashmaroli
Copy link
Member

Thanks @yboulkaid
@jekyllbot: merge +fix

@jekyllbot jekyllbot merged commit 6903f0b into jekyll:master Oct 16, 2022
jekyllbot added a commit that referenced this pull request Oct 16, 2022
github-actions bot pushed a commit that referenced this pull request Oct 16, 2022
Youssef Boulkaid: Fix false positive conflicts for static files in a collection (#9141)

Merge pull request 9141
Copy link
Member

@parkr parkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to submit my review.

@@ -360,9 +360,13 @@ def documents
end

def each_site_file
seen_files = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One optimization is to use a Set here. Faster than array for the call to include?.

lylo pushed a commit to lylo/jekyll that referenced this pull request Oct 27, 2022
lylo pushed a commit to lylo/jekyll that referenced this pull request Oct 27, 2022
@jekyll jekyll locked and limited conversation to collaborators Oct 17, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants