Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary Cache: Add write-back support #1406

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sadroeck
Copy link

Write-back support for Binary Caching

This PR adds support for a cache write-back phase during package installation when the relevant cache entry was not newly built. This allows ports to be synced to all binary-cache providers marked as write or readwrite upon a successful installation.

Currently the vcpkg binary-caching layer will only populate a write-capable if the port was built from scratch, i.e. not restored from a read-capable cache provider. This means that a consumer has no way of guaranteeing other caches to be populated, unless they themselves build the port.

Use-cases

As a developer I have a local cache, e.g. file-system & a remote cache, e.g. S3. The remote cache is meant to be a fallback in case the local cache isn't populated. Once restored from the remote cache, I'd like the local cache to be populated with the restored entry, so if the port needs to be restored at a later time, it can be done from the cheaper local cache i.s.o. the remote cache.

Implementation details

Provider identity

The binary caching layer generally works by adding a list of read & write capable cache providers. The read providers are typically single-source, while the write providers are multi-source, e.g. multiple local file-system directories belong to 1 write provider. In order to identify which providers need be written back to, we need to uniquely identify each provider that qualifies for a write-back. This is done using a unique ProviderId, assigned during the parsing of the binary-cache segment list.

Write-back

The push_success() function is now also called post-install to populate the write-capable providers that were (during prefetch) marked as missing the port. These individually mark the cache entry as written back using mark_written_back().

Relevant issues/discussions

@sadroeck sadroeck force-pushed the binary-cache-write-back-support branch from 8adc114 to efdf99f Compare May 13, 2024 18:28
@sadroeck
Copy link
Author

@microsoft-github-policy-service agree company="SingleStore"

@sadroeck sadroeck marked this pull request as draft May 13, 2024 18:38
@sadroeck sadroeck force-pushed the binary-cache-write-back-support branch 8 times, most recently from 7929022 to acb9b24 Compare May 19, 2024 00:05
@sadroeck sadroeck force-pushed the binary-cache-write-back-support branch from acb9b24 to 690198c Compare May 19, 2024 00:23
@sadroeck
Copy link
Author

Added some fixes related to Nuget prefetches & write-only binary providers.

@sadroeck sadroeck marked this pull request as ready for review May 19, 2024 00:41
Copy link
Member

@BillyONeal BillyONeal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this feature is something that we should have.

I think we might need a separate setting to control it, e.g. 'read', 'write', 'writeback'. For example, if someone has a stack of providers which are in 'furthest away' order, it would be bad to try to write back to a cache on another continent when a cache hit happens on a local directory, so users probably want some control over that and silently doing it to them without an opt in may be a breaking change.

To that end I believe there is a longstanding structural problem we probably need to fix before we can go here. Specifically, the documentation suggests and the original design intent was that cache providers would be visited in order. Unfortunately, there is actually an assumed order that sorts whatever the user says in an internally defined provider order, which seems like a bug.

The existing design is excessively complex because backends are tried to be smashed together, which was done to try to save the work in making zip files. The change here kinda makes that worse. It looks like the first half of getting rid of this problem was already done in #998

Instead of stacking ProviderId on top, now that there is no longer a reason to merge different users of the same provider type together, I would prefer to see a change that removes the multiple entries per provider entirely and just stores them in the user provided order. This will reduce complexity, actually match with what the documentation and design intent does, and make implementing the write-back behavior you're trying to achieve here much easier.

We probably should look at trying to land #908 first because autoantwort has been waiting forever ....

@@ -44,6 +44,19 @@ namespace vcpkg::Util

return false;
}
template<class Vec, class Filter>
std::vector<ElementT<Vec>> filtered_copy(const Vec& container, const Filter&& filter)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::vector<ElementT<Vec>> filtered_copy(const Vec& container, const Filter&& filter)
std::vector<ElementT<Vec>> copy_if(const Vec& container, Filter filter)
  1. The standard algorithm this duplicates is copy_if so I think that should be in the name.
  2. Filter functors are traditionally passed by value. (See for example key_equal above)

@@ -114,44 +134,62 @@ namespace vcpkg
std::string instantiate_variables(const BinaryPackageReadInfo& info) const;
};

struct GithubActionsInfo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think this should be checked in :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants