Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: file cache: max size #945

Open
wants to merge 43 commits into
base: main
Choose a base branch
from

Conversation

autoantwort
Copy link
Contributor

@autoantwort autoantwort commented Mar 8, 2023

Fixes microsoft/vcpkg#19452

If no settings are found a new settings file is created. You can set the following properties:

"properties": {
"delete-policy": {
"description": "After which policy objects should be deleted.",
"enum": [
"None",
"OldestAccessDate",
"OldestModificationDate"
]
},
"max-size-in-gb": {
"description": "The maximum size of the cache in gigabytes.",
"type": "number",
"minimum": 0
},
"max-age-in-days": {
"description": "The maximum age of the cache in days.",
"type": "number",
"minimum": 0
},
"keep-available-in-percentage": {
"description": "How much space should be kept available on the disk in percentage.",
"type": "number",
"minimum": 0
}
},

To ensure that the limits are respected when multiple instances are running the following is done:
0. Every instance has a unique "sync" file in "/sync/<random_number>".

  1. Append a line to sync file of the format "<name_of_object>;<file_size>\n"
  2. Read the new entries from other sync files
  3. Check if files must be deleted to respect the limits

# Conflicts:
#	include/vcpkg/base/messages.h
#	src/vcpkg/base/messages.cpp
# Conflicts:
#	src/vcpkg/binarycaching.cpp
# Conflicts:
#	src/vcpkg-test/metrics.cpp
#	src/vcpkg/binarycaching.cpp
# Conflicts:
#	include/vcpkg/base/files.h
#	src/vcpkg-test/files.cpp
#	src/vcpkg/base/files.cpp
#	src/vcpkg/binarycaching.cpp
@autoantwort autoantwort marked this pull request as ready for review August 19, 2023 16:19
@autoantwort
Copy link
Contributor Author

This is not 100% ready for merge, but ready for review. When I get a design approval I will handle all edge case errors and do localization.

Copy link
Contributor

@Thomas1664 Thomas1664 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor nits.

src/vcpkg/binarycaching.cpp Outdated Show resolved Hide resolved
src/vcpkg/binarycaching.cpp Outdated Show resolved Hide resolved
src/vcpkg/binarycaching.cpp Outdated Show resolved Hide resolved
src/vcpkg/binarycaching.cpp Outdated Show resolved Hide resolved
# Conflicts:
#	include/vcpkg/base/message-data.inc.h
Copy link
Member

@BillyONeal BillyONeal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0. Every instance has a unique "sync" file in "<root>/sync/<random_number>". 
1. Append a line to sync file of the format "<name_of_object>;<file_size>\n"
2. Read the new entries from other sync files
3. Check if files must be deleted to respect the limits

I do not believe this is correct. There is nothing to stop 2 instances from concurrently reading all the sync files, deciding that something needs to be deleted, writing their intent to delete to their own sync files, and now there is no way to resolve the ambiguity on who 'won' that race.

In particular, this is assuming that writes or appends within a file will be atomic, which is a feature most file systems do not provide and several network file systems extremely do not provide. The only operation which we can assume is atomic is the creation or removal of a file system entry; as in rename.

I think you need something like a read/write oplock here, where instances trying to remove entries from the cache are writers, and instances trying to do anything else are readers. Only one instance should be trying to delete out of the cache at a time, and if any instance is doing so, we need to make sure we don't delete a cache entry that other instances are potentially touching.

src/vcpkg/binarycaching.cpp Outdated Show resolved Hide resolved
src/vcpkg/binarycaching.cpp Outdated Show resolved Hide resolved
cache.own_sync_file = get_own_sync_file(cache.sync_root_dir);
if (cache.folder_settings.delete_policy != FolderSettings::DeletePolicy::None)
{
std::unordered_map<std::string, uint64_t> file_sizes;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe that in general this works correctly, due to concurrent insertions into the cache. Cache entries are not meaningfully part of the cache until they have been renamed into place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also a cache so that I don't have to call fs.file_size() for every cache entry. If a file is not in the cache anymore, this cache size is not used.

}
}
}

size_t push_success(const BinaryPackageWriteInfo& request, MessageSink& msg_sink) override
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think 'during push_success' is the right time to be doing this. It probably should be a one time pass that looks at the cache(s) after all cache operations this particular vcpkg instance will do on this run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then the cache could be larger then its max size in the meantime?

@autoantwort
Copy link
Contributor Author

autoantwort commented Oct 25, 2023

I do not believe this is correct. There is nothing to stop 2 instances from concurrently reading all the sync files, deciding that something needs to be deleted, writing their intent to delete to their own sync files, and now there is no way to resolve the ambiguity on who 'won' that race.

The instances only write to the sync files what they want to add, not what they want to delete.
So if you want to add a file with hash xxx and size 123 to the binary cache you append the following to your own sync file:
xxx;123\n

In particular, this is assuming that writes or appends within a file will be atomic, which is a feature most file systems do not provide and several network file systems extremely do not provide. The only operation which we can assume is atomic is the creation or removal of a file system entry; as in rename.

No this code does not assumes this. I explicitly handle this case ... I just realize that I wanted to implement this but haven't done this yet ... 🤦 🙈 Edit: Now implemented
It this is implemented, the implementation is save:

Start:
Current binary cache size: 9.5 GB
Two instances want to add 0.6 GB

Instance 1 Instance 2
1 Add half a line to own sync file
2 Add line to own sync file
3 Read other sync file
4 Delete files until there is space for the 0.6 GB file
5 Write rest of line
6 Read other sync file
7 Delete files until there is space for the 0.6 GB file

@julianxhokaxhiu
Copy link

Has there any progress been made on this direction? it's becoming quite an effort to cleanup manually the vcpkg cache directory from time to time. Thank you in advance 🙏🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants