-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Testing pre-requisites for Nessie GC: Two S3 testing projects #5142
Conversation
snazy
commented
Sep 9, 2022
- s3minio - JUnit extension to provide a Minio instance
- s3mock - S3 endpoint serving data via functions, no persistence
@Value.Immutable | ||
@JsonSerialize(as = ImmutableBucket.class) | ||
@JsonDeserialize(as = ImmutableBucket.class) | ||
public interface Bucket { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just depend on this library (https://github.com/adobe/S3Mock), instead of we implementing or maintaining the similar/same code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you look closely at both implementations, you see that those do different things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do find the data folder in this PR to be similar to https://github.com/adobe/S3Mock/tree/main/server/src/main/java/com/adobe/testing/s3mock/dto
But yeah, I do have to look closely as you mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo this is a legitimate question, so my suggestion would be to briefly summarize the reasoning on why we can not use this library in a README file inside the module folder, WDYT ?
this way we dont need to expect everyone to look at both implementations closely to tell what the exact difference is and why we had to roll our own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again - the Adobe S3Mock does something very different. This one does intentionally NOT serve "real" content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why comparing with another library, that serves a different purpose, would make sense.
This one serves content and listings from functions not a real file system incl put operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. I may need a while to understand all these object store interface implementations as I am not very familiar with them.
If no one else reviews, I will dig deep to understand these.
* s3minio - JUnit extension to provide a Minio instance * s3mock - S3 endpoint serving data via functions, no persistence
Codecov ReportBase: 83.60% // Head: 83.60% // No change to project coverage 👍
Additional details and impacted files@@ Coverage Diff @@
## main #5142 +/- ##
=======================================
Coverage 83.60% 83.60%
=======================================
Files 49 49
Lines 1812 1812
Branches 348 348
=======================================
Hits 1515 1515
Misses 218 218
Partials 79 79
Flags with carried forward coverage won't be shown. Click here to find out more. Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
.mapToObj(intToKey) | ||
.collect(Collectors.toList())); | ||
|
||
// This one takes long - the number of round-trips makes is slow TODO: too slow for a unit test? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took 8 seconds for me for this test on local run. Should be ok!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compared the DTO with other standard implementations. This looks like a subset of it (only required functionalities).
Util and resource files I have reviewed as per my knowledge! (which may be very limited)
As these are just test code, I don't want to block other dependent PRs. Hence, I am ok in merging if no one else reviews this week.
See also the readme in `./gc/README.md`. GC is implemented as a two-phase approach: 1. Identify all refetences to live content versions by walking all named references. Content versions for Iceberg are tuples of commit-ID + snapshot ID + content ID + metadata reference). These `ContentReference`s are and stored in an instance of `LiveContentSetsStore`. This is the "mark phase". The set of content-references is called "live contents set". 2. For each content-ID, resolve all `ContentReferences` to file objects and populate a bloom-filter with those. Then traverse the base storage locations for each content-ID and delete all files that are not contained in the bloom filter, but do not delete files that are "too new" (created since the "mark phase" started). A command line tool `nessie-gc-shell` is provided as well. It supports running mark-and-sweep, mark only, sweep only plus a few supplemental commands. All pieces are pluggable, there are Java interfaces (or abstract classes) providing the API for all pieces. This makes the code much easier to test and helps using alternative implementations. The base functionality is contained in the `nessie-gc-base` project. This has default implementations to identify live content references, an in-memory live-content-sets-storage and a default implementation to run expire per content. Support for content types, like Iceberg tables, implements the functionality to expand content-references to actual file-objects, list file-objects from base locations and delete files. The "sweep" phase for Iceberg therefore also acts as the "Nessie aware delete orphan files". A live-content-sets-storage implementation using JDBC is present as well. Small Nessie repositories using a "one-off" mark-and-sweep run can get away with the in-memory contents-storage. But big Nessie repositories would require too much heap. Another reason to eventually use a persistent live-content-sets-storage is to allow the deletion of the files occupied by completely unreferenced tables - tables that are not referenced by any live Nessie commit. This functionality is not implemented, but would work by collecting the base-locations for all content-IDs that are no longer present in the most recent live-contents-set. This PR still has a few non-`Test*` classes that were only added to prove that the concept works quick and does not require much heap and does not use a lot of CPU. Those "tests" in `nessie-gc-base` and `nessie-gc-iceberg-inttest` will be removed, because those add no real value and were only needed to validate the concept. The "final" proof showed that a mark-and-sweep (deleting ~1.7M files, collected from "many" branches, commits, content objects) alone can complete in round about one minute with a 2GB Java heap. Depends on: #5142 #5206