Testing pre-requisites for Nessie GC: Two S3 testing projects #5142

snazy · 2022-09-09T10:07:01Z

s3minio - JUnit extension to provide a Minio instance
s3mock - S3 endpoint serving data via functions, no persistence

ajantha-bhat · 2022-09-09T10:32:25Z

testing/s3mock/src/main/java/org/projectnessie/s3mock/data/Bucket.java

+@Value.Immutable
+@JsonSerialize(as = ImmutableBucket.class)
+@JsonDeserialize(as = ImmutableBucket.class)
+public interface Bucket {


Can we just depend on this library (https://github.com/adobe/S3Mock), instead of we implementing or maintaining the similar/same code?

If you look closely at both implementations, you see that those do different things.

I do find the data folder in this PR to be similar to https://github.com/adobe/S3Mock/tree/main/server/src/main/java/com/adobe/testing/s3mock/dto

But yeah, I do have to look closely as you mentioned.

imo this is a legitimate question, so my suggestion would be to briefly summarize the reasoning on why we can not use this library in a README file inside the module folder, WDYT ?
this way we dont need to expect everyone to look at both implementations closely to tell what the exact difference is and why we had to roll our own.

Again - the Adobe S3Mock does something very different. This one does intentionally NOT serve "real" content.

Not sure why comparing with another library, that serves a different purpose, would make sense.
This one serves content and listings from functions not a real file system incl put operations.

Thanks for the explanation. I may need a while to understand all these object store interface implementations as I am not very familiar with them.

If no one else reviews, I will dig deep to understand these.

* s3minio - JUnit extension to provide a Minio instance * s3mock - S3 endpoint serving data via functions, no persistence

codecov · 2022-09-09T12:55:27Z

Codecov Report

Base: 83.60% // Head: 83.60% // No change to project coverage 👍

Coverage data is based on head (d3e32c2) compared to base (f3dd999).
Patch has no changes to coverable lines.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #5142   +/-   ##
=======================================
  Coverage   83.60%   83.60%           
=======================================
  Files          49       49           
  Lines        1812     1812           
  Branches      348      348           
=======================================
  Hits         1515     1515           
  Misses        218      218           
  Partials       79       79

Flag	Coverage Δ
java	`∅ <ø> (∅)`
javascript	`81.98% <ø> (ø)`
python	`83.95% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

ajantha-bhat · 2022-09-23T08:15:45Z

testing/s3mock/src/test/java/org/projectnessie/s3mock/TestIcebergS3MockServerS3Client.java

+                .mapToObj(intToKey)
+                .collect(Collectors.toList()));
+
+    // This one takes long - the number of round-trips makes is slow TODO: too slow for a unit test?


It took 8 seconds for me for this test on local run. Should be ok!

ajantha-bhat

Compared the DTO with other standard implementations. This looks like a subset of it (only required functionalities).
Util and resource files I have reviewed as per my knowledge! (which may be very limited)

As these are just test code, I don't want to block other dependent PRs. Hence, I am ok in merging if no one else reviews this week.

See also the readme in `./gc/README.md`. GC is implemented as a two-phase approach: 1. Identify all refetences to live content versions by walking all named references. Content versions for Iceberg are tuples of commit-ID + snapshot ID + content ID + metadata reference). These `ContentReference`s are and stored in an instance of `LiveContentSetsStore`. This is the "mark phase". The set of content-references is called "live contents set". 2. For each content-ID, resolve all `ContentReferences` to file objects and populate a bloom-filter with those. Then traverse the base storage locations for each content-ID and delete all files that are not contained in the bloom filter, but do not delete files that are "too new" (created since the "mark phase" started). A command line tool `nessie-gc-shell` is provided as well. It supports running mark-and-sweep, mark only, sweep only plus a few supplemental commands. All pieces are pluggable, there are Java interfaces (or abstract classes) providing the API for all pieces. This makes the code much easier to test and helps using alternative implementations. The base functionality is contained in the `nessie-gc-base` project. This has default implementations to identify live content references, an in-memory live-content-sets-storage and a default implementation to run expire per content. Support for content types, like Iceberg tables, implements the functionality to expand content-references to actual file-objects, list file-objects from base locations and delete files. The "sweep" phase for Iceberg therefore also acts as the "Nessie aware delete orphan files". A live-content-sets-storage implementation using JDBC is present as well. Small Nessie repositories using a "one-off" mark-and-sweep run can get away with the in-memory contents-storage. But big Nessie repositories would require too much heap. Another reason to eventually use a persistent live-content-sets-storage is to allow the deletion of the files occupied by completely unreferenced tables - tables that are not referenced by any live Nessie commit. This functionality is not implemented, but would work by collecting the base-locations for all content-IDs that are no longer present in the most recent live-contents-set. This PR still has a few non-`Test*` classes that were only added to prove that the concept works quick and does not require much heap and does not use a lot of CPU. Those "tests" in `nessie-gc-base` and `nessie-gc-iceberg-inttest` will be removed, because those add no real value and were only needed to validate the concept. The "final" proof showed that a mark-and-sweep (deleting ~1.7M files, collected from "many" branches, commits, content objects) alone can complete in round about one minute with a 2GB Java heap. Depends on: #5142 #5206

ajantha-bhat reviewed Sep 9, 2022

View reviewed changes

snazy force-pushed the ngc-testing branch from 392d725 to bb7e948 Compare September 9, 2022 11:07

Testing pre-requisites for Nessie GC: Two S3 testing projects

d3e32c2

* s3minio - JUnit extension to provide a Minio instance * s3mock - S3 endpoint serving data via functions, no persistence

snazy force-pushed the ngc-testing branch from bb7e948 to d3e32c2 Compare September 9, 2022 12:41

snazy mentioned this pull request Sep 17, 2022

Nessie GC: Iceberg functionality #5207

Merged

ajantha-bhat reviewed Sep 23, 2022

View reviewed changes

ajantha-bhat approved these changes Sep 23, 2022

View reviewed changes

snazy merged commit 6b1c222 into projectnessie:main Sep 23, 2022

snazy deleted the ngc-testing branch September 23, 2022 08:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testing pre-requisites for Nessie GC: Two S3 testing projects #5142

Testing pre-requisites for Nessie GC: Two S3 testing projects #5142

snazy commented Sep 9, 2022

ajantha-bhat Sep 9, 2022

snazy Sep 9, 2022

ajantha-bhat Sep 9, 2022

XN137 Sep 16, 2022

snazy Sep 16, 2022

snazy Sep 19, 2022

ajantha-bhat Sep 19, 2022

codecov bot commented Sep 9, 2022

ajantha-bhat Sep 23, 2022

ajantha-bhat left a comment •

edited

Loading

Testing pre-requisites for Nessie GC: Two S3 testing projects #5142

Testing pre-requisites for Nessie GC: Two S3 testing projects #5142

Conversation

snazy commented Sep 9, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Sep 9, 2022

Codecov Report

Choose a reason for hiding this comment

ajantha-bhat left a comment • edited Loading

Choose a reason for hiding this comment

ajantha-bhat left a comment •

edited

Loading