Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing pre-requisites for Nessie GC: Two S3 testing projects #5142

Merged
merged 1 commit into from
Sep 23, 2022

Conversation

snazy
Copy link
Member

@snazy snazy commented Sep 9, 2022

  • s3minio - JUnit extension to provide a Minio instance
  • s3mock - S3 endpoint serving data via functions, no persistence

@Value.Immutable
@JsonSerialize(as = ImmutableBucket.class)
@JsonDeserialize(as = ImmutableBucket.class)
public interface Bucket {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just depend on this library (https://github.com/adobe/S3Mock), instead of we implementing or maintaining the similar/same code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look closely at both implementations, you see that those do different things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do find the data folder in this PR to be similar to https://github.com/adobe/S3Mock/tree/main/server/src/main/java/com/adobe/testing/s3mock/dto

But yeah, I do have to look closely as you mentioned.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo this is a legitimate question, so my suggestion would be to briefly summarize the reasoning on why we can not use this library in a README file inside the module folder, WDYT ?
this way we dont need to expect everyone to look at both implementations closely to tell what the exact difference is and why we had to roll our own.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again - the Adobe S3Mock does something very different. This one does intentionally NOT serve "real" content.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why comparing with another library, that serves a different purpose, would make sense.
This one serves content and listings from functions not a real file system incl put operations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. I may need a while to understand all these object store interface implementations as I am not very familiar with them.

If no one else reviews, I will dig deep to understand these.

* s3minio - JUnit extension to provide a Minio instance
* s3mock - S3 endpoint serving data via functions, no persistence
@codecov
Copy link

codecov bot commented Sep 9, 2022

Codecov Report

Base: 83.60% // Head: 83.60% // No change to project coverage 👍

Coverage data is based on head (d3e32c2) compared to base (f3dd999).
Patch has no changes to coverable lines.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #5142   +/-   ##
=======================================
  Coverage   83.60%   83.60%           
=======================================
  Files          49       49           
  Lines        1812     1812           
  Branches      348      348           
=======================================
  Hits         1515     1515           
  Misses        218      218           
  Partials       79       79           
Flag Coverage Δ
java ∅ <ø> (∅)
javascript 81.98% <ø> (ø)
python 83.95% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

.mapToObj(intToKey)
.collect(Collectors.toList()));

// This one takes long - the number of round-trips makes is slow TODO: too slow for a unit test?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took 8 seconds for me for this test on local run. Should be ok!

Copy link
Contributor

@ajantha-bhat ajantha-bhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compared the DTO with other standard implementations. This looks like a subset of it (only required functionalities).
Util and resource files I have reviewed as per my knowledge! (which may be very limited)

As these are just test code, I don't want to block other dependent PRs. Hence, I am ok in merging if no one else reviews this week.

@snazy snazy merged commit 6b1c222 into projectnessie:main Sep 23, 2022
@snazy snazy deleted the ngc-testing branch September 23, 2022 08:57
snazy added a commit that referenced this pull request Oct 17, 2022
See also the readme in `./gc/README.md`.

GC is implemented as a two-phase approach:
1. Identify all refetences to live content versions by walking all
   named references. Content versions for Iceberg are tuples of
   commit-ID + snapshot ID + content ID + metadata reference).
   These `ContentReference`s are and stored in an instance of
   `LiveContentSetsStore`. This is the "mark phase". The set of
   content-references is called "live contents set".
2. For each content-ID, resolve all `ContentReferences` to file
   objects and populate a bloom-filter with those. Then traverse
   the base storage locations for each content-ID and delete all
   files that are not contained in the bloom filter, but do not
   delete files that are "too new" (created since the "mark phase"
   started).

A command line tool `nessie-gc-shell` is provided as well. It
supports running mark-and-sweep, mark only, sweep only plus a few
supplemental commands.

All pieces are pluggable, there are Java interfaces (or abstract
classes) providing the API for all pieces. This makes the code much
easier to test and helps using alternative implementations.

The base functionality is contained in the `nessie-gc-base` project.
This has default implementations to identify live content
references, an in-memory live-content-sets-storage and a default
implementation to run expire per content.

Support for content types, like Iceberg tables, implements
the functionality to expand content-references to actual
file-objects, list file-objects from base locations and delete
files. The "sweep" phase for Iceberg therefore also acts as the
"Nessie aware delete orphan files".

A live-content-sets-storage implementation using JDBC is present as
well.  Small Nessie repositories using a "one-off" mark-and-sweep
run can get away with the in-memory contents-storage. But big Nessie
repositories would require too much heap.

Another reason to eventually use a persistent
live-content-sets-storage is to allow the deletion of the files
occupied by completely unreferenced tables - tables that are not
referenced by any live Nessie commit.  This functionality is not
implemented, but would work by collecting the base-locations for
all content-IDs that are no longer present in the most recent
live-contents-set.

This PR still has a few non-`Test*` classes that were only added to
prove that the concept works quick and does not require much heap
and does not use a lot of CPU. Those "tests" in `nessie-gc-base` and
`nessie-gc-iceberg-inttest` will be removed, because those add no
real value and were only needed to validate the concept. The "final"
proof showed that a mark-and-sweep (deleting ~1.7M files, collected
from "many" branches, commits, content objects) alone can complete
in round about one minute with a 2GB Java heap.

Depends on: #5142 #5206
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants