-
Notifications
You must be signed in to change notification settings - Fork 0
ETT-574: Deposit to new storage with versitygw S3 gateway #168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This simulates depositing to both the Isilon and TrueNAS at one site; it doesn't model deposit to the remote site.
* Test depositing to multiple linked pairtrees (local) * Add versitygw for filesystem -> s3 gateway test * PairtreeObjectStore for deposit to filesystem via s3 gateway * collate test with this Still TODO: * Do we need to handle symlinking with the PairtreeObjectStore? (Would rather not..) * Unit tests for other aspects of PairtreeObjectStore * Consider how/where to store audit info for PairtreeObjectStore
Primarily testing the different paths vs. ObjectStore; stubs for testing different recording expectations and for behavior with symlinks
|
@rrotter Not sure how much review you might want to do on this, but figured I'd point it out at least. |
|
Tests are failing; it looks like versitygw isn't starting. I'll take a look at that in the docker compose. |
|
Tests are still failing I suspect due to differing permission schemes under github actions as locally. I will see what makes sense to do. |
* no need for two different s3 gateways in docker compose setup * remove duplicative setup for s3 in tests * test that we can transparently write through an existing symlink to a directory with minio * add dependencies & health checks in docker compose for versitygw * set permissions for versitygw directory in github actions
simulates deposit with: * one LinkedPairtree (current storage) * two different PairtreeObjectStores (new storage at both sites) * one ObjectStore (glacier deep archive) * one PrefixedVersions (data den) removes tests for * two linkedpairtrees (not going to do this) * single PairtreeObjectStore (duplicative of storage_pairtree_object_store.t)
* Avoid re-using feed_audit as it has extra info not related to the deposit (page count, etc) * Will want to think about retiring/migrating info from feed_audit when retiring old storage * Implement zip_size /mets_size for ObjectStore and PairtreeObjectStore
|
Tests are working in GHA now. I've also added logic to record deposits to a new One thing that comes to mind is that we may want to extract a separate subclass of |
docker-compose.yml
Outdated
| ROOT_SECRET_KEY: testingsecretkey | ||
| volumes: | ||
| - ./var/vgw:/usr/local/feed/var/vgw | ||
| command: --health /health posix /usr/local/feed/var/vgw |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had issues running this locally: vgw was unhealthy complaining "2025/10/02 17:33:13 xattr check failed: xattrs are not supported on this filesystem". Saw this with the Otis branch too. Passing --nometa fixed this for Otis. I don't know if this implementation is depending on xattrs for the tests to pass, Otis by contrast is using vgw read-only and it doesn't seem to matter.
I did an A/B test on main vs this branch, since I always get failing tests -- on this branch I got
t/storage_object_store.t ........... 7/?
# Failed test 'metadata for md5 checksum'
# at t/storage_object_store.t line 75.
# got: undef
# expected: 'LUDWXBrs2Fez94DoW8m9kg=='
which suggest we need that xattr stuff.
I don't know if it's a priority to get this running on non-ext3 (or whatever) filesystems since the feed test suite has always been ornery IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can try on Mac and with --no-meta.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect the xattrs may be needed to store checksums though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I edited my comment after comparing test outputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested with a sidecar approach (https://github.com/versity/versitygw/wiki/POSIX-metadata) and didn't get the above error but a test in t/storage_object_store.t still failed (HTFeed::Storage::ObjectStore with encryption enabled stores the mets and encrypted zip), so I dunno.
This was my change:
versitygw:
user: "1000:1000"
image: versity/versitygw
restart: always
environment:
ROOT_ACCESS_KEY: TESTINGACCESSKEY
ROOT_SECRET_KEY: testingsecretkey
volumes:
- ./var/vgw:/usr/local/feed/var/vgw
- ./var/sidecar:/usr/local/feed/var/sidecar
command: --health /health posix --sidecar /usr/local/feed/var/sidecar /usr/local/feed/var/vgw
healthcheck:
<<: *healthcheck-defaults
test: [ "CMD", "wget", "--quiet", "--tries=1", "-O", "/dev/null", "http://127.0.0.1:7070/health" ]
(had to create ./var/sidecar)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if it's a priority to get this running on non-ext3 (or whatever) filesystems since the feed test suite has always been ornery IMHO.
I'd love to have it all working consistently, especially since now I'm now the only developer in ETT primarily running tests under Linux. I do have a mini I can try some of this on and at least try to triage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw, I successfully tested versitygw on my Mac running in /tmp (more or less using their quickstart docs verbatim). Did not try with Docker, just installed from brew.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: the failing tests I am seeing the same thing. I see a lot of errors like:
"gpg: error running '/usr/bin/gpg-agent': exit status 2"
I suspect this is related to the test failures.
I also note that using the sidecar leaves a lot of junk in the sidecar directory lying around after test. I will see if I can at least use a directory or volume that doesn't leave stuff in the repository filesystem, since we aren't separately checking any of that from other containers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm running into permissions & other weird issues if I try to have the sidecar somewhere else or in a regular Docker volume because I'm not running versitygw as root, so I think we'll have to live with accumulating junk in the temp directory. I can at least make sure it's gitignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the gpg issue also has to do with the Mac filesystem and/or mapping of Unix sockets. In any case I can move the gpg directory to /tmp/gnupg (as we do in production) and that's fine.
moseshll
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment on docker-compose about my filesystem woes. I don't see anything wrong with the code changes but it would be good to address this issue.
The Mac filesystem is case-preserving but not case sensitive, and t/lib/HTFeed/Namespace/Test.pm appears to conflict with lib/HTFeed/Namespace/TEST.pm. This appears to fix that issue
I believe tests were failing because of issues regarding the socket files mounted in from the host filesystem. We're already using /tmp/gnupg in the feed-internal Dockerfile (which we can remove when we update to this version of feed), so this shouldn't cause any issues.
|
@moseshll All the tests on this branch pass for me via Docker on Mac now. Could you give it another try? |
|
Well now tests are failing in GitHub.. but I bet I know why & it has to do with permissions for the metadata sidecar... |
|
OK tests are passing in github, locally for me in Docker running in Linux, and locally for me in Docker running in Mac OS. @moseshll if they work for you in Mac OS I think we can merge this! |
moseshll
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests pass and output generally matches what we're getting from the CI tests here. Hearty APPROVE
This implements a "PairtreeObjectStore" which we can use with versitygw - a filesystem to S3 gateway - to deposit to both the local & remote storage without needing to do NFS mounts. PageTurner, etc, will still access the filesystem with NFS, and it will be laid out in more or less the same way, so we do want to use the "pairtree path".
Not yet implemented is anything recording this deposit; in the next PR, I'd like to add a column to feed_audit and separately record each deposit that ingest does.
In addition, this removes minio (another thing that implements S3) that we were using to mock depositing to glacier deep archive -- we don't need two separate test S3 implementations here.
This also tests that symlinks will work correctly with versitygw. For existing stuff the terminal directory under sdr1 is a symlink to one of sdr2 .. sdr24; this tesst that we can write through symlinks with versitygw -- e.g. that we can write using that symlink in the path and update the existing zip/mets. For new material, we'll just deposit right under sdr1, and not bother with the symlinks. Over time, we can (separately from this process) work towards removing the existing symlinks.