PUT_SYNC does not work with S3 #129

acnewton · 2020-03-17T09:50:21Z

What we want
Sync data from AWS S3 bucket to iRODS using the PUT_SYNC operation.

What we did
python -m irods_capability_automated_ingest.irods_sync start --ignore_cache --event_handler 'irods_capability_automated_ingest.examples.sync' --synchronous --progress --s3_keypair aws-s3-keypair --s3_region_name eu-west-1 --log_filename /home/irods/log/test.log --log_level DEBUG /bucket-name /tempZone/home/rods/target [Elapsed Time: 0:00:01] |####################################################################################################################################################| (Time: 0:00:01) count: 1 tasks: ------ failures: 1 retries: ------ (rodssync) 13:29 myhost:/home/irods

What we expected
Successful synchronization: file copied from S3 bucket to iRODS target collection.

What we got
Error log that the file can not be found locally.

Possible solution
It seems that functionality to download/stream data first from S3 in upload_file() and sync_file() functions in irods_capability_automated_ingest/sync_irods.py is missing.

The text was updated successfully, but these errors were encountered:

zfed · 2022-10-27T12:49:23Z

we experimented same issue with operation.PUT
there are no solutions yet?

alanking · 2022-10-27T14:12:52Z

This work was done a while ago, but I dropped the ball on getting it reviewed: #134 The work still needs to be rebased (a lot has moved since then) and tested.

…k size for calculating checksum

avkrishnamurthy · 2023-07-06T18:29:16Z

We have brainstormed several potential solutions to solve this issue. The current plan is to try to implement #2, and subsequently #1, which will be the most efficient option. Currently, #4, is already hypothetically working thanks to acnewton's pull request from 2020 (#134), which will be the fallback option if #1, or #2 don't work.

1. Multi-read from S3 -> Multi-write to iRODS

Pros:
- Best parallelization
- Reduced bottlenecks with multiple buffers
- Scalability
Cons:
- More complex to implement
- May use more resources for multiple buffers and multiple threads

2. Single stream/read from S3 -> Multi-write to iRODS

Pros:
- Probably easier to do multi-write when putting to iRODS than getting from S3
- Stepping stone to multi-read, multi-write
Cons:
- Speed limited by single stream from S3

3. Multi-read from S3 -> Single stream/write to iRODS

Pros:
- Uses some parallelization
Cons:
- Should just implement multi-read and multi-write if we are able to get multi-read from S3 working

4. Single stream/read from S3 -> Single stream/write to iRODS

Pros:
- Does not have significant storage requirements
- Simpler to implement/already implemented
Cons:
- Very slow
- Limited ways to optimize

5. Download/Upload

Pros:
- Should work with existing PRC code
Cons:
- Can require arbitrarily huge amount of disk space for download

6. Register->Replicate->Trim

Pros:
- Would not require any/many changes to existing code in PRC
Cons:
- Requires extra resource for replication
- Will require S3 resource in iRODS to be configured to talk to the same bucket

Added the functionality to transfer the data from an object in S3 to an object in iRODS. Not only registering the file in place. Using Minio library to do the transfer. It is also possible to append to a file by setting a certain offset. The md5sum hash is calculated during streaming. Final calculated hash is compared to the Etag header from the S3 object. Note that this Etag is not always the md5sum, when the file has been uploaded via multipart upload. A more general way of comparing checksum will be necessary.

- Added cli option for Amazon S3 multipart upload file chunk size for calculating checksum - Changes based on pr review, cleaned up code and some error handling - Reverted behavior for null operation and changed wording for multipart - Rewording multipart option and TODO for no-op

Added the functionality to transfer the data from an object in S3 to an object in iRODS. Not only registering the file in place. Using Minio library to do the transfer. It is also possible to append to a file by setting a certain offset. The md5sum hash is calculated during streaming. Final calculated hash is compared to the Etag header from the S3 object. Note that this Etag is not always the md5sum, when the file has been uploaded via multipart upload. A more general way of comparing checksum will be necessary.

- Added cli option for Amazon S3 multipart upload file chunk size for calculating checksum - Changes based on pr review, cleaned up code and some error handling - Reverted behavior for null operation and changed wording for multipart - Rewording multipart option and TODO for no-op

alanking · 2023-07-19T17:01:25Z

The multi-stream solution will be handled in the work for #207.

If the initial solution is complete, I think this can be closed, @avkrishnamurthy.

alanking added the enhancement New feature or request label Oct 27, 2022

alanking assigned avkrishnamurthy Jun 23, 2023

avkrishnamurthy added a commit to avkrishnamurthy/irods_capability_automated_ingest that referenced this issue Jun 30, 2023

[irods#129] Added functionality for PUT, PUT_SYNC with S3 via Minio

9f71529

avkrishnamurthy added a commit to avkrishnamurthy/irods_capability_automated_ingest that referenced this issue Jun 30, 2023

[irods#129] Added cli option for Amazon S3 multipart upload file chun…

0d6059c

…k size for calculating checksum

avkrishnamurthy closed this as completed Jul 19, 2023

alanking added this to the 0.5.0 milestone Mar 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PUT_SYNC does not work with S3 #129

PUT_SYNC does not work with S3 #129

acnewton commented Mar 17, 2020

zfed commented Oct 27, 2022

alanking commented Oct 27, 2022

avkrishnamurthy commented Jul 6, 2023 •

edited by trel

Loading

alanking commented Jul 19, 2023

PUT_SYNC does not work with S3 #129

PUT_SYNC does not work with S3 #129

Comments

acnewton commented Mar 17, 2020

zfed commented Oct 27, 2022

alanking commented Oct 27, 2022

avkrishnamurthy commented Jul 6, 2023 • edited by trel Loading

alanking commented Jul 19, 2023

avkrishnamurthy commented Jul 6, 2023 •

edited by trel

Loading