Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PUT_SYNC does not work with S3 #129

Closed
acnewton opened this issue Mar 17, 2020 · 4 comments
Closed

PUT_SYNC does not work with S3 #129

acnewton opened this issue Mar 17, 2020 · 4 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@acnewton
Copy link
Contributor

What we want
Sync data from AWS S3 bucket to iRODS using the PUT_SYNC operation.

What we did
python -m irods_capability_automated_ingest.irods_sync start --ignore_cache --event_handler 'irods_capability_automated_ingest.examples.sync' --synchronous --progress --s3_keypair aws-s3-keypair --s3_region_name eu-west-1 --log_filename /home/irods/log/test.log --log_level DEBUG /bucket-name /tempZone/home/rods/target [Elapsed Time: 0:00:01] |####################################################################################################################################################| (Time: 0:00:01) count: 1 tasks: ------ failures: 1 retries: ------ (rodssync) 13:29 myhost:/home/irods

What we expected
Successful synchronization: file copied from S3 bucket to iRODS target collection.

What we got
Error log that the file can not be found locally.

Possible solution
It seems that functionality to download/stream data first from S3 in upload_file() and sync_file() functions in irods_capability_automated_ingest/sync_irods.py is missing.

@zfed
Copy link

zfed commented Oct 27, 2022

we experimented same issue with operation.PUT
there are no solutions yet?

@alanking
Copy link
Collaborator

This work was done a while ago, but I dropped the ball on getting it reviewed: #134 The work still needs to be rebased (a lot has moved since then) and tested.

@alanking alanking added the enhancement New feature or request label Oct 27, 2022
avkrishnamurthy added a commit to avkrishnamurthy/irods_capability_automated_ingest that referenced this issue Jun 30, 2023
avkrishnamurthy added a commit to avkrishnamurthy/irods_capability_automated_ingest that referenced this issue Jun 30, 2023
@avkrishnamurthy
Copy link
Collaborator

avkrishnamurthy commented Jul 6, 2023

We have brainstormed several potential solutions to solve this issue. The current plan is to try to implement #2, and subsequently #1, which will be the most efficient option. Currently, #4, is already hypothetically working thanks to acnewton's pull request from 2020 (#134), which will be the fallback option if #1, or #2 don't work.


1. Multi-read from S3 -> Multi-write to iRODS

  • Pros:
    • Best parallelization
    • Reduced bottlenecks with multiple buffers
    • Scalability
  • Cons:
    • More complex to implement
    • May use more resources for multiple buffers and multiple threads

2. Single stream/read from S3 -> Multi-write to iRODS

  • Pros:
    • Probably easier to do multi-write when putting to iRODS than getting from S3
    • Stepping stone to multi-read, multi-write
  • Cons:
    • Speed limited by single stream from S3

3. Multi-read from S3 -> Single stream/write to iRODS

  • Pros:
    • Uses some parallelization
  • Cons:
    • Should just implement multi-read and multi-write if we are able to get multi-read from S3 working

4. Single stream/read from S3 -> Single stream/write to iRODS

  • Pros:
    • Does not have significant storage requirements
    • Simpler to implement/already implemented
  • Cons:
    • Very slow
    • Limited ways to optimize

5. Download/Upload

  • Pros:
    • Should work with existing PRC code
  • Cons:
    • Can require arbitrarily huge amount of disk space for download

6. Register->Replicate->Trim

  • Pros:
    • Would not require any/many changes to existing code in PRC
  • Cons:
    • Requires extra resource for replication
    • Will require S3 resource in iRODS to be configured to talk to the same bucket

avkrishnamurthy pushed a commit to avkrishnamurthy/irods_capability_automated_ingest that referenced this issue Jul 19, 2023
Added the functionality to transfer the data from an object in S3 to an object in iRODS. Not only registering the file in place. Using Minio library to do the transfer. It is also possible to append to a file by setting a certain offset. The md5sum hash is calculated during streaming. Final calculated hash is compared to the Etag header from the S3 object. Note that this Etag is not always the md5sum, when the file has been uploaded via multipart upload. A more general way of comparing checksum will be necessary.
avkrishnamurthy added a commit to avkrishnamurthy/irods_capability_automated_ingest that referenced this issue Jul 19, 2023
- Added cli option for Amazon S3 multipart upload file chunk size for calculating checksum

- Changes based on pr review, cleaned up code and some error handling

- Reverted behavior for null operation and changed wording for multipart

- Rewording multipart option and TODO for no-op
alanking pushed a commit that referenced this issue Jul 19, 2023
Added the functionality to transfer the data from an object in S3 to an object in iRODS. Not only registering the file in place. Using Minio library to do the transfer. It is also possible to append to a file by setting a certain offset. The md5sum hash is calculated during streaming. Final calculated hash is compared to the Etag header from the S3 object. Note that this Etag is not always the md5sum, when the file has been uploaded via multipart upload. A more general way of comparing checksum will be necessary.
alanking pushed a commit that referenced this issue Jul 19, 2023
- Added cli option for Amazon S3 multipart upload file chunk size for calculating checksum

- Changes based on pr review, cleaned up code and some error handling

- Reverted behavior for null operation and changed wording for multipart

- Rewording multipart option and TODO for no-op
@alanking
Copy link
Collaborator

The multi-stream solution will be handled in the work for #207.

If the initial solution is complete, I think this can be closed, @avkrishnamurthy.

@alanking alanking added this to the 0.5.0 milestone Mar 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

No branches or pull requests

4 participants