Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sda-download] Implement random access in encrypted files #696

Closed
dbampalikis opened this issue Feb 28, 2024 · 3 comments · Fixed by #831
Closed

[sda-download] Implement random access in encrypted files #696

dbampalikis opened this issue Feb 28, 2024 · 3 comments · Fixed by #831

Comments

@dbampalikis
Copy link
Contributor

As an sda-user
I want to be able to download specific parts of an encrypted file
In order to be able to get only the region I am interested in

The service currently allows to download specific byte ranges of unencrypted files but in the case of encrypted files, that's only possible for byte ranges that start from the beginning of the file. We need to be able to support random byte ranges of encrypted files, to support the htsget case.

A/C

  • A PR that allows for downloading specific ranges of encrypted files
@pontus
Copy link
Contributor

pontus commented Feb 28, 2024

Assuming we want to avoid reencryption of possibly large amounts of data, this should use the intended support for this in the crypt4gh file format.

In short, each file/data stream is split into 64kbyte blocks that are encrypted/ separately. This is also the smallest unit for decryption as these blocks are what MACs are created for.

This means that to send logical byte 65535-65536 (base 0), one would need to send the reencrypted header and the first two data blocks (65536+extra bytes for crypt4gh). As the receiver only want those two bytes, there would also need to be a data edit list in the header to instruct it to throw away bytes 0-65534 and 65537-131071.

So the header reencryption service needs to be able to accept a dataeditlist to be put in the header.

Currently, I think there's only the chacha20_ietf_poly1305 cipher, so a fixed block size of 65564 can be used, but possibly it might make sense to have a function in the crypt4gh library that takes a header and responds with the block size (or similar).

@pontus
Copy link
Contributor

pontus commented Feb 28, 2024

For both the unencrypted and encrypted data out case, there will also be a performance motive to not request the entire object from the archive and only return the wanted bit but rather only requesting the range actually needed.

For the encrypted case, this is fairly simple - the s3 download client could pass a Range with the bytes wanted.

The question would be if we would prefer having a unified handling for unencrypted and encrypted.

For the unencrypted case, it might make sense to have a reader that maps calls to Read to a s3 call that is essentially managed synchronously or something similar.

@MalinAhlberg
Copy link
Contributor

When decrypting a partial file, the resulting file size should be what was originally asked for, not more. Ie, the extra data passed on to meet the next data boundary block should be removed. Use data-edit-list. See #695 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants