Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage: Capture relevant headers to blob properties during download #24

Closed
tseaver opened this issue Nov 13, 2019 · 8 comments
Closed

Storage: Capture relevant headers to blob properties during download #24

tseaver opened this issue Nov 13, 2019 · 8 comments

Comments

@tseaver
Copy link
Contributor

@tseaver tseaver commented Nov 13, 2019

Residual from googleapis/google-cloud-python#9003.

@william-silversmith notes that even with raw_download enabled, he is unable to detect the content_type of a downloaded blob without performing an additional reload request, which is prohibitive for his usecase at scale. E.g.:

blob = bucket.blob( key )
binary = blob.download_as_string(raw_download=True)
if blob.content_encoding == 'gzip':
    return gunzip(binary)
elif blob.content_encoding == 'br':
    return brotli.decompress(binary)
else:
    return binary

Potentially even...

if blob.content_type == 'application/json':
    return json.loads(binary.decode('utf8'))
@william-silversmith
Copy link
Contributor

@william-silversmith william-silversmith commented Nov 13, 2019

Thank you Tres!

@tseaver
Copy link
Contributor Author

@tseaver tseaver commented Nov 13, 2019

@william-silversmith The headers needed to do this task are buried several layers deep. I'm wondering if a better solution than trying to guess which ones to map onto the blob's properties, we might be better off designing for your usecase by supplying a response_callback argument to the Blob.download_* methods:

  • The callback would take the blob instance and a requests.Response object as its arguments.
  • For non-chunked downloads, the callback would be called once; chunked downloads would call it once per chunk.

An example callback implementation for your usecase might be something like:

def _update_blob_content_encoding(blob, response):
    if blob.content_encoding is None:  # latch
        blob.content_encoding = response.headers.get('Content-Encoding')

@tseaver
Copy link
Contributor Author

@tseaver tseaver commented Nov 13, 2019

If we don't use a callback, then the relevant headers to start with would seem to be:

  • Cache-Control -> Blob.cache_control
  • Content-Disposition -> Blob.content_disposition
  • Content-Encoding -> Blob.content_encoding
  • Content-Length -> Blob.size
  • Content-Type -> Blob.content_type
  • Etag -> Blob.etag
  • X-Goog-Generation: Blob.generation
  • X-Goog-Hash: Blob.crc32c
  • X-Goog-Metageneration: Blob.metageneration
  • X-Goog-Storage-Class: Blob.storage_class

@william-silversmith
Copy link
Contributor

@william-silversmith william-silversmith commented Nov 14, 2019

I almost never need more exotic headers than the ones you listed. If it's easier to only expose those headers, that's fine with me. If you want to go for a more general solution, I am fine with a (synchronous) callback. Thanks again!

@william-silversmith
Copy link
Contributor

@william-silversmith william-silversmith commented Jan 27, 2020

Any word on when this might drop? Thanks for your help!

@william-silversmith
Copy link
Contributor

@william-silversmith william-silversmith commented Jul 14, 2020

Hi Tres, I've made some changes to my own branch of this repo and they seem to work. I tried making a PR but the button was greyed out. Here's a link to the changes in case y'all are interested. In the mean time, I'll direct my repos to use the forked version.

master...seung-lab:master

@william-silversmith
Copy link
Contributor

@william-silversmith william-silversmith commented Jul 15, 2020

Unfortunately, it seems PyPI won't let me use a direct reference to github. It would be extremely convenient if some version of this could get merged.

@william-silversmith
Copy link
Contributor

@william-silversmith william-silversmith commented Jul 15, 2020

Figured out the PR submission issue.

gcf-merge-on-green bot pushed a commit that referenced this issue Jul 23, 2020
…ds (#204)

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
- [x] Make sure to open an issue as a [bug/issue](https://github.com/googleapis/python-storage/issues/new/choose) before writing your code!  That way we can discuss the change, evaluate designs, and agree on the general idea
- [x] Ensure the tests and linter pass
- [x] Code coverage does not decrease (if any source code was changed)
- [x] Appropriate docs were updated (if necessary)

Fixes #24  🦕

This PR autopopulates the following fields for non-chunked downloads based on the server header response:
```
blob.content_encoding
blob.content_type
blob.cache_control
blob.storage_class
blob.content_language
blob.md5_hash
blob.crc32c
```
cojenco added a commit to cojenco/python-storage that referenced this issue Oct 13, 2021
…ds (googleapis#204)

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
- [x] Make sure to open an issue as a [bug/issue](https://github.com/googleapis/python-storage/issues/new/choose) before writing your code!  That way we can discuss the change, evaluate designs, and agree on the general idea
- [x] Ensure the tests and linter pass
- [x] Code coverage does not decrease (if any source code was changed)
- [x] Appropriate docs were updated (if necessary)

Fixes googleapis#24  🦕

This PR autopopulates the following fields for non-chunked downloads based on the server header response:
```
blob.content_encoding
blob.content_type
blob.cache_control
blob.storage_class
blob.content_language
blob.md5_hash
blob.crc32c
```
cojenco added a commit to cojenco/python-storage that referenced this issue Oct 13, 2021
…ds (googleapis#204)

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
- [x] Make sure to open an issue as a [bug/issue](https://github.com/googleapis/python-storage/issues/new/choose) before writing your code!  That way we can discuss the change, evaluate designs, and agree on the general idea
- [x] Ensure the tests and linter pass
- [x] Code coverage does not decrease (if any source code was changed)
- [x] Appropriate docs were updated (if necessary)

Fixes googleapis#24  🦕

This PR autopopulates the following fields for non-chunked downloads based on the server header response:
```
blob.content_encoding
blob.content_type
blob.cache_control
blob.storage_class
blob.content_language
blob.md5_hash
blob.crc32c
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

4 participants