Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: upload_from_file does not correctly stream Popen pipes #3429

Closed
evanj opened this issue May 16, 2017 · 3 comments
Closed

storage: upload_from_file does not correctly stream Popen pipes #3429

evanj opened this issue May 16, 2017 · 3 comments
Assignees
Labels
api: storage Issues related to the Cloud Storage API.

Comments

@evanj
Copy link
Contributor

evanj commented May 16, 2017

upload_from_file attempts to detect the size of the file by executing os.fstat(file_obj.fileno())
(google/cloud/storage/blob.py:929 https://github.com/GoogleCloudPlatform/google-cloud-python/blob/master/storage/google/cloud/storage/blob.py#L929)

For a subprocess.Popen pipe, this returns a stat structure with st_size set to 0. For some miracle that I didn't track down, this appears to cause the library to upload the output in a single chunk, rather than uploading a zero length file, which is a good failure mode to have!

However, I would expect this to fail with the usual total bytes could not be determined error, and to use a chunked upload once I fix that. Additionally: The streaming code relies on a file implementing file.tell(), which Popen pipes do not: It raises IOError Illegal seek. The code seems to only use tell() for the current position, so this could be fixed by having the code maintain the position itself.

I worked around this bug by wrapping the Popen object in my own object that implements read and tell and it seems to work.

To reproduce:

import google.cloud.storage
import subprocess
client = google.cloud.storage.Client()
blob = client.bucket('bucket').blob('blob')
p = subprocess.Popen(['ls'], stdout=subprocess.PIPE)
blob.upload_from_file(p.stdout)

Expected result:

A ValueError instructing the user to specify a chunk size.

Actual result:

The entire file gets uploaded in a single go (I think).

System Details:

OS: Mac OS X 10.11.6
Python: Python 2.7.10
google-cloud-storage version 1.1.1

@dhermes
Copy link
Contributor

dhermes commented May 16, 2017

@evanj Thanks for reporting!

  • The line you linked to (929) is just for open directly from files. That st_size check is present in the latest released version but has been removed in HEAD (not released yet though).
  • RE: .tell(), upload_from_file explicitly requires an IO[Bytes] object, and a subprocess.Popen does not fit that interface (as you noted). Adapting the implementing not to use tell() is not likely to happen, instead your fix is the right approach for such types. However, if you want to upload the output of a command, you should write it to disk and then use upload_from_filename

@dhermes dhermes added the api: storage Issues related to the Cloud Storage API. label May 16, 2017
@evanj
Copy link
Contributor Author

evanj commented May 16, 2017

Ah oops, sorry for wasting your time: I was looking at the wrong thing. I'm going to close it, since it looks like this will work correctly with the next release.

One minor suggestion: The doc comment for upload_from_file maybe should be updated to indicate that if you want a streaming upload, you must set chunk_size in the Blob constructor. That isn't super obvious if you just read the docs. I would typically assume that an API that takes a file object will do something "in chunks" by default, but it looks like this will attempt to read the entire thing in one go into memory, which is what I'm trying to avoid (I'm generating a fairly large output).

@evanj evanj closed this as completed May 16, 2017
@dhermes
Copy link
Contributor

dhermes commented May 16, 2017

@evanj Thanks for the suggestion! I recently dug into the very old implementation and was less than happy about the amount of things going on under the covers.

If you'd like a more explicit interface, I recommend checking out what we use under the covers (source and docs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the Cloud Storage API.
Projects
None yet
Development

No branches or pull requests

2 participants