Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an API method to give us a streaming file object #29

Closed
dmsolow opened this issue Jan 29, 2019 · 21 comments
Closed

Add an API method to give us a streaming file object #29

dmsolow opened this issue Jan 29, 2019 · 21 comments

Comments

@dmsolow
Copy link

@dmsolow dmsolow commented Jan 29, 2019

It doesn't look like there's a way to get a streaming download from google storage in the Python API. We have download_to_file , download_to_string, and download_to_filename, but I don't see anything that returns a file-like object that can be streamed. This is a disadvantage for many file types which can usefully be processed as they download.

Can a method like this be added?

@tseaver
Copy link
Contributor

@tseaver tseaver commented Jan 29, 2019

@dmsolow Hmm, Blob.download_to_file takes a file object -- does that not suit your usecase?

@dmsolow
Copy link
Author

@dmsolow dmsolow commented Jan 30, 2019

I don't think so. The situation is that it's often useful to start processing a file as it downloads instead of waiting until it's finished. For example if there's 1GB CSV file in google storage, it should be possible to parse it line by line as it's downloaded.

It's fairly common for network libraries to offer this kind of functionality. For example in the standard urllib.request HTTP library:

import urllib.request
import csv
from io import TextIOWrapper

with urllib.request.urlopen('http://test.com/big.csv') as f:
    wrapped = TextIOWrapper(f) # decode from bytes to str
    reader = csv.reader(wrapped)
    for row in reader:
       print(row[0])

This parses the CSV as it's downloaded. I'd like to get the same functionality from google storage. If there's already a good way to do this with the current library, please let me know.

@tseaver
Copy link
Contributor

@tseaver tseaver commented Jan 30, 2019

Hmm, you'll need to have the "stream consumer" running in a separate thread / process to do much good. You can make this work using Python's os.pipe: see this gist, which produces the following output:

$ bin/python pipe_test.py 
reader: start
reader: read one chunk
reader: read one chunk
...
reader: read one chunk
reader: read 800000 bytes

@dmsolow
Copy link
Author

@dmsolow dmsolow commented Jan 30, 2019

Using a separate thread kind of feels like a hack to me, but it is surely one way to do it. I think the ability to do this without using extra threads would be widely useful, but idk how hard it would be to implement.

@tseaver
Copy link
Contributor

@tseaver tseaver commented Jan 31, 2019

OK, looking at the underlying implementation in google-resumable-media, all that we actually expect of the file object is that it has a write method, which is then passed each chunk as it is downloaded.

You could therefore pass in an instance of your own class which wrapped the underlying stream, e.g.:

from google.cloud.storage import Client

class ChunkParser(object):

    def __init__(self, fileobj):
        self._fileobj = fileobj

    def write(self, chunk):
        self._fileobj.write(chunk)
        self._do_something_with(chunk)

client = Client()
bucket = client.get_bucket('my_bucket_name')
blob = bucket.blob('my_blob.xml')'

with open('my_blob.xml', 'wb') as blob_file:
    parser = ChunkParser(blob_file)
    blob.download_to_file(parser)

@yiga2
Copy link

@yiga2 yiga2 commented Feb 18, 2019

This was requested many times but was at some point turned down (googleapis/google-cloud-python#3903)

As an alternative, one can use the gcsfs library which supports file-obj for read and write.

@dmsolow
Copy link
Author

@dmsolow dmsolow commented Feb 19, 2019

It's a shame that this was turned down. It's a feature that every python dev is going to expect from a library like this, as evidenced by the fact that it keeps coming up.

@akuzminsky
Copy link

@akuzminsky akuzminsky commented Apr 1, 2019

Hmm, you'll need to have the "stream consumer" running in a separate thread / process to do much good.

Unfortunately this doesn't work with uploading streams.
https://github.com/googleapis/google-cloud-python/blob/master/storage/google/cloud/storage/blob.py#L1160 returns size of a pipe equal to zero. As result the pipe never empties and thus a child gets eventually blocked writing to it.

Are there known workarounds?

@tseaver
Copy link
Contributor

@tseaver tseaver commented Apr 16, 2019

@akuzminsky The line you've linked to is in the implementation of Blob.upload_from_filename. This issue is about being able to process downloaded chunks before the download completes.

@dmsolow Does my file-emulating wrapper class solution work for you?

@dmsolow
Copy link
Author

@dmsolow dmsolow commented Apr 16, 2019

@tseaver No. I would like something that is a "file-like object." This means something that supports standard Python io methods like readline, next, read etc. Maybe that object buffers chunks under the hood, but it should essentially be indistinguishable from the file object returned by the builtin open function.

@thnee
Copy link

@thnee thnee commented Jun 11, 2019

I was really surprised to see that not only is this feature not available, but it also has been brought up and closed in the past. It seems like an obvious and important feature to have.

Fortunately, gcsfs works really well as a substitute, but it's a little bit awkward to have to have a second library for such a core functionality.

But gcsfs does not support setting Content-Type, so I end up having to first upload the file using gcsfs, and then call gsutil setmeta via subprocess to set it after the file has been uploaded. This takes extra time and it is brittle, it is more of a workaround than a solution.

@yiga2
Copy link

@yiga2 yiga2 commented Jun 11, 2019

@thnee you should check back, gcsfs has the setxattrs() method to set metadata, including content-type.

@ElliotSilver
Copy link

@ElliotSilver ElliotSilver commented Jun 17, 2019

The lack of a simple streaming interface is a challenge to implementing a cloud function that reads/writes large files. I need the ability to read an object in from cloud storage, manipulate it, and write it out to another object. Since the only filestore available to GCF is /tmp which lives in the function memory space, you are limited to files less than 2 GB.

@IlyaFaer
Copy link
Member

@IlyaFaer IlyaFaer commented Jun 25, 2019

Well, if this new method is so much wanted, I'd propose solution: class, that inherits FileIO. It inits ChunkedDownload in self property and then on every read() call it consumes next chunk and returns it (some variants provided, as seek() will work in that class, so as flush()). New blob-method will be initializing this object and returning it to user

Looks like it'll work, 'cause (as I know) most file methods works through read(), so overriding it must do the trick. I've already raw-coded this and tried some tests - it worked. And it's compact

@IlyaFaer IlyaFaer self-assigned this Aug 2, 2019
@olejorgenb
Copy link

@olejorgenb olejorgenb commented Aug 21, 2019

Tensorflow have an implementation that gives a file like object for gc blobs: https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile

Not sure if it actually streams or not though.

@petedannemann
Copy link

@petedannemann petedannemann commented Jan 27, 2020

smart_open now has support for streaming files to/from GCS.

from smart_open import open

# stream from GCS
for line in open('gs://my_bucket/my_file.txt'):
    print(line)

# stream content *into* GCS (write mode):
with open('gs://my_bucket/my_file.txt', 'wb') as fout:
    fout.write(b'hello world')

@rocketbitz
Copy link

@rocketbitz rocketbitz commented Jan 29, 2020

@petedannemann great work - any ETA for an official release?

@petedannemann
Copy link

@petedannemann petedannemann commented Jan 29, 2020

@rocketbitz no idea but for now you could install from Github

pip install git+https://github.com/RaRe-Technologies/smart_open

@crwilcox crwilcox transferred this issue from googleapis/google-cloud-python Jan 31, 2020
@xbrianh
Copy link

@xbrianh xbrianh commented Feb 25, 2020

I've implemented gs-chunked-io to satisfy my own needs for GS read/write streams. It's designed to compliment the Google python API.

import gs_chunked_io as gscio
from google.cloud.storage import Client

bucket = Client().bucket("my-bucket")
blob = bucket.get_blob("my-key)

# read
with gscio.Reader(blob) as fh:
    fh.read(size)

# read in background
with gscio.AsyncReader(blob) as fh:
    fh.read(size)

# write
with gscio.Writer("my_new_key", bucket) as fh:
    fh.write(data)

justindujardin added a commit to justindujardin/pathy that referenced this issue Mar 13, 2020
 - annoyingly GCS doesn't support file-like objects: googleapis/python-storage#29
 - use a small library for doing file-like object support for GCS: https://github.com/xbrianh/gs-chunked-io
@petedannemann
Copy link

@petedannemann petedannemann commented Mar 16, 2020

@petedannemann great work - any ETA for an official release?

Release 1.10 last night included GCS functionality

@tseaver tseaver changed the title Storage: add an API method to give us a streaming file object Add an API method to give us a streaming file object Aug 17, 2020
@abhipn
Copy link

@abhipn abhipn commented Nov 13, 2020

Any update on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.