storage API: Allow wildcard in object path to allow retrieval of several objects #4154

halfdanrump · 2017-10-11T03:48:45Z

gcsfs has a very handy feature that allows you to fetch multiple files by allowing wildcards in the object path. I think this would be a nice little feature to add to this library.

This example illustrates the idea:

from google.cloud import storage
c = storage.Client()
bucket = c.bucket('mybucket')
blobs = bucket.blob('2017/*.csv')

As far as I know, the current way to accomplish the same would be to filter the list of all the files in the bucket and then fetch the files one by one (please correct me if I'm wrong :). The problem with this is that it's slow if you have a bucket with a very large number of files.

Cheers,
Halfdan

sagarrakshe · 2017-10-11T14:52:02Z

@halfdanrump bucket.list_blobs takes an optional parameter prefix. This will filter blobs starting with the given string, this can partly solve your problem.
The following will return iterator for blobs starting with test_

from google.cloud import storage
client = storage.Client()
bucket = client.bucket('mybucket')
blobs = bucket.list_blobs(prefix='test_')

halfdanrump · 2017-10-13T02:27:39Z

@sagarrakshe You're right, I hadn't noticed that parameter. In my case this is sufficient, so thanks for telling me about it! :)

As you also point it's a partial solution. Actually it might be handle to even allow for regex matches on the filenames @lukesneeringer. I'm not into the code , so I don't know how difficult this would be to implement. Do you think it would be worth the effort?

Cheers,
Halfdan

sagarrakshe · 2017-10-30T18:04:43Z

The list objects API doesn't allow any wildcard parameter, apart from prefix:
https://cloud.google.com/storage/docs/json_api/v1/objects/list

So we need to add a parameter to bucket.list_blobs method (say pattern=None) which will be used to filter objects by applying that pattern on each object. Is this the optimal way to do?
Any thoughts? @lukesneeringer @dhermes

tseaver · 2017-10-30T18:31:03Z

@sagarrakshe As you note, the back-end doesn't provide such access, so what we are discussing is really a convenience wrapper for application code which would otherwise be something like:

import fnmatch

for blob in bucket.list_blobs():
    if fnmatch.fnmatch(blob.name, "*something"):
        do_something_with(blob)

sagarrakshe · 2017-10-31T15:58:51Z

Nice. So can we close this issue? @tseaver

tseaver · 2017-11-01T16:59:52Z

I'll close it, as there isn't much we can do to improve on application-level processing.

schunlee · 2018-11-15T01:47:58Z

fnmatch lib works, but the filter process so slow. Don't know how gsutil handle the problem, very effectively.

ViRaL95 · 2019-08-14T13:35:11Z

At any point will this feature request be re-opened to allow wildcards within the list_blobs method?

For example:

bucket.list_blobs(prefix='2019*.csv')

tseaver · 2019-08-14T16:20:50Z

@ViRaL95

At any point will this feature request be re-opened to allow wildcards within the list_blobs method?

Because the back-end doesn't provide support for that kind of matching, we decided that it was not worth the effort, given how simple it is do do the matching in the application (as my example above illustrates).

lukesneeringer assigned tseaver Oct 12, 2017

lukesneeringer added api: storage Issues related to the Cloud Storage API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. labels Oct 12, 2017

tseaver closed this as completed Nov 1, 2017

tseaver mentioned this issue Jan 31, 2020

'Bucket.list_blobs' surface issues googleapis/python-storage#35

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage API: Allow wildcard in object path to allow retrieval of several objects #4154

storage API: Allow wildcard in object path to allow retrieval of several objects #4154

halfdanrump commented Oct 11, 2017

sagarrakshe commented Oct 11, 2017

halfdanrump commented Oct 13, 2017

sagarrakshe commented Oct 30, 2017

tseaver commented Oct 30, 2017

sagarrakshe commented Oct 31, 2017

tseaver commented Nov 1, 2017

schunlee commented Nov 15, 2018

ViRaL95 commented Aug 14, 2019

tseaver commented Aug 14, 2019

storage API: Allow wildcard in object path to allow retrieval of several objects #4154

storage API: Allow wildcard in object path to allow retrieval of several objects #4154

Comments

halfdanrump commented Oct 11, 2017

sagarrakshe commented Oct 11, 2017

halfdanrump commented Oct 13, 2017

sagarrakshe commented Oct 30, 2017

tseaver commented Oct 30, 2017

sagarrakshe commented Oct 31, 2017

tseaver commented Nov 1, 2017

schunlee commented Nov 15, 2018

ViRaL95 commented Aug 14, 2019

tseaver commented Aug 14, 2019