Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage API: Allow wildcard in object path to allow retrieval of several objects #4154

Closed
halfdanrump opened this issue Oct 11, 2017 · 9 comments
Assignees
Labels
api: storage Issues related to the Cloud Storage API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@halfdanrump
Copy link

gcsfs has a very handy feature that allows you to fetch multiple files by allowing wildcards in the object path. I think this would be a nice little feature to add to this library.

This example illustrates the idea:

from google.cloud import storage
c = storage.Client()
bucket = c.bucket('mybucket')
blobs = bucket.blob('2017/*.csv')

As far as I know, the current way to accomplish the same would be to filter the list of all the files in the bucket and then fetch the files one by one (please correct me if I'm wrong :). The problem with this is that it's slow if you have a bucket with a very large number of files.

Cheers,
Halfdan

@sagarrakshe
Copy link

@halfdanrump bucket.list_blobs takes an optional parameter prefix. This will filter blobs starting with the given string, this can partly solve your problem.
The following will return iterator for blobs starting with test_

from google.cloud import storage
client = storage.Client()
bucket = client.bucket('mybucket')
blobs = bucket.list_blobs(prefix='test_')

@lukesneeringer lukesneeringer added api: storage Issues related to the Cloud Storage API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. labels Oct 12, 2017
@halfdanrump
Copy link
Author

@sagarrakshe You're right, I hadn't noticed that parameter. In my case this is sufficient, so thanks for telling me about it! :)

As you also point it's a partial solution. Actually it might be handle to even allow for regex matches on the filenames @lukesneeringer. I'm not into the code , so I don't know how difficult this would be to implement. Do you think it would be worth the effort?

Cheers,
Halfdan

@sagarrakshe
Copy link

The list objects API doesn't allow any wildcard parameter, apart from prefix:
https://cloud.google.com/storage/docs/json_api/v1/objects/list

So we need to add a parameter to bucket.list_blobs method (say pattern=None) which will be used to filter objects by applying that pattern on each object. Is this the optimal way to do?
Any thoughts? @lukesneeringer @dhermes

@tseaver
Copy link
Contributor

tseaver commented Oct 30, 2017

@sagarrakshe As you note, the back-end doesn't provide such access, so what we are discussing is really a convenience wrapper for application code which would otherwise be something like:

import fnmatch

for blob in bucket.list_blobs():
    if fnmatch.fnmatch(blob.name, "*something"):
        do_something_with(blob)

@sagarrakshe
Copy link

Nice. So can we close this issue? @tseaver

@tseaver
Copy link
Contributor

tseaver commented Nov 1, 2017

I'll close it, as there isn't much we can do to improve on application-level processing.

@tseaver tseaver closed this as completed Nov 1, 2017
@schunlee
Copy link

fnmatch lib works, but the filter process so slow. Don't know how gsutil handle the problem, very effectively.

@ViRaL95
Copy link

ViRaL95 commented Aug 14, 2019

At any point will this feature request be re-opened to allow wildcards within the list_blobs method?

For example:

bucket.list_blobs(prefix='2019*.csv')

@tseaver
Copy link
Contributor

tseaver commented Aug 14, 2019

@ViRaL95

At any point will this feature request be re-opened to allow wildcards within the list_blobs method?

Because the back-end doesn't provide support for that kind of matching, we decided that it was not worth the effort, given how simple it is do do the matching in the application (as my example above illustrates).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the Cloud Storage API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

6 participants