Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add s3util.ListObjects(url string, c *Config) (*ListObjectsResult, error) #7

Closed
wants to merge 10 commits into from

Conversation

hnakamur
Copy link
Contributor

@hnakamur hnakamur commented Jun 5, 2013

This is a function for the GET Bucket (List Objects) API.
http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html

@kr
Copy link
Owner

kr commented Jul 14, 2013

I thought I replied here before, apologies for missing that.

The name ListObjects seems redundant, why not just List?
(Or perhaps Readdir?)

The public interface here seems a bit complicated. Two new
types introduced. Is it possible to implement
http://godoc.org/os#FileInfo and avoid introducing a new
public type Content?

Also, is there any way to avoid exposing fields like Marker
and IsTruncated? Those are implementation details, which
ideally s3util would handle automatically. That is, is something
like this signature possible?

func Open(url string, c *Config) (*File, error)

func (f *File) List(n int) ([]os.FileInfo, error)

@hnakamur
Copy link
Contributor Author

Hi, thanks for your reply.

When we add an API for listing buckets in the future,
I suppose it will be named as ListBuckets.
I thought ListObjects is a better name than List when having ListBuckets.
I think List may be OK, though.

I think Readdir is confusing.
With the name Readdir, I expect the result would be entries in one directory
like os.File.Readdir API.

I implemented the ListObjects API as a low level primitive API corresponding to:
http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html
Two new types are needed to contain all response elements.

I think your signature for List() is somewhat misleading.
I had an impression that f is a direcotry and I will get entries in it.

I think this is better.
func List(marker *File, n int) ([]os.FileInfo, error)

Before we think for function signatures, we have to think about
a gotcha about directory names on S3.
The amazon S3 web app console sets directory names with
the suffix '/' (ex. 'foo/'). On the other hand tools like S3Fox
uses the suffix '$folder$' (ex. 'foo$folder$').

And we need to specify directory names with these suffixes
when we use them for markers for the next ListObjects call.
So we cannot just trim and throw away directory suffixes.

If we use os.File for file or directory entries, we must use
Name() for returning names with directory suffixes. And we have to
define another function
func TrimDirectorySuffix(f *File) string
to get a directory name without '/' or '_$folder$'.
I don't like to call TrimDirectorySuffix(file.Name()) to get actual names.

Could you tell me what you think?

@kr
Copy link
Owner

kr commented Jul 16, 2013

Package s3util isn't really meant for low level functions.
It's for convenient high level access. For example, the
user shouldn't have to keep track of the marker; this
package should do it for them.

S3 doesn't have directories, but it's possible to treat
objects as if they were in a hierarchy, and the amazon
api and docs encourage this. It seems reasonable to
present files that way. There's no way we could use
os.File, but we could make an s3.File that's analogous.
A File that corresponds to an actual object would
need to present the exact path of that object as its name.
A File that corresponds to an intermediate level of
hierarchy (aka a directory) would need to present as its
name the path up to that point, not including the trailing
path separator.

Since '/' is already the path separator, creating an empty
object ending in '/' causes a level of hierarchy to appear
with no extra logic. It seems unwise to use any other suffix
for these pseudo-directories.

Given the following objects:

sample.jpg
photos/2006/January/sample.jpg
photos/2006/February/sample2.jpg
photos/2006/February/sample3.jpg
photos/2006/February/sample4.jpg

This api could produce the following listings:

For "/":
photos
sample.jpg
For "/photos":
2006
For "/photos/2006":
February
January

etc.

Why can't List can work for listing both buckets and objects?

@hnakamur
Copy link
Contributor Author

Now I understand that s3util is meant for high level access. Thanks for your explanation.

As for directory suffixes, I wish all tool out there used only '/'. In reality, there are already
a lot of directories with both suffixes '/' and '_$folder$, so I think it would be better for
s3util to process directories with both of them.

Your listing is a breadth-first search, but S3 List API is a depth-first search.
And the S3 List API has the limit for returning entries count. It returns 1000 entries at most.
So we need to call S3 List API multiple times when we have a lot of entries.
Actually we need to traverse all entries to get the top level listing.

I would like to control the count of S3 API calls because they costs money.
Also, I would like to process listings as I go getting them partially before I get total listings.

Yes, maybe List can work for listings both buckets and objects.

@kr
Copy link
Owner

kr commented Jul 18, 2013

The page you linked above,
http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html,
shows the first step in a breadth-first search under heading
"Sample Request Using Prefix and Delimiter".

The key seems to be to supply the path separator as the delimiter param.

The design I suggest would perform exactly one S3 call per call to List.
Hopefully this is sufficient to control costs.

Just like for os.File.Readdir, List can let the user decide how many results
to get at once (up to the amazon limit), and continue where it left off in a
subsequent call.

@hnakamur
Copy link
Contributor Author

I read samples in
http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html

By setting delimeter=/, you get only directory entries. So you have to do an extra API call for getting entries in directories. And those results have files and subdirectories mixed.

By just using the marker parameter and not using delimiter, the needed API call count is int((entries - 1)/ 1000) + 1 (1000 = the max entries count per an API call). And this is the minimum you can get.

@kr
Copy link
Owner

kr commented Jul 21, 2013

Files and directories aren't mixed. Files are listed in Contents,
and directories are in CommonPrefixes. In this example (copied
from amazon), the file is sample.html and the directory is photos.

<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
  <Name>example-bucket</Name>
  <Prefix></Prefix>
  <Marker></Marker>
  <MaxKeys>1000</MaxKeys>
  <Delimiter>/</Delimiter>
  <IsTruncated>false</IsTruncated>
  <Contents>
    <Key>sample.html</Key>
    <LastModified>2011-02-26T01:56:20.000Z</LastModified>
    <ETag>&quot;bf1d737a4d46a19f3bced6905cc8b902&quot;</ETag>
    <Size>142863</Size>
    <Owner>
      <ID>canonical-user-id</ID>
      <DisplayName>display-name</DisplayName>
    </Owner>
    <StorageClass>STANDARD</StorageClass>
  </Contents>
  <CommonPrefixes>
    <Prefix>photos/</Prefix>
  </CommonPrefixes>
</ListBucketResult>

Doing a breadth-first traversal might still take a few more api calls than
the flat listing, but it seems much more convenient.

@hnakamur
Copy link
Contributor Author

Thank you again for your explanation.
I confirmed that files are listed in Contents and directories are in CommonPrefixes with my sample program.

I tried to implement proposed APIs, but I found out we cannot get LastModified for directories.
Is it OK that f.ModTime() returns the zero value for time.Time if f is a directory?

@kr
Copy link
Owner

kr commented Jul 22, 2013

Is it OK that f.ModTime() returns the zero value for time.Time if f is a directory?

Yes, that seems reasonable. Also for Size() etc. Since directories
don't really exist, they can't have metadata.

@hnakamur
Copy link
Contributor Author

Oh, I was wrong about directories. I knew S3 console creates entries for directories, but I thought we cannot get them with delimiter specified. Actually we can get them.

an empty directory created with S3 console.

<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
  <Name>go-s3</Name>
  <Prefix>s3util/foo/</Prefix>
  <Marker/>
  <MaxKeys>1000</MaxKeys>
  <Delimiter>/</Delimiter>
  <IsTruncated>false</IsTruncated>
  <Contents>
    <Key>s3util/foo/</Key>
    <LastModified>2013-06-07T07:52:45.000Z</LastModified>
    <ETag>"d41d8cd98f00b204e9800998ecf8427e"</ETag>
    <Size>0</Size>
    <Owner>
      <ID>a42a235b94cfe0f3fd630844e076307918c210d57a6e3499e813f564588716a4</ID>
      <DisplayName>hnakamur</DisplayName>
    </Owner>
    <StorageClass>STANDARD</StorageClass>
  </Contents>
</ListBucketResult>

a file uploaded to the directory above.

<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
  <Name>go-s3</Name>
  <Prefix>s3util/hoge/</Prefix>
  <Marker/>
  <MaxKeys>1000</MaxKeys>
  <Delimiter>/</Delimiter>
  <IsTruncated>false</IsTruncated>
  <Contents>
    <Key>s3util/hoge/</Key>
    <LastModified>2013-07-22T23:31:55.000Z</LastModified>
    <ETag>"d41d8cd98f00b204e9800998ecf8427e"</ETag>
    <Size>0</Size>
    <Owner>
      <ID>a42a235b94cfe0f3fd630844e076307918c210d57a6e3499e813f564588716a4</ID>
      <DisplayName>hnakamur</DisplayName>
    </Owner>
    <StorageClass>STANDARD</StorageClass>
  </Contents>
  <Contents>
    <Key>s3util/hoge/list_local.go.bak</Key>
    <LastModified>2013-07-22T23:36:04.000Z</LastModified>
    <ETag>"afda40162cce64840ffd7aae3b2d3094"</ETag>
    <Size>894</Size>
    <Owner>
      <ID>a42a235b94cfe0f3fd630844e076307918c210d57a6e3499e813f564588716a4</ID>
      <DisplayName>hnakamur</DisplayName>
    </Owner>
    <StorageClass>STANDARD</StorageClass>
  </Contents>
</ListBucketResult>

When I had created my directory structures on S3 for my experiments and implementing ListObjects(), I initially uploaded files with 3Hub: Amazon S3 Client (for Mac OS X).
This tool is creating directories names with '$folder$' suffixes, like S3Fox Organizer(S3Fox).
Then I removed directory entries with '
$folder$' suffixes on S3 console.
So now there are no entries for those directories.

If you use only S3 console to create directories,
you can get directory entries like the above examples. Sorry for confusion.
In this case, you can get metadata for directories.

Of course, if you use only S3 APIs, you can create file entries without parent directory entries.
In this case, you cannot get metadata for directories.

@kr
Copy link
Owner

kr commented Jul 23, 2013

Yes, in my interpretation, s3util/foo/ is technically an empty file,
and s3util/foo is the directory that holds it. The file's basename
(returned from method Name on FileInfo) would be the empty string.

@hnakamur
Copy link
Contributor Author

Thanks for your comment. I close this pull request since I made another pull request #14 for new APIs.

@hnakamur hnakamur closed this Jul 23, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants