Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP support ACL's x-amz-acl via resource_kwargs (apply on s3's BufferedOutputBase close) #402

Closed
wants to merge 1 commit into from

Conversation

albertoa
Copy link

@albertoa albertoa commented Jan 2, 2020

support ACL's x-amz-acl via resource_kwargs (apply on s3's BufferedOutputBase close)

Motivation

We often need to support uploading files to S3 buckets from different accounts.

In order for the S3 object to be visible from the account that owns the bucket the ACL needs to be set. A quick way is to use the built in: x-amz-acl

In this PR I overload the resource_kwargs used within s3.py so that we can pass x-amz-acl which is then applied during the close for BufferedOutputBase objects.

Tests

Let me know what tests you would like me to write for it. Keep in mind that the upload doesn't fail unless the owner of the bucket refuses the object creation (may point to whether or not the close is the right place). So a test would require 2 accounts an upload afterwards download from a separate account.

Work in progress

Flagged as WIP pending comments on testing.

Checklist

Before you create the PR, please make sure you have:

  • Picked a concise, informative and complete title
  • Clearly explained the motivation behind the PR
  • Linked to any existing issues that your PR will be solving
  • Included tests for any new functionality
  • Checked that all unit tests pass

Copy link
Collaborator

@mpenkov mpenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution and your interest in the smart_open library.

Personally, I don't think this is a good idea, because you're dealing with irrelevant boto3 details inside of smart_open. I think it's better to deal with ACLs directly using boto3.

So, I think your use case is better realized using:

with smart_open.open('s3://bucket/key', ...) as fout:
    # do stuff here
    s3_object = fout.get_s3_object()  # this method is currently not implemented

s3_object.Acl().put(your_acl_here)
s3_object.OtherBoto3VoodooHappensHere()

The reason I prefer this way is:

  1. That there are many other options in boto3. It'd be a pain to add and maintain support for all of them, or even part of them.
  2. These options don't even influence the operation of smart_open. They're merely hand-me-downs to boto3. We can set them at any time, e.g. after the S3 object has been written.

We make exception to things like resource_kwargs because we cannot set them at any time, we have to set them while smart_open is doing its thing.

@piskvorky @albertoa What do you think?

@albertoa
Copy link
Author

albertoa commented Jan 3, 2020

Thank you for your comments. However that kind of defeats the purpose of using smart_open.

Currently we use boto3 and detect the usage of S3 within user given URIs. Within the intended CLI applications, URIs are for the local file systems or S3.

The goal of moving to smart_open was to be able to handle local file systems and S3 without having to do different code paths in the applications. This is why using smart_open.open(URI,...) was great for us.

However the transport is what makes the behavior differently. Since the behavior is transport specific I figured the s3.py file was the best place for this. On top of that, this is an option that doesn't affect everybody (that's why I used the .pop as it is also used in other places in the code)

I liked how smart_open can use transport specific options that are ignored if not needed for the specific transport affecting the current URI. If I have to detect the transport and handle the differences within the application then what would be the purpose of using smart_open?

@mpenkov Am I missing something? Maybe I'm using it wrong ;-)

@mpenkov
Copy link
Collaborator

mpenkov commented Jan 4, 2020

The goal of moving to smart_open was to be able to handle local file systems and S3 without having to do different code paths in the applications. This is why using smart_open.open(URI,...) was great for us.

I think you're conflating the goals of smart_open with their consequences.

The mission of smart_open is simple (from README.md):

smart_open is a Python 2 & Python 3 library for efficient streaming of very large files from/to storages such as S3, HDFS, WebHDFS, HTTP, HTTPS, SFTP, or local filesystem.

The emphasis is on "efficient streaming". To achieve that, we use an abstraction familiar to most Python users: the file object, which gets opened by the open function. That function takes in a bunch of additional arguments that are strictly necessary in order for the opening to work. Everything else (e.g. permissions setting) happens elsewhere.

Consider the built-in io.open function. Does it allow you to specify a mask so it can automatically chmod the file after creating it? Does it allow you to specify the user and group who owns the file?

However the transport is what makes the behavior differently. Since the behavior is transport specific I figured the s3.py file was the best place for this. On top of that, this is an option that doesn't affect everybody (that's why I used the .pop as it is also used in other places in the code)

This isn't what resource_kwargs is for. Hijacking it to satisfy your particular use case is a bad idea.

I liked how smart_open can use transport specific options that are ignored if not needed for the specific transport affecting the current URI. If I have to detect the transport and handle the differences within the application then what would be the purpose of using smart_open?

Presumably, so you can open the remote location and write to it as a stream using the familiar construct:

with open(some_url, 'w') as fout:
    fout.write(data)

That's what smart_open is good at. Doing the same thing using boto3 would be more involved. Plus, once you've written the stream using boto3, you'd still have to do the work to apply the ACL (as I've suggested in the original comment).

If you want to do other things to the object after you've written it, e.g.

  • Change file permissions using chmod (Local filesystem under Windows and *nix)
  • Assign owners, groups, etc using chown (Local filesystem under *nix only)
  • Apply an ACL (S3 only)
  • Apply other boto3 magic (S3 only)
  • Anything else not necessary for streaming the remote file object

Then that is application logic, and is the responsibility of your application. Application logic does not belong in a library.

@albertoa
Copy link
Author

albertoa commented Jan 7, 2020

If it is out of scope for smart_open let's go ahead and close the PR.

However, I will say that we should not generalize the behavior of what is the responsibility of application vs. library based on this PR. When a library abstracts away the transport and provides a mechanism for transport specific options but then we state that transport behavior is 100% the responsibility of the application it creates a conflict.

I understand this specific use case doesn't belong in this library, but it should not be taken as a generalization.

Thanks for looking into it.

@albertoa albertoa closed this Jan 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants