Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 Support #91

Closed
wants to merge 182 commits into from
Closed

S3 Support #91

wants to merge 182 commits into from

Conversation

sjperkins
Copy link
Contributor

@sjperkins sjperkins commented Mar 31, 2023

Closes #46

TODO

General KVstore

Testing

  • TestKeyValueStoreBasicFunctionality passes
  • Use localstack for S3 mocking
  • Install localstack in github actions
  • Benchmarking See S3 Support #91 (comment)
  • ...

Authorisation

  • Credentials from Environment Variables
  • Credentials from ~/.aws/credentials
  • Anonymous Access (no Authorization header if credentials empty)
  • Add session token to AWS4 signature headers if available
  • Credentials from EC2 Metadata Server
  • Credential refresh/timeout functionality (similar to Google Authprovider)

Documentation

  • rst/yml docs

Polishing

  • Refactor S3RequestBuilder to insert required headers automatically.
  • clang-format -style=google
  • Remove YAGNI MD5 Digester .

Python AWS4 Calculation: https://gist.github.com/sjperkins/15d3d196387cff8b6dc0dcca7f756fe8

@sjperkins sjperkins marked this pull request as draft March 31, 2023 16:54
@jbms
Copy link
Collaborator

jbms commented Mar 31, 2023

Thanks for starting work on this!

@sjperkins
Copy link
Contributor Author

sjperkins commented Apr 3, 2023

S3 Key-Value Store

This describes the implementation of an S3 Key Value Store.

S3 Builder Class

As suggested in #91 (review), an S3 Builder Class should be provided.

class S3RequestBuilder {
 public:
  S3RequestBuilder(std::string_view method);
  // Replicate HttpRequestBuilder methods here
  // ...

  S3RequestBuilder & AddAwsRegion(std::string_view aws_region);
  S3RequestBuilder & AddAwsAccessKey(std::string_view aws_access_key_id);
  S3RequestBuilder & AddAwsSecretKey(std::string_view aws_secret_key);
  // Adds a X-Amz-Security-Token to Headers
  S3RequestBuilder & AddAwsSSessionToken(std::string_view aws_session_token);
  S3RequestBuilder & AddEndpointUrl(std::string_view endpoint_url);
  HttpRequest BuildRequest();

 private:
  // Based on https://docs.aws.amazon.com/AmazonS3/latest/API/sig-v4-header-based-auth.html
  // derives
  // 1. Canonical Request String
  // 2. Signing String
  // 3. Signature
  // and returns the AuthorizationHeader
  std::string AuthorizationHeader();

};

Obtaining configuration variables

AWS configuration data can be obtained from multiple locations, primarily:

  1. Configuration files ~/.aws/credentials and ~/.aws/config.
  2. Environment variables, primarily: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION, AWS_PROFILE and AWS_SESSION_TOKEN.

Environment variables take precedence over configuration files.

Python pseudo-code for deriving AWS variables

aws_profile = os.environ.get("AWS_PROFILE", "default")
aws_credentials_file = os.environ.get("AWS_SHARED_CREDENTIALS_FILE", "~/.aws/credentials")
aws_config_file = os.environ.get("AWS_CONFIG_FILE", "~/.aws/config")

# Read credentials file
credentials = ConfigParser()
credentials.read(os.path.expanduser(aws_credentials_file))
aws_access_key = credentials.get(aws_profile, "aws_access_key_id")
aws_secret_key = credentials.get(aws_profile, "aws_secret_access_key")
aws_session_token = credentials.get(aws_profile, "aws_session_token")

config = ConfigParser()
config.read(os.path.expanduser(aws_config_file))
aws_region = config.get(aws_profile, "aws_region")

aws_region = os.environ.get("AWS_REGION", aws_region)  # AWS_DEFAULT_REGION?
aws_access_key = os.environ.get("AWS_ACCESS_KEY_ID", aws_access_key)
aws_secret_access_key = os.environ.get("AWS_SECRET_ACCESS_KEY", aws_secret_access_key)
aws_session_token = os.environ("AWS_SESSION_TOKEN", aws_session_token)

Constructing S3 Endpoint URLs

As per the AWS S3 Rest Api
there are two methods for constructing S3 URLS:

  1. Virtual hosted-Style <bucket_name>.s3.<aws_region>.amazonaws.com
  2. Path Style s3.<aws_region>.amazonaws.com/<bucket_name>

The above patterns are specific to AWS but custom endpoints are used to access other
S3 implementations such as:

Currently, there is no agreed upon environment variable for overriding the endpoint URL. Something like TENSORSTORE_S3_ENDPOINT_URL may be the best way forward here.

S3 KVStore

TBD.

TODO

  1. S3RequestBuilder class and tests
  2. AWS INI file config parsing and environment variable reading
  3. S3 KVStore

Questions

  1. It is currently the responsbility of the GCS KVstore to construct the GCS endpoint urls for reads + writes.
    Should this be the responsibility of the S3 KVStore or the S3RequestBuilder class, which seems to be configured
    with more information?

  2. Will the S3 KVStore be more less a cut and paste of the GCS KVStore? If so, should common elements be factored out?

  3. Is there a preferred c++ INI file parser that should be used? There's a list of projects here. Other likely candidates:

@jbms
Copy link
Collaborator

jbms commented Apr 3, 2023

S3 Builder Class

Looks good.

Obtaining configuration variables

Looks good for Amazon S3 endpoints. For other endpoints we may want to rely on a ContextResource to specify where to find the credentials, e.g. name of environment variables to read or path to configuration file.

Another important source for credentials is the EC2 metadata server.

Constructing S3 Endpoint URLs

As per the AWS S3 Rest Api there are two methods for constructing S3 URLS:

  1. Virtual hosted-Style <bucket_name>.s3.<aws_region>.amazonaws.com
  2. Path Style s3.<aws_region>.amazonaws.com/<bucket_name>

I think we want to support the following:

  1. When talking to the real amazon s3, it should be possible to just specify the bucket, not the region, since there is a global namespace of buckets, either using a json spec: {"driver": "s3", "bucket": "my-bucket", "path": "my-path"} or a URL: "s3://my-bucket/my-path". You can determine the region from the x-amz-bucket-region response header. Whether we use virtual host style or path style URLs to make requests does not really matter, but note that:
    • If you use virtual host style, you can just use <bucket_name>.s3.amazonaws.com (no region). However, you still need to know the region in order to generate a correct signature. I think you can make an initial request without a signature in order to determine the region. If the bucket name contains dots, then I think some special curl options will need to be set to avoid certificate validation errors.
    • If you use path style, you need to know the region even to make an initial request. I think you can make an initial request to the default s3.amazonaws.com endpoint to determine the region, but you should check on that.
  2. When talking to a non-Amazon s3 server, the endpoint should be specified either in the json spec: {"driver": "s3", "bucket": "my-bucket", "endpoint": "https://my-server", "path": "my-path"} or as a URL: "s3+https://my-server/my-bucket/my-path".
  3. We don't want to rely on specifying the endpoint via environment variable because that prevents the same program from using more than one endpoint (e.g. copying from amazon s3 to non-amazon s3 server). However, this could potentially be supported solely for testing, as with the gcs driver.

One challenge with supporting multiple endpoints is that each endpoint requires its own credentials.

  1. It is currently the responsbility of the GCS KVstore to construct the GCS endpoint urls for reads + writes.
    Should this be the responsibility of the S3 KVStore or the S3RequestBuilder class, which seems to be configured
    with more information?

Probably it would make sense to keep at least most of the logic for generating urls in the kvstore, rather than S3RequestBuilder.

  • Will the S3 KVStore be more less a cut and paste of the GCS KVStore? If so, should common elements be factored out?

If there are some common elements that can be factored out, that would probably be good. However note that the GCS kvstore driver uses the GCS "JSON" api rather than the s3-compatible API, so it won't be that similar. The largest difference will be in supporting List, which will require parsing the XML response. Not sure if that will require an XML library or if it can be done with just regular expressions.

As far as INI library --- you might have to be aware that there are variations in format. E.g. the mINI library seems to use ; as the comment character while AWS uses #. It might be simpler to just use regular expressions to parse.

@sjperkins
Copy link
Contributor Author

sjperkins commented Apr 5, 2023

I think you can make an initial request to the default s3.amazonaws.com endpoint to determine the region, but you should check on that.

By inspection, the following seems to work to get the bucket region for both public and private buckets.

$ curl -sI https://<bucket_name>.amazonaws.com | grep bucket-region

This then seems to be the region to use in Signature Calculations: quoting from the example here:

The bucket is assumed to be in the US East (N. Virginia) Region. The credential Scope and the Signing Key calculations use us-east-1 as the Region specifier. For information about other Regions, see Regions and Endpoints in the AWS General Reference.

I interpret this to mean that the signature depends on the actual bucket location and that the AWS_REGION and AWS_DEFAULT_REGION in config files/environment variables should probably be ignored.

Another important source for credentials is the EC2 metadata server.

Assume this refers to the following idea:

role_name=$( curl -s http://169.254.169.254/latest/meta-data/iam/security-credentials/ )
curl -s http://169.254.169.254/latest/meta-data/iam/security-credentials/${role_name}

I assume the precedence from highest to lowest would be:

  1. EC2 Metadata Server
  2. Environment Variables
  3. Credentials files

Would, for example, a 200ms connect timeout on the Metadata Server connection be sufficient for a decent user XP w.r.t. timeouts outside AWS?

One challenge with supporting multiple endpoints is that each endpoint requires its own credentials.

True, and I suspect most software just assumes a single endpoint. One way I've dealt with this in the past (in combination with fsspec) is to have the following sort of configuration in yaml:

"s3://bucket-one":
  aws_access_key_id: ...
  aws_secret_access_key: ...

"s3+https://bucket-two":
  aws_access_key_id: ...
  aws_secret_access_key: ...

and then apply the following pseudocode when handling access keys:

s3url = "s3://bucket-two/path/to/data.bin"
aws_access_key_id = "ABCDEF..."

for url_prefix, config in yaml.load("s3config.yaml").items():
  if s3url.startswith(url_prefix):
    endpoint_url = config["endpoint_url"]
    aws_access_key = config["aws_access_key_id"]
    break

It's a bit messy when combined with ~/.aws/credentials, which has separate profile sections. Perhaps we could just override variables in the json spec? For example:

{"driver": "s3", "bucket": "my-bucket", "endpoint": "https://my-server", "path": "my-path", "profile": "myprofile"}
{"driver": "s3", "bucket": "my-bucket", "endpoint": "https://my-server", "path": "my-path", "aws_access_key_id": "...", "aws_secret_access_key": "...", "aws_session_token": "..."}

As far as INI library --- you might have to be aware that there are variations in format. E.g. the mINI library seems to use ; as the comment character while AWS uses #. It might be simpler to just use regular expressions to parse.

Yes, INI is simple enough for regex. Let's hold off on the XML library decision until the XML complexity is clearer.

EDIT: Suggest placing keys for specific endpoints in the JSON spec

@jbms
Copy link
Collaborator

jbms commented Apr 6, 2023

I assume the precedence from highest to lowest would be:

  1. EC2 Metadata Server
  2. Environment Variables
  3. Credentials files

Would, for example, a 200ms connect timeout on the Metadata Server connection be sufficient for a decent user XP w.r.t. timeouts outside AWS?

I think the order should be:

  1. Environment variables
  2. Credentials files
  3. Metadata server

You might check the AWS SDK to see what it does -- I expect it uses something like that order. That is also how we handle Google Cloud Storage credentials.

As far as timeout, you might check to see what the AWS SDK does by default. In most cases I expect the connection will be refused immediately if not on EC2 and the timeout doesn't matter. Essentially the only time that the timeout actually comes into play is if you have no credentials and wish to use anonymous access, and you have a network configuration such that requests to the EC2 metadata server are silently dropped rather than are refused immediately.

@jbms
Copy link
Collaborator

jbms commented Apr 6, 2023

I interpret this to mean that the signature depends on the actual bucket location and that the AWS_REGION and AWS_DEFAULT_REGION in config files/environment variables should probably be ignored.

Yes I believe that is correct.

By inspection, the following seems to work to get the bucket region for both public and private buckets.

Yes, looks right. I think there is one extra complication:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucketnamingrules.html

Buckets created in us-east-1 prior to Mar 1, 2018 could have names that are not DNS compatible. For those, we cannot use method you described to determine the region, since they cannot be used with the virtual hosted-style requests. However, we can instead detect that the bucket name does not conform to the new rules, and infer that it must be a us-east-1 bucket.

Perhaps we could just override variables in the json spec?

Yes, that would probably make sense. In any case it probably makes sense to first implement a basic version before worrying about all of the possible extra options.

@laramiel
Copy link
Collaborator

laramiel commented Apr 7, 2023

It's a bit messy when combined with ~/.aws/credentials, which has separate profile sections. Perhaps we could just override variables in the json spec? For example:

{"driver": "s3", "bucket": "my-bucket", "endpoint": "https://my-server", "path": "my-path", "profile": "myprofile"}
{"driver": "s3", "bucket": "my-bucket", "endpoint": "https://my-server", "path": "my-path", "aws_access_key_id": "...", "aws_secret_access_key": "...", "aws_session_token": "..."}

For other drivers we've avoided setting any secrets in the spec as they become easier to leak that way. It does make sense to allow a region setting in the verbose spec, as well as profile-related directives, IMO:

{"driver": "s3", "bucket": "my-bucket", "endpoint": "https://my-server", "path": "my-path", "region": "us-east-1"}

@sjperkins
Copy link
Contributor Author

Some progress on:

  • Validating S3 Bucket Names
  • Validating S3 Object Names
  • S3 Signature URI Encoding

A review might not be worth it at this point, but one thing that may be controversial is the move of the following:

  • AsciiSet
  • PercentEncodeReserved
  • IntToHexDigit
  • HexDigitToInt

from path.cc into ascii_utils.{h,cc}. Let me know if this doesn't work or if you want it done differently.

Other than that, I broadly plan to implement in the following order:

  • S3RequestBuilder (including moving functionality from the signature.{h,cc} files), along with some basic Request testing here.
  • Configuration management (ini files, environment variable querying, EC2 metadata querying)
  • S3 KVStore

These might be logical points at which reviews may be worthwhile.

@jbms
Copy link
Collaborator

jbms commented Apr 13, 2023

Some progress on:

  • Validating S3 Bucket Names
  • Validating S3 Object Names
  • S3 Signature URI Encoding

A review might not be worth it at this point, but one thing that may be controversial is the move of the following:

  • AsciiSet
  • PercentEncodeReserved
  • IntToHexDigit
  • HexDigitToInt

from path.cc into ascii_utils.{h,cc}. Let me know if this doesn't work or if you want it done differently.

Moving these is fine.

Other than that, I broadly plan to implement in the following order:

  • S3RequestBuilder (including moving functionality from the signature.{h,cc} files), along with some basic Request testing here.
  • Configuration management (ini files, environment variable querying, EC2 metadata querying)

I'd suggest that you just implement the minimum here initially, e.g. getting the access key from the env var, and then work on the S3 kvstore, before worrying about more complicated methods of getting the credentials. That way you will have a workable driver sooner.

  • S3 KVStore

These might be logical points at which reviews may be worthwhile.

@tchaton
Copy link

tchaton commented Apr 13, 2023

Any ETA for this PR to land ? This is a very exciting development. Really looking to try it out when it lands.

@tchaton
Copy link

tchaton commented Apr 13, 2023

Hey @sjperkins,

Could you share a snippet of the Python API ?

@sjperkins
Copy link
Contributor Author

Any ETA for this PR to land ? This is a very exciting development. Really looking to try it out when it lands.

Hi @tchaton. I'm not developing this full time, so I can't estimate an ETA here.

Could you share a snippet of the Python API ?

I haven't touched any Python code yet, but I'd imagine the Python interface would be very similar to the GCS key-value driver https://google.github.io/tensorstore/kvstore/gcs/index.html

@sjperkins

This comment was marked as resolved.

@sjperkins

This comment was marked as resolved.

@sjperkins
Copy link
Contributor Author

sjperkins commented Jul 31, 2023

A localstack fixture has been added.

  1. Added a host variable to the S3 json spec as localstack expects <bucket>.s3.<region>.localstack.localhost.com style hosts in the HTTP Requests. This could probably do with some more thought in the context of minio/ceph for e.g. so might need to change this.
  2. This test in the basic test suite fails on localstack, but not on S3. Haven't had time to dig into this. Might be easily fixable in localstack itself.
  3. No localstack install on github actions yet.
  4. Added requester payer header if set in the spec.



ABSL_FLAG(std::string, localstack_binary, "",
"Path to the localstack");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@laramiel I adapted this from the gcs test bench, but didn't have enough time to understand and investigate how bazel linked (and installed?) the test bench utility. Might this be possible with localstack?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, see here:
https://github.com/google/tensorstore/tree/master/third_party/pypa

Add to test_requirements.txt then run generate_workspace.py

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably name this s3_localstack_test, then add to the BUILD file tags = ["manual"] so it's not triggered on every bazel test ... invocation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... also, split the tests which don't require a running localstack into a separate test.

};

void SetUp() override {
for(auto &pair: saved_vars) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saving the env vars shouldn't be necessary, since they are process scoped and should not leak past test execution.

static constexpr char kAwsRegion[] = "af-south-1";
static constexpr char kUriScheme[] = "s3";
static constexpr char kDriver[] = "s3";
static constexpr char kLocalStackEndpoint[] = "http://localhost:4566";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The endpoint should be a flag. Otherwise, if localstack were spawned from the test itself, then ideally there would be a --port argument that you could pass, and use tensorstore/internal/http/transport_test_utils.h PickUnusedPortOrDie().

namespace {

/// Exemplar ListObjects v2 Response
/// https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html#API_ListObjectsV2_ResponseSyntax
Copy link
Collaborator

@laramiel laramiel Jul 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is is possible to get the response in json format? If not, maybe we should use a small xml parser such as ... (looking around) ... tinyxml2? expat?

if(absl::EndsWith(bucket, "--ol-s3")) return BucketNameType::Invalid;

// Bucket name is a series of labels split by .
// https://web.archive.org/web/20170121163958/http://docs.aws.amazon.com/AmazonS3/latest/dev/BucketRestrictions.html
Copy link
Collaborator

@laramiel laramiel Jul 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we're using regexes, just write the full regex. something like:

static LazyRE2 kValidBucket = {
  "^(([a-z0-9][a-z0-9-]*[a-z0-9])|([a-z0-9]))"
  "(\.(([a-z0-9][a-z0-9-]*[a-z0-9])|([a-z0-9])))*$"
};

static LazyRE2 kOldValidBucket = {
  "^(([a-zA-Z0-9][a-zA-Z0-9_-]*[a-zA-Z0-9])|([a-zA-Z0-9]))"
  "(\.(([a-zA-Z0-9][a-zA-Z0-9_-]*[a-zA-Z0-9])|([a-zA-Z0-9])))*$"
};

if (!RE2::FullMatch(bucket, *kOldValidBucket)) {
  return BucketNameType::kInvalid;
}
if (bucket.size() <= 63 && RE2::FullMatch(bucket, *kValidBucket)) {
  return  BucketNameType::kStandard;
}
return : BucketNameType::kOldUsEast;

@laramiel
Copy link
Collaborator

laramiel commented Sep 1, 2023

I'm preparing to submit most of this in the next day or so, with some minor changes.

copybara-service bot pushed a commit that referenced this pull request Sep 1, 2023
Includes changes for includes, style, and some fixes from the following pull request:

#91

This is still a work in progress, so it is not yet available by default.
To use, add a bazel dependency on //tensorstore/kvstore/s3

#46

PiperOrigin-RevId: 562068999
Change-Id: Id660a72224ef865c858484c04985f9fc4f0e3bf5
@sjperkins
Copy link
Contributor Author

sjperkins commented Sep 2, 2023

I'm preparing to submit most of this in the next day or so, with some minor changes.

I think this makes sense -- The PR contains core S3 functionality and further additions can be split into smaller PR's which will be easier to review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

s3 support?
4 participants