Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requester pays environment variable for assets on S3 #82

Closed
arthurelmes opened this issue Aug 1, 2022 · 9 comments · Fixed by #86
Closed

Requester pays environment variable for assets on S3 #82

arthurelmes opened this issue Aug 1, 2022 · 9 comments · Fixed by #86

Comments

@arthurelmes
Copy link

Although it is possible to use odc.stac.load to load items whose assets are in requester-pays S3 buckets (e.g. landsat-c2l2-sr), the documentation is not entirely clear on how to accomplish this. Setting the environment variable AWS_REQUEST_PAYER to requester does the trick, but I was unable to use odc.stac.configure_rio(aws={"requester_pays": True}) to load the data. I'm happy to make a docs PR here to address this topic, unless there is a more native way of going about this.

@Kirill888
Copy link
Member

Thanks for feedback @arthurelmes, while using environment variables is certainly a valid approach, it is more proper (probably more robust when using remote Dask workers) to use configure-s3-access with requester_pays=True flag, so if this is not working it's an error that needs fixing, I'm guessing this needs to be translated into request_payer="requester" somewhere along the way to gdal, but is not happening, it's not a well tested feature, so not all the surprising.

@Kirill888
Copy link
Member

And please go ahead with a PR

@Kirill888
Copy link
Member

Actually, conversion from boolean happens inside rasterio, so it's probably not a culprit. But it looks like request_payer environment variable is missing from this list:

SESSION_KEYS = (
*SECRET_KEYS,
"AWS_DEFAULT_REGION",
"AWS_REGION",
"AWS_S3_ENDPOINT",
"AWS_NO_SIGN_REQUEST",
"AZURE_STORAGE_ACCOUNT",
"AZURE_NO_SIGN_REQUEST",
"OSS_ENDPOINT",
"SWIFT_STORAGE_URL",
)

might be related to the issue you are observing regarding "requester_pays" option having no effect. Do you want to try adding that and testing if that fixes it?

@arthurelmes
Copy link
Author

Sure thing -- I will try adding the key locally, and see if I can get it to work without specifying the env var externally. Thanks!

@arthurelmes
Copy link
Author

@Kirill888 I tried adding AWS_REQUEST_PAYER to the SESSION_KEYS, but no joy. I also tried adding "AWS_REQUEST_PAYER": "requester", to the GDAL_CLOUD_DEFAULTS, out of interest, but no luck there either. I can see that the GlobalRioConfig object contains _aws: {'requester_pays': True}, and _gdal_opts: {'AWS_REQUEST_PAYER': 'requester ...}. However, still only by adding os.environ['AWS_REQUEST_PAYER'] = GDAL_CLOUD_DEFAULTS.get("AWS_REQUEST_PAYER", "") right before the actual loading am I able to load the Landsat data.

@mpaget
Copy link

mpaget commented Aug 4, 2022

Hi @arthurelmes, I've been playing with Landsat too. The following worked for me; there may be something in this that works for you. The following shows two things:

  1. configure_s3_access(requester_pays=True) sets the AWS_REQUEST_PAYER envar with rasterio
  2. A patch_url func is (in my case) required to read from the S3 bucket.
from pystac_client import Client
from odc.stac import configure_s3_access, stac_load
from odc.stac._rio import _dump_rio_config, _CFG    # For testing

configure_s3_access(requester_pays=True);
with _CFG.env():
    _dump_rio_config()

GDAL_DATA = /usr/local/share/gdal
GDAL_DISABLE_READDIR_ON_OPEN = EMPTY_DIR
GDAL_HTTP_MAX_RETRY = 10
GDAL_HTTP_RETRY_DELAY = 0.5
AWS_ACCESS_KEY_ID = xx..xx
AWS_SECRET_ACCESS_KEY = xx..xx
AWS_SESSION_TOKEN = xx..xx
AWS_REGION = ap-southeast-1
AWS_REQUEST_PAYER = requester

catalog = Client.open('https://landsatlook.usgs.gov/stac-server/')
product = 'landsat-c2l2-sr'
query_cfg = ["platform=LANDSAT_8", "landsat:collection_category=T1"]

query = catalog.search(
    collections=[product], datetime=times, bbox=bbox, query=query_cfg
)
items = list(query.get_items())

def patch(uri: str) -> str:
    """Return the Landsat S3 version of the URI"""
    return uri.replace('https://landsatlook.usgs.gov/data/', 's3://usgs-landsat/')

xx = stac_load(
    items,
    bands=("B4",),
    patch_url = patch,
)

@arthurelmes
Copy link
Author

Strange. Even a direct copy/paste of your code (adding in the times and bbox vars) results in the same RasterioIOError AccessDenied error. I can avoid the error by adding the line os.environ["AWS_REQUEST_PAYER"] = "requester"

Library versions:

pystac                    1.6.1              pyhd8ed1ab_1    conda-forge
pystac-client             0.4.0                    pypi_0    pypi
odc-geo                   0.2.1              pyhd8ed1ab_0    conda-forge
odc-stac                  0.3.1              pyhd8ed1ab_0    conda-forge

Thanks for the help!

@mpaget
Copy link

mpaget commented Aug 12, 2022

I returned to this and encountered essentially the same behaviour reported by @arthurelmes, Manually setting os.environ["AWS_REQUEST_PAYER"] doesn't work for me as a work-around but that could be due to our different local environments.

This issue, I think, relates to the (configure_s3) environment not being activated for the (current) thread. That is, under our use patterns, stac_load() is not activating (via rasterio.env.Env()) the correct set of environment parameters set by an earlier call to configure_s3_access().

_load.py#L189 seems to be doing the correct thing. While, for example, _load.py#L195 could potentially create the situation we're seeing in this issue. I'll keep digging to find the path(s) through stac_load() that may not trigger the expected _CFG.env() activation.

As an example or workaround, the following seems to repeatedly work for me.

# Borrowing from above...
from odc.stac import configure_s3_access, stac_load
from odc.stac._rio import _CFG
configure_s3_access(requester_pays=True);

# Manually wrap stac_load the first time
with _CFG.env():
    xx = stac_load(
        items,
        bands=("B4",),
        patch_url=patch,
    )

# With the rasterio.env.Env() activated for the current thread, further unwrapped loads work as expected
yy = stac_load(more_items, bands=("B4",), patch_url=patch)

Kirill888 added a commit that referenced this issue Aug 12, 2022
When loading data without Dask, configured RIO environment was ignored,
and in case of multi-threaded processing currently activated environment
didn't get passed along to processing threads either.

Also adds requester payer env var to a list of captured envs.
@Kirill888
Copy link
Member

@mpaget thanks for looking into this, pretty sure that should be fixed by commit referenced above.

Kirill888 added a commit that referenced this issue Aug 13, 2022
When loading data without Dask, configured RIO environment was ignored,
and in case of multi-threaded processing currently activated environment
didn't get passed along to processing threads either.

Also adds requester payer env var to a list of captured envs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants