# Requester-pays buckets

A lot of large datasets are being stored as "requester pays buckets". This is because cloud-providers charge high fees for transferring data over the internet, and "requester-pays" means the person who *requests* the data, rather than the data provider must foot the bill. Cloud providers (AWS, Azure, Google) have slightly different configurations for this, in this notebook we'll just look at some data in AWS, docs here https://docs.aws.amazon.com/AmazonS3/latest/userguide/RequesterPaysBuckets.html 

[NAIP imagery](https://registry.opendata.aws/naip/) is in a requester-pays bucket, this notebook illustrates access.

Keep in mind:

1. *If the bucket is in the same datacenter as your compute you don't pay hefty egress fees!*
    * **Pangeo AWS Binder runs in aws-uswest-2, and NAIP data is located there as well**
    
1. *If you're doing large-scale analysis you might also be charged for high numbers of GET requests (even if you're running in the same data-center. So keep that in mind. It's hard to know in advance what the cost will be since the pricing schemes are quite complex https://aws.amazon.com/s3/pricing/*

In [None]:
import rasterio
import rioxarray
import os
import hvplot.xarray

In [None]:
# Some GDAL Optimizations and authentication settings
env = dict(GDAL_DISABLE_READDIR_ON_OPEN='EMPTY_DIR',
           AWS_REQUEST_PAYER='requester',
           AWS_DEFAULT_REGION='us-west-2')
os.environ.update(env)

## RGB Visual

In [None]:
# Let rasterio handle authentication
# If running on Pangeo AWS Binder our role has permissions to read NAIP buckets

from rasterio.session import AWSSession
env = rasterio.Env(AWSSession(region_name='us-west-2', 
                              #requester_pays=False, #RasterioIOError: Access Denied
                              requester_pays=True,
                             ))

In [None]:
# NOTE: analytic assets at s3://naip-analytic/ 
s3path = 's3://naip-visualization/wa/2017/100cm/rgb/47122/m_4712264_ne_10_1_20170928.tif'
with env:
    with rasterio.open(s3path) as src:
        print(src.profile)
        da = rioxarray.open_rasterio(src, chunks={'band': -1, 'x': src.width/2, 'y': src.height/2})
da

In [None]:
da.hvplot.rgb(x='x',y='y',rasterize=True, data_aspect=1, frame_width=500)

## Mulltiband Analytic

In [None]:
# NOTE: analytic COG assets at s3://naip-analytic/
s3Path = 's3://naip-analytic/wa/2017/100cm/rgbir_cog/47122/m_4712264_ne_10_1_20170928.tif'
with env:
    with rasterio.open(s3path) as src:
        print(src.profile)
        da = rioxarray.open_rasterio(src, chunks={'band': -1, 'x': src.width/2, 'y': src.height/2})
da['band'] = ['red','green','blue']
#da.name = 'm_4712264_ne_10_1_20170928'
ds = da.to_dataset('band')
ds['red']

In [None]:
ds['red'].hvplot.image(rasterize=True, data_aspect=1, frame_width=500, cmap='reds')

## Cluster considerations

If you're running computations on a distributed cluster you need to make sure to propagate environment variables to the cluster. See more in [./3-dask-gatewaycluster.ipynb](./3-dask-gateway-cluster.ipynb)

In [None]:
gateway = Gateway()
options = gateway.cluster_options()
options.environment = env 
cluster = gateway.new_cluster(options)
cluster.scale(4) # let's get the same number of "workers" as our previous LocalCluster examples