# Data / Cloud Storage

One of the main storage locations for HyTest data is in '_The Cloud_'. This is sometimes referred to as **'Object Storage'**.  The data is kept in datacenter(s) which makes it easily available to network-connected devices. The main advantage of doing this is that if your compute engine is also in that same datacenter (as is the case for many JupyterHub nodes), the data doesn't have to go very far to get to the compute power.  This brings the computation to the data, rather than shipping large datasets across the internet to get to the compute engine. 

[S3](https://aws.amazon.com/s3/) is Amazon's implementation of object storage, which pairs with the Amazon (AWS) nodes on which the Jupter Hub runs. What follows is a brief demo of how S3 data is accessed (both read and write), and some pitfalls to watch out for.

The easiest way to access S3 data from within a Python program is via 
[fsspec](https://filesystem-spec.readthedocs.io/en/latest/) -- a layer of 
abstraction that lets us 
interact with arbitrary storage mechanisms as if they are conventional file systems.  
It makes S3 'look' like a conventional file system.

## Access / Profile
The permissions scheme for S3 allows for anonymous/global read access, as well as secured access via specific credentials.  

We'll look at generic workflows using an anonymous-access bucket, then finish off with some private/credentialed operations.

## Anonymous Reads

A lot of data is available for global read, which does not require credentials or a profile. In this case, just set `anon=True` when plumbing the `fsspec` object. 

In [None]:
import fsspec
# Create a reference to a globally-readable space
fs = fsspec.filesystem(
    's3', 
    anon=True   # Does not require credentials
    )

fs.ls('s3://noaa-nwm-retrospective-2-1-zarr-pds/')

In [None]:
# Other filesystem-like operations: 

# glob = wildcard match:
fs.glob("s3://noaa-nwm-retrospective-2-1-zarr-pds/*.zarr")

In [None]:
# Get metadata about a file
fs.info('noaa-nwm-retrospective-2-1-zarr-pds/index.html')

In [None]:
# Use open() to get something that behaves like a file handle for low-level Python read/write operations:
with fs.open('noaa-nwm-retrospective-2-1-zarr-pds/index.html') as f:
    # print first 5 lines...
    for i in range(0,5):
        line = f.readline()
        print(line)


The `fsspec` library lets you do other common file operations (provided you have adequate premissions), see the [API documentation](https://filesystem-spec.readthedocs.io/en/latest/api.html) for details.
Examples:
* `mkdir` -- makes a new directory / folder
* `mv` -- moves/renames a file or folder
* `rm` -- removes a file or folder

However, if what you need is something that looks like a file **name** or a **path** (as opposed to a file **handle**)... you may need to instruct `fsspec` to `map` the S3 path.  Like so:

In [None]:
import zarr
m = fs.get_mapper('s3://noaa-nwm-retrospective-2-1-zarr-pds/chrtout.zarr')
g = zarr.convenience.open_consolidated(m)
print(g.tree())

**WARNING !!!!**

We are deliberately using `zarr` commands that read **only** the metadata, and not the full data
set.  This is a very, _very_, **very** large dataset, which you don't want to load
over the network to your desktop.  Execute full data read operations only if this notebook is being hosted 
and run out of the same AWS center where the data lives. 

## The Good News
The good news about some of the larger science-oriented libraries (xarray, dask, pandas, zarr, etc), is that they **automatically** handle the `fsspec` operations for you **IF YOUR ACCESS IS ANONYMOUS**.  So a workflow like this:
```python
fs = fsspec.filesystem('s3', anon=True)
m = fs.get_mapper('s3://noaa-nwm-retrospective-2-1-zarr-pds/chrtout.zarr')
dataset = xr.open_zarr(m, consolidated=True)
```
Can actually be simplified to:
```python
dataset = xr.open_zarr('s3://noaa-nwm-retrospective-2-1-zarr-pds/chrtout.zarr', consolidated=True)
```

Note that this is a feature of 
[very specific libraries](https://filesystem-spec.readthedocs.io/en/latest/#who-uses-fsspec).  If you are
reading or writing S3 locations outside of those libraries, you'll need to handle the `fsspec` maps yourself using `get_mapper()`.  

If you will be accessing a S3 storage location with `anon=False` (i.e. with credentials), then you will need to set up
'longhand' with a mapper from `getmapper()`.

## Credentialed Access
For most data, permissions are set by the owners of that data, and credentials assigned to a 'profile'. 

Credentials are stored outside of the Python program (typically in a master file in your ``HOME` folder on the compute/jupyter server).  You need to have this set up beforehand, and is usually achieved by copying specific credentials into the right spot. 

From the shell / command-line, it might look something like this:
```text
cp -R /shared/users/lib/.aws $HOME/.aws
```
The `.aws` folder and files will be provided by the bucket owner.  Within that `.aws` folder is a `config` file which includes lines something like this:

```text
[nhgf-development]
aws_access_key_id = XXXXXXXXXXXXXXXX
aws_secret_access_key = <magic key>

[default]
aws_access_key_id = XXXXXXXXXXXXXXXX
aws_secret_access_key = <magic key>
```
The names in brackes are '_profiles_', which describe the access pattern for the S3 buckets.

```python
import os
# Set profile via environment variable -- this ensures that all AWS-capable 
# functions can get the right profile without it being explicitly specified.
os.environ['AWS_PROFILE'] = 'nhgf-development'

import fsspec

fs = fsspec.filesystem(
    's3',                    # Use S3 protocol
    anon=False,              # Force fsspec to find credentials
    skip_instance_cache=True # Insist that we don't cache locally ; important for fs that can change
)
# the 'fs' object now gives us filesystem-like methods to use, like 'ls'
fs.ls('s3://nhgf-development/workspace/')
``

With greater permissions, you may be able to do more destructive activities (overwriting, removing, etc).  The essential form
is the same as it is for anonymous access, but  you should take care with S3 locations where you have the ability to affect
existing files and data.