## Getting Started
In this notebook will will show direct access of archive products in the AWS Simple Storage Service (S3). In this demo, we will showcase the usage of **SWOT Simulated Level-2 KaRIn SSH from GLORYS for Science Version 1**. More information on the datasets can be found at https://podaac.jpl.nasa.gov/dataset/SWOT_SIMULATED_L2_KARIN_SSH_GLORYS_SCIENCE_V1.

We will access the data from inside the AWS cloud (us-west-2 region, specifically) and load a time series made of multiple netCDF datasets into a single xarray dataset. This approach leverages S3 native protocols for efficient access to the data.

In the future, if you want to use this notebook as a reference, please note that we are not doing collection discovery here- we assume the collection of interest has been determined. 

### Requirements
AWS
This notebook should be running in an EC2 instance in AWS region us-west-2, as previously mentioned. We recommend using an EC2 with at least 8GB of memory available.

The notebook was developed and tested using a t2.small instance (_ CPUs; 8GB memory). Python 3

Most of these imports are from the Python standard library. However, you will need to install these packages into your Python 3 environment if you have not already done so:

```
boto3
s3fs
xarray
matplotlib
cartopy
```

## Learning Objectives
* import needed libraries
* authenticate for NASA Earthdata archive (Earthdata Login)
* obtain AWS credentials for Earthdata DAAC archive in AWS S3
* access DAAC data by downloading directly from S3 within US-west 2 and operating on those files.
* access DAAC data directly from the in-region S3 bucket without moving or downloading any files to your local (cloud) workspace
* plot the first time step in the data

In [None]:
import boto3
import json
import xarray as xr
import s3fs
import os
import cartopy.crs as ccrs
from matplotlib import pyplot as plt
from os import path
%matplotlib inline


## Get a temporary AWS Access Key based on your Earthdata Login user ID

By accessing https://archive.podaac.earthdata.nasa.gov/s3credentials, you will be given an AWS Access credential. This key will last 1 hour and will give you access to PO.DAAC S3 Collection buckets. We will store this key in our environment variables for use by the btot3 s3fs libraries.

In [None]:
# Paste the result of your accessing and login to the s3Credential endpoint above into the 's3_credential' variable here:
# https://archive.podaac.earthdata.nasa.gov/s3credentials
s3_credential ='''
<<PASTE RESULT OF https://archive.podaac.earthdata.nasa.gov/s3credentials HERE>>
'''
creds = json.loads(s3_credential)

In [None]:
os.environ["AWS_ACCESS_KEY_ID"] = creds["accessKeyId"]
os.environ["AWS_SECRET_ACCESS_KEY"] = creds["secretAccessKey"]
os.environ["AWS_SESSION_TOKEN"] = creds["sessionToken"]

s3 = s3fs.S3FileSystem(anon=False) 

# Location of data in the PO.DAAC S3 Archive
We need to determine the path for our products of interest. We can do this through several mechanisms.

## Finding S3 Location information from the PO.DAAC Portal
The easiest of which is through the PO.DAAC Cloud Dataset Listing page: https://podaac.jpl.nasa.gov/cloud-datasets

![S3 Data Locations from Portal](img/S3_data_locations_from_portal.png)

For eachd ataset, the 'Data Access' tab will have various information, but will always contain the S3 paths listed specifically. Data files will *always* be found under the 'protected' bucket.

## Finding S3 Location from Earthdata Search

From the Earthdata Search Client (search.earthdata.nasa.gov), collection level information can be found by clicking the 'i' on a collection search result. An example of this is seen below:

![S3 Data Locations from Search 1](img/S3_data_locations_from_search_1.png)


Once on the collection inforamtion screen, the S3 bucket locations can be found by scrolling to the bottom of the information panel. The SWOT_SIMULATED_L2_KARIN_SSH_GLORYS_SCIENCE_V1 example is shown below.

![S3 Data Locations from Search 2](img/S3_data_locations_from_search_2.png)



## Finding S3 Location from CMR

One can query the collection identifier to get information from CMR:

```
https://cmr.earthdata.nasa.gov/search/concepts/C2152045877-POCLOUD.umm_json
```

The identifier is found on the PO.DAAC [Cloud Data Set Listing](https://podaac.jpl.nasa.gov/cloud-datasets) page entries, called 'Collection Concept ID'

Results returned will look like the following:

```json
{
    ...
    "DirectDistributionInformation": {
        "Region": "us-west-2",
        "S3BucketAndObjectPrefixNames": [
            "podaac-ops-cumulus-protected/SWOT_SIMULATED_L2_KARIN_SSH_GLORYS_SCIENCE_V1/",
            "podaac-ops-cumulus-public/SWOT_SIMULATED_L2_KARIN_SSH_GLORYS_SCIENCE_V1/"
        ],
        "S3CredentialsAPIEndpoint": "https://archive.podaac.earthdata.nasa.gov/s3credentials",
        "S3CredentialsAPIDocumentationURL": "https://archive.podaac.earthdata.nasa.gov/s3credentialsREADME"
    },
    ...
}
```



# Now that we have the bucket location...

It's time to find our data! Below we are using a 'glob' to find file names matching a pattern. Here, we want any files matching the pattern used below, this equates, in science, terms, to Cycle 001 and the first 10 passes. This information can be gleaned form product description documents. Another way of finding specific data files would be to search on cycle/pass from CMR and use the S3 links provided in the resulting metadata directly instead of doing a glob (essentially an 'ls').

In [None]:
s3path = 's3://podaac-ops-cumulus-protected/SWOT_SIMULATED_L2_KARIN_SSH_GLORYS_SCIENCE_V1/SWOT_L2_LR_SSH_Expert_001_00*.nc'
remote_files = s3.glob(s3path)

In [None]:
remote_files

# Traditional Access - get these files from S3 and store them on your running instance

Here we will leverage the speed of transfering data within the cloud to our running instance (this notebook!). We will download 10 files into the 'DEMO_FILES' directory to show you cloud and traditional access.

In [None]:
%%time
for f in remote_files:
    s3.download(f, "DEMO_FILES/" + os.path.basename(f))

In [None]:
%%time
ds = xr.open_mfdataset("DEMO_FILES/*.nc", combine='nested', concat_dim="num_lines", decode_times=False)

In [None]:
ds.ssh_karin

Now let's plot these 10 files in a chosen projection.

In [None]:
plt.figure(figsize=(21, 12))
ax = plt.axes(projection=ccrs.PlateCarree())
ax.set_global()
ds.ssh_karin.plot.pcolormesh(
 ax=ax, transform=ccrs.PlateCarree(), x="longitude", y="latitude", add_colorbar=False
)
ax.coastlines()

#ds.ssh_karin.plot()


# Access Files without any Downloads to your disk

We can also do that same plot without 'downloading' the data to our disk first. Let's try access the data from S3 directly through xarray.

In [None]:
s3path = 's3://podaac-ops-cumulus-protected/SWOT_SIMULATED_L2_KARIN_SSH_GLORYS_SCIENCE_V1/SWOT_L2_LR_SSH_Expert_001_00*.nc'
remote_files = s3.glob(s3path)

In [None]:
fileset = [s3.open(file) for file in remote_files]

In [None]:

%%time
data = xr.open_mfdataset(fileset,engine='h5netcdf', combine='nested', concat_dim="num_lines", decode_times=False)
    


In [None]:
data.ssha_karin

In [None]:
plt.figure(figsize=(21, 12))
ax = plt.axes(projection=ccrs.PlateCarree())
ax.set_global()
data.ssha_karin.plot.pcolormesh(
 ax=ax, transform=ccrs.PlateCarree(), x="longitude", y="latitude", add_colorbar=False
)
ax.coastlines()

## A final word...

Accessing data completely from S3 and in memory are affected by various things.

1. The format of the data- archive formats like NetCDF, GEOTIFF, HDF  vs cloud optimized data structures (Zarr, kerchunk, cog). cloud formats are made for accessing only the pieces of data of interest needed at the time of request (e.g. a subset, timestep, etc)
2. The internal structure of the data. Tools like xarray make a lot of assumptions about how to open and read a file. Sometimes the internals don't fit the xarray 'mold' and we need to continue to work with data providers and software providers to make these two sides work together.