# Retrieving ECCO datasets when working in the AWS Cloud

## Introduction
Previous tutorials have discussed how to download ECCO datasets from PO.DAAC to your local machine. However, during 2021-2022 PO.DAAC datasets (including ECCO) migrated to the NASA Earthdata Cloud hosted by Amazon Web Services (AWS). While data downloads from the cloud (using wget, curl, Python requests, etc.) function like downloads from any other website, there are definite advantages to working with datasets within the cloud environment. Data can be opened in an S3 bucket and viewed without downloading, or can be quickly downloaded to a user's cloud instance for computations. For more information on PO.DAAC datasets in the cloud, there are [a number of infographics here](https://podaac.jpl.nasa.gov/cloud-datasets/about).

This tutorial will walk through the steps of how a user can set up an AWS instance and access ECCO datasets for computations in that instance. No prior experience with the AWS cloud is assumed.

## Set up an AWS cloud instance
Computations in the AWS Cloud are typically done in an Amazon Elastic Compute Cloud (EC2) instance, which is a self-contained computing environment like the operating system on your local machine. You start an EC2 instance through the AWS Cloud management console, then connect to the instance (e.g., through `ssh`) like you would to any other machine, install the software that you need, and start working. AWS offers a [Free Tier](https://aws.amazon.com/ec2/?did=ft_card&trk=ft_card) that provides 750 compute hours per month, 1 GB memory, and up to 30 GB storage for a period of 12 months. This is great for experimenting, but these memory/storage limits can be quite restrictive, so if you have institutional or project support for larger instances it is highly recommended to use them.

The steps here mostly follow this [excellent tutorial on the PO.DAAC Cookbook](https://podaac.github.io/tutorials/external/July_2022_Earthdata_Webinar.html). 

### Step 1: Create an AWS account
If you don't already have one, [create an account on AWS](https://portal.aws.amazon.com/billing/signup#/start/email). Anyone with an e-mail address and a credit card can do this, though for the reasons above it is recommended that you seek out institutional support if possible, or include cloud computing costs in your grant proposals.

### Step 2: Start an EC2 instance
Once you log in to your new account, you should be at a screen with the title Console Home. First, let's make sure you are in the most optimal AWS "region" for accessing PO.DAAC datasets, which are hosted in region *us-west-2 (Oregon)*. In the upper-right corner of the page just to the left of your username, there is a drop-down menu with a place name on it. Select the **US West (Oregon)    us-west-2** region.

Now let's start a new EC2 instance. Click on **Services** in the upper-left corner next to the AWS logo, then **Compute** --> **EC2**. On this new screen where there are a number of boxes, select the yellow **Launch instance** button. There are some settings on this screen to configure before launching the new instance:

*Name and tags*: Whatever you want (e.g., ECCO tutorials).

*Application and OS images (Amazon Machine Image)*: **Quick Start** --> **Red Hat** --> **Red Hat Enterprise Linux 9, SSD Volume Type**
This is not the only AMI you can use, and your institution may have preferred or required AMIs to use AWS cloud services. Make sure that the AMI you select runs Linux, and is "Free tier eligible" if you are not supported by your institution or project.

*Instance type*: **t2.micro** if using the Free tier. If you're not restricted to the free tier, **t2.medium** or larger is recommended.

*Key pair (login)*: Click on **Create new key pair**. In the pop-up window, make the name whatever you want (e.g., aws_ec2_jupyter), select *Key pair type*: **RSA** and *Private key file format*: **.pem**, then **Create key pair**. This downloads the private key file to your Downloads folder, and you should move it to your `.ssh` folder: `mv ~/Downloads/aws_ec2_jupyter.pem ~/.ssh/`. Then change the permissions to read-only for the file owner `chmod 400 ~/.ssh/aws_ec2_jupyter.pem`.

*Network settings*: Your institution may have existing security groups that you should use, so click the **Select existing security group** and check with your IT or cloud support to see if there are recommended security groups/VPCs to use. If not or you are doing this on your own, then click **Create security group**, which will create a new security group with a name like *launch-wizard-1*. Make sure that the boxes to allow HTTPS and HTTP traffic from the internet are checked.

*Configure storage*: Specify a storage volume with at least **15 GiB gp3** as your root volume. This is important, since the pyython/conda installation with the packages we need will occupy ~7.5 GB, and we need some workspace as a buffer. If you are in Free tier then you can request up to 30 GB across all your instances, so you can use up the full amount in a single instance or split it across two instances with 15 GB each.

*Advanced details*: Depending on your security/institutional requirements, you may need to include a specific IAM profile. Check the *IAM instance profile* dropdown menu to see if there is one associated with your security group.

Finally, at the bottom-right of the page click the yellow **Launch instance** button. Wait a minute or two for the instance to initialize; you can check the **Instances** screen accessed from the menu on the left side to see that your Instance state is **Running**.

### Step 3: Install software and set up conda environment

Since your instance starts with a very bare-bones Linux OS, you will need to install software (conda/miniconda/miniforge) to run Python, and then install Python packages and the Jupyter interface to run these tutorial notebooks. A shell script to expedite this process is provided on the tutorial Github page, and here we will walk through setting this up.

First, ssh into your new instance. For most users this will be at the public IPv4 address on the AWS instance summary page, e.g., if the IP address is 35.24.135.171, then: 

```
ssh -i "~/.ssh/aws_ec2_jupyter.pem" ec2-user@35.24.135.171 -L 9889:localhost:9889
```

Some users with an institutional network or VPN might use the private IP address instead. The `-L` option indicates a tunnel from the local machine's port 9889 to the instance's port 9889; this will be used later to open Jupyterlab through your local machine's web browser.

> Tip: If you are having difficulty connecting to your new instance, you might need to change your network/security group settings to allow SSH traffic from your local machine and/or HTTPS/HTTP traffic. Alternatively, you may need to [attach an IAM role to your instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html).

Now we will update the OS on the instance and install `git` so that we can clone the [ECCOv4 Python tutorial repository](https://github.com/ECCO-GROUP/ECCO-v4-Python-Tutorial). This repository includes a shell script that can be used to finish setting up our workspace. First run the following commands on the instance:

```
sudo dnf update -y
sudo dnf install git -y
cd ~
git clone https://github.com/ECCO-GROUP/ECCO-v4-Python-Tutorial.git
```

Now we will execute a shell script that will set up a conda environment called `jupyter`, allow the user to input their NASA Earthdata username and password (which are written to the `~/.netrc` file on the instance), and open up Jupyter lab on the instance.

```
sudo chmod 755 ~/ECCO-v4-Python-Tutorial/ECCO-ACCESS/Cloud_access_to_ECCO_datasets/jupyter_env_setup.sh
~/ECCO-v4-Python-Tutorial/ECCO-ACCESS/Cloud_access_to_ECCO_datasets/jupyter_env_setup.sh
```

The script takes a few minutes to run, but it should set up our environment with the packages we need, even within the memory constraints of a free-tier t2.micro instance. After this is done (and while still connected to your instance through port 9889), open up a window in your local machine's web browser and put ``http://127.0.0.1:9889/`` in the URL field. A Jupyter lab should open up in the ECCOv4 tutorial notebook directory, with notebooks ready to run!


## The *ecco_s3_retrieve* module

In [1]:
import numpy as np
import xarray as xr
import matplotlib.pyplot as plt

from ecco_s3_retrieve import *
import time

## Method 2: Open using 2 processes and threads

In [2]:
from dask.distributed import Client

client = Client("tcp://127.0.0.1:39333")
client

0,1
Connection method: Direct,
Dashboard: http://127.0.0.1:8787/status,

0,1
Comm: tcp://127.0.0.1:39333,Workers: 4
Dashboard: http://127.0.0.1:8787/status,Total threads: 4
Started: 1 hour ago,Total memory: 8.00 GiB

0,1
Comm: tcp://127.0.0.1:37015,Total threads: 1
Dashboard: http://127.0.0.1:42753/status,Memory: 2.00 GiB
Nanny: tcp://127.0.0.1:45677,
Local directory: /tmp/dask-worker-space/worker-rr1w54uz,Local directory: /tmp/dask-worker-space/worker-rr1w54uz
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 4.0%,Last seen: Just now
Memory usage: 432.72 MiB,Spilled bytes: 0 B
Read bytes: 92.21 kiB,Write bytes: 128.27 kiB

0,1
Comm: tcp://127.0.0.1:38459,Total threads: 1
Dashboard: http://127.0.0.1:40347/status,Memory: 2.00 GiB
Nanny: tcp://127.0.0.1:35667,
Local directory: /tmp/dask-worker-space/worker-maqkhk6x,Local directory: /tmp/dask-worker-space/worker-maqkhk6x
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 4.0%,Last seen: Just now
Memory usage: 410.22 MiB,Spilled bytes: 0 B
Read bytes: 90.82 kiB,Write bytes: 116.11 kiB

0,1
Comm: tcp://127.0.0.1:46421,Total threads: 1
Dashboard: http://127.0.0.1:36767/status,Memory: 2.00 GiB
Nanny: tcp://127.0.0.1:44159,
Local directory: /tmp/dask-worker-space/worker-3p7evg9_,Local directory: /tmp/dask-worker-space/worker-3p7evg9_
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 2.0%,Last seen: Just now
Memory usage: 411.42 MiB,Spilled bytes: 0 B
Read bytes: 80.56 kiB,Write bytes: 110.13 kiB

0,1
Comm: tcp://127.0.0.1:38853,Total threads: 1
Dashboard: http://127.0.0.1:42793/status,Memory: 2.00 GiB
Nanny: tcp://127.0.0.1:42473,
Local directory: /tmp/dask-worker-space/worker-wi9rnpwk,Local directory: /tmp/dask-worker-space/worker-wi9rnpwk
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 4.0%,Last seen: Just now
Memory usage: 409.37 MiB,Spilled bytes: 0 B
Read bytes: 90.85 kiB,Write bytes: 116.14 kiB


In [3]:
import xarray as xr
import numpy as np
import matplotlib.pyplot as plt
import s3fs

In [4]:
%%time
file_list = ecco_podaac_s3_open(ShortName="ECCO_L4_TEMP_SALINITY_LLC0090GRID_MONTHLY_V4R4",\
                                StartDate="2010-01",EndDate="2010-12")

{'ShortName': 'ECCO_L4_TEMP_SALINITY_LLC0090GRID_MONTHLY_V4R4', 'temporal': '2010-01-02,2010-12-31'}

Total number of matching granules: 12
CPU times: user 229 ms, sys: 21.7 ms, total: 251 ms
Wall time: 3.21 s


In [None]:
ds = xr.open_mfdataset(file_list, engine='h5netcdf', \
                       data_vars='minimal',coords='minimal',\
                       compat='override', parallel=True,
                       decode_cf=False,)

In [None]:
%%time
grid_file = ecco_podaac_s3_open(ShortName="ECCO_L4_GEOMETRY_LLC0090GRID_V4R4",\
                                    StartDate="1992-01",EndDate="2017-12")
ds_grid = xr.open_dataset(grid_file)
ds_grid

In [None]:
%%time
cell_vol = ds_grid.hFacC*ds_grid.rA*ds_grid.drF
cell_vol = cell_vol.compute()

In [None]:
%%time
total_vol = cell_vol.sum().compute()
theta_global_mean = (cell_vol*ds.THETA).sum(dim=["k","tile","j","i"])/\
                        total_vol

In [None]:
%%time
theta_global_mean = theta_global_mean.compute()
theta_global_mean.plot()

In [None]:
client.close()