# Assessing performance of various methods to open NASA Earthdata data


## Summary

This notebook started and currently uses TEMPO Level-3 data for its test cases. For access patterns, we begin by focusing on using the relatively new `earthaccess.open_virtual_mfdataset()` functionality. Further discussion of this effort can be found in [this `earthaccess` GitHub discussion](https://github.com/nsidc/earthaccess/discussions/987).

## Prerequisites

- **AWS US-West-2 Environment:** This tutorial has been designed to run in an AWS cloud compute instance in AWS region us-west-2. However, if you want to run it from your laptop or workstation, everything should work just fine but without the speed benefits of in-cloud access.

- **Earthdata Account:** A (free!) Earthdata Login account is required to access data from the NASA Earthdata system. Before requesting TEMPO data, we first need to set up our Earthdata Login authentication, as described in the Earthdata Cookbook's [earthaccess tutorial (link)](https://nasa-openscapes.github.io/earthdata-cloud-cookbook/tutorials/earthaccess-demo.html).

- **Packages:**

  - `cartopy`
  - `dask`
  - `earthaccess` **version 0.14.0 or greater**
  - `matplotlib`
  - `numpy`
  - `xarray`

# Test Report (so far)


## Approach for Initial Test Case(s)
---

#### Data

The tests in this notebook leverage data from the Nitrogen Dioxide ($NO_2$) Level-3 data collection of the [TEMPO air quality mission (link)](https://asdc.larc.nasa.gov/project/TEMPO). TEMPO Level-3 data are stored in granules that are each ~500 MB. Thus, a year's worth of data is about $500*4867/1024/1024 = 2.3 \text{ TB}$.

#### Methods

We use multiple functions, including the relatively new `earthaccess.open_virtual_mfdataset()` function, to open, stream, or download the data and/or metadata from granules such that we can then calculate means for a subset of the data and visualize the results.

Test cases utilize the following functions/function chains:
- Single granule cases:
  - **Case 1:** `earthaccess.open_virtual_dataset()`
  - **Case 2:** `xr.open_dataset(earthaccess.open())`
- Multi-granule cases:
  - **Case 3:** `earthaccess.open_virtual_mfdataset()`
  - **Case 4:** `earthaccess.download()`

#### Benchmarking

To facilitate comparisons, times reported for opening one file are extrapolated to an estimated time it would take to open all 4,867 TEMPO Level-3 granules in the 2024–2025 "year-long" analysis scenario, while assuming the year-long scenario would be performed using the Openscapes Hub's default of 4 CPUs. And where appropriate, times are converted to more easily readable units. For example, for Cases **1** and **2**, the following equation is used to convert from time (in seconds) for opening one file ($x$) to estimated time (in hours) for all granules in a year-long scenario ($y$): 
$$
  x \text{ time (s) for one granule} * 
  \frac{4867 \text{ granules}}{1 \text{ granule}} * 
  \frac{1 \text{ min}}{60 \text{ s}} * 
  \frac{1}{4 \text{ CPUs}} * 
  \frac{1 \text{ hr}}{60 \text{ min}} =
  y \text{ time (hr) for all granules per CPU}
$$

For Case **3**, the following equation is used to estimate a time (in miliseconds) for a single granule ($y$) from the time (in seconds) it takes to open all granules ($x$):

$$
  x \text{ time (s) for all granules} * 
  \frac{1 \text{ granule}}{4867 \text{ granules}} * 
  \frac{1000 \text{ ms}}{1 \text{ s}} * =
  y \text{ time (ms) for one granule}
$$

For Case **4**, the following equation is used to estimate time (in hours) for all granules in a year-long scneario ($y$) from the time (in seconds) it takes to open 10 granules ($x$):

$$
  x \text{ time (s) for 10 granules} * 
  \frac{4867 \text{ granules}}{10 \text{ granules}} * 
  \frac{1 \text{ min}}{60 \text{ s}} *
  \frac{1 \text{ hr}}{60 \text{ min}} =
  y \text{ time (hr) for all granules}
$$

## Results
---
 
### Case 1 – Opening as virtual dataset – using `earthaccess.open_virtual_dataset()`:

with 
```python
open_options = {
    "access": "direct",
    "load": True
}
```

| Run          | Wall time  | Wall time extrapolated to all granules | CPU time    | CPU time extrapolated to all granules |
| :-           | :--------- | :-                                     | :-          | :-                                    |
| 1            | 235 ms     | 0.08 hr = 4.8 min                      | 46.5 ms     | 0.016 hr = 56 s                       |
| 2            | 881 ms     | 0.30 hr = 17.9 min                     | 54.9 ms     | 0.019 hr = 67 s                       |
| 3            | 281 ms     | 0.09 hr = 5.7 min                      | 56.0 ms     | 0.019 hr = 68 s                       |
| 4            | 249 ms     | 0.08 hr = 5.0 min                      | 53.7 ms     | 0.018 hr = 65 s                       |
| **Average:** | **411 ms** | **0.14 hr = 8.4 min**                  | **52.8 ms** | **0.018 hr = 64 s**                   |

### Case 2 – Streaming the data – using `xr.open_dataset(earthaccess.open([results[0]])[0])`

| Run          | Wall time  | Extrapolated to all granules | CPU time   | CPU time extrapolated to all granules |
| :-           | :--------- | :-                           | :-         | :-                                    |
| 1            | 17.5 s     | 5.9 hr                       | 6.11 s     | 2.1 hr                                |
| 2            | 12.3 s     | 4.2 hr                       | 5.16 s     | 1.7 hr                                |
| 3            | 12.4 s     | 4.2 hr                       | 5.22 s     | 1.8 hr                                |
| 4            | 12.5 s     | 4.2 hr                       | 5.08 s     | 1.7 hr                                |
| **Average:** | **13.7 s** | **4.6 hr**                   | **5.39 s** | **1.8 hr**                            | 


### Case 3 – Opening as virtual dataset – using `earthaccess.open_virtual_mfdataset()`

with
```python
open_options = {
    "access": "direct",
    "load": True,
    "concat_dim": "time",
    "coords": "minimal",
    "compat": "override",
    "join": "override",
    "combine_attrs": "override",
    "parallel": True,
}
```
And note that these times represent working with the "root" group of the netCDF.

| Run          | Wall time all granules | Wall time estimate for one granule | CPU time all granules  |
| :-           | :---------             | :-                                 | :-                     |
| 1            | 245 s                  | 50 ms                              | 85 s                   |
| 2            | 238 s                  | 49 ms                              | 84 s                   |
| 3            | 235 s                  | 48 ms                              | 85 s                   |
| 4            | 231 s                  | 47 ms                              | 84 s                   |
| **Average:** | **237 s**              | **48.5 ms**                        | **84.5 s**             |


### Case 4 - Downloading – using earthaccess.download(results[0:10], local_path="/tmp/")

| Run          | Wall time 10 granules | Wall time estimate for all granules | CPU time 10 granules  |
| :-           | :---------             | :-                                 | :-                     |
| 1            | 178 s                  | 24 hr                              | 19 s                   |
| 2            | 186 s                  | 25 hr                              | 26 s                   |
| 3            | 160 s                  | 22 hr                              | 21 s                   |
| 4            | 172 s                  | 23 hr                              | 32 s                   |
| **Average:** | **174 s**              | **23.5 hr**                        | **24.5 s**             |

# Setup

In [34]:
import cartopy.crs as ccrs
import earthaccess
import matplotlib.pyplot as plt
import numpy as np
import xarray as xr
from dask.diagnostics import ProgressBar
from matplotlib import rcParams

%config InlineBackend.figure_format = 'jpeg'
rcParams["figure.dpi"] = (
    80  # Reduce figure resolution to keep the saved size of this notebook low.
)

pbar = ProgressBar()
pbar.register()  # Set the ProgressBar to indicate progress for the minutes-long open steps.

#### Methods used for calculating and converting results' timings

In [2]:
# Function for creating table at top of notebook.
def granule_time_in_seconds_to_year_of_granules_time_in_hours(
    input_time: float | list[float],
) -> np.array:
    return np.asarray(input_time) * 4867 / 60 / 4 / 60

In [None]:
# np.mean(granule_time_in_seconds_to_year_of_granules_time_in_hours([6.11, 5.16, 5.22, 5.08]))

In [None]:
# np.mean([6.11, 5.16, 5.22, 5.08])

## Login using the Earthdata Login

In [3]:
auth = earthaccess.login()  # earthaccess.system.UAT)

if not auth.authenticated:
    auth.login(
        strategy="interactive", persist=True
    )  # ask for credentials and persist them in a .netrc file

print(earthaccess.__version__)

0.14.0


# TEMPO $NO_2$ Level-3 Data Tests

## Search for data granules

We search for TEMPO Nitrogen Dioxide ($NO_2$) data for a year-long period (note: times are in UTC) betwee January, 2024 and 2025.

In [4]:
%%time
results = earthaccess.search_data(
    short_name="TEMPO_NO2_L3",
    version="V03",
    temporal=("2024-01-11 12:00", "2025-01-11 12:00"),
)
print(f"Number of results: {len(results)}")

Number of results: 4990
CPU times: user 821 ms, sys: 118 ms, total: 939 ms
Wall time: 14.8 s


## Opening a Single Granule

### Case 1

In [5]:
open_options = {"access": "direct", "load": True}

In [7]:
%%time
first_result_root = earthaccess.open_virtual_dataset(results[0], **open_options)

CPU times: user 33.2 ms, sys: 221 μs, total: 33.4 ms
Wall time: 219 ms


CPU times: user 46.5 ms, sys: 0 ns, total: 46.5 ms
Wall time: 235 ms

CPU times: user 53.3 ms, sys: 1.58 ms, total: 54.9 ms
Wall time: 881 ms

CPU times: user 56 ms, sys: 0 ns, total: 56 ms
Wall time: 281 ms

CPU times: user 53.7 ms, sys: 0 ns, total: 53.7 ms
Wall time: 249 ms

### Case 2

In [9]:
%%time
first_dataset = xr.open_dataset(earthaccess.open([results[0]])[0])

QUEUEING TASKS | :   0%|          | 0/1 [00:00<?, ?it/s]

PROCESSING TASKS | :   0%|          | 0/1 [00:00<?, ?it/s]

COLLECTING RESULTS | :   0%|          | 0/1 [00:00<?, ?it/s]

CPU times: user 2.8 s, sys: 2.44 s, total: 5.25 s
Wall time: 12.5 s


CPU times: user 2.8 s, sys: 3.32 s, total: 6.11 s
Wall time: 17.5 s

CPU times: user 2.93 s, sys: 2.24 s, total: 5.16 s
Wall time: 12.3 s

CPU times: user 2.81 s, sys: 2.41 s, total: 5.22 s
Wall time: 12.4 s

CPU times: user 2.82 s, sys: 2.26 s, total: 5.08 s
Wall time: 12.5 s

## Opening Multiple Granules

### Case 3

Because TEMPO data are processed and archived in a netCDF4 format using a group hierarchy, we open each group and then merge them together.

Options to set before opening:
- `load=True` works
- `load=False` results in `KeyError: "no index found for coordinate 'longitude'"` because it creates `ManifestArray`s without indexes (see the [earthaccess documentation here (link)](https://github.com/nsidc/earthaccess/blob/7f5fe5d2e42343b6d7948338255cf9bb8cdb2775/earthaccess/dmrpp_zarr.py#L36C456-L36C502))

In [None]:
# from dask.distributed import LocalCluster, Client

# if "dask_client" not in locals():
#     # cluster = LocalCluster(threads_per_worker=1)
#     cluster = LocalCluster()
#     dask_client = Client(cluster)

# dask_client

In [10]:
open_options = {
    "access": "direct",
    "load": True,
    "concat_dim": "time",
    "coords": "minimal",
    "compat": "override",
    "join": "override",
    "combine_attrs": "override",
    "parallel": True,
}

Open root, product, and geolocation groups of the granules.

In [15]:
%%time
result_root = earthaccess.open_virtual_mfdataset(granules=results, **open_options)

[########################################] | 100% Completed | 138.63 s
CPU times: user 1min 21s, sys: 3.47 s, total: 1min 24s
Wall time: 3min 51s



##### with `load=True`, and running `open_virtual_mfdataset()` for the root group only


CPU times: user 1min 21s, sys: 3.55 s, total: 1min 25s
Wall time: 4min 5s

CPU times: user 1min 21s, sys: 3.32 s, total: 1min 24s
Wall time: 3min 58s

CPU times: user 1min 22s, sys: 3.43 s, total: 1min 25s
Wall time: 3min 55s

CPU times: user 1min 21s, sys: 3.47 s, total: 1min 24s
Wall time: 3min 51s

### Case 4

In [36]:
%%time
earthaccess.download(results[0:10], local_path="/tmp/")

CPU times: user 19.9 s, sys: 12.2 s, total: 32.2 s
Wall time: 2min 52s


[PosixPath('/tmp/TEMPO_NO2_L3_V03_20240111T125625Z_S002.nc'),
 PosixPath('/tmp/TEMPO_NO2_L3_V03_20240111T133630Z_S003.nc'),
 PosixPath('/tmp/TEMPO_NO2_L3_V03_20240111T141635Z_S004.nc'),
 PosixPath('/tmp/TEMPO_NO2_L3_V03_20240111T151635Z_S005.nc'),
 PosixPath('/tmp/TEMPO_NO2_L3_V03_20240111T161635Z_S006.nc'),
 PosixPath('/tmp/TEMPO_NO2_L3_V03_20240111T171635Z_S007.nc'),
 PosixPath('/tmp/TEMPO_NO2_L3_V03_20240111T181635Z_S008.nc'),
 PosixPath('/tmp/TEMPO_NO2_L3_V03_20240111T191635Z_S009.nc'),
 PosixPath('/tmp/TEMPO_NO2_L3_V03_20240111T201635Z_S010.nc'),
 PosixPath('/tmp/TEMPO_NO2_L3_V03_20240111T211635Z_S011.nc')]

CPU times: user 11.1 s, sys: 8.26 s, total: 19.4 s
Wall time: 2min 58s

CPU times: user 15.5 s, sys: 10.7 s, total: 26.1 s
Wall time: 3min 6s

CPU times: user 11.6 s, sys: 9.36 s, total: 21 s
Wall time: 2min 40s

CPU times: user 19.9 s, sys: 12.2 s, total: 32.2 s
Wall time: 2min 52s

In [None]:
open_options = {
    "concat_dim": "time",
    "coords": "minimal",
    "compat": "override",
    "join": "override",
    "combine_attrs": "override",
    "combine": "nested",
    "parallel": True,
}

In [None]:
# %%time
# opened_datasets = xr.open_mfdataset(earthaccess.open(results), **open_options)

In [None]:
result_product = earthaccess.open_virtual_mfdataset(
    granules=results, group="product", **open_options
)
result_geolocation = earthaccess.open_virtual_mfdataset(
    granules=results, group="geolocation", **open_options
)

Merge root groups with subgroups.

In [None]:
%%time
result_merged = xr.merge([result_root, result_product, result_geolocation])

In [None]:
result_merged

# Subsetting, computing statistics, and plotting

## Temporal Mean - a map showing an annual average



In [None]:
%%time
temporal_mean_ds = (
    result_merged.sel({"longitude": slice(-78, -74), "latitude": slice(35, 39)})
    .where(result_merged["main_data_quality_flag"] == 0)
    .mean(dim=("time"))
)
temporal_mean_ds

In [None]:
%%time
mean_vertical_column_trop = temporal_mean_ds["vertical_column_troposphere"].compute()

In [None]:
fig, ax = plt.subplots(subplot_kw={"projection": ccrs.PlateCarree()})

mean_vertical_column_trop.squeeze().plot.contourf(ax=ax)
ax.coastlines()
ax.gridlines(
    draw_labels=True,
    dms=True,
    x_inline=False,
    y_inline=False,
)

## Spatial mean - a time series of area averages

In [None]:
%%time
spatial_mean_ds = (
    result_merged.sel({"longitude": slice(-78, -74), "latitude": slice(35, 39)})
    .where(result_merged["main_data_quality_flag"] == 0)
    .mean(dim=("longitude", "latitude"))
)
spatial_mean_ds

In [None]:
%%time
spatial_mean_vertical_column_trop = spatial_mean_ds[
    "vertical_column_troposphere"
].compute()

In [None]:
spatial_mean_vertical_column_trop.plot()
plt.show()

## Single scan subset

In [None]:
%%time
subset_ds = result_merged.sel(
    {
        "longitude": slice(-78, -74),
        "latitude": slice(35, 39),
        "time": slice(
            np.datetime64("2024-01-11T13:00:00"), np.datetime64("2024-01-11T14:00:00")
        ),
    }
).where(result_merged["main_data_quality_flag"] == 0)
subset_ds

In [None]:
%%time
subset_vertical_column_trop = subset_ds["vertical_column_troposphere"].compute()

In [None]:
fig, ax = plt.subplots(subplot_kw={"projection": ccrs.PlateCarree()})

subset_vertical_column_trop.squeeze().plot.contourf(ax=ax)
ax.coastlines()
ax.gridlines(
    draw_labels=True,
    dms=True,
    x_inline=False,
    y_inline=False,
)
plt.show()