# Cloud optimized access to NASA data with earthaccess and virtualizarr
This notebook will focus on the usage of `earthaccess.open_virtual_dataset` and `earthaccess.open_virtual_mfdataset` to create cloud optimized reference files for the data stored in the cloud.

All of the examples in this tutorial load data over https (`access="indirect"`). However, there is a **significant** speed improvement when using these functions in-cloud and enabling `access="direct"`. For example, using managed cloud JupyterHubs like NASA VEDA or 2i2c Openscapes. This is because the data is streamed directly from cloud storage to cloud compute.

> WARNING: This feature is current experimental and may change in the future. This feature relies on NASA DMR++ metadata files which may not always be present for your dataset and you may get a `FileNotFoundError`.


In [None]:
import earthaccess
import xarray as xr

In [None]:
# NASA JPL Multiscale Ultrahigh Resolution (MUR) Sea Surface Temperature (SST) dataset - 0.01 degree resolution
results = earthaccess.search_data(
    temporal=("2010-01-01", "2010-01-31"), short_name="MUR-JPL-L4-GLOB-v4.1"
)
len(results)

In [None]:
%%time
mur = earthaccess.open_virtual_mfdataset(
    results,
    access="indirect",
    load=True,
    concat_dim="time",
    coords="all",
    compat="override",
    combine_attrs="drop_conflicts",
)
mur

In [None]:
print(f"{mur.nbytes / 1e9} GB")

In [None]:
mur.isel(time=0).sel(
    lat=slice(20, 45), lon=slice(-95, -50)
).analysed_sst.plot.pcolormesh(x="lon", y="lat", cmap="plasma", figsize=(8, 4))

## Save virtual reference file and load with xarray
If you have a dataset you frequently access or you want to share this blueprint file with others, it is recommended to create a virtual reference file that points to the data in the cloud. This allows xarray to rapidly load the dataset as if it was a [Zarr store](https://zarr.dev/).

Notice below that `load=False`. This means that the output of `open_virtual_mfdataset` is a virtual xarray Dataset that contains only chunk information and metadata. You can modify this dataset, then save it to a virtual reference file (as JSON), and then simply load that file with xarray. For more information on virtual reference files, see the [virtualizarr documentation](https://virtualizarr.readthedocs.io/en/latest/).

Sample workflow:
1. Open a dataset with `open_virtual_mfdataset` with `load=False`
2. Modify the dataset as needed
3. Save the dataset to a virtual reference file with `vds.virtualize.to_kerchunk(...)`
4. Load the virtual reference file with `xr.open_dataset(..., engine='kerchunk')`

In [None]:
%%time
mur_vds = earthaccess.open_virtual_mfdataset(
    results,
    access="indirect",
    load=False,
    concat_dim="time",
    coords="all",
    compat="override",
    combine_attrs="drop_conflicts",
)
mur_vds

In [None]:
# Example of what's inside this virtual dataset
print(mur_vds.analysed_sst.data)
print(mur_vds.analysed_sst.data.manifest.dict()["0.0.1"])

In [None]:
mur_vds.virtualize.to_kerchunk(filepath="mur_kerchunk.json", format="json")

In [None]:
%%time
fs = earthaccess.get_fsspec_https_session()

ds = xr.open_dataset(
    "mur_kerchunk.json",
    engine="kerchunk",
    chunks={},
    storage_options={
        "remote_protocol": fs.protocol,
        "remote_options": fs.storage_options,
    },
)
print(ds)

## Read datasets with groups

In [None]:
# NASA TEMPO NO2 tropospheric and stratospheric columns V03
results = earthaccess.search_data(count=2, doi="10.5067/IS-40e/TEMPO/NO2_L2.003")
len(results)

In [None]:
earthaccess.open_virtual_dataset(results[0], group="product")

## Advanced: Preprocess the datasets
You can also preprocess the datasets before saving the virtual reference file. This is useful if you want to apply a function to the datasets before concatentaion. For example, the `SWOT_L2_LR_SSH_Expert_2.0` dataset (from [NASA JPL SWOT satellite](https://www.jpl.nasa.gov/missions/surface-water-and-ocean-topography-swot/)) is an [L2 product](https://www.earthdata.nasa.gov/learn/earth-observation-data-basics/data-processing-levels) where each file represents a single pass of the satellite. If you want to combine all the passes into a single dataset, you can concatenate the datasets using `cycle_number` and `pass_number` which are only found in the attributes of each netcdf file.

The `preprocess` function and argument allows us to turn those attributes into dimensions first, and then concatenate along this new dimension.

In [None]:
results = earthaccess.search_data(
    count=10, temporal=("2023"), short_name="SWOT_L2_LR_SSH_Expert_2.0"
)

In [None]:
%%time


def preprocess(ds: xr.Dataset) -> xr.Dataset:
    # Add cycle number and pass_number as dimensions
    return ds.expand_dims(["cycle_num", "pass_num"]).assign_coords(
        cycle_num=[ds.attrs["cycle_number"]], pass_num=[ds.attrs["pass_number"]]
    )


swot = earthaccess.open_virtual_mfdataset(
    results,
    access="indirect",
    load=False,
    preprocess=preprocess,
    concat_dim="pass_num",
    coords="all",
    compat="override",
    combine_attrs="drop_conflicts",
)
swot