# AWS Zarr Eosdis Store Data Tests

**Goal**
<br/>
To open the MUR 1-km dataset stored in the PO.DAAC Archive using the zarr-easdis-store package in conjunction with the MUR Climatology dataset (created by Mike Chin and cleaned in the notebook 'CleaningMURClimatologyData.ipynb') to create a Sea Surface Temperature (SST) anomaly dataset for use in testing runtimes on dataset loading and plotting applications. 

**Run Location**
<br/>
This notebook was run on an AWS EC2 t3.medium instance. It also did not work on a t3.small instance. Memory or disk space are not the issue with this code.

**Dataset**
<br/>
MUR 1-km L4 SST netCDF4 On-Premise https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/

### Import Modules

In [1]:
%matplotlib inline
import sys
from eosdis_store import EosdisStore

import s3fs
import numpy as np
import xarray as xr
import fsspec
import zarr
import timeit
import matplotlib.pyplot as plt
from dask.distributed import Client, performance_report

### Save Access URL

In [2]:
BASEURL = 'https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/'

## Setup for Regional Tests

### Period and Region of Interest

In [3]:
start_date = "2019-08-01"
end_date = "2020-01-21"

minlat = 18
maxlat = 23
minlon = -160
maxlon = -154

### Create Labels for Necessary Days

In [4]:
dates = np.arange(start_date, end_date, dtype='datetime64[D]')
dates

array(['2019-08-01', '2019-08-02', '2019-08-03', '2019-08-04',
       '2019-08-05', '2019-08-06', '2019-08-07', '2019-08-08',
       '2019-08-09', '2019-08-10', '2019-08-11', '2019-08-12',
       '2019-08-13', '2019-08-14', '2019-08-15', '2019-08-16',
       '2019-08-17', '2019-08-18', '2019-08-19', '2019-08-20',
       '2019-08-21', '2019-08-22', '2019-08-23', '2019-08-24',
       '2019-08-25', '2019-08-26', '2019-08-27', '2019-08-28',
       '2019-08-29', '2019-08-30', '2019-08-31', '2019-09-01',
       '2019-09-02', '2019-09-03', '2019-09-04', '2019-09-05',
       '2019-09-06', '2019-09-07', '2019-09-08', '2019-09-09',
       '2019-09-10', '2019-09-11', '2019-09-12', '2019-09-13',
       '2019-09-14', '2019-09-15', '2019-09-16', '2019-09-17',
       '2019-09-18', '2019-09-19', '2019-09-20', '2019-09-21',
       '2019-09-22', '2019-09-23', '2019-09-24', '2019-09-25',
       '2019-09-26', '2019-09-27', '2019-09-28', '2019-09-29',
       '2019-09-30', '2019-10-01', '2019-10-02', '2019-

### Create URLs for Accessing Data

In [5]:
urls = []

for day in dates:
    urls.append(BASEURL + str(str(day).replace('-', '')) + '090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc')
    
urls

['https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20190801090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc',
 'https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20190802090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc',
 'https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20190803090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc',
 'https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20190804090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc',
 'https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20190805090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc',
 'https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20190806090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc',
 'https://archive.podaac.earthdata.nasa.gov/podaac-ops-cum

### Open MUR Dataset

In [28]:
start_time = timeit.default_timer()

variables=[
    'analysed_sst', 
    'mask'
]

def subset(ds):
    subset_ds = ds[variables].sel(
        lat=slice(minlat, maxlat),
        lon=slice(minlon, maxlon)
    )
    return subset_ds

mur_hawaii = xr.open_mfdataset(
    paths=[EosdisStore(f) for f in urls],
    preprocess=subset,
    combine='by_coords',
    consolidated=False,
    mask_and_scale=True,
#     decode_cf=True,
#     cache=False,
#     parallel=True,
    engine='zarr'
).chunk({"time": 30, "lat": 100, "lon": 100})

mur_hawaii.load()   # Uncomment if you want to load the dataset into memory now

elapsed = timeit.default_timer() - start_time
print(elapsed)

ValueError: Shuffle buffer is not an integer multiple of elementsize

In [8]:
mur_hawaii

Unnamed: 0,Array,Chunk
Bytes,198.71 MiB,1.14 MiB
Shape,"(173, 501, 601)","(30, 100, 100)"
Count,58221 Tasks,252 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 198.71 MiB 1.14 MiB Shape (173, 501, 601) (30, 100, 100) Count 58221 Tasks 252 Chunks Type float32 numpy.ndarray",601  501  173,

Unnamed: 0,Array,Chunk
Bytes,198.71 MiB,1.14 MiB
Shape,"(173, 501, 601)","(30, 100, 100)"
Count,58221 Tasks,252 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,198.71 MiB,1.14 MiB
Shape,"(173, 501, 601)","(30, 100, 100)"
Count,30266 Tasks,252 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 198.71 MiB 1.14 MiB Shape (173, 501, 601) (30, 100, 100) Count 30266 Tasks 252 Chunks Type float32 numpy.ndarray",601  501  173,

Unnamed: 0,Array,Chunk
Bytes,198.71 MiB,1.14 MiB
Shape,"(173, 501, 601)","(30, 100, 100)"
Count,30266 Tasks,252 Chunks
Type,float32,numpy.ndarray


### Add in NAN Values for Land to MUR Data
<br/>
We use the mask dimension to replace temperature values from land observations with NaN so that they are not factored in to our calculations. The mask variable has a value for each coordinate pair representing which surface the temperature was collected from (land, open-sea, ice, etc.).

In [9]:
mur_hawaii_sst = mur_hawaii['analysed_sst'].where(mur_hawaii.mask == 1)

### Convert Temperatures to Celsius
<br/>
The dataset is stored with temperatures measured in Kelvin. This converts it to Celsius for ease of understanding and analysis.

In [10]:
mur_hawaii_sst = mur_hawaii_sst - 273.15

In [11]:
mur_hawaii_sst

Unnamed: 0,Array,Chunk
Bytes,198.71 MiB,1.14 MiB
Shape,"(173, 501, 601)","(30, 100, 100)"
Count,89243 Tasks,252 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 198.71 MiB 1.14 MiB Shape (173, 501, 601) (30, 100, 100) Count 89243 Tasks 252 Chunks Type float32 numpy.ndarray",601  501  173,

Unnamed: 0,Array,Chunk
Bytes,198.71 MiB,1.14 MiB
Shape,"(173, 501, 601)","(30, 100, 100)"
Count,89243 Tasks,252 Chunks
Type,float32,numpy.ndarray


### Open MUR Climatology for Hawaii

In [12]:
mur_clim = xr.open_dataarray(
    "../data/MURClimatology.nc",
    chunks={"time": 30, "lat": 100, "lon": 100}
)

In [13]:
mur_clim

Unnamed: 0,Array,Chunk
Bytes,420.39 MiB,1.14 MiB
Shape,"(366, 501, 601)","(30, 100, 100)"
Count,547 Tasks,546 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 420.39 MiB 1.14 MiB Shape (366, 501, 601) (30, 100, 100) Count 547 Tasks 546 Chunks Type float32 numpy.ndarray",601  501  366,

Unnamed: 0,Array,Chunk
Bytes,420.39 MiB,1.14 MiB
Shape,"(366, 501, 601)","(30, 100, 100)"
Count,547 Tasks,546 Chunks
Type,float32,numpy.ndarray


### Drop the Leap Day

In [14]:
mur_clim = mur_clim.where(mur_clim["time"] != np.datetime64('2004-02-29T09:00:00', 'ns'), drop=True)

### Create Subset Dataset

In [15]:
mur_clim_jan = mur_clim[0:20]

In [16]:
mur_clim_subset = mur_clim[212:]

In [17]:
mur_clim_subset = xr.concat([mur_clim_subset, mur_clim_jan], dim="time")

In [18]:
mur_clim_subset = mur_clim_subset.assign_coords({"time": mur_hawaii_sst["time"]})

In [19]:
mur_clim_subset

Unnamed: 0,Array,Chunk
Bytes,198.71 MiB,1.14 MiB
Shape,"(173, 501, 601)","(30, 100, 100)"
Count,2254 Tasks,294 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 198.71 MiB 1.14 MiB Shape (173, 501, 601) (30, 100, 100) Count 2254 Tasks 294 Chunks Type float32 numpy.ndarray",601  501  173,

Unnamed: 0,Array,Chunk
Bytes,198.71 MiB,1.14 MiB
Shape,"(173, 501, 601)","(30, 100, 100)"
Count,2254 Tasks,294 Chunks
Type,float32,numpy.ndarray


### Create SST Anomaly Dataset

In [20]:
sst_anomaly = mur_hawaii_sst - mur_clim_subset

In [21]:
sst_anomaly

Unnamed: 0,Array,Chunk
Bytes,198.71 MiB,1.03 MiB
Shape,"(173, 501, 601)","(27, 100, 100)"
Count,93933 Tasks,504 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 198.71 MiB 1.03 MiB Shape (173, 501, 601) (27, 100, 100) Count 93933 Tasks 504 Chunks Type float32 numpy.ndarray",601  501  173,

Unnamed: 0,Array,Chunk
Bytes,198.71 MiB,1.03 MiB
Shape,"(173, 501, 601)","(27, 100, 100)"
Count,93933 Tasks,504 Chunks
Type,float32,numpy.ndarray


### Find Daily Average SST Anomaly for Time Series

In [22]:
sst_anomaly_mean_ts = sst_anomaly.mean(['lat', 'lon'])

In [23]:
sst_anomaly_mean_ts

Unnamed: 0,Array,Chunk
Bytes,692 B,108 B
Shape,"(173,)","(27,)"
Count,94641 Tasks,12 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 692 B 108 B Shape (173,) (27,) Count 94641 Tasks 12 Chunks Type float32 numpy.ndarray",173  1,

Unnamed: 0,Array,Chunk
Bytes,692 B,108 B
Shape,"(173,)","(27,)"
Count,94641 Tasks,12 Chunks
Type,float32,numpy.ndarray


### Find Average SST Anomaly for Each Coordinate Pair for Spatial Plot

In [24]:
sst_anomaly_mean_sp = sst_anomaly.mean(['time'])

In [25]:
sst_anomaly_mean_sp

Unnamed: 0,Array,Chunk
Bytes,1.15 MiB,39.06 kiB
Shape,"(501, 601)","(100, 100)"
Count,94605 Tasks,42 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 1.15 MiB 39.06 kiB Shape (501, 601) (100, 100) Count 94605 Tasks 42 Chunks Type float32 numpy.ndarray",601  501,

Unnamed: 0,Array,Chunk
Bytes,1.15 MiB,39.06 kiB
Shape,"(501, 601)","(100, 100)"
Count,94605 Tasks,42 Chunks
Type,float32,numpy.ndarray


## Regional Tests

### Regional SST Anomaly Averaged Time Series, August 1st, 2019 - January 20th, 2020

In [26]:
start_time = timeit.default_timer()

sst_anomaly_mean_ts.plot()

elapsed = timeit.default_timer() - start_time
print(elapsed)

ValueError: Shuffle buffer is not an integer multiple of elementsize

### Regional SST Anomaly Averaged Spatial Plot, August 1st, 2019 - January 20th, 2020

In [27]:
start_time = timeit.default_timer()

sst_anomaly_mean_sp.plot()

elapsed = timeit.default_timer() - start_time
print(elapsed)

ValueError: Shuffle buffer is not an integer multiple of elementsize