# Converting NetCDF4 Data to Zarr

This notebook was developed following the workflow defined in Jack McNelis' "Use Case: Study Amazon Estuaries with Data from the EOSDIS Cloud" notebook, located here: https://github.com/podaac/tutorials/blob/master/notebooks/SWOT-EA-2021/Estuary_explore_inCloud_zarr.ipynb

**Goal**
<br/>
The goal of this notebook is to access the MUR 1-km SST dataset stored in the netCDF4 format, convert that data into Zarr using the Earthdata Harmony tool, and then run an analysis with a subset of that data. 

**Dataset**
<br/>
MUR 1-km L4 SST (requires AWS early access in order to view on Earthdata Search) https://podaac.jpl.nasa.gov/MEaSUREs-MUR?tab=background&sections=about%2Bdata

### Import Modules

In [2]:
import s3fs
import numpy as np
import xarray as xr
import fsspec
import zarr
import timeit
import time
import requests
import matplotlib.pyplot as plt
import pandas as pd
from dask.distributed import Client
from IPython.display import HTML
from json import dumps, loads
from platform import system
from netrc import netrc
from getpass import getpass
from urllib import request
from http.cookiejar import CookieJar
from os.path import join, expanduser

### Period and Region of Interest

In [3]:
start_date = "2019-08-01"
end_date = "2020-1-20"

minlat = 18
maxlat = 23
minlon = -160
maxlon = -154

### Set the CMR, URS, and Harmony Endpoints
<br/>
CMR, or the Earthdata Common Metadata Repository, is a high-performance, high-quality, continuously evolving metadata system that catalogs Earth Science data and associated service metadata records. URS is the Earthdata login system, that allows (free) download access to Earthdata data. Harmony API allows you to seamlessly analyze Earth observation data from different NASA data centers.

In [4]:
cmr = "cmr.earthdata.nasa.gov"
urs = "urs.earthdata.nasa.gov"
harmony = "harmony.earthdata.nasa.gov"

cmr, urs, harmony

('cmr.earthdata.nasa.gov',
 'urs.earthdata.nasa.gov',
 'harmony.earthdata.nasa.gov')

### Earthdata Login
<br/>
An Earthdata Login account is required to access data, as well as discover restricted data, from the NASA Earthdata system. Please visit https://urs.earthdata.nasa.gov to register and manage your Earthdata Login account. This account is free to create and only takes a moment to set up.

The setup_earthdata_login_auth function will allow Python scripts to log into any Earthdata Login application programmatically. To avoid being prompted for credentials every time you run and also allow clients such as curl to log in, you can add the following to a .netrc (_netrc on Windows) file in your home directory:

    machine urs.earthdata.nasa.gov
    login <your username>
    password <your password>

Make sure that this file is only readable by the current user or you will receive an error stating "netrc access too permissive."

    $ chmod 0600 ~/.netrc

In [5]:
TOKEN_DATA = ("<token>"
              "<username>%s</username>"
              "<password>%s</password>"
              "<client_id>PODAAC CMR Client</client_id>"
              "<user_ip_address>%s</user_ip_address>"
              "</token>")


def setup_earthdata_login_auth(urs: str='urs.earthdata.nasa.gov', cmr: str='cmr.earthdata.nasa.gov'):

    # GET URS LOGIN INFO FROM NETRC OR USER PROMPTS:
    netrc_name = "_netrc" if system()=="Windows" else ".netrc"
    try:
        username, _, password = netrc(file=join(expanduser('~'), netrc_name)).authenticators(urs)
        print("# Your URS credentials were securely retrieved from your .netrc file.")
    except (FileNotFoundError, TypeError):
        print('# Please provide your Earthdata Login credentials for access.')
        print('# Your info will only be passed to %s and will not be exposed in Jupyter.' % (urs))
        username = input('Username: ')
        password = getpass('Password: ')

    # SET UP URS AUTHENTICATION FOR HTTP DOWNLOADS:
    manager = request.HTTPPasswordMgrWithDefaultRealm()
    manager.add_password(None, urs, username, password)
    auth = request.HTTPBasicAuthHandler(manager)
    jar = CookieJar()
    processor = request.HTTPCookieProcessor(jar)
    opener = request.build_opener(auth, processor)
    request.install_opener(opener)

    # GET TOKEN TO ACCESS RESTRICTED CMR METADATA:
    ip = requests.get("https://ipinfo.io/ip").text.strip()
    r = requests.post(
        url="https://%s/legacy-services/rest/tokens" % cmr,
        data=TOKEN_DATA % (str(username), str(password), ip),
        headers={'Content-Type': 'application/xml', 'Accept': 'application/json'}
    )
    return r.json()['token']['id']


# Provide URS credentials for HTTP download auth & CMR token retrieval:
_token = setup_earthdata_login_auth(urs=urs, cmr=cmr)

# Your URS credentials were securely retrieved from your .netrc file.


### Obtain Dataset Metadata

In [6]:
mur_ShortName = "MUR-JPL-L4-GLOB-v4.1"

In [7]:
r = requests.get(url=f"https://{cmr}/search/collections.umm_json", 
                 params={'provider': "POCLOUD", 
                         'ShortName': mur_ShortName, 
                         'token': _token})

mur_coll = r.json()
mur_coll_meta = mur_coll['items'][0]['meta']
mur_coll_meta

{'revision-id': 29,
 'deleted': False,
 'format': 'application/vnd.nasa.cmr.umm+json',
 'provider-id': 'POCLOUD',
 'user-id': 'mgangl',
 'has-formats': False,
 'associations': {'variables': ['V2028632042-POCLOUD',
   'V2028632044-POCLOUD',
   'V2028632047-POCLOUD',
   'V2028668049-POCLOUD']},
 's3-links': ['podaac-ops-cumulus-public/MUR-JPL-L4-GLOB-v4.1/',
  'podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/'],
 'has-spatial-subsetting': False,
 'native-id': 'GHRSST+Level+4+MUR+Global+Foundation+Sea+Surface+Temperature+Analysis+(v4.1)',
 'has-transforms': False,
 'has-variables': True,
 'concept-id': 'C1996881146-POCLOUD',
 'revision-date': '2021-07-26T17:41:06.284Z',
 'granule-count': 0,
 'has-temporal-subsetting': False,
 'concept-type': 'collection'}

### Locate Granules
<br/>
You can see below that there are >7000 hits. Each hit represents a single day of data in the MUR SST dataset. 

In [8]:
r = requests.get(url=f"https://{cmr}/search/granules.umm_json", 
                 params={'provider': "POCLOUD", 
                         'ShortName': mur_ShortName, 
                         'token': _token,
                         'page_size': 2000})

mur_gran = r.json()
mur_gran['hits']

7018

In [9]:
mur_gran['items'][9]

{'meta': {'concept-type': 'granule',
  'concept-id': 'G2028105753-POCLOUD',
  'revision-id': 1,
  'native-id': '20020610090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1',
  'provider-id': 'POCLOUD',
  'format': 'application/vnd.nasa.cmr.umm+json',
  'revision-date': '2021-03-31T05:21:01.951Z'},
 'umm': {'TemporalExtent': {'RangeDateTime': {'EndingDateTime': '2002-06-10T21:00:00.000Z',
    'BeginningDateTime': '2002-06-09T21:00:00.000Z'}},
  'MetadataSpecification': {'URL': 'https://cdn.earthdata.nasa.gov/umm/granule/v1.6.3',
   'Name': 'UMM-G',
   'Version': '1.6.3'},
  'GranuleUR': '20020610090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1',
  'ProviderDates': [{'Type': 'Insert', 'Date': '2021-03-31T05:20:32.509Z'},
   {'Type': 'Update', 'Date': '2021-03-31T05:20:32.510Z'}],
  'SpatialExtent': {'HorizontalSpatialDomain': {'Geometry': {'BoundingRectangles': [{'WestBoundingCoordinate': -180,
       'SouthBoundingCoordinate': -90,
       'EastBoundingCoordinate': 180,
       'NorthBound

In [10]:
mur_gran['items'][1999]

{'meta': {'concept-type': 'granule',
  'concept-id': 'G2028143785-POCLOUD',
  'revision-id': 1,
  'native-id': '20071121090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1',
  'provider-id': 'POCLOUD',
  'format': 'application/vnd.nasa.cmr.umm+json',
  'revision-date': '2021-03-31T07:39:30.031Z'},
 'umm': {'TemporalExtent': {'RangeDateTime': {'EndingDateTime': '2007-11-21T21:00:00.000Z',
    'BeginningDateTime': '2007-11-20T21:00:00.000Z'}},
  'MetadataSpecification': {'URL': 'https://cdn.earthdata.nasa.gov/umm/granule/v1.6.3',
   'Name': 'UMM-G',
   'Version': '1.6.3'},
  'GranuleUR': '20071121090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1',
  'ProviderDates': [{'Type': 'Insert', 'Date': '2021-03-31T07:39:04.510Z'},
   {'Type': 'Update', 'Date': '2021-03-31T07:39:04.510Z'}],
  'SpatialExtent': {'HorizontalSpatialDomain': {'Geometry': {'BoundingRectangles': [{'WestBoundingCoordinate': -180,
       'SouthBoundingCoordinate': -90,
       'EastBoundingCoordinate': 180,
       'NorthBound

The CMR Search metadata for a single day.

In [11]:
mur_gran['items'][0]['meta']

{'concept-type': 'granule',
 'concept-id': 'G2030963432-POCLOUD',
 'revision-id': 1,
 'native-id': '20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1',
 'provider-id': 'POCLOUD',
 'format': 'application/vnd.nasa.cmr.umm+json',
 'revision-date': '2021-04-07T16:59:09.662Z'}

You can also obtain the UMM metadata, accessible from the 'umm' key.

In [12]:
mur_gran['items'][0]['umm']['RelatedUrls']

[{'URL': 'https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc',
  'Description': 'Download 20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc',
  'Type': 'GET DATA'},
 {'URL': 'https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-public/MUR-JPL-L4-GLOB-v4.1/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc.md5',
  'Description': 'Download 20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc.md5',
  'Type': 'EXTENDED METADATA'},
 {'URL': 'https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-public/MUR-JPL-L4-GLOB-v4.1/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.cmr.json',
  'Description': 'Download 20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.cmr.json',
  'Type': 'EXTENDED METADATA'},
 {'URL': 'https://archive.podaac.earthdata.nasa.gov/s3credentials',
  'Description': 'api endpoint to retrieve temporary credentials v

We want the URL corresponding to 'Type': 'GET DATA'. Select the URL from appropriate item in the list, then print:

In [13]:
mur_gran['items']

[{'meta': {'concept-type': 'granule',
   'concept-id': 'G2030963432-POCLOUD',
   'revision-id': 1,
   'native-id': '20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1',
   'provider-id': 'POCLOUD',
   'format': 'application/vnd.nasa.cmr.umm+json',
   'revision-date': '2021-04-07T16:59:09.662Z'},
  'umm': {'TemporalExtent': {'RangeDateTime': {'EndingDateTime': '2002-06-01T21:00:00.000Z',
     'BeginningDateTime': '2002-05-31T21:00:00.000Z'}},
   'MetadataSpecification': {'URL': 'https://cdn.earthdata.nasa.gov/umm/granule/v1.6.3',
    'Name': 'UMM-G',
    'Version': '1.6.3'},
   'GranuleUR': '20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1',
   'ProviderDates': [{'Type': 'Insert', 'Date': '2021-04-07T16:58:43.887Z'},
    {'Type': 'Update', 'Date': '2021-04-07T16:58:43.888Z'}],
   'SpatialExtent': {'HorizontalSpatialDomain': {'Geometry': {'BoundingRectangles': [{'WestBoundingCoordinate': -180,
        'SouthBoundingCoordinate': -90,
        'EastBoundingCoordinate': 180,


In [14]:
mur_url = mur_gran['items'][0]['umm']['RelatedUrls'][0]['URL']
mur_url

'https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc'

### Request to Harmony API: Zarr Reformatter
<br/>
But we'll use the Harmony API's Zarr Reformatter service instead of downloading the entire granule. The zarr format will allow us to open and download/read just the data that we require for our Amazon Basin study area.
<br/>
<br/>
If you have a jobID you'd like to re-visit instead of running this command again, modify the cell below to set the async_jobId then skip the one immediately after. You can continue from 'Query for the job status and links'.
<br/>
<br/>
If you are running for the first time, proceed to the next cells to submit the harmony request.

In [15]:
async_jobId = "17086226-3ed5-4c5f-bc82-a56837623a01"  # jobId belongs to dev. You wont have access.
# async_jobId = None

See important usage note below if this is your first time submitting a request to the Zarr Reformatter service.

The Zarr Reformatter service operates on an input Collection concept-id (a CMR construct). The service will accept more user-friendly inputs (like a Collection ShortName) in future releases. Here's how you identify the CMR concept-id for the JPL MUR 1-km SST dataset:

In [16]:
collection_concept_id = mur_coll_meta['concept-id']
collection_concept_id

'C1996881146-POCLOUD'

In [17]:
lat = '(' + str(minlat) + ":" + str(maxlat) + ')'
lon = '(' + str(minlon) + ":" + str(maxlon) + ')' 
time = '(' + "\"" + '2019-08-01T09:00:00Z' + "\"" + ':' + "\"" + '2019-08-03T09:00:00Z' + "\"" + ')'

Most of this next cell will only evaluate if there's NOT a valid job identifier set to the async_jobId variable above. It submits the Harmony request, and prints the JSON response.

**There is currently no Harmony L4 Subsetter service, so the following request will not be able to return a spatial subset of the MUR dataset (can be subset along time dimension, but not lat or lon). This request should work with an L2 dataset because there is a Harmony L2 Subsetter, and once there is a Harmony L4 Subsetter this request should also work.**

In [18]:
async_url = f'https://{harmony}/{collection_concept_id}/ogc-api-coverages/1.0.0/collections/all/coverage/rangeset?&subset=lat{lat}&subset=lon{lon}&subset=time{time}&format=application/x-zarr'
if async_jobId is None:
    print('Request URL: ', async_url)
    async_response = request.urlopen(async_url)
    async_results = async_response.read()
    async_json = loads(async_results)
    print(dumps(async_json, indent=2))
    async_jobId = async_json['jobID']

Format and display the complete url to the Harmony API job:

In [19]:
job_url = f'https://{harmony}/jobs/{async_jobId}'
job_url

'https://harmony.earthdata.nasa.gov/jobs/17086226-3ed5-4c5f-bc82-a56837623a01'

Query for the job status and links in case the request is still processing:

In [20]:
while True:
    loop_response = request.urlopen(job_url)
    loop_results = loop_response.read()
    job_json = loads(loop_results)
    if job_json['status'] != 'running':
        break
    print(f"# Job status is running. Progress is {job_json['progress']}. Trying again.")
    time.sleep(5)

links = []
if job_json['status'] == 'successful' and job_json['progress'] == 100:
    print("# Job progress is 100%. Links to job outputs are displayed below:")
    links = [link['href'] for link in job_json['links']]
    display(links)

# Job progress is 100%. Links to job outputs are displayed below:


['https://harmony.earthdata.nasa.gov/stac/17086226-3ed5-4c5f-bc82-a56837623a01/',
 'https://harmony.earthdata.nasa.gov/cloud-access.sh',
 'https://harmony.earthdata.nasa.gov/cloud-access',
 's3://harmony-prod-staging/public/harmony/netcdf-to-zarr/e78ccaad-04bd-4fe8-81c7-4d356e6ea1eb/',
 's3://harmony-prod-staging/public/harmony/netcdf-to-zarr/e78ccaad-04bd-4fe8-81c7-4d356e6ea1eb/20190801090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.zarr',
 's3://harmony-prod-staging/public/harmony/netcdf-to-zarr/e78ccaad-04bd-4fe8-81c7-4d356e6ea1eb/20190802090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.zarr',
 's3://harmony-prod-staging/public/harmony/netcdf-to-zarr/e78ccaad-04bd-4fe8-81c7-4d356e6ea1eb/20190803090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.zarr',
 's3://harmony-prod-staging/public/harmony/netcdf-to-zarr/e78ccaad-04bd-4fe8-81c7-4d356e6ea1eb/20190804090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.zarr',
 's3://harmony-prod-staging/public/harmony/netcdf-to-zarr/e78ccaad-04bd-4

### Access URL for the Output Zarr File
<br/>
The new zarr dataset/dataset is/are staged for us in an S3 bucket. The url ends in '.zarr' and for MUR begin at the 4th index in the list above. May be different with different datasets.

Select the url and display below:

In [21]:
zarr_url = links[4]
zarr_url

's3://harmony-prod-staging/public/harmony/netcdf-to-zarr/e78ccaad-04bd-4fe8-81c7-4d356e6ea1eb/20190801090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.zarr'

### Access Credentials for the Output Zarr File
<br/>
Credentials provided at the third and fourth urls in the list grant authenticated access to your staged S3 resources.

Grab the credentials as a JSON string, load to a Python dictionary, and display their expiration date:

In [22]:
with request.urlopen(f"https://{harmony}/cloud-access") as f:
    creds = loads(f.read())

creds['Expiration']

'2021-08-21T06:29:32.000Z'

### Open Staged Zarr File with s3fs
<br/>
We use the AWS s3fs package to get metadata about the zarr data store and list its contents:

(This shows a permission error becuase it is nearing the end of my internship and I believe my access has been revoked, however, this code should work)

In [26]:
zarr_fs = s3fs.S3FileSystem(
    key=creds['AccessKeyId'],
    secret=creds['SecretAccessKey'],
    token=creds['SessionToken'],
    client_kwargs={'region_name':'us-west-2'},
)
zarr_store = zarr_fs.get_mapper(root=zarr_url, check=False)
zarr_dataset = zarr.open(zarr_store)

print(zarr_dataset.tree())

PermissionError: Forbidden

In [24]:
print(zarr_dataset.analysed_sst.info)

Name               : /analysed_sst
Type               : zarr.core.Array
Data type          : float64
Shape              : (1, 17999, 36000)
Chunk shape        : (1, 1023, 2047)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type         : fsspec.mapping.FSMap
No. bytes          : 5183712000 (4.8G)
Chunks initialized : 324/324



### Open Staged Zarr File with xarray

In [25]:
ds_MUR = xr.open_zarr(zarr_store)
print(ds_MUR)

<xarray.Dataset>
Dimensions:           (time: 1, lat: 17999, lon: 36000)
Coordinates:
  * lon               (lon) float32 -180.0 -180.0 -180.0 ... 180.0 180.0 180.0
  * time              (time) datetime64[ns] 2019-08-01T09:00:00
Dimensions without coordinates: lat
Data variables:
    analysed_sst      (time, lat, lon) float64 dask.array<chunksize=(1, 1023, 2047), meta=np.ndarray>
    analysis_error    (time, lat, lon) float64 dask.array<chunksize=(1, 1023, 2047), meta=np.ndarray>
    dt_1km_data       (time, lat, lon) timedelta64[ns] dask.array<chunksize=(1, 1447, 2895), meta=np.ndarray>
    mask              (time, lat, lon) float32 dask.array<chunksize=(1, 1447, 2895), meta=np.ndarray>
    sea_ice_fraction  (time, lat, lon) float64 dask.array<chunksize=(1, 1447, 2895), meta=np.ndarray>
    sst_anomaly       (time, lat, lon) float64 dask.array<chunksize=(1, 1023, 2047), meta=np.ndarray>
Attributes: (12/47)
    Conventions:                CF-1.7
    Metadata_Conventions:       Unidata 