# Sea ice and ocean data access and analysis 

Accessing coincident sea ice and ocean data to study melt pond characteristics.

Intro to use case, motivation, connections to cloud migration, learning objectives...

## Explore NASA Earthdata sea ice and ocean products...

... highlight key search terms and data availability across NASA DAACs...

## Import packages

In [45]:
from gql import gql, Client
from gql.transport.requests import RequestsHTTPTransport
from pprint import pprint
import getpass
import requests
import json
import random
from statistics import mean
import pandas as pd
import geopandas

## Data Discovery

Start by identifying your study area and exploring coincident data over the same time and area.

NASA Earthdata Search can be used to visualize file coverage over mulitple data sets and to access the same data you will be working with below: [insert URL with same search here]


### Identify time and area of interest


Study area: Using OA melt pond annotation as an example: https://openaltimetry.org/data/icesat2/?start_date=2019-06-22&annoId=180

- Temporal coverage: 22 June 2019
- Bounding area: -62.8,81.7,-56.4,83



<img align="left"
     src="OpenAltimetry-study-area.png">

In [2]:
# Bounding Box spatial parameter in decimal degree 'W,S,E,N' format.
bounding_box = '-62.8,81.7,-56.4,83'
# Each date in yyyy-MM-ddTHH:mm:ssZ format; date range in start,end format
temporal = '2019-06-22T00:00:00Z,2019-06-22T23:59:59Z'

### Explore data availability using the Common Metadata Repository
The Common Metadata Repository (CMR) is a high-performance, high-quality, continuously evolving metadata system that catalogs Earth Science data and associated service metadata records. These metadata records are registered, modified, discovered, and accessed through programmatic interfaces leveraging standard protocols and APIs.

General CMR API documentation: https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html

[below is background for development - will not include in final notebook]

#### Configure a CMR GraphQL client
Using qgl we can communicate with the CMR GraphQL endpoint in a standards-based way, allowing for schema introspection. gql isn't the only python GraphQL client library out there. Other libraries might provide features you value, like gql-next's static type generation functionality.

GraphQL endpoint documentation and interactive playground: https://graphql.earthdata.nasa.gov/api

In [3]:
CMR_GRAPHQL_URL = 'https://graphql.earthdata.nasa.gov/api'
sample_transport=RequestsHTTPTransport(
    url=CMR_GRAPHQL_URL,
    retries=3,            # Automatically retry, don't put it on the user!
)

client = Client(
    transport=sample_transport,
    fetch_schema_from_transport=True,  # Get the schema as part of the client object
)

collection_schema = client.schema.get_type('Collection')

# Show info about 5 random fields
sample_fields = random.sample(list(collection_schema.fields.items()), 5)
for fieldname, field in sample_fields:
    print(f'* {fieldname}: {field.description}')

* points: Spatial coverage of the collection
* coordinateSystem: Coordinate system of the metadata.
* abstract: A brief description of the collection the metadata represents.
* tags: Tags associated with the collection. It includes sub-elements of tagKey and optional data which is in embedded JSON.
* polygons: Spatial coverage of the collection


### Select data sets and determine version numbers

Data products:
- Sea Surface Temperature: 
    * MODIS (Terra) SST: MODIS_T-JPL-L2P-v2014.0
    * SMAP SSS L3 expected in cloud Ops starting with 20.4.2.
    * GRACE-FO in cloud Ops now.
- Sea Ice Height:
    * ATL07

Data sets are selected by data set IDs (e.g. ATL07). In the CMR API documentation, a data set ids is referred to as a "short name". These short names are located at the top of each NSIDC data set landing page in gray above the full title. We are using the Python Requests package to access the CMR. Data are then converted to JSON format; a language independant human-readable open-standard file format. More than one version can exist for a given data set:

In [24]:
height_query = gql('''
query { 
  collections(shortName: "ATL07") {
    items { 
      shortName
      datasetId
      conceptId
      versionId
    }
  }
}
''')

height_response = client.execute(height_query)

sst_query = gql('''
query { 
  collections(shortName: "MODIS_T-JPL-L2P-v2014.0") {
    items { 
      shortName
      datasetId
      conceptId
      versionId
    }
  }
}
''')

sst_response = client.execute(sst_query)
pprint(height_response)
pprint(sst_response)

{'collections': {'items': [{'conceptId': 'C1631076780-NSIDC_ECS',
                            'datasetId': 'ATLAS/ICESat-2 L3A Sea Ice Height '
                                         'V002',
                            'shortName': 'ATL07',
                            'versionId': '002'},
                           {'conceptId': 'C1706334166-NSIDC_ECS',
                            'datasetId': 'ATLAS/ICESat-2 L3A Sea Ice Height '
                                         'V003',
                            'shortName': 'ATL07',
                            'versionId': '003'}]}}
{'collections': {'items': [{'conceptId': 'C1648596996-PODAAC',
                            'datasetId': 'GHRSST Level 2P Global Skin Sea '
                                         'Surface Temperature from the '
                                         'Moderate Resolution Imaging '
                                         'Spectroradiometer (MODIS) on the '
                                         'NASA Terra 

We will specify the most recent version for our remaining data set queries.

### Determine how many files exist over this time and area of interest, as well as the average size and total volume of those granules
We will use the gql library once more, this time to query the CMR granule API. We will look at the results and print the number of granules, average size, and total volume of those granules.

In [30]:
# NOTE: GraphQL endpoint currently supports selecting granules by conceptId, not short_name, versionId.
query = gql('''
query {
  granules(conceptId: "C1706334166-NSIDC_ECS"
           boundingBox: "-62.8,81.7,-56.4,83"
           temporal: "2019-06-22T00:00:00Z,2019-06-22T23:59:59Z"
           limit: 100) {
    count
    items { granuleSize }
           
  }
}
''')
response = client.execute(query)

granule_sizes = [float(i['granuleSize']) for i in response['granules']['items']][:]

print(f"Found {response['granules']['count']} granules")
print(f"Average size: {mean(granule_sizes):.2f}")
print(f"Total size: {sum(granule_sizes):.2f}")

Found 2 granules
Average size: 230.85
Total size: 461.70


### ***On your own: Discover data availability for MODIS Skin Sea Surface Temperature***
Replace the ATL07 `conceptID` value with the MODIS Skin Sea Surface Temperature `conceptID` value returned above. How do the number of files and the data volume compare to ATL07?

Note that subsetting, reformatting, or reprojecting can alter the size of the granules if those services are applied to your request.



## Data Access

Depending on the NSIDC data set, we could utilize CMR UMM-S associations to demonstrate ESI subsetting service availability. Or for "native" data access, we could also access using Harmony (all CMR data can be accessed using Harmony no-proc service), or just through CMR data access URLs. 

The PO DAAC data can be subsetted via Harmony subsetting service.

### Determine subsetting capabilties for ATL07...

(Is Icepyx not working currently? Getting an error below..)

In [33]:
!{sys.executable} -m pip install icepyx # Install a pip package in the current Jupyter kernel
import icepyx as ipx

region_a = ipx.Query('ATL07',[-62.8, 81.7, -56.4, 83],['2019-06-22', '2019-06-22'], \
                           start_time='00:00:00', end_time='23:59:59')

region_a.earthdata_login('amy.steiker','amy.steiker@nsidc.org')

Collecting icepyx
  Downloading icepyx-0.3.0-py3-none-any.whl (38 kB)
Collecting pre-commit
  Downloading pre_commit-2.8.2-py2.py3-none-any.whl (184 kB)
[K     |████████████████████████████████| 184 kB 786 kB/s eta 0:00:01
Collecting datetime
  Downloading DateTime-4.3-py2.py3-none-any.whl (60 kB)
[K     |████████████████████████████████| 60 kB 1.5 MB/s eta 0:00:01
Collecting toml
  Downloading toml-0.10.2-py2.py3-none-any.whl (16 kB)
Collecting cfgv>=2.0.0
  Downloading cfgv-3.2.0-py2.py3-none-any.whl (7.3 kB)
Collecting virtualenv>=20.0.8
  Downloading virtualenv-20.1.0-py2.py3-none-any.whl (4.9 MB)
[K     |████████████████████████████████| 4.9 MB 3.2 MB/s eta 0:00:01
[?25hCollecting identify>=1.0.0
  Downloading identify-1.5.9-py2.py3-none-any.whl (97 kB)
[K     |████████████████████████████████| 97 kB 8.0 MB/s eta 0:00:011
[?25hCollecting nodeenv>=0.11.1
  Downloading nodeenv-1.5.0-py2.py3-none-any.whl (21 kB)
Collecting zope.interface
  Downloading zope.interface-5.1.2-cp37-

AttributeError: module 'icepyx' has no attribute 'Query'

Using UMM-S association...

In [44]:
height_query = gql('''
query { 
  collections(conceptId: "C1706334166-NSIDC_ECS") {
    items { 
      shortName
      datasetId
      conceptId
      versionId
      services {
          items {
          serviceOptions
          }
      }
    }
  }
}
''')

height_response = client.execute(height_query)
pprint(height_response['collections'])

{'items': [{'conceptId': 'C1706334166-NSIDC_ECS',
            'datasetId': 'ATLAS/ICESat-2 L3A Sea Ice Height V003',
            'services': {'items': [{'serviceOptions': {'subset': {}}},
                                   {'serviceOptions': {'subset': {'spatialSubset': {'boundingBox': {'allowMultipleValues': False},
                                                                                    'shapefile': [{'format': 'ESRI'},
                                                                                                  {'format': 'KML'},
                                                                                                  {'format': 'GeoJSON'}]},
                                                                  'temporalSubset': {'allowMultipleValues': False},
                                                                  'variableSubset': {'allowMultipleValues': True}},
                                                       'supportedReformattings': [{'support

### Grab OPeNDAP URLs for MODIS...

In [None]:
# Create new dictionary with fields needed for CMR url search

url_df = search_df.drop(columns=['start_date', 'end_date','version','dataset_id'])
url_dict = url_df.to_dict('records')

# CMR search variables
granule_search_url = 'https://cmr.earthdata.nasa.gov/search/granules'
headers= {'Accept': 'application/json'}

# Create URL list from each df row
urls = []
for i in range(len(url_dict)):
    response = requests.get(granule_search_url, params=url_dict[i], headers=headers)
    results = json.loads(response.content)
    urls.append(fn.cmr_filter_urls(results))
# flatten url list
urls = list(np.concatenate(urls))
urls

## Data Comparison

- Any reprojection or resampling needed? 
- Simple plotting of data 

# Move to different notebook: Point search testing...

Argo float data from [this search](https://nrlgodae1.nrlmry.navy.mil/cgi-bin/argo_select.pl?startyear=2020&startmonth=11&startday=01&endyear=2020&endmonth=11&endday=03&Nlat=75&Wlon=-80&Elon=-45&Slat=50&dac=ALL&floatid=ALL&gentype=plt&.submit=++Go++&.cgifields=endyear&.cgifields=dac&.cgifields=delayed&.cgifields=startyear&.cgifields=endmonth&.cgifields=endday&.cgifields=startday&.cgifields=startmonth&.cgifields=gentype)

Pull in file with lat lon point locations:

In [54]:
df = pd.read_csv("argo-float-data.csv")
gdf = geopandas.GeoDataFrame(df, geometry = geopandas.points_from_xy(df.Longitude, df.Latitude))
gdf

Unnamed: 0,Date,Latitude,Longitude,FloatID,DAC,geometry
0,20201101,58.033,-47.105,6901170,bodc,POINT (-47.10500 58.03300)
1,20201101,56.267,-54.291,4902510,meds,POINT (-54.29100 56.26700)
2,20201101,57.178,-53.264,4902509,meds,POINT (-53.26400 57.17800)
3,20201101,57.389,-51.571,4902505,meds,POINT (-51.57100 57.38900)
4,20201101,59.921,-50.34,4902471,meds,POINT (-50.34000 59.92100)
5,20201101,54.456,-50.419,3901669,coriolis,POINT (-50.41900 54.45600)
6,20201101,54.339,-47.566,6901191,bodc,POINT (-47.56600 54.33900)
7,20201101,66.973,-57.687,6902952,coriolis,POINT (-57.68700 66.97300)
8,20201101,58.859,-58.043,6901194,bodc,POINT (-58.04300 58.85900)
9,20201101,57.803,-54.672,3901668,coriolis,POINT (-54.67200 57.80300)


In [55]:
gdf.to_file('argo-data.geojson', driver='GeoJSON')

curl -XPOST "https://cmr.earthdata.nasa.gov/search/granules" -F "shapefile=@argo-data.geojson;type=application/geo+json" -F "collection_concept_id=C1706334166-NSIDC_ECS" -F "page_size=100"

In [69]:
search_url = "https://cmr.earthdata.nasa.gov/search/granules"
files = {"shapefile": ("argo-data.geojson", open('argo-data.geojson', 'r'), "application/geo+json")}
parameters = {
    "scroll": "true",
    "page_size": 100,
    # set any search criteria here
    "collection_concept_id": "C1706334166-NSIDC_ECS",
}
output_format = "json"
response = requests.post(f"{search_url}.{output_format}", data=parameters, files=files)

print("status:", response.status_code)
print("hits:", response.headers["CMR-Hits"])
pprint(response.json()["feed"]["entry"][0])


status: 200
hits: 500
{'browse_flag': True,
 'collection_concept_id': 'C1706334166-NSIDC_ECS',
 'coordinate_system': 'ORBIT',
 'data_center': 'NSIDC_ECS',
 'dataset_id': 'ATLAS/ICESat-2 L3A Sea Ice Height V003',
 'granule_size': '27.4623594284',
 'id': 'G1814912564-NSIDC_ECS',
 'links': [{'href': 'https://n5eil01u.ecs.nsidc.org/DP7/ATLAS/ATL07.003/2018.10.14/ATL07-01_20181014062057_02390101_003_02.h5',
            'hreflang': 'en-US',
            'rel': 'http://esipfed.org/ns/fedsearch/1.1/data#',
            'type': 'application/x-hdfeos'},
           {'href': 'https://n5eil01u.ecs.nsidc.org/DP0/BRWS/Browse.001/2020.06.05/ATL07-01_20181014062057_02390101_003_02_BRW.default.default1.jpg',
            'hreflang': 'en-US',
            'rel': 'http://esipfed.org/ns/fedsearch/1.1/browse#',
            'type': 'image/jpeg'},
           {'href': 'https://n5eil01u.ecs.nsidc.org/DP0/BRWS/Browse.001/2020.06.05/ATL07-01_20181014062057_02390101_003_02_BRW.default.default2.jpg',
            'hrefl