# Discover, Customize and Access NSIDC DAAC Data

This notebook is based off of the [NSIDC-Data-Access-Notebook](https://github.com/nsidc/NSIDC-Data-Access-Notebook) provided through NSIDC's Github organization. 

Now that we've visualized our study areas, we will first explore data coverage, size, and customization (subsetting, reformatting, reprojection) service availability, and then access those associated files. The __Data access for all datasets__ notebook provides the steps needed to subset and download all the data we'll be using in the final __Visualize and Analyze Data__.

___A note on data access options:___
We will be pursuing data discovery and access "programmatically" using Application Programming Interfaces, or APIs. 

*What is an API?* You can think of an API as a middle man between an application or end-use (in this case, us) and a data provider. In this case the data provider is both the Common Metadata Repository (CMR) housing data information, and NSIDC as the data distributor. These APIs are generally structured as a URL with a base plus individual key-value-pairs separated by ‘&’.

There are other discovery and access methods available from NSIDC including access from the data set landing page 'Download Data' tab (e.g. [ATL07 Data Access](https://nsidc.org/data/atl07?qt-data_set_tabs=1#qt-data_set_tabs)) and [NASA Earthdata Search](https://search.earthdata.nasa.gov/). Programmatic API access is beneficial for those of you who want to incorporate data access into your visualization and analysis workflow. This method is also reproducible and documented to ensure data provenance. 

Here are the steps you will learn in this customize and access notebook:
   
1. Search for data programmatically using the Common Metadata Repository API by time and area of interest.
2. Determine subsetting, reformatting, and reprojection capabilities for our data of interest.
3. Access and customize data using NSIDC's data access and service API.

## Import packages


In [125]:
from statistics import mean
import json
import random

from gql import gql, Client
from gql.transport.requests import RequestsHTTPTransport
from pprint import pprint
import getpass
import requests

# This is our functions module. We created several functions used in this notebook and the Visualize and Analyze notebook.
import tutorial_helper_functions as fn 

## Configure a CMR GraphQL client

Using `qgl` we can communicate with the CMR GraphQL endpoint in a standards-based way, allowing for schema introspection. `gql` isn't the only python GraphQL client library out there. Other libraries might provide features you value, like `gql-next`'s static type generation functionality.

In [83]:
CMR_GRAPHQL_URL = 'https://graphql.earthdata.nasa.gov/api'
sample_transport=RequestsHTTPTransport(
    url=CMR_GRAPHQL_URL,
    retries=3,            # Automatically retry, don't put it on the user!
)

client = Client(
    transport=sample_transport,
    fetch_schema_from_transport=True,  # Get the schema as part of the client object
)

collection_schema = client.schema.get_type('Collection')

# Show info about 5 random fields
sample_fields = random.sample(list(collection_schema.fields.items()), 5)
for fieldname, field in sample_fields:
    print(f'* {fieldname}: {field.description}')

* variables: 
* datasetId: True if any of its associated services support spatial subsetting.
* abstract: A brief description of the collection or service the metadata represents.
* hasTemporalSubsetting: True if any of its associated services support temporal subsetting.
* hasFormats: True if there are multiple supported formats for any services associated with the collection.


## Explore data availability using the Common Metadata Repository 

The Common Metadata Repository (CMR) is a high-performance, high-quality, continuously evolving metadata system that catalogs Earth Science data and associated service metadata records. These metadata records are registered, modified, discovered, and accessed through programmatic interfaces leveraging standard protocols and APIs. Note that not all NSIDC data can be searched at the file level using CMR, particularly those outside of the NASA DAAC program. 

General CMR API documentation: https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html

GraphQL endpoint documentation and interactive playground: https://graphql.earthdata.nasa.gov/api

### Select data set and determine version number

Data sets are selected by data set IDs (e.g. ATL07). In the CMR API documentation, a data set ids is referred to as a "short name". These short names are located at the top of each NSIDC data set landing page in gray above the full title. We are using the Python Requests package to access the CMR. Data are then converted to [JSON](https://en.wikipedia.org/wiki/JSON) format; a language independant human-readable open-standard file format. More than one version can exist for a given data set: 

In [158]:
query = gql('''
query { 
  collections(shortName: "MOD29") {
    items { 
      shortName
      datasetId
      conceptId
      versionId
    }
  }
}
''')

# As a user, I don't need to know about "feed" or "entry", just the fields I'm interested in!
response = client.execute(query)
pprint(response)

{'collections': {'items': [{'conceptId': 'C1219248592-LANCEMODIS',
                            'datasetId': 'MODIS/Terra Near Real Time (NRT) Sea '
                                         'Ice Extent 5-Min L2 Swath 1km',
                            'shortName': 'MOD29',
                            'versionId': '6NRT'},
                           {'conceptId': 'C61468238-NSIDC_ECS',
                            'datasetId': 'MODIS/Terra Sea Ice Extent 5-Min L2 '
                                         'Swath 1km V005',
                            'shortName': 'MOD29',
                            'versionId': '5'},
                           {'conceptId': 'C1000001160-NSIDC_ECS',
                            'datasetId': 'MODIS/Terra Sea Ice Extent 5-Min L2 '
                                         'Swath 1km V006',
                            'shortName': 'MOD29',
                            'versionId': '6'}]}}


We will specify the most recent version for our remaining data set queries.

### Select time and area of interest

We will create a simple dictionary with our short name, version, time, and area of interest. We'll continue to add to this dictionary as we discover more information about our data set. The bounding box coordinates cover our region of interest over the East Siberian Sea and the temporal range covers March 23, 2019. 

In [115]:
# Bounding Box spatial parameter in decimal degree 'W,S,E,N' format.
bounding_box = '140,72,153,80'
# Each date in yyyy-MM-ddTHH:mm:ssZ format; date range in start,end format
temporal = '2019-03-23T00:00:00Z,2019-03-23T23:59:59Z'

Start our data dictionary with our data set, version, and area and time of interest. 

**Note that some version IDs include 3 digits and others include only 1 digit. Make sure to enter this value exactly as reported above.**

In [116]:
# TODO: consider rename to "dataset_info"? egi_query? To me, "data_dict" indicates there's data 
# inside. I want the name of the variable to tell me its purpose.
data_dict = {'short_name': 'MOD29', 
             'version': '6',
             'bounding_box': bounding_box, 
             'temporal': temporal }

### Determine how many files exist over this time and area of interest, as well as the average size and total volume of those granules

We will use the `gql` library once more, this time to query the CMR granule API. We will look at the results and print the number of granules, average size, and total volume of those granules.

Finally, we update the data_dict with the granule count. TODO: Is this necessary?

In [157]:
# TODO: Get rid of data_dict entirely? or re-use it here? Currently hardcoding bbox, temporal.
# NOTE: GraphQL endpoint currently supports selecting granules by conceptId, not short_name, versionId.
query = gql('''
query {
  granules(conceptId: "C1000001160-NSIDC_ECS"
           boundingBox: "140,72,153,80"
           temporal: "2019-03-23T00:00:00Z,2019-03-23T23:59:59Z"
           limit: 100) {
    count
    items { granuleSize }
           
  }
}
''')
response = client.execute(query)

# Now that it's a so easy to get the info we need, I think granule_info can go away.
granule_sizes = [float(i['granuleSize']) for i in response['granules']['items']][:]

print(f"Found {response['granules']['count']} granules")
print(f"Average size: {mean(granule_sizes):.2f}")
print(f"Total size: {sum(granule_sizes):.2f}")

data_dict['gran_num'] = int(response['granules']['count'])

Found 13 granules
Average size: 2.75
Total size: 35.70


Note that subsetting, reformatting, or reprojecting can alter the size of the granules if those services are applied to your request.

## ***On your own***: Discover data availability for ATL07

Go back to the "Select data set and determine version number" heading. Replace all `MOD29` instances with `ATL07` along with its most recent version number, keeping your time and area of interest the same. ***Note that ATL07 has a 3-digit version number.*** How does the data volume compare to MOD29? 

____

## Determine the subsetting, reformatting, and reprojection services enabled for your data set of interest.

The NSIDC DAAC supports customization (subsetting, reformatting, reprojection) services on many of our NASA Earthdata mission collections. Let's discover whether or not our data set has these services available using the `print_service_options` function. If services are available, we will also determine the specific service options supported for this data set, which we will then add to our data dictionary. 

### Input Earthdata Login credentials

An Earthdata Login account is required to query data services and to access data from the NSIDC DAAC. If you do not already have an Earthdata Login account, visit http://urs.earthdata.nasa.gov to register. We will input our credentials below, and we'll add our email address to our dictionary for use in our final access request.

In [None]:
uid = '' # Enter Earthdata Login user name

pswd = getpass.getpass('Earthdata Login password: ') # Input and store Earthdata Login password

email = '' # Enter email associated with Earthata Login account

data_dict['email'] = email # Add to data dictionary

We now need to create an HTTP session in order to store cookies and pass our credentials to the data service URLs. The capability URL below is what we will query to determine service information. 

In [None]:
# Query service capability URL 
capability_url = f'https://n5eil02u.ecs.nsidc.org/egi/capabilities/{data_dict["short_name"]}.{data_dict["version"]}.xml' 

# Create session to store cookie and pass credentials to capabilities url
session = requests.session() 
s = session.get(capability_url)
response = session.get(s.url,auth=(uid,pswd))
response.raise_for_status() # Raise bad request to check that Earthdata Login credentials were accepted 

This function provides a list of all available services:

In [None]:
fn.print_service_options(data_dict, response)

### Populate data dictionary with services of interest

We already added our CMR search keywords to our data dictionary, so now we need to add the service options we want to request. A list of all available service keywords for use with NSIDC's access and service API are available in our [Key-Value-Pair table](https://nsidc.org/support/tool/table-key-value-pair-kvp-operands-subsetting-reformatting-and-reprojection-services), as a part of our [Programmatic access guide](https://nsidc.org/support/how/how-do-i-programmatically-request-data-services). For our ATL07 request, we are interested in bounding box, temporal, and variable subsetting. These options crop the data values to the specified ranges and variables of interest. We will enter those values into our data dictionary below.


__Bounding box subsetting:__ Output files are cropped to the specified bounding box extent.

__Temporal subsetting:__ Output files are cropped to the specified temporal range extent.

In [None]:
data_dict['bbox'] = '140,72,153,80' # Just like with the CMR bounding box search parameter, this value is provided in decimal degree 'W,S,E,N' format. 
data_dict['time'] = '2019-03-23T00:00:00,2019-03-23T23:59:59' # Each date in yyyy-MM-ddTHH:mm:ss format; Date range in start,end format

__Variable subsetting:__ Subsets the data set variable or group of variables. For hierarchical data, all lower level variables are returned if a variable group or subgroup is specified. 

For ATL07, we will use only strong beams since these groups contain higher coverage and resolution due to higher surface returns. According to the user guide, the spacecraft was in the backwards orientation during our day of interest, setting the `gt*l` beams as the strong beams. 

We'll use these primary geolocation, height and quality variables of interest for each of the three strong beams. The following descriptions are provided in the [ATL07 Data Dictionary](https://nsidc.org/sites/nsidc.org/files/technical-references/ATL07-data-dictionary-v001.pdf), with additional information on the algorithm and variable descriptions in the [ATBD (Algorithm Theoretical Basis Document)](https://icesat-2.gsfc.nasa.gov/sites/default/files/page_files/ICESat2_ATL07_ATL10_ATBD_r002.pdf).

`delta_time`: Number of GPS seconds since the ATLAS SDP epoch. 

`latitude`: Latitude, WGS84, North=+, Lat of segment center

`longitude`: Longitude, WGS84, East=+,Lon of segment center

`height_segment_height`: Mean height from along-track segment fit determined by the sea ice algorithm

`height_segment_confidence`: Confidence level in the surface height estimate based on the number of photons; the background noise rate; and the error
analysis

`height_segment_quality`: Height segment quality flag, 1 is good quality, 0 is bad

`height_segment_surface_error_est`: Error estimate of the surface height (reported in meters)

`height_segment_length_seg`: along-track length of segment containing n_photons_actual

In [None]:
data_dict['coverage'] = '/gt1l/sea_ice_segments/delta_time,\
/gt1l/sea_ice_segments/latitude,\
/gt1l/sea_ice_segments/longitude,\
/gt1l/sea_ice_segments/heights/height_segment_confidence,\
/gt1l/sea_ice_segments/heights/height_segment_height,\
/gt1l/sea_ice_segments/heights/height_segment_quality,\
/gt1l/sea_ice_segments/heights/height_segment_surface_error_est,\
/gt1l/sea_ice_segments/heights/height_segment_length_seg,\
/gt2l/sea_ice_segments/delta_time,\
/gt2l/sea_ice_segments/latitude,\
/gt2l/sea_ice_segments/longitude,\
/gt2l/sea_ice_segments/heights/height_segment_confidence,\
/gt2l/sea_ice_segments/heights/height_segment_height,\
/gt2l/sea_ice_segments/heights/height_segment_quality,\
/gt2l/sea_ice_segments/heights/height_segment_surface_error_est,\
/gt2l/sea_ice_segments/heights/height_segment_length_seg,\
/gt3l/sea_ice_segments/delta_time,\
/gt3l/sea_ice_segments/latitude,\
/gt3l/sea_ice_segments/longitude,\
/gt3l/sea_ice_segments/heights/height_segment_confidence,\
/gt3l/sea_ice_segments/heights/height_segment_height,\
/gt3l/sea_ice_segments/heights/height_segment_quality,\
/gt3l/sea_ice_segments/heights/height_segment_surface_error_est,\
/gt3l/sea_ice_segments/heights/height_segment_length_seg'


### Select data access configurations

The data request can be accessed asynchronously or synchronously. The asynchronous option will allow concurrent requests to be queued and processed as orders. Those requested orders will be delivered to the specified email address, or they can be accessed programmatically as shown below. Synchronous requests will automatically download the data as soon as processing is complete. For this tutorial, we will be selecting the asynchronous method. 

In [None]:
base_url = 'https://n5eil02u.ecs.nsidc.org/egi/request' # Set NSIDC data access base URL
data_dict['request_mode'] = 'async' # Set the request mode to asynchronous
data_dict['page_size'] = 2000 # Set the page size to the maximum of 2000, which equals the number of output files that can be returned

## Create the data request API endpoint 
Programmatic API requests are formatted as HTTPS URLs that contain key-value-pairs specifying the service operations that we specified above. We will first create a string of key-value-pairs from our data dictionary and we'll feed those into our API endpoint. This API endpoint can be executed via command line, a web browser, or in Python below. 

In [None]:
# Create a new param_dict with CMR configuration parameters removed from our data_dict 
param_dict = dict((i, data_dict[i]) for i in data_dict if i!='gran_num' and i!='page_num')

param_string = '&'.join("{!s}={!r}".format(k,v) for (k,v) in param_dict.items()) # Convert param_dict to string
param_string = param_string.replace("'","") # Remove quotes

API_request = f'{base_url}?{param_string}' 
print(API_request) # Print API base URL + request parameters

## Request data and clean up Output folder

We will now download data using the `request_data` function, which utilizes the Python requests library. Our param_dict and HTTP session will be passed to the function to allow Earthdata Login access. The data will be downloaded directly to this notebook directory in a new Outputs folder. The progress of the order will be reported. The data are returned in separate files, so we'll use the `clean_folder` function to remove those individual folders.

In [None]:
fn.request_data(param_dict,session)
fn.clean_folder()

To review, we have explored data availability and volume over a region and time of interest, discovered and selected data customization options, constructed API endpoints for our requests, and downloaded data. Let's move on to the analysis portion of the tutorial.