# Discover, Customize and Access NSIDC DAAC Data

This notebook is based off of the [NSIDC-Data-Access-Notebook](https://github.com/nsidc/NSIDC-Data-Access-Notebook) provided through NSIDC's Github organization. 

Now that we've visualized our study areas, we will first explore data coverage, size, and customization (subsetting, reformatting, reprojection) service availability, and then access those associated files. The `Data access for all datasets` notebook provides the steps needed to subset and download all the data we'll be using in the final `Visualize and Analyze Data`.

___A note on data access options:___
We will be pursuing data discovery and access "programmatically" using Application Programming Interfaces, or APIs. 

*What is an API?* You can think of an API as a middle man between an application or end-use (in this case, us) and a data provider. In this case, the data provider is both the metadata repository housing data information (the Common Metadata Repository, or CMR) and NSIDC as the data distributor. These APIs are generally structured as a URL with a base plus individual key-value-pairs separated by ‘&’.

There are other discovery and access methods available from NSIDC as we walked through in the `Introduction` notebook, including access from the data set landing page 'Download Data' tab (e.g. [ATL07 Data Access](https://nsidc.org/data/atl07?qt-data_set_tabs=1#qt-data_set_tabs)) and [NASA Earthdata Search](https://search.earthdata.nasa.gov/).

Here are the steps you will learn in this customize and access notebook:
   
1. Search for data programmatically using the Common Metadata Repository API by time and area of interest.
2. Determine subsetting, reformatting, and reprojection capabilities for our data of interest.
3. Access and customize data using NSIDC's data access and service API.

[DISCOVER MOD29 ON YOUR OWN]


## Import packages


In [None]:
#%run functions.py

In [None]:
import requests
import getpass
import json

# This is our functions module. We created several functions used in this notebook and the Visualize and Analyze notebook.
import functions

## Explore data availability using the Common Metadata Repository 

The Common Metadata Repository (CMR) is a high-performance, high-quality, continuously evolving metadata system that catalogs Earth Science data and associated service metadata records. These metadata records are registered, modified, discovered, and accessed through programmatic interfaces leveraging standard protocols and APIs.

https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html

### Select data set and determine version number

Data sets are selected by data set IDs (e.g. ATL07).  In the CMR API documentation, a data set ids is referred to as a "short name". [HOW DO YOU FIND A DATASET SHORTNAME?] We use the python requests package to access the CMR. This search does not require a password.  

In [None]:
CMR_COLLECTIONS_URL = 'https://cmr.earthdata.nasa.gov/search/collections.json'
response = requests.get(CMR_COLLECTIONS_URL, params={'short_name': 'ATL07'})

Data are then converted to [JSON](https://en.wikipedia.org/wiki/JSON) format; a language independant human-readable open-standard file format.

In [None]:
results = json.loads(response.content)
results

More than one version can exist for a given data set:

In [None]:
for entry in results['feed']['entry']:
    functions.print_cmr_metadata(entry)

We will specify the most recent version, `002`, for our remaining ATL07 queries.

### Select time and area of interest

We will create a simple dictionary with our short name, version, time, and area of interest. We'll continue to add to this dictionary as we discover more information about our data set.

In [None]:
# Bounding Box spatial parameter in decimal degree 'W,S,E,N' format. This is our region of interest over the East Siberian Sea.

bounding_box = '140,72,153,80' 

In [None]:
# Each date in yyyy-MM-ddTHH:mm:ssZ format
# Date range in start,end format

temporal = '2019-03-23T00:00:00Z,2019-03-23T23:59:59Z'

In [None]:
data_dict = {'short_name': 'ATL07', 
             'version': '002',
             'bounding_box': bounding_box, 
             'temporal': temporal }

### Determine how many files exist over this time and area of interest, as well as the average size and total volume of those granules

We will use the `granule_info` function to query the CMR granule API. The function prints the number of granules, average size, and total volume of those granules. It returns the granule number value, which we will add to our data dictionary.

In [None]:
gran_num = functions.granule_info(data_dict)
data_dict['gran_num'] = gran_num

Note that subsetting, reformatting, or reprojecting can alter the size of the granules if those services are applied to your request.

## Determine the subsetting, reformatting, and reprojection services enabled for your data set of interest.

The NSIDC DAAC supports customization (subsetting, reformatting, reprojection) services on many of our NASA Earthdata mission collections. Let's discover whether or not our data set has these services available using the `print_service_options` function. If services are available, we will also determine the specific service options supported for this data set, which we will then add to our data dictionary. 

### Input Earthdata Login credentials

An Earthdata Login account is required to query data services and to access data from the NSIDC DAAC. If you do not already have an Earthdata Login account, visit http://urs.earthdata.nasa.gov to register. We will input our credentials below, and we'll add our email address to our dictionary for use in our final access request.

In [None]:
#Store Earthdata Login user name
uid = 'amy.steiker'

#Input and store Earthdata Login password
pswd = getpass.getpass('Earthdata Login password: ')

#Store email associated with Earthata Login account
email = 'amy.steiker@nsidc.org'
data_dict['email'] = email

We now need to create an HTTP session in order to store cookies and pass our credentials to the data service URLs. The capability URL below is what we will query to determine service information. 

In [None]:
# Query service capability URL 
capability_url = f'https://n5eil02u.ecs.nsidc.org/egi/capabilities/{data_dict["short_name"]}.{data_dict["version"]}.xml'

# Create session to store cookie and pass credentials to capabilities url
session = requests.session()
s = session.get(capability_url)
response = session.get(s.url,auth=(uid,pswd))
#DO AN HTTPS RESPONSE CHECK TO CHECK FOR PWSD

This function provides a list of all available services:

In [None]:
functions.print_service_options(data_dict, response)

### Populate data dictionary with services of interest

We already added our CMR search keywords to our data dictionary, so now we need to add the service options we want to request. A list of all available service keywords for use with NSIDC's access and service API are available in our [Key-Value-Pair table](https://nsidc.org/support/tool/table-key-value-pair-kvp-operands-subsetting-reformatting-and-reprojection-services), as a part of our [Programmatic access guide](https://nsidc.org/support/how/how-do-i-programmatically-request-data-services). For our ATL07 request, we are interested in bounding box, temporal, and variable subsetting. These options crop the data values to the specified ranges and variables of interest. We will enter those values into our data dictionary below: 

[MAKE SURE THE RANGE ONLY RETURNS 3 GRANULES]


In [None]:
# Bounding box subsetting
# Output files are cropped to the specified bounding box extent.
# Just like with the CMR bounding box search parameter, this value is provided in decimal degree 'W,S,E,N' format. 

data_dict['bbox'] = '140,72,153,80'

# Temporal subsetting 
# Output files are cropped to the specified temporal range extent.
# Each date in yyyy-MM-ddTHH:mm:ss format
# Date range in start,end format

data_dict['time'] = '2019-03-23T00:00:00,2019-03-23T23:59:59'

# Variable subsetting 
# Subsets the data set variable or group of variables. For hiearchical data, all lower level variables are returned if a variable group or subgroup is specified. 
# For ATL07, we will use only strong beams since these groups contain higher coverage and resolution due to higher surface returns. 
# According to the user guide, the spacecraft was in the backwards orientation during our day of interest, setting the gt*l beams as the strong beams. 
# We'll use these primary geolocation, height and quality variables of interest for each of the three strong beams:

data_dict['coverage'] = '/gt1l/sea_ice_segments/delta_time,\
/gt1l/sea_ice_segments/latitude,\
/gt1l/sea_ice_segments/longitude,\
/gt1l/sea_ice_segments/heights/height_segment_confidence,\
/gt1l/sea_ice_segments/heights/height_segment_height,\
/gt1l/sea_ice_segments/heights/height_segment_quality,\
/gt1l/sea_ice_segments/heights/height_segment_surface_error_est,\
/gt1l/sea_ice_segments/heights/height_segment_length_seg,\
/gt2l/sea_ice_segments/delta_time,\
/gt2l/sea_ice_segments/latitude,\
/gt2l/sea_ice_segments/longitude,\
/gt2l/sea_ice_segments/heights/height_segment_confidence,\
/gt2l/sea_ice_segments/heights/height_segment_height,\
/gt2l/sea_ice_segments/heights/height_segment_quality,\
/gt2l/sea_ice_segments/heights/height_segment_surface_error_est,\
/gt2l/sea_ice_segments/heights/height_segment_length_seg,\
/gt3l/sea_ice_segments/delta_time,\
/gt3l/sea_ice_segments/latitude,\
/gt3l/sea_ice_segments/longitude,\
/gt3l/sea_ice_segments/heights/height_segment_confidence,\
/gt3l/sea_ice_segments/heights/height_segment_height,\
/gt3l/sea_ice_segments/heights/height_segment_quality,\
/gt3l/sea_ice_segments/heights/height_segment_surface_error_est,\
/gt3l/sea_ice_segments/heights/height_segment_length_seg'

### Select data access configurations

The data request can be accessed asynchronously or synchronously. The asynchronous option will allow concurrent requests to be queued and processed without the need for a continuous connection. Those requested orders will be delivered to the specified email address, or they can be accessed programmatically as shown below. Synchronous requests will automatically download the data as soon as processing is complete. For this tutorial, we will be selecting the asynchronous method. 

In [None]:
# Set NSIDC data access base URL
base_url = 'https://n5eil02u.ecs.nsidc.org/egi/request'

# Set the request mode to asynchronous
data_dict['request_mode'] = 'async'

# Set the page size, which equals the number of output files returned. The maximum is 2000, so we will set to this value to ensure the maximum is returned.
# data_dict['page_size'] = 2000

## Create the data request API endpoint 
Programmatic API requests are formatted as HTTPS URLs that contain key-value-pairs specifying the service operations that we specified above. We will first create a string of key-value-pairs from our data dictionary and we'll feed those into our API endpoint below. 

In [None]:
# Create a new param_dict with CMR configuration parameters removed from our data_dict 
param_dict = dict((i, data_dict[i]) for i in data_dict if i!='gran_num' and i!='page_num' and i!='page_size')

#Convert param_dict to string
param_string = '&'.join("{!s}={!r}".format(k,v) for (k,v) in param_dict.items())
param_string = param_string.replace("'","")

The following API endpoint can be executed via command line, a web browser, or in Python below. 

In [None]:
#Print API base URL + request parameters

API_request = f'{base_url}?{param_string}'
print(API_request)

## Request data

We will now download data using the `request_data` function, which utilizes the Python requests library. Our param_dict and HTTP session will be passed to the function to allow Earthdata Login access. The data will be downloaded directly to this notebook directory in a new Outputs folder. The progress of the order will be reported.

In [None]:
functions.request_data(param_dict,session)

### Finally, we will clean up the Output folder by removing individual order folders using the `clean_folder` function.

In [None]:
functions.clean_folder()

To review, we have explored data availability and volume over a region and time of interest, discovered and selected data customization options, constructed API endpoints for our requests, and downloaded data. Let's move on to the analysis portion of the tutorial.