# Download Data

### Purpose
This jupyter notebook highlights two different methods for accessing and downloading data from Ocean Observatories Initiative Carbon System instruments. The first method utilizes OOI's API to perform M2M (Machine-2-Machine) queries for data from the OOI THREDDS data server. The second method requests data from OOI's DataExplorer ERDDAP server.

#### THREDDs Data
The data served up via OpenDAP on OOI THREDDs servers are the same datasets which can be accessed via OOI's Data Portal at https://ooinet.oceanobservatories.org/. This is the source for accessing realtime or near-realtime data from OOI. 


#### Data Explorer
Data Explorer is the new tool for exploring, discovering, and downloading data from OOI. It can be accessed via the web at https://dataexplorer.oceanobservatories.org/. Data Explorer hosts "gold copy" versions of OOI datasets, with all the relevant data stream merged into a single unified file. These datasets are hosted on the Data Explorer ERDDAP server at  However, Data Explorer currently only from the Data Explorer website, they currently can't be downloaded from the ERDDAP server.

---
## OOINet/THREDDs
First, we are going to access and download data from OOI's Data Portal. Then we will do some dataset reprocessing to make the resulting data easier and more intuitive to work with. This portion of the notebook relies on some community tools which have been developed by OOI's Data Team members which simplify interacting with OOI's API. The two tools are the OOINet tool (https://github.com/reedan88/OOINet) and the Data Explorations Modules (https://github.com/oceanobservatories/ooi-data-explorations).

This notebook provides an example on how to use the OOINet download tool to perform the following functions:
* Search for datasets
* Identify desired reference designator
* Get the associated metadata for a given reference designator
* Request netCDF datasets for a reference designator
* Download the netCDF dataset to your local machine

The key parameters which the OOI API requires is the "reference designator." A reference designator may be thought of as a type of instrument located at a fixed location and depth. 

In [1]:
# Import libraries
import os, shutil, sys, time, re, requests, csv, datetime, pytz
import time
import yaml
import pandas as pd
import numpy as np
import netCDF4 as nc
import xarray as xr
import warnings
warnings.filterwarnings("ignore")

In [2]:
# Import the OOINet M2M tool
sys.path.append("../../M2M_tutorial/src/")
from pyOOI import M2M

---
## Search Datasets
First, we can search the available OOI Reference Designators (i.e. "refdes" for short) on the following keys: **array**, **node**, **instrument**. Additionally, can request for "**English_names**", which will return the descriptive name for the associated array, node, and instrument. Below, we will search for the available CTD instruments on the Pioneer Array Central Surface Mooring.

The major caveat with the search is, similar to searching on ERDDAP datasets, the search terms must be partial or full match based on OOI nomenclature. For example, we have to search for "PCO2", "PCO2AA", or the full instrument name "04-PCO2AA" if we are searching for the sea-surface pCO2 sensor. We can't search "pco2", "carbon dioxide" or other instrument terms.

gold_copy = 'http://thredds.dataexplorer.oceanobservatories.org/thredds/catalog/ooigoldcopy/public/'

In [4]:
instruments = M2M.search_datasets(array="GP02HYPM", English_names=True)
instruments

Output()

Unnamed: 0,array,array_name,node,node_name,instrument,instrument_name,refdes,url,deployments
0,GP02HYPM,Global Station Papa Apex Profiler Mooring,WFP03,Wire-Following Profiler Lower,05-VEL3DL000,3-D Single Point Velocity Meter,GP02HYPM-WFP03-05-VEL3DL000,https://ooinet.oceanobservatories.org/api/m2m/...,"[1, 2, 3, 4, 5, 6, 7, 8, 9]"
1,GP02HYPM,Global Station Papa Apex Profiler Mooring,WFP03,Wire-Following Profiler Lower,04-CTDPFL000,CTD,GP02HYPM-WFP03-04-CTDPFL000,https://ooinet.oceanobservatories.org/api/m2m/...,"[1, 2, 3, 4, 5, 6, 7, 8, 9]"
2,GP02HYPM,Global Station Papa Apex Profiler Mooring,WFP03,Wire-Following Profiler Lower,03-DOSTAL000,Dissolved Oxygen,GP02HYPM-WFP03-03-DOSTAL000,https://ooinet.oceanobservatories.org/api/m2m/...,"[1, 2, 3, 4, 5, 6, 7, 8, 9]"
3,GP02HYPM,Global Station Papa Apex Profiler Mooring,WFP03,Wire-Following Profiler Lower,01-FLORDL000,2-Wavelength Fluorometer,GP02HYPM-WFP03-01-FLORDL000,https://ooinet.oceanobservatories.org/api/m2m/...,"[1, 2, 3, 4, 5, 6, 7, 8, 9]"
4,GP02HYPM,Global Station Papa Apex Profiler Mooring,WFP03,Wire-Following Profiler Lower,00-WFPENG000,Profiler Controller,GP02HYPM-WFP03-00-WFPENG000,https://ooinet.oceanobservatories.org/api/m2m/...,"[1, 2, 3, 4, 5, 6, 7, 8, 9]"
5,GP02HYPM,Global Station Papa Apex Profiler Mooring,WFP02,Wire-Following Profiler Upper,05-VEL3DL000,3-D Single Point Velocity Meter,GP02HYPM-WFP02-05-VEL3DL000,https://ooinet.oceanobservatories.org/api/m2m/...,"[1, 2, 3, 4, 5, 6, 7, 8, 9]"
6,GP02HYPM,Global Station Papa Apex Profiler Mooring,WFP02,Wire-Following Profiler Upper,04-CTDPFL000,CTD,GP02HYPM-WFP02-04-CTDPFL000,https://ooinet.oceanobservatories.org/api/m2m/...,"[1, 2, 3, 4, 5, 6, 7, 8, 9]"
7,GP02HYPM,Global Station Papa Apex Profiler Mooring,WFP02,Wire-Following Profiler Upper,03-DOSTAL000,Dissolved Oxygen,GP02HYPM-WFP02-03-DOSTAL000,https://ooinet.oceanobservatories.org/api/m2m/...,"[1, 2, 3, 4, 5, 6, 7, 8, 9]"
8,GP02HYPM,Global Station Papa Apex Profiler Mooring,WFP02,Wire-Following Profiler Upper,01-FLORDL000,2-Wavelength Fluorometer,GP02HYPM-WFP02-01-FLORDL000,https://ooinet.oceanobservatories.org/api/m2m/...,"[1, 2, 3, 4, 5, 6, 7, 8, 9]"
9,GP02HYPM,Global Station Papa Apex Profiler Mooring,WFP02,Wire-Following Profiler Upper,00-WFPENG000,Profiler Controller,GP02HYPM-WFP02-00-WFPENG000,https://ooinet.oceanobservatories.org/api/m2m/...,"[1, 2, 3, 4, 5, 6, 7, 8, 9]"


From the returned list of available instruments above, we can select a particular instrument using its **reference designator** (refdes for short):

In [None]:
refdes = "GP02HYPM-WFP02-05-VEL3DL000"

---
## Metadata
Next, we can query OOINet for the metadata associated with the selected reference designator. The metadata contains such valuable information such as the available methods and streams (which are required to download the data), the particleKeys (the data variable names), and the associated units. 

In [None]:
metadata = M2M.get_metadata(refdes)
metadata

#### Sensor Parameters
Each instrument returns multiple parameters containing a variety of low-level instrument output and metadata. However, we are interested in science-relevant parameters. We can identify the science parameters based on the preload database, which designates the science parameters with a "data level" of L1 or L2. 

Consequently, we will want to filter and group the metadata for a given reference designator to identify the relevant parameters. First, we query the preload database with the relevant metadata for a reference designator. Then, we filter the metadata for the science-relevant data streams based on the preload information. Then, we reduce the results by grouping by the stream parameter to get the stream-by-stream data, which will be useful when requesting data from OOINet for download. 

In [None]:
data_levels = M2M.get_parameter_data_levels(metadata)
data_levels

Filter the metadata based on the data levels for **L1** & **L2** data

In [None]:
def filter_parameter_ids(pdId, pid_dict):
    data_level = pid_dict.get(pdId)
    if data_level is not None:
        if data_level > 0:
            return True
        else:
            return False
    else:
        return False

In [None]:
mask = metadata["pdId"].apply(lambda x: filter_parameter_ids(x, data_levels))
metadata = metadata[mask]

Groupby based on the reference designator - method - stream to get the unique values for each data stream

In [None]:
metadata = metadata.groupby(by=["refdes","method","stream"]).agg(lambda x: pd.unique(x.values.ravel()).tolist())
metadata = metadata.reset_index()
metadata = metadata.applymap(lambda x: x[0] if len(x) == 1 else x)
metadata.head()

This returns all of the methods and streams which have scientific data. For PCO2W datasets, we want to drop the entries which have "blank" in them.

In [None]:
mask = metadata["stream"].apply(lambda x: False if "blank" in x else True)
metadata = metadata[mask]
metadata

---
## Deployment Information
When we searched for datasets, it returned a table which listed the available deployment numbers for each of the datasets. We can get much more detailed information on the deployments for a particular reference designator by requesting the deployment information from OOINet.

In [None]:
deployments = M2M.get_deployments(refdes=refdes)
deployments

We'll go ahead and save the deployment data as a csv since it might be useful when working with the data.

In [None]:
deployments.to_csv(f"../data/{refdes}_deployments.csv", index=False)

---
## Vocab Information
Additionally, if we are interested in more detailed information on the location that the reference designator is assigned to, we can request the vocab information for the given reference designator. The vocab information includes some of the "**English_names**" info we requested when searching for datasets, as well as instrument model, manufacturer, and the descriptive names for the reference designator location.

In [None]:
vocab = M2M.get_vocab(refdes=refdes)
vocab

---
## Calibration Information
We can also request the calibration information for a given reference designator. Since individual instruments are swapped during each mooring deployment & recovery, the calibration coefficients for a reference designator are different for each deployment. The way OOI operates is that it loads all the available calibration coefficients for a given reference designator. Then, for each deployment, it finds the calibration coefficients with the most recent calibration date which most closely _precedes_ the start of the deployment. The result is a table, sorted by deployment number for a reference designator, with the uid of the specific instrument, its calibration coefficients, when the instrument was calibrated, and the source of the calibration coefficients.

In [None]:
calibrations = M2M.get_calibrations(refdes, deployments)
calibrations

It is also possible to request the calibration history for a specific instrument by utilizing the **uid** of the instrument and using the lower-level ```_get_api``` method and ```OOINet.URLS``` attribute to construct your own request.

In [None]:
# Set up the calibration url and arguments to pass to the request
cal_url = OOINet.URLS["cal"]
uid = "CGINS-PHSENF-P0183" # This is unique to each instrument
params = {
    "uid": uid
}

# Make the request
calInfo = OOINet._get_api(cal_url, params=params)

# Put the data into a pandas dataframe, sorted by calibration date and coefficient name
columns = ["uid", "calCoef", "calDate", "value", "calFile"]
instrumentCals = pd.DataFrame(columns=columns)
for c in calInfo["calibration"]:
    for cc in c["calData"]:
        instrumentCals = instrumentCals.append({
            "uid": cc["assetUid"],
            "calCoef": cc["eventName"],
            "calDate": OOINet._convert_time(cc["eventStartTime"]),
            "value": cc["value"],
            "calFile": cc["dataSource"]
        }, ignore_index=True)
instrumentCals.sort_values(by=["calDate", "calCoef"], inplace=True)
instrumentCals

---
## Download Datasets
The ultimate goal of the queries above were to identify what data streams(s) we are interested in, along with supporting metadata/calibration information, in order to request the to download. Now we want to be able to request those data streams and get the associated netCDF files. This process involves the following steps:
1. Identify the methods and data streams for the selected reference designator
2. Request the THREDDS server url for the data sets
3. Get the catalog of datasets on the THREDDS server
4. Parse the catalog for the desired netCDF files
5. Download the identified netCDF files to a local directory

Below, we script the above steps in order to download all of the available datasets. In the following section we will combine the data delivered via different methods (e.g. telemetered, recovered_host, recovered_inst) to generate a single combined dataset with the most complete data record available.

In [None]:
def clean_catalog(catalog, stream, deployments):
    """Clean up the netCDF catalog of unwanted datasets"""
    # Parse the netCDF datasets to only get those with the datastream in its name
    datasets = []
    for dset in catalog:
        check = dset.split("/")[-1]
        if stream in check:
            datasets.append(dset)
        else:
            pass
    
    # Next, check that the netCDF datasets are not empty by getting the timestamps in the
    # datasets and checking if they are 
    catalog = datasets
    datasets = []
    for dset in catalog:
        # Get the timestamps
        timestamps = dset.split("_")[-1].replace(".nc","").split("-")
        t1, t2 = timestamps
        # Check if the timestamps are equal
        if t1 == t2:
            pass
        else:
            datasets.append(dset)
            
    # Next, determine if the dataset is either for the given instrument
    # or an ancillary instrument which supplies and input variable
    catalog = datasets
    datasets = []
    ancillary = []
    for dset in catalog:
        if re.search(stream, dset.split("/")[-1]) is None:
            ancillary.append(dset)
        else:
            datasets.append(dset)
            
    # Finally, check that deployment numbers match what is in deployments metadata
    catalog = datasets
    datasets = []
    for dset in catalog:
        dep = re.findall("deployment[\d]{4}", dset)[0]
        depNum = int(dep[-4:])
        if depNum not in list(deployments["deploymentNumber"]):
            pass
        else:
            datasets.append(dset)
            
    return datasets

In [None]:
for row in metadata.index:
    # Get the method and stream
    method, stream = metadata.loc[row,"method"], metadata.loc[row, "stream"]
    
    if "air" in stream:
        continue
    
   
    # Get the THREDDS url
    thredds_url = M2M.get_thredds_url(refdes, method, stream, goldCopy=True)
    
    # Get the catalog
    catalog = M2M.get_thredds_catalog(thredds_url)
    
    # Clean the catalog
    catalog = clean_catalog(catalog, stream, deployments)
    
    # Remove unwanted datasets from the catalog
    for dataset in catalog:
        if "blank" in dataset:
            catalog.remove(dataset)
    
    # Create a directory to save the data
    save_dir = f"../data/{refdes}/{method}/"
    if not os.path.exists(save_dir):
        os.makedirs(save_dir)
    else:
        pass
    
    # Download the files to the save directory
    M2M.download_netCDF_files(catalog, goldCopy=True, saveDir=save_dir)

In [None]:
refdes, method, stream, save_dir

In [None]:
# For the PCO2A, need to get just the "water" stream
os.listdir("..")

In [None]:
save_dir

### Merge Datasets

With the datasets downloaded to a local directory, we now want to combine the datasets delivered via the different methods into a single dataset. This dataset should have the most complete data record available for the given reference designator.


#### Load the data
First, load the downloaded data into xarray datasets.

In [None]:
# List the telemetered data sets
telemetered_files = os.listdir(f"../data/{refdes}/telemetered")
telemetered_files = sorted([f"../data/{refdes}/telemetered/" + f for f in telemetered_files if "metbk" not in f])
telemetered_files

In [None]:
# For the PCO2A, need just the "water" data stream
remove_files = []
for file in telemetered_files:
    if "air" in file:
        remove_files.append(file)

for f in remove_files:
    telemetered_files.remove(f)
    
telemetered_files

In [None]:
# List the recovered_host data sets
recovered_host_files = os.listdir(f"../data/{refdes}/recovered_host")
recovered_host_files = sorted([f"../data/{refdes}/recovered_host/" + f for f in recovered_host_files if "metbk" not in f])
recovered_host_files

In [None]:
# For the PCO2A, need just the "water" data stream
remove_files = []
for file in recovered_host_files:
    if "air" in file:
        remove_files.append(file)

for f in remove_files:
    recovered_host_files.remove(f)
    
recovered_host_files

In [None]:
# List the recovered_inst data sets
recovered_inst_files = os.listdir(f"../data/{refdes}/recovered_inst")
recovered_inst_files = sorted([f"../data/{refdes}/recovered_inst/" + f for f in recovered_inst_files if "metbk" not in f])

recovered_inst_files

Load the datasets:

In [None]:
from dask.diagnostics import ProgressBar

In [None]:
refdes

In [None]:
def open_datasets(datasets, refdes):
    """Opens datasets saved locally into an xarray dataset."""
    
    M2M.REFDES = refdes
    
    # Clean the catalog
    
    
    # Load the datasets into a concatenated xarray DataSet
    with ProgressBar():
        print("\n"+f"Loading netCDF_files for {M2M.REFDES}:")
        ds = xr.open_mfdataset(datasets, preprocess=OOINet._preprocess, parallel=True)
        
    # Add in the English name of the dataset
    refdes = "-".join(ds.attrs["id"].split("-")[:4])
    vocab = M2M.get_vocab(refdes)
    ds.attrs["Location_name"] = " ".join((vocab["tocL1"].iloc[0],
                                          vocab["tocL2"].iloc[0],
                                          vocab["tocL3"].iloc[0]))    

    return ds

In [None]:
refdes

In [None]:
tele_data = open_datasets(telemetered_files, refdes)
host_data = open_datasets(recovered_host_files, refdes)
inst_data = open_datasets(recovered_inst_files, refdes)

#### Optional Step: Process the dataset
An additional step is to process the datasets to clean up the datasets and get rid of 

In [None]:
sys.path.append("/home/areed/Documents/OOI/oceanobservatories/ooi-data-explorations/python/")

In [None]:
from ooi_data_explorations.uncabled import process_ctdbp

In [None]:
inst_data = process_ctdbp.ctdbp_instrument(inst_data)
host_data = process_ctdbp.ctdbp_datalogger(host_data)
tele_data = process_ctdbp.ctdbp_datalogger(tele_data)

#### Merge data
Now, we need to merge the data. First, we iterate through the data variables for each dataset, identify any which are unique to a given dataset, and broadcast them to the other datasets. This step is necessary to allow the datasets to combine. Once each dataset has the same data variables, we utilize ```xr.combine_first``` to combine the datasets. We assume that the instrument record, if available, is the best and most complete dataset and utilize the telemetered and recovered_host datasets to fill in the gaps.

In [None]:
# Need to make sure each dataset has the same variables
for var in tele_data.variables:
    if var not in host_data.variables:
        host_data[var] = tele_data[var].broadcast_like(host_data["time"])
        
for var in host_data.variables:
    if var not in tele_data.variables:
        tele_data[var] = host_data[var].broadcast_like(tele_data["time"])

In [None]:
# Merge the telemetered dataset and host_dataset
tele_host = tele_data.combine_first(host_data)
data = tele_host

In [None]:
for var in tele_host.variables:
    if var not in inst_data.variables:
        inst_data[var] = tele_host[var].broadcast_like(inst_data["time"])

for var in inst_data.variables:
    if var not in tele_host.variables:
        tele_host[var] = inst_data[var].broadcast_like(tele_host["time"])

In [None]:
# Merge the instrument dataset with the combined telemetered-recovered_host dataset
data = inst_data.combine_first(tele_host)

#### Save the results
With the merged datasets, we can save the results locally as a netCDF file. However, some data variables contain improperly formatted datetimes and timestamps which will cause an error when saving. Generally, these data variables do not contain particularly useful information for a science-user and can be dropped before saving.

In [None]:
data

In [None]:
refdes

Save the data as a netCDF file using h5netcdf compression

In [None]:
data.to_netcdf(f"../data/{refdes}_combined.nc", engine="h5netcdf")

Close the dataset so it can be operated on

In [None]:
data.close()

In [None]:
os.listdir("../data/")

---
## Annotations
Annotations contain important qualitative assessments of data quality from the instrument operators. They may range from explanations for why data is missing for a given time period to information about biofouling or other data quality issues. Annotations can be downloaded from OOINet for a particular reference designator.

In [None]:
# Download the annotations for each reference designator
annotations = OOINet.get_annotations(refdes)
annotations

Save the annotations to local directory

In [None]:
annotations.to_csv(f"../data/{refdes}_annotations.csv")

---
## Data Explorer
---
The data from Data Explorer are hosted via ERDDAP. To interact with Data Explorer's ERDDAP, we'll utilize the python package ```erddapy```.

In [None]:
from erddapy import ERDDAP

**Data Explorer ERDDAP url**

In [None]:
dataExplorer = "http://erddap.dataexplorer.oceanobservatories.org/erddap"

Connect to the Data Explorer ERDDAP

In [None]:
erd = ERDDAP(server=dataExplorer)

Search for ```PHSEN``` on the Irminger Array

In [None]:
search_url = erd.get_search_url(search_for="gi03flma phsen", 
                                protocol="tabledap",
                                response="csv")

Get the dataset ids for the available PHSEN datasets on the Irminger Sea Flanking Mooring A

In [None]:
dataset_ids = pd.read_csv(search_url)["Dataset ID"]
dataset_ids

Download the dataset from ERDDAP

In [None]:
# Select the dataset id of the instrument you want to download
dataset_id = "ooi-gi03flma-ris01-04-phsenf000"

# Get the download url
download_url = erd.get_download_url(dataset_id=dataset_id, 
                                    protocol="tabledap",
                                    response="opendap")

# Set up the parameters for the dataset request from the ERDDAP server
erd.dataset_id = dataset_id
erd.response = "nc"
erd.protocol = "tabledap"

Open the requested dataset using ```xarray```

In [None]:
ds = erd.to_xarray()
ds = ds.swap_dims({"obs":"time"})
ds = ds.sortby("time")
ds