In [2]:
import requests
import pandas as pd
import intake
import search
# import search.ErddapReader as ErddapReader
# import search.axdsReader as axdsReader
# import search.localReader as localReader
# from search.Data import Data
import xarray as xr
import numpy as np
from joblib import Parallel, delayed
import multiprocessing

# Search for data

The databases checked are:
1. ERDDAP servers for IOOS and Coastwatch
2. Axiom database looking for type platform2 and layer_group
3. Local user-input files.

Each reader has the attributes `dataset_ids`, `meta`, and `data`, and the `search.Data` class wraps them. I think the readers are pretty good, but the `search.Data` class still needs some inspiration.

Important questions:

1. Data is returned in dataframes or datasets. Is that ok to have two types?
2. Currently dataset_ids, metadata, and data is returned in a dictionary form, one entry for each reader/source, and separate entries for each dataset. Should these be combined instead?
3. How should this be sped up? The data is pretty slow. Save locally? Even setting up the lazy Datasets has been sometimes slow, sometimes fast.
4. The names are pretty bad, especially "search", "data", and the readers and variable names aren't always great.

## Quick Demos

Find dataset_ids for 1 known ERDDAP server (just one to save time).

In [13]:
# # SFBOFS
# kw = {
#     "min_lon": -124.0,
#     "max_lon": -122.0,
#     "min_lat": 36.0,
#     "max_lat": 40.0,
#     "min_time": '2021-4-1', # "2016-07-10T00:00:00",#Z",
#     "max_time": '2021-4-2', # "2017-02-10T00:00:00",#Z"
# }
# # Gulf of Mexico 
# kw = {
#     "min_lon": -99.0,
#     "max_lon": -88.0,
#     "min_lat": 20.0,
#     "max_lat": 30.0,
#     "min_time": '2020-1-1', # "2016-07-10T00:00:00",#Z",
#     "max_time": '2021-1-2', # "2017-02-10T00:00:00",#Z"
# }
# # full U.S.
# kw = {
#     "min_lon": -195,# -99.0,
#     "max_lon": -60, #-88.0,
#     "min_lat": 17, #20.0,
#     "max_lat": 80, #30.0,
#     "min_time": '2021-2-1', # "2016-07-10T00:00:00",#Z",
#     "max_time": '2021-4-1', # "2017-02-10T00:00:00",#Z"
# }

In [14]:
kw = {
    "min_lon": -124.0,
    "max_lon": -123.0,
    "min_lat": 39.0,
    "max_lat": 40.0,
    "min_time": '2021-4-1',
    "max_time": '2021-4-2',
}

# setup Data search object
data = search.Data(kw=kw, approach='region', 
                   readers=search.ErddapReader, 
                   ErddapReader={'known_server': 'ioos'})

# find dataset_ids to make sure it works
data.dataset_ids[0][:5]



['gov_noaa_nws_hads_ddde4328',
 'noaa_nos_co_ops_9417426',
 'gov_usgs_waterdata_11461000',
 'gov_usgs_waterdata_11468500',
 'gov_noaa_nws_kuki']

You can find the metadata and data, it just might take some time:

In [15]:
# to find metadata:
# data.meta

# to find data:
# data.data

Set up for finding any datasets for the region in `kw`.

In [16]:
data = search.Data(kw=kw, approach='region')

Find the data for that station you care about, but you just know the source name of it. There is just one dataset_id returned, but it is in a list of lists because more than one reader set up was checked for that station name.

In [17]:
data = search.Data(approach='stations', stations='8770475')
data.dataset_ids

[['noaa_nos_co_ops_8770475'], [], [], [], []]

Get the associated metadata by indexing into the zeroth space since that is where the dataset_id was located.

In [18]:
data.meta[0]

Unnamed: 0,database,download_url,geospatial_lat_min,geospatial_lat_max,geospatial_lon_min,geospatial_lon_max,time_coverage_start,time_coverage_end,defaultDataQuery,subsetVariables,keywords,id,infoUrl,institution,featureType,source,sourceUrl,variable names
noaa_nos_co_ops_8770475,http://erddap.sensors.ioos.us/erddap,http://erddap.sensors.ioos.us/erddap/tabledap/...,29.8667,29.8667,-93.93,-93.93,2015-05-05T13:00:00Z,2021-05-10T12:29:00Z,"sea_surface_height_above_sea_level_geoid_mllw,...",,,45587,https://sensors.ioos.us/#metadata/45587/station,NOAA Center for Operational Oceanographic Prod...,TimeSeries,,https://sensors.axds.co/api/,


Get the data by first indexing into the list, then by station name:

In [19]:
data.data[0]['noaa_nos_co_ops_8770475']

Unnamed: 0_level_0,latitude (degrees_north),longitude (degrees_east),z (m),air_pressure (millibars),air_temperature (degree_Celsius),sea_water_temperature (degree_Celsius),sea_surface_height_amplitude_due_to_geocentric_ocean_tide_geoid_mllw (cm),sea_surface_height_above_sea_level_geoid_mllw (m),wind_speed_of_gust (mile.hour-1),wind_speed (m.s-1),wind_from_direction (degrees),station
time (UTC),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2021-04-02 00:00:00+00:00,29.8667,-93.93,0.0,1027.6,18.5,19.6,,0.380,16.553,5.8,6.0,"Port Arthur, TX"
2021-04-01 23:54:00+00:00,29.8667,-93.93,0.0,1027.6,18.6,19.6,,0.389,17.672,5.7,4.0,"Port Arthur, TX"
2021-04-01 23:48:00+00:00,29.8667,-93.93,0.0,1027.6,18.7,19.6,30.0,0.397,18.790,5.9,358.0,"Port Arthur, TX"
2021-04-01 23:42:00+00:00,29.8667,-93.93,0.0,1027.5,18.7,19.6,,0.428,17.001,5.2,10.0,"Port Arthur, TX"
2021-04-01 23:36:00+00:00,29.8667,-93.93,0.0,1027.5,18.8,19.6,,0.409,18.119,6.6,9.0,"Port Arthur, TX"
...,...,...,...,...,...,...,...,...,...,...,...,...
2021-04-01 00:24:00+00:00,29.8667,-93.93,0.0,1022.9,16.4,20.7,,0.329,21.698,6.4,5.0,"Port Arthur, TX"
2021-04-01 00:18:00+00:00,29.8667,-93.93,0.0,1022.8,16.4,20.7,,0.340,22.593,7.5,3.0,"Port Arthur, TX"
2021-04-01 00:12:00+00:00,29.8667,-93.93,0.0,1022.8,16.3,20.7,,0.343,20.132,7.1,4.0,"Port Arthur, TX"
2021-04-01 00:06:00+00:00,29.8667,-93.93,0.0,1022.6,16.3,20.8,,0.308,24.606,6.7,9.0,"Port Arthur, TX"


## In Detail

### General Options

#### Parallel

You can control readers individually as needed. For example, you could input the keyword `parallel`, which every reader accepts, per individual reader (in case you want different values for different readers), or you can input it for all readers by including it in `kwargs` generally. It runs in parallel using the `joblib` `Parallel` and `delayed` modules with `multiprocesses` — running loops on different cores.

In [25]:
kwargs = {
          'kw': kw, 
          'approach': 'region',
          'parallel': True,    
          'ErddapReader': {
                           'known_server': 'ioos',
#                            'parallel': False,
                           'variables': 'salinity',
          },
          'axdsReader': {'catalog_name': None,
#                          'parallel': True,
                         'axds_type': 'platform2',
                         'variables': 'Salinity'},
          }
data = search.Data(**kwargs)

#### Reader Choice

Your reader choices can be selected as follows, where the `ErddapReader` connects to ERDDAP servers, the `axdsReader` connects to Axiom databases, and the `localReader` enables easy local file read-in. If you don't input any reader, it will use all of them. Alternatively you can input some subset.

In [26]:
readers = [search.ErddapReader,
           search.axdsReader,
           search.localReader]

Use only ERDDAP reader and Axiom reader:

In [27]:
data = search.Data(kw=kw, approach='region', 
                   readers=[search.ErddapReader,
                            search.axdsReader])

### By Region

All variables: don't input anything or use:

In [28]:
kwargs = {
          'kw': kw, 
          'approach': 'region',
          'readers': [search.ErddapReader,
                      search.axdsReader],
          'variables': None
}
data = search.Data(**kwargs)

#### By Variable(s)

If no `variables` are specified for a given reader, datasets with any variables will be returned from a search. This is most relevant for a `region` search.

However, if you want to specify a variable or variables, keep in mind that different readers have different names for variables, which is why you can't just input a variable name for all the readers. 

This is only relevant for the ERDDAP and Axiom readers currently (it will retain all variables in local files), and only the Axiom reader of type `platform2` will search by variable.

Let's say you want to search for salinity. You can input the base of the word as `variables` ("sal" or "salinity" but not "salt" since the checker searches for matches with the whole input variable name and "salt" isn't used for any variable names) and the code will make sure it exactly matches a known variable name. If it cannot match, it will throw an error with suggestions. This is not done automatically since for example "soil_salinity" matches for "salinity". You need to do this for each `known_server` for the `ErddapReader` separately, and variables will only be used for filter for the `axdsReader` for `axds_type='platform2'`.

TODO: DEFAULT VARIABLES FOR BOEM SET UP

In [29]:
kwargs = {
          'kw': kw, 
          'approach': 'region',
          'stations': '8771972',
          'readers': [search.ErddapReader,
                      search.axdsReader],
                    
          'ErddapReader': {
                          'known_server': ['coastwatch','ioos'],
                           'variables': [['sal'],
                                         ['sal']]
          },
          'axdsReader': {
                          'axds_type': ['platform2','layer_group'],
                         'variables': ['sal',None]},
}


data = search.Data(**kwargs)

AssertionError: The input variables are not exact matches to ok variables for known_server ioos.                      
Check all parameter group values with `ErddapReader().all_variables()`                      
or search parameter group values with `ErddapReader().search_variables(['sal'])`.                     

 Try some of the following variables:
                                                count
variable                                             
salinity                                          952
salinity_qc                                       952
sea_water_practical_salinity                      778
soil_salinity_qc_agg                              622
soil_salinity                                     622
...                                               ...
sea_water_practical_salinity_4161sc_a_qc_tests      1
sea_water_practical_salinity_6754mc_a_qc_tests      1
sea_water_practical_salinity_6754mc_a_qc_agg        1
sea_water_practical_salinity_4161sc_a_qc_agg        1
sea_water_practical_salinity_10091sc_a              1

[1148 rows x 1 columns]

You can do this process iteratively, trying out variables for each of the ERDDAP and Axiom readers until you get what you want. Once you have selected variables that match, the code won't complain anymore.

In [30]:
kwargs = {
          'kw': kw, 
          'approach': 'region',
          'readers': [search.ErddapReader,
                      search.axdsReader],
                    
          'ErddapReader': {
                          'known_server': ['coastwatch','ioos'],
                           'variables': [['salinity', 'sea_water_salinity'],
                                         ['salinity', 'sea_water_practical_salinity']]
          },
          'axdsReader': {
                          'axds_type': ['platform2','layer_group'],
                         'variables': ['Salinity',None]},
}

data = search.Data(**kwargs)

#### Actions with Variables

Alternatively you can proactively search for variables for each reader. Currently the ways to call the individiual libraries aren't pretty but they'll work. Note that the number of times a variable is used in the system is also included under "count" to see what the popular names are (many are not widely used). 

Return all variables for the two Erddap `known_server`s, then for the Axiom reader `axds_type='platform2'`.

In [31]:
search.ErddapReader.ErddapReader(known_server='coastwatch').all_variables().head()

Unnamed: 0_level_0,count
variable,Unnamed: 1_level_1
abund_m3,2
ac_line,1
ac_sta,1
adg_412,8
adg_412_bias,8


In [32]:
search.ErddapReader.ErddapReader(known_server='ioos').all_variables().head()

Unnamed: 0_level_0,count
variable,Unnamed: 1_level_1
air_pressure,4028
air_pressure_10011met_a,2
air_pressure_10311ahlm_a,2
air_pressure_10311ahlm_a_qc_agg,1
air_pressure_10311ahlm_a_qc_tests,1


The Axiom reader variables are for `axds_type='platform2'` not `axds_type='layer_group` since the latter are more unique grid products that don't conform well.

In [33]:
search.axdsReader.axdsReader().all_variables().head()

Unnamed: 0_level_0,count
variable,Unnamed: 1_level_1
Ammonium,23
Atmospheric Pressure: Air Pressure at Sea Level,362
Atmospheric Pressure: Barometric Pressure,4152
Backscatter Intensity,286
Battery,2705


Search for variables, sorted by how commonly used:

In [34]:
search.ErddapReader.ErddapReader(known_server='coastwatch').search_variables('sal').head()

Unnamed: 0_level_0,count
variable,Unnamed: 1_level_1
salinity,73
salt,4
sea_water_salinity,4
surface_salinity_trend,2
bucket_salinity,1


In [35]:
search.ErddapReader.ErddapReader(known_server='ioos').search_variables('sal').head()

Unnamed: 0_level_0,count
variable,Unnamed: 1_level_1
salinity,952
salinity_qc,952
sea_water_practical_salinity,778
soil_salinity_qc_agg,622
soil_salinity,622


In [36]:
search.axdsReader.axdsReader().search_variables('sal').head()

Unnamed: 0_level_0,count
variable,Unnamed: 1_level_1
Salinity,3204
Soil Salinity,622


And finally you can check to make sure you have good variables. No news is good news in this.

In [37]:
search.ErddapReader.ErddapReader(known_server='coastwatch').check_variables(['salinity',
                                                                            'sea_water_salinity'])

In [38]:
search.ErddapReader.ErddapReader(known_server='ioos').check_variables(['salinity',
                                                                        'sea_water_practical_salinity'])

In [39]:
search.axdsReader.axdsReader(axds_type='platform2').check_variables('Salinity')

Or overall

In [55]:
kwargs = {
          'kw': kw, 
          'approach': 'region',
          'readers': [search.ErddapReader,
                      search.axdsReader],

          'ErddapReader': {
                          'known_server': ['coastwatch','ioos'],
                           'variables': [['salinity', 'sea_water_salinity'],
                                         ['salinity', 'sea_water_practical_salinity']]
          },
          'axdsReader': {
                          'axds_type': ['platform2','layer_group'],
                         'variables': ['Salinity',None]},
}

data = search.Data(**kwargs)

In [56]:
data.dataset_ids

[[],
 [],
 [],
 ['05113e8c-ea25-11e0-a998-0019b9dae22b',
  '391183ee-827e-11e1-a4f3-00219bfe5678',
  '071705de-8400-11e1-99fe-00219bfe5678',
  '1f22a216-2565-11e3-8377-00219bfe5678',
  'd069dd3f-09da-4295-964e-bbbc0ad554b8',
  '8899968a-2567-11e3-89f6-00219bfe5678',
  'fd646e28-2566-11e3-9bc9-00219bfe5678',
  '8a33cf06-2567-11e3-9bc7-00219bfe5678',
  'de91c282-01e2-11e2-ad19-00219bfe5671',
  'de91c282-01e2-11e2-ad19-00219bfe5679',
  'de91c282-01e2-11e2-ad19-00219bfe5670',
  'dd2af856-c36f-4b4c-9452-2c39a07d5178',
  '06eff4a2-60b2-4b5a-8f4b-931597f6e156',
  '49b40d84-7ae1-4217-9ec5-4c6524bb06b0',
  '1ec2a7e4-08db-40d3-ab41-4251cd3633bb',
  '3b314775-4efa-4717-98ec-04dd9e436127',
  '4bbceb3c-00d7-4af9-a8c7-81d38dd767ce',
  '4dd92df2-d9a8-460b-98cd-3395858c19f5',
  '45d182be-1a8f-4b36-9fc8-c829248d8512',
  '3bbc741a-966c-4547-b08b-ad57f9493e57',
  '3651d05b-e120-49e2-ace1-9cf27e68d820',
  'd359748a-fe78-11e7-8128-0023aeec7b98',
  '5d7e406a-d38d-482d-ab4f-71c8bb3d15b5',
  '3261285c-e3c9-45

### By Name

You can search by either a general station name to be searched for, or by the specific database dataset_id if you know it (from performing a search previously, for example).

#### By Station

In the case that you know names of stations, but they might not be the names in the particular databases, you can use this approach.

In the follow example, I use some station id's I know off the top of my head. Note that the dataset_ids are returned in order of the readers that are being used (ERDDAP IOOS, ERDDAP Coastwatch, Axiom platform2, Axiom layer_group, localreader). The module will check all of the readers for the station names.

In [42]:
kwargs = {
          'approach': 'stations',
          'stations': ['8771972','SFBOFS','42020','TABS_B']
}
data = search.Data(**kwargs)

In [43]:
data.dataset_ids

[['wmo_42020', 'noaa_nos_co_ops_8771972', 'tabs_b'],
 [],
 [],
 ['03158b5d-f712-45f2-b05d-e4954372c1ce',
  '794f7bba-b3d2-4da8-8465-408c27ab433b'],
 []]

#### By Dataset ID

Once we know the database dataset_ids, we can use them directly for future searches:

In [44]:
kwargs = {
          'approach': 'stations',
          'ErddapReader': {
                          'known_server': ['ioos'],
                           'dataset_ids': [['tabs_b', 'wmo_42020', 'noaa_nos_co_ops_8771972']]
          },
          'axdsReader': {
                          'axds_type': ['layer_group'],
                         'dataset_ids': [['03158b5d-f712-45f2-b05d-e4954372c1ce']]},

}
data = search.Data(**kwargs)

In [45]:
data.dataset_ids

[['tabs_b', 'wmo_42020', 'noaa_nos_co_ops_8771972'],
 ['03158b5d-f712-45f2-b05d-e4954372c1ce'],
 []]

#### Include Time Range

By default, the full available time range will be returned for each dataset unless the user specifies one to narrow the returned datasets in time.

In [46]:
kwargs = {
          'kw': {'min_time': '2017-1-1', 
                 'max_time': '2017-1-2'},
          'approach': 'stations',
          'stations': ['8771972']
}
data = search.Data(**kwargs)

## Reader Options

### ERDDAP Reader

By default, the Data module will use `ErddapReader` with two known servers: IOOS and Coastwatch. 

In [47]:
kwargs = {
          'kw': kw,
          'approach': 'region',
          'readers': [search.ErddapReader]
}
data = search.Data(**kwargs)
data.sources[0].name, data.sources[1].name

('erddap_ioos', 'erddap_coastwatch')

#### Specify Known Server

The user can specify to use just one of these:

In [48]:
kwargs = {
          'kw': kw,
          'approach': 'region',
          'readers': [search.ErddapReader],
          'ErddapReader': {
                          'known_server': ['ioos'],  # or 'coastwatch'
          }
}
data = search.Data(**kwargs)
data.sources[0].name

'erddap_ioos'

#### New ERDDAP Server

You can give the necessary information to use a different ERDDAP server.

In [49]:
kwargs = {
          'kw': kw,
          'approach': 'region',
          'readers': [search.ErddapReader],
            'ErddapReader': {
                'known_server': 'ifremer',
                'protocol': 'tabledap',
                'server': 'http://www.ifremer.fr/erddap'
            }
}
data = search.Data(**kwargs)

In [50]:
data.dataset_ids

[['OceanGlidersGDACTrajectories', 'ArgoFloats-synthetic-BGC', 'ArgoFloats']]

### AXDS Reader

By default the Data module will use `axdsReader` with two types of data, `platform2` (like gliders) or `layer_group` (model output). 

In [51]:
kwargs = {
          'kw': kw,
          'approach': 'region',
          'readers': [search.axdsReader]
}
data = search.Data(**kwargs)
data.sources[0].name, data.sources[1].name

('axds_platform2', 'axds_layer_group')

#### Specify AXDS Type

The user can specify to use just one of these:

In [52]:
kwargs = {
          'kw': kw,
          'approach': 'region',
          'readers': [search.axdsReader],
          'axdsReader': {
                          'axds_type': 'platform2',  # or 'layer_group'
          }
}
data = search.Data(**kwargs)
data.sources[0].name

'axds_platform2'

### Local Files

I can't remember the process by which I got these files from a portal now, but they are just meant to be sample files anyway. Hopefully this will work reasonably with other files too.

The `region` and `stations` approach doesn't work as well with local files if the user would only be inputting filenames if they know they want to use them. It could be useful to use the approaches in the case that the user has a bunch of files somewhere or a catalog that already exists and they just want to point to that and have the code filter down. That code is not in place but could be if that is a good use case.

So it currently doesn't matter which approach is used for local files. There is a default `kw` and `region` if nothing is input and in this case that is fine since neither are used.

In [53]:
filenames = ['/Users/kthyng/Downloads/ANIMIDA_III_BeaufortSea_2014-2015/kasper-netcdf/ANIMctd14.nc',
             '/Users/kthyng/Downloads/Harrison_Bay_CTD_MooringData_2014-2015/Harrison_Bay_data/SBE16plus_01604787_2015_08_09_final.csv']

data = search.Data(readers=search.localReader, localReader={'filenames': filenames})

Can look at metadata or data

In [54]:
data.meta 
# data.data

[                                                                              download_url  \
 ANIMctd14.nc                             /Users/kthyng/Downloads/ANIMIDA_III_BeaufortSe...   
 SBE16plus_01604787_2015_08_09_final.csv  /Users/kthyng/Downloads/Harrison_Bay_CTD_Moori...   
 
                                                  geospatial_lat_max  \
 ANIMctd14.nc                             [time, lat, lon, pressure]   
 SBE16plus_01604787_2015_08_09_final.csv                     70.6349   
 
                                                                         geospatial_lat_min  \
 ANIMctd14.nc                             [station_name, sal, tem, fluoro, turbidity, PA...   
 SBE16plus_01604787_2015_08_09_final.csv                                            70.6349   
 
                                                                         geospatial_lon_max  \
 ANIMctd14.nc                             /Users/kthyng/projects/boem_datasets/notebooks...   
 SBE16plus_0160478