Regardless of dataset semantics, its actual data will be physically stored a sequence of 1's and 0's on disk. On a local computer, we would refer to them as files and identify them using a file path, e.g. /path/to/file
. More generally, the files could be located on a remote server and require different protocols (i.e. http://
or file://
) to access. Therefore, we chose the more proper term resource to refer to the physical (as opposed to semantic) container that holds the data.
A resource represents an atomic data unit of a dataset. A dataset usually has only a single resource associated with it, but can have many resources (e.g., FLDAS dataset is organized as a collection of resources going back to 1981, each containing data for one month worth of observations).
Data catalog currently supports the following resource types: NetCDF, GeoTIFF, and CSV. Each type has its own set of metadata, but the following list applies to all of them:
record_id
is resource's unique identifier (in the form of UUID)
dataset_id
is the uuid of the dataset to which this resources belongs
name
is what resource is called. Typically, it's the name at the end of the URI path. For example, for a resource located at http://www.example.com/path/to/my/data_file.txt
, its name would be data_file
data_url
is the direct download link for the resource
variable_ids
is an array of variable ids (uuid
s)
resource_type
describes the file type of the resource
{
"resource_type": "NetCDF|GeoTIFF|CSV"
}
In addition to the mandatory attributes above, data catalog also allows to provide arbitrary key-value metadata. You can provide any valid JSON meaning that the value can be a string, number, array, or another object.
{
"metadata": {
"<other_fields>": "...",
"myCustomKey1": "myCustomValue1",
"myCustomKey2": ["list", "of", "custom", "values"],
"<other_fields>": "...",
}
}
The metadata
object is where you should put other recommended metadata in order to take full advantage of data catalog's capabilities. Description of these recommended fields is provided below.
temporal_coverage
provides the length and granularity information on temporal coverage of the resource. Dates follow ISO 8601 format YYYY-MM-DDThh:mm:ss
. Providing this information makes it possible to find the data using temporal queries (e.g., is there rainfall data for 2015-2017 years?)
{
"metadata": {
"temporal_coverage": {
"start_time": "1980-01-01T00:00:00",
"end_time": "1999-12-31T23:59:59",
"resolution": {
"value": 3,
"units": "year|month|week|day|hour|minute|second"
}
}
}
}
spatial_coverage
provides information about the spatial coverage of the resource (in WGS84 coordinate system). Here, x
refers to longitude and y
refers to latitude (in degrees). Providing this information makes it possible to find the data using geospatial queries (e.g., what kind of rainfall-related datasets are available for South Sudan?)
{
"metadata": {
"spatial_coverage": {
"type": "BoundingBox",
"value": {
"xmin": -118.4253354,
"ymin": 33.9605286,
"xmax": -118.4093589,
"ymax": 33.9895077
}
}
}
}
size
- number of bytes
{
"metadata": {
"size": 12345
}
}
date_created
is a timestamp of when this particular resource was created (as opposed to registered in the data catalog)
{
"metadata": {
"date_created": "2000-01-01T01:23:45"
}
}
dimensions
is JSON object describing the sizes of each dimension. NetCDF should already contain that information and you can view it using e.g. ncdump
: ncdump -h my_netcdf_file.nc
and the output will look something like
dimensions:
time = UNLIMITED ; // (1 currently)
Y = 348 ;
X = 294 ;
bnds = 2 ;
In this case, dimensions
object for data catalog would look like
{
"metadata": {
"dimensions": {
"time": "UNLIMITED ; (1 currently)",
"Y": 348,
"X": 294,
"bnds": 2
}
}
}
We treat any non-integer value as "unlimited"
spatial_coverage
asks for the bounding box that describes the spatial extent (in WGS84 coordinates) that the data within the resource covers. But the actual data can be stored in a different spatial reference system (SRS). In order to visualize the contents of the resource on a map, we need to know the SRS, which dimensions map to latitude and longitude, and the spatial resolution of cell/pixel. So, for the example above with dimensions "time"
, "Y"
, "X"
, and "bnds"
, geospatial metadata might look something like this:
{
"metadata": {
"geospatial_metadata": {
"srs": {
"srid": "EPSG:4326"
},
"resolution": {
"longitude": {
"dimension": "X",
"value": 10,
"units": "m"
},
"latitude": {
"dimension": "Y",
"value": 10,
"units": "m"
}
}
}
}
}
dimensions
is JSON object describing the sizes of each dimension. Typically, a GeoTIFF is an image with a certain width, height, and the number of bands.
{
"metadata": {
"dimensions": {
"height": 123,
"width": 456,
"bands": 3
}
}
}
spatial_coverage
asks for the bounding box that describes the spatial extent (in WGS84 coordinates) that the data within the resource covers. But the actual data can be stored in a different spatial reference system (SRS). In order to visualize the contents of the resource on a map, we need to know the SRS, which dimensions map to latitude and longitude, and the spatial resolution of cell/pixel. So, for the example above with dimensions "height"
, "width"
, and "bands"
, geospatial metadata might look something like this:
{
"metadata": {
"geospatial_metadata": {
"srs": {
"srid": "EPSG:4326"
},
"resolution": {
"latitude": {
"dimension": "width",
"value": 10,
"units": "m"
},
"longitude": {
"dimension": "height",
"value": 10,
"units": "m"
}
}
}
}
}
dimensions
is a JSON object describing the sizes of each dimension. Typically, a CSV consists of two dimensions: rows and columns
{
"metadata": {
"dimensions": {
"rows": 100,
"cols": 10
}
}
}
delimiter
describes how fields are separated
{
"metadata": {
"delimiter": ","
}
}
has_header
is a boolean flag that should be true
if the first row of the CSV file is a header and false
otherwise
{
"metadata": {
"has_header": true
}
}