Skip to content

Latest commit

 

History

History
254 lines (207 loc) · 7.21 KB

resources.md

File metadata and controls

254 lines (207 loc) · 7.21 KB

Resources

Regardless of dataset semantics, its actual data will be physically stored a sequence of 1's and 0's on disk. On a local computer, we would refer to them as files and identify them using a file path, e.g. /path/to/file. More generally, the files could be located on a remote server and require different protocols (i.e. http:// or file://) to access. Therefore, we chose the more proper term resource to refer to the physical (as opposed to semantic) container that holds the data.

A resource represents an atomic data unit of a dataset. A dataset usually has only a single resource associated with it, but can have many resources (e.g., FLDAS dataset is organized as a collection of resources going back to 1981, each containing data for one month worth of observations).

Data catalog currently supports the following resource types: NetCDF, GeoTIFF, and CSV. Each type has its own set of metadata, but the following list applies to all of them:

Mandatory metadata

record_id [mandatory]

record_id is resource's unique identifier (in the form of UUID)

dataset_id [mandatory]

dataset_id is the uuid of the dataset to which this resources belongs

name [mandatory]

name is what resource is called. Typically, it's the name at the end of the URI path. For example, for a resource located at http://www.example.com/path/to/my/data_file.txt, its name would be data_file

data_url [mandatory]

data_url is the direct download link for the resource

variable_ids [mandatory]

variable_ids is an array of variable ids (uuids)

resource type [mandatory]

resource_type describes the file type of the resource

{
	"resource_type": "NetCDF|GeoTIFF|CSV"
}

[Highly] recommended and optional metadata

Arbitrary metadata

In addition to the mandatory attributes above, data catalog also allows to provide arbitrary key-value metadata. You can provide any valid JSON meaning that the value can be a string, number, array, or another object.

{
	"metadata": {
		"<other_fields>": "...",
		"myCustomKey1": "myCustomValue1",
		"myCustomKey2": ["list", "of", "custom", "values"],
		"<other_fields>": "...",
	}
}

The metadata object is where you should put other recommended metadata in order to take full advantage of data catalog's capabilities. Description of these recommended fields is provided below.

Recommended metadata

temporal coverage [recommended]

temporal_coverage provides the length and granularity information on temporal coverage of the resource. Dates follow ISO 8601 format YYYY-MM-DDThh:mm:ss. Providing this information makes it possible to find the data using temporal queries (e.g., is there rainfall data for 2015-2017 years?)

{
	"metadata": {
		"temporal_coverage": {
			"start_time": "1980-01-01T00:00:00",
			"end_time": "1999-12-31T23:59:59",
			"resolution": {
				"value": 3,
				"units": "year|month|week|day|hour|minute|second"
			}

		}
	}
}
spatial coverage [recommended]

spatial_coverage provides information about the spatial coverage of the resource (in WGS84 coordinate system). Here, x refers to longitude and y refers to latitude (in degrees). Providing this information makes it possible to find the data using geospatial queries (e.g., what kind of rainfall-related datasets are available for South Sudan?)

{
	"metadata": {
		"spatial_coverage": {
			"type": "BoundingBox",
		    "value": {
		        "xmin": -118.4253354,
		        "ymin": 33.9605286,
		        "xmax": -118.4093589,
		        "ymax": 33.9895077
		    }

		}
	}
}
resource size [recommended]

size - number of bytes

{
	"metadata": {
		"size": 12345
	}
}
resource created at date [recommended]

date_created is a timestamp of when this particular resource was created (as opposed to registered in the data catalog)

{
	"metadata": {
		"date_created": "2000-01-01T01:23:45"
	}
}

Resource type-specific metadata

NetCDF

dimensions [recommended]

dimensions is JSON object describing the sizes of each dimension. NetCDF should already contain that information and you can view it using e.g. ncdump: ncdump -h my_netcdf_file.nc and the output will look something like

dimensions:
	time = UNLIMITED ; // (1 currently)
	Y = 348 ;
	X = 294 ;
	bnds = 2 ;

In this case, dimensions object for data catalog would look like

{
	"metadata": {
		"dimensions": {
			"time": "UNLIMITED ; (1 currently)",
			"Y": 348,
			"X": 294,
			"bnds": 2
		}
	}
}

We treat any non-integer value as "unlimited"

geospatial metadata [recommended]

spatial_coverage asks for the bounding box that describes the spatial extent (in WGS84 coordinates) that the data within the resource covers. But the actual data can be stored in a different spatial reference system (SRS). In order to visualize the contents of the resource on a map, we need to know the SRS, which dimensions map to latitude and longitude, and the spatial resolution of cell/pixel. So, for the example above with dimensions "time", "Y", "X", and "bnds", geospatial metadata might look something like this:

{
	"metadata": {
		"geospatial_metadata": {
			"srs": {
				"srid": "EPSG:4326"
			},
			"resolution": {
				"longitude": {
					"dimension": "X",
					"value": 10,
					"units": "m"
				},
				"latitude": {
					"dimension": "Y",
					"value": 10,
					"units": "m"
				}
			}
		}
	}
}

GeoTIFF

dimensions [recommended]

dimensions is JSON object describing the sizes of each dimension. Typically, a GeoTIFF is an image with a certain width, height, and the number of bands.

{
	"metadata": {
		"dimensions": {
			"height": 123,
			"width": 456,
			"bands": 3
		}
	}
}

spatial_coverage asks for the bounding box that describes the spatial extent (in WGS84 coordinates) that the data within the resource covers. But the actual data can be stored in a different spatial reference system (SRS). In order to visualize the contents of the resource on a map, we need to know the SRS, which dimensions map to latitude and longitude, and the spatial resolution of cell/pixel. So, for the example above with dimensions "height", "width", and "bands", geospatial metadata might look something like this:

{
	"metadata": {
		"geospatial_metadata": {
			"srs": {
				"srid": "EPSG:4326"
			},
			"resolution": {
				"latitude": {
					"dimension": "width",
					"value": 10,
					"units": "m"
				},
				"longitude": {
					"dimension": "height",
					"value": 10,
					"units": "m"
				}
			}
		}
	}
}

CSV

dimensions [recommended]

dimensions is a JSON object describing the sizes of each dimension. Typically, a CSV consists of two dimensions: rows and columns

{
	"metadata": {
		"dimensions": {
			"rows": 100,
			"cols": 10
		}
	}
}
delimiter [recommended]

delimiter describes how fields are separated

{
	"metadata": {
		"delimiter": ","
	}
}
has header [recommended]

has_header is a boolean flag that should be true if the first row of the CSV file is a header and false otherwise

{
	"metadata": {
		"has_header": true
	}
}