# Australian Geoscience Datacube API
This notebook describes connecting to the datacube and doing a basic query

In [1]:
import datacube.api
from pprint import pprint

By default, the API will use the configured database connection found in the config file.

Details on setting up the config file and database and be found here:
http://agdc-v2.readthedocs.org/en/develop/db_setup.html

In [2]:
dc = datacube.api.API()

## Summary functions
* __`list_fields()`__ - lists all fields that can be used for searching
* __`list_field_values(field)`__ - lists all the values of the field found in the database

Find out what fields we can search:

In [3]:
dc.list_fields()

[u'product',
 u'lat',
 u'sat_path',
 u'platform',
 u'lon',
 u'orbit',
 'collection',
 u'instrument',
 u'sat_row',
 u'time',
 u'gsi',
 'id']

The `product` and `platform` fields looks interesting. Find out more about them:

In [4]:
dc.list_field_values('product')

[u'LEDAPS', u'nbar', u'pqa']

In [5]:
dc.list_field_values('platform')

[u'LANDSAT_5']

## Query and Access functions
There are several API calls the describe and provide data in different ways:

* __`get_descriptor()`__ - provides a descripton of the data for a given query
* __`get_data()`__ - provides the data as `xarray.DataArray`s for each variable.  This is usually called based on information returned by the `get_descriptor` call.
* __`get_data_array()`__ - returns an `xarray.DataArray` n-dimensional object, with the variables stack along the dimension labelled `variables`.
* __`get_dataset()`__ - return an `xarray.Dataset` object, containing an `xarray.DataArray` for each variable.

###  get_descriptor
We can make a query and find out about the data:

The query is a nested dict of variables of terms.

In [7]:
query = {
    'product': 'nbar',
    'platform': 'LANDSAT_5',
}
descriptor = dc.get_descriptor(query, include_storage_units=False)
pprint(descriptor)

{u'ls5_nbar': {'dimensions': [u'time', u'latitude', u'longitude'],
               'irregular_indices': {u'time': array(['1990-03-02T23:11:16.000000000', '1990-05-05T23:10:28.000000000',
       '1990-06-06T23:10:29.000000000', '1990-07-24T23:10:22.000000000',
       '1990-08-09T23:10:17.000000000', '1990-08-25T23:10:12.000000000',
       '1990-09-10T23:10:09.000000000', '1990-09-26T23:10:01.000000000',
       '1990-10-12T23:09:54.000000000', '1990-10-28T23:09:46.000000000',
       '1990-11-13T23:09:47.500000000', '1990-12-15T23:09:43.000000000'], dtype='datetime64[ns]')},
               'result_max': (numpy.datetime64('1990-12-15T23:09:43.000000000'),
                              -33.000125,
                              151.999875),
               'result_min': (numpy.datetime64('1990-03-02T23:11:16.000000000'),
                              -35.999874999999996,
                              148.000125),
               'result_shape': (12, 12000, 16000),
               'variables': {u

The query can be restricted to provide information on particular range along a dimension.

For spatial queries, the dimension names should be used.  The default projection for the range query values is in WGS84, although

In [11]:
query = {
    'product': 'NBAR',
    'platform': 'LANDSAT_5',
    'dimensions': {
        'x' : {
            'range': (148.5, 149.5),
        },
        'y' : {
            'range': (-34.8, -35.8),
        },
        'time': {
            'range': ((1990, 6, 1), (1990, 7 ,1)),
        }
    }
}
pprint(dc.get_descriptor(query, include_storage_units=False))

{u'ls5_nbar_albers': {'dimensions': [u'time', u'y', u'x'],
                      'irregular_indices': {u'time': array(['1990-06-05T09:22:26.000000000+1000',
       '1990-06-05T09:22:50.000000000+1000',
       '1990-06-05T09:23:14.000000000+1000',
       '1990-06-07T09:10:29.000000000+1000',
       '1990-06-07T09:11:16.000000000+1000',
       '1990-06-14T09:16:14.000000000+1000',
       '1990-06-14T09:17:02.000000000+1000',
       '1990-06-14T09:17:26.000000000+1000',
       '1990-06-21T09:22:25.000000000+1000',
       '1990-06-21T09:22:49.000000000+1000',
       '1990-06-30T09:17:02.000000000+1000',
       '1990-06-30T09:17:26.000000000+1000'], dtype='datetime64[ns]')},
                      'result_max': (numpy.datetime64('1990-06-30T09:17:26.000000000+1000'),
                                     -3896912.5,
                                     1591137.5),
                      'result_min': (numpy.datetime64('1990-06-05T09:22:26.000000000+1000'),
                                     

A coordinate reference sytsem can be provided for the spatial dimensions, either as a EPSG code or a WKT description:

In [None]:
query = {
    'product': 'NBAR',
    'platform': 'LANDSAT_5',
    'dimensions': {
        'x' : {
            'range': (1542112, 1563962),
            'crs': 'EPSG:3577',
        },
        'y' : {
            'range': (-3920000.5,-3926000.5),
            'crs': 'EPSG:3577',
        },
        'time': {
            'range': ((1990, 6, 1), (1990, 7 ,1)),
        }
    }
}

### get_data
This retrieves the data, usually as a subset, based on the information provided by the `get_descriptor` call.

The query is in a similar form to the `get_descriptor` call, with the addition of a `variables` parameter.  If not specified, all variables are returned.
The query also accepts an `array_range` parameter on a dimension that provides a subset based on array indicies, rather than labelled coordinates.

In [12]:
query = {
    'product': 'NBAR',
    'platform': 'LANDSAT_5',
    'variables': ['band_30', 'band_40'],
    'dimensions': {
        'x' : {
            'range': (148.5, 149.5),
            'array_range': (0, 1),
        },
        'y' : {
            'range': (-34.8, -35.8),
            'array_range': (0, 1),
        },
        'time': {
            'range': ((1990, 4, 1), (1990, 5, 1))
        }
    }
}
data = dc.get_data(query)
data.keys()

['dimensions',
 'arrays',
 'element_sizes',
 'indices',
 'coordinate_reference_systems',
 'size']

### get_data_array
This is a convinence function that wraps the `get_data` function, returning only the data, stacked in a single `xarray.DataArray`.

The variables are stacked along the `variable` dimension.

In [9]:
nbar = dc.get_data_array(product='NBAR', platform='LANDSAT_5', y=(-34.95,-35.05), x=(148.95,149.05))
nbar

<xarray.DataArray u'ls5_nbar_albers' (variable: 6, time: 182, y: 489, x: 420)>
dask.array<concate..., shape=(6, 182, 489, 420), dtype=float64, chunksize=(1, 1, 489, 420)>
Coordinates:
  * time      (time) datetime64[ns] 1990-03-02T23:11:16 1990-03-02T23:11:39 ...
  * y         (y) float64 -3.919e+06 -3.919e+06 -3.919e+06 -3.919e+06 ...
  * x         (x) float64 1.538e+06 1.538e+06 1.538e+06 1.538e+06 1.538e+06 ...
  * variable  (variable) <U7 u'band_10' u'band_20' u'band_30' u'band_40' ...

### get_dataset
This is a convinience fuction similar to `get_data_array`, returning the data of the query as a `xarray.Dataset` object.

In [13]:
dc.get_dataset(product='NBAR', platform='LANDSAT_5', y=(-34.95,-35.05), x=(148.95,149.05))

<xarray.Dataset>
Dimensions:  (time: 182, x: 420, y: 489)
Coordinates:
  * time     (time) datetime64[ns] 1990-03-02T23:11:16 1990-03-02T23:11:39 ...
  * y        (y) float64 -3.919e+06 -3.919e+06 -3.919e+06 -3.919e+06 ...
  * x        (x) float64 1.538e+06 1.538e+06 1.538e+06 1.538e+06 1.538e+06 ...
Data variables:
    band_20  (time, y, x) float64 729.0 612.0 579.0 629.0 862.0 1.027e+03 ...
    band_10  (time, y, x) float64 476.0 375.0 357.0 416.0 586.0 708.0 730.0 ...
    band_50  (time, y, x) float64 2.175e+03 634.0 245.0 546.0 1.788e+03 ...
    band_40  (time, y, x) float64 1.262e+03 536.0 484.0 1.081e+03 2.132e+03 ...
    band_30  (time, y, x) float64 856.0 592.0 528.0 642.0 932.0 1.137e+03 ...
    band_70  (time, y, x) float64 1.428e+03 493.0 206.0 395.0 1.031e+03 ...