# Introduction to loading data <img align="right" src="../img/LivingWales_logo.png" width="190" height="200">

* **Compatibility:** Notebook currently compatible with the `WDC` environment
* **Products used:** `sen2_l2a_gcp`, `nfi_woodland_fr`, `nrw_saltmarshes_lle`
* **Prerequisites:** Users of this notebook should have a basic understanding of:
    * How to run a [Jupyter notebook](01_Introduction_jupyter_notebooks.ipynb)
    * The basic structure of the WDC [satellite datasets]()
    * Inspecting available [WDC products and measurements](03_Products_and_measurements.ipynb)
    

## Background
Loading data from the Wales Open Data Cube (WDC) instance of the [Open Data Cube](https://www.opendatacube.org/) requires the construction of a data query that specifies the what, where, and when of the data request.
Each query returns a [multi-dimensional xarray object](http://xarray.pydata.org/en/stable/) containing the contents of your query.
It is essential to understand the `xarray` data structures as they are fundamental to the structure of data loaded from the datacube.
Manipulations, transformations and visualisation of `xarray` objects provide datacube users with the ability to explore and analyse WDC datasets.

## Description
This notebook will introduce how to load data from the WDC datacube through the construction of a query and use of the `dc.load()` function.
Topics covered include:

1. Loading data using `dc.load()`
    * Visualising the resulting `xarray.Dataset` object
    * Interpreting the resulting `xarray.Dataset` object
    * Inspecting an individual `xarray.DataArray`


2. Customising parameters passed to the `dc.load()` function
    * Loading specific measurements
    * Loading data for coordinates in a custom coordinate reference system (CRS)


3. Loading data using a reusable dictionary query
4. Loading matching data from multiple products using `like`

***

## Getting started
To run this notebook, run all the cells in the notebook starting with the "Load packages" cell. For help with running notebook cells, refer back to the [Jupyter Notebooks notebook](01_Introduction_jupyter_notebooks.ipynb).

### Load packages
The `datacube` package is required to query the datacube database and load some data. The `time` package is used to retrieve processing time. 

In [None]:
import datacube
from time import time as time

### Connect to the datacube
The next step is to connect to the datacube database.
The resulting `dc` datacube object can then be used to load data.
The `app` parameter is a unique name used to identify the notebook that does not have any effect on the analysis.

In [None]:
dc = datacube.Datacube(app="04_Loading_data")

## Loading data using `dc.load()`

Loading data from the datacube uses the `dc.load()` function.

The function requires the following minimum arguments:

* `product`: The data product to load (to revise WDC products, see the [Products and measurements](03_Products_and_measurements.ipynb) notebook).
* `x`: The spatial region in the *x* dimension. By default, the *x* and *y* arguments accept queries in a geographical coordinate system WGS84, identified by the EPSG code *4326*. The dimensions ``longitude``and ``x`` can be used interchangeably.
* `y`: The spatial region in the *y* dimension. The dimensions ``latitude`` and ``y`` can be used interchangeably.
* `time`: The temporal extent. The time dimension can be specified using a tuple of datetime objects or strings in the "YYYY", "YYYY-MM" or "YYYY-MM-DD" format. 
* `output_crs`: The output projection system in EPSG:code
* `resolution`: The output spatial resolution with `output_crs` unit (e.g., in meters for EPSG:27700)

For example, to load 2018 data from the Sentinel-2 L2A product for Aberystwyth with British National Grid projection and 10m spatial resolution, use the following parameters:

* `product`: `sen2_l2a_gcp`
* `x`: `(-4.095, -4.076)`
* `y`: `(52.407, 52.422)`
* `time`: `("2018-01-01", "2018-12-31")`
* `output_crs`: `epsg:27700`
* `resolution`: `(-10,10)`

Run the following cell to load all datasets from the `sen2_l2a_gcp` product that match this spatial and temporal extent:

In [None]:
start_load= time()
dataset = dc.load(product="sen2_l2a_gcp",
             x=(-4.095, -4.076),
             y=(52.407, 52.422),
             time=("2018-01-01", "2018-12-31"),
             output_crs= "epsg:27700",
             resolution= (-10,10))

end_load= time()

print("Datacube ready")
print("Took only " + str(round(end_load-start_load,2)) + " seconds to load a whole year of data from datacube \
for the specified extent (i.e., "+ str(len(dataset.time)*len(dataset.keys())) +" images).")

### Visualising the resulting `xarray.Dataset`

In [None]:
dataset

### Interpreting the resulting `xarray.Dataset`
The variable `dataset` has returned an `xarray.Dataset` containing all data that matched the spatial and temporal query parameters inputted into `dc.load`.

*Dimensions* 

* This header identifies the number of timesteps returned in the search (`time: 142`) as well as the number of pixels in the `x` and `y` directions of the data query.

*Coordinates* 

* `time` identifies the date attributed to each returned timestep.
* `x` and `y` are the coordinates for each pixel within the spatial bounds of the query.

*Data variables*

* These are all the measurements available for the nominated product. 
For every date (`time`) returned by the query, the measured value at each pixel (`y`, `x`) is returned as an array for each measurement.
Each data variable is itself an `xarray.DataArray` object ([see below](#Inspecting-an-individual-xarray.DataArray)). 

*Attributes*

* `crs` identifies the coordinate reference system (CRS) of the loaded data. 

### Inspecting an individual `xarray.DataArray`
The `xarray.Dataset` loaded above is itself a collection of individual `xarray.DataArray` objects that hold the actual data for each data variable/measurement. 
For example, all measurements listed under _Data variables_ above (e.g. `blue`, `green`, `red`, `nir`, `swir1`, `swir2`) are `xarray.DataArray` objects.

These `xarray.DataArray` objects can be inspected or interacted with by using either of the following syntaxes:
```
dataset["measurement_name"]
```
or
```
dataset.measurement_name
```

The ability to access individual variables means that these can be directly viewed, or further manipulated to create new variables. 
For example, run the following cell to access data from the near infra-red satellite band (i.e. `nir`):

In [None]:
dataset.nir

**Note** that the object header informs us that it is an `xarray.DataArray` containing data for the `nir` satellite band. 

Like an `xarray.Dataset`, the array also includes information about the data's **dimensions** (i.e. `(time: 142, y: 171, x: 135)`), **coordinates** and **attributes**.
This particular data variable/measurement contains some additional information that is specific to the `nir` band, including details of array's nodata value (i.e. `nodata: -9999`).

> For a more in-depth introduction to `xarray` data structures, refer to the [official xarray documentation](http://xarray.pydata.org/en/stable/data-structures.html)

## Customising the `dc.load()` function

The `dc.load()` function can be tailored to refine a query.

Customisation options include:

* `measurements:` This argument is used to provide a list of measurement names to load, as listed in `dc.list_measurements()`. 
For satellite datasets, measurements contain data for each individual satellite band (e.g. swir2). 
By default, all measurements for the product will be returned. In most cases, only a few bands are needed; specify these using this argument.
* `crs:` The coordinate reference system (CRS) of the query's `x` and `y` coordinates is assumed to be `WGS84`/`EPSG:4326` unless the `crs` field is supplied, even if the stored data is in another projection or the `output_crs` is specified. 
The `crs` parameter is required if the query's coordinates are in any other CRS.

Example syntax on the use of these options follows in the cells below.

> For help or more customisation options, run `help(dc.load)` in an empty cell or visit the function's [documentation page](https://datacube-core.readthedocs.io/en/latest/dev/api/generate/datacube.Datacube.load.html)


### Specifying measurements
By default, `dc.load()` will load *all* measurements in a product.

To load data from the `red`, `green` and `blue` satellite bands only, add `measurements=["red", "green", "blue"]` to the query:

In [None]:
start_load= time()

dataset_rgb = dc.load(product="sen2_l2a_gcp",
                      x=(-4.095, -4.076),
                      y=(52.407, 52.422),
                      time=("2018-01-01", "2018-12-31"),
                      output_crs= "epsg:27700",
                      resolution= (-10,10),
                      measurements= ['red','green','blue'])

end_load= time()

print("Datacube ready")
print("Took only " + str(round(end_load-start_load,2)) + " seconds to load a whole year of data from datacube \
for the specified extent (i.e., "+ str(len(dataset_rgb.time)*len(dataset_rgb.keys())) +" images). \
\nAs we can see, loading only the needed measurements significantly reduced the loading time.")

In [None]:
dataset_rgb

**Note** that the **Data variables** component of the `xarray.Dataset` now includes only the measurements specified in the query (i.e. the `red`, `green` and `blue` satellite bands).

### Loading data for coordinates in any CRS
By default, `dc.load()` assumes that the queried `x` and `y` coordinates are in the `WGS84`/`EPSG:4326` CRS.
If these coordinates are in a different coordinate system, specify this using the `crs` parameter.

The example cell below loads data for a set of `x` and `y` coordinates defined in British National Grid (`EPSG:27700`), ensuring that the `dc.load()` function accounts for this by including `crs="EPSG:27700"`:

In [None]:
dataset_custom_crs = dc.load(product="sen2_l2a_gcp",
                             x=(257268, 259493),
                             y=(280841, 282479),
                             crs="EPSG:27700",
                             time=("2018-01-01", "2018-12-31"),
                             output_crs= "epsg:27700",
                             resolution= (-10,10),
                             measurements= ['red','green','blue'])

dataset_custom_crs

## Loading data using the query dictionary syntax
It is often useful to re-use a set of query parameters to load data from multiple products.
To achieve this, load data using the "query dictionary" syntax.
This involves placing the query parameters inside a Python dictionary object which can be re-used for multiple data loads.

Query dictionaries can contain any set of parameters that would usually be provided to `dc.load()`:

In [None]:
query = {"x":(-4.095, -4.076),
         "y":(52.407, 52.422),
         "time":("2018-01-01", "2018-12-31"),
         "output_crs": "epsg:27700",
         "resolution": (-10,10)}

The query dictionary object can be added as an input to `dc.load()`.

> The `**` syntax below is Python's "keyword argument unpacking" operator.
This operator takes the named query parameters listed in the query dictionary (e.g. `"x":(-4.095, -4.076)`), and "unpacks" them into the `dc.load()` function as new arguments. 
For more information about unpacking operators, refer to the [Python documentation](https://docs.python.org/3/tutorial/controlflow.html#unpacking-argument-lists)

In [None]:
dataset_sentinel2 = dc.load(product="sen2_l2a_gcp",
                            **query)

dataset_sentinel2

In [None]:
dataset_nfi = dc.load(product="nfi_woodland_fr",
                        **query)

dataset_nfi

## Loading data "like" another dataset

Another option for loading matching data from multiple products is to use `dc.load()`'s `like` parameter.
This will copy the spatial and temporal extent and the CRS/resolution from an existing dataset, and use these parameters to load new data from a new product.

The example cell below loads NRW saltmarshes dataset that exactly matches the `dataset_sentinel2` dataset loaded earlier:


In [None]:
dataset_saltmarsh = dc.load(product="nrw_saltmarshes_lle",
                            like=dataset_sentinel2)

dataset_saltmarsh

## Recommended next steps

For more advanced information about working with Jupyter Notebooks or JupyterLab, see the [JupyterLab documentation](https://jupyterlab.readthedocs.io/en/stable/user/notebook.html).

To continue working through the notebooks in this beginner's guide, the following notebooks are designed to be worked through in the following order:

1. **[Introduction to jupyter Notebooks](01_Introduction_jupyter_notebooks.ipynb)**
2. **[Wales Open Data Cube](02_Wales_Open_Data_Cube.ipynb)**
3. **[Products and measurements](03_Products_and_measurements.ipynb)**
4. **Loading data in WDC (this notebook)**
5. **[Plotting](05_Plotting.ipynb)**
6. **[Using Sentinel-2 data](06_Using_Sentinel2_data.ipynb)**
7. **[Calculating band indices](07_Calculating_band_indices.ipynb)**
8. **[Generating composites](08_Generating_composites.ipynb)**
9. **[Zonal_statistics](09_Zonal_statistics.ipynb)**


Once you have worked through the beginner's guide, you can explore the "Case Studies" directory, which provides examples of applications within Wales Open Data Cube.