# Multidimensional Data and Xarray Fundamentals

This tutorial is a modified and simplified version of the [Xarray tutorials](https://xarray-contrib.github.io/xarray-tutorial/index.html). The tutorials are freely available for download on [GitHub](https://github.com/xarray-contrib/xarray-tutorial). Some material has also been taken from the [Geohack week](https://geohackweek.github.io/nDarrays/01-introduction/)

## Learning Objectives

- Understand multidimensional data in geosciences
- Provide an overview of xarray
- Describe the xarray data structures, the DataArray and the Dataset, and
  the components that make them up
- Load xarray dataset from a netCDF file
- View and set attributes


## Overview of multidimensional data

Unlabelled, N-dimensional arrays of numbers are the most widely used data structure in scientific computing. Geoscientists have a particular need for structuring their data as arrays. For example, we commonly work with sets of climate variables (e.g. temperature and precipitation) that vary in space and time and are represented on a regularly-spaced grid. Often we need to subset a large global grid to look at data for a particular region, or select a specific time slice. Then we might want to apply statistical functions to these subsetted groups to generate summary information.
These data can be treated with NumPy’s ndarray, because we essentially deal with indexed sets of data. These arrays lack a meaningful representation of the metadata associated with their data. Implementing such functionality is left to individual users and domain-specific packages.

Real-world datasets are usually more than just raw numbers; they have
labels which encode information about how the array values map to locations in
space, time, etc.

Here is an example of how we might structure a dataset for a weather forecast:

<img src="http://xarray.pydata.org/en/stable/_images/dataset-diagram.png" align="center" width="80%">

You'll notice multiple data variables (temperature, precipitation), coordinate
variables (latitude, longitude), and dimensions (x, y, t). We'll cover how these
fit into Xarray's data structures below.

### Conventional Approach: Working with Unlabelled Arrays
Multidimensional array data are often stored in user-defined binary formats, and distributed with custom Fortran or C++ libraries used to read and process the data. Users are responsible for setting up their own file structures and custom codes to handle these files. Subsetting the data involves reading everything into an in-memory array, and then using a series of nested loops with conditional statements to look for a specific range of index values associated with the temporal or spatial slice needed. Also, clever use of matrix algebra is often used to summarize data across spatial and temporal dimensions.

### Challenges:
The biggest challenge in working with N-dimensional arrays in this fashion is the fact that the data are almost disassociated from their metadata. Users are left with the task of tracking the meaning behind array indices using domain-specific software, often leading to inefficiencies and errors. Common pitfalls often occur in in the form of questions like “is the time axis of my array in the first or third index position?”, or “does my array of timestamps still align with my data after resampling?”.

### The network Common Data Format
The network Common Data Format, or [NetCDF](https://www.unidata.ucar.edu/software/netcdf/docs/faq.html#whatisit), was created in the early 1990s, and set out to solve some of the challenges in working with N-dimensional arrays. Netcdf is a collection of self-describing, machine-independent binary data formats and software tools that facilitate the creation, access and sharing of scientific data stored in N-dimensional arrays, along with metadata describing the contents of each array. Netcdf was built by the climate science community at a time when regional climate models were beginning to produce larger and larger output files. Another format, [HDF5](https://www.hdfgroup.org/), has been used for many applications including distribution of remote sensing datasets. It turns out these two formats are now merging, such that the latest version netCDF-4 is the HDF5 format but with some restrictions.

One benefit of Common Data Formats is that they are structured in ways that enable rapid subsetting and analysis using simple command line tools. For example, the climate community has developed their own [netCDF toolkits](http://www.unidata.ucar.edu/software/netcdf/software.html) that accomplish tasks like subsetting and grouping. Similar tools exist for [HDF5](https://support.hdfgroup.org/HDF5/Tutor/HDF5Intro.pdf). Therefore many researchers utilize these tools exclusively in their analysis.

### NetCDF in practice
NetCDF has been widely adopted as a standard format for distributing N-dimensional arrays. Although many geoscience communities rely entirely on existing NetCDF software tools for processing and visualizing their data, others simply use NetCDF as a convenient format for serializing their arrays. In many applications, existing NetCDF tools do not provide the flexibility needed for a specific research question, and users end up reading arrays into memory. They then perform statistical and subsetting operations using conventional coding methods (e.g. looping over array indices) described above.

### Handling large arrays
The NetCDF format has no limit on file sizes. However, any analysis tools that read data from a NetCDF array into memory for some computational operation will be limited by that particular machine’s available memory. As many multidimensional datasets grown in size, for example due to increases in model resolution and remote sensing capabilities, we are becoming increasingly limited in our ability to handle these large datasets.

## What Is Xarray?

- Xarray expands on the capabilities of NumPy arrays, providing a lot of
  streamline data manipulation.

- Xarray's interface is based largely on the netCDF data model (variables,
  attributes, and dimensions), but it goes beyond the traditional netCDF
  interfaces 

- Xarray is motivated by weather and climate use cases but is **domain agnostic**


## Xarray Data Structures

- xarray has 2 fundamental data structures:

  - `DataArray`, which holds single multi-dimensional variables, its coordinates and the attributes
  - `Dataset`, which holds multiple variables (each one a DataArray) that potentially share the same coordinates and common global attributes

Both classes are most commonly understood by reading data from an existing NetCDF file. The file used in this example contains monthly means of sea surface temperature. This is loaded as a dataset, using the `open_dataset` method

If you get an error (**read the error**, it's at the bottom), it may be that the file you want to open is not in this folder, or that netcdf4 is not installed. To install netcdf4, open a terminal and type

`conda install netcdf4`

restart the kernel and then retry.

In [None]:
import xarray as xr

In [None]:
# Load the mean sea surface temperature dataset (the engine keyword is not necessary)
ds = xr.open_dataset("./sst.mnmean.nc", engine="netcdf4")

# xarray's HTML representation
ds

`xarray`, when coupled with the jupyter notebook can show very rich representations of the dataset information, which helps browsing through the attributes and a condensed view of the data.

If you prefer a text based representation, you can set the display_style='text' by running the line below
`xr.set_options(display_style="text")`
Or you can simply display the netCDF information stored in the file that you would obtain with the command `ncdump` run in the terminal

In [None]:
# netCDF representation
ds.info()

### `Dataset`

- Xarray's `Dataset` is a dict-like container of labeled arrays (`DataArrays`)
  with aligned dimensions. - It is designed as an in-memory representation of a
  netCDF dataset.
- The dict-like interface of the dataset itself can be
  used to access any `DataArray` in a `Dataset`. 
  
Datasets have the following key properties:

| Attribute   | Description                                                                                                                              |
| ----------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
| `data_vars` | OrderedDict of `DataArray` objects corresponding to data variables.                                                                      |
| `dims`      | dictionary mapping from dimension names to the fixed length of each dimension (e.g., {`lat`: 6, `lon`: 6, `time`: 8}).                   |
| `coords`    | a dict-like container of arrays (coordinates) that label each point (e.g., 1-dimensional arrays of numbers, datetime objects or strings) |
| `attrs`     | OrderedDict holding the metadata pertaining to the dataset (global attributes)                                                                        |


In [None]:
# list the variables in our dataset, including their dimensions (this dataset contains only one variable)
ds.data_vars

### Coordinates vs dimensions

Dimensions and coordinates may seem synonyms, but they are conceptually different in the NetCDF model. 

- **Dimensions** count the number of elements along each axis of the multidimensional DataArray. 
    - Dimensions have names to identify them, and they hold the size of the various variables. _The sst variable contained in this DataSet has 3 dimensions (time, lat, lon)_
    - You may have several dimensions in a `DataSet`, and not all variables need to have the same dimensions
    - The dimension length is stored in the dimension variable 

In [None]:
# dataset dimensions
print('Dimensions are stored in a dict-like object:',ds.dims)
print('The length of the Time dimension is:',ds.dims['time'])

- **Coordinates** are *variables* in all senses, but they cannot be modified (while variables can). In the most simple NetCDF data-model, a variable with the same name of a dimension is assumed to be a coordinate. 
    - Coordinates are the system of reference of the data variables
    - They allow you to visualize the data in the space they have been defined, to connect the abstract data structure objects to real world objects (locations in space and time);
    - Check the output of the `ds.info` command executed above. There are specific attributes (metadata) that indicates that *lat, lon and time* are coordinates (the *axis* attribute). It may also indicate if the coordinate is centred in the spatial grid or it's an average over a temporal period;
    - xarray look for the variables with coordinate attributes, and if not found it applies the simple model that variables holding the same name of a dimension are coordinates; 
    - The coordinate system of this `DataSet` is *regular*, because the coordinates can be represented with one dimensional variables. xarray creates the 2D grids for carrying out any operation on the data

In [None]:
# visualize the dataset coordinates
ds.coords

This gives you a quick glimpse at the content of your coordinates, so you can understand
- the spatial resolution (distance between coordinates points along the axes)
- the temporal frequency

The coordinate attributes often give you all the necessary information

In [None]:
# extract a coordinate variable from the coordinates
ds.coords['time']

### Global attributes
The `DataSet` object holds all the global attributes contained in the NetCDF file. They are meant to describe the history of the data and usually give you information about the source, who to contact and how to cite the data

**Note**: these information exist if the data originator included them. You can tell how poor a data management plan is from the absence of metadata in the global attributes

In [None]:
# dataset global attributes are stroed in a dictionary object
print(type(ds.attrs))
ds.attrs

In [None]:
# the dictionary allows you to access each single attribute, e.g.
ds.attrs['project']

### `DataArray`

Each variable is a `DataArray`. The `DataArray` is xarray's implementation of a labeled, multi-dimensional array.
It has several key properties:

| Attribute | Description                                                                                                                              |
| --------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
| `data`    | `numpy.ndarray` or `dask.array` holding the array's values.                                                                              |
| `dims`    | dimension names for each axis. For example:(`x`, `y`, `z`) (`lat`, `lon`, `time`).                                                       |
| `coords`  | a dict-like container of arrays (coordinates) that label each point (e.g., 1-dimensional arrays of numbers, datetime objects or strings) |
| `attrs`   | an `OrderedDict` to hold arbitrary attributes/metadata (such as units)                                                                   |
| `name`    | an arbitrary name of the array                                                                                                           |

The under-the-hood `xarray` engine is `pandas`. Hence, the DataArray can be accessed using the two typical syntax

In [None]:
# show information of the DataArray containing the sst variable
ds['sst']  

In [None]:
# equivalent command
ds.sst

In [None]:
# The actual (numpy) array data
sst = ds.sst.data
print(type(sst))
sst

Because every variable may have *different dimensions, coordinates and attributes*, this information is stored within each `DataArray`

In [None]:
# datarray/variable dimensions
ds.sst.dims

In [None]:
# datarray/variable coordinates
ds.sst.coords

In [None]:
# extracting a coordinate variable to find out the spatial resolution
ds.sst.lon

In [None]:
# dataarray/variable attributes (specific to this variable only)
ds.sst.attrs

It is very quick to set some arbitrary attribute on a data variable/datarray. You just create a new dictionary entry. **Note**: this does not change the file on disk! You need to export it to netcdf, using the method `ds.to_netcdf()`, either creating a new file or overwriting the previous one

In [None]:
ds.sst.attrs['extended_units'] = 'Degrees Centigrade'
ds.sst.attrs

## Extracting and Visualizing data
xarray comes with pandas and matplotlib capabilities. Hence you can extract data by indexing and visualize them, also adding the Cartopy mapping features. The matplotlib keywords specifyng the type of plot are passed through the `DataArray.plot()` method. xarray makes a few educated guesses based on the shape of the data you have extracted. If the object is 1D, it shows a timeseries or a line `plot`; if its 2D shows a `pcolormesh` (changing the colormap depending on whether the data are all positive or positive and negative); in all other cases displays a histogram. 

### Indexing
Indexing is used to select specific elements from xarray files. Let’s select some data from the SST `DataArray`. We need to know that this DataArray has dimensions of time and two dimensional space (latitude and longitude): the first array index is time, the second is latitude, and so on.

You are probably already used to conventional ways of indexing an array. You would then use positional indexing:

In [None]:
# select one variable and pick the first entry along all the axes
ds.sst[0,0,0]

In [None]:
# Plot one timestep (the python convention includes all the other indexes)
ds.sst[0].plot()

In [None]:
ds.sst[:,10,0].plot()

This method of handling arrays should be familiar to anyone who has worked with arrays in MATLAB or NumPy. Challenges with this approach: 
- *you need to know the order of the dimensions (time, lat, lon in this case, but it may change in different datasets)* 
- *it is not simple to associate an integer index position with something meaningful in our data (how do I know that index 10 of the second dimension is latitude 68S?)*

For example, we would have to write some function to map a specific date in the time dimension to its associated integer. **Note that even if you are using an array indexing, xarray still preserves the metadata and when you plot the extracted data you obtain an annotated figure!**

xarray lets us perform positional indexing using the coordinates instead of integers by using the methods 
- `isel` extracts data based on positional indexing along the labelled coordinates (you need to know the names, but not the order)
- `sel` extracts data using the coordinate values

They are equivalent to `iloc` and `loc` methods in `pandas`. 

In [None]:
da = ds.sst
da.isel(lon=0,time=10,lat=0)

In [None]:
da.isel(lat=60, lon=40).plot()

With method `da.isel()` you still need to know the correspondence between indexes and values. `da.sel()` allows you to do label-based indexing, with all the power of the pandas timeseries capabilities. In the following example we are also showing how you pass matplotlib keyword arguments (kwarg) through xarray plotting function:

In [None]:
da.sel(lat=-32, lon=80).plot(figsize=(12,8),marker='o')

In [None]:
da.sel(lat=50.0, lon=200.0, time="2020")

This method works if you match the exact coordinates of the data. If the coordinate *label* does not exist, and a `KeyError` is generated. 

xarray implements the keyword `method` to enable nearest neighbour (inexact) lookups by use of the methods `backfill` or `nearest`

In [None]:
da.sel(lat=51.0, lon=200.0, time="2020")

In [None]:
da.sel(lat=51., lon=200., method='nearest').plot()

The `slice` function can also be used, to select a range of coordinate values. Note that the method parameter `nearest` is not yet supported if any of the arguments to `.sel()` is a slice object

In [None]:
# select a given period of time
da.sel(time=slice('2019-05', '2020-07')).plot()

<div class="alert alert-block alert-warning">
but wait, why do we see a histogram? What were you expecting?
    
<em> Think about the dimension of the extracted object... </em>
</div>

In [None]:
# slicing can also be done along other axes
da.sel(time='2019-01',lat=-20,lon=slice(-50,80)).plot(marker='s')

<div class="alert alert-block alert-warning">
Where are the values with negative longitudes? 
    
<em> A quick look at the lon coordinate will give the answer... </em>
    
Why there are missing values?
</div>

In [None]:
da.sel(time='2019-07',lat=slice(-20,-70),lon=slice(250,360)).plot()

### Mapping
This is very simple. If the axes on which you are plotting the object is a `GeoAxes` instance, the plot becomes a map!
Since xarray's default plotting functionality builds on matplotlib, we can
seamlessly use cartopy to make nice maps:

1. Specify a `projection` for the plot when creating a new axis `axis`.
2. Explicitly ask xarray to plot to axis `axis` by passing the keyword argument `ax=axis`.
3. Specify the projection of the data using `transform` (`PlateCarree` here) in
   `.plot()`.

In [None]:
import matplotlib.pyplot as plt
import cartopy.crs as ccrs

axis = plt.axes(projection=ccrs.PlateCarree())
da.sel(time='2019-07').plot(ax=axis,transform=ccrs.PlateCarree(),
                           cbar_kwargs={'orientation': 'vertical', 'shrink': 0.6})
axis.set_extent([-110,10,-20,-70]) # now you can use all cartopy methods on the axis
axis.coastlines()  
gl = axis.gridlines(draw_labels=True)
gl.right_labels = False
gl.top_labels = False

In [None]:
fig, axis = plt.subplots(1, 1,figsize=(10,10), subplot_kw=dict(projection=ccrs.Orthographic(0, -30)))

ds.sst.isel(time=1).plot(
    ax=axis,
    transform=ccrs.PlateCarree(),  # this is important since the data are on a mercator projection
    vmin=0., vmax=30., # these are matplotlib kwargs
    # some arguments passed to control the colorbar
    cbar_kwargs={"orientation": "horizontal", "shrink": 0.7},
    robust=True,
)
axis.coastlines()  # now you can use all cartopy methods on the axis
axis.gridlines()
# The parameter robust=True allows to visualize the data without the outliers, which may change your colorbar limits. 
# This will use the 2nd and 98th percentiles of the data to compute the color limits.

## Masking
Indexing methods on xarray objects generally return a subset of the original data. However, it is sometimes useful to select an object with the same shape as the original data, but with some elements masked. An example is selecting a given region, or all the gridpoints that have temperature larger than a given value.

To do this type of selection in xarray, we use the method `where()`:

In [None]:
# tropical cyclones develop in regions where the surface temperature is larger than 26 degC
da.sel(time='2019-07').where(da>26.).plot()

In [None]:
# which is better visualized with the mapping
fig,axis = plt.subplots(figsize=(15,7),subplot_kw=dict(projection=ccrs.PlateCarree()))
da.sel(time='2019-07').where(da>26.).plot(ax=axis,
                                          transform=ccrs.PlateCarree(),
                                         cbar_kwargs={'orientation': 'horizontal', 'shrink': 0.8})
axis.set_extent([-179,179,40,-40])
axis.coastlines()
gl=axis.gridlines(draw_labels=True)
gl.right_labels=False
gl.top_labels=False

Masking can also be used to extract a given region and to drop all the other points from the dataset. In this case, you use the keyword `drop=True`. This will return a dataset that is a portion of the original one. 

_Note: this may be an expensive operation and sometimes it's not efficient. Do it only if you need to reduce the memory footprint._

In [None]:
import numpy as np
mask = np.logical_and((da.lon>0) & (da.lon<=30),(da.lat<-20) & (da.lat>=-36))
region = da.sel(time='2019-07').where(mask,drop=True)
print(region)
region.plot()

## Going Further

- Xarray Documentation on Data Structures:
  http://xarray.pydata.org/en/latest/data-structures.html
- Xarray Documentation on Reading files and writing files:
  https://xarray.pydata.org/en/stable/io.html
- Xarrat Documentation on Indexing:
  http://xarray.pydata.org/en/stable/indexing.html
