# Xarray for multidimensional gridded data

*This material is based on the excellent [Research Computing in Earth Science](https://rabernat.github.io/research_computing_2018/) course of Ryan Abernathey, CC-BY-NC**

[xarray](http://xarray.pydata.org/en/stable/) is an open source project and Python package that makes working with labelled multi-dimensional arrays simple, efficient, and fun!

---

Pandas provides a way to keep track of additional "metadata" surrounding tabular datasets, including "indexes" for each row and labels for each column. These features, together with Pandas' many useful routines for all kinds of data munging and analysis, have made Pandas one of the most popular python packages in the world.

However, not all Earth science datasets easily fit into the "tabular" model (i.e. rows and columns) imposed by Pandas. In particular, we often deal with _multidimensional data_. By _multidimensional data_ (also often called _N-dimensional_), I mean data with many independent dimensions or axes. For example, we might represent Earth's surface temperature $T$ as a three dimensional variable

$$ T(x, y, t) $$

where $x$ is longitude, $y$ is latitude, and $t$ is time.

The point of xarray is to provide pandas-level convenience for working with this type of data. 



![xarray data model](https://github.com/pydata/xarray/raw/master/doc/_static/dataset-diagram.png)

## Reading the example data

import xarray as xr
ds = xr.open_dataset('NOAA_NCDC_ERSST_v3b_SST.nc')
ds

In [None]:
! wget http://ldeo.columbia.edu/~rpa/NOAA_NCDC_ERSST_v3b_SST.nc -P data

In [None]:
import xarray as xr

In [None]:
ds = xr.open_dataset('data/NOAA_NCDC_ERSST_v3b_SST.nc')
ds

## Xarray data structures

Xarray has two fundamental data structures:

* a `DataArray`, which holds a single multi-dimensional variable and its coordinates
* a `Dataset`, which holds multiple variables that potentially share the same coordinates

### DataArray

A `DataArray` has four essential attributes:
* `values`: a `numpy.ndarray` holding the array’s values
* `dims`: dimension names for each axis (e.g., `('x', 'y', 'z')`)
* `coords`: a dict-like container of arrays (coordinates) that label each point (e.g., 1-dimensional arrays of numbers, datetime objects or strings)
* `attrs`: an `OrderedDict` to hold arbitrary metadata (attributes)


In [None]:
da = ds.sst
da

### Datasets

A Dataset holds many DataArrays which potentially can share coordinates. In analogy to pandas:

    pandas.Series : pandas.Dataframe :: xarray.DataArray : xarray.Dataset

In [None]:
ds

## Working with Labeled Data

Xarray's labels make working with multidimensional data much easier.

### Selecting Data (Indexing)

We can always use regular numpy indexing and slicing on DataArrays

In [None]:
ds.sst[0].plot()

However, it is often much more powerful to use xarray's `.sel()` method to use label-based indexing.

In [None]:
ds.sst.sel(time='1960-01-15').plot()

Selecting all values in time for a specific longitude and latitude now becomes easy:

In [None]:
ds.sst.sel(lon=80, lat=10).plot()

Compared to numpy-style indexing (then you need to know the positions for those lon/lat values):

In [None]:
ds.sst[:, 49, 40].plot()

### Computation

Xarray dataarrays and datasets work seamlessly with arithmetic operators and numpy array functions.

In [None]:
temp_kelvin = ds.sst + 273.15

In [None]:
temp_kelvin.sel(time='1960-01-15').plot()

### Broadcasting

Broadcasting arrays in numpy is a nightmare. It is much easier when the data axes are labeled!

This is a useless calculation, but it illustrates how perfoming an operation on arrays with differenty coordinates will result in automatic broadcasting

In [None]:
lon_times_lat = ds.lon * ds.lat
lon_times_lat

In [None]:
lon_times_lat.plot()

### Reductions

Just like in numpy, we can reduce xarray DataArrays along any number of axes:

In [None]:
ds.sst.mean(axis=0).dims

In [None]:
ds.sst.mean(axis=1).dims

However, rather than performing reductions on axes (as in numpy), we can perform them on dimensions. This turns out to be a huge convenience

In [None]:
sst_mean = ds.mean(dim='time')
sst_mean

In [None]:
sst_mean.sst.plot()

Or the average temperature for all locations over time:

In [None]:
ds.sst.mean(dim=('lon', 'lat')).plot()

More advanced calculations:

In [None]:
ds_mm = ds.groupby('time.month').mean(dim='time')
ds_mm

In [None]:
ds_mm.sst.sel(lon=300, lat=50).plot()

In [None]:
ds_mm.sst.mean(dim='lon').transpose().plot.contourf(levels=12, vmin=-2, vmax=30)

In [None]:
(ds_mm.sst.sel(month=1) - ds_mm.sst.sel(month=7)).plot(vmax=10)

In [None]:
def remove_time_mean(x):
    return x - x.mean(dim='time')

ds_anom = ds.groupby('time.month').apply(remove_time_mean)
ds_anom

In [None]:
ds_anom.sst.sel(lon=300, lat=50).plot()

In [None]:
ds_anom_resample = ds_anom.resample(time='5Y').mean(dim='time')
ds_anom_resample

In [None]:
ds_anom.sst.sel(lon=300, lat=50).plot()
ds_anom_resample.sst.sel(lon=300, lat=50).plot(marker='o')

In [None]:
ds_anom_rolling = ds_anom.rolling(time=12, center=True).mean()
ds_anom_rolling

In [None]:
ds_anom.sst.sel(lon=300, lat=50).plot(label='monthly anom')
ds_anom_resample.sst.sel(lon=300, lat=50).plot(marker='o', label='5 year resample')
ds_anom_rolling.sst.sel(lon=300, lat=50).plot(label='12 month rolling mean')
plt.legend()

## Plotting with cartopy

https://scitools.org.uk/cartopy/docs/latest/

Cartopy makes use of the powerful [PROJ.4](https://proj4.org/), numpy and shapely libraries and includes a programatic interface built on top of Matplotlib for the creation of publication quality maps.

Key features of cartopy are its object oriented projection definitions, and its ability to transform points, lines, vectors, polygons and images between those projections.


In [None]:
import cartopy.crs as ccrs
import cartopy

In [None]:
sst = ds.sst.sel(time='2000-01-01', method='nearest')
fig = plt.figure(figsize=(9,6))
ax = plt.axes(projection=ccrs.Robinson())
ax.coastlines()
ax.gridlines()
sst.plot(ax=ax, transform=ccrs.PlateCarree(),
         vmin=2, vmax=30, cbar_kwargs={'shrink': 0.4})