<img src="http://xarray.pydata.org/en/stable/_static/dataset-diagram-logo.png" align="right" width="30%">

# Xarray: Data structures for high-level analysis of multi-dimensional data

In this lesson, we discuss cover the basics of Xarray data structures. By the end of the lesson, we will be able to:

- Understand the basic data structures in Xarray
- Inspect `DataArray` and `Dataset` objects.
- Read and write netCDF files using Xarray.
- Understand that there are many packages that build on top of xarray

## A practical example

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import xarray as xr

%matplotlib inline

In [None]:
# load tutorial dataset
ds = xr.tutorial.load_dataset("air_temperature")

## What's in a dataset? many DataArrays

In [None]:
# dataset repr
ds

Datasets are dict-like containers of DataArrays i.e. they are a mapping of variable name to DataArray.

In [None]:
# pull out "air" dataarray
ds["air"]

In [None]:
# pull out dataarray using dot notation
ds.air  ## same as ds["air"]

## What's in a DataArray? data + (a lot of) metadata

### Named dimensions `.dims`

In [None]:
ds.air.dims

### Coordinate variables or "tick labels" (`.coords`)

In [None]:
ds.air.coords

In [None]:
# extracting coordinate variables
ds.air.lon

In [None]:
# extracting coorindate variables from .coords
ds.coords["lon"]

### Arbitrary attributes (`.attrs`)

`.attrs` is a dictionary that can contain arbitrary python objects. Your only limitation is that some attributes may not be writeable to a netCDF file

In [None]:
ds.air.attrs

In [None]:
# assign your own attribute
ds.air.attrs["who_is_awesome"] = "xarray"
ds.air.attrs

### Underlying data (`.data`)

This is a numpy array which you may be familiar with. Xarray structures wrap underlying simpler data structures. 

This part of xarray is quite extensible allowing for GPU arrays, sparse arrays, arrays with units etc. See the demo at the end.

In [None]:
ds.air.data

In [None]:
# what is the type of the underlying data
type(ds.air.data)

A numpy array!

<img src="https://numpy.org/images/logos/numpy.svg" style="width:20%">

### Review


Xarray provides two main data structures
* DataArrays that wrap underlying data containers (e.g. numpy arrays) and contain associated metadata
* Datasets that are dict-like containers of DataArrays

For more see
* https://xarray.pydata.org/en/stable/data-structures.html#dataset
* https://xarray.pydata.org/en/stable/data-structures.html#dataarray

---

## Why xarray? Use metadata for fun and ~profit~ papers!

### Analysis without xarray: `X(`

In [None]:
# plot the first timestep
lat = ds.air.lat.data  # numpy array
lon = ds.air.lon.data  # numpy array
temp = ds.air.data  # numpy array
plt.figure()
plt.pcolormesh(lon, lat, temp[0, :, :])

In [None]:
temp.mean(axis=1)  ## what did I just do? I can't tell by looking at this line. 

### Analysis with xarray `=)`

How readable is this code?

In [None]:
plt.figure()
ds.air.isel(time=1).plot(x="lon")

In [None]:
plt.figure()
ds.air.mean("time").plot()

---

## Extracting data or "indexing" : `.sel`, `.isel`

Xarray supports 
* label-based indexing using `.sel`
* position-based indexing using `.isel`

For more see https://xarray.pydata.org/en/stable/indexing.html

### Label-based indexing

Xarray inherits its label-based indexing rules from pandas; this means great support for dates and times!

In [None]:
# pull out data for all of 2013-May
ds.sel(time="2013-05")

In [None]:
# demonstrate slicing
ds.sel(time=slice("2013-05", "2013-07"))

In [None]:
# demonstrate "nearest" indexing
ds.sel(lon=240.2, method="nearest")

In [None]:
# "nearest indexing at multiple points"
ds.sel(lon=[240.125, 234], lat=[40.3, 50.3], method="nearest")

### Position-based indexing


This is similar to your usual numpy `array[0, 2, 3]` but with the power of named dimensions!

In [None]:
# pull out time index 0 and lat index 0
ds.air.isel(time=0, lat=0)  #  much better than ds.air[0, 0, :]

In [None]:
# demonstrate slicing
ds.air.isel(lat=slice(10))

---
## Concepts for computation

### Broadcasting: expanding data

Let's try to calculate grid cell area associated with the air temperature data. We will use this to make a proper domain-average

A very approximate formula is

\begin{equation}
Δlat \times Δlon \times \cos(\text{latitude}) 
\end{equation}

assuming that $Δlon$ = 111km and $Δlat$ = 111km

In [None]:
dlon = np.cos(ds.air.lat * np.pi / 180) * 111e3
dlon

In [None]:
dlat = 111e3 * xr.ones_like(ds.air.lon)
dlat

In [None]:
cell_area = dlon * dlat
cell_area

The result has two dimensions because xarray realizes that dimensions `lon` and `lat` are different so it automatically "broadcasts" to get a 2D result. See the last row in this image from *Jake VanderPlas Python Data Science Handbook*

<img src="https://jakevdp.github.io/PythonDataScienceHandbook/figures/02.05-broadcasting.png">


Because xarray knows about dimension names we avoid having to create unnecessary size-1 dimensions using `np.newaxis` or `.reshape`. For more, see https://xarray.pydata.org/en/stable/computation.html#broadcasting-by-dimension-name


---

### Alignment: putting data on the same grid

When doing arithmetic operations xarray automatically "aligns" i.e. puts the data on the same grid. In this case `cell_area` and `ds.air` are at the same lat, lon points so things are multiplied as you would expect

In [None]:
(cell_area * ds.air.isel(time=1))

Now lets make `cell_area` unaligned i.e. change the coordinate labels

In [None]:
# make a copy of cell_area
# then add 1e-5 to lat
cell_area_bad = cell_area.copy(deep=True)
cell_area_bad["lat"] = cell_area.lat + 1e-5
cell_area_bad

In [None]:
cell_area_bad * ds.air.isel(time=1)

**Tip:** If you notice extra NaNs or missing points after xarray computation, it means that your xarray coordinates were not aligned *exactly*.

For more, see https://xarray.pydata.org/en/stable/computation.html#automatic-alignment

---

## High level computation: `groupby`, `resample`, `rolling`, `coarsen`, `weighted`

Xarray has some very useful high level objects that let you do common computations:

1. `groupby` : [Bin data in to groups and reduce](https://xarray.pydata.org/en/stable/groupby.html)
1. `resample` : [Groupby specialized for time axes. Either downsample or upsample your data.](https://xarray.pydata.org/en/stable/time-series.html#resampling-and-grouped-operations)
1. `rolling` : [Operate on rolling windows of your data e.g. running mean](https://xarray.pydata.org/en/stable/computation.html#rolling-window-operations)
1. `coarsen` : [Downsample your data](https://xarray.pydata.org/en/stable/computation.html#coarsen-large-arrays)
1. `weighted` : [Weight your data before reducing](https://xarray.pydata.org/en/stable/computation.html#weighted-array-reductions)

### groupby

In [None]:
# seasonal groups
ds.groupby("time.season")

In [None]:
# make a seasonal mean
seasonal_mean = ds.groupby("time.season").mean()
seasonal_mean

### resample

In [None]:
# resample to monthly frequency
ds.resample(time="M").mean()

### weighted

In [None]:
# weight by cell_area and take mean over (time, lon)
ds.weighted(cell_area).mean(["lon", "time"]).air.plot()

---

## Visualization: `.plot`

For more see https://xarray.pydata.org/en/stable/plotting.html and https://xarray.pydata.org/en/stable/examples/visualization_gallery.html

In [None]:
# facet the seasonal_mean
seasonal_mean.air.plot(col="season")

In [None]:
# contours
seasonal_mean.air.plot.contour(col="season", levels=20, add_colorbar=True)

In [None]:
# line plots too? wut
seasonal_mean.air.mean("lon").plot.line(hue="season", y="lat")

---

## Reading and writing to disk

Xarray supports many disk formats. Below is a small example using netCDF. For more see https://xarray.pydata.org/en/stable/io.html



In [None]:
# write ds to netCDF
ds.to_netcdf("my-example-dataset.nc")

In [None]:
# read from disk
fromdisk = xr.open_dataset("my-example-dataset.nc")
fromdisk

In [None]:
# check that the two are identical
ds.identical(fromdisk)

**Tip:** A common use case to read datasets that are a collection of many netCDF files.  See https://xarray.pydata.org/en/stable/io.html#reading-multi-file-datasets for how to handle that

---

# More information

1. A description of common terms used in the xarray documentation: https://xarray.pydata.org/en/stable/terminology.html
1. For information on how to create a DataArray from an existing numpy array: https://xarray.pydata.org/en/stable/data-structures.html#creating-a-dataarray
1. Answers to common questions on "how to do X" are here: https://xarray.pydata.org/en/stable/howdoi.html
1. Our more extensive Scipy 2020 tutorial material: https://xarray-contrib.github.io/xarray-tutorial/

---

# The scientific python / pangeo ecosystem: demo

Xarray ties in to the larger scientific python ecosystem and in turn many packages build on top of xarray. A long list of such packages is here: https://xarray.pydata.org/en/stable/related-projects.html.


Now we will demonstrate some cool features.

## Pandas: tabular data structures 

You can easily convert between xarray and pandas structures: https://pandas.pydata.org/

This allows you to conveniently use the extensive pandas ecosystem of packages (like seaborn) for your work.

See https://xarray.pydata.org/en/stable/pandas.html

In [None]:
# convert to pandas dataframe
df = ds.isel(time=slice(10)).to_dataframe()
df

In [None]:
# convert dataframe to xarray
df.to_xarray()

## xarray can wrap other array types, not just numpy


<img src="https://docs.dask.org/en/latest/_static/images/dask-horizontal-white.svg" style="width:25%"> 

**dask** : parallel arrays https://xarray.pydata.org/en/stable/dask.html & https://docs.dask.org/en/latest/array.html 


<img src="https://sparse.pydata.org/en/stable/_images/logo.png" style="width:12%"> 

**pydata/sparse** : sparse arrays http://sparse.pydata.org

<img src="https://raw.githubusercontent.com/cupy/cupy.dev/master/images/cupy_logo.png" style="width:22%">

**cupy** : GPU arrays http://cupy.chainer.org


<img src="https://pint.readthedocs.io/en/stable/_images/logo-full.jpg" style="width:10%">

**pint** : unit-aware computations https://pint.readthedocs.org & https://github.com/xarray-contrib/pint-xarray

### Xarray + dask

Dask cuts up NumPy arrays into blocks and parallelizes your analysis code across these blocks

<img src="https://dask.org/_images/dask-array-black-text.svg" style="width:55%">

In [None]:
# make dask cluster; this is for demo purposes
import dask
import distributed

cluster = distributed.LocalCluster()

In [None]:
# demonstrate dask dataset
dasky = xr.tutorial.open_dataset(
    "air_temperature", 
    chunks={"time": 100}, # 100 time steps in each block
)

dasky.air

In [None]:
# demonstrate lazy mean
dasky.air.mean()

In [None]:
# "compute" the mean
dasky.air.mean().compute()

## holoviews: javascript interactive plots

the ``hvplot`` package is a nice easy way to access [holoviews](http://holoviews.org/) functionality. It attaches itself to all xarray objects under the `.hvplot` namespace. So instead of using `.plot` use `.hvplot`

In [None]:
import hvplot.xarray

ds.air.hvplot(groupby="time", clim=(270, 300))

### cf_xarray : use even more metadata for even more fun and ~profit~ papers

[cf_xarray](https://cf-xarray.readthedocs.io/) is a new project that tries to let you make use of other CF attributes that xarray ignores. It attaches itself to all xarray objects under the `.cf` namespace

In [None]:
import cf_xarray

In [None]:
# describe cf attributes in dataset
ds.air.cf.describe()

In [None]:
# demonstrate equivalent of .mean("lat")
ds.air.cf.mean("latitude")

In [None]:
# demonstrate indexing
ds.air.cf.sel(longitude=242.5, method="nearest")

### Other cool packages


* xgcm : grid-aware operations with xarray objects
* xrft : fourier transforms with xarray
* xclim : calculating climate indices with xarray objects
* intake : forget about file paths
* rioxarray : raster files and xarray
* xesmf : regrid using ESMF
* MetPy : tools for working with weather data

More here: https://xarray.pydata.org/en/stable/related-projects.html