In [None]:
import xarray as xr
import numpy as np
import pandas as pd

# Introduction

Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called “tensors”) are an essential part of computational science. They are encountered in a wide range of fields, including physics, astronomy, geoscience, bioinformatics, engineering, finance, and deep learning. In Python, NumPy provides the fundamental data structure and API for working with raw ND arrays. However, real-world datasets are usually more than just raw numbers; they have labels which encode information about how the array values map to locations in space, time, etc.

Xarray doesn’t just keep track of labels on arrays – it uses them to provide a powerful and concise interface. For example:

- Apply operations over dimensions by name: `x.sum('time')`.

- Select values by label (or logical location) instead of integer location: `x.loc['2014-01-01']` or `x.sel(time='2014-01-01')`.

- Mathematical operations (e.g., `x - y`) vectorize across multiple dimensions (array broadcasting) based on dimension names, not shape.

- Easily use the split-apply-combine paradigm with groupby: `x.groupby('time.dayofyear').mean()`.

- Database-like alignment based on coordinate labels that smoothly handles missing values: `x, y = xr.align(x, y, join='outer')`.

- Keep track of arbitrary metadata in the form of a Python dictionary: `x.attrs`.

The N-dimensional nature of xarray’s data structures makes it suitable for dealing with multi-dimensional scientific data, and its use of dimension names instead of axis labels (`dim='time'` instead of `axis=0`) makes such arrays much more manageable than the raw numpy ndarray: with xarray, you don’t need to keep track of the order of an array’s dimensions or insert dummy dimensions of size 1 to align arrays (e.g., using np.newaxis).

The immediate payoff of using xarray is that you’ll write less code. The long-term payoff is that you’ll understand what you were thinking when you come back to look at it weeks or months later.

# Data structures

xarray mainly provides two types: `DataArray` and `Dataset`. The `DataArray` class attaches dimension names, coordinates and attributes to multi-dimensional arrays while `Dataset` combines multiple arrays.

Both classes are normally created by reading data, but to understand them let's first look at creating them programmatically.

## DataArray

The `DataArray` class is used to attach a name, dimension names, labels, and attributes to an array.

As an example, let's create a `DataArray` named `a` with two dimensions (named `x` and `y`) from a `numpy` array:

In [None]:
da = xr.DataArray(
    np.ones((3, 4, 2)),
    dims=("x", "y", "z"),
    name="a",
    coords={"z": [-1, 1], "u": ("x", [0.1, 1.2, 2.3])},
    attrs={"attr": "value"},
)

In this case, we used a 3x4 `numpy` array with all values being equal to `1`, but it can be anything that either behaves like a `numpy` array or can be coerced to a `numpy` array using `numpy.array`.

We also passed a sequence (a tuple here, but could also be a list) containing the dimension names `x` and `y` to `dims`. In case we have only a single dimension we can also pass just the dimension name:
```python
xr.DataArray([1, 1], dims="x")
```

The dimension names (and the array's `name`) can be anything that fits into a python `set` (i.e. calling `hash()` on it doesn't raise an error), but to be useful they should be strings.

`coords` is a [dict-like](https://docs.python.org/3/glossary.html#term-mapping) container of arrays (coordinates) that label each point (e.g., 1-dimensional arrays of numbers, datetime objects or strings). We will look at its format later.

We can also attach arbitrary metadata (attributes) to the `DataArray` by passing a dict-like to the `attrs` parameter.

### string representations

Now that we have the `DataArray` we can look at its string representation.

xarray has two representation types: `"html"` (which is only available in notebooks) and `"text"`. To choose between them, use the `display_style` option.

Let's first look at the text representation:

In [None]:
xr.set_options(display_style="text")
da

It consists of:
- the name of the `DataArray` (`'a'`). If we didn't provide a name, this will be omitted.
- the dimensions of the array `(x: 3, y: 4)`: this tells us that the first dimension is named `x` and has a size of `3` while the second dimension is named `y` and has a size of `4`
- a preview of the data
- a (unordered) list of coordinates or dimensions with coordinates with one item per line. Each item has a name, one or more dimensions in parentheses, a dtype and a preview of the values. Also, if it is a dimension coordinate, it will be marked with a `*`.
- a alphabetically sorted list of dimensions without coordinates
- a (unordered) list of attributes

The `"html"` representation looks similar:

In [None]:
xr.set_options(display_style="html")
da

except the data preview was collapsed to a single line (we can expand it by clicking on the symbol on the left) and the dimensions are marked by a bold font instead of a `*` prefix.

Except when explaining the text representation we will use the HTML representation.

Once we have created the `DataArray`, we can look at its data:

In [None]:
da.data

In [None]:
da.dims

In [None]:
da.coords

In [None]:
da.attrs

### coordinates

As mentioned above, `coords` is a dict-like mapping names to values. These values can be either

- another `DataArray` object
- a tuple of the form `(dims, data, attrs)` where `attrs` is optional. This is roughly equivalent to creating a new `DataArray` object with `DataArray(dims=dims, data=data, attrs=attrs)`
- a `numpy` array (or anything that can be coerced to one using `numpy.array`).

Let's look at an example:

In [None]:
da = xr.DataArray(
    np.ones((3, 4)),
    dims=("x", "y"),
    coords={
        "x": ["a", "b", "c"],
        "y": np.arange(4),
        "u": ("x", np.arange(3), {"attr1": 0}),
    },
)
da

We can see that we assigned labels to the `x` and `y` dimensions and also created a coordinate named `u` along `x` with its own metadata (click on the sheet icon to look at them).

The difference between the dimension labels (dimension coordinates) and normal coordinates is that for now it only is possible to use indexing operations (`sel`, `reindex`, etc) with dimension coordinates. Also, while coordinates can have arbitrary dimensions, dimension coordinates have to be one-dimensional.

# Dataset

`Dataset` objects collect multiple data variables, each with possibly different dimensions.

The constructor of `Dataset` takes three parameters:
- `data_vars`: dict-like mapping names to values. It has the format described in [coordinates](#coordinates) except we need to use either `DataArray` objects or the tuple syntax since we have to provide dimensions
- `coords`: same as for `DataArray`
- `attrs`: same as for `Dataset`

For example, let's create a `Dataset` with two variables:

In [None]:
ds = xr.Dataset(
    data_vars={
        "a": (("x", "y"), np.ones((3, 4))),
        "b": ("t", np.full((8,), 3), {"attr": "value"}),
    },
    coords={
        "x": [-1, 0, 1],
    },
    attrs={"attr": "value"},
)

### string representations

Let's again first look at the text representation:

In [None]:
xr.set_options(display_style="text")
ds

It consists of
- a summary of all dimensions in the dataset and their lengths
- a unordered list of coordinates (same format as the `DataArray`)
- a unordered list of dimensions without coordinates
- a unordered list of data variables: each item has the same format as the coordinates with the exception of the dimension mark (`*`)

Again, the HTML representation is similar:

In [None]:
xr.set_options(display_style="html")
ds

### coordinates

As with `DataArray`, a `Dataset` really becomes useful once we assign coordinates:

In [None]:
xr.Dataset(
    data_vars={
        "a": (("x", "y"), np.ones((3, 4))),
        "b": (("t", "x"), np.full((8, 3), 3)),
    },
    coords={
        "x": ["a", "b", "c"],
        "y": np.arange(4),
        "t": pd.date_range("2020-07-05", periods=8, freq="D"),
    },
)

If we have variables with different values along the same dimension, we can't use the shortcut syntax anymore. Instead, we need to use `DataArray` objects:

In [None]:
x_a = np.arange(1, 4)
x_b = np.arange(-1, 3)

a = xr.DataArray(np.linspace(0, 1, 3), dims="x", coords={"x": x_a})
b = xr.DataArray(np.zeros(4), dims="x", coords={"x": x_b})

xr.Dataset(data_vars={"a": a, "b": b})

which combines the coordinates and fills in floating-point `nan` values for missing data (converting the data type to `float` in the process). For example, `b` doesn't have a value for `x == 3` so `nan` was used.

# Roundtripping and I/O

Typically, `DataArray` and `Dataset` objects are not created programmatically but instead by converting from / to other libraries such as `pandas` or by reading from data storage formats such as `netcdf` or `zarr`.

To convert from / to `pandas`, we can use the `to_xarray` methods on `pandas` objects or the `to_pandas` methods on `xarray` objects:

In [None]:
series = pd.Series(np.ones((10,)), index=list("abcdefghij"))
series

In [None]:
arr = series.to_xarray()
arr

In [None]:
arr.to_pandas()

We can also control what `pandas` object is used by calling `to_series` / `to_dataframe`:

In [None]:
ds = xr.Dataset(data_vars={"a": ("x", np.arange(5)), "b": (("x", "y"), np.ones((5, 4)))})

**`to_series`**:
This will always convert `DataArray` objects to `pandas.Series`, using a `MultiIndex` for higher dimensions

In [None]:
ds.a.to_series()

In [None]:
ds.b.to_series()

**`to_dataframe`**:
This will always convert `DataArray` or `Dataset` objects to a `pandas.DataFrame`. Note that `DataArray` objects have to be named for this.

In [None]:
ds.a.to_dataframe()

Since columns in a `DataFrame` need to have the same index, they are broadcasted.

In [None]:
ds.to_dataframe()

## I/O

- netcdf / pseudonetcdf (open_dataset / open_mfdataset, to_netcdf / save_mfdataset)
- zarr (open_zarr, to_zarr)
- rasterio (open_rasterio)

Scientific data usually is 

### netcdf

To read / write to `netcdf` files, use the `open_dataset` / `open_dataarray` functions and the `to_netcdf` method.

Let's first create some datasets and write them to disk using `to_netcdf`, which takes the path we want to write to:

In [None]:
ds1 = xr.Dataset(
    data_vars={
        "a": (("x", "y"), np.random.randn(4, 2)),
        "b": (("z", "x"), np.random.randn(6, 4)),
    },
    coords={
        "x": np.arange(4),
        "y": np.arange(-2, 0),
        "z": np.arange(-3, 3),
    },
)
ds2 = xr.Dataset(
    data_vars={
        "a": (("x", "y"), np.random.randn(7, 3)),
        "b": (("z", "x"), np.random.randn(2, 7)),
    },
    coords={
        "x": np.arange(6, 13),
        "y": np.arange(3),
        "z": np.arange(3, 5),
    },
)

ds1.to_netcdf("ds1.nc")
ds2.to_netcdf("ds2.nc")

ds1.a.to_netcdf("da1.nc")

Reading those files is just as simple:

In [None]:
xr.open_dataset("ds1.nc")

In [None]:
xr.open_dataarray("da1.nc")

## zarr

`zarr` files can be written with:

In [None]:
ds1.to_zarr("ds1.zarr")

We can then read the created file with:

In [None]:
xr.open_zarr("ds1.zarr", chunks=None)

setting the `chunks` parameter to `None` avoids `dask` (more on that in a later session)