<a href="https://colab.research.google.com/github/geonextgis/Data-Wrangling-with-Xarray/blob/main/00_Fundamentals/00_Data_Structures.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Xarray's Data Structures**

## **Introduction**
N-dimensional arrays, also known as tensors, are integral to computational science and are used across various domains like physics, astronomy, geoscience, bioinformatics, engineering, finance, and deep learning. In Python, [NumPy](https://numpy.org/) serves as the essential tool for handling these arrays. However, practical datasets go beyond simple numerical values; they often include labels that provide information about how the array values correspond to locations in space, time, and other dimensions.

To illustrate, consider how we might organize a dataset for a weather forecast:

<center><img src="https://docs.xarray.dev/en/stable/_images/dataset-diagram.png" width="60%"></center>

Xarray distinguishes itself by not only keeping track of labels on arrays but using them to offer a robust and concise interface. For instance:

- Conduct operations across dimensions by name: `x.sum('time')`.
- Select values by label rather than integer location:
    `x.loc['2014-01-01']` or `x.sel(time='2014-01-01')`.
- Mathematical operations (e.g., `x - y`) efficiently work across multiple dimensions (array broadcasting) based on dimension names, not shape.
- Utilize versatile split-apply-combine operations with groupby:
    `x.groupby('time.dayofyear').mean()`.
- Achieve database-like alignment based on coordinate labels that adeptly handles missing values: `x, y = xr.align(x, y, join='outer')`.
- Retain arbitrary metadata using a Python dictionary: `x.attrs`.

Xarray's N-dimensional data structures are well-suited for handling multi-dimensional scientific data. Its use of dimension names, rather than axis labels (e.g., `dim='time'` instead of `axis=0`), makes managing arrays more straightforward compared to raw NumPy ndarrays. With xarray, there's no need to keep track of the order of an array's dimensions or insert dummy dimensions of size 1 for alignment (e.g., using `np.newaxis`).

The immediate benefit of using xarray is reduced code, and the long-term advantage is enhanced understanding when revisiting the code in the future.

## **Data Structures**
Xarray offers two primary data structures: the `DataArray` and `Dataset`. The `DataArray` class adds dimension names, coordinates, and attributes to multi-dimensional arrays, while the `Dataset` class combines multiple arrays.

For practical examples, Xarray provides small real-world tutorial datasets on its GitHub repository [here](https://github.com/pydata/xarray-data). We will utilize the [xarray.tutorial.load_dataset](https://docs.xarray.dev/en/stable/generated/xarray.tutorial.open_dataset.html#xarray.tutorial.open_dataset) function to download and open the `air_temperature` Dataset from the National Centers for Environmental Prediction by name.

In [1]:
import numpy as np
import xarray as xr

### **Dataset**
`Dataset` objects function as container-like structures resembling dictionaries. They organize DataArrays, where each variable name is mapped to an associated DataArray within the dataset. This arrangement allows for a comprehensive and structured representation of multi-variable datasets.

In [2]:
# Reading built-in dataset with Xarray
ds = xr.tutorial.load_dataset("air_temperature")
ds

We can access "layers" of the Dataset (individual DataArrays) with dictionary syntax.

In [None]:
ds["air"]

We can save some typing by using the "attribute" or "dot" notation. This won't work for variable names that clash with built-in method names (for example, `mean`).

In [4]:
ds.air

#### **Understanding String Representations**

Xarray offers two types of representations: `"html"` (exclusive to notebooks) and `"text"`. You can specify your preference using the `display_style` option.

Up to this point, our notebook has been set to automatically display the `"html"` representation (which we will stick with). The `"html"` representation is interactive, enabling you to collapse sections (using left arrows) and explore attributes and values for each entry (accessible through the right-hand sheet icon and data symbol).

In [5]:
with xr.set_options(display_style="html"):
    display(ds)

The output includes:

- A summary detailing all *dimensions* of the `Dataset` `(lat: 25, time: 2920, lon: 53)`. This information specifies that the first dimension, named `lat`, has a size of `25`, the second dimension, named `time`, has a size of `2920`, and the third dimension, named `lon`, has a size of `53`. Since we access dimensions by name, their order is not significant.
- An unordered list presenting *coordinates* or dimensions with coordinates. Each item is listed on a separate line, providing the name, one or more dimensions in parentheses, the data type (dtype), and a preview of the values. Additionally, if a dimension coordinate is present, it is marked with a `*`.
- An alphabetically sorted list of *dimensions without coordinates* (if any).
- An unordered list detailing *attributes*, or metadata.

🤔 **Note:** The use of the `with` statement in Python is associated with context management. In this context, the `xr.set_options(display_style="html")` is likely a context manager provided by the xarray library. When used within a `with` statement, it allows you to temporarily change a setting for a specific block of code, and once the block is exited, the original settings are automatically restored.

In [6]:
with xr.set_options(display_style="text"):
    display(ds)

To understand each of the components better, we'll explore the "air" variable of our Dataset.

### **DataArray**
The `DataArray` class consists of an array (data) and its associated dimension names, labels, and attributes (metadata).

In [7]:
# Selecting 'air' variable from the dataset
da = ds["air"]
da

#### **Understanding String Representations**
We can use the same two representations (`"html"`, which is only available in
notebooks, and `"text"`) to display our `DataArray`.

In [8]:
with xr.set_options(display_style="html"):
    display(da)

In [9]:
with xr.set_options(display_style="text"):
    display(da)

We can also access the data array directly:

In [None]:
ds.air.data # (or equivalently, `da.data`)

#### **Named Dimensions**
`.dims` represent the named axes of your data, and they can either have associated values (dimension coordinates) or not (dimensions without coordinates). The names can take any form that is compatible with a Python `set` (i.e., calling `hash()` on it does not result in an error), but for practical use, they are typically strings.

In this instance, there are two spatial dimensions, with shorthand names `lat` and `lon` representing `latitude` and `longitude`, respectively. Additionally, there is one temporal dimension, denoted as `time`.

In [11]:
ds.air.dims

('time', 'lat', 'lon')

#### Coordinates

`.coords` serves as a straightforward [dict-like](https://docs.python.org/3/glossary.html#term-mapping) [data container](https://docs.xarray.dev/en/stable/user-guide/data-structures.html#coordinates) that maps coordinate names to corresponding values. These values can take different forms:

- Another `DataArray` object.
- A tuple `(dims, data, attrs)`, where `attrs` is optional. This is akin to creating a new `DataArray` object with `DataArray(dims=dims, data=data, attrs=attrs)`.
- A 1-dimensional `numpy` array or any convertible type (using [`numpy.array`](https://numpy.org/doc/stable/reference/generated/numpy.array.html)), such as a `list`. This array contains numbers, datetime objects, strings, etc., serving as labels for each point.

In the following example, we observe the actual timestamps and spatial positions associated with our air temperature data:

In [12]:
ds.air.coords

Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00

The distinction between dimension labels (dimension coordinates) and regular coordinates lies in the fact that, currently, indexing operations (`sel`, `reindex`, etc.) can only be applied to dimension coordinates. Additionally, while coordinates can have arbitrary dimensions, it is a requirement for dimension coordinates to be one-dimensional.

### **Attributes**
`.attrs` is a dictionary capable of holding diverse Python objects, including strings, lists, integers, dictionaries, etc., to store information about your data. The only constraint is that certain attributes might not be writable to specific file formats.

In [13]:
ds.air.attrs

{'long_name': '4xDaily Air temperature at sigma level 995',
 'units': 'degK',
 'precision': 2,
 'GRIB_id': 11,
 'GRIB_name': 'TMP',
 'var_desc': 'Air temperature',
 'dataset': 'NMC Reanalysis',
 'level_desc': 'Surface',
 'statistic': 'Individual Obs',
 'parent_stat': 'Other',
 'actual_range': array([185.16, 322.1 ], dtype=float32)}

## **Bridging Pandas and Xarray**
Frequently, the creation of `DataArray` and `Dataset` objects involves conversions from other libraries like [pandas](https://pandas.pydata.org/) or by reading data from storage formats such as [NetCDF](https://www.unidata.ucar.edu/software/netcdf/) or [zarr](https://zarr.readthedocs.io/en/stable/).

To facilitate conversion between `xarray` and `pandas`, you can utilize the [to_xarray](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_xarray.html) methods on Pandas objects or the [to_pandas](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.to_pandas.html) methods on `xarray` objects:

In [14]:
import pandas as pd

In [15]:
series = pd.Series(np.ones((10,)), index=list("abcdefghij"))
series

a    1.0
b    1.0
c    1.0
d    1.0
e    1.0
f    1.0
g    1.0
h    1.0
i    1.0
j    1.0
dtype: float64

In [16]:
arr = series.to_xarray()
arr

In [17]:
arr.to_pandas()

index
a    1.0
b    1.0
c    1.0
d    1.0
e    1.0
f    1.0
g    1.0
h    1.0
i    1.0
j    1.0
dtype: float64

We can also control what `pandas` object is used by calling `to_series` /
`to_dataframe`:


**`to_series`**: This will always convert `DataArray` objects to
`pandas.Series`, using a `MultiIndex` for higher dimensions


In [18]:
ds.air.to_series()

time                 lat   lon  
2013-01-01 00:00:00  75.0  200.0    241.199997
                           202.5    242.500000
                           205.0    243.500000
                           207.5    244.000000
                           210.0    244.099991
                                       ...    
2014-12-31 18:00:00  15.0  320.0    297.389984
                           322.5    297.190002
                           325.0    296.489990
                           327.5    296.190002
                           330.0    295.690002
Name: air, Length: 3869000, dtype: float32

**`to_dataframe`**: This will always convert `DataArray` or `Dataset`
objects to a `pandas.DataFrame`. Note that `DataArray` objects have to be named
for this.

In [19]:
ds.air.to_dataframe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,air
time,lat,lon,Unnamed: 3_level_1
2013-01-01 00:00:00,75.0,200.0,241.199997
2013-01-01 00:00:00,75.0,202.5,242.500000
2013-01-01 00:00:00,75.0,205.0,243.500000
2013-01-01 00:00:00,75.0,207.5,244.000000
2013-01-01 00:00:00,75.0,210.0,244.099991
...,...,...,...
2014-12-31 18:00:00,15.0,320.0,297.389984
2014-12-31 18:00:00,15.0,322.5,297.190002
2014-12-31 18:00:00,15.0,325.0,296.489990
2014-12-31 18:00:00,15.0,327.5,296.190002


Since columns in a `DataFrame` need to have the same index, they are
broadcasted.

In [20]:
ds.to_dataframe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,air
lat,time,lon,Unnamed: 3_level_1
75.0,2013-01-01 00:00:00,200.0,241.199997
75.0,2013-01-01 00:00:00,202.5,242.500000
75.0,2013-01-01 00:00:00,205.0,243.500000
75.0,2013-01-01 00:00:00,207.5,244.000000
75.0,2013-01-01 00:00:00,210.0,244.099991
...,...,...,...
15.0,2014-12-31 18:00:00,320.0,297.389984
15.0,2014-12-31 18:00:00,322.5,297.190002
15.0,2014-12-31 18:00:00,325.0,296.489990
15.0,2014-12-31 18:00:00,327.5,296.190002


## **To Pandas and back**
`DataArray` and `Dataset` objects are commonly generated through the conversion of data from other libraries like pandas or by reading from various data storage formats such as NetCDF or zarr.

To convert from / to `pandas`, we can use the `to_xarray` methods on pandas objects or the `to_pandas` methods on `xarray` objects:

In [21]:
import pandas as pd

In [23]:
series = pd.Series(np.ones((10,)), index=list("abcdefghij"))
series

a    1.0
b    1.0
c    1.0
d    1.0
e    1.0
f    1.0
g    1.0
h    1.0
i    1.0
j    1.0
dtype: float64

In [24]:
arr = series.to_xarray()
arr

In [27]:
arr.to_pandas()

index
a    1.0
b    1.0
c    1.0
d    1.0
e    1.0
f    1.0
g    1.0
h    1.0
i    1.0
j    1.0
dtype: float64

We can also control what `pandas` object is used by calling `to_series` /
`to_dataframe`:

**`to_series`**: This will always convert `DataArray` objects to
`pandas.Series`, using a `MultiIndex` for higher dimensions


In [28]:
ds.air.to_series()

time                 lat   lon  
2013-01-01 00:00:00  75.0  200.0    241.199997
                           202.5    242.500000
                           205.0    243.500000
                           207.5    244.000000
                           210.0    244.099991
                                       ...    
2014-12-31 18:00:00  15.0  320.0    297.389984
                           322.5    297.190002
                           325.0    296.489990
                           327.5    296.190002
                           330.0    295.690002
Name: air, Length: 3869000, dtype: float32

**`to_dataframe`**: This will always convert `DataArray` or `Dataset`
objects to a `pandas.DataFrame`. Note that `DataArray` objects have to be named
for this.

In [29]:
ds.air.to_dataframe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,air
time,lat,lon,Unnamed: 3_level_1
2013-01-01 00:00:00,75.0,200.0,241.199997
2013-01-01 00:00:00,75.0,202.5,242.500000
2013-01-01 00:00:00,75.0,205.0,243.500000
2013-01-01 00:00:00,75.0,207.5,244.000000
2013-01-01 00:00:00,75.0,210.0,244.099991
...,...,...,...
2014-12-31 18:00:00,15.0,320.0,297.389984
2014-12-31 18:00:00,15.0,322.5,297.190002
2014-12-31 18:00:00,15.0,325.0,296.489990
2014-12-31 18:00:00,15.0,327.5,296.190002


Since columns in a `DataFrame` need to have the same index, they are
broadcasted.

In [34]:
ds.to_dataframe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,air
lat,time,lon,Unnamed: 3_level_1
75.0,2013-01-01 00:00:00,200.0,241.199997
75.0,2013-01-01 00:00:00,202.5,242.500000
75.0,2013-01-01 00:00:00,205.0,243.500000
75.0,2013-01-01 00:00:00,207.5,244.000000
75.0,2013-01-01 00:00:00,210.0,244.099991
...,...,...,...
15.0,2014-12-31 18:00:00,320.0,297.389984
15.0,2014-12-31 18:00:00,322.5,297.190002
15.0,2014-12-31 18:00:00,325.0,296.489990
15.0,2014-12-31 18:00:00,327.5,296.190002
