# Xarray Part 1

![xarray logo](../images/xarray_logo.png)

https://xarray.pydata.org/en/stable/index.html

⚡️ **xarray** is a python package which allows us to handle multi-dimensional datasets in a simple way. It provides a huge set of functions for advanced analytics and visualization. It is part of higher level package ecosystems like [Pangeo](https://pangeo.io/).

⭐️ **xarray**'s underlying data model is borrowed from the data format [NetCDF](http://www.unidata.ucar.edu/software/netcdf). This data format in combination with the [Climate and Forecast conventions](https://cfconventions.org/) is the standard for the climate science community. A large part of DKRZ's data is available in netCDF. Therefore, `xarray` allows fast and intuitive data analysis on this kind of data.

💥 **xarray** data structure deals with scientific data by using labels, attributes, dimensions and coordinates, and extend the capabilities of **NumPy** and **pandas**.

### Content:
* [DataArrays](#DataArray)
* [Dimensions](#Dimensions)
* [Coordinates](#Coordinates)
* [Variable attributes](#Variable-attributes)
* [Datasets](#Datasets)
* [Read and open files](#read-and-open)
* [Indexing and selecting data](#index-and-select)

<br />

### Requirements:
* [numpy](numpy_intro.ipynb)

## Overview: **Xarray's** data model

A **data model**  🗃️ describes how the package organizes elements of data and standardizes how they relate to one another. On code level, a graph of a data model shows the interconnections of classes, types and methods. **Xarray's** data model consists of the classes *Dataset*, *DataArray*, *Dimension*, *Coordinate* and *attributes*.

📎 Dataset ( ≈ file ): 

    Dict-like collection of DataArray objects with aligned dimensions. Similar use of variables, dimensions, coordinates, and attributes like for DataArray. You can see an xarray Dataset as a netCDF file like object. Has no data itself but only pointers to DataArrays

💾  DataArray ( = variable in the file ): 

    N-dimensional array with dimensions. The objects add dimension names, coordinates, and attibutes to the underlying data structure (numpy and dask arrays).
 
↔️  Dimensions: 

    Named dimension axes, if missing the dimension names are dim_0, dim_1, ...


🌎 Coordinates: 

    An array which labels a dimension. Two types are defined a) dimension coordinates - 1-dimensional coordinate array assigned to the DataArray with a name and dimension name. b) Non-dimensional coordinate - a coordinate array assigned to DataArray with the name assigned to the coordinates and not to the dimensions.

<br />

![xarray_core_data_structure.png](../images/xarray_core_data_structure.png)
From https://xarray-contrib.github.io/xarray-tutorial/online-tutorial-series/01_xarray_fundamentals.html

<br/>

### Importing modules

In this notebook, the Python libraries Numpy, Pandas, and cfgrib are needed for the examples. 

```python
import xarray as xr

import numpy as np
import pandas as pd
import cfgrib

```

<br />


If you work with jupyter lite, *before* importing the packages do:

```python
import micropip
await micropip.install(['xarray','cfgrib'])
```

<br />


In [None]:
import xarray as xr
import numpy as np
import pandas as pd
import cfgrib

<br>

## DataArray

🔄 As a start, we compare the `numpy` array with an `xarray`'s **DataArray** type. You can directly convert a `numpy` array into an `xarray` **DataArray** type by using it as input for `xarray`'s function `DataArray`. We use the data from the file `pr.dat` by loading it with `numpy`.

```python
pr_data = np.loadtxt('pr.dat', usecols=(1,2,3), skiprows=1)
pr_data_xr = xr.DataArray(pr_data)
pr_data_xr
```

In Jupyterlite, you can do:
```python
from js import fetch
res = await fetch('https://swift.dkrz.de/v1/dkrz_0b2a0dcc-1430-4a8a-9f25-a6cb8924d92b/python_workshop/pr.dat')
text = await res.text()

from io import StringIO
f = StringIO(text)
```
where f can be opened by numpy.

`pr_data_xr` has got more structure and descriptive information than `pr_data`. In contrast to the `numpy` data array, the `Xarray's` DataArray can separate the variable of interest, `pr`, as a *data variable* from *coordinate* variables. In summary, it contains:


- ↔️ **dimensions** with names              (`pr_data_xr.dims`)
- 🌎 **coordinates** pointing to variables  (`pr_data_xr.coords`)
- 🎨  and **attributes**                     (`pr_data_xr.attrs`)

Not only `xarray` but other software tools require and use the **labeld geospatial** information from coordinates, for example for

- 🖼️ **plotting**: mapping of data on a real world grid point
- 🖩 **analysis**: implemented routines for e.g. area *weighted* means can be run

This information is not correctly parsed from the input numpy array per default when executing `xr.DataArray()`. But we know them so we need to configure the call `xr.DataArray()` via the function parameters (arguments + keyword arguments):

```python
xr.DataArray(data,
             coords=,
             dims=,
             name=,
             attrs
            )
```

<div class="alert alert-info">
    <b>Note:</b> When working with <b>xarray</b>, the arguments and keyword arguments for a function are <i>in general</i> very usefull and important!
</div>

### Parsing numpy data with labels to xarray

Let's define a clear structure for the `xarray.DataArray()` for the numpy data first:

1. The actual **data** for the data variable is in the first column of the `numpy` array.
2. The **coords** are the second and third column of the `numpy` array. They have the same dimension as the data array.
3. We have one dimension (**dims**) which refers to the *station*. It is an index which runs from 0 to the length of the a column minus 1.
4. The **name** of the data variable is *Precipitation*.
5. In the **attrs**, we can store variable attributes like *units*.

Let's bring that into context with `xr.DataArray()`:
```python
pr_data_xr = xr.DataArray(pr_data[:,0],
                          coords={"lon":("Station",pr_data[:,1]),
                                  "lat":("Station",pr_data[:,2])},
                          dims=["Station"],
                          name="Precipitation",
                          attrs={"units":"mm",
                                "coords":"lon lat"})
```

In [None]:
print("Variable Name: ",pr_data_xr.name)
print("Dimensions: ",pr_data_xr.dims)
print("Coordinates: ",pr_data_xr.coords)
print("Sizes: ",pr_data_xr.sizes)
print("Attribute: ",pr_data_xr.attrs)

### Dimensions

↔️ Dimensions are **indices** covering an interval of the length of the dimension.

In our example, we only have one dimension where each index refers to one **station**. However, if create a quick plot of the data with the DataArray variable's `.plot()` function, we only get a one dimensional view:

In [None]:
pr_data_xr.plot()

#### Create a two dimensional georeferenced plot 🖼️

Our goal for this session now is to reorganize the data so that `.plot()` returns a meshed grid plot.
For that, we create a less condensed **two-dimensional** DataArray (with a lot of `NaN` values). 

<br />

<h2 style="color:red"> Exercise </h2>

1. Create a two dimensional numpy with the size `len(pr_data)` x `len(pr_data)`

1. Assign `NaN` values to the entire array

1. On the diagonal of the quadratic array, insert the values of `pr_data`

1. Show the new data frame

<br />

You will need:

- `np.empty()`
- `np.Nan`
- `for` loop


Let's pass this DataArray to **Xarray**.

<br />

<h2 style="color:red"> Exercise </h2>

1. Reset the variable `pr_data_xr` with a `xr.DataArray()` but use `pr_data_2d` as input.

    1.1. Set a correct configuration for the parameters of the function.

2. Plot again

<br />


We plot the two dimension xr:

### Coordinates

🌎 The plot only uses the indices of the dimensions for the x and y axes of the plot. This is because the **coordinates** `lat` and `lon` are not interpreted as **index coordinates**. `Xarray` will interpete coordinates as **index coordinates** only if the name of the coordinate is the same as the name of the dimension. 

<br />

<h2 style="color:red"> Exercise </h2>

1. Reset the variable `pr_data_xr` with a `xr.DataArray()` but rename `coords` or `dims` so that they are equal.
2. Plot again

<br />


You will receive a

```Python
ValueError: The input coordinate is not sorted in increasing order along axis 0. Consider calling the `sortby` method on the input DataArray.
```

<div class="alert alert-info">
    <b>Note:</b> When running into errors with <b>xarray</b>, the output will be very helpful and guiding. Be not afraid of making mistakes!
</div>

So let's use `sortby`:

```python
pr_data_xr.sortby(["lon","lat"])
```

Wrong dimension size? There are sevaral ways to repair this. One is to use `xarray`'s transpose function:

```python
pr_data_xr.transpose("lat","lon")
```

We created a plot which gives us an idea of for which places the data is valid with only few commands based on `xarray`.
- The *boundaries* of the grid points are artificial. They are not specified but only rendered by the plot function.
- In the next sessions we will learn a more sophisticated plotting including e.g. *coastlines*.

In [None]:
import cartopy.crs as ccrs
import matplotlib.pyplot as plt

proj=ccrs.PlateCarree()
ax = plt.axes(projection=proj)
ax.set_extent([-120, -80, 20, 60], proj)
ax.stock_img()
ax.coastlines()

pr_data_xr.plot()

<a class="anchor" id="Variable-attributes"></a>
### Variable attributes

<br /> 

🎨 You can easily set an attribute, for instance the attribute _name_ :

```python
pr_data_xr.name = 'precip'
```

<br />

Variables in Earth Science commonly have attributes like **standard_name**, **long_name** or **units** which can be added via the _attrs_ attribute to the DataArray. 

Add the units attribute to the DataArray _da_ :

```python
pr_data_xr.attrs['units'] = 'mm'
```

<br />


<h2 style="color:red"> Exercise </h2>

1. Add the variable attribute units as shown above
1. Add the variable long_name (as you like ;))
1. Change the long_name
1. Print all attributes

<br />


<br />

## Datasets

📎 Xarray's function `open_dataset` can be used to open and read the content of a file. It supports various formats, such as **netcdf, grib, zarr**, etc. (default: netcdf4). The file content will be  stored in the Xarray Dataset structure.

Example:

In the data directory of the course material, we use the file _tsurf.nc_ to demonstrate Xarray's file handling.

```python
ds = xr.open_dataset('../data/tsurf.nc')

ds.info()
```

Result:

```python
xarray.Dataset {
dimensions:
	lat = 96 ;
	lon = 192 ;
	time = 40 ;

variables:
	datetime64[ns] time(time) ;
		time:standard_name = time ;
		time:axis = T ;
	float64 lon(lon) ;
		lon:standard_name = longitude ;
		lon:long_name = longitude ;
		lon:units = degrees_east ;
		lon:axis = X ;
	float64 lat(lat) ;
		lat:standard_name = latitude ;
		lat:long_name = latitude ;
		lat:units = degrees_north ;
		lat:axis = Y ;
	float32 tsurf(time, lat, lon) ;
		tsurf:long_name = surface temperature ;
		tsurf:units = K ;
		tsurf:code = 169 ;
		tsurf:table = 128 ;

// global attributes:
	:CDI = Climate Data Interface version 1.9.6 (http://mpimet.mpg.de/cdi) ;
	:Conventions = CF-1.6 ;
	:history = Thu Oct 10 16:08:50 2019: cdo selname,tsurf rectilinear_grid_2D.nc tsurf.nc ;
	:CDO = Climate Data Operators version 1.9.6 (http://mpimet.mpg.de/cdo) ;
}
```

<br />


<br />

### Show variable names and coordinates

🌎 It is always good to have a closer look at the data, and this can be done very easily using the attributes explained above.

Show the coordinates stored in file:

```python
coords = ds.coords
```

Result:

```python
 Coordinates:
  * time     (time) datetime64[ns] 2001-01-01 ... 2001-01-10T18:00:00
  * lon      (lon) float64 -180.0 -178.1 -176.2 -174.4 ... 174.4 176.2 178.1
  * lat      (lat) float64 88.57 86.72 84.86 83.0 ... -83.0 -84.86 -86.72 -88.57
```

List the variables stored in the file:

```python
variables = ds.variables

```

Here we can see the time displayed in a readable way, because Xarray use the datetime64 module under the hood. Also the variable and coordinate attributes are displayed.

<h2 style="color:red"> Exercise </h2>

Read the file and try the above commands.

<br />

### Dimensions, shape and size

To get more informations about the dimension, shape and size of a **Dataset**, we can use the appropriate attributes.

```python
dims  = ds.dims
shape = tsurf.shape
size  = tsurf.size
rank  = len(shape)

print('dimensions: ', dims)
print('shape:      ', shape)
print('size:       ', size)
print('rank:       ', rank)
```

<br />

<a class="anchor" id="read-and-open"></a>
### Read another file format

 💾 💽 📀 
Xarray needs an _engine_ to read another file format. Here, we demonstrate how to read a GRIB file using the **cfgrib**  _engine_ from the additional library __cfgrib__ (don't forget to import it).

```python
import cfgrib

ds2 = xr.open_dataset('../data/MET9_IR108_cosmode_0909210000.grb2',
                      engine='cfgrib')

variables2 = ds2.variables
```

<h2 style="color:red"> Exercise </h2>

Read the GRIB file yourself.

<br />

<br>

### Open multiple files

📎📎📎 In the course directory **data** there are 3 files _precip_day01.nc, precip_day02.nc, and precip_day03.nc_, each containing the data of one day in 6 hour intervals. 

**Xarray** provides the function `open_mfdataset` to read multiple files in one step as a single dataset. Before you can use `open_mfdataset` make sure that the Python module **dask** is installed in your environment.

<br>


One reason why `xarray` is very fast with multiple files is that it does not **load** the data when the files are opened. This is possible by using an underlying library named `dask`. You can recognize that by checking for the `precip` variable in `dsm`.

```python
dsm.precip[1,4,5]
```
will not show you an exact value but only a description of what this output will be. You would have to load the data into memory first for accessing one specific point of the array. This is most often not necessary for your workflow.

The entire array can be loaded into memory by `dsm.precip.load()`. You can also do: 
```python
dsm.precip.values[1,4,5]
```

➡️ While data is not in loaded, you can work on files that are *larger than memory*.

In [None]:
dsm.precip[1,4,5]

In [None]:
dsm.precip.load()
dsm.precip[1,4,5]
# is the same as
dsm.precip.values[1,4,5]

The [open_mfdataset](http://xarray.pydata.org/en/stable/generated/xarray.open_mfdataset.html?highlight=open_mfdataset) function is very powerful. It contains over **10 arguments** which allow users to configure how the files are combined:

- On what dimension should the data be concatted
- How strict should tests ensure that the data can be concatted
- What are coordinates, what are data variables

### Dataset attributes

🎨 Dataset attributes and variable attributes are important for understanding what the data represents not only for human but also the machine. Therefore, it is important that they are available and have a standard format. In addition to the attributes of a `DataArray`, there also **global** or dataset attributes.

```python
tas_hr=xr.open_dataset("/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710/tas_Amon_MPI-ESM1-2-HR_ssp370_r1i1p1f1_gn_201501-201912.nc")
tas_hr_atts = list(tas_hr.attrs)
global_atts = list(tas_hr.attrs)
print(global_atts)
```

Assumed that we know the variable and attribute names, we can get their content immediately.

```python
units = tas_hr.tas.units

print('units:     ', units)
```

<br />


<h2 style="color:red"> Exercise </h2>

List the attributes of the variable _tas_ and print their content.