# Data science with xarray

Hello and welcome to your intro to `xarray` for data science as part of NCI's Parallel Python data science course.

This notebook is designed to cover the fundamentals of xarray, highlighting some similarities with tools you have already been introduced too such as `numpy` and `CuPy` and some of the advantages of using `xarray`.


First of all what is `xarray`? `xarray` is a project that came out of climate and geophysics research, recognising the need for rapid, scalable and easily manipulated N dimensional array data **with labels and metadata**. 

But wait, doesn't `numpy` provide an N Dimensional array? Yes it does, but `xarray` provides the ability to use labelled and metadata rich arrays, providing to quote the manual a " more intuitive, more concise, and less error-prone developer experience"

This is done by providing a label based API that simplifies a lot of the manual book keeping of working with `numpy` arrays directly.

No longer will you forget what experiment this data came from, what that tensor dimension was, or what array position corresponded to with time point! Even better these, augmented N dimensional arrays can be combined to make massive datasets, enabling analysis of huge volumes of data. It's all there in  the one datastructure that can be easily stored and shared, as is done on a massive scale using `xarray` in the climate modelling, geophysics and astrophysics communities to name just a few. 


Alright, lets jump in!

In [2]:
# lets import xarray and numpy
import xarray as xr
import numpy as np

The `DataArray` is the `xarray` equivalent of the `numpy` `ndarray` and will be the first focus of our intro. Lets make a 2 x 3 DataArray using a numpy array, so we can get the hang of how to work with it. 

In [9]:
inp = np.arange(6).reshape(2,3)
data = xr.DataArray(inp, dims=("x", "y"))
data

We can see that our data array looks a lot like a numpy array, but with two labelled dimensions `x` and `y`. We can access our values directly using the `values` attribute as shown below. 

In [14]:
data.values

array([[0, 1, 2],
       [3, 4, 5]])

We can also access our dimensions using the `dims` attribute as shown below

In [15]:
data.dims

('x', 'y')

The eagle eyed amongst you may have  spotted the `coords` attribute. This is used to associate a value along a particular axis with another value, that could for example correspond to the time or location it was measured. The possibilities are only limited by your imagination

We set our `coords` using a dictionary corresponding to the `coords` along a particular axis.

In [23]:
coords = {"x":[10,20], "y":[0.1, 0.2, 0.3]}
data = data.assign_coords(coords)

**Notice the detail above**, we had to assign our `DataArray` to a new object (in this case we overwrote `data`) for our change to persist. This pattern will be familiar to those who use `pandas`.

Okay cool, how can we access the data in our array? We can use four kinds of slicing. 

* Numpy like using integer locations
* using `loc` like in pandas
* using an integer select (`isel`), combining a dimension name and integer label
* using a selection (`sel`) based on coordinate combining a dimension name and coordinate value

The following selections all give the same set of values, those at `x=10`:

Numpy like using array indexing:

In [27]:
data[0, :]

Pandas like using `loc`

In [28]:
data.loc[10]

An xarray integer selection

In [31]:
data.isel(x=0)

An xarray selection

In [32]:
data.sel(x=10)

When dealing with a complicated multidimensional dataset, I and the creators of `xarray` would argue that the last two are simple and powerful

In [33]:
data.attrs["long_name"] = "random velocity"
data.attrs["units"] = "metres/sec"
data.attrs["description"] = "A random variable created as an example."


In [34]:
data.x.attrs["units"] = "x units"
data.x

In [12]:
data += 100
data

In [13]:
data.T


In [14]:
a = xr.DataArray(np.random.randn(3), [data.coords["y"]])
a

In [15]:
b = xr.DataArray(np.random.randn(4), dims="z")
b

In [16]:
c = a + b
c

In [18]:
c

In [42]:
x = xr.DataArray(np.arange(10), dims="x", coords= {"x": np.arange(10)/10})

y = xr.DataArray(np.arange(3), dims="y", coords = {"y": np.arange(3)/10})

In [43]:
x

In [44]:
y

In [45]:
z = x*y
z

In [47]:
z[1,1]

In [48]:
w = xr.DataArray(np.arange(10), dims="x", coords= {"x": np.arange(10)/10})

In [50]:
q = w*x
q

In [53]:
l = z * w
l

In [None]:
x.save_n