# Working with multidimensional datasets in xarray
#### Following along with a tutorial from <i>WestGrid</i> to better understand **xarray** Python package. Need to better understand this for the final project in my Scientific Visualization class.

[See tutorial here](https://www.youtube.com/watch?v=xdrcMi_FB8Q&ab_channel=WestGrid)

## Xarray Library
- Built on top of numpy and pandas
- Brings the power of pandas (easy data manupulation, I/O, plotting) to multidimensional arrays
- Not limited to 3D --> data of any dimensionality
- Makes it easy to work with time-dependent arrays (time = one of the dimensions)

There are two main data structures in xarray:
1. *xarray.DataArray* is a fancy, labelled version of a numpy.ndarray
2. *xarray.Dataset* is a collection of multiple *xarray.DataArray*'s that (usually) share dimensions

## DataArray

In [2]:
import xarray as xr
import numpy as np
data = xr.DataArray(
    np.random.random(size=(4,3)), #this has 4 rows & 3 columns
    dims=("y", "x"),              #we want 'y' to represent the rows & 'x' to represent columns
    coords={"x": [10,11,12],      #this passes a dictionary that has the list of keys
            "y": [10,20,30,40]}
)
print(data)

<xarray.DataArray (y: 4, x: 3)>
array([[0.20831876, 0.36910154, 0.38853407],
       [0.11262523, 0.08547024, 0.11565599],
       [0.92867073, 0.06485579, 0.02862245],
       [0.78135015, 0.40681556, 0.98554136]])
Coordinates:
  * x        (x) int32 10 11 12
  * y        (y) int32 10 20 30 40


## Coordinates and Attributes

In [3]:
data.dims
#tells info about the attributes

('y', 'x')

In [4]:
data.size, data.dtype
#total number of elemennts (size) & the data type (dtype) 

(12, dtype('float64'))

### access specific coordinates

In [5]:
#access specific coordinates
#data.coords & square brackets to specify the key for the coordinate

data.coords['x']

In [6]:
#access specific coordinates 
#add another square bracket to specify which value inside the key
data.coords['x'][1]

### pandas-like notation

In [7]:
#name_pandas_dataframe.column
data.x[1]

In [8]:
#get a data array by using '.values' at the very end
data.x.values

#the output is a 1-dimensional numpy array, with 3 elements

array([10, 11, 12])

### subsetting arrays
Use the usual Python square brackets to grab row, column, from arrays
- **.isel()** = select by coordinate index (single index, list, range)
- **.sel()** = select by coordinate value (singe value, list, range)
- **.interp()** = interpolate by coordinate value

In [9]:
# first row, also a DataArray & all elements of the second dimension

data[0,:] 

In [10]:
# all rows *:*   &  last two columns *,-2*

data[:,-2:] 

In [12]:
# can also modify in-place

data[-1,-1] = 0.99999 

In [13]:
# first row, numpy array

data.values[0,0:]

array([0.20831876, 0.36910154, 0.38853407])

### Aggregate Functions
You can aggregate (mean, standard deviation, etc) and use a dimension

In [15]:
#apply the mean over 'y' 

meanOfEachColumn = data.mean(dim="y")

In [16]:
# apply mean over both x & y
spatialMean = data.mean(dim=['x', 'y'])


In [17]:
#apply mean over both x & y --- different way of writing

spatialMean = data.mean()

In [19]:
spatialMean

### DataArray.groupby()

This lets you take your multi-dimensional array and divide it into groups by an attribute. 
This is useful in cases where you want to apply a function separately to each group.

### Convert to netCDF

You can create an output as a netCDF. The function **.to_netcdf** will create an output and write it to the disk. Review [`37:00 - 38:30`](https://youtu.be/xdrcMi_FB8Q?t=2215) to see this & how to import into ParaView

`name.to_netcdf("name.nc")`

## Dataset
An *xarray.Dataset* is a collection of multiple *DataArray*'s that (usually) share dimensions

In [21]:
from bokeh.io import output_notebook, show
output_notebook()

In [22]:
from bokeh.plotting import figure

f = figure()
f.circle([1,2,3], [4,5,6], size=10)

show(f)