
# Xarray Workbook 1
Modified: May 28, 2019


This is my sketchpad to understand xr.DataArray and xr.Dataset constructors.

Keys:
    - `xarray`'s main motivation is to model after `netcdf` format
    - it handles pandas's limitation on 2D (or 3D at max) dataframe
    - it maintains the `pandas`'s named dimensions idea
    - we can think of it as a multidimensional array(`numpy.ndarray`) with named dimensions
    - each dimension has a `name` and `tick-marks`. These tick-marks are called `coordinates`. `numpy` doesn't have this feature, so all of its indexing is by integer/order based. In `xarray`, since we have a name for each dimension (ie. axis) as well as a list of coordinates (ie. tick-marks) for each dimension (again, ie. axis), we can refer to a value in the xrray DataArray container with more semantic-aware indexing
   

|xrray overview|
|-|
|<img src="../images/xarray_overview.png" alt="xarray_overview" width="1000"/>|

## Overview of data structures in `xarray`



|`xr.Variable`| `xr.DataArray`| `xr.Dataset`|
|-|-|-|
<img src="../images/xr_variable_overview.png" alt="xr_variable" width="700"/> | <img src="../images/xr_dataarray_overview.png" alt="xr_dataarray" width="700"/> | <img src="../images/xr_dataset_overview.png" alt="xr_dataset" width="700"/> | 

### More details
- xr.DataArray, xr.Dataset: [doc](http://xarray.pydata.org/en/stable/data-structures.html)
- 

In [None]:
arr = np.array(np.matrix('0, 10; 1, 11; 2, 12'))
nprint(arr.shape, arr)
# np.c_[df, ['a','b','c']] #have you tried this?

## Simplest DataArray Constructor

In [None]:
xarr = xr.DataArray(arr)
print(xarr)

# 1. more meaning ful dimension name
renamed = xarr.rename(dim_0='date', dim_1='station_id')
nprint('renamed', renamed)

Note `xarray` and `numpy` follow the same dimention assignment order. Dim0 is along the rows, Dim1 is along the columns

In [None]:
renamed.coords

In [None]:
# Let's assign some coordinates
coords = {
    'date': ['2019-01-01', '2019-01-02', '2019-01-03'],
    'station_id': ['LA', 'SF']
    
}

In [None]:
new_xarr = renamed.assign_coords(**coords)
print(new_xarr)

In [None]:
# I can store any relevant metadata as 'attrs'
new_xarr.attrs

In [None]:
new_xarr.attrs.update(unit='km')
new_xarr.attrs.update(description='US State Daily Temperature Flux')
new_xarr.attrs.update(collector='NASA')
new_xarr.attrs.update(last_updated='2019-05-28')
new_xarr.attrs.update(license='MIT')


pprint(new_xarr.attrs)

## Let's specify these parameters at the construction time

In [None]:
meta = new_xarr.attrs.copy()
nprint('meta data', meta)

In [None]:
xarr2 = xr.DataArray(arr, 
                     dims = ['time', 'state'],
                     # coords as a list of tuples: each tuple = (dimname, coord_values)
                     # this results in setting coordinate name same as its dimension's name
                     # To set coord's name specifically, use a dictionary format
                     #   eg: coords = {coord_name1: coord_vals1, coord_name2, coord_vals2}
                     #   In this case, dimensions must be provided explicitly
                     #   See example below
                     coords=[('time', pd.date_range('2019-01-01', periods=3)),
                             ('state', ['LA', 'SF'])],
                     attrs=meta,
                     name='US state example data array'
                    )
print(xarr2)

## `Dimensions` and their `coordinates` 
DataArray Constructor
```
darr = xr.DataArray(
        data,
        dims=['dimname0', 'dimname1'],
        coords=
        attrs=
        name=
)
```

### Coordinates
1. A dictionary of form {'coordname': coord1, 'coordname2': coords2, ...}  
    - This requires the `dims` to be explicitly provided

In [None]:
coords = {'coord1': pd.date_range('2019-05-05', periods=3),
          'coord2': ['LA','SF']
         }
darr = xr.DataArray(arr, 
             dims=['time', 'state'], # this will error because they `dims` must be a subset of `coords.keys()`
                    # when `coords` is given as a dictionary. Now I see why `coords` keyword is specified before 
                    # `dims`
             coords=coords,
            )
print(darr)

In [None]:
# Let's fix the dimension names so that it works.
darr = xr.DataArray(arr,
                    coords=coords,
                    dims=['coord1', 'coord2'])
print(darr)

The advantage of using dictionary format for the `coords` is that we can specify extra coordinates that are about the dimensions (ie. axes).


In [None]:
darr = xr.DataArray(arr, 
             dims=['time', 'state'],
             coords={
                 'time': pd.date_range('2019-05-05', periods=3),
                 'state': ['LA', 'SF'],
                 'const': 17 # more on this extra (dimension-independent) coordinate later
             }
            )
print(darr)

But those dimension-independent coordinates have constraints: 
- a coordinate must have a value of a non-iterable datatype (eg. 15, 0.01, etc but not [1,2,3]). 
    - It can have a name not in `dims`
- If a coordinate's value is an iterable, it's 
- If a coordinate can have a name that is not in `dims`, but its value must be a tuple (or other iterable) following the tuple constructor format for a coordinate that has a cooresponding dimension. 
    - Eg: coord3 = ('extra_coordname', ('dimname0', [1,2,3]))

Their usecases will be explained in more details later.

In [None]:
# This is okay
darr = xr.DataArray(arr, 
             dims=['time', 'state'],
             coords={
                 'time': pd.date_range('2019-05-05', periods=3),
                 'state': ['LA', 'SF'],
                 'coord3': 'hihi' # try any other non-iterable datatypes: 0.01,'a', 'hihi'
             }
            )
print(darr)

In [None]:
# This is not okay
darr = xr.DataArray(arr, 
             dims=['time', 'state'],
             coords={
                 'time': pd.date_range('2019-05-05', periods=3),
                 'state': ['LA', 'SF'],
                 'coord3': [0.01,1] # doesn't work because the value is an iterable 
             }
            )
print(darr)

In [None]:
# This is okay
darr = xr.DataArray(arr, 
             dims=['time', 'state'],
             coords={
                 'time': pd.date_range('2019-05-05', periods=3),
                 'state': ['LA', 'SF'],
                 'coord3': ('time', [1,2,3]) # this works because 'time' is one of the dimensions
                 #but fails if the length of the iterable doesn't match 'time's length, eg. [1,2,3,4]. Try it.
             }
            )
print(darr)

In [None]:
# This is okay
darr = xr.DataArray(arr, 
             dims=['time', 'state'],
             coords={
                 'time': pd.date_range('2019-05-05', periods=3),
                 'state': ['LA', 'SF'],
                 'coord3': ( ('time', 'state'), np.random.randn(6).reshape(3,2)) 
                 # okay because 'time' and 'state' are dimension names
             }
            )
print(darr)

### `xr.DataArray` constructor from `pd.DataFrame`
Precedant of propagating DataArray properties at construction time
    - args to the `xr.DataArray` constructor
    - non-specified arguments will be filled in from the `pandas` object

In [None]:
df = pd.DataFrame(arr, columns=['LA', 'SF'], index=pd.date_range('2020-01-01', periods=3))
df

In [None]:
darr = xr.DataArray(df)

In [None]:
darr

Notice that `df`'s index is set to the first dimension (which is named `dim_0` by default)'s coordinate, and `df`'s column names to the coordinate of the second dimension (`dim_1`).

Let's try providing dimension names.

In [None]:
darr = xr.DataArray(df, dims=['time', 'state'])
print(darr)

What if the input `pd.DataFrame` instance has default index and column names?


In [None]:
df = pd.DataFrame(arr)
print(df)

In [None]:
darr = xr.DataArray(df)
print(darr)

Same rule applies. That is, we use the input `df`'s index and columnnames to fill in non-specified filed for the new xr.DataArray object.

Let's see if specifying the coordinates correctly take a precedance over the input `df`'s index and columns.

In [None]:
darr = xr.DataArray(df,
                    coords=[('time', pd.date_range('2021-01-01',periods=3)),
                            ('state', ['LA', 'SF'])],
                    #dims=['time', 'state'] # optional, as it's redundant
                   )
print(darr)

                            

Notice that the coordinates are set from the direct input arguments to `xr.DataArray` constructor.

### `xr.DataArray.rename` method
- returns a **new** xr.DataArray with the same (**NOT** a copied version of) data and modification on the properties

In [None]:
print(darr)

In [None]:
new_darr = darr.rename(state='us_state')
print(new_darr)

In [None]:
print(darr is new_darr)

In [None]:
# see that the underlying data is copied as well
print(darr.values is new_darr.values) # same as print(id(darr.values) == id(new_darr.values))

Suprise?! Is this really true?

In [None]:
print(id(darr.values) == id(new_darr.values))

In [None]:
# Let's see if changes in one array is reflected on the `renamed` DataArray's underlying data

In [None]:
print(darr.values)

In [None]:
darr.values[0,0] = -100
nprint(darr.values)
nprint(new_darr.values)

Okay. This is worthwhile to remember. 

> xr.DataArray.rename() will return a **new** instance with properties changed (eg. dimension names, coordinate values),
but the new instance will have to the **same** handle to the original dataarray's `values`(ie. the underlying data)!!

This means, changing the underlying data in one instance is directly reflected on the other instance. Nice in that the data is copied, but if we truely want a new instance, we need to figure out what's the right way to deep copy the underlying data.

## xarray's Dataset class
Keys:
- a variable is either `data_variable` (`data_var`) or `coordinate_variable`(`coords`)

    For example, in the diagram below, `temperature` and `precipitation` are `data variable` and all other arrays are `coordinate variables`
    <img src="../assets/xr_dataset_structure.png" alt="xr_dataset" width="500"/>


- multi-dimensional equivalent of a pd.Datafrmae + labelled axes
- a dict-link container of labelled arrays (ie.xr.DataArray objects) with *aligned* dimensions
- designed as an in-memory representation of the data model from the netCDF file format

4 main properties of a Dataset object
- dims: a dictionary mapping from dimension names to the fixed length of each dimension, eg: `{'dim0': 4, 'dim1': 3}`
- data_vars: a dict-link container of DataArrays corresponding to variables
- coords: a dict-link contain of DataArrays intended to label points used in `data_vars`
- attrs: OrderedDict to hold arbitrary metadata

- (xarray) `data_var` : `coord` = 'vdims' : 'kdims' (holoviews)
- How to decide whether a variable belongs to `data_vars` or `coords`"
    - coordinates indicate constant/fixed/independed quantities 
    - varying/measured/dependednt quantities belongs to data
- recall `coords.keys()` is a superset of `dims`

<img src="../assets/create_dataset.png" alt="create_dataset" width="650"/>

In [None]:
temp = 70+10*np.random.randn(2,3,4)
precip = 5+2*np.random.randn(2,3,4)
lon = [[-99.81, -99.44, -99.23], 
       [-99.79, -99.34, -99.12]]
lat = [[42.24, 42.21, 42.19],
       [42.63, 42.59, 42.44]]

```python
data_vars = {
    'temperature': xr.DataArray(data=temp,
                                coords = [('x', [x_tick1, x_tick2]),
                                          ('y', [y_tick1, y_tick2, y_tick3]),
                                          ('t', [t_tick1, t_tick3, t_tick3, t_tick4])],
                                dims = ['x','y','time'], # optional as it's redundant
                                attrs = temp_metadata,
                                name = temp_name
                               ),
    'precipitation': xr.DataArray(data = precip,
                                  coords = [('x', [x_tick1, x_tick2]),
                                          ('y', [y_tick1, y_tick2, y_tick3]),
                                          ('time', [t_tick1, t_tick3, t_tick3, t_tick4])],
                                  dims = ['x','y','time'], # optional as it's redundant
                                 ),
}
```

Note that we don't really know what to use to tickmark values for `x` and `y` coordinates. These are very general coordinates, to indicate the general two dimensional space, and no semantics attached. So, we use the second syntax (which only requires a list of dimension names ('x', 'y', 't') and the underlying data for the data_variable) to construct each data variable:

```python
    {"varname": (`dims`, `underlying_data`),
    "varname2": (`dims`, `underlying_data`)}
    
```


In [None]:
data_vars = {
    'temperature': (['x','y','time'], temp),
    'percipitation': (['x','y','time'], precip),
}                             
coord_vars = {
    'lon': (['x','y'], lon),
    'lat': (['x','y'], lat),
    
    # this is the last case (for the general coordinate variables, ie. x,y,t in our case
#     'x': [val1, val2] <-- what to put in..? whatever is, not very meaningful
#     'y': [ycoord1, ycoord2] <-- what to put in...? so we don't explicitly express these two
    'time': pd.date_range('2019-05-28', periods=4),
    'reference_time': pd.Timestamp('2019-05-27')
}
    

In [None]:
ds = xr.Dataset(
    data_vars=data_vars,
                
    #coords should've named coord_vars, in my opinion
    coords = coord_vars
)

In [None]:
ds

Pretty interesting. This is well-connected to the ideas behind holoviews `kdims` and `vdims`.

# Resources
- Pangeo architecture: [slides](https://is.gd/t9Rtqn)
    - Bring computation to the data (big data)
    - Uses `xarray`  which is supported by `Dask` in the backend
- Great tutorial on how to use OPeNDAP server with GES DISC (NASA's open data portal)
    - [link](https://is.gd/V4RJMS)
- xarray: read opendap data
    - [doc](http://xarray.pydata.org/en/stable/io.html#opendap)
    - use xarray.open_dataset() for password-protected Opendap files: [link](https://github.com/pydata/xarray/issues/1068)
    <img src="xarray-opendap2.png" alt="xarray-opendap" width="500"/>
- xarray general tutorials
    - [liasa](http://pure.iiasa.ac.at/id/eprint/14952/1/xarray-tutorial-egu2017-answers.pdf)
    
