# Xarray Introduction

- Unlabeled, N-dimensional arrays of numbers (e.g., NumPy’s ndarray) are the most widely used data structure in scientific computing. However, they lack a meaningful representation of the metadata associated with their data. Implementing such functionality is left to individual users and domain-specific packages. xarry expands on the capabilities of NumPy arrays, providing a lot of streamline data manipulation. 

- xarray's interface is based largely on the netCDF data model (variables, attributes, and dimensions), but it goes beyond the traditional netCDF interfaces to provide functionality similar to netCDF-java's Common Data Model (CDM). 

- xarray is a useful tool for parallelizing and working with large datasets in the geosciences.

## Data Structures

- xarray has 2 fundamental data structures:
    - `DataArray`, which holds single multi-dimensional variables and its coordinates
    - `Dataset`, which holds multiple variables that potentially share the same coordinates
   
![](../assets/images/xarray-data-structures.png)


    
### `DataArray`

The DataArray is xarray's implementation of a labeled, multi-dimensional array. It has several key properties:

| Attribute 	| Description                                                                                                                              	|
|-----------	|------------------------------------------------------------------------------------------------------------------------------------------	|
| `data`    	| `numpy.ndarray` or `dask.array` holding the array's values.                                                                              	|
| `dims`    	| dimension names for each axis. For example:(`x`, `y`, `z`) (`lat`, `lon`, `time`).                                                       	|
| `coords`  	| a dict-like container of arrays (coordinates) that label each point (e.g., 1-dimensional arrays of numbers, datetime objects or strings) 	|
| `attrs`   	| an `OrderedDict` to hold arbitrary attributes/metadata (such as units)                                                                   	|
| `name`    	| an arbitrary name of the array                                                                                                           	|

In [None]:
# Import packages
import numpy as np
import xarray as xr

In [None]:
# Create some sample data
data = 2 + 6 * np.random.exponential(size=(5, 3, 4))
data

To create a basic `DataArray`, you can pass this numpy array of random data to `xr.DataArray`

In [None]:
prec = xr.DataArray(data)
prec

<div class="alert alert-info">

**Note:** 
    
Xarray automatically generates some basic dimension names for us.

</div>

You can also pass in your own dimension names and coordinate values:

In [None]:
# Use pandas to create an array of datetimes
import pandas as pd
times = pd.date_range('2019-04-01', periods=5)
times

In [None]:
# Use numpy to create array of longitude and latitude values
lons = np.linspace(-150, -60, 4)
lats = np.linspace(10, 80, 3)
lons, lats

In [None]:
coords = {'time': times, 'lat': lats, 'lon': lons}
dims = ['time', 'lat', 'lon']

In [None]:
# Add name, coords, dims to our data
prec = xr.DataArray(data, dims=dims, coords=coords, name='prec')
prec

This is already improved upon from the original numpy array, because we have names for each of the dimensions (or axis in NumPy universe). 




We can also add attributes to an existing `DataArray`:

In [None]:
prec.attrs['units'] = 'mm'
prec.attrs['standard_name'] = 'precipitation'
prec

In [None]:
prec.data

### `Dataset`

- Xarray's `Dataset` is a dict-like container of labeled arrays (`DataArrays`) with aligned dimensions. - It is designed as an in-memory representation of a netCDF dataset. 
- In addition to the dict-like interface of the dataset itself, which can be used to access any `DataArray` in a `Dataset`. Datasets have the following key properties:


| Attribute   	| Description                                                                                                                              	|
|-------------	|------------------------------------------------------------------------------------------------------------------------------------------	|
| `data_vars` 	| OrderedDict of `DataArray` objects corresponding to data variables.                                                                      	|
| `dims`      	| dictionary mapping from dimension names to the fixed length of each dimension  (e.g., {`lat`: 6, `lon`: 6, `time`: 8}).                  	|
| `coords`    	| a dict-like container of arrays (coordinates) that label each point (e.g., 1-dimensional arrays of numbers, datetime objects or strings) 	|
| `attrs`     	| OrderedDict to hold arbitrary metadata pertaining to the dataset.                                                                        	|
| `name`      	| an arbitrary name of the dataset                                                                                                         	|

- DataArray objects inside a Dataset may have any number of dimensions but are presumed to share a common coordinate system. 
- Coordinates can also have any number of dimensions but denote constant/independent quantities, unlike the varying/dependent quantities that belong in data.

To create a `Dataset` from scratch, we need to supply dictionaries for any variables (`data_vars`), coordinates (`coords`) and attributes (`attrs`):

In [None]:
dset = xr.Dataset({'precipitation' : prec})
dset

Let's add some fake `temperature` data array to this existing dataset:

In [None]:
temp_data = 283 + 5 * np.random.randn(5, 3, 4)
temp = xr.DataArray(data=temp_data, dims=['time', 'lat', 'lon'], 
                    coords={'time': times, 'lat': lats, 'lon': lons},
                    name='temp',
                    attrs={'standard_name': 'air_temperature', 'units': 'kelvin'})
temp

In [None]:
# Now add this data array to our existing dataset
dset['temperature'] = temp
dset.attrs['history'] = 'Created for the xarray tutorial'
dset.attrs['author'] = 'foo and bar'
dset

<div class="alert alert-info">

**Going Further:** 
    
Xarray Documentation on Data Structures: http://xarray.pydata.org/en/latest/data-structures.html

</div>

## Core xarray Features

### Reading and Writing Files

Xarray supports direct serialization and I/O to several file formats including pickle, netCDF, OPeNDAP (read-only), GRIB1/2 (read-only), and HDF by integrating with third-party libraries. Additional serialization formats for 1-dimensional data are available through pandas.

File types
- Pickle
- NetCDF 3/4
- RasterIO
- Zarr
- PyNio

Interoperability
- Pandas
- Iris
- CDMS
- dask DataFrame


#### Opening xarray datasets

Xarray's `open_dataset` and `open_mfdataset` are the primary functions for opening local or remote datasets such as netCDF, GRIB, OpenDap, and HDF. These operations are all supported by third party libraries (engines) for which xarray provides a common interface. 

In [None]:
!ncdump -h ./data/rasm.nc

In [None]:
ds = xr.open_dataset('./data/rasm.nc', engine='netcdf4')
ds

#### Saving xarray datasets as netcdf files

Xarray provides a high-level method for writing netCDF files directly from Xarray Datasets/DataArrays.

In [None]:
ds.to_netcdf('./data/rasm_test.nc')

#### Multifile datasets

Xarray can read/write multifile datasets using the `open_mfdataset` and `save_mfdataset` functions. 

In [None]:
years, datasets = zip(*ds.groupby('time.year'))
paths = ['./data/%s.nc' % y for y in years]
print(paths)

In [None]:
len(datasets)

In [None]:
# write the 4 netcdf files
xr.save_mfdataset(datasets, paths)

- Open a group of files and concatenate them into a single xarray.Dataset

In [None]:
ds2 = xr.open_mfdataset('./data/19*nc')
ds2

### Zarr

Zarr is a Python package providing an implementation of chunked, compressed, N-dimensional arrays. Zarr has the ability to store arrays in a range of ways, including in memory, in files, and in cloud-based object storage such as Amazon S3 and Google Cloud Storage. Xarray’s Zarr backend allows xarray to leverage these capabilities.

In [None]:
# Zarr
ds.to_zarr('./data/rasm.zarr', mode='w')

In [None]:
!ls data/*zarr
!du -h data/*zarr

In [None]:
import zarr

In [None]:
compressor = zarr.Blosc(clevel=2, shuffle=-1)
ds.to_zarr('./data/rasm_compressed.zarr', mode='w', encoding={var: {'compressor': compressor} 
                                                              for var in ds.variables})

In [None]:
!ls data/*zarr
!du -h data/*zarr

<div class="alert alert-info">

**Going Further:** 
    
Xarray I/O Documentation: http://xarray.pydata.org/en/latest/io.html

</div>

### Label-based indexing

Scientific data is inherently labeled. For example, time series data includes timestamps that label individual periods or points in time, spatial data has coordinates (e.g. longitude, latitude, elevation), and model or laboratory experiments are often identified by unique identifiers. 

In [None]:
ds = xr.open_dataset('./data/air_temperature.nc')
ds

#### NumPy Positional Indexing

When working with numpy, indexing is done by position (slices/ranges/scalars).

In [None]:
t = ds['air'].data # numpy array 
t

In [None]:
t.shape

In [None]:
# extract a time-series for one spatial location
t[:, 20, 40]

**but wait, what labels go with 10 and 20? Was that lat/lon or lon/lat? Where are the timestamps that go along with this time-series?**

#### Indexing with xarray

xarray offers extremely flexible indexing routines that combine the best features of NumPy and pandas for data selection.

In [None]:
da = ds['air'] # Extract data array
da

- **NumPy style indexing still works (but preserves the labels/metadata)**

In [None]:
da[:, 20, 40]

- **Positional indexing using dimension names**

In [None]:
da.isel(lat=20, lon=40)

- **Label-based indexing**

In [None]:
da.sel(lat=50., lon=200.)

- **Nearest Neighbor Lookups**

In [None]:
da.sel(lat=52.25, lon=251.8998, method='nearest')

- **All of these indexing methods work on the dataset too:**

In [None]:
ds.sel(lat=52.25, lon=251.8998, method='nearest')

#### Vectorized Indexing

Like numpy and pandas, xarray supports indexing many array elements at once in a vectorized manner:


In [None]:
# generate a coordinates for a transect of points
lat_points = xr.DataArray([52, 52.5, 53], dims='points')
lon_points = xr.DataArray([250, 250, 250], dims='points')
lat_points

In [None]:
# nearest neighbor selection along the transect
da.sel(lat=lat_points, lon=lon_points, method='nearest')

### Aggregation

Xarray supports many of the aggregations methods that numpy has. A partial list includes: all, any, argmax, argmin, max, mean, median, min, prod, sum, std, var.

Whereas the numpy syntax would require scalar axes, xarray can use dimension names:

In [None]:
ds = xr.open_dataset("./data/air_temperature.nc")

In [None]:
da = ds['air']
da

In [None]:
da.mean(dim=['lat'])

### Arithmetic

Arithmetic operations with a single DataArray automatically vectorize (like numpy) over all array values:


In [None]:
da - 273.15

In [None]:
da_mean = da.mean(dim='time')
da_mean

In [None]:
da - da_mean

<div class="alert alert-info">

**Note:** 
    
Notice that this required broadcasting along the time dimension. NumPy broadcasting is covered in great detail in [NumPy Guide](../numpy/01-numpy-guide.ipynb).

</div>


### Alignment

xarray enforces alignment between index Coordinates (that is, coordinates with the same name as a dimension, marked by `*`) on objects used in binary operations.

In [None]:
da

In [None]:
arr = da.isel(time=0, lat=slice(5, 10), lon=slice(7, 11))
arr

In [None]:
part = arr[:-1]
part

- **Default behavior is an `inner join`**

In [None]:
(arr + part) / 2

- **We can also use an `outer join`**

In [None]:
with xr.set_options(arithmetic_join="outer"):
    print((arr + part) / 2)

<div class="alert alert-info">

**Note:** 
    
Notice that missing values (`nan`) were inserted where it is appropriate. 

</div>

### GroupBy Operations

xarray supports “group by” operations with the same API as pandas to implement the split-apply-combine strategy:

- Split your data into multiple independent groups.
- Apply some function to each group.
- Combine your groups back into a single data object.

Group by operations work on both Dataset and DataArray objects. Most of the examples focus on grouping by a single one-dimensional variable, although support for grouping over a multi-dimensional variable is also supported:

- **Using groupby to calculate a monthly climatology:**

In [None]:
da_climatology = da.groupby('time.month').mean('time')

da_climatology

In this case, we provide what we refer to as a virtual variable (`time.month`). Other virtual variables include: `year`, `month`, `day`, `hour`, `minute`, `second`, `dayofyear`, `week`, `dayofweek`, `weekday` and `quarter`. It is also possible to use another DataArray or pandas object as the grouper.

### Resampling Operations

In order to resample time-series data, xarray provides a `resample` convenience method for frequency conversion and resampling of time series. 

In [None]:
da

- **Downsample our 6 hourly time-series data to seasonal data:**

In [None]:
da.resample(time="QS-DEC").mean(dim='time')

- **Upsample our 6 hourly time-series data to 1 hourly data:**

In [None]:
da.resample(time='1H').interpolate('linear')

### Rolling Window Operations

Xarray objects include a rolling method to support rolling window aggregations:

In [None]:
roller = da.rolling(time=3)
roller

In [None]:
roller.mean()

- **We can also provide a custom function**

In [None]:
def sum_minus_2(da, axis):
    return da.sum(axis=axis) - 2

roller.reduce(sum_minus_2)

### Masking

Indexing methods on xarray objects generally return a subset of the original data. However, it is sometimes useful to select an object with the same shape as the original data, but with some elements masked. To do this type of selection in xarray, use `where()`:

In [None]:
da.where(da < 273)

In [None]:
xr.where(da < 273, 0, 1)

### Plotting

Labeled data enables expressive computations. These same labels can also be used to easily create informative plots.

xarray plotting functionality is a thin wrapper around the popular matplotlib library. Matplotlib syntax and function names were copied as much as possible, which makes for an easy transition between the two.

#### Matplotlib Integration

Xarray has built-in plotting via `matplotlib` for DataAr`rays:

In [None]:
da

##### Plotting >2d Data

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
da.plot()

For high-dimensional data, xarray plots `histograms` by default. 

##### Plotting 1D Data

In [None]:
da_point_resample = da.isel(lat=20, lon=40).resample(time='1D')
t_max = da_point_resample.max('time')
t_min = da_point_resample.min('time')
t_max

In [None]:
t_min

In [None]:
t_max.plot(label='t_max')
t_min.plot(label='t_min')
plt.legend()

##### Plotting 2D Data

For 2-dimensional data, the xarray `plot()` method automatically does a QuadMesh contour plot informed by the metadata:

In [None]:
t_mean = da.mean('time')
t_mean

In [None]:
t_mean.plot()

##### FacetGrid Plots

- **Calculate some seasonal anomalies and plot them:**

In [None]:
da_month = da.resample(time='QS-Dec').mean('time')

climatology = da_month.groupby('time.season').mean('time')
anomalies = da_month.groupby('time.season') - climatology
anomalies

In [None]:
anomalies.plot(col='time', col_wrap=4)

<div class="alert alert-info">

**Going Further:** 
    
- [Advanced Plotting Notebook](02-xarray-advanced-plotting.ipynb)
- Xarray's Documentation on Plotting: http://xarray.pydata.org/en/latest/plotting.htm
</div>

## Reference

- [Pangeo Tutorial for 2018 UCAR SEA Conference](https://github.com/pangeo-data/pangeo-tutorial-sea-2018)

In [None]:
%load_ext watermark
%watermark --iversion -g -m -v -u -d