# What is `xarray`?

From [xarray.pydata.org](http://xarray.pydata.org/):

![xarray](http://xarray.pydata.org/en/stable/_images/dataset-diagram-logo.png)

N-D labeled arrays and datasets in Python
=========================================

**xarray** (formerly **xray**) is an open source project and Python package
that aims to bring the labeled data power of `pandas` to the physical sciences,
by providing N-dimensional variants of the core pandas data structures.

Our goal is to provide a pandas-like and pandas-compatible toolkit for
analytics on multi-dimensional arrays, rather than the tabular data for which
pandas excels. Our approach adopts the `Common Data Model` for self-
describing scientific data in widespread use in the Earth sciences:
``xarray.Dataset`` is an in-memory representation of a netCDF file.

# Why is `xarray` cool?

- `pandas`, but for multiple dimensions!
    - Formerly `pd.Panel`, `pd.Panel4D`, etc
- Automatic broadcasting and alignment
- Simple vectorization
    - Takes advantage of balanced data
- Built-in support for `dask`, `netCDF`, etc

# How do I use `xarray`?

## Installation
1. Creating a new `conda env`
```
source /ihme/code/central_comp/miniconda/bin/activate gbd_env
conda create --name xarray-demo --clone gbd_env
source activate xarray-demo
```
2. Installing `xarray`
```
conda install xarray
```
3. Optional add-ons
```
conda install dask
conda install netcdf4
```

## The Basics
### Setup

In [1]:
import xarray as xr
import pandas as pd
import numpy as np
import db_queries as db # http://dev-tomflem.ihme.washington.edu/docs/db_queries/current/setup.html

### Some example data

In [2]:
deaths = db.get_outputs('cause', location_id='lvl1',
                        measure_id=1, metric_id=1,
                        age_group_id='most_detailed', sex_id=[1,2], 
                        cause_id='lvl2', year_id='full')
deaths.head()

Unnamed: 0,age_group_id,cause_id,location_id,measure_id,metric_id,sex_id,year_id,age_group_name,cause_name,expected,location_name,measure_name,metric_name,sex,val,upper,lower
0,2,296,4,1,1,1,1980,Early Neonatal,HIV/AIDS and tuberculosis,,"Southeast Asia, East Asia, and Oceania",Deaths,Number,Male,,,
1,3,296,4,1,1,1,1980,Late Neonatal,HIV/AIDS and tuberculosis,,"Southeast Asia, East Asia, and Oceania",Deaths,Number,Male,,,
2,4,296,4,1,1,1,1980,Post Neonatal,HIV/AIDS and tuberculosis,False,"Southeast Asia, East Asia, and Oceania",Deaths,Number,Male,3885.037863,4792.002302,2318.900177
3,5,296,4,1,1,1,1980,1 to 4,HIV/AIDS and tuberculosis,False,"Southeast Asia, East Asia, and Oceania",Deaths,Number,Male,6274.109029,8060.912791,3133.89027
4,6,296,4,1,1,1,1980,5 to 9,HIV/AIDS and tuberculosis,False,"Southeast Asia, East Asia, and Oceania",Deaths,Number,Male,4083.168767,4929.136659,3301.503199


### Setup a `MultiIndex`

In [3]:
df = deaths.set_index(['location_id','sex_id','age_group_id','year_id','cause_id'])[['val','lower','upper']].fillna(0)
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,val,lower,upper
location_id,sex_id,age_group_id,year_id,cause_id,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
4,1,2,1980,296,0.0,0.0,0.0
4,1,3,1980,296,0.0,0.0,0.0
4,1,4,1980,296,3885.037863,2318.900177,4792.002302
4,1,5,1980,296,6274.109029,3133.89027,8060.912791
4,1,6,1980,296,4083.168767,3301.503199,4929.136659


### Convert a `pd.Series` to an `xr.DataArray`

In [4]:
death_da = df['val'].to_xarray()
death_da

<xarray.DataArray u'val' (location_id: 7, sex_id: 2, age_group_id: 23, year_id: 37, cause_id: 22)>
array([[[[[   0.      , ...,    0.      ],
          ..., 
          [   0.      , ...,    0.      ]],

         ..., 
         [[ 399.697783, ...,    0.      ],
          ..., 
          [ 514.333989, ...,    0.      ]]],


        [[[   0.      , ...,    0.      ],
          ..., 
          [   0.      , ...,    0.      ]],

         ..., 
         [[ 593.490692, ...,    0.      ],
          ..., 
          [ 718.414844, ...,    0.      ]]]],



       ..., 
       [[[[   0.      , ...,    0.      ],
          ..., 
          [   0.      , ...,    0.      ]],

         ..., 
         [[ 141.276598, ...,    0.      ],
          ..., 
          [ 458.554881, ...,    0.      ]]],


        [[[   0.      , ...,    0.      ],
          ..., 
          [   0.      , ...,    0.      ]],

         ..., 
         [[ 117.426784, ...,    0.      ],
          ..., 
          [ 377.657629, ...,    0

### Standard attributes

In [5]:
death_da.dims

('location_id', 'sex_id', 'age_group_id', 'year_id', 'cause_id')

In [6]:
death_da.coords

Coordinates:
  * location_id   (location_id) int64 4 31 64 103 137 158 166
  * sex_id        (sex_id) int64 1 2
  * age_group_id  (age_group_id) int64 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...
  * year_id       (year_id) int64 1980 1981 1982 1983 1984 1985 1986 1987 ...
  * cause_id      (cause_id) int64 296 301 344 366 380 386 392 410 491 508 ...

In [7]:
death_da.shape

(7, 2, 23, 37, 22)

In [8]:
death_da.sizes

Frozen(OrderedDict([('location_id', 7), ('sex_id', 2), ('age_group_id', 23), ('year_id', 37), ('cause_id', 22)]))

In [9]:
death_da.values

array([[[[[  0.00000000e+00,   5.88397754e+04,   7.18907759e+01, ...,
             4.16882725e+02,   7.78823584e+01,   0.00000000e+00],
          [  0.00000000e+00,   5.66258596e+04,   7.20791794e+01, ...,
             4.18550621e+02,   7.85243164e+01,   0.00000000e+00],
          [  0.00000000e+00,   5.44358789e+04,   7.26904858e+01, ...,
             4.18169900e+02,   7.50877662e+01,   0.00000000e+00],
          ..., 
          [  0.00000000e+00,   5.85309326e+03,   6.26532007e+00, ...,
             9.45389365e+01,   2.46490468e+00,   0.00000000e+00],
          [  0.00000000e+00,   5.27944984e+03,   5.77724178e+00, ...,
             8.85148656e+01,   2.65453061e+00,   0.00000000e+00],
          [  0.00000000e+00,   4.76425933e+03,   5.43032205e+00, ...,
             8.35210341e+01,   4.68209999e+00,   0.00000000e+00]],

         [[  0.00000000e+00,   7.05525493e+04,   8.50817318e+02, ...,
             1.10020409e+02,   4.71582573e+01,   0.00000000e+00],
          [  0.00000000e+00,  

### Turn a `pd.DataFrame` into an `xr.Dataset`

In [10]:
death_ds = df.to_xarray()
death_ds

<xarray.Dataset>
Dimensions:       (age_group_id: 23, cause_id: 22, location_id: 7, sex_id: 2, year_id: 37)
Coordinates:
  * location_id   (location_id) int64 4 31 64 103 137 158 166
  * sex_id        (sex_id) int64 1 2
  * age_group_id  (age_group_id) int64 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...
  * year_id       (year_id) int64 1980 1981 1982 1983 1984 1985 1986 1987 ...
  * cause_id      (cause_id) int64 296 301 344 366 380 386 392 410 491 508 ...
Data variables:
    val           (location_id, sex_id, age_group_id, year_id, cause_id) float64 0.0 ...
    lower         (location_id, sex_id, age_group_id, year_id, cause_id) float64 0.0 ...
    upper         (location_id, sex_id, age_group_id, year_id, cause_id) float64 0.0 ...

In [11]:
death_ds['val']

<xarray.DataArray 'val' (location_id: 7, sex_id: 2, age_group_id: 23, year_id: 37, cause_id: 22)>
array([[[[[   0.      , ...,    0.      ],
          ..., 
          [   0.      , ...,    0.      ]],

         ..., 
         [[ 399.697783, ...,    0.      ],
          ..., 
          [ 514.333989, ...,    0.      ]]],


        [[[   0.      , ...,    0.      ],
          ..., 
          [   0.      , ...,    0.      ]],

         ..., 
         [[ 593.490692, ...,    0.      ],
          ..., 
          [ 718.414844, ...,    0.      ]]]],



       ..., 
       [[[[   0.      , ...,    0.      ],
          ..., 
          [   0.      , ...,    0.      ]],

         ..., 
         [[ 141.276598, ...,    0.      ],
          ..., 
          [ 458.554881, ...,    0.      ]]],


        [[[   0.      , ...,    0.      ],
          ..., 
          [   0.      , ...,    0.      ]],

         ..., 
         [[ 117.426784, ...,    0.      ],
          ..., 
          [ 377.657629, ...,    0.

In [12]:
death_ds.data_vars

Data variables:
    val      (location_id, sex_id, age_group_id, year_id, cause_id) float64 0.0 ...
    lower    (location_id, sex_id, age_group_id, year_id, cause_id) float64 0.0 ...
    upper    (location_id, sex_id, age_group_id, year_id, cause_id) float64 0.0 ...

## Accessing data

### `.loc[]`

In [13]:
death_da.loc[{'location_id': 4}]

<xarray.DataArray u'val' (sex_id: 2, age_group_id: 23, year_id: 37, cause_id: 22)>
array([[[[   0.      , ...,    0.      ],
         ..., 
         [   0.      , ...,    0.      ]],

        ..., 
        [[ 399.697783, ...,    0.      ],
         ..., 
         [ 514.333989, ...,    0.      ]]],


       [[[   0.      , ...,    0.      ],
         ..., 
         [   0.      , ...,    0.      ]],

        ..., 
        [[ 593.490692, ...,    0.      ],
         ..., 
         [ 718.414844, ...,    0.      ]]]])
Coordinates:
    location_id   int64 4
  * sex_id        (sex_id) int64 1 2
  * age_group_id  (age_group_id) int64 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...
  * year_id       (year_id) int64 1980 1981 1982 1983 1984 1985 1986 1987 ...
  * cause_id      (cause_id) int64 296 301 344 366 380 386 392 410 491 508 ...

In [14]:
death_da.loc[{'location_id': 4, 'sex_id': 2, 'age_group_id': 2}]

<xarray.DataArray u'val' (year_id: 37, cause_id: 22)>
array([[  0.000000e+00,   5.224694e+04,   1.050770e+02, ...,   6.381963e+02,
          1.505562e+02,   0.000000e+00],
       [  0.000000e+00,   5.029792e+04,   1.043912e+02, ...,   6.449041e+02,
          1.533365e+02,   0.000000e+00],
       [  0.000000e+00,   4.843732e+04,   1.043710e+02, ...,   6.464364e+02,
          1.487777e+02,   0.000000e+00],
       ..., 
       [  0.000000e+00,   3.492871e+03,   5.522565e+00, ...,   1.066848e+02,
          3.937401e+00,   0.000000e+00],
       [  0.000000e+00,   3.128856e+03,   5.034388e+00, ...,   9.855571e+01,
          4.969122e+00,   0.000000e+00],
       [  0.000000e+00,   2.802962e+03,   4.591126e+00, ...,   9.191121e+01,
          8.431090e+00,   0.000000e+00]])
Coordinates:
    location_id   int64 4
    sex_id        int64 2
    age_group_id  int64 2
  * year_id       (year_id) int64 1980 1981 1982 1983 1984 1985 1986 1987 ...
  * cause_id      (cause_id) int64 296 301 344 366 380 

In [15]:
death_da.loc[{'location_id': 4, 'sex_id': 2, 'age_group_id': 2, 
              'year_id': np.arange(2000,2016,5)}]

<xarray.DataArray u'val' (year_id: 4, cause_id: 22)>
array([[  0.000000e+00,   1.315227e+04,   2.658951e+01,   0.000000e+00,
          1.346362e+05,   0.000000e+00,   1.639806e+03,   1.455403e+02,
          8.828362e+02,   0.000000e+00,   0.000000e+00,   2.942089e+02,
          5.495683e-01,   5.616159e-01,   3.585845e+02,   0.000000e+00,
          2.048293e+04,   4.622391e+02,   9.116175e+02,   3.351440e+02,
          2.402751e+01,   0.000000e+00],
       [  0.000000e+00,   8.028680e+03,   1.535855e+01,   0.000000e+00,
          1.046934e+05,   0.000000e+00,   1.223842e+03,   1.119763e+02,
          6.261138e+02,   0.000000e+00,   0.000000e+00,   2.234909e+02,
          4.613178e-01,   5.157797e-01,   2.432625e+02,   0.000000e+00,
          1.675747e+04,   3.373929e+02,   7.147487e+02,   1.925577e+02,
          6.767000e+00,   0.000000e+00],
       [  0.000000e+00,   5.254818e+03,   8.488506e+00,   0.000000e+00,
          8.185464e+04,   0.000000e+00,   9.068455e+02,   9.357015e+01,
 

### `.sel()`

In [16]:
death_da.sel(location_id=4, sex_id=2, age_group_id=2, 
             year_id=np.arange(2000,2016,5))

<xarray.DataArray u'val' (year_id: 4, cause_id: 22)>
array([[  0.000000e+00,   1.315227e+04,   2.658951e+01,   0.000000e+00,
          1.346362e+05,   0.000000e+00,   1.639806e+03,   1.455403e+02,
          8.828362e+02,   0.000000e+00,   0.000000e+00,   2.942089e+02,
          5.495683e-01,   5.616159e-01,   3.585845e+02,   0.000000e+00,
          2.048293e+04,   4.622391e+02,   9.116175e+02,   3.351440e+02,
          2.402751e+01,   0.000000e+00],
       [  0.000000e+00,   8.028680e+03,   1.535855e+01,   0.000000e+00,
          1.046934e+05,   0.000000e+00,   1.223842e+03,   1.119763e+02,
          6.261138e+02,   0.000000e+00,   0.000000e+00,   2.234909e+02,
          4.613178e-01,   5.157797e-01,   2.432625e+02,   0.000000e+00,
          1.675747e+04,   3.373929e+02,   7.147487e+02,   1.925577e+02,
          6.767000e+00,   0.000000e+00],
       [  0.000000e+00,   5.254818e+03,   8.488506e+00,   0.000000e+00,
          8.185464e+04,   0.000000e+00,   9.068455e+02,   9.357015e+01,
 

## Doing stuff to data

In [17]:
death_da.mean(dim='year_id')

<xarray.DataArray u'val' (location_id: 7, sex_id: 2, age_group_id: 23, cause_id: 22)>
array([[[[   0.      , ...,    0.      ],
         ..., 
         [ 504.79995 , ...,    0.      ]],

        [[   0.      , ...,    0.      ],
         ..., 
         [ 750.84736 , ...,    0.      ]]],


       ..., 
       [[[   0.      , ...,    0.      ],
         ..., 
         [ 295.743178, ...,    0.      ]],

        [[   0.      , ...,    0.      ],
         ..., 
         [ 248.113755, ...,    0.      ]]]])
Coordinates:
  * location_id   (location_id) int64 4 31 64 103 137 158 166
  * sex_id        (sex_id) int64 1 2
  * age_group_id  (age_group_id) int64 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...
  * cause_id      (cause_id) int64 296 301 344 366 380 386 392 410 491 508 ...

In [18]:
all_cause_da = death_da.sum(dim='cause_id')
all_cause_da

<xarray.DataArray u'val' (location_id: 7, sex_id: 2, age_group_id: 23, year_id: 37)>
array([[[[ 431749.137854, ...,  100549.82766 ],
         ..., 
         [  11103.276125, ...,   60849.311784]],

        [[ 327005.784673, ...,   69199.978177],
         ..., 
         [  21964.195124, ...,  140114.257909]]],


       ..., 
       [[[ 311617.87784 , ...,  423473.607417],
         ..., 
         [   2033.876694, ...,    7512.63842 ]],

        [[ 223889.06841 , ...,  298630.294538],
         ..., 
         [   3052.018287, ...,   11731.877769]]]])
Coordinates:
  * location_id   (location_id) int64 4 31 64 103 137 158 166
  * sex_id        (sex_id) int64 1 2
  * age_group_id  (age_group_id) int64 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...
  * year_id       (year_id) int64 1980 1981 1982 1983 1984 1985 1986 1987 ...

## Combining data
### Get some populations

In [19]:
pop = db.get_population(age_group_id = death_da.coords['age_group_id'].values.tolist(),
                        location_id = death_da.coords['location_id'].values.tolist(),
                        sex_id = death_da.coords['sex_id'].values.tolist(),
                        year_id = death_da.coords['year_id'].values.tolist()
                       )
pop.head()

Unnamed: 0,age_group_id,location_id,year_id,sex_id,population,process_version_map_id
0,2,4,1980,1,328647.715155,635
1,2,4,1980,2,310065.767121,635
2,2,4,1981,1,333325.177317,635
3,2,4,1981,2,314430.356573,635
4,2,4,1982,1,336986.831599,635


### Turn into an `xr.DataArray`

In [20]:
pop = pop.set_index(['location_id','sex_id','age_group_id','year_id'])
pop_da = pop['population'].to_xarray()
pop_da

<xarray.DataArray u'population' (location_id: 7, sex_id: 2, age_group_id: 23, year_id: 37)>
array([[[[ 328647.715155, ...,  235996.913732],
         ..., 
         [  25004.786623, ...,  163579.217919]],

        [[ 310065.767121, ...,  214647.028209],
         ..., 
         [  50779.683437, ...,  438084.755912]]],


       ..., 
       [[[ 154933.060786, ...,  341035.124231],
         ..., 
         [   4999.08355 , ...,   18727.555809]],

        [[ 149937.54739 , ...,  329458.145417],
         ..., 
         [   7150.264461, ...,   29031.836275]]]])
Coordinates:
  * location_id   (location_id) int64 4 31 64 103 137 158 166
  * sex_id        (sex_id) int64 1 2
  * age_group_id  (age_group_id) int64 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...
  * year_id       (year_id) int64 1980 1981 1982 1983 1984 1985 1986 1987 ...

### Automatic alignment

In [21]:
all_cause_da / pop_da

<xarray.DataArray (location_id: 7, sex_id: 2, age_group_id: 23, year_id: 37)>
array([[[[ 1.313714, ...,  0.426064],
         ..., 
         [ 0.444046, ...,  0.371987]],

        [[ 1.054634, ...,  0.32239 ],
         ..., 
         [ 0.432539, ...,  0.319834]]],


       ..., 
       [[[ 2.011307, ...,  1.24173 ],
         ..., 
         [ 0.40685 , ...,  0.401154]],

        [[ 1.493215, ...,  0.906429],
         ..., 
         [ 0.42684 , ...,  0.404104]]]])
Coordinates:
  * location_id   (location_id) int64 4 31 64 103 137 158 166
  * sex_id        (sex_id) int64 1 2
  * age_group_id  (age_group_id) int64 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...
  * year_id       (year_id) int64 1980 1981 1982 1983 1984 1985 1986 1987 ...

### Automatic broadcasting!

In [22]:
death_da / pop_da

<xarray.DataArray (location_id: 7, sex_id: 2, age_group_id: 23, year_id: 37, cause_id: 22)>
array([[[[[ 0.      , ...,  0.      ],
          ..., 
          [ 0.      , ...,  0.      ]],

         ..., 
         [[ 0.015985, ...,  0.      ],
          ..., 
          [ 0.003144, ...,  0.      ]]],


        [[[ 0.      , ...,  0.      ],
          ..., 
          [ 0.      , ...,  0.      ]],

         ..., 
         [[ 0.011688, ...,  0.      ],
          ..., 
          [ 0.00164 , ...,  0.      ]]]],



       ..., 
       [[[[ 0.      , ...,  0.      ],
          ..., 
          [ 0.      , ...,  0.      ]],

         ..., 
         [[ 0.02826 , ...,  0.      ],
          ..., 
          [ 0.024486, ...,  0.      ]]],


        [[[ 0.      , ...,  0.      ],
          ..., 
          [ 0.      , ...,  0.      ]],

         ..., 
         [[ 0.016423, ...,  0.      ],
          ..., 
          [ 0.013008, ...,  0.      ]]]]])
Coordinates:
  * location_id   (location_id) int64 4 31 6

## Creating new variables

In [23]:
death_ds['rate'] = death_ds['val'] / pop_da
death_ds

<xarray.Dataset>
Dimensions:       (age_group_id: 23, cause_id: 22, location_id: 7, sex_id: 2, year_id: 37)
Coordinates:
  * location_id   (location_id) int64 4 31 64 103 137 158 166
  * sex_id        (sex_id) int64 1 2
  * age_group_id  (age_group_id) int64 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...
  * year_id       (year_id) int64 1980 1981 1982 1983 1984 1985 1986 1987 ...
  * cause_id      (cause_id) int64 296 301 344 366 380 386 392 410 491 508 ...
Data variables:
    val           (location_id, sex_id, age_group_id, year_id, cause_id) float64 0.0 ...
    lower         (location_id, sex_id, age_group_id, year_id, cause_id) float64 0.0 ...
    upper         (location_id, sex_id, age_group_id, year_id, cause_id) float64 0.0 ...
    rate          (location_id, sex_id, age_group_id, year_id, cause_id) float64 0.0 ...

## Saving and loading data

In [24]:
death_ds.to_netcdf('~/tmp/deaths.nc')

In [25]:
loaded_death_ds = xr.open_dataset('~/tmp/deaths.nc')
loaded_death_ds

<xarray.Dataset>
Dimensions:       (age_group_id: 23, cause_id: 22, location_id: 7, sex_id: 2, year_id: 37)
Coordinates:
  * location_id   (location_id) int64 4 31 64 103 137 158 166
  * sex_id        (sex_id) int64 1 2
  * age_group_id  (age_group_id) int64 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...
  * year_id       (year_id) int64 1980 1981 1982 1983 1984 1985 1986 1987 ...
  * cause_id      (cause_id) int64 296 301 344 366 380 386 392 410 491 508 ...
Data variables:
    val           (location_id, sex_id, age_group_id, year_id, cause_id) float64 0.0 ...
    lower         (location_id, sex_id, age_group_id, year_id, cause_id) float64 0.0 ...
    upper         (location_id, sex_id, age_group_id, year_id, cause_id) float64 0.0 ...
    rate          (location_id, sex_id, age_group_id, year_id, cause_id) float64 0.0 ...

## Example: age-adjusted rates

In [26]:
age_weights = db.get_age_metadata(12)
age_weights = age_weights.set_index('age_group_id')[['age_group_weight_value']]
age_weights.head()

Unnamed: 0_level_0,age_group_weight_value
age_group_id,Unnamed: 1_level_1
2,0.000357
3,0.001062
4,0.016878
5,0.071823
6,0.086928


In [27]:
weights_da = age_weights['age_group_weight_value'].to_xarray()
weights_da

<xarray.DataArray u'age_group_weight_value' (age_group_id: 23)>
array([ 0.000357,  0.001062,  0.016878,  0.071823,  0.086928,  0.083952,
        0.080984,  0.078145,  0.075598,  0.072485,  0.068559,  0.063842,
        0.058491,  0.05271 ,  0.04678 ,  0.040575,  0.033591,  0.026293,
        0.018967,  0.012135,  0.006445,  0.002576,  0.000825])
Coordinates:
  * age_group_id  (age_group_id) int64 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...

In [28]:
age_adj_da = (death_ds['rate'] * weights_da).sum('age_group_id') * 1e5
age_adj_da

<xarray.DataArray (location_id: 7, sex_id: 2, year_id: 37, cause_id: 22)>
array([[[[  83.330172, ...,    0.      ],
         ..., 
         [  18.531529, ...,    0.      ]],

        [[  53.599355, ...,    0.      ],
         ..., 
         [   8.316955, ...,    0.      ]]],


       ..., 
       [[[ 197.673496, ...,    0.      ],
         ..., 
         [ 218.915467, ...,    0.      ]],

        [[ 133.296035, ...,    0.      ],
         ..., 
         [ 158.797021, ...,    0.      ]]]])
Coordinates:
  * location_id  (location_id) int64 4 31 64 103 137 158 166
  * sex_id       (sex_id) int64 1 2
  * year_id      (year_id) int64 1980 1981 1982 1983 1984 1985 1986 1987 ...
  * cause_id     (cause_id) int64 296 301 344 366 380 386 392 410 491 508 ...

## Example: annualized rate of change

In [29]:
arc_da = (xr.ufuncs.log(age_adj_da).diff('year_id', 1) * 100.)
arc_da.sel(cause_id=296)

  return getattr(module, name)(*args, **kwargs)
  if not reflexive


<xarray.DataArray (location_id: 7, sex_id: 2, year_id: 36)>
array([[[ -1.793099,  -3.783488, ...,  -4.386699,  -3.847824],
        [ -4.10563 ,  -4.017464, ...,  -4.51216 ,  -3.625253]],

       [[  1.915439,  -2.732526, ...,  -0.657557,  -3.067505],
        [  2.613252,  -3.10217 , ...,  -3.064849,  -3.765895]],

       ..., 
       [[ -2.081365,  -2.148303, ...,  -5.39287 ,  -6.184376],
        [ -2.272081,  -2.417414, ...,  -5.527711,  -7.07386 ]],

       [[ -0.443818,  -0.033131, ...,  -7.569799,  -8.013397],
        [ -1.13586 ,  -1.140267, ..., -11.947363, -10.15332 ]]])
Coordinates:
  * location_id  (location_id) int64 4 31 64 103 137 158 166
  * sex_id       (sex_id) int64 1 2
  * year_id      (year_id) int64 1981 1982 1983 1984 1985 1986 1987 1988 ...
    cause_id     int64 296

## Advanced example: `dask`!

In [30]:
import os
nc_dir = '/ihme/forecasting/data/4/past/death/20171003_dalynator_outputs/'
nc_files = [os.path.join(nc_dir, f) for f in os.listdir(nc_dir) if f.endswith('.nc')]
nc_files[:3]

['/ihme/forecasting/data/4/past/death/20171003_dalynator_outputs/digest_pancreatitis.nc',
 '/ihme/forecasting/data/4/past/death/20171003_dalynator_outputs/neo_bladder.nc',
 '/ihme/forecasting/data/4/past/death/20171003_dalynator_outputs/_otherncd.nc']

In [31]:
def get_global(x):
    return x.sel(location_id=1)[['mean']]

In [32]:
dask_ds = xr.open_mfdataset(nc_files, concat_dim='cause_id', preprocess=get_global)
dask_ds

<xarray.Dataset>
Dimensions:       (age_group_id: 23, cause_id: 232, sex_id: 2, year_id: 37)
Coordinates:
  * sex_id        (sex_id) int64 1 2
  * year_id       (year_id) int64 1980 1981 1982 1983 1984 1985 1986 1987 ...
  * age_group_id  (age_group_id) int64 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...
    location_id   int64 1
  * cause_id      (cause_id) int64 535 474 640 302 653 297 502 525 342 741 ...
Data variables:
    mean          (cause_id, age_group_id, sex_id, year_id) float64 0.0 0.0 ...

In [33]:
dask_ds.sum(['age_group_id', 'sex_id'])

<xarray.Dataset>
Dimensions:   (cause_id: 232, year_id: 37)
Coordinates:
  * year_id   (year_id) int64 1980 1981 1982 1983 1984 1985 1986 1987 1988 ...
  * cause_id  (cause_id) int64 535 474 640 302 653 297 502 525 342 741 543 ...
Data variables:
    mean      (cause_id, year_id) float64 0.002146 0.002141 0.002133 ...

# Who should use `xarray`, and when?

`xarray` excels when:
- You've got lots of dimensions
- ...which are balanced
- You need to do lots of combining of data
- You're slicing across different dimensions

`xarray` is not the answer to your problems if:
- Your data is not balanced
    - E.g. raw data, where you have different numbers of data points for different demographic combos
- You need to use things which expect `pandas`, like `statsmodels`
    - _However_, there's always `xr.Dataset.to_pandas()`!

# Where can I learn more about `xarray`?

The [`xarray` documentation](http://xarray.pydata.org/en/stable/index.html) is really great!

# Notes

__Author__: Kyle Foreman (kfor@uw.edu)

__Originally presented:__ 13 October 2017