# Computing spatial derivatives with dask vs xarray

### Author 
 - Julien Le Sommer, CNRS

### Context
 - April 2016, preparatory work for [oocgcm](https://github.com/lesommer/oocgcm)
 - following the analysis published [in this first notebook](https://github.com/lesommer/notebooks/blob/master/Profiling_simple_derivative_computation_with_numpy_vs_dask_vs_xarray.ipynb) and and and [this second notebook](https://github.com/lesommer/notebooks/blob/master/More_profiling_of_dask_vs_xarray.ipynb), as discussed in this [thread](https://groups.google.com/forum/#!topic/xarray/TOX5BIc08WA)
 
### Purpose
 - compare the execution speed for computing derivatives with xarray.DataArray and dask arrays. 



### Modules

In [1]:
#- import
import numpy as np
from netCDF4 import Dataset
import dask.array as da
import xarray as xr
from contextlib import contextmanager
import time

In [2]:
#- print versions
import dask 
print('dask version : ' + dask.__version__)
print('xarray version : ' + xr.__version__)

dask version : 0.8.1
xarray version : 0.7.2


### Define the arrays

#### Location of netcdf files

In [3]:
#- Medium size 2D dataset from NATL60 
#  array shape is (1,3454,5422)
file_2d_gridt = "/Users/lesommer/data/NATL60/NATL60-MJM155-S/1d/2008/NATL60-MJM155_y2008m01.1d_BUOYANCYFLX.nc"
varname_2d = "vosigma0"
file_2d_coord = "/Users/lesommer/data/NATL60/NATL60-I/NATL60_coordinates_v4.nc"

#### Size of the chunks 

In [4]:
chunks2d = (1727,2711)
xr_chunks2d = {'x': chunks2d[-1], 'y': chunks2d[-2]}

#### Functions for defining dask arrays

In [5]:
#- dask from netcdf :  dachk
def load_dachk(filename,varname,chunks,it=None):
    d = Dataset(filename).variables[varname]
    if it is None:
        array = da.from_array(d, chunks=chunks)
    else :
        array = da.from_array(d, chunks=(1,)+ chunks)[it]
    return array 

get_array2d_tgrid_dachk = lambda: load_dachk(file_2d_gridt,varname_2d,chunks2d,it=0)
get_array2d_e1u_dachk   = lambda: load_dachk(file_2d_coord,"e1u",chunks2d)

#### Functions for defining xarray.DataArray

In [6]:
#- xarray with chunks :  xachk
def load_xachk(filename,varname,xr_chunks,it=None,decode_cf=True,engine='netcdf4',lock=False):
    ds = xr.open_dataset(filename,chunks=xr_chunks,decode_cf=decode_cf,engine=engine,lock=lock)
    if it is None:
       array = ds.variables[varname]
    else:
       array = ds.variables[varname][it]
    return array.chunk()

get_array2d_tgrid_xachk           = lambda: load_xachk(file_2d_gridt,varname_2d,xr_chunks2d,it=0)
get_array2d_e1u_xachk             = lambda: load_xachk(file_2d_coord,"e1u",xr_chunks2d)

### Define the function that compute the x-derivative

In [7]:
#- Simple x-derivative function 
def derivative_da(array_tgrid,array_e1u,compute=True):
    di = lambda t: (np.roll(t,-1,axis=-1) - t) 
    depth = {-1: 1, -2: 0, -3:0}
    d = array_tgrid.map_overlap(di,depth=depth,boundary=0)
    if compute:
        return np.asarray(  (d / array_e1u ).compute() )
    else:
        return d / array_e1u

def derivative_xa(array_tgrid,array_e1u, compute=True):
    if compute:
        return np.asarray((array_tgrid.shift(x=-1) - array_tgrid) / array_e1u)
    else:
        return (array_tgrid.shift(x=-1) - array_tgrid) / array_e1u

def derivative_xa_as_da(array_tgrid,array_e1u,compute=True):
    return derivative_da(array_tgrid.data,array_e1u.data,compute=compute)


### Profiling tools

In [8]:
@contextmanager
def timeit_context(name):
    startTime = time.time()
    yield
    elapsedTime = time.time() - startTime
    print('{} takes {} ms'.format(name, int(elapsedTime * 1000))+'\n')

In [9]:
def derivative_profile_context(func_get_array_gridt,func_get_array_e1u,compute_func): 
    "Simple profiling function"
    with timeit_context('creating tgrid array'):
        array_tgrid = func_get_array_gridt()
    with timeit_context('creating e1u array'):
        array_e1u  = func_get_array_e1u()
    with timeit_context('computing x-derivative'):
        gx = compute_func(array_tgrid, array_e1u)
    print('output array is a ' + str(type(gx)))

### pure dask array implementation


In [10]:
# dask (netcdf with chunk) / dask (netcdf with chunk) 
derivative_profile_context(get_array2d_tgrid_dachk,get_array2d_e1u_dachk,derivative_da)

creating tgrid array takes 3 ms

creating e1u array takes 1 ms

computing x-derivative takes 924 ms

output array is a <type 'numpy.ndarray'>


### pure xarray implementation

In [11]:
# xarray (with chunk) / xarray (with chunk) 
derivative_profile_context(get_array2d_tgrid_xachk,get_array2d_e1u_xachk,derivative_xa)

creating tgrid array takes 12 ms

creating e1u array takes 17 ms

computing x-derivative takes 782 ms

output array is a <type 'numpy.ndarray'>


### mixed xarray dask implementation

In [12]:
# xarray (with chunk) / xarray (with chunk) (derivative as for dask array)
derivative_profile_context(get_array2d_tgrid_xachk,get_array2d_e1u_xachk,derivative_xa_as_da)

creating tgrid array takes 9 ms

creating e1u array takes 14 ms

computing x-derivative takes 1032 ms

output array is a <type 'numpy.ndarray'>
