In [1]:
%matplotlib notebook

# Iris/dask dataset loading investigation

## Introduction

This demos using dask functionality beyond the `array` module to help with Iris processing. Specifically, in this notebook we will demo alternative approaches for loading numerous and/or large datasets into Iris.

Three approaches will be compared:

* The standard Iris load
* Wrapping Iris load calls in a **dask bag** generated from a sequence
* Wrapping Iris load calls in a **dask bag** generated from a **delayed** call

These options will be compared with two simple metrics:

- Ease of use
- Runtime

## Setup

Below are the functions used to load the dataset. There is one function for each of the standard Iris load and the bag generated from a sequence. The bag generated from a delayed call requires two functions; one which is delayed, one to call the delayed function.

### Imports

In [2]:
import os
import time

import dask.bag as db
import dask.delayed as delayed
import iris

In [3]:
print iris

<module 'iris' from '/home/h04/dkillick/git/iris/lib/iris/__init__.pyc'>


### Timer function

A simple function that records the runtime of a supplied function. This will be useful for capturing results; otherwise within this notebook we can just make use of the `%timeit` magic. 

In [4]:
def timer(func, *funcargs):
    t0 = time.time()
    func(*funcargs)
    t1 = time.time()
    return t1 - t0

def repeater(repeat, *timerargs):
    if repeat <= 1:
        result = timer(*timerargs)
    else:
        result = [timer(*timerargs) for _ in range(repeat)]
    return result

### Runner functions

In [5]:
def direct_load(fp, pattern):
    """Load datasets at the filepath `fp` using Iris."""
    iris.load(os.path.join(fp, pattern))

def withbag(fp, seq):
    """
    Load a number of individual datasets in a sequence using Iris.
    
    This is a little more complex as we need to generate a sequence and map
    that sequence onto a load call. The dask bag is generated from that
    sequence.

    """
    loader = lambda fn: iris.load_cube(os.path.join(fp, fn))
    cs = db.from_sequence(seq).map(loader)
    iris.cube.CubeList(cs.compute())

# @delayed
# def delay(fp):
#     """
#     A simple Iris load function decorated with dask's `delayed` functionality.
    
#     TODO: could this be done as `delayed(direct_load)` instead?
#     """
#     return iris.load(os.path.join(fp, '*.nc'))

# def delay_wrapper(fp):
#     """Converts the delay-wrapped function into a dask bag."""
#     dlyd = delay(fp)
#     cs = db.from_delayed(dlyd)
#     iris.cube.CubeList(cs.compute())

def delay_wrapper_v2(fp, pattern):
    dlyd = delayed(direct_load(fp, pattern))
    cs = db.from_delayed(dlyd)
    iris.cube.CubeList(cs.compute())

## Test!

Run each loader on some sample data and print the output.

Using **sample NetCDF data** at `/project/applied/OECD/data/original_data/tas/rcp26`:

In [6]:
fp = '/project/applied/OECD/data/original_data/tas/rcp26'
seq = os.listdir(fp)

In [7]:
# %timeit direct_load(fp)

In [8]:
# %timeit withbag(fp, seq)

In [9]:
# %timeit delay_wrapper(fp)

### Digging deeper

The `%timeit` magic is useful for giving a quick overview of how long something is taking. Now let's look a little more deeply by using some box and whisker plots showing the runtime.

In [10]:
reps = 10

In [11]:
direct_load_vals = repeater(reps, direct_load, fp, '*.nc')
with_bag_vals = repeater(reps, withbag, fp, seq)
# delay_vals = repeater(reps, delay_wrapper, fp)
delay_vals_v2 = repeater(reps, delay_wrapper_v2, fp, '*.nc')

In [12]:
print direct_load_vals
print delay_vals_v2

[8.09387993812561, 1.024986982345581, 1.0065579414367676, 1.0034730434417725, 1.0011858940124512, 1.002074956893921, 1.002161979675293, 1.0049009323120117, 1.0022990703582764, 1.0022609233856201]
[1.007253885269165, 1.006119966506958, 1.0063679218292236, 1.0081560611724854, 1.0075829029083252, 1.0083439350128174, 1.00730299949646, 1.0083320140838623, 1.008795976638794, 1.0073637962341309]


#### Plot the results

In [13]:
import matplotlib.pyplot as plt

In [19]:
fig = plt.figure(figsize=(9, 6))
plt.boxplot([direct_load_vals, with_bag_vals, delay_vals_v2],
            vert=True, labels=['direct', 'bag', 'delay'])
plt.show()

<IPython.core.display.Javascript object>

### Different dataset

In [15]:
fp = '/project/euro4_hindcast/WIND-ATLAS_EURO4-RERUN/2015/06/18Z'
fn = 'EURO4_2015060[1-3].pp'
seq = os.listdir(fp)
reps = 3

In [17]:
direct_load_vals_pp = repeater(reps, direct_load, fp, fn)
# with_bag_vals_pp = repeater(reps, withbag, fp, seq)
delay_vals_v2_pp = repeater(reps, delay_wrapper_v2, fp, fn)

In [18]:
fig = plt.figure(figsize=(9, 6))
plt.boxplot([direct_load_vals_pp, delay_vals_v2_pp],
            vert=True, labels=['direct', 'delay'])
plt.show()

<IPython.core.display.Javascript object>