# Parallelize code with Dask Delayed

In this notebook we demonstrate:

* A few words about Panda
* Read CSV files using Delayed
* Read data example
* Sequential code: Mean CO3 Per Core
* Parallelize the code above using dask delayed
---

- Authors: NCI Virtual Research Environment Team
- Keywords: Dask, Delayed, Pandas, DataFrame
- Create Date: 2020-April; Update Date: 2020-April

### Prerequisite

You can run this notebook on Gadi/VDI (recommended), or on your local computer by downloading [all the CSV example files](git repo). The following modules are needed:

* Pandas
* Dask

<div class="alert alert-warning">
<b>NOTE:</b> If you run this notebook on your local computer, make sure that your local computer has multiple cores. Otherwise, your parallel code won't perform any better than sequencial code! 
</div>

### A few words about Pandas

Pandas is a an open source library providing high-performance, easy-to-use data structures and data analysis tools. Pandas is particularly suited to the analysis of tabular data, i.e. data that can can go into a table. In other words, if you can imagine the data in an Excel spreadsheet, then Pandas is the tool for the job.

Pandas are tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format.

Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.

In [56]:
# Create cluster
from dask.distributed import Client,LocalCluster
client = Client(scheduler_file='scheduler.json')
client

0,1
Client  Scheduler: tcp://10.6.75.59:8710  Dashboard: http://10.6.75.59:8752/status,Cluster  Workers: 24  Cores: 24  Memory: 103.08 GB


Starting the Dask Client is optional. It will provide a dashboard which is useful to gain insight on the computation.

The link to the dashboard will become visible when you create the client below. We recommend having it open on one side of your screen while using your notebook on the other side. This can take some effort to arrange your windows, but seeing them both at the same is very useful when learning.

## Scale up csv files reading using `delayed` 

We will apply `delayed` to a real data processing task, albeit a simple one.

Consider reading three CSV files with `pd.read_csv` and then measuring their total length. We will consider how you would do this with ordinary Python code, then build a graph for this process using delayed, and finally execute this graph using Dask, for a handy speed-up factor of more than two (there are only three inputs to parallelize over).

In [61]:
import pandas as pd
import os
from glob import glob
from dask import delayed
import numpy

filenames = sorted(glob('CSV/*.csv'))
filenames

['CSV/csvfile1.csv', 'CSV/csvfile2.csv', 'CSV/csvfile3.csv']

In [62]:
%%time

# normal, sequential code
a = pd.read_csv(filenames[0])
b = pd.read_csv(filenames[1])
c = pd.read_csv(filenames[2])

na = len(a)
nb = len(b)
nc = len(c)

total = sum([na, nb, nc])
print(total)

27
CPU times: user 4.87 ms, sys: 9.86 ms, total: 14.7 ms
Wall time: 13.5 ms


Your task is to recreate this graph again using the delayed function on the original Python code. The three functions you want to delay are `pd.read_csv`, `len` and `sum`.. 

In [63]:
%%time

# delayed, sequential code
delayed_read_csv = delayed(pd.read_csv)
a = delayed_read_csv(filenames[0])
b = delayed_read_csv(filenames[1])
c = delayed_read_csv(filenames[2])

delayed_len = delayed(len)
na = delayed_len(a)
nb = delayed_len(b)
nc = delayed_len(c)

delayed_sum = delayed(sum)

total = delayed_sum([na, nb, nc])
total

CPU times: user 0 ns, sys: 1.51 ms, total: 1.51 ms
Wall time: 1.3 ms


Delayed('sum-55c1f334-7733-46dd-9eda-e1fb29a04b13')

In [69]:
%time print(total.compute())

FileNotFoundError: [Errno 2] File CSV/csvfile3.csv does not exist: 'CSV/csvfile3.csv'

Next, repeat this using loops, rather than writing out all the variables.

In [30]:
# concise version
csvs = [delayed(pd.read_csv)(fn) for fn in filenames]
lens = [delayed(len)(csv) for csv in csvs]
total = delayed(sum)(lens)
%time print(total.compute())

FileNotFoundError: [Errno 2] File CSV/csvfile1.csv does not exist: 'CSV/csvfile1.csv'

## Real example

### Inspect Data

We will use the supplementary data of a paper **Sequestration of carbon in the deep Atlantic during the last glaciation** by Yu. *et. al* published in Nature Geoscience, 2016, doi:10.1038/ngeo2657.

I downloaded the data and recorganize it into several CVS files saved under a local directory called `Nature_geo_csv`. This dataset includes lab measurement of PH (i.e., CO3 umol/kg), Oxgen isotops, Carbon isotops, and CaCO3 in sedimenets at different depths of the Ocean Deep Drilling (ODP) cores in Atlantic OCean. The name convention for those files are *coreID-measurements.csv*. 

In [31]:
import os
sorted(os.listdir('Nature_geo_csv'))

['.DS_Store',
 'EW9209-2JPC-PH.csv',
 'MD01-2446-O-C.csv',
 'MD01-2446-PH.csv',
 'MD95-2039-CaCO3.csv',
 'MD95-2039-O-C.csv',
 'MD95-2039-PH.csv',
 'RC13-228-O-C.csv',
 'RC13-228-PH.csv',
 'RC13-229-O-C.csv',
 'RC13-229-PH.csv',
 'RC16-59-PH.csv',
 'TNO57-21-PH.csv']

#### Read one file with pandas.read_csv and compute mean PH value of a core.

We can use `Pandas.read_csv( )` to access csv files.

In [32]:
import pandas as pd
# skip the first two lines
# line1: core name
# line2: units of the measurement in each column
df = pd.read_csv("Nature_geo_csv/TNO57-21-PH.csv",skiprows=2)
df.head()

Unnamed: 0,top,btm,mid,age,Cw B/Ca,CO3
0,815,816,815.5,51.9,123.4,83.3
1,853,854,853.5,54.6,128.8,88.0
2,916,917,916.5,60.7,114.0,75.0
3,925,926,925.5,61.4,113.3,74.5
4,936,937,936.5,62.3,111.3,72.7


In [33]:
# What is the schema?
df.dtypes

top          int64
btm          int64
mid        float64
age        float64
Cw B/Ca    float64
CO3        float64
dtype: object

In [34]:
# get the mean value of each column
df.mean()

top        1092.583333
btm        1093.604167
mid        1093.093750
age          73.637500
Cw B/Ca     125.895833
CO3          85.506250
dtype: float64

In [35]:
# We can get a single column as a Series using python's getitem syntax on the DataFrame object.
df['CO3']

# or specify one column to get the mean of that data series only
df.CO3.mean()

# get number of data points
import numpy as np
np.size(df['CO3'])

48

### Sequential code: Mean CO3 Per Core

The above cell computes the mean departure delay per-airport for one year. Here we expand that to all years using a sequential for loop.

In [36]:
from glob import glob
filenames = sorted(glob('Nature_geo_csv/*-PH.csv'))
filenames

['Nature_geo_csv/EW9209-2JPC-PH.csv',
 'Nature_geo_csv/MD01-2446-PH.csv',
 'Nature_geo_csv/MD95-2039-PH.csv',
 'Nature_geo_csv/RC13-228-PH.csv',
 'Nature_geo_csv/RC13-229-PH.csv',
 'Nature_geo_csv/RC16-59-PH.csv',
 'Nature_geo_csv/TNO57-21-PH.csv']

In [37]:
%%time
means = []
counts = []
for fn in filenames:
    # Read in file
    df = pd.read_csv(fn,skiprows=2)
    
    # Get the mean CO3 for each core
    mean_CO3_each = df.CO3.mean()

    # Count how many data points in each core
    count = np.size(df['CO3'])

    # Save the intermediates
    means.append(mean_CO3_each)
    counts.append(count)

# Combine intermediates to get total mean-delay-per-origin
mean_CO3 = np.mean(means)
n_dpoints = sum(counts)

CPU times: user 19.4 ms, sys: 3.62 ms, total: 23.1 ms
Wall time: 23.6 ms


In [38]:
means

[92.66666666666667,
 97.8157894736842,
 106.03571428571429,
 90.16,
 80.31818181818181,
 94.51515151515152,
 85.50625000000002]

In [39]:
mean_CO3
n_dpoints

263

### Parallelize the code above

Use dask.delayed to parallelize the code above. Some extra things you will need to know.

Methods and attribute access on delayed objects work automatically, so if you have a delayed object you can perform normal arithmetic, slicing, and method calls on it and it will produce the correct delayed calls.

```
x = delayed(np.arange)(10)
y = (x + 1)[::2].sum()  # everything here was delayed
```

Calling the `.compute()` method works well when you have a single output. When you have multiple outputs you might want to use the `dask.compute` function:

```
x = delayed(np.arange)(10)
y = x ** 2
min_, max_ = compute(y.min(), y.max())
min_, max_
(0, 81)
```
This way Dask can share the intermediate values (like `y = x**2`)
So your goal is to parallelize the code above (which has been copied below) using dask.delayed. You may also want to visualize a bit of the computation to see if you’re doing it correctly. This is just one way of using `delayed`, there are several ways to do this.

In [40]:
from dask import compute
from dask import delayed

In [41]:
%%time

means = []
counts = []
for fn in filenames:
    # Read in file
    df = delayed(pd.read_csv)(fn,skiprows=2)
    
    # Get the mean CO3 for each core
    mean_CO3_each = df.CO3.mean()

    # Count how many data points in each core
    count = np.size(df['CO3'])

    # Save the intermediates
    means.append(mean_CO3_each)
    counts.append(count)

# Compute the intermediates
means, counts = compute(means, counts)

# Combine intermediates to get total mean-delay-per-origin
#mean_CO31 = np.mean(means1)
#n_dpoints = sum(counts1)

FileNotFoundError: [Errno 2] File Nature_geo_csv/MD95-2039-PH.csv does not exist: 'Nature_geo_csv/MD95-2039-PH.csv'

In [42]:
mean_CO3

92.43110767991409

### Close the client

Before moving on to the next exercise, make sure to close your client or stop this kernel.

In [43]:
client.close()

### Summary

This example shows how Pandas work with multiple tabular datasets efficiently using dask delayed feature.

## Reference

https://tutorial.dask.org