# Distributed IO

We'll see that for applications that are limited by IO bandwidth, a wide distribution across compute nodes can be beneficial if a [distributed filesystem](https://en.wikipedia.org/wiki/Clustered_file_system#Distributed_file_systems) is used. (True on virtually all HPC systems.)



## Technical preamble

Spin up a Jobqueue cluster that has 6 workers on 6 different nodes.
(We'll ensure different nodes for each job by requesting more than 50% of the available CPUs in each job.)

In [1]:
# create log dir
log_dir = !echo ${PWD}/slurm-logs
log_dir = log_dir[0]
print(log_dir)

/gpfs/soma_interim/home/valerio/neuron/dask-jobqueue/notebooks/slurm-logs


In [2]:
import dask, dask.distributed
import dask_jobqueue

cluster = dask_jobqueue.SLURMCluster(

    # Dask worker size
    cores=10, memory='10GB',
    processes=1, # Dask workers per job
    
    # SLURM job script things
    queue='CPU', walltime='00:15:00',
    
    # Dask worker network and temporary storage
    interface='ib0', local_directory='/tmp', #'$TMPDIR',
    
    # make sure logs are stored away
    log_directory=log_dir,
)

client = dask.distributed.Client(cluster)
cluster.scale(jobs=6)

In [4]:
client

0,1
Connection method: Cluster object,Cluster type: dask_jobqueue.SLURMCluster
Dashboard: http://10.102.0.62:8787/status,

0,1
Dashboard: http://10.102.0.62:8787/status,Workers: 2
Total threads: 20,Total memory: 18.62 GiB

0,1
Comm: tcp://10.102.0.62:33713,Workers: 2
Dashboard: http://10.102.0.62:8787/status,Total threads: 20
Started: Just now,Total memory: 18.62 GiB

0,1
Comm: tcp://10.102.2.60:40466,Total threads: 10
Dashboard: http://10.102.2.60:37063/status,Memory: 9.31 GiB
Nanny: tcp://10.102.2.60:39683,
Local directory: /tmp/dask-worker-space/worker-8_ya65rt,Local directory: /tmp/dask-worker-space/worker-8_ya65rt

0,1
Comm: tcp://10.102.2.59:40996,Total threads: 10
Dashboard: http://10.102.2.59:41028/status,Memory: 9.31 GiB
Nanny: tcp://10.102.2.59:35709,
Local directory: /tmp/dask-worker-space/worker-w3gx31yo,Local directory: /tmp/dask-worker-space/worker-w3gx31yo


In [12]:
!squeue -u $USER

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             96187       CPU dask-wor  valerio  R       0:03      1 somacpu059
             96182       CPU dask-wor  valerio  R       1:45      1 somacpu060


In [11]:
cluster.scale(jobs=2)

## Create random data and write them to disk

In [13]:
from dask import array as darr

In [14]:
# 100 GB in chunks of 500 MB
random_data = darr.random.normal(
    size=(int(100_000_000_000 / 8), ),
    chunks=(int(200_000_000 / 8), )
)
random_data

Unnamed: 0,Array,Chunk
Bytes,93.13 GiB,190.73 MiB
Shape,"(12500000000,)","(25000000,)"
Count,500 Tasks,500 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 93.13 GiB 190.73 MiB Shape (12500000000,) (25000000,) Count 500 Tasks 500 Chunks Type float64 numpy.ndarray",12500000000  1,

Unnamed: 0,Array,Chunk
Bytes,93.13 GiB,190.73 MiB
Shape,"(12500000000,)","(25000000,)"
Count,500 Tasks,500 Chunks
Type,float64,numpy.ndarray


In [21]:
!rm -rf random_data.zarr/

In [17]:
%time random_data.to_zarr("random_data.zarr")

CPU times: user 13.5 s, sys: 1.43 s, total: 14.9 s
Wall time: 2min 50s


In [18]:
!du -sh random_data.zarr/

87G	random_data.zarr/


## Find largest number with disk IO

We'll re-read the data and find the maximum on the fly.

Note in the Dask dashboard that we don't saturate CPU load.
This means we're limited by IO rather than compute.

In [19]:
random_data = darr.from_zarr("random_data.zarr/")
random_data

Unnamed: 0,Array,Chunk
Bytes,93.13 GiB,190.73 MiB
Shape,"(12500000000,)","(25000000,)"
Count,501 Tasks,500 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 93.13 GiB 190.73 MiB Shape (12500000000,) (25000000,) Count 501 Tasks 500 Chunks Type float64 numpy.ndarray",12500000000  1,

Unnamed: 0,Array,Chunk
Bytes,93.13 GiB,190.73 MiB
Shape,"(12500000000,)","(25000000,)"
Count,501 Tasks,500 Chunks
Type,float64,numpy.ndarray


In [20]:
%time random_data.max().compute()

CPU times: user 28.2 s, sys: 3.09 s, total: 31.3 s
Wall time: 6min 25s


6.627577697895813

We've just read and digested 90GB from disk, decompressed it to 100GB and found the maximum in 3 seconds.

That's approx. 30 GB/s.

## Decrease cluster size and see effect on IO bandwidth

In [13]:
cluster.scale(jobs=1)

In [14]:
client

0,1
Client  Scheduler: tcp://172.18.4.100:42645  Dashboard: http://172.18.4.100:8787/status,Cluster  Workers: 4  Cores: 68  Memory: 400.00 GB


In [15]:
random_data = darr.from_zarr("random_data.zarr/")
random_data

Unnamed: 0,Array,Chunk
Bytes,100.00 GB,200.00 MB
Shape,"(12500000000,)","(25000000,)"
Count,501 Tasks,500 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 100.00 GB 200.00 MB Shape (12500000000,) (25000000,) Count 501 Tasks 500 Chunks Type float64 numpy.ndarray",12500000000  1,

Unnamed: 0,Array,Chunk
Bytes,100.00 GB,200.00 MB
Shape,"(12500000000,)","(25000000,)"
Count,501 Tasks,500 Chunks
Type,float64,numpy.ndarray


In [16]:
%time random_data.max().compute()

CPU times: user 4.19 s, sys: 187 ms, total: 4.37 s
Wall time: 11.5 s


6.452157854438557

We've just read and digested 90GB from disk, decompressed it to 100GB and found the maximum in 10 seconds.

That's approx. 10 GB/s.

## Increase cluster size again and see effect on IO bandwidth

In [17]:
cluster.scale(jobs=8)

In [18]:
client

0,1
Client  Scheduler: tcp://172.18.4.100:42645  Dashboard: http://172.18.4.100:8787/status,Cluster  Workers: 1  Cores: 17  Memory: 100.00 GB


In [19]:
%time random_data.max().compute()

CPU times: user 2.98 s, sys: 156 ms, total: 3.14 s
Wall time: 6.58 s


6.452157854438557

## Bottom line

For IO bound problems, we'd like to be able to scale horizontally rather than vertically.

That's something that could be tackled with the scheduler config (fill all nodes equally vs. keep as many nodes as possible empty).

In [22]:
client.close()
cluster.close()