# Parallelization Testing

In this notebook, I will learn how to use dask within xarray to parallelize running code and speed up parts of the Argo analysis. I'll start by running a simple test case (I hope to find) in xarray's documentation. If this work successfully, I will then move on to running the depth-->density interpolation function to see if that comes with speed improvements too.

In [1]:
import xarray as xr
import matplotlib.pyplot as plt
import matplotlib as mpl
from matplotlib.path import Path
import seaborn as sns
import seaborn
import pandas as pd
import numpy as np
from importlib import reload
import cartopy.crs as ccrs
import cmocean.cm as cmo
import gsw
import dask.array as da

In [2]:
import os
os.chdir('/home/jovyan/argo-intern/funcs')
import density_funcs as df
import EV_funcs as ef
import filt_funcs as ff
import plot_funcs as pf
import processing_funcs as prf

In [3]:
reload(df)
reload(ef)
reload(ff)
reload(prf)

<module 'processing_funcs' from '/home/jovyan/argo-intern/funcs/processing_funcs.py'>

# Reproducable Test

Goal here is to make a really big array and then test loading with dask vs loading without dask. I'm following the rough steps Stephan Hoyer outlines in this blogpost (https://stephanhoyer.com/2015/06/11/xray-dask-out-of-core-labeled-arrays/), including creating a dataset with the same dimensions of

Dimensions:(latitude: 256, longitude: 512, time: 52596)

In [4]:
from dask.distributed import Client

In [5]:
client = Client()
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: /user/andrewfagerheim/proxy/8787/status,

0,1
Dashboard: /user/andrewfagerheim/proxy/8787/status,Workers: 4
Total threads: 8,Total memory: 64.00 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:39675,Workers: 4
Dashboard: /user/andrewfagerheim/proxy/8787/status,Total threads: 8
Started: Just now,Total memory: 64.00 GiB

0,1
Comm: tcp://127.0.0.1:36321,Total threads: 2
Dashboard: /user/andrewfagerheim/proxy/42409/status,Memory: 16.00 GiB
Nanny: tcp://127.0.0.1:44807,
Local directory: /tmp/dask-scratch-space/worker-0_kpcncs,Local directory: /tmp/dask-scratch-space/worker-0_kpcncs

0,1
Comm: tcp://127.0.0.1:45745,Total threads: 2
Dashboard: /user/andrewfagerheim/proxy/44443/status,Memory: 16.00 GiB
Nanny: tcp://127.0.0.1:32999,
Local directory: /tmp/dask-scratch-space/worker-2djcse58,Local directory: /tmp/dask-scratch-space/worker-2djcse58

0,1
Comm: tcp://127.0.0.1:40813,Total threads: 2
Dashboard: /user/andrewfagerheim/proxy/41275/status,Memory: 16.00 GiB
Nanny: tcp://127.0.0.1:45813,
Local directory: /tmp/dask-scratch-space/worker-zr1jtjtu,Local directory: /tmp/dask-scratch-space/worker-zr1jtjtu

0,1
Comm: tcp://127.0.0.1:43315,Total threads: 2
Dashboard: /user/andrewfagerheim/proxy/36171/status,Memory: 16.00 GiB
Nanny: tcp://127.0.0.1:39695,
Local directory: /tmp/dask-scratch-space/worker-ugxnb5h9,Local directory: /tmp/dask-scratch-space/worker-ugxnb5h9


Can also use dask gateway, then adapt(min,max) number of cores to use in the calculation

In [12]:
factor = 10

lat, lon, time = 256, 512, 52596*factor

In [13]:
#data = np.random.rand(lat,lon,time)
#data

In [14]:
data = da.random.random((time,lat,lon),chunks=(100,256,512))

In [15]:
ds = xr.Dataset(
    {
        "data": (["time", "latitude", "longitude"], data)
    },
    coords={
        "time": np.arange(time),
        "latitude": np.linspace(-90, 90, lat),
        "longitude": np.linspace(-180, 180, lon)
    }
)

In [16]:
ds

Unnamed: 0,Array,Chunk
Bytes,513.63 GiB,100.00 MiB
Shape,"(525960, 256, 512)","(100, 256, 512)"
Dask graph,5260 chunks in 1 graph layer,5260 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 513.63 GiB 100.00 MiB Shape (525960, 256, 512) (100, 256, 512) Dask graph 5260 chunks in 1 graph layer Data type float64 numpy.ndarray",512  256  525960,

Unnamed: 0,Array,Chunk
Bytes,513.63 GiB,100.00 MiB
Shape,"(525960, 256, 512)","(100, 256, 512)"
Dask graph,5260 chunks in 1 graph layer,5260 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [17]:
%time result = ds.mean('time').compute()

This may cause some slowdown.
Consider scattering data ahead of time and using futures.


CPU times: user 1min 3s, sys: 5.62 s, total: 1min 8s
Wall time: 3min 27s


This isn't promising. The CPU time seems very close to the wall time. Maybe creating a much bigger array will make a difference? I'll try again but up the array size. If that doesn't seem to work, then I will need to figure out a different way to create the array (one that doesn't involve using dask, because obviously that's providing issues for comparison to not using dask)

Okay, this definitely isn't working. Wall time greater than CPU time means that this is not running in parallel. What else can we try?

Another thought: it says "Host CPU: 47.7% used on 16 CPUs" which to me, seems to suggest it was actually using multiple cores? Next try creating an array in a way that doesn't involve dask. So you can compare performance times between the two.

# Another Try

In [32]:
import numpy as np

# Define the shape and data type of the array
shape = (100000000, 1000)
dtype = 'float32'

# Create a memmap array
filename = 'large_array.nc'
large_array = np.memmap(filename, dtype=dtype, mode='w+', shape=shape)

# Initialize the array with random values in chunks
chunk_size = 10000
for i in range(0, shape[0], chunk_size):
    large_array[i:i+chunk_size] = np.random.rand(chunk_size, shape[1])

# Flush changes to disk
large_array.flush()

# Access the array as needed
loaded_array = np.memmap(filename, dtype=dtype, mode='r', shape=shape)

In [33]:
np.shape(loaded_array)

(100000000, 1000)

In [18]:
np.arange(0,10000)

array([   0,    1,    2, ..., 9997, 9998, 9999])

In [35]:
ds = xr.Dataset(
    {
        "data": (["prof", "depth"], loaded_array)
    },
    coords={
        "prof": np.arange(0,100000000),
        "depth": np.arange(-0,2000, 2)
    }
)

In [None]:
ds.to_netcdf("test1.nc")

In [4]:
daskless = xr.open_dataset('test1.nc')
daskfull = xr.open_dataset('test1.nc',chunks={'prof':1000})

In [None]:
%time daskless.mean().compute()

In [None]:
%time daskfull.mean().compute()

In [68]:
natl = xr.open_dataset('/swot/SUM05/amf2288/sync-boxes/lon:(-25,-20)_lat:(-70,70)_ds_z.nc')
datl = xr.open_dataset('/swot/SUM05/amf2288/sync-boxes/lon:(-25,-20)_lat:(-70,70)_ds_z.nc').chunk({'N_PROF':1000})

In [73]:
%time float(natl.CT.mean())

CPU times: user 43.3 ms, sys: 18.1 ms, total: 61.4 ms
Wall time: 57.9 ms


6.800860854562184

In [74]:
%time float(datl.CT.mean())

CPU times: user 85.5 ms, sys: 66.1 ms, total: 152 ms
Wall time: 63.5 ms


6.800860854562181

In [69]:
%time natl.CT.groupby('LATITUDE').mean();

CPU times: user 6.36 s, sys: 228 ms, total: 6.59 s
Wall time: 6.59 s


In [76]:
%time datl.CT.groupby('LATITUDE').mean();

CPU times: user 17.1 s, sys: 96.9 ms, total: 17.2 s
Wall time: 17.2 s


Okay something is not working as expected because the xr ds loaded with dask takes longer than the one loaded without. A few thoughts:
- It's possible the chunks are too small, so the overhead added for each calculation overwhelmes any advantage of running in parallel.
- Maybe it's not using multiple cores at all: the CPU time is about the same as wall time, which isn't a good sign.
- Maybe this isn't a time consuming enough calculation for using dask to make a difference at all?

The first thing to look into is definitely the second bullet point. If the processes aren't running oon multiple cores, then nothing else is going to work either.

Okay, I went to http://gyre.ldeo.columbia.edu:19999/#menu_users_submenu_cpu;theme=slate;help=true and the natl and datl runs both took right at (or slightly over ) 100%. So I don't think anything is being parallelized. What to try next??