# Parallelization Testing

In this notebook, I will learn how to use dask within xarray to parallelize running code and speed up parts of the Argo analysis. I'll start by running a simple test case (I hope to find) in xarray's documentation. If this work successfully, I will then move on to running the depth-->density interpolation function to see if that comes with speed improvements too.

In [21]:
import xarray as xr
import matplotlib.pyplot as plt
import matplotlib as mpl
from matplotlib.path import Path
import seaborn as sns
import seaborn
import pandas as pd
import numpy as np
from importlib import reload
import cartopy.crs as ccrs
import cmocean.cm as cmo
import gsw
import dask.array as da

In [22]:
import os
os.chdir('/home.ufs/amf2288/argo-intern/funcs')
import density_funcs as df
import EV_funcs as ef
import filt_funcs as ff
import plot_funcs as pf
import processing_funcs as prf

In [23]:
reload(df)
reload(ef)
reload(ff)
reload(prf)

<module 'processing_funcs' from '/home/amf2288/argo-intern/funcs/processing_funcs.py'>

# Reproducable Test

Goal here is to make a really big array and then test loading with dask vs loading without dask. I'm following the rough steps Stephan Hoyer outlines in this blogpost (https://stephanhoyer.com/2015/06/11/xray-dask-out-of-core-labeled-arrays/), including creating a dataset with the same dimensions of

Dimensions:(latitude: 256, longitude: 512, time: 52596)

In [24]:
from dask.distributed import Client

In [25]:
client = Client()
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 9
Total threads: 72,Total memory: 0.98 TiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:42894,Workers: 9
Dashboard: http://127.0.0.1:8787/status,Total threads: 72
Started: Just now,Total memory: 0.98 TiB

0,1
Comm: tcp://127.0.0.1:37910,Total threads: 8
Dashboard: http://127.0.0.1:33734/status,Memory: 111.95 GiB
Nanny: tcp://127.0.0.1:33144,
Local directory: /tmp/dask-scratch-space-2594/worker-ghbrom8i,Local directory: /tmp/dask-scratch-space-2594/worker-ghbrom8i

0,1
Comm: tcp://127.0.0.1:37934,Total threads: 8
Dashboard: http://127.0.0.1:36003/status,Memory: 111.95 GiB
Nanny: tcp://127.0.0.1:34076,
Local directory: /tmp/dask-scratch-space-2594/worker-6wn1czp3,Local directory: /tmp/dask-scratch-space-2594/worker-6wn1czp3

0,1
Comm: tcp://127.0.0.1:33475,Total threads: 8
Dashboard: http://127.0.0.1:39961/status,Memory: 111.95 GiB
Nanny: tcp://127.0.0.1:39551,
Local directory: /tmp/dask-scratch-space-2594/worker-npb_621n,Local directory: /tmp/dask-scratch-space-2594/worker-npb_621n

0,1
Comm: tcp://127.0.0.1:35628,Total threads: 8
Dashboard: http://127.0.0.1:36023/status,Memory: 111.95 GiB
Nanny: tcp://127.0.0.1:36683,
Local directory: /tmp/dask-scratch-space-2594/worker-fkp7zhg3,Local directory: /tmp/dask-scratch-space-2594/worker-fkp7zhg3

0,1
Comm: tcp://127.0.0.1:37911,Total threads: 8
Dashboard: http://127.0.0.1:45272/status,Memory: 111.95 GiB
Nanny: tcp://127.0.0.1:40838,
Local directory: /tmp/dask-scratch-space-2594/worker-ldig5236,Local directory: /tmp/dask-scratch-space-2594/worker-ldig5236

0,1
Comm: tcp://127.0.0.1:34845,Total threads: 8
Dashboard: http://127.0.0.1:37663/status,Memory: 111.95 GiB
Nanny: tcp://127.0.0.1:45841,
Local directory: /tmp/dask-scratch-space-2594/worker-wp64wabo,Local directory: /tmp/dask-scratch-space-2594/worker-wp64wabo

0,1
Comm: tcp://127.0.0.1:40937,Total threads: 8
Dashboard: http://127.0.0.1:32816/status,Memory: 111.95 GiB
Nanny: tcp://127.0.0.1:40822,
Local directory: /tmp/dask-scratch-space-2594/worker-cn854oek,Local directory: /tmp/dask-scratch-space-2594/worker-cn854oek

0,1
Comm: tcp://127.0.0.1:40479,Total threads: 8
Dashboard: http://127.0.0.1:46663/status,Memory: 111.95 GiB
Nanny: tcp://127.0.0.1:44291,
Local directory: /tmp/dask-scratch-space-2594/worker-6dfjz5yx,Local directory: /tmp/dask-scratch-space-2594/worker-6dfjz5yx

0,1
Comm: tcp://127.0.0.1:40724,Total threads: 8
Dashboard: http://127.0.0.1:40469/status,Memory: 111.95 GiB
Nanny: tcp://127.0.0.1:39545,
Local directory: /tmp/dask-scratch-space-2594/worker-3sdj8n49,Local directory: /tmp/dask-scratch-space-2594/worker-3sdj8n49


Can also use dask gateway, then adapt(min,max) number of cores to use in the calculation

In [26]:
factor = 10

lat, lon, time = 256, 512, 52596*factor

In [27]:
#data = np.random.rand(lat,lon,time)
#data

In [28]:
data = da.random.random((time,lat,lon),chunks=(100,256,512))

In [29]:
ds = xr.Dataset(
    {
        "data": (["time", "latitude", "longitude"], data)
    },
    coords={
        "time": np.arange(time),
        "latitude": np.linspace(-90, 90, lat),
        "longitude": np.linspace(-180, 180, lon)
    }
)

In [30]:
ds

Unnamed: 0,Array,Chunk
Bytes,513.63 GiB,100.00 MiB
Shape,"(525960, 256, 512)","(100, 256, 512)"
Dask graph,5260 chunks in 1 graph layer,5260 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 513.63 GiB 100.00 MiB Shape (525960, 256, 512) (100, 256, 512) Dask graph 5260 chunks in 1 graph layer Data type float64 numpy.ndarray",512  256  525960,

Unnamed: 0,Array,Chunk
Bytes,513.63 GiB,100.00 MiB
Shape,"(525960, 256, 512)","(100, 256, 512)"
Dask graph,5260 chunks in 1 graph layer,5260 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [31]:
%time result = ds.mean('time').compute()

CPU times: user 44.8 s, sys: 17.1 s, total: 1min 1s
Wall time: 1min 14s


Okay this is a sizable dataset, with 525960*256*512 data points. It was parallelized, running on ~60 cores for the duration of the calculation. Great! So we know dask + xarray is working here.

## Shape of Argo Data

In [32]:
depth, prof = 1000, 1000000000

In [33]:
data = da.random.random((depth,prof), chunks=(1000,1000))

In [34]:
ds = xr.Dataset({'data':(['depth','prof'],data)},
               coords={'depth':np.arange(depth),
                      'prof':np.arange(prof)})

In [48]:
prf.get_MLD()

TypeError: get_MLD() missing 1 required positional argument: 'ds'

In [36]:
%time result=ds.mean('prof').compute()

CPU times: user 2.23 s, sys: 392 ms, total: 2.62 s
Wall time: 2.63 s


In [37]:
%time ds_mld = prf.get_MLD(ds, variable='data',dim1='prof',dim2='depth').compute()

Exception ignored in: <bound method GCDiagnosis._gc_callback of <distributed.utils_perf.GCDiagnosis object at 0x7f426c61fee0>>
Traceback (most recent call last):
  File "/home/amf2288/mambaforge-pypy3/envs/argo_Aug_23/lib/python3.10/site-packages/distributed/utils_perf.py", line 176, in _gc_callback
Process Dask Worker process (from Nanny):
2024-10-11 13:14:03,277 - distributed.nanny - ERROR - Worker process died unexpectedly
Process Dask Worker process (from Nanny):
Process Dask Worker process (from Nanny):
Process Dask Worker process (from Nanny):
Process Dask Worker process (from Nanny):
Process Dask Worker process (from Nanny):
2024-10-11 13:14:03,278 - distributed.nanny - ERROR - Worker process died unexpectedly
2024-10-11 13:14:03,278 - distributed.nanny - ERROR - Worker process died unexpectedly
Process Dask Worker process (from Nanny):
2024-10-11 13:14:03,278 - distributed.nanny - ERROR - Worker process died unexpectedly
Process Dask Worker process (from Nanny):
2024-10-11 13:1

KeyboardInterrupt: 

In [38]:
ds_mld

NameError: name 'ds_mld' is not defined

#### For prof = 1,000,000:

- .mean(): Interestingly, there was a short spike to 20 cores for inital set-up with this ~Argo-sized dataset, however the rest of the calculation was seemingly done on 1-2. I wonder if this dataset is small enough that dask determines parallelization isn't necessary. I will now try a calculation that takes more time to see if that triggers using more cores. 

- get_MLD(): I ran this size dataset through the function to add MLD and this also only ran on 1-2 cores. Which is strange because it took a long time to run (TIME HERE). I'm wondering if this is related to how I've writen the function. For example, when I use .values within the function, and I wonder if that's going to be very slow because it loads things into memory? Does this play into anything?

- MAJOR UPDATE: it appears dask won't parallelize anything within a function automatically. So in order for me to run a function in parallel, I need to explicitly tell dask to do so using dask.delayed. Here's the documentation on this: https://examples.dask.org/delayed.html

#### For prof = 1,000,000,000

- .mean(): Okay, increasing prof to 1,000,000,000 triggers multiple cores to be used, in this case ~12.


# Argo Interpolation Test

In [10]:
atl = xr.open_dataset('/swot/SUM05/amf2288/sync-boxes/lon:(-25,-23)_lat:(-70,70)_ds_z.nc',chunks={'N_PROF':2000})

In [8]:
atl = prf.get_MLD(atl)

Calling this function only used one core. But at this point, it's hard to tell if that's because the dataset isn't big enough for dask to deem multiple cores to be necessary, or because there's something deeper going on. I'm almost wondering if I should define an argo-style dataset that's big enough to trigger multiple cores, then pass this to a function to see if this uses multiple cores

In [None]:
print('max: {}, min: {}'.format(atl.SIG0.max().values, atl.SIG0.min().values))

In [None]:
atl_grid = np.linspace(21,28,1000)

In [None]:
number=np.arange(0,len(atl.N_PROF))
atl.sortby('LATITUDE')
atl.coords['N_PROF_NEW']=xr.DataArray(number,dims=atl.N_PROF.dims)

In [None]:
%time rho_atl= df.interpolate2density_prof(atl, atl_grid)