# Using Dask distributed

In [14]:
from dask_jobqueue import PBSCluster

In [15]:
cluster = PBSCluster(project='p06010014')

Port 8787 is already in use. 
Perhaps you already have a cluster running?
Hosting the diagnostics dashboard on a random port instead.


Scale cluster, get 10 workers

In [16]:
cluster.scale(10)

In [17]:
from dask.distributed import Client

In [18]:
client = Client(cluster)
client

0,1
Client  Scheduler: tcp://128.117.181.197:37487  Dashboard: /proxy/38563/status,Cluster  Workers: 0  Cores: 0  Memory: 0 B


Need to update the dashboard URL default

In [19]:
import dask

In [20]:
dask.config.get('distributed.dashboard.link')

'/proxy/{port}/status'

In [21]:
dask.config.set({'distributed.dashboard.link': "/proxy/{port}/status"});

In [22]:
dask.config.get('distributed.dashboard.link')

'/proxy/{port}/status'

In [23]:
client

0,1
Client  Scheduler: tcp://128.117.181.197:37487  Dashboard: /proxy/38563/status,Cluster  Workers: 0  Cores: 0  Memory: 0 B


Create a test dataframe

In [24]:
import dask.dataframe as dd
df = dd.demo.make_timeseries()
df

Unnamed: 0_level_0,id,name,x,y
npartitions=11,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-31,int64,object,float64,float64
2000-02-29,...,...,...,...
...,...,...,...,...
2000-11-30,...,...,...,...
2000-12-31,...,...,...,...


Run persist to read into memory; watch dashboard to see computation live

In [12]:
df = df.persist()

### Question: how to do you release workers (i.e. cancel jobs?)

In [25]:
client.close() # This works without error,  but doesn't do anything?

In [26]:
cluster.close() # This works without error only if you run client.close() FIRST. This is the step that cancels jobs.

## Try the NCARCluster functionality

https://github.com/NCAR/ncar-jobqueue

In [27]:
from ncar_jobqueue import NCARCluster

In [28]:
cluster = NCARCluster(project='p06010014')

Port 8787 is already in use. 
Perhaps you already have a cluster running?
Hosting the diagnostics dashboard on a random port instead.


### Question: is there a way to get a random port each time I initiate a cluster? Throws a warning if I've already started a cluster because it tries to use port 8787 by default

In [30]:
cluster.scale(5)

In [32]:
client = Client(cluster)

In [33]:
client

0,1
Client  Scheduler: tcp://128.117.181.197:42263  Dashboard: https://jupyterhub.ucar.edu/ch/user/kdagon/proxy/41041/status,Cluster  Workers: 5  Cores: 180  Memory: 545.00 GB


The dashboard URL only works for jupyterhub (NCARCluster settings?)

In [34]:
dask.config.set({'distributed.dashboard.link': "/proxy/{port}/status"});

In [35]:
client

0,1
Client  Scheduler: tcp://128.117.181.197:42263  Dashboard: /proxy/41041/status,Cluster  Workers: 5  Cores: 180  Memory: 545.00 GB


In [37]:
# check versions (optional - for debugging)
#client.get_versions(check=True)['scheduler']

Create a test dataarray; watch dashboard to see various computations

In [38]:
import dask.array as da

In [39]:
x = da.random.random((5000,5000), chunks=(500,500))

In [40]:
x = x.persist()

In [43]:
x.nbytes / 1e9

0.2

In [44]:
y = (x + x.T) - x.mean(axis=0)

In [45]:
y = y.persist()

In [46]:
y.sum().compute()

12499424.928402135

In [47]:
client.close()
cluster.close()