# Running Dask on the cluster

The dask frameworks enabling users to parallelize internal systems
Not all computations fit into a big dataframe. Dask exposes lower-level APIs letting you build custom systems for in-house applications. This helps parallelize python processes and dramatically accelerate their performance

Dask Kubernetes deploys Dask workers on Kubernetes clusters using native Kubernetes APIs. It is designed to dynamically launch short-lived deployments of workers during the lifetime of a Python process.

Check out this link https://kubernetes.dask.org/en/latest/

When user runs dask the frameworks start one or more pods running in parallel on the cluster. Users can define the number of nodes and the minimun and maximum number of pods that the dask framework opens up
Scale to zero is achieved by setting the minimum = 0. Setting it to zero delete the pods once the job is done and free up the resources 

In [1]:
!pip install dask distributed
!pip install dask-kubernetes==0.10.0

In [2]:
from dask_kubernetes import KubeCluster

cluster = KubeCluster.from_yaml('worker-spec.yml')
cluster.scale_up(4)  # specify number of nodes explicitly

cluster.adapt(minimum=2, maximum=5)  # or dynamically scale based on current workload

to view the pods that are running

In [3]:
!kubectl  -n default-tenant get pods | grep dask

In [4]:
# Example usage
import distributed
import dask.array as da

# Connect dask to the cluster
client = distributed.Client(cluster)

# Create an array and calculate the mean
array = da.ones((1000, 1000, 1000), chunks=(100, 100, 10))
print(array.mean().compute())  # Should print 1.0