# Introduction to scalable computing with Dask

---

## What is Dask?

A library for **parallel and distributed computing in Python**.

Traditional libraries were designed for linear workflows, Dask provides a similar API to run the same computations in parallel.

## Parallel computing

Computing parts of a workflow simultaneously. Usually, this used to describe single-machine parallelism, where your computation can be run simultaneously on various cores while sharing the same memory (RAM).

In [None]:
from time import sleep

def add(x, y):
    sleep(1)
    return x + y

def mul(x, y):
    sleep(1)
    return x * y

In [None]:
%%time

a = add(5, 5)
b = add(10, 10)
c = mul(a, b)
c

In [None]:
import dask

parallel_add = dask.delayed(add)
parallel_mul = dask.delayed(mul)

In [None]:
%%time

a = parallel_add(5, 5)
b = parallel_add(10, 10)
c = parallel_mul(a, b)
c

### Lazy evaluation

Dask evaluates your computations lazily.

This means Dask only creates the logic of your computation, i.e. how to run your compute in parallel.

In [None]:
c.visualize()

And, executes the logic on when needed.

In [None]:
%%time

c.compute()

## Dask DataFrame API

Dask has a few different APIs, but we will primarily cover Dask's DataFrame API in this tutorial.

The Dask DataFrame API parallelizes pandas.

In [None]:
import dask.dataframe as dd

In [None]:
ddf = dd.read_csv("gcs://quansight-datasets/airline-ontime-performance/csv/*2022.csv")

In [None]:
ddf

### Partitions

A Dask DataFrame is a collection of pandas DataFrames (these are actually pandas DataFrames internally as well!):

<img src="./images/dask-dataframe.svg" width="30%"/>

where each pandas DataFrame is called a partition.

Your Dask computations will be run on all the partitions in parallel, and then combined as necessary.

In [None]:
ddf.npartitions

In [None]:
ddf.partitions[1]

Since we read a CSV file with one month of data per file, our Dask DataFrame is partitions such that each partition corresponds to one file.

## Distributed computing

Leveraging parallel computation on several different machines (workers) with their own processors and memory. These machines can interact to share data, and a central machine called the scheduler managed everything.

<img src="images/distributed-overview.png" width="50%"/>

## Dask Gateway

A library to manage Dask clusters on the cloud.

<img src="images/gateway-architecture.svg" width="50%" />

In [None]:
import dask_gateway

gateway = dask_gateway.Gateway()

In [None]:
options = gateway.cluster_options(use_local_defaults=False)
options

In [None]:
cluster = gateway.new_cluster(options)
cluster

## Adaptive scaling

TODO

In [None]:
client = cluster.get_client()
client

## Dask Dashboard

 - First time sign in
 - Open plots in JupyterLab
 - Important plots.

## Read the full dataset

In [None]:
import dask.dataframe as dd

In [None]:
%%time

ddf = dd.read_csv("gcs://quansight-datasets/airline-ontime-performance/csv/*")

In [None]:
ddf

In [None]:
ddf.head()

## Ensure cluster shutdown

In [None]:
client.close(shutdown=True)

## Other Dask APIs

* Dask Array
* Dask Bag
* Dask Delayed and Futures

Check out the official Dask tutorial to learn more!