# Introduction to scalable computing with Dask

---

## What is Dask?

A library for **parallel and distributed computing in Python**.

Traditionally PyData libraries were designed for linear workflows (for example NumPy, pandas, scikit-learn), Dask provides a similar API to run the same computations in parallel.

## Parallel computing

Computing parts of a workflow simultaneously. Typically, we use this term to describe single-machine parallelism, where your computation can be run simultaneously on various cores while sharing the same memory (RAM).

## Dask DataFrame API

Dask has a few different APIs to parallelize different tools/activities. We will primarily cover Dask's DataFrame API, which parallelizes pandas, in this tutorial.

The idea is to provide a familiar interface to pandas, but leverage parallelism under-the-hood.

In [None]:
import dask.dataframe as dd

In [None]:
ddf = dd.read_csv("gcs://quansight-datasets/airline-ontime-performance/csv/*ber*2022.csv") # September, October, November, December - 2022

In [None]:
ddf

### Lazy evaluation

Dask evaluates your computations lazily, this is what allows Dask to "scale" your computations. This means, Dask only creates the "logic" of your computation eagerly, i.e., what are the independent tasks that can be executed in parallel, what does that dependency tree (called "task graph" in Dask) look like.

In the previous cell, Dask has loaded only the metadata information, but none of the actual values.

When we do computations, Dask keeps track of the logic and presents what it expects the final output to look like.

In [None]:
add = ddf.sum()
add

In the following task graph, everything in the same horizontal layer will be executed in parallel.

In [None]:
add.visualize()

We can executes this workflow with `compute()`:

In [None]:
%%time

add.compute()

Besides `.compute()`, some commands like `.head()` also trigger an internal compute.

In [None]:
ddf.head() # ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.

### Specify `dtypes`

The lazy behavior of Dask means it infers the datatypes using minimal information -- for CSV files, Dask uses the first row.

This behavior is different from pandas that loads the entire dataset and then infers dtypes.

It's a good practice to provide explicit dtypes. You can do this with the subset of data we looked at in notebook-01 by exporting the datatype with pandas. we've already prepared the dtypes:

In [None]:
import json

with open('prep/dtypes.json', 'r') as f:
    dtypes = json.load(f)

(Optional: You can take a look at `prep/dtypes.json` to see how it was created)

In [None]:
ddf = dd.read_csv("gcs://quansight-datasets/airline-ontime-performance/csv/*", dtype=dtypes)

In [None]:
ddf.head() # No warnings or errors :)

### Partitions

Internally, Dask DataFrame is a collection of pandas DataFrames (these are actual pandas DataFrames internally as well!):

<img src="./images/dask-dataframe.svg" width="30%"/>

where each pandas DataFrame is called a "partition".

Your Dask computations will be run on all the individual pandas DataFrames in parallel, and then combined as necessary.

In [None]:
ddf.npartitions

Since we read a CSV file with one month of data per file, our Dask DataFrame is partitions such that each partition corresponds to one file.

In [None]:
ddf.partitions[1]

## Distributed computing

We can also leverage parallel computation on several different machines (workers) with their own processors and memory. The different machines can interact to share data, and a central machine (scheduler) manages all the interactions. We call this process distributed computing.

These different machines can be located anywhere, on your local in-house network or in data centers around the world.

<img src="images/distributed-overview.png" width="50%"/>

## Dask Gateway

Dask Gateway is a library to manage Dask clusters on the cloud.

<img src="images/gateway-architecture.svg" width="50%" />

In [None]:
import dask_gateway

Create a new Gateway instance:

In [None]:
gateway = dask_gateway.Gateway()

Set how your workers need to be configured, and make sure the workers have the same environment as your current notebook:

In [None]:
options = gateway.cluster_options(use_local_defaults=False)
options

### Manual vs adaptive scaling

You can specify the exact number of machines required, and Dask will spin all of them up at the beginning. Dask Gateway has a very useful "adaptive scaling" feature. This allows Dask to spin up new machines as your workflow needs it, and then tear them down after the computation.

Adaptive scaling can help manage costs when you have large compute requirements.

Select manual (~5) or adaptive (5-10) below:

In [None]:
cluster = gateway.new_cluster(options)
cluster

Finally, you can connect this cluster of machines to this IPython notebook using a client:

In [None]:
client = cluster.get_client()
client

## Dask Dashboard

The Client widget displays a link to a dashboard:
* Click on it, and a new Keycloak sign-up page should open
* Login with the email and password you used to register
* The dashboard opens in the browser window

You will need to login only once, you should be able to access the Dashboard directly if you click on the link next time. :)

You can also access these plots within JupyterLab:
* Click on the Dask logo in the left sidebar
* Click on the magnifying glass icon, the dashboard should connect automatically and display available plots
* Open: Cluster map, task stream, and progress bar

## A quick computation

### 💻 Your turn: Compute the longest flight (distance) across the dataset

Make sure to look at the dashboard plots :)

In [None]:
# Your code here. When ready, click on the three dots below for the solution.

In [None]:
ddf["DISTANCE"].max()
ddf["DISTANCE"].max().compute()

## Ensure cluster shutdown

Idling clusters can quickly add up to costs, so make sure to shutdown your clusters after completing your work.

In [None]:
cluster.shutdown()
client.close()

---

## Next →

[Storage formats](./04-storage-formats.ipynb)!