<img src="images/dask-logo.svg" width="20%" align="right"/>

# Introduction to scalable computing with Dask

In this notebook, we'll introduce the dataset and some basic principles of scaling with Dask.

---

**☝🏽 Important note:**

Big data analysis always start with a manageable subset of the data, this allows you to:

* Explore it with familiar tools like NumPy and pandas, and
* Experiment with various computations you wish to do faster.

After you have your computations and pipelines are ready, you can focus on scaling up.

To make the most of our time here, we will skip this part and jump right to scaling. If you are curious, you can take a look at the full version of the tutorial at [nebari-dev/big-data-tutorial](https://github.com/nebari-dev/big-data-tutorial).


## Introduce dataset: Airline on-time performance data

In this tutorial, we will analyze the **["airline on-time performance" dataset](https://www.transtats.bts.gov/Fields.asp?gnoyr_VQ=FGJ) -- a collection of flight records maintained by the U.S. Department of Transportation's Bureau of Transportation Statistics (BTS)**.

This dataset provides information about the on-time performance of domestic flights operated by large air carriers in the United States, including flight delays, cancellations, and diversions. It covers flights operated by 23 major airlines and the records from 1987-present day.

We will work with data from 2003-2022, which is ~70 GB in size on disk.

The data is stored as one CSV file per month for each year:

<img src="./images/csv-files.png">

## Motivation: Need for scale

Libraries like NumPy and pandas as extremely powerful, however they need your data "in memory" (i.e., to fit in your local RAM storage).
Which means you won't be able to load the full dataset in pandas without the kernel crashing:

In [None]:
# Note: Your kernel will restart if you execute this cell.
# Uncomment the code below to try for yourself. :)

# files = [f"gcs://{f}" for f in fs.glob("quansight-datasets/airline-ontime-performance/csv/*.csv")]

# with open('prep/dtypes.json', 'r') as f:
#     dtypes = json.load(f)

# df_list = []
# for file in files:
#     df_temp = pd.read_csv(file, dtype=dtypes)
#     df_list.append(df_temp)

## What is Dask?

A library for **parallel and distributed computing in Python**.

Dask provides a similar API to familiar PyData libraries (for example NumPy, pandas, scikit-learn) buts runs the same computations in a parallel and/or distributed manner.

## Parallel computing

Computing parts of a workflow simultaneously. Typically, we use this term to describe single-machine parallelism, where your computation can be run simultaneously on various cores of a machine while sharing the same memory (RAM).

## Dask DataFrame API

Dask has a few different APIs to parallelize different tools/activities. We will primarily cover Dask's DataFrame API, which parallelizes pandas, in this tutorial.

The idea is to provide a familiar interface to pandas, but leverage parallelism under-the-hood.

In [None]:
import dask.dataframe as dd

In [None]:
ddf = dd.read_csv("gcs://quansight-datasets/airline-ontime-performance/csv/*ber_2020.csv") # September, October, November, December - 2020

In [None]:
ddf

### Lazy evaluation

Dask evaluates your computations lazily, this is what allows Dask to "scale" your computations. This means, Dask only creates the "logic" of your computation eagerly, i.e., what are the independent tasks that can be executed in parallel, what does that dependency tree (called "task graph" in Dask) look like.

In the previous cell, Dask has loaded only the metadata information for the DataFrame, but none of the actual values.

You can use `.compute()` to run/execute the computations:

In [None]:
ddf.MONTH.unique().compute()

Some functions like `.head()` also trigger an internal compute

In [None]:
ddf.head() # ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.

### Specify `dtypes`

The lazy behavior of Dask means it infers the datatypes using minimal information -- for CSV files, Dask uses the first row.

This behavior is different from pandas, which loads the entire dataset and then infers datatypes.

Hence, in Dask and distributed computing in general tt's a good practice to provide explicit dtypes, especially if you use CSV files. You can do this by exporting the dtypes of a subset of data, We've already prepared the dtypes for this tutorial:

In [None]:
import json

with open('prep/dtypes.json', 'r') as f:
    dtypes = json.load(f)

(Optional: You can take a look at `prep/dtypes.json`)

In [None]:
ddf = dd.read_csv("gcs://quansight-datasets/airline-ontime-performance/csv/*ber_2020.csv", dtype=dtypes)

In [None]:
ddf.head() # No warnings or errors :)

In [None]:
list(ddf.columns)

When we do computations, Dask keeps track of the parallel logic based on what it expects the output structure for each operation to look like.

In [None]:
count = ddf.count()
count

In the following task graph, everything in the same horizontal layer will be executed in parallel.

In [None]:
count.visualize() # The task graph is displayed and also saved as a file "mydask.png"

We can executes this workflow with `compute()`:

In [None]:
%%time

count.compute()

### Partitions

Internally, Dask DataFrame is a collection of pandas DataFrames (these are actual pandas DataFrames internally as well!):

<img src="./images/dask-dataframe.svg" width="30%"/>

where each pandas DataFrame is called a "partition".

Your Dask computations will be run on all the individual pandas DataFrames in parallel, and then combined as necessary.

In [None]:
ddf.npartitions

Dask selects an adequate number of partitions based on your dataset and resource limits. If you use partitioned data formats, like Parquet (we'll learn more later!), Dask will preserve the partitions while reading data.

## Distributed computing

We can also leverage parallel computation on several different machines (workers) with their own processors and memory.
The different machines can interact to share data, and a central machine (scheduler) manages all the interactions.
Distributed computing refers to this system/model of computations.

These different machines can be located anywhere, on your local in-house network or in data centers around the world.

<img src="images/distributed-overview.png" width="50%"/>

## Dask Gateway

Dask Gateway is a library to manage Dask clusters on the cloud. The platform you're on (Nebari) has a Dask Gateway and ee'll create clusters using Google Cloud Provider machines in this tutorial.

<img src="images/gateway-architecture.svg" width="50%" />

First, import the library and create a new Gateway instance:

In [None]:
import dask_gateway

In [None]:
gateway = dask_gateway.Gateway()

Then set how your workers need to be configured, and make sure you select:
* the same `analyst/analyst-pydata-nyc-2023` environment as your current notebook (aka, client), and
* "Medium Worker" cluster profile.

In [None]:
options = gateway.cluster_options(use_local_defaults=False)
options

### Manual vs adaptive scaling

You can specify the exact number of machines required in your cluster, and Dask will spin all of them up at the beginning. We call this approach "Manual scaling".

Dask Gateway also has a very useful "adaptive scaling" feature. This allows Dask to spin up new machines as your workflow/computation needs it, and then shut idle workers down safely after the computation is complete, until the next computation is triggered. Adaptive scaling can help manage costs when you have large compute requirements.

In the widget below, select adaptive scaling, set 5 (min) and 10 (max), and click "Adapt":

In [None]:
cluster = gateway.new_cluster(options)
cluster

Finally, you can connect this cluster of machines to this IPython notebook using a client:

In [None]:
client = cluster.get_client()
client

You slowly start getting new machines and you can see the Workers, Threads, and Memory increase in the "GatewayCluster" widget.

## Dask Dashboard

The Client widget displays a link to a dashboard:
* Click on it, and a new Keycloak sign-up page should open
* Login with the email and password you used to register
* The dashboard opens in the browser window

You will need to login only once, you should be able to access the Dashboard directly if you click on the link next time. :)

You can also access these plots within JupyterLab:
* Click on the Dask logo in the left sidebar
* Click on the magnifying glass icon, the dashboard should connect automatically and display available plots
* Open: "Cluster map", "task stream", and "progress" plots
* Re-arrange the plots in your JupyterLab interface to see them all together

Your instructor will talk about each plot as you work on the following computations!

## A quick computation

### 💻 Your turn: Compute the longest flight ("DISTANCE") across the dataset

Make sure to look at the dashboard plots :)

In [None]:
# Your code here. When ready, click on the three dots below for the solution.

In [None]:
ddf["DISTANCE"].max().compute()

## Ensure cluster shutdown

Idling clusters can quickly add up to costs, so make sure to always shutdown your clusters after completing your work.

In [None]:
cluster.shutdown()
client.close()

---

## Next →

[Big data analysis with dask](./02-big-data-analysis-with-dask.ipynb)