<img src="images/dask-logo.svg" width="20%" align="right"/>

# Big data analysis with Dask

In this notebook, we'll work some specific computations and learn some best practices for distributed computing.

---

_Note: There are some "Wall time ..." comments added to the cells, these will change depending on your cluster/profiles, etc. So use them only as a rough reference._


## Connect to your Dask Cluster

Instead of starting a new cluster each time, you can view and connect to existing clusters.

In [None]:
import dask_gateway

In [None]:
gateway = dask_gateway.Gateway()

List your previous created clusters:

In [None]:
running_clusters = gateway.list_clusters()
running_clusters

Connect to the cluster with it's name, the previously selected options will be preserved:

In [None]:
cluster = gateway.connect(running_clusters[0].name)

In [None]:
# # If there aren't any active clusters

# options = gateway.cluster_options(use_local_defaults=False)
# options.profile = "Medium Worker"
# options.conda_environment = "pycon2023/pycon2023-tutorial"

# cluster = gateway.new_cluster(options)
# cluster.adapt(minimum=5, maximum=10)

In [None]:
cluster

Finally, connect a client to the cluster -- so that you can access it from this specific IPython notebook.

In [None]:
client = cluster.get_client()
client

Open the Dashboard plots: Cluster Map, Progress, Task Stream, and Workers Memory.

## Read the full dataset

We'll work with the Parquet datasets moving forward!

In [None]:
import dask.dataframe as dd

In [None]:
ddf = dd.read_parquet("gcs://quansight-datasets/airline-ontime-performance/full_dataset.parquet")

In [None]:
ddf.head()

## Shuffling computations

### Calculate and plot the number of canceled flights each day

Note that the following section will take 5+ minutes to compute.

In [None]:
import hvplot
import hvplot.dask

In [None]:
hvplot.extension('bokeh') # ensure this cell executes before moving on

In [None]:
%%time

ddf.groupby("FL_DATE")["CANCELLED"].count().hvplot()

This takes:

1. so long to compute, 
2. the plot lines look out of place,
3. there is so much "red", i.e., interaction, in the task stream

because there's an internal "sort" operation being done.

The "partition" based workflow in Dask makes certain types of operations very expensive. Operations like sort, set_index, etc., require dataset-wide interactions, called "shuffling", making them very difficult to parallelize.

It's a good practice to always **minimize shuffling in distributed computations**.

The Dask dashboard is very helpful in catching these unexpected performance penalties

Red is a universal color in the Dask dashboard, and it always indicated networking costs when there is data transfer between workers. If you hover on the red bars, you'll see they start with `transfer-` or equivalent.

## Read the sorted dataset

If your computes needs some form of sorting or changing index, you should consider doing it once at the beginning and storing/using the sorted dataset.

We have sorted and stored the dataset in the same GCP bucket :)

In [None]:
ddf = dd.read_parquet("gcs://quansight-datasets/airline-ontime-performance/sorted/parquet_by_year")

### 💻 Your turn: Run the previous computation on the sorted dataset and compare them

In [None]:
# Your code here. When ready, click on the three dots below for the solution.

In [None]:
%%time

ddf.groupby("FL_DATE")["CANCELLED"].count().hvplot() 

## Persisting data and intermediates

A "compute" in Dask creates a pandas/numeric output and bring the final result to your client machine. This is not required in cases where:

- Your client machine (where the output will be displayed) doesn't have enough resources to store/display the output, or
- You have more computations to do with the data or intermediate results, and keeping them on the workers will optimize your overall workflow

### Data locality

Dask tries to assign computations following "data locality", where the computation goes to the worker that holds the required data.

If you remember, data transfer is one of the slowest parts of a workflow.

### 💻 Your turn: Compute the number of departure and arrival delays per day (without persisting)

This is partially similar to the previous workflow where you need to groupby `FL_DATE`. Make sure to record how long this takes!

In [None]:
# Your code here. When ready, click on the three dots below for the solutions.

In [None]:
%%time

ddf.groupby("FL_DATE")["DEP_DELAY"].count().compute() # Wall time: 10.5 s

In [None]:
%%time

ddf.groupby("FL_DATE")["ARR_DELAY"].count().compute() # Wall time: 5.96 s

### Persist

Dask allows you to "persist" your data or intermediate outputs (as Dask objects) on the workers.

This can run in the background and the control is returned to you immediately.

### Run the same computation with persisting:

In [None]:
ddf_p = ddf[["FL_DATE", "DEP_DELAY", "ARR_DELAY"]].persist()

Open the "Graph" dashboard plot for the following computations!

In [None]:
ddf_p

Notice how you have the control back immediately, and you can continue working with a Dask DataFrame!

In [None]:
%%time

ddf_p.groupby("FL_DATE")["DEP_DELAY"].count().compute() # Wall time: 2.68 s

In [None]:
%%time

ddf_p.groupby("FL_DATE")["ARR_DELAY"].count().compute() # Wall time: 2.47 s

## Partitioning effectively

Our dataset currently has 1251 partitions:

In [None]:
ddf.npartitions

You can change the number of partitions with: `ddf.repartition(npartitions=xx)`

### 💻 Your turn: Compute the unique flights taken each day, comparing the performance with current partitions and ~600 partitions

In [None]:
# Your code here. When ready, click on the three dots below for the solutions.

In [None]:
ddf_full = ddf[["FL_DATE", "OP_UNIQUE_CARRIER"]].persist()

In [None]:
%%time

ddf_full.groupby("FL_DATE").OP_UNIQUE_CARRIER.count().compute()

In [None]:
ddf_600 = ddf_full.repartition(npartitions=600).persist()

In [None]:
%%time

ddf_600.groupby("FL_DATE").OP_UNIQUE_CARRIER.count().compute() 

Re-partitioning is an expensive operation. Notice all the red bars in the task graph. 

Therefore, when you store your data, ensure you have the optimal number of partitions depending on your dataset, your computation, number of workers, and worker resources.

## `meta` keyword

Since Dask evaluates computations lazily, it uses a special `meta` property to keeps track of the output structure of any computation. 

### 💻 Your turn: Convert all negative values in DEP_DELAY to zeros

In [None]:
# Your code here. When ready, click on the three dots below for the solutions.

In [None]:
f = lambda x: 0 if x < 0 else x

ddf["DEP_DELAY"].apply(f).compute() # UserWarning: You did not provide metadata ...

This is a common warning. `f` is a custom function to Dask can not reliable predict the output structure. In such cases, it's best to specify `meta` explicitly.

### Specify `meta`:

You can specify `meta` with an empty/sample pandas data structure with the appropriate columns names and data types.

Optional, further reading: [Understanding Dask’s meta keyword argument](https://blog.dask.org/2022/08/09/understanding-meta-keyword-argument)

In [None]:
import pandas as pd

In [None]:
f = lambda x: 0 if x < 0 else x

ddf["DEP_DELAY"].apply(f, meta=pd.Series(dtype="float64")).compute()

---

## Next →

After a 10-minute break, we'll look at [big data visualizations](./06-big-data-visualization.ipynb)!