# Big data analysis with Dask

---

## Connect to your Dask Cluster

Instead of starting a new cluster each time, you can view and connect to existing clusters!

In [None]:
import dask_gateway

In [None]:
gateway = dask_gateway.Gateway()

List your previous created clusters:

In [None]:
gateway.list_clusters()

Connect to the cluster you want, with the previously selected options preserved:

In [None]:
cluster = gateway.connect(name="dev.695708ff76d74a959afc916f1d613960")

In [None]:
cluster

Finally, connect a client to the cluster -- so that you can access it from this specific IPython notebook.

In [None]:
client = cluster.get_client()
client

Open the Dashboard plots: Cluster Map, Progress, Task Stream, and Workers Memory.

## Read the full dataset

We'll work with the Parquet datasets moving forward!

In [None]:
import dask.dataframe as dd

In [None]:
ddf = dd.read_parquet("gcs://quansight-datasets/airline-ontime-performance/full_dataset.parquet")

In [None]:
ddf.head()

## Shuffling computations

### Your turn: Calculate the mean and median across the entire dataset.

Make sure to visualize and time your computations!

In [None]:
# Your code here

In [None]:
avg = ddf.mean()

In [None]:
avg.visualize(engine="cytoscape")

In [None]:
%%time

avg.compute()

In [None]:
med = ddf.median_approximate()

In [None]:
med.visualize(engine="cytoscape")

In [None]:
%%time

med.compute()

**Notice the difference in performance and the corresponding task streams?**

Computing median will take much more longer than computing mean, because the "partition" based workflow in Dask makes certain types of operations very expensive.

Operations like median, sort, set_index, etc., require dataset-wide interactions, called "shuffling", making them very difficult to parallelize. So, we try to minimize such operations with large datsets.

Compare the task streams, and see how many "red" bars the median compute has vs mean. Red is a universal colour in the Dask dashboard, and it always indicated networking costs when there is data transfer between workers. If you hover on the red bars, you'll see they start with `transfer-` or equivalent.

It's a good practice to always **minimize shuffling in distributed computations**.

### Calculate and plot the number of canceled flights each day

Note that this will take 5+ minutes to compute.

In [None]:
import hvplot
import hvplot.dask

In [None]:
hvplot.extension('bokeh')

In [None]:
%%time

ddf.groupby("FL_DATE")["CANCELLED"].count().hvplot() 

This takes:

1. so long to compute, 
2. the plot lines look out of place,
3. there is so much interaction in the task stream

because there's an internal "sort" operation being done.

These implicit sorts are quite common, and this is why exploratory plots and the Dask dashboard are so useful!

If you see these, it's a good practice to sort your dataset once and then store+use the sorted dataset for your workflows.

## Read the sorted dataset

We have sorted and stored the dataset in the same GCP bucket :)

In [None]:
ddf = dd.read_parquet("gcs://quansight-datasets/airline-ontime-performance/sorted/parquet_by_year")

### Your turn: Run the previous computation on the sorted dataset and compare performance

In [None]:
# Your code here

In [None]:
%%time

ddf.groupby("FL_DATE")["CANCELLED"].count().hvplot() 

## Persisting data and intermediate

A "compute" Dask computation to a pandas output, this is not required in cases where:

- Your client machine (where the output will be displayed) doesn't have enough resources
- You have more computations to do with the data or intermediate results, and keeping them on the workers will optimize your overall workflow

**Data locality**: TODO

Open the "Task Graph" plots!

### Your turn: Compute and plot the number of departure and arrival delays per day (without persisting)

This is partially similar to the previous workflow where you need to groupby `FL_DATE`. Make sure to record how long this takes!

In [None]:
# Your code here

### With persisting:

In [None]:
ddf_fld = ddf.groupby("FL_DATE").persist()

In [None]:
%%time

ddf_fld["DEP_DELAY"].count().hvplot()

In [None]:
%%time

ddf_fld["ARR_DELAY"].count().hvplot()

## Partitioning effectively

Our dataset currently has 641 partitions:

In [None]:
ddf.npartitions

You can change the number of partitions with: `ddf.repartition(npartitions=xx)`

### Your turn: Compute the total flights taken each year, comparing the performance with current partitions, ~300 partitions, and ~100 partitions.

In [None]:
# Your code here

In [None]:
ddf_300 = ddf.repartition(npartitions=300)

In [None]:
%%time

ddf_300

In [None]:
ddf_100 = ddf.repartition(npartitions=100)

When you store your data, ensure you have the optimal number of partitions based on your dataset, your computation, number of workers, and worker resources.

## `meta` keyword

### Your turn: Get all "DISTANCE" values in kilometers instead of miles

Note, this is similar to our pandas operation

In [None]:
# Your code here

In [None]:
ddf.DISTANCE.apply(lambda x: x*1.609344).compute()

`meta` is how Dask understands what the output looks like!

### Specify `meta`:

It's a good practice to always specify `meta` explicitly!

Further reading: [Understanding Dask’s meta keyword argument](https://blog.dask.org/2022/08/09/understanding-meta-keyword-argument)

In [None]:
import pandas as pd

In [None]:
ddf.DISTANCE.apply(lambda x: x*1.609344, meta=pd.Series(dtype="float64")).compute()

In [None]:
# cluster.shutdown()
# client.close()

---

## Next

Big data visualizations!