<img src="https://hvplot.holoviz.org/_static/logo_horizontal.svg" width="20%" align="right"/>

# Big data visualization with Dask and hvPlot

In this notebook, we'll learn to use the hvPlot APIs with Dask DataFrames.

---

## Reconnect to our Dask Cluster

In [None]:
import dask_gateway
import dask.dataframe as dd

In [None]:
gateway = dask_gateway.Gateway()

In [None]:
if len(running_clusters := gateway.list_clusters())>0:
    cluster = gateway.connect(running_clusters[0].name)
else:
    cluster = gateway.new_cluster(conda_environment="global/global-data-of-unusual-size", profile="Medium Worker")
    cluster.adapt(1,10)

In [None]:
cluster

In [None]:
client = cluster.get_client()
client

## Load a subset of columns

In [None]:
columns = [
    'YEAR', 'MONTH', 'DAY_OF_MONTH', 'DAY_OF_WEEK', 'FL_DATE', 'OP_CARRIER', 
    'TAIL_NUM', 'OP_CARRIER_FL_NUM', 'ORIGIN', 'DEST', 'CRS_DEP_TIME', 
    'DEP_TIME', 'DEP_DELAY', 'ARR_TIME', 'ARR_DELAY', 'CANCELLED', 
    'CANCELLATION_CODE', 'DIVERTED', 'AIR_TIME', 'FLIGHTS', 'DISTANCE',
    'CARRIER_DELAY', 'WEATHER_DELAY', 'NAS_DELAY', 'SECURITY_DELAY', 
    'LATE_AIRCRAFT_DELAY', 'DIV_ARR_DELAY'
]

In [None]:
flights = dd.read_parquet(
    f"gcs://quansight-datasets/airline-ontime-performance/sorted/full_dataset.parquet", 
    columns=columns
)

In [None]:
flights.head()

## hvPlot + Dask

To use hvPlot's build in Dask integration, we need to switch out:

`import hvplot.pandas` for `import hvplot.dask` 

In [None]:
import hvplot.dask
hvplot.extension('bokeh')

### Plot the departure delay per day for the entire dataset

In [None]:
flights.groupby('FL_DATE')['DEP_DELAY'].count().hvplot()

### 💻 Your turn: Visualize the weekly distribution of the mean of any variable in the datasets

You can any plot type from the [hvPlot Gallery](https://hvplot.holoviz.org/reference/index.html)

In [None]:
# Your code here. When ready, click on the three dots below for the solution.

In [None]:
flights.groupby('DAY_OF_WEEK')['ARR_DELAY'].mean().hvplot.scatter(x="DAY_OF_WEEK", y='ARR_DELAY')

## Plotting large datasets

In the above visualization of daily counts we saw a bunch of compute happening before we saw the plot appear. But after it was generated, panning and zooming did not cause any new Dask computes.

This is because the final dataset after the groupby is only about `20 years * 365 days` long, so it fits completely in memory.

Now let's look at the entire dataset:

In [None]:
print(f"The full dataset has {len(flights)/1e6:2} million rows")

If we try and send these many data points to the browser for visualization in a plot, the browser would run out of memory and crash.

<img src="images/datashader.svg" width="30%" align="right">

The solution for this is to take advantage of the fact that the output plot has a fixed resolution in terms of number of pixels. A 600x400 image has 240,000 pixels. This means that if we plotted 125 million points on the these pixels, most would overlay each other and not be visible. Instead, we pre-render or rasterize the data and shade in a manner that maintains an accurate the distribution of your data. 

We do this via the hvPlot integration with **Datashader**.

We will use a smaller dataset for the next few examples for quick outputs. These examples will work with the full dataset, but will take a bit longer to run with the 10 computer nodes we are currently using for this tutorial.

In [None]:
flights = dd.read_parquet(
        f"gcs://quansight-datasets/airline-ontime-performance/sorted/parquet_by_year", 
        filters=[('YEAR', '>', 2017)],
        columns=columns,
)

In [None]:
print(f"The smaller dataset has {len(flights)/1e6} million rows")

In these next two visualizations, Datashader data is displayed on the plots. 
As we pan and zoom, Datashader recomputes the appropriate pixel shades using Dask.

This allows us to easily look at the entire 30 million row dataset, but still
zoom into a single point, without requiring downsampling or decimation of the dataset.

In [None]:
flights.hvplot.line(x='FL_DATE', y='DEP_DELAY', datashade=True)

In [None]:
flights[['ARR_DELAY', 'DISTANCE']].hvplot.scatter(x='ARR_DELAY', y='DISTANCE', datashade=True)

In [None]:
# To shutdown the cluster, uncomment and run the next line
# cluster.shutdown()

---

## Next →

[Big data dashboards](./07-big-data-dashboards.ipynb)