<img src="https://hvplot.holoviz.org/_static/logo_horizontal.svg" width="25%" align="right"/>

# Big data visualization with Dask and hvPlot

In this notebook, we'll continue to explore the dataset, but with visuals! We will learn to use `hvplot` with Dask to create some quick interactive visualizations.

---

## What is hvPlot?

hvPlot a familiar and high-level API for data exploration and visualization. 

<img src="https://hvplot.holoviz.org/assets/diagram.svg" width="70%"/>

 
It is a powerful and interactive version of the pandas' `.plot()` API.
**By replacing .plot() with .hvplot() you get an interactive figure.**

In [None]:
# Ignore some DeprecationWarnings

import warnings
warnings.filterwarnings('ignore')

## Reconnect to our Dask Cluster

In [None]:
import dask_gateway
import dask.dataframe as dd

In [None]:
gateway = dask_gateway.Gateway()

You can connect to a running cluster (that we created in the previous notebook), and note that you may need to refresh your dashboard page:

In [None]:
if len(running_clusters := gateway.list_clusters())>0:
    cluster = gateway.connect(running_clusters[0].name)
else:
    cluster = gateway.new_cluster(conda_environment="analyst/analyst-pydata-nyc-2023", profile="Medium Worker")
    cluster.adapt(5,10)

In [None]:
cluster

In [None]:
client = cluster.get_client()

client

## Load a subset of flights data

We can do all of the following computations and visualizations on the full dataset with the power of Dask and hvplot. 
However, in order to do so, we'd need a larger compute pool and there are quite a few of you. So we'll grab a subset for
demonstration purposes. 

In [None]:
columns = [
    'YEAR', 'MONTH', 'DAY_OF_MONTH', 'DAY_OF_WEEK', 'FL_DATE', 'OP_CARRIER', 
    'TAIL_NUM', 'OP_CARRIER_FL_NUM', 'ORIGIN', 'DEST', 'CRS_DEP_TIME', 
    'DEP_TIME', 'DEP_DELAY', 'ARR_TIME', 'ARR_DELAY', 'CANCELLED', 
    'CANCELLATION_CODE', 'DIVERTED', 'AIR_TIME', 'FLIGHTS', 'DISTANCE',
    'CARRIER_DELAY', 'WEATHER_DELAY', 'NAS_DELAY', 'SECURITY_DELAY', 
    'LATE_AIRCRAFT_DELAY', 'DIV_ARR_DELAY',
]

Even with 10 compute notes, the visualization in this notebook will take some time to execute if we use the full dataset.

We'll look at the data since 2020 and limit the number of carriers for this tutorial.

In [None]:
flights = dd.read_parquet(
    f"gcs://quansight-datasets/airline-ontime-performance/sorted/full_dataset.parquet", 
    columns=columns,
    filters=[('YEAR', '>', 2020)],
)
flights_subset = flights[flights.OP_CARRIER.isin(['AA', 'UA', 'WN', 'DL'])]

flights_subset.head()

In [None]:
print(f"Our subset dataset has {len(flights_subset)/1e6:2} million rows!")

Persist the data on the cluster so we don't need to reread it with every computation:

In [None]:
flights_subset.persist()

## hvPlot + Dask

To use hvPlot's build in Dask integration, we need to switch out:

`import hvplot.pandas` for `import hvplot.dask` 

In [None]:
import hvplot.dask
hvplot.extension('bokeh')

### Plot the departure delay per day for the entire dataset

In [None]:
flights_subset.groupby('FL_DATE')['DEP_DELAY'].count().hvplot()

### 💻 Your turn: Visualize the weekly distribution of the mean of any variable in the datasets

You can any plot type from the [hvPlot Gallery](https://hvplot.holoviz.org/reference/index.html)

In [None]:
# Your code here. When ready, click on the three dots below for the solution.

In [None]:
flights_subset.groupby('DAY_OF_WEEK')['ARR_DELAY'].mean().hvplot.scatter(x="DAY_OF_WEEK", y='ARR_DELAY')

## More interactivity with quick widgets

Zoom, pan, and hover are just the tip of the iceberg for interactivity, widgets open up a whole new world of interaction. Some examples of widgets are dropdown selectors, range/date/color selectors, radio buttons, text fields, etc.

hvPlot automatically includes the best widgets for your visualization.

In [None]:
flights_subset.hvplot.hist('DEP_DELAY', groupby='OP_CARRIER', bins=20, bin_range=(-20, 100), width=300)

## Compose and overlay plots 

With hvPlot, you can compose and overlay your plots easily with the `+` or `*` operations, respectively.

Let's plot the minimum, maximum, and mean departure delays for each carrier.

In [None]:
import numpy as np

In [None]:
# Caution: reset_index() is compute intensive

delays = flights_subset.groupby(['DAY_OF_WEEK', 'OP_CARRIER'])['DEP_DELAY'].agg([np.min, np.mean, np.max]).reset_index()

In [None]:
delays = delays.persist()

In [None]:
min_max_plot = delays.hvplot.area(x='DAY_OF_WEEK', y='amin', y2='amax', alpha=0.2, groupby="OP_CARRIER")

In [None]:
mean_plot = delays.hvplot.line(x='DAY_OF_WEEK', y="mean", groupby="OP_CARRIER")

The + operation creates a layout, displaying the plots side-by-side:

In [None]:
min_max_plot + mean_plot

The * operation overlays one plot on top of the other:

In [None]:
min_max_plot * mean_plot

## Explorer

For creating all of our previous plots, we needed some preliminary knowledge of the dataset.

What if you want to explore a dataset visually from scratch? hvPlot's data explorer can help you explore and create interactive visualizations using a graphical UI.

Note: We're using pandas DataFrame here to demonstrate the Explorer, because it's the most useful & performant with a small subset.

In [None]:
flights_subset_pandas = flights_subset.compute()

In [None]:
explorer = hvplot.explorer(flights_subset_pandas)
explorer

You can use the above GUI to create a plot you want!

### Save your plot

You can then save the selected visualization using `save()`, or generate the code to create the specific viz using `plot_code`:

In [None]:
explorer.plot_code()

### 💻 Your turn: Use the explorer to plot the flights cancellations per day

In [None]:
# Your code here. When ready, click on the three dots for the solution.

In [None]:
flights_subset.groupby('FL_DATE')['CANCELLED'].count().hvplot()

## Geographic plots

To plot data on geographic maps, we need the latitude and longitude values. `ip2location` has created a list of lat/lon values for US airports here: https://github.com/ip2location/ip2location-iata-icao

We'll use this information to plot the departure delays on a world map!

In [None]:
airports = dd.read_csv('https://raw.githubusercontent.com/ip2location/ip2location-iata-icao/master/iata-icao.csv')

In [None]:
airports = airports.set_index('iata')

In [None]:
airports.head()

In [None]:
airport_delays = flights.groupby('ORIGIN')['DEP_DELAY'].mean()

In [None]:
airport_delays = dd.merge(airports, airport_delays, left_index=True, right_index=True).persist()

In [None]:
airport_delays.head()

In [None]:
airport_delays.hvplot.points('longitude', 'latitude', geo=True, c='DEP_DELAY', alpha=1, xlim=(-180, -30), ylim=(0, 72), tiles='ESRI')

## Plotting large datasets

In the above visualization of daily counts we saw a bunch of compute happening before we saw the plot appear. But after it was generated, panning and zooming did not cause any new Dask computes.

This is because the final dataset after the groupby is only about `20 years * 365 days` long, so it fits completely in memory.

Now let's look at the entire dataset:

In [None]:
flights = dd.read_parquet(
    f"gcs://quansight-datasets/airline-ontime-performance/sorted/full_dataset.parquet", 
    columns=columns,
)

In [None]:
print(f"Reminder, the full dataset has {len(flights)/1e6:2} million rows")

If we try and send these many data points to the browser for visualization in a plot, the *browser* would run out of memory and crash.

<img src="images/datashader.svg" width="30%" align="right">

The solution for this is to take advantage of the fact that the output plot has a fixed resolution in terms of number of pixels. A 600x400 image has 240,000 pixels. This means that if we plotted 125 million points on the these pixels, most would overlay each other and not be visible. Instead, we pre-render or rasterize the data and shade in a manner that maintains an accurate the distribution of your data. 

We do this via the hvPlot integration with **Datashader**.

We will use a smaller dataset for the next few examples for quick outputs. These examples will work with the full dataset, but will take a bit longer to run with the 10 compute nodes we are currently using for this tutorial.

In [None]:
flights = dd.read_parquet(
        f"gcs://quansight-datasets/airline-ontime-performance/sorted/parquet_by_year", 
        filters=[('YEAR', '>', 2017)],
        columns=columns,
)

In [None]:
print(f"The smaller dataset has {len(flights)/1e6} million rows")

In these next two visualizations, Datashader data is displayed on the plots. 
As we pan and zoom, Datashader recomputes the appropriate pixel shades using Dask.

This allows us to easily look at the entire 30 million row dataset, but still
zoom into a single point, without requiring downsampling or decimation of the dataset.

In [None]:
flights.hvplot.line(x='FL_DATE', y='DEP_DELAY', datashade=True)

In [None]:
flights[['ARR_DELAY', 'DISTANCE']].hvplot.scatter(x='ARR_DELAY', y='DISTANCE', datashade=True)

Shutdown the cluster:

In [None]:
cluster.shutdown()

---

## Next →

[Conclusion](./04-conclusion.ipynb)