<img src="https://hvplot.holoviz.org/_static/logo_horizontal.svg" width="25%" align="right"/>

# Big data visualization with Dask and hvPlot

In this notebook, we'll continue to explore the dataset, but with visuals! We will learn to use `hvplot` with Dask to create some quick interactive visualizations.

---

## What is hvPlot?

hvPlot a familiar and high-level API for data exploration and visualization. 

<img src="https://hvplot.holoviz.org/assets/diagram.svg" width="70%"/>

 
It is a powerful and interactive version of the pandas' `.plot()` API.
**By replacing .plot() with .hvplot() you get an interactive figure.**

## Reconnect to our Dask Cluster

In [1]:
import dask_gateway
import dask.dataframe as dd

In [2]:
gateway = dask_gateway.Gateway()

You can connect to a running cluster (that we created in the previous notebook) but be aware that you may need to refresh your dashboard page:

In [None]:
if len(running_clusters := gateway.list_clusters())>0:
    cluster = gateway.connect(running_clusters[0].name)
else:
    cluster = gateway.new_cluster(conda_environment="analyst/analyst-pydata-nyc-2023", profile="Medium Worker")
    cluster.adapt(5,10)

In [3]:
cluster = gateway.new_cluster(conda_environment="analyst/analyst-pydata-nyc-2023", profile="Medium Worker")
cluster.adapt(5,10)

In [4]:
cluster

VBox(children=(HTML(value='<h2>GatewayCluster</h2>'), HBox(children=(HTML(value='\n<div>\n<style scoped>\n    …

In [5]:
client = cluster.get_client()
client

0,1
Connection method: Cluster object,Cluster type: dask_gateway.GatewayCluster
Dashboard: https://training.quansight.dev/gateway/clusters/dev.d1baf5f8bd6a4b40aea799a50979781e/status,


## Load a subset of flights data

We can do all of the following computations and visualizations on the full dataset with the power of Dask and hvplot. 
However, in order to do so, we'd need a larger compute pool and there are quite a few of you. So we'll grab a subset for
demonstration purposes. 

In [14]:
columns = [
    'YEAR', 'MONTH', 'DAY_OF_MONTH', 'DAY_OF_WEEK', 'FL_DATE', 'OP_CARRIER', 
    'TAIL_NUM', 'OP_CARRIER_FL_NUM', 'ORIGIN', 'DEST', 'CRS_DEP_TIME', 
    'DEP_TIME', 'DEP_DELAY', 'ARR_TIME', 'ARR_DELAY', 'CANCELLED', 
    'CANCELLATION_CODE', 'DIVERTED', 'AIR_TIME', 'FLIGHTS', 'DISTANCE',
    'CARRIER_DELAY', 'WEATHER_DELAY', 'NAS_DELAY', 'SECURITY_DELAY', 
    'LATE_AIRCRAFT_DELAY', 'DIV_ARR_DELAY'
]

In [15]:
flights = dd.read_parquet(
    f"gcs://quansight-datasets/airline-ontime-performance/sorted/full_dataset.parquet", 
    columns=columns,
    filters=[('YEAR', '=', 2022)],
)
# reduce to only 4 carriers
flights_subset = flights[flights.OP_CARRIER.isin(['AA', 'UA', 'WN', 'DL'])]

Use the `index` argument to set a sorted column as your index to create a DataFrame collection with known `divisions`.


In [16]:
flights_subset.head()

Unnamed: 0,YEAR,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,OP_CARRIER,TAIL_NUM,OP_CARRIER_FL_NUM,ORIGIN,DEST,...,DIVERTED,AIR_TIME,FLIGHTS,DISTANCE,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,DIV_ARR_DELAY
476,2022,1,1,6,2022-01-01,AA,N101NN,10,LAX,JFK,...,0.0,270.0,1.0,2475.0,,,,,,
477,2022,1,1,6,2022-01-01,AA,N101NN,117,JFK,LAX,...,0.0,343.0,1.0,2475.0,,,,,,
478,2022,1,1,6,2022-01-01,AA,N101NN,2453,LAX,BOS,...,0.0,296.0,1.0,2611.0,,,,,,
479,2022,1,1,6,2022-01-01,AA,N102UW,1072,CLT,MKE,...,0.0,99.0,1.0,651.0,,,,,,
480,2022,1,1,6,2022-01-01,AA,N102UW,752,PWM,CLT,...,0.0,134.0,1.0,813.0,,,,,,


In [17]:
print(f"Our subset dataset has {len(flights_subset)/1e6:2} million rows!")

Our subset dataset has 3.374373 million rows!


In [None]:
# filters=[('YEAR', '=', 2022)],

In [18]:
# flights_subset = flights.loc['YEAR' == 2020]
# # flights_subset = flights_subset[flights_subset.OP_CARRIER.isin(['AA', 'UA', 'WN', 'DL'])]
# flights_subset

Persist the data on the cluster so we don't need to reread it with every computation

In [19]:
flights_subset.persist()

Unnamed: 0_level_0,YEAR,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,OP_CARRIER,TAIL_NUM,OP_CARRIER_FL_NUM,ORIGIN,DEST,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,ARR_TIME,ARR_DELAY,CANCELLED,CANCELLATION_CODE,DIVERTED,AIR_TIME,FLIGHTS,DISTANCE,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,DIV_ARR_DELAY
npartitions=31,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
,int16,int8,int8,int8,datetime64[us],string,string,int16,string,string,string,string,float64,string,float64,float64,string,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


## hvPlot + Dask

To use hvPlot's build in Dask integration, we need to switch out:

`import hvplot.pandas` for `import hvplot.dask` 

In [20]:
import hvplot.dask
hvplot.extension('bokeh')

### Plot the departure delay per day for the entire dataset

In [21]:
flights_subset.groupby('FL_DATE')['DEP_DELAY'].count().hvplot()

### 💻 Your turn: Visualize the weekly distribution of the mean of any variable in the datasets

You can any plot type from the [hvPlot Gallery](https://hvplot.holoviz.org/reference/index.html)

In [None]:
# Your code here. When ready, click on the three dots below for the solution.

In [22]:
flights_subset.groupby('DAY_OF_WEEK')['ARR_DELAY'].mean().hvplot.scatter(x="DAY_OF_WEEK", y='ARR_DELAY')

## More interactivity with quick widgets

Zoom, pan, and hover are just the tip of the iceberg for interactivity, widgets open up a whole new world of interaction. Some examples of widgets are dropdown selectors, range/date/color selectors, radio buttons, text fields, etc.

hvPlot automatically includes the best widgets for your visualization.

In [23]:
flights_subset.hvplot.hist('DEP_DELAY', groupby='OP_CARRIER', bins=20, bin_range=(-20, 100), width=300)

### 💻 Your turn: Create violin plots for the different types of "DELAYS" for each 

Hint: You can look for columns associated with Delays (i.e. "DEL") 

In [None]:
# Your code here. When ready, click on the three dots for the solution.

In [24]:
columns = [col for col in flights.columns if "DEL" in col]
flights_subset.hvplot.violin(y=columns, group_label='Type of Delay', value_label='Delay in Minutes', invert=True, groupby="OP_CARRIER")

## Compose and overlay plots 

With hvPlot, you can compose and overlay your plots easily with the `+` or `*` operations, respectively.

Let's plot the minimum, maximum, and mean departure delays per week for each carrier.

In [25]:
import numpy as np

In [36]:
delays = flights_subset.groupby(['DAY_OF_WEEK', 'OP_CARRIER'])['DEP_DELAY'].agg([np.min, np.mean, np.max])

In [37]:
delays.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,amin,mean,amax
DAY_OF_WEEK,OP_CARRIER,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
6,AA,-34.0,15.341886,2966.0
6,DL,-43.0,10.64809,1206.0
6,UA,-50.0,13.128012,1385.0
6,WN,-20.0,16.820639,648.0
7,AA,-28.0,15.896023,3433.0


In [33]:
min_max_plot = flights_subset.hvplot.area(x='DAY_OF_WEEK', y='amin', y2='amax', alpha=0.2, groupby="OP_CARRIER")

In [38]:
mean_plot = delays['mean'].hvplot.line(x='DAY_OF_WEEK', groupby="OP_CARRIER")

KeyError: "Dimension 'OP_CARRIER' not found."

The + operation creates a layout, displaying the plots side-by-side:

In [None]:
min_max_plot + mean_plot

The * operation overlays one plot on top of the other:

In [None]:
min_max_plot * mean_plot

### 💻 Your turn: Plot the mean and max departure delay by time (hour) of day

In [None]:
# Your code here. When ready, click on the three dots for the solution.

In [None]:
flights['DEP_HOUR'] = flights.CRS_DEP_TIME.astype(int) // 100

flights.groupby('DEP_HOUR')['DEP_DELAY'].mean().hvplot.bar() + flights.groupby('DEP_HOUR')['DEP_DELAY'].max().hvplot.bar()

## Explorer

For creating all of our previous plots, we needed some preliminary knowledge of the dataset.

What if you want to explore a dataset visually from scratch? hvPlot's data explorer can help you explore and create interactive visualizations using a graphical UI:

In [None]:
explorer = hvplot.explorer(flights_subset)
explorer

You can use the above GUI to create a plot you want!

### Save your plot

You can then save the selected visualization using `save()`, or generate the code to create the specific viz using `plot_code`:

In [None]:
explorer.plot_code()

### 💻 Your turn: Use the explorer to plot the flights cancellations per day

In [None]:
# Your code here. When ready, click on the three dots for the solution.

In [None]:
flights_subset.groupby('FL_DATE')['CANCELLED'].count().hvplot()

## Geographic plots

To plot data on geographic maps, we need the latitude and longitude values. `ip2location` has created a list of lat/lon values for US airports here: https://github.com/ip2location/ip2location-iata-icao

We'll use this information to plot the departure delays on a world map!

In [None]:
import warnings

warnings.filterwarnings('ignore') # Ignore some HoloviewsDeprecationWarning

In [None]:
airports = pd.read_csv('https://raw.githubusercontent.com/ip2location/ip2location-iata-icao/master/iata-icao.csv')

In [None]:
airports = airports.set_index('iata')

In [None]:
airports.head()

In [None]:
airport_delays = flights.groupby('ORIGIN')['DEP_DELAY'].mean()

In [None]:
airport_delays = pd.merge(airport_delays, airports, left_on='ORIGIN', right_on='iata')

In [None]:
airport_delays.hvplot.points('longitude', 'latitude', geo=True, c='DEP_DELAY', alpha=1, xlim=(-180, -30), ylim=(0, 72), tiles='ESRI')

## Plotting large datasets

In the above visualization of daily counts we saw a bunch of compute happening before we saw the plot appear. But after it was generated, panning and zooming did not cause any new Dask computes.

This is because the final dataset after the groupby is only about `20 years * 365 days` long, so it fits completely in memory.

Now let's look at the entire dataset:

In [None]:
print(f"Reminder, the full dataset has {len(flights)/1e6:2} million rows")

If we try and send these many data points to the browser for visualization in a plot, the *browser* would run out of memory and crash.

<img src="images/datashader.svg" width="30%" align="right">

The solution for this is to take advantage of the fact that the output plot has a fixed resolution in terms of number of pixels. A 600x400 image has 240,000 pixels. This means that if we plotted 125 million points on the these pixels, most would overlay each other and not be visible. Instead, we pre-render or rasterize the data and shade in a manner that maintains an accurate the distribution of your data. 

We do this via the hvPlot integration with **Datashader**.

We will use a smaller dataset for the next few examples for quick outputs. These examples will work with the full dataset, but will take a bit longer to run with the 10 compute nodes we are currently using for this tutorial.

In [None]:
flights = dd.read_parquet(
        f"gcs://quansight-datasets/airline-ontime-performance/sorted/parquet_by_year", 
        filters=[('YEAR', '>', 2017)],
        columns=columns,
)

In [None]:
print(f"The smaller dataset has {len(flights)/1e6} million rows")

In these next two visualizations, Datashader data is displayed on the plots. 
As we pan and zoom, Datashader recomputes the appropriate pixel shades using Dask.

This allows us to easily look at the entire 30 million row dataset, but still
zoom into a single point, without requiring downsampling or decimation of the dataset.

In [None]:
flights.hvplot.line(x='FL_DATE', y='DEP_DELAY', datashade=True)

In [None]:
flights[['ARR_DELAY', 'DISTANCE']].hvplot.scatter(x='ARR_DELAY', y='DISTANCE', datashade=True)

In [None]:
# shutdown the cluster
cluster.shutdown()

---

## Next →

[Conclusion](./04-conclusion.ipynb)