<img src="https://hvplot.holoviz.org/_static/logo_horizontal.svg" width="25%" align="right"/>

# Introduction to interactive data visualization with `hvplot`

In this notebook, we'll continue to explore the dataset, but with visuals! We will learn to use `hvplot` with pandas to create some quick interactive visualizations.

---

## What is hvPlot?

hvPlot a familiar and high-level API for data exploration and visualization. 

<img src="https://hvplot.holoviz.org/assets/diagram.svg" width="70%"/>

 
It is a powerful and interactive version of the pandas' `.plot()` API.
**By replacing .plot() with .hvplot() you get an interactive figure.**

In [None]:
import pandas as pd

In [None]:
# Note: Extension setup needs to be in a separate cell to avoid a JupyterHub race conditions. If you see an error/warning, please re-run the cell.

import hvplot
import hvplot.pandas # noqa

hvplot.extension('bokeh')

## Read a subset into pandas

Let's read in 1 year of data as into a pandas DataFrame. 

We'll read the Parquet dataset (details in a future notebook!) and only the first few columns.

In [None]:
columns = [
    'MONTH', 'DAY_OF_MONTH', 'DAY_OF_WEEK', 'FL_DATE', 'OP_CARRIER', 
    'TAIL_NUM', 'OP_CARRIER_FL_NUM', 'ORIGIN', 'DEST', 'CRS_DEP_TIME', 
    'DEP_TIME', 'DEP_DELAY', 'ARR_TIME', 'ARR_DELAY', 'CANCELLED', 
    'CANCELLATION_CODE', 'DIVERTED', 'AIR_TIME', 'FLIGHTS', 'DISTANCE',
    'CARRIER_DELAY', 'WEATHER_DELAY', 'NAS_DELAY', 'SECURITY_DELAY', 
    'LATE_AIRCRAFT_DELAY', 'DIV_ARR_DELAY'
]

In [None]:
flights = pd.read_parquet(
        f"gcs://quansight-datasets/airline-ontime-performance/sorted/parquet_by_year", 
        filters=[('YEAR', '=', 2022)],
        columns=columns,
)

### Create a smaller DataFrame

Let's reduce the dataset to only have information about 4 airlines:

In [None]:
print(f"Before: {len(flights)} rows")

In [None]:
flights_subset = flights[flights.OP_CARRIER.isin(['AA', 'UA', 'WN', 'DL'])]

In [None]:
print(f"After: {len(flights_subset)} rows")

##  hvPlot as a `pandas.plot` replacement

hvPlot gives you an interactive plot very quickly and out-of-the-box.

Let's see this in action by plotting the average departure delay each day:

**pandas.plot:**

In [None]:
flights_subset.groupby('FL_DATE')["DEP_DELAY"].mean().plot()

**hvPlot:**

In [None]:
flights_subset.groupby('FL_DATE')["DEP_DELAY"].mean().hvplot()

### Bokeh plot tools

The above plot is rendered using Bokeh, hover/click on the pan, box zoom, wheel zoom, save, reset, and help tools on the right to interact with your plot!

Learn more about the tools in the [Bokeh documentation](https://docs.bokeh.org/en/latest/docs/first_steps/first_steps_1.html).

### 💻 Your turn: Include arrival delays to the same plot

Note that your DataFrame will now need both departure and arrival delay columns, and when done, make sure to click on the legend labels to show/hide each plot!

Optionally, you can plot the maximum or cumulative delays as well.

In [None]:
# Your code here. When ready, click on the three dots for the solution.

In [None]:
flights.groupby('MONTH')[["DEP_DELAY", "ARR_DELAY"]].mean().hvplot()

### Histograms

hvPlot makes all the `pandas.plot` options more powerful, let's look at histograms for instance:

In [None]:
flights_subset.hvplot.hist('DEP_DELAY', by='OP_CARRIER', bins=20, bin_range=(-20, 100), width=300, subplots=True)

You can hover over the bars in the plots to view more details.

## More interactivity with quick widgets

Zoom, pan, and hover are just the tip of the iceberg for interactivity, widgets open up a whole new world of interaction. Some examples of widgets are dropdown selectors, range/date/color selectors, radio buttons, text fields, etc.

hvPlot automatically includes the best widgets for your visualization.

In [None]:
flights.hvplot.hist('DEP_DELAY', groupby='OP_CARRIER', bins=20, bin_range=(-20, 100), width=300)

Here we see a dropdown for selecting the carrier, try it out!

### 💻 Your turn: Create violin plots for the different types of "DELAYS" for each 

Hint: You can look for columns associated with Delays (i.e. "DEL") 

In [None]:
# Your code here. When ready, click on the three dots for the solution.

In [None]:
columns = [col for col in flights.columns if "DEL" in col]
flights.hvplot.violin(y=columns, group_label='Type of Delay', value_label='Delay in Minutes', invert=True, groupby="OP_CARRIER")

## Compose and overlay plots 

With hvPlot, you can compose and overlay your plots easily with the `+` or `*` operations, respectively.

Let's plot the minimum, maximum, and mean departure delays per week for each carrier.

In [None]:
import numpy as np

In [None]:
delays = flights.groupby(['DAY_OF_WEEK', 'OP_CARRIER'])['DEP_DELAY'].agg([np.min, np.mean, np.max])

In [None]:
delays.head()

In [None]:
min_max_plot = delays.hvplot.area(x='DAY_OF_WEEK', y='amin', y2='amax', alpha=0.2, groupby="OP_CARRIER")

In [None]:
mean_plot = delays['mean'].hvplot.line(x='DAY_OF_WEEK', groupby="OP_CARRIER")

The + operation creates a layout, displaying the plots side-by-side:

In [None]:
min_max_plot + mean_plot

The * operation overlays one plot on top of the other:

In [None]:
min_max_plot * mean_plot

### 💻 Your turn: Plot the mean and max departure delay by time (hour) of day

In [None]:
# Your code here. When ready, click on the three dots for the solution.

In [None]:
flights['DEP_HOUR'] = flights.CRS_DEP_TIME.astype(int) // 100

flights.groupby('DEP_HOUR')['DEP_DELAY'].mean().hvplot.bar() + flights.groupby('DEP_HOUR')['DEP_DELAY'].max().hvplot.bar()

## Explorer

For creating all of our previous plots, we needed some preliminary knowledge of the dataset.

What if you want to explore a dataset visually from scratch? hvPlot's data explorer can help you explore and create interactive visualizations using a graphical UI:

In [None]:
explorer = hvplot.explorer(flights_subset)
explorer

You can use the above GUI to create a plot you want!

### Save your plot

You can then save the selected visualization using `save()`, or generate the code to create the specific viz using `plot_code`:

In [None]:
explorer.plot_code()

### 💻 Your turn: Use the explorer to plot the flights cancellations per day

In [None]:
# Your code here. When ready, click on the three dots for the solution.

In [None]:
flights_subset.groupby('FL_DATE')['CANCELLED'].count().hvplot()

## Geographic plots

To plot data on geographic maps, we need the latitude and longitude values. `ip2location` has created a list of lat/lon values for US airports here: https://github.com/ip2location/ip2location-iata-icao

We'll use this information to plot the departure delays on a world map!

In [None]:
airports = pd.read_csv('https://raw.githubusercontent.com/ip2location/ip2location-iata-icao/master/iata-icao.csv')

In [None]:
airports = airports.set_index('iata')

In [None]:
airports.head()

In [None]:
airport_delays = flights.groupby('ORIGIN')['DEP_DELAY'].mean()

In [None]:
airport_delays = pd.merge(airport_delays, airports, left_on='ORIGIN', right_on='iata')

In [None]:
airport_delays.hvplot.points('longitude', 'latitude', geo=True, c='DEP_DELAY', alpha=1, xlim=(-180, -30), ylim=(0, 72), tiles='ESRI')

---

## Next →

After a 10-minute break, we'll dive into [Dask](./03-intro-to-dask.ipynb)!