# Exploring Data in Notebooks

The two most common types of data encountered in typical data science workflows are tabular (columnar) datasets and raster (array) data. In the Python ecosystem, the former type of data has gradually standardized around the pandas [DataFrame API](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) while raster data has standarized around the [NumPy API](https://numpy.org/).

Examples of libraries offering `DataFrame` style objects include [Dask](https://dask.org/), [Rapids](https://rapids.ai/), [GeoPandas](https:geopandas.org/), [Streamz](https://streamz.readthedocs.io/) and of course [Pandas](https:pandas.pydata.org) itself. For array data, you can use [XArray](http://xarray.pydata.org/) to specify labelled multidimensional arrays or [NumPy](https://numpy.org/) ndarrays.

This notebook is available as part of an `anaconda project` archive that you can download [here](https://anaconda.org/jlstevens/project/exploring-data). If you extract the archive, you will also find a notebook called `Reproducibly_Capturing_Code.ipynb` which details the steps needed to set up the  environment necessary to run this notebook. More information can be found in the *Capturing your Python code as a reproducible, deployable project* companion talk.

In this notebook, we will see how [hvplot](https://hvplot.holoviz.org/) allows you to visualize data in all these various formats using a `.plot` style API inspired by [pandas](https://pandas.pydata.org/). Our first step is therefore to take a look at some simple examples of what you can do with `.plot` on a pandas `DataFrame` without `hvplot`, starting with a pandas import:

In [None]:
import pandas as pd

The data we will examine lists the number of [cases of measles and pertussis](http://graphics.wsj.com/infectious-diseases-and-vaccines/#b02g20t20w15) (per 100,000 people) over time in each US state from 1928 to 2011:


In [None]:
df = pd.read_csv('diseases.csv.gz')
df.head()

The `DataFrame` named `df` has a `.plot` method we can simply call after running the `%matplotlib inline` notebook magic:

In [None]:
%matplotlib inline
df.plot();

At this point we can note two things about the plot above:

* This plot is rendered as a static image with [matplotlib](https://matplotlib.org/) which means it is not interactive.
* Without any specification from the user, the `.plot` call renders a plot that displays all the available data but this plot is hard to interpret.

Using the `numpy.sum` function, we can build a new `DataFrame` indexed by `'Year'` that has a `'measles'` column that is the aggregate over that year:

In [None]:
import numpy as np

by_year = df[["Year","measles"]].groupby("Year").aggregate(np.sum)
by_year.head()

Calling `.plot()` on this `DataFrame` results in a more easily interpretable plot:

In [None]:
by_year.plot();

So why use `hvplot`? 

Let us now import `hvplot.pandas` which gives our `DataFrame` objects a new `.hvplot` method which we can call:

In [None]:
import hvplot.pandas # adds hvplot method to pandas objects

by_year.hvplot()

Immediately, we can note the following differences from `.plot`:

* The plot is not rendered with Matplotlib but with [Bokeh](https://bokeh.org/) instead.
* The plot is now interactive: by selecting the various tools available on the toolbar (on the right), you can now pan, box zoom, mouse zoom, save, reset and hover the data respectively.
* The last tool (hover) in particular gives you a new view on your data, allowing you to see the exact values on the curve without having to read the values off the axes.

# Interpreting and composing plots

Looking at the plot above, we note that the incidence of measles used to be higher in the past and has dropped to nearly zero since the year 1980. What caused this change?

With a little research, we may learn that in [1963 a measles vaccine became widely available](https://www.cdc.gov/measles/about/history.html) which brought cases down to negligible levels. This is knowledge that is relevant to this plot and is something that would be useful to annotated explicitly on top of the plot.

In this section, we will see how this is easy to achieve using the [HoloViews](http://holoviews.org/) objects returned by `.hvplot`:

In [None]:
hvplot_obj = by_year.hvplot()
hvplot_obj

We can look at the textual representation of `hvplot_obj` by printing it:

In [None]:
print(hvplot_obj)

Now we see that this is a HoloViews [`Curve`](http://holoviews.org/reference/elements/bokeh/Curve.html) object described in the reference guide [here](http://holoviews.org/reference/elements/bokeh/Curve.html). This object (like all HoloViews objects) is *not a plot* but an object that contains your data, with a rich visual representation. We can see this by looking at the `.data` attribute:

In [None]:
hvplot_obj.data

This object can also compose with other HoloViews objects to build rich visualizations. We can see this by importing HoloViews and creating a [`VLine`](http://holoviews.org/reference/elements/bokeh/VLine.html) and a [`Text`](http://holoviews.org/reference/elements/bokeh/Text.html) object:

In [None]:
import holoviews as hv

vline = hv.VLine(1963).opts(color='black')
text = hv.Text(1963, 27000, " Vaccine introduced", halign='left')

You can now overlay these on top of the original `hvplot_obj` using the `*` operator, creating a HoloViews [`Overlay`](http://holoviews.org/reference/containers/bokeh/Overlay.html):

In [None]:
composite = hvplot_obj * vline * text
composite

We now have an interactive (pannable, zoomable, hoverable) plot with annotations!

To inspect the textual representation of this composite object, we can print it:

In [None]:
print(composite)

We can now say that we are viewing an overlay consisting of a `Curve`, a `VLine` and some `Text`.

# Interactivity with widgets

The interactive tools offered by [Bokeh](https://bokeh.org/) are one compelling reason to use `hvplot`, but the `hvplot` method offers additional levels of interactivity by generating widgets.

To illustrate, let's make a new `DataFrame` that aggregates the measle incidence by `'Year'` while preserving the breakdown by `'State'`:

In [None]:
measles_agg = df.groupby(['Year', 'State'])['measles'].sum()
measles_agg 

Now we can call `hvplot` to generate a plot over `'Year'` while grouping by `'State'`. The specification of a column to group by results in a dropdown widget by state:

In [None]:
by_state = measles_agg.hvplot('Year', groupby='State', width=600)
by_state * vline

Note that you can now view the data for each state while retaining the ability to pan, zoom and hover the plot. In addition, we have the `VLine` marking the point at which the 1963 measles vaccine was introduced.

# Plotting large data with `hvplot`

So far, the examples have shown how `.hvplot` differs from `.plot` for regular Pandas `DataFrames`. In this example, we will see how `.hvplot` can be used to visualize large volumes of data in [Dask](https://dask.org/) dataframes.

First let's import the `airline_flights` sample data from `hvplot`, convert it to a Dask `DataFrame` and view the `.tail` of it:

In [None]:
from hvplot.sample_data import airline_flights
flights = airline_flights.to_dask().persist()
flights.tail()

Note that there are 918204 rows in this dataframe! Plotting all these entries (e.g as a scatter plot) is likely to be a slow and memory intensive operation that may well crash the browser tab. We will now see how `hvplot` is able to quickly and efficiently plot all this data regardless.

Now we need to import `hvplot.dask` to give our Dask array a `.hvplot` method:

In [None]:
import hvplot.dask

We can call the `hvplot.scatter` method to generate a HoloViews [`Scatter`](http://holoviews.org/reference/elements/bokeh/Scatter.html) object (which we will not display the normal way due to the size of the data!)

In [None]:
scatter = flights.hvplot.scatter(x='distance', y='airtime')
print(scatter)

Displaying this object normally is risky as it would involve plotting 918204 with Bokeh. However, we can quickly and safely plot it by adding the `datashade=True` keyword:

In [None]:
flights.hvplot.scatter(x='distance', y='airtime', datashade=True)

Note that this example is still interactively pannable and zoomable!

This is possible due to the use of [datashader](https://datashader.org/) which is a fast rasterizer: our `Scatter` object is rapidly rendered by datshader to an image that is sent to the browser by the Python process. This minimizes the load on the browser by only pushing the image data to the client instead of all 918204 points.

# Plotting raster data with `hvplot`

So far, all the plots in this notebook have been generated from `DataFrame` style objects. This section shows that `hvplot` can be used to visualize raster data, specifically large [Xarray](http://xarray.pydata.org) datasets.

First we import `xarray` and enable `hvplot` support by importing `hvplot.xarray`:

In [None]:
import xarray as xr
import hvplot.xarray

Next we load one of the large sample dataset that shops with `xarray`:

In [None]:
air_ds = xr.tutorial.open_dataset('air_temperature').load()
air = air_ds.air
'Air temperature data has {dims} as dimensions and a shape of {shape}'.format(dims=air.dims, shape=air.shape)

We can now call `hvplot.scatter` to plot this entire dataset over time, remembering to set `datashade=True`:

In [None]:
temp_scatter = air.hvplot.scatter('time', groupby=[], datashade=True)
temp_scatter

Next we can use the `.mean` method on our xarray `DataArray` to average the data over latitude and longitude before plotting it over time with `hvplot.line`:

In [None]:
temp_mean = air.mean(['lat', 'lon']).hvplot.line('time', color='indianred')
temp_mean

Finally we can use `*` to easily overlay these two plots:

In [None]:
temp_scatter * temp_mean

# Next steps


This notebook only scratches the surface of what you can do with `hvplot`: you can visualize streaming data using the [`streamz`](https://streamz.readthedocs.io/) library, build dashboards using [Panel](https://panel.pyviz.org/), generated linked selection plots using [HoloViews](http://holoviews.org/) and much more. You can find many of these topics covered in the `hvplot` [User Guide](https://hvplot.holoviz.org/user_guide/index.html).

Lastly, if you have any problems [running this project](https://anaconda.org/jlstevens/project/exploring-data), you can consult the `Reproducibly_Capturing_Code.ipynb` notebook which also has a talk (*Capturing your Python code as a reproducible, deployable project*) describing how to reproducibly capture a project such as this one, together with its files, an associated environment and the corresponding commands for execution.