FINAL 

In [29]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Visualizing Data with Holoviews

Thus far we have done very little data visualization! There is a lot for us to dive into here, both from a theoretical and a technical point of view. In order for us to get into the theoretical side, it is helpful to first start with learning some of the technical bits of visualizing data. We are going to 

## Anscombe's Quartet

Let's start with a dataset containing four tables and perform some basic summarization statistics on them. These 4 datasets together are called `Anscombe's Quartet`:


In [30]:
import pandas as pd

q1 = pd. DataFrame({
    'x': [10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0],
    'y': [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]
})
q1

Unnamed: 0,x,y
0,10.0,8.04
1,8.0,6.95
2,13.0,7.58
3,9.0,8.81
4,11.0,8.33
5,14.0,9.96
6,6.0,7.24
7,4.0,4.26
8,12.0,10.84
9,7.0,4.82


In [31]:
q2 = pd.DataFrame({
    'x': [10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0],
    'y': [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]
})
q2

Unnamed: 0,x,y
0,10.0,9.14
1,8.0,8.14
2,13.0,8.74
3,9.0,8.77
4,11.0,9.26
5,14.0,8.1
6,6.0,6.13
7,4.0,3.1
8,12.0,9.13
9,7.0,7.26


In [32]:
q3 = pd.DataFrame({
    'x': [10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0],
    'y': [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]
})
q3

Unnamed: 0,x,y
0,10.0,7.46
1,8.0,6.77
2,13.0,12.74
3,9.0,7.11
4,11.0,7.81
5,14.0,8.84
6,6.0,6.08
7,4.0,5.39
8,12.0,8.15
9,7.0,6.42


In [33]:
q4 = pd.DataFrame({
    'x': [8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 19.0, 8.0, 8.0, 8.0],
    'y': [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]
})
q4

Unnamed: 0,x,y
0,8.0,6.58
1,8.0,5.76
2,8.0,7.71
3,8.0,8.84
4,8.0,8.47
5,8.0,7.04
6,8.0,5.25
7,19.0,12.5
8,8.0,5.56
9,8.0,7.91


These data are all of the same size, and clearly look a little different from each another. Let's get some basic statistics. Below we are going to use `describe` to get some statistical dataframes, and then stack them horizontally. Then, to differentiate the columns and which quartet they belong to, we are going to use a *multiindex* to add some additional labels to the columns.

In [34]:

summary = pd.concat([q1.describe(), q2.describe(), q3.describe(), q4.describe()], axis=1)
summary.columns = pd.MultiIndex.from_product([['q1', 'q2', 'q3', 'q4'], ['x', 'y']])
summary

Unnamed: 0_level_0,q1,q1,q2,q2,q3,q3,q4,q4
Unnamed: 0_level_1,x,y,x,y,x,y,x,y
count,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0
mean,9.0,7.500909,9.0,7.500909,9.0,7.5,9.0,7.500909
std,3.316625,2.031568,3.316625,2.031657,3.316625,2.030424,3.316625,2.030579
min,4.0,4.26,4.0,3.1,4.0,5.39,8.0,5.25
25%,6.5,6.315,6.5,6.695,6.5,6.25,8.0,6.17
50%,9.0,7.58,9.0,8.14,9.0,7.11,8.0,7.04
75%,11.5,8.57,11.5,8.95,11.5,7.98,8.0,8.19
max,14.0,10.84,14.0,9.26,14.0,12.74,19.0,12.5


Each subset does not look too statistically different from the others! But what if we plot them? What do they look like? Below we are going to use the three main plotting libraries for visualizing the 4 subsets. To reduce repeating code, we will write functions to handle plotting individual subsets.

First we will start with `matplotlib`.

In [36]:
import matplotlib.pyplot as plt
import numpy as np

mpl_fig, mpl_axes = plt.subplots(2, 2)

xs = [0.0, 20.0]

def mpl_plot_quartet(title, df, ax):
    coef = np.polyfit(df.x, df.y, 1)
    poly1d_fn = np.poly1d(coef)  # Give me a ploynomial to create a linear regression line. 
    ax.plot(xs, poly1d_fn(xs), '--k')  # Plot the regression  pass the X values (xs) vs Y values (poly1d_fn(xs) , '--k' => dotted line of black (k)
    df.plot(kind='scatter', x='x', y='y', title=title, ax=ax)

mpl_plot_quartet('I', q1, mpl_axes[0][0])
mpl_plot_quartet('II', q2, mpl_axes[0][1])
mpl_plot_quartet('III', q3, mpl_axes[1][0])
mpl_plot_quartet('IV', q4, mpl_axes[1][1])

plt.show()


Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure.



## DOING SAME THING IN BOKEH

In [37]:
import bokeh.io
import bokeh.layouts
import bokeh.plotting

bokeh.io.output_notebook()

def bk_plot_quartet(title, df):
    coef = np.polyfit(df.x, df.y, 1)
    poly1d_fn = np.poly1d(coef)
    fig = bokeh.plotting.figure(title=title)
    fig.circle(df.x, df.y)
    fig.line(xs, poly1d_fn(xs), color='black', line_dash='dashed')
    return fig

bk_grid = bokeh.layouts.gridplot([
    [bk_plot_quartet('I', q1), bk_plot_quartet('II', q2)],
    [bk_plot_quartet('III', q3), bk_plot_quartet('IV', q4)]
])

bokeh.io.show(bk_grid)

In [39]:
import plotly.subplots

# IT RENDERS HTML AND JS 

pl_fig = plotly.subplots.make_subplots(rows=2, cols=2, subplot_titles=("I", "II", "III", "IV"))

def pl_plot_quartet(df, ax):
    coef = np.polyfit(df.x, df.y, 1)
    poly1d_fn = np.poly1d(coef)
    pl_fig.add_scatter(x=df.x, y=df.y, row=ax[0], col=ax[1], mode="markers")
    pl_fig.add_scatter(x=xs, y=poly1d_fn(xs), row=ax[0], col=ax[1], mode="lines", line_color='black', line_dash='dash') 

pl_plot_quartet(q1, (1,1))
pl_plot_quartet(q2, (1,2))
pl_plot_quartet(q3, (2,1))
pl_plot_quartet(q4, (2,2))

pl_fig.show()

To start we can very clearly see how different these subsets of data are from each other. Statistically they are all similar, and even show some exact similarities (those linear trend lines are all identical!). However, it is through visualization that we can obviously see just how different they are!

Now let's talk about the plotting libraries and how they visualize the data differently. Which one do you like best? Why? They all show the data, but the default settings vary greatly between them all. In fact, not only is everything rendered differently, but the keywords and functions needed to get these plots are all different enough from each other to be make it hard to remember which goes with which!

Let's summarize the plots:

* matplotlib
    * `+ ` Simple to look at
    * `+ ` Visually compact
    * `+ ` Sensible default size
    * `- ` Aspect ratio is not equal
    * `- ` Static image, cannot zoom, pan, etc.
    * `- ` Labels are overlapping
    

* bokeh
    * `+ ` Large plotting area
    * `+ ` Interactive zooming, panning
    * `+ ` Small features
    * `+ ` No overlapping labels
    * `- ` Difficult to visually find subplot labels
    * `- ` Large plotting area
    * `- ` Small features

* plotly
    * `+ ` Hovering over points provides information
    * `+ ` Background color is pleasing
    * `+ ` Default grid is nice
    * `+ ` Labels are all easy to locate
    * `+ ` No overlapping labels
    * `+ ` Different scatter point colors per plot are nice (though unnecessary)
    * `- ` Extreme skewing of aspect ratios
    * `- ` While legends are nice, this one isn't

We can certainly configure all of these plots to look effectively the same, but that is hardly a useful task. Instead, we want to move in a direction where the plotting API we are using becomes transparent. We want to define our plots abstractly and tell the visualization how to render. This is where `HoloViews` comes into play. `HoloViews` is an abstraction library that puts the focus on defining the visualization with your data. Let's use it to create the above three plots.

In [40]:
import holoviews as hv

In [41]:
import holoviews as hv

def hv_plot_quartet(title: str, df: pd.DataFrame) -> hv.Layout:
    coef = np.polyfit(df.x, df.y, 1)
    poly1d_fn = np.poly1d(coef)
    scatter = hv.Scatter(df)  # WE GAVE THE COMPLETE DATA SET> BECAUSE WE GAVE IT X AND Y. 
    # WE NEED TO SEND A PROPER DATA SET USING PD>DATAFRAME 
    curve = hv.Curve({'x': xs, 'y':poly1d_fn(xs)})
    return scatter * curve

hv_q1 = hv_plot_quartet('I', q1)
hv_q2 = hv_plot_quartet('II', q2)
hv_q3 = hv_plot_quartet('III', q3)
hv_q4 = hv_plot_quartet('IV', q4)

hv_anscombes_quartet = hv_q1 + hv_q2 + hv_q3 + hv_q4
hv_anscombes_quartet.cols(2)

So there ia a lot to unpack here. FIrst off, there is no plot displayed! That is because we have not told `HoloViews` what to do with our *description* of our plot. So before we go any farther, let's break down the description.

We defined a function just like before to simplify the repeated code for each subset (and similar to the `bokeh` function we are returning something!). We start the function by computing the linear regression, but after that things look very different!

Let's look at the scatter plot:

```python
scatter = hv.Scatter(df)
```

This is all that is needed! We are passing the dataframe *directly* into `HoloViews`, which is preconfigured and ready to locate and use columns with sensible names like `x` and `y`.


Let's look at the line plot (in `HoloViews` it is called a `Curve`):

```python
curve = hv.Curve({'x': xs, 'y':poly1d_fn(xs)})
```

Here we are passing a dictionary that maps the reasonably sensible default axes `x` and `y` to their data. `HoloViews` takes care of the rest! But there is more here to look at, does anything look odd?

```python
return scatter * curve
```

We are *multiplying* two plots together. This operation *overlays* them into a single plot! Once we are done defining the function, we call it for each subset, and then we *add* them together. Adding two plots in `HoloViews` results in composing them side by side, with their axes linked together. We can then display it in `Jupyter` like anything else, as well as tell it to arrange the subplots into 2 columns!

Now that we have all of that out of the way, let's actually tell `HoloViews` to plot the data.

In [42]:
hv.extension('matplotlib', 'bokeh', 'plotly')
hv.output(hv_anscombes_quartet, backend='matplotlib')

In [43]:
hv.extension('matplotlib', 'bokeh', 'plotly')
hv.output(hv_anscombes_quartet, backend='bokeh')

In [44]:
hv.extension('matplotlib', 'bokeh', 'plotly')
hv.output(hv_anscombes_quartet, backend='plotly')

The default visualization from each plotting API has been drastically improved by `HoloViews`, but there is still some work to do! Notably, subset titles are missing, the trend line is not a dashed black line, and in the case of `plotly` the axes are all different!

Let's keep the `bokeh` render output, but let's update `hv_anscombes_quartet` to include titles for the subsets and dashed black lines for the linear trends. To do this we can access the internal elements and update their *options*. Usually though we would apply this options when creating the visual elements, as we know what backend we wish to use. For us we will almost exclusively use `bokeh`.

While we are at it, we will also add a grid and enlarge the scatter points.

In [45]:
print(hv_anscombes_quartet)

:Layout
   .Overlay.I   :Overlay
      .Scatter.I :Scatter   [x]   (y)
      .Curve.I   :Curve   [x]   (y)
   .Overlay.II  :Overlay
      .Scatter.I :Scatter   [x]   (y)
      .Curve.I   :Curve   [x]   (y)
   .Overlay.III :Overlay
      .Scatter.I :Scatter   [x]   (y)
      .Curve.I   :Curve   [x]   (y)
   .Overlay.IV  :Overlay
      .Scatter.I :Scatter   [x]   (y)
      .Curve.I   :Curve   [x]   (y)


In [49]:
import holoviews as hv
hv.extension('bokeh')
for idx, subplot in enumerate(hv_anscombes_quartet):
    subplot.opts(title='I' * (idx+1) if idx < 3 else 'IV', show_grid=True)
    subplot.opts(hv.opts.Scatter(size=8))
    subplot.opts(hv.opts.Curve(color='black', line_dash='dashed'))
hv_anscombes_quartet
hv.save(hv_anscombes_quartet, 'my_plot.html')

In [17]:
def hv_plot_quartet(title, df):
    coef = np.polyfit(df.x, df.y, 1)
    poly1d_fn = np.poly1d(coef)
    return (hv.Scatter(df).opts(size=8) * hv.Curve({'x': xs, 'y':poly1d_fn(xs)}).opts(color='black', line_dash='dashed')).opts(show_grid=True, title=title)

hv_anscombes_quartet = (hv_plot_quartet('I', q1) + hv_plot_quartet('II', q2) + hv_plot_quartet('III', q3) + hv_plot_quartet('IV', q4)).cols(2)
hv_anscombes_quartet
hv.save(hv_anscombes_quartet, 'my_plot2.html')

## Exercise

Instead of plotting the four subsets in different subplots, plot them all on the same plot. When plotting the `Scatter` elements you can specify a `label` to indicate which subset it being plotted (e.g. `label='I'`). Note that we do not need a custom function to plot anything as we did before.

## Plot Types

`HoloViews` provides a plethora of plot and graphs types, all of which are highly customizable. It is necessary to note though that not every plot type is available through each of the plotting backends. For example, 3D plotting is only available through `matplotlib` and `plotly`, as `bokeh` has no support for 3D visualizations.

In [18]:
import holoviews as hv
import numpy as np
import pandas as pd

t = np.linspace(0.0, 100.0, 1000)
df = pd.DataFrame({
    'x': np.cos(t) * t,
    'y': np.sin(t) * t,
    'z': t,
})

In [19]:
hv.extension('matplotlib')
hv.Scatter3D(df)

In [20]:
hv.extension('plotly')
hv.Scatter3D(df)

In [21]:
hv.extension('bokeh')
hv.Scatter3D(df)  # no plotting is actually available; setting options is an error!

:Scatter3D   [x,y,z]

Here are some of the most common plot elements that we will make use of:

* `Scatter`
* `Curve`
* `Bars`
* `HeatMap`
* `Image`
* `Histogram`

A complete list of everything that `holoviews` offers can be found at the [`HoloViews` Reference Gallery](https://holoviews.org/reference/index.html).

## Dimensions

Very importantly one of the strongest features of `HoloViews` is the ability to map dimensions of our data directly to aspects of our visualization. We can control various color details, sizes, and more using dimensions from our data, and with a direct integration with `pandas` we can map columns to these customizations. Furthermore, we can manipulate the dimensions with various operations. 


In [22]:
# use the z column (which gets automatically normalized) to set colors for each point
hv.extension('plotly')
hv.Scatter3D(df).opts(color='z')

In [23]:
# sets the size equal to the 'z' column divided by 10
hv.Scatter3D(df).opts(
    color='z',
    size=hv.dim('z')/10
)