# Tabular Datasets

HoloViews can work with a wide variety of data types, but many of them can be categorized as either:

   * **Tabular:** Tables of flat columns, or
   * **Gridded:** Array-like data on 2-dimensional or N-dimensional grids

This module provides an overview and introduction to one of the most flexible and powerful formats: Tabular **Pandas** DataFrames.

## Tabular

Tabular data  is one of the most common, general, and versatile data formats, corresponding to how data is laid out in a spreadsheet. 

In [3]:
import numpy as np
import pandas as pd

import holoviews as hv
from holoviews import opts

hv.extension('bokeh', 'matplotlib')

In [2]:
# Load the  dataset
diseases = pd.read_csv('../assets/diseases.csv.gz') # .gz file is a compressed CSV
diseases.head()

Unnamed: 0,Year,Week,State,measles,pertussis
0,1928,1,Alabama,3.67,
1,1928,2,Alabama,6.25,
2,1928,3,Alabama,7.95,
3,1928,4,Alabama,12.58,
4,1928,5,Alabama,8.03,


In [6]:
len(diseases)

222768

We can see we have 5 data columns, which each correspond either to independent variables that specify a particular measurement ('Year', 'Week', 'State'), or observed/dependent variables reporting what was then actually measured (the 'measles' or 'pertussis' incidence). 

Knowing the distinction between those two types of variables is crucial for doing visualizations, but if often not declared. For example, plotting 'Week' against 'State' would not be meaningful, whereas 'measles' for each 'State' (averaging or summing across the other dimensions) would be fine, and there's no way to deduce those constraints from the tabular format.  

We will first make a HoloViews object called a ``Dataset`` that declares the independent variables (called key dimensions or **kdims** in HoloViews) and dependent variables (called value dimensions or **vdims**) that you want to work with:

In [3]:
# Create a Holoviews Dataset

 # Define the independent value dimensions with labels, note the ordinary column names renamed to more descriptive labels
vdims = [('measles', 'Measles Incidence'), ('pertussis', 'Pertussis Incidence')] 
ds = hv.Dataset(diseases, ['Year', 'State'], vdims) # Create the Holoviews Dataset
ds

:Dataset   [Year,State]   (Measles Incidence,Pertussis Incidence)

Here we've used an optional tuple-based syntax **``(name,label)``** to specify a more meaningful description for the ``vdims``, while using the original short descriptions for the two ``kdims``.  We haven't yet specified what to do with the ``Week`` dimension, but we are only interested in yearly averages, so let's just tell HoloViews to average over all remaining dimensions:

In [4]:
# Aggregate the dataset by taking the mean over all states for each year
ds = ds.aggregate(function=np.mean)
ds

:Dataset   [Year,State]   (Measles Incidence,Pertussis Incidence)

The ``Week`` dimension can now be ignored.

The ``repr`` shows us both the ``kdims`` (in square brackets) and the ``vdims`` (in parentheses) of the ``Dataset``.  Because it can hold arbitrary combinations of dimensions, a ``Dataset`` is *not* immediately visualizable. There's no single clear mapping from these four dimensions onto a two-dimensional page, hence the textual representation shown above.

To make this data visualizable, we'll need to provide a bit more metadata, by selecting one of the library of Elements that can help answer the questions we want to ask about the data. Perhaps the most obvious representation of this dataset is as a ``Curve`` displaying the incidence for each year, for each state. We could pull out individual columns one by one from the original dataset, but now that we have declared information about the dimensions, the cleanest approach is to map the dimensions of our ``Dataset`` onto the dimensions of an Element using ``.to``:

In [5]:
# Create curves for measles and pertussis incidence over time by attaching it to the Dataset
# Stack the two curves vertically (use of the + operator and .cols(1))
layout = (ds.to(hv.Curve, 'Year', 'measles') + ds.to(hv.Curve, 'Year', 'pertussis')).cols(1) 
# Customize the layout appearance using hv.Curve options
layout.opts(
    opts.Curve(width=600, height=250, framewise=True)) #set width and height for all curves, framewise y-axis scaling means each curve has its own y-axis range

Here we specified two ``Curve`` elements showing measles and pertussis incidence respectively (the vdims), per year (the kdim), and laid them out in a vertical column.  Notat that even though we specified only the short name for the value dimensions, the plot shows the longer names ("Measles Incidence", "Pertussis Incidence") that we declared on the ``Dataset``.

We automatically received a dropdown menu to select which ``State`` to view. Each ``Curve`` ignores unused value dimensions, because additional measurements don't affect each other, but HoloViews has to do *something* with every key dimension for every such plot.  If the ``State`` (or any other key dimension) isn't somehow plotted or aggregated over, then HoloViews has to leave choosing a value for it to the user, hence the selection widget. 

### Selecting

One of the most common things we might want to do is to select only a subset of the data. The ``select`` method supports this, letting the user select a single value, a list of values supplied as a list, or a range of values supplied as a tuple. We will use ``select`` to display the measles incidence in four states over one decade. After applying the selection, we use the ``.to`` method as shown earlier, now displaying the data as ``Bars`` indexed by 'Year' and 'State' key dimensions and displaying the 'Measles Incidence' value dimension:

In [30]:
# Create a bar chart of measles incidence for selected states between 1980 and 1990
# Specify the states and year range to select
states = ['New York', 'New Jersey', 'California', 'Texas']

# Create the bar chart
# select the data for the specified states and years, convert to Bars, and sort. State/Year are a kdim
bars = ds.select(State=states, # kdim selection
                 Year=(1980, 1990) # kdim range selection
                 ).to(hv.Bars, #convert ds to Bars plot
                      ['Year', 'State'], #use kdms to define x-axis (nested: Year within State)
                      'measles').sort()# select measles incidence as the vdim and sort the bars

# Customize the bar chart appearance using hv.Bars options
bars.opts(opts.Bars(width=800, height=400, 
            tools=['hover'], # add hover tool
            xrotation=90, # rotate x-axis labels for readability
            title='Measles Incidence (1980-1990) for Selected States', # add title 
            show_legend=False) # hide legend
    )

### Faceting
A facet refers to a specific type of plot layout where you display multiple small plots based on the unique values of one or more categorical dimensions in your data.

Above we already saw what happens to key dimensions that we didn't explicitly assign to the Element using the ``.to`` method: they are grouped over, popping up a set of widgets so the user can select the values to show at any one time. Using widgets is not always the most effective way to view the data, and a ``Dataset`` lets you specify other alternatives using the ``.overlay``, ``.grid`` and ``.layout`` methods. For instance, we can lay out each state separately using ``.grid``:

In [None]:

# Create small multiples (grid space) of measles incidence curves for selected states between 1930 and 2005
grouped = ds.select(State=states, # kdim selection
                    Year=(1930, 2005) # kdim range selection
                    ).to(hv.Curve, 'Year', 'measles') # convert to Curves with Year as x-axis and measles incidence as y-axis
# Create the grid space by grouping the curves by State
gridspace = grouped.grid('State')
# Customize the grid space appearance using hv.Curve options
gridspace.opts(
    opts.Curve(color='blue')) # set width, height and color for all curves

Or we can take the same grouped object and ``.overlay`` the individual curves instead of laying them out in a grid:

In [72]:
# Create an overlay of measles incidence curves for selected states
ndoverlay = grouped.overlay('State')

# 1. Apply Element-Level Options (to customize the curves)
ndoverlay = ndoverlay.opts(
    # Only include options that style the *individual* curves
    opts.Curve(color=hv.Cycle(values=['indianred', 'slateblue', 'lightseagreen', 'coral']), tools=['hover'] )
)

# 2. Apply Plot/Layout-Level Options (to customize the FIGURE/LAYOUT)
# These options must be applied to the overall ndoverlay container.
ndoverlay = ndoverlay.opts(
    # Layout/Figure Options
    width=900,
    height=400, 
    legend_position='top_right', # position the legend
    title='Measles Incidence (1930-2005) for Selected States',
     # add hover tool
)

ndoverlay

These faceting methods even compose together, meaning that if we had more key dimensions we could ``.overlay`` one dimension, ``.grid`` another and have a widget for any other remaining key dimensions.

### Aggregating

Instead of selecting a subset of the data, another common operation supported by HoloViews is computing aggregates. When we first loaded this dataset, we aggregated over the 'Week' column to compute the mean incidence for every year, thereby reducing our data significantly. The ``aggregate`` method is therefore very useful to compute statistics from our data.

We can use our dataset is to compute the mean and standard deviation of the Measles Incidence by ``'Year'``. We can express this by passing the key dimensions to aggregate over (in this case just the 'Year') along with a function and optional ``spreadfn`` to compute the statistics we want. The ``spreadfn`` will append the name of the function to the dimension name so we can reference the computed value separately. Once we have computed the aggregate, we can cast it to a ``Curve`` and ``ErrorBars``:

In [None]:
# Create a mean measles incidence curve (for all states) with error bars showing standard deviation
agg = ds.aggregate('Year', # kdim by year
                   function=np.mean, # mean  incidence for all states
                   spreadfn=np.std # standard deviation for error bars
                   )

agg #note how the two vdims now have _std suffixes

:Dataset   [Year]   (Measles Incidence,measles_std,Pertussis Incidence,pertussis_std)

In [89]:
# Create the error bars, subsampling to reduce visual clutter
errorbars = hv.ErrorBars(agg, #pass the aggregated dataset
                         vdims=['measles', 'measles_std']).iloc[::2] #add measles (mean) and the STD subsample, every other point for clarity
# create a curve of the mean incidence and overlay with the error bars
overlay =  (hv.Curve(agg) * errorbars).redim.range(measles=(0, None)) # set y-axis range to start at 0 as you canot have negative incidence
# Customize the overlay appearance using hv.Curve and hv.ErrorBars options
overlay = overlay.opts(
    opts.Curve(color='navy', 
               line_width=2, 
               tools=['hover']), # Customize the curve appearance
    opts.ErrorBars(color='gray', 
                   line_width=1), # Customize the error bars appearance
    )

# Customize the overall overlay appearance using layout/figure options
overlay = overlay.opts(width=900, height=500, title='Mean Measles Incidence (All States) with Standard Deviation Error Bars') # add title)
overlay

In this way we can summarize a multi-dimensional dataset as something that can be visualized directly, while allowing us to compute arbitrary statistics along a dimension.

Now lets explor [Gridded Datasets](./4-Gridded_Datasets.ipynb)