# Tabular Datasets

In [None]:
import numpy as np
import pandas as pd
import holoviews as hv
hv.extension('bokeh', 'matplotlib')

As we have already discovered Elements are simple wrappers around your data, which provide a semantically meaningful representation. To work directly with a wide range of data formats HoloViews provides flexible and extensible interfaces that allow you to work with two main categories of supported data types:

   * **Tabular:** Tables of flat columns usually in a tidy format
   * **Gridded:** N-dimensional array-like data
   
Here we will take a quick tour on how to work with such datasets, for more detail check out the [Columnar Data](...) and [Gridded Data](Gridded_Data.ipynb) user guides. Look especially for details on all the supported formats which include simple dictionaries of column arrays, pandas' ``DataFrame``, dask's ``DataFrame``, xarray's ``DataArray`` and ``Dataset`` and more. Here we will use two of the most flexible and powerful formats: **pandas** DataFrames and **xarray** Datasets to provide a quick overview and introduction.

## Tabular

Tabular data is both one of the most common, general and versatile data formats. At the same time there are many different formats tabular data can be laid out in. For interactive analysis the so called **tidy format** is the most flexible and simple. The **columns** and **rows** of the table represent **variables** or **dimensions** and **observations** respectively. The best way to understand this format is to look at such a dataset:

In [None]:
diseases = pd.read_csv('../assets/diseases.csv.gz')
diseases.head()

This particular dataset was the subject of an excellent piece of visual journalism in the [WSJ](http://graphics.wsj.com/infectious-diseases-and-vaccines/#b02g20t20w15), detailing the incidence of various diseases over time, and was downloaded from the [University of Pittsburgh's Project Tycho](http://www.tycho.pitt.edu/). We can see we have 5 columns corresponding to different variables. We can also make a distinction between the variables. 'Year', 'Week', and 'State' are independent variables and the 'measles' and 'pertussis' columns are the observed or dependent variables. In HoloViews these map onto key dimensions (**kdims**) and value dimensions (**vdims**) respectively.

This is a fairly complex dataset and we can't visualize it all at once. Therefore we will declare a ``Dataset`` Element, which acts as a powerful wrapper for our data and allows us to add additional metadata. One of the most common pieces of metadata are additional labels for the dimensions of our data. We will give the measles and pertussis columns in our dataset more readable labels by supplying a tuple of the form **``(name, label)``** as the dimension. In this notebook we don't need the ``Week`` column so we will quickly the aggregate data computing the mean incidence for each ``Year`` and ``State`` (don't worry, we will cover that later on).

In [None]:
vdims = [('measles', 'Measles Incidence'), ('pertussis', 'Pertussis Incidence')]
ds = hv.Dataset(diseases, kdims=['Year', 'State'], vdims=vdims)
ds = ds.aggregate(function=np.mean)
ds

The ``repr`` shows us both the ``kdims`` (in square brackets) and the ``vdims`` (in parentheses) of the ``Dataset``. Now we just have to find the right visualizations to answer the questions we want to ask about the data. For that we can pick from the large library of [Elements] to visualize our data.

Perhaps the most natural representation of this dataset is as a Curve displaying the incidence for each year. for each state. So let's just display it that way, using the ``.to`` interface we can map the dimensions of our ``Dataset`` onto the dimensions of an Element. To display a timeseries we will pick the ``Curve`` element and specify the ``'Year'`` as the key dimension and the ``'Measles Incidence'`` as the value dimension, which we will refer to by its name (``'measles'``) rather than the more readable but also more verbose label. We will also do the same for the ``'Pertussis Incidence'`` and lay out the two plots.

In [None]:
%%opts Curve [width=600 height=250] {+framewise}
(ds.to(hv.Curve, 'Year', 'measles') + ds.to(hv.Curve, 'Year', 'pertussis')).cols(1)

You will immediately notice that we automatically received a dropdown menu to select which State to view. The ``.to`` interface automatically groups your data by all the key dimensions you didn't assign to the Element, which in this case just leaves the 'State'. To explicitly specify which key dimensions to group over simply supply a list or single dimension as the third positional or ``groupby`` keyword argument.

#### Selecting

One of the most common thing we might want to do is ``select`` only a subset of the data. The ``select`` method makes this extremely easy letting you select a single value, a list of values supplied as a list and a range of values supplied as a tuple. Here we will use ``select`` to display the display the measles incidence in four states over the 1980s. After applying the selection we again use the ``.to`` method to display the data as ``Bars`` indexed by 'Year' and 'State' key dimensions and displaying the 'Measles Incidence':

In [None]:
%%opts Bars [width=800 height=400 tools=['hover'] group_index=1 legend_position='top_left']
states = ['New York', 'New Jersey', 'California', 'Texas']
ds.select(State=states, Year=(1980, 1990)).to(hv.Bars, ['Year', 'State'], 'measles').sort()

#### Faceting

Above we already saw what happens to key dimensions that we didn't explicitly assign to the Element using the ``.to`` method. They are grouped over and pop up a set of widgets to select the values we want. Often we want to facet the data in other ways however, and HoloViews let's you do this very easily using the ``.overlay``, ``.grid`` and ``.layout`` methods. Using the grid method we can lay out the selected states instead:

In [None]:
%%opts Curve [width=200] (color='indianred')
grouped = ds.select(State=states, Year=(1930, 2005)).to(hv.Curve, 'Year', 'measles')
grouped.grid('State')

We can take the same grouped object and ``overlay`` the individual curves instead of laying them out in a grid. These faceting methods even compose together, meaning that if we had more key dimensions we could ``.overlay`` one dimension, ``.grid`` another and have a widget for any other remaining key dimensions:

In [None]:
%%opts Curve [width=600] (color=Cycle(values=['indianred', 'slateblue', 'lightseagreen', 'coral']))
grouped.overlay('State')

#### Aggregating

Instead of selecting a subset of the data another common operation supported by HoloViews is computing aggregates. When we first loaded this dataset we aggregated over the 'Week' column to compute the mean incidence for every year reducing our data significantly. The ``aggregate`` method is therefore very useful to compute statistics from our data.

A simple example using our dataset is to compute the mean and standard deviation of the Measles Incidence by ``'Year'``. We can express this simply by passing the key ``dimensions`` to aggregate over (in this case just the 'Year') along with a function and optional ``spreadfn`` to compute the statistics we want. The spread_fn will append the name of the function to the dimension name so we can reference it separately. Once we have computed the aggregate we can simply cast it to a ``Curve`` and ``ErrorBars``:

In [None]:
%%opts Curve [width=600]
agg = ds.aggregate('Year', function=np.mean, spreadfn=np.std)
(hv.Curve(agg) * hv.ErrorBars(agg,vdims=['measles', 'measles_std'])).redim.range(measles=(0, None))

In this way we can summarize a multi-dimensional dataset to something that can be visualized more easily and allowing us to compute arbitrary statistics along a dimension.

### Other data

If you want to know more about working with tabular data particularly when using datatypes other than pandas have a look at the [Tabular Data] user guide. The different interfaces allow you to work with everything from simple NumPy arrays to out-of-core dataframes using dask which will scale to visualizations of billions of rows particularly when using the [datashader](https://anaconda.org/jbednar/holoviews_datashader/notebook) integration in HoloViews.