<a href='http://www.holoviews.org'><img src="assets/hv+bk.png" alt="HV+BK logos" width="40%;" align="left"/></a>
<div style="float:right;"><h2>04. Working with multi-dimensional datasets</h2></div>

As we have already discovered, elements are simple wrappers around your data that provide a semantically meaningful representation. The real power of HoloViews becomes most evident when working with larger, multi-dimensional datasets whether they are tabular like in a database or CSV file, or gridded like large datasets of images.

Tabular data (also called columnar data) is one of the most common, general, and versatile data formats, corresponding to how data is laid out in a spreadsheet. There are many different ways to put data into a tabular format, but for interactive analysis having [**tidy data**](http://www.jeannicholashould.com/tidy-data-in-python.html) provides flexibility and simplicity.

In this tutorial all the information you have learned in the previous sections will finally really pay off. We will discover how to facet data and use different element types to explore and visualize the data contained in a real dataset.

In [None]:
import numpy as np
import scipy.stats as ss
import pandas as pd
import holoviews as hv
hv.extension('bokeh')
%opts Curve Scatter Bars [tools=['hover']]

## What is tabular, tidy data?

In [None]:
macro_df = pd.read_csv('../data/macro.csv')
macro_df.head()

For tidy data, the **columns** of the table represent **variables** or **dimensions** and the **rows** represent **observations**. 

The opposite of tidy data is so called **wide** data, to see what looks like we can use the pandas ``pivot_table`` method:

In [None]:
wide = macro_df.pivot_table('unem', 'year', 'country')
wide.head(5)

In this wide format we can see that each columns represents the unemployment figures for one country indexed and each row a particular year. To go from wide to tidy data you can use the ``pd.melt`` function: 

In [None]:
melted = pd.melt(wide.reset_index(), id_vars='year',  value_name='unemployment')
melted.head()

## Declaring dimensions

Mathematical variables can usually be described as **dependent** or **independent**. In HoloViews these correspond to value dimensions and key dimensions (respectively).

In this dataset ``'country'`` and ``'year'`` are independent variables or key dimensions, while the remainder are automatically inferred as value dimensions:

In [None]:
macro = hv.Dataset(macro_df, ['country', 'year'])
macro

We will also give the dimensions more sensible labels using ``redim.label``:

In [None]:
macro = macro.redim.label(growth='GDP Growth', unem='Unemployment', year='Year', country='Country')

## Groupby

The great thing about tabular data is that we can easily group the data by a particular variable allowing us to plot or analyze each subset separately. Let's say for instance that we want to break the macro-economic data down by 'year'. Using the groupby method we can easily split the Dataset into subsets:

In [None]:
print(macro.groupby('Year'))

The Dataset does not have a visual representation however, to easily leverage this powerful capability HoloViews provides the convenient ``.to`` method, which allows us to both group the dataset and map dimensions to elements at the same time.

## Mapping dimensions to elements

Once we have a ``Dataset`` with multiple dimensions we can map these dimensions onto elements onto the ``.to`` method. The method takes four main arguments:

1. The element you want to convert to
2. The key dimensions (or independent variables to display)
3. The dependent variables to display
4. The dimensions to group by

As a first simple example let's go through such a declaration:

1. We will use a ``Curve``
2. Our independent variable will be the 'year'
3. Our dependent variable will be 'unem'
4. We will ``groupby`` the 'country'.

In [None]:
curves = macro.to(hv.Curve, 'year', 'unem', groupby='country')
print(curves)
curves

If you look at the printed output you will see that instead of a simple ``Curve`` we got a ``HoloMap`` of ``Curve`` Elements for each country.

Alternatively we could also group by the year and view the unemployment rate by country as Bars instead. If we simply want to groupby all remaining key dimensions (in this case just the year) we can leave out the groupby argument:

In [None]:
%%opts Bars [width=600 xrotation=45]
bars = macro.sort('country').to(hv.Bars, 'country', 'unem')
bars

In [None]:
# Exercise: Create a HeatMap using ``macro.to``, declaring vdims 'year' and 'country', and kdims 'growth'
# You'll need to declare ``width`` and ``xrotation`` plot options for HeatMap to make the plot readable
# You can also add ``tools=['hover']`` to get more info

## Displaying distributions

Often we want to summarize the distribution of values, e.g. to reveal the distribution of unemployment rates for each OECD country across time. This means we want to ignore the 'year' dimension in our dataset, letting it be summarized instead. To stop HoloViews from grouping by the extra variable, we pass an empty list to the groupby argument. In this case we can easily declare the ``BoxWhisker`` directly, but ommitting a key dimension from the ``groupby`` can be useful in cases when there are more dimensions:

In [None]:
%%opts BoxWhisker [width=800 xrotation=30] (box_fill_color=Palette('Category20'))
macro.to(hv.BoxWhisker, 'country', 'growth', groupby=[])
# Is equivalent to:
hv.BoxWhisker(macro, kdims=['country'], vdims=['growth'])

In [None]:
# Exercise: Display the distribution of GDP growth by year using the BoxWhisker element

## Faceting dimensions

Once the data has been grouped into a ``HoloMap`` as we did above we can further use the grouping capabilities by using the ``.grid``, ``.layout`` and ``.overlay`` methods to lay the groups out on the page rather than flipping through them with a set of widgets.

#### NdOverlay

In [None]:
%%opts Scatter [width=800 height=400 size_index='growth'] (color=Palette('Category20') size=5)
%%opts NdOverlay [legend_position='left']
ndoverlay = macro.to(hv.Scatter, 'year', ['unem', 'growth']).overlay()
print(ndoverlay)
ndoverlay.relabel('OECD Unemployment 1960 - 1990')

#### GridSpace

In [None]:
%%opts GridSpace [shared_yaxis=True]
subset = macro.select(country=['Austria', 'Belgium', 'Netherlands', 'West Germany'])
grid = subset.to(hv.Bars, 'year', 'unem').grid()
print(grid)
grid

To understand what is actually going on here let's rewrite this example in a slightly different way. Instead of using the ``.to`` or ``.groupby`` method we can express the same thing by iterating over the countries we want to look at, select the subset of the data for that country using the ``.select`` and then passing these plots to the container we want.

In the example above that means we ``select`` by 'country' on the macro ``Dataset`` pass the selection to ``Bars`` elements and declare the key and value dimension to display. We then pass the dictionary of ``Bars`` elements to the ``GridSpace`` container and declare the kdim of the container as 'Country':

In [None]:
countries = ['Austria', 'Belgium', 'Netherlands', 'West Germany']
hv.GridSpace({country: hv.Bars(macro.select(country=country), 'year', 'unem') for country in countries},
             kdims=['Country'])

#### NdLayout

In [None]:
%%opts Curve [width=200 height=200]
ndlayout = subset.to(hv.Curve, 'year', 'unem').layout()
print(ndlayout)
ndlayout

In [None]:
## Exercise: Recreate the plot above using hv.NdLayout and using macro.select just as we did for the GridSpace above

## Aggregating

Another common operation is computing aggregates. We can also compute and visualize these easily using the ``aggregate`` method. The aggregate method lets you declare the dimension(s) to aggregate by and a function to aggregate with (optionally a secondary function can be supplied to compute the spread). Once we have computed the aggregate we can simply pass it to the [``Curve``](http://holoviews.org/reference/elements/bokeh/Curve.html) and [``ErrorBars``](http://holoviews.org/reference/elements/bokeh/ErrorBars.html):

In [None]:
%%opts Curve [width=600]
agg = macro.reindex(vdims=['growth']).aggregate('year', function=np.mean, spreadfn=np.std)
hv.Curve(agg) * hv.ErrorBars(agg)

In [None]:
# Exercise: Display aggregate GDP growth by country, building it up in a series of steps
# Step 1. First, aggregate the data by country rather than by year, using
# np.mean and ss.sem as the function and spreadfn, respectively, then 
# make a `Bars` element from the resulting ``agg``

In [None]:
%%opts Bars [width=600 xrotation=45] (fill_alpha=0.5)
agg = macro.reindex(vdims=['growth']).aggregate('country', function=np.mean, spreadfn=ss.sem)
hv.Bars(agg)

In [None]:
# Step 2: You should now have a bars plot, but with no error bars. Now add ErrorBars as above. 
# Hint: You'll want to make the plot wider and use an xrotation to see the labels clearly

## Onward

* Go through the Tabular Data [getting started](http://build.holoviews.org/getting_started/Tabular_Datasets.html) and [user guide](http://build.holoviews.org/user_guide/Tabular_Datasets.html).
* Learn about slicing, indexing and sampling in the [Indexing and Selecting Data](http://holoviews.org/user_guide/Indexing_and_Selecting_Data.html) user guide.

The next section shows a similar approach, but for working with gridded data, in multidimensional array formats.