# Timeseries

In many domains it is common to plot scalar values as a function of time (or other single dimensions).  As long as the total number of datapoints is relatively low (in the tens of thousands, perhaps) and there are only a few separate curves involved, most plotting packages will do well.  However, for longer or more frequent sampling, you'll be required to subsample your data before plotting, potentially missing important peaks or troughs in the data.  And even just a few timeseries visualized together quickly run into [overplotting](https://anaconda.org/jbednar/plotting_problems/notebook) issues, where only the most recently plotted curve is visible, which can be highly misleading.

For applications with many datapoints or when visualizing multiple curves, datashader provides a principled way to view *all* of your data.  In this example, we will synthesize several time series curves so that we know their properties, and then show how datashader can reveal them.

In [None]:
import pandas as pd
import numpy as np
import xarray as xr
import datashader as ds
import datashader.transfer_functions as tf
from collections import OrderedDict

## Create some fake timeseries data

Here we create a fake time series, then generate many noisy samples of that time series.  We will also add a couple of "rogue" lines, with different statistical properties, and see how well those are visible compared to the rest.

In [None]:
# Constants
np.random.seed(42)
n = 100000                           # Number of points
cols = list('abcdefgh')            # Column names of samples
start = 1456297053                   # Start time
end = start + 60 * 60 * 24           # End time   

# Generate a fake signal
time = np.linspace(start, end, n)
signal = np.random.normal(0, 0.3, size=n).cumsum() + 50

# Generate many noisy samples from the signal
noise = lambda var, bias, n: np.random.normal(bias, var, n)
data = {c: noise(1, 5*(np.random.random() - 0.5), n) + signal for c in cols}

# Add one "rogue line" that diverges from the rest
cols += ['y']
data['y'] = signal + np.random.normal(0, 0.015, size=n).cumsum()

# Add another "rogue line" that has no noise, unlike the rest
cols += ['z']
data['z'] = signal

# Create a dataframe
data['Time'] = np.linspace(start, end, n)
df = pd.DataFrame(data)

# Default plot ranges:
x_range = (start, end)
y_range = (signal.min(), signal.max())

df.tail()

## Static Plots

To simulate what would happen in a standard plotting program, let's first use datashader to draw each curve into an aggregate grid, by connecting each datapoint in the series, and let's look at the first such plot.

In [None]:
%%time
cvs = ds.Canvas(x_range=x_range, y_range=y_range, plot_height=300)
aggs= OrderedDict((c,cvs.line(df, 'Time', c)) for c in cols)
img = tf.interpolate(aggs['a'],high="red")

In [None]:
img

Here we're using all 100,000 datapoints for this curve, and could easily handle 1 million or 10 million (try 
changing `n = ` above and re-running the notebook).  Because no downsampling was required, we know we aren't missing any stray sharp peaks (e.g. noise values) that might have been skipped in other approaches.  Other than that, the results should be similar to those in other plotting programs.

What happens if we then overlay multiple such curves?  In a traditional plotting program, there would be serious issues with overplotting, because these curves are highly overlapping.  To show what would typically happen, let's merge the images corresponding to each of the curves:




In [None]:
colors = ["yellow","green","orange","purple","red","pink","brown","grey","black","blue"]
imgs = [tf.interpolate(aggs[c],high=colors[i]) for i,c in enumerate(cols)]
tf.stack(*imgs)

Here the last-plotted curve, which happens to be the low-noise rogue curve, is clearly visible, but due to overplotting it will be entirely invisible if we plot precisely the same data in the opposite order:

In [None]:
tf.stack(*reversed(imgs))

If what we are interested in are (a) the overall trends, and (b) any curves that differ from those trends, we can avoid overplotting by combining the plots at the aggregate level, not just the image level as above.  At the aggregate level, curves have no associated color (though we could treat them as categories to add that information; see the census_race.ipynb example), and instead add together arithmetically.  First, lets try looking at data for all the timeseries merged into one aggregate:

In [None]:
merged = xr.concat(aggs.values(), dim=pd.Index(cols, name='cols'))
tf.interpolate(merged.any(dim='cols'))

The `any` operator merges all the data such that any pixel that is lit up for any curve is lit up in the final result.  Clearly, it is difficult to see any structure in this fully overplotted data.  Instead, let's merge the curves by summing them and displaying low-count pixels as light blue and high-count pixels as dark blue:

In [None]:
total = tf.interpolate(merged.sum(dim='cols'), low="lightblue", high="darkblue", how='linear')
total

Now the structure of this set of data should be clear -- there are numerous curves that are highly correlated apart from noise, and there is one curve that starts out being similar but gradually diverges (which shows up as a light blue curve towards the bottom right).  This structure, which we know was true from how we generated the data, only becomes clear when the data is combined in a principled way that avoids overplotting.


## Highlighting specific curves

As you may recall, we added two "rogue" curves, one of which we were able to detect above because it had low overlap with the other ones (due to gradually diverging values).  The other curve, however, is buried deep within the pack, because it differs only by having lower noise.  One way to detect it is to highlight each of the curves in turn, and display it in relation to the datashaded average values.  For instance, any of the noisy curves stand out very little from the pack:

In [None]:
img = tf.interpolate(aggs['a'],high="red")
tf.stack(total,img)

While the no-noise curve can be seen clearly against the pack:

In [None]:
img = tf.interpolate(aggs['z'],high="red")
tf.stack(total,img)

## Dynamic Plots

In practice, it might be difficult to cycle through each of the curves to find the one that was different, as done above.  Perhaps a criterion based on similarity could be devised, choosing the curve most dissimilar from the rest to plot in this way, which would be an interesting topic for future research.  In any case, one thing that can always be done is to make the plot fully interactive, so that the viewer can zoom in and discover such patterns dynamically.
If you are looking at a live, running version of this notebook, just enable the wheel zoom or box zoom tools, and then zoom and pan as you wish:

In [None]:
from datashader.callbacks import InteractiveImage
import bokeh.plotting as bp
bp.output_notebook()

def base_plot(tools='pan,wheel_zoom,box_zoom,reset'):
    p = bp.figure(tools=tools, plot_width=600, plot_height=300,
        x_range=x_range, y_range=y_range, outline_line_color=None,
        min_border=0, min_border_left=0, min_border_right=0,
        min_border_top=0, min_border_bottom=0)   
    p.xgrid.grid_line_color = None
    p.ygrid.grid_line_color = None
    return p

In [None]:
def create_image(x_range, y_range, w, h):
    cvs = ds.Canvas(x_range=x_range, y_range=y_range,
                    plot_height=int(h), plot_width=int(w))
    aggs = OrderedDict((c,cvs.line(df, 'Time', c)) for c in cols)
    merged = xr.concat(aggs.values(), dim=pd.Index(cols, name='cols'))
    img = tf.interpolate(merged.sum(dim='cols'), how='log')
    return img
    
p = base_plot()
InteractiveImage(p, create_image)

Here the diverging "rogue line" is immediately apparent, and if you zoom in you can see precisely how it differs from the rest at a given time.  The low-noise "rogue line" is much harder to see, but if you zoom in enough (particularly if you stretch out the x axis by zooming on the axis itself), you can see that one line goes through the middle of the pack, with different properties from the rest.  The datashader team is working on support for hover-tool information to reveal what line that was, and in general on better support for exploring large timeseries (and other curve) data.