In [1]:
from curves import plot_lines, functional_boxplot, functional_boxplot_from_df, split_datasets
from bokeh.io import output_notebook
from bokeh.plotting import show

In [2]:
output_notebook()

# Data depth

Given an ensemble of data drawn from a distribution F, _data depth_ quantifies how central (or deep) is a particular sample within the cloud of the sampled data. The deeper samples are considered more representative of the ensemble and are assigned high depth values whereas samples farther away from the rest of the ensemble are considered to be outliers and are correspondingly assigned lower depth values. Therefore, the notion of data depth provides a center outward ordering (also known as order statistics) for an ensemble of sampled data. [Mirzagar, Whitaker (2014)](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6875964)

# Band Depth

The work of [Lopez-Pintado, Romo (2009)](https://www.tandfonline.com/doi/pdf/10.1198/jasa.2009.0108) introduces the notion of _functional band depth_, a generalization of data depth for a higher dimension that is designed for ensembles of functions. The difference between the notion of functional band depth and other generalizations for higher dimensions is that it goes beyond the point-wise analysis of functional data, it provides a measure of centrality of a function among an ensemble of functional data that is both sensitive to the shape and the position of a function in comparison to the rest of ensemble members.

The _band_ in band depth can be described as the region between two (or more) curves in a line chart, it can be seen as the grey region in the image below, from [Sun, Genton, Nychka (2012)](https://onlinelibrary.wiley.com/doi/pdf/10.1002/sta4.8):

![error loading image](images/band_SUN2012.jpeg "Example of a band delimited by two curves")

Thus, the notion of _band depth_ of a curve can be defined as the proportion of bands delimited by _j_ different curves that contains the whole graph of the curve in question. One problem with this definition is that in datasets with a large number of curves or that the curves cross at some point there are few bands that completely contain curves, resulting in an poorly defined rank and many curves with the same value of band depth.

## Example 1

Starting with a simple example we have a small set of 5 curves each with 10 points. The line chart below shows us the distribution of the curves. It's easy to see that all the curves have the same behaviour.

In [17]:
data, plot_lines_test = plot_lines('data/ex1.csv','outputs/ex1_out.txt','od','1/1/2018','1/5/2018', 10)
show(plot_lines_test)

Now if we colour the lines according to their band depth values, in which lighter colours represent curves with lower band depth values while stronger colours represent curves with higher values of band depth, we can see that the more central the curve gets, the darker is its shade of red, denoting a higher value of band depth which is exactly what we expected to begin with.

In [18]:
data, plot_lines_depth = plot_lines('data/ex1.csv','outputs/ex1_out.txt','od','1/1/2018','1/5/2018',10,
                                    depth_color=True)
show(plot_lines_depth)

## Example 2

Now we'll take the same example as before, just with a small change. We have the same curves as before, the difference is that the central curve (the one that had the highest value of band depth) now has a spike in one of the its points, it can be an outlier or a failure in the data acquisition process or anything like that. Below is the line chart that represents the data.

In [20]:
data, plot_lines_test = plot_lines('data/ex2.csv','outputs/ex2_out.txt','od','1/1/2018','1/5/2018', 10)
show(plot_lines_test)

Using the same strategy of colouring the lines according to their band depth values we can see that the difference in the behaviour of the central curve has changed the whole scenery of the band depth values. The central curve which had the highest value of band depth before - despite still belonging to the bands defined by the other curves 90% of the time - has the same colour of the outer curves, which are only contained in the bands defined by themselves, denoting a low band depth value.

In [15]:
data, plot_lines_depth = plot_lines('data/ex2.csv','outputs/ex2_out.txt','od','1/1/2018','1/5/2018',10,
                                    depth_color=True)
show(plot_lines_depth)

## Example 3

In the next example we have a set of 10 curves that are repeating the same pattern, they are senoids, just offseted. There is one curve, though, that shows a different behaviour compared to the others, we can see it in the chart below.

In [63]:
data, plot_lines_test = plot_lines('data/ex3.csv','outputs/ex3_out.txt','od','1/1/2018','1/11/2018', 10)
show(plot_lines_test)

Colouring the lines according to their depth values we can see that the depths in the curves with the same pattern have the expected behaviour, higher values in more central curves and lower near the "boundaries". If we look really hard it's possible to see that the curve with the different pattern appears in the middle in a really clear colour, denoting its low value of band depth. This shows that this concept can also find outliers that are different than just a very high value or a very low value, curves that are "central" but have a very different behaviour than the others also haven't a high value of band depth, that is, are not representative of the whole set.

In [64]:
data, plot_lines_test = plot_lines('data/ex3.csv','outputs/ex3_out.txt','od','1/1/2018','1/11/2018', 10,
                                   depth_color=True)
show(plot_lines_test)

# Modified Band Depth

In the same work, Lopez-Pintado and Romo proposed another notion of band depth called _modified band depth_ in which, instead of representing the proportion of bands that completely contain a curve, it denotes the proportion of time that the curve is contained in the bands defined by _j_ curves of the set of curves. This definition is a more flexible definition to work with larger datasets and robust to outliers or errors during the data cquisition process. We'll use the same examples as before to observe the behaviour of this notion of band depth in them.

## Example 1

In [31]:
data, plot_lines_depth = plot_lines('data/ex1.csv','outputs/ex1_out.txt','omd','1/1/2018','1/5/2018',10,
                                    depth_color=True)
show(plot_lines_depth)

Using out simplest example, in which the curves are regular and constant we can observe that the behaviour of the values of band depth has not changed in comparison to the original definition. This happens because in this case every curve is either completely contained in the band - is 100% of the time inside the band - or not at all. This way we can see that this notion does not affect the curves that are completely contained in the band.

## Example 2

In [16]:
data, plot_lines_depth = plot_lines('data/ex2.csv','outputs/ex2_out.txt','omd','1/1/2018','1/5/2018',10,
                                    depth_color=True)
show(plot_lines_depth)

The situation above is exactly the kind of situation that the modified band depth was designed for. We saw that using the original definition of band depth the central curve in this line chart had a very low value of band depth despite just leaving the bands defined by the other curves in one point. Now using modified band depth - that instead of measuring if the curve is completely contained in the bands, measures the proportion of time that the curve is inside the bands - we can observe that the central curve now has a very high value of band depth, represented by the dark red colour.

## Example 3

In [65]:
data, plot_lines_test = plot_lines('data/ex3.csv','outputs/ex3_out.txt','omd','1/1/2018','1/11/2018', 10,
                                   depth_color=True)
show(plot_lines_test)

In this case the notion of modified band depth assigns a high value of band depth to the curve with a different pattern because even though it's not completely contained in a high number of bands because its pattern is different than the others, the curve is still inside the bands in a high proportion of time, resulting in a high value of modified band depth.

# An Application of Band Depth

Now we'll show a possible application of the notion of band depth in a visualization technique called _Functional Boxplot_, proposed by [Sun, Genton (2011)](https://www.tandfonline.com/doi/abs/10.1198/jcgs.2011.09224?) and shown in the image below.

![error loading image](images/fbplot_SUN2012.png "Functional Boxplot example")

This technique is a generalization of the classic box plot for a set of curves. In the image above the pink region represents the envelope of the 50% central region (inter-quartile range) similar to the box in the classic box plot; the black curve denotes the median curve, that is the curve with the highest value of band depth (most representative curve of the set), represented by a dash inside the box in the box plot; and the maximum non-outlying envelope bounded by the two blue curves (similar to the whiskers of the box plot), also obtained by the 1.5 times the 50% central region empirical rule and representing the boundaries of the "common" behaviour for a curve in the set, meaning that every curve that goes outside this region for a moment can be considered an outlier, as can be seen by the dashed red lines in the image above. This technique uses the rank provided by the notion of band depth to produce this representation.

Now we'll show an example of this technique used in a dataset of [NYC Taxi Trips](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page), in which we counted the number of yellow taxi trips that happened in each hour of the day for each day of 2018, resulting in the following chart.

In [11]:
data, plot_lines_depth = plot_lines('data/taxis_v1.csv',"outputs/taxis_v1_out.txt",'tmd','1/1/2018','12/31/2018',24)
show(plot_lines_depth)

Looking at the graph above, it's not easy to draw any conclusions about the data other than a glimpse of its distribution through time. The _overplotting_ in this chart makes it difficult to interpret it. We can use the notion of _modified band depth_ (that we now know that is a robust enough definition of data depth for an irregular set of curves) to see the behaviour of this notion of data depth in this dataset.

In [12]:
data, plot_lines_depth = plot_lines('data/taxis_v1.csv',"outputs/taxis_v1_out.txt",'tmd','1/1/2018','12/31/2018',
                                    24, depth_color=True)
show(plot_lines_depth)

As expected we can see that the inner curves have a higher value of modified band depth, evidenced by the dark red colour, while the outer curves have lower values, evidenced by the light orange colour. We can use the values of modified band depth to build a _Functional Boxplot_ of the taxis in 2018 data.

In [13]:
fbplot_taxisv1 = functional_boxplot('data/taxis_v1.csv',"outputs/taxis_v1_out.txt",'01/01/2018','12/31/2018',
                                    24,'omd')
show(fbplot_taxisv1)

In the chart above, the grey region represents the inter-quartile range, the dashed grey lines represent the maximum non-outlying envelope and the black curve represents the median curve, August 9th (we can hover over the line to see which day the curve represents). The light orange curves represent every curve that leaves the maximum non-outlying envelope in some moment, denoting a possible outlier; while the five curves that appear in the legend of the chart are the five with the lowest values of band depth, that means the five most likely to be outliers and it makes sense, since most of them are curves that represent holidays or days leading to holidays, in which the number of taxi trips is expected to be different from a "normal" day. That can be clearly seen in the red curve, representing January 1st that has a consistently high number of taxi rides during the 12pm-5am interval compared to other days, that can represent people returning home from New Year's parties, for example.

# Limitations

The naïve implementation of the definition of both band depth and modified band depth of a curve requires us to check for each possible pair (in the case of a band being the region between two curves) of curves if the curve in question is either completely inside or the proportion of time that it is contained in the band, resulting in a computational cost in the order of _O(n³p)_ where _n_ is the number of curves in the set and _p_ is the number of points per curve. This performance is not good enough for large datasets, which are becoming more common with the increasing number of available data. Because of this limitation, some researchers have developed other approaches to calculate the notion of band depth and modified band depth, as can be seen in the work of [Sun, Genton, Nychka (2012)](https://onlinelibrary.wiley.com/doi/pdf/10.1002/sta4.8), trying to obtain the desired results faster.