In [1]:
from curves import plot_lines, functional_boxplot, functional_boxplot_from_df, split_datasets
from bokeh.io import output_notebook
from bokeh.plotting import show

In [2]:
output_notebook()

# Data depth

Given an ensemble of data drawn from a distribution F, _data depth_ quantifies how central (or deep) is a particular sample within the cloud of the sampled data. The deeper samples are considered more representative of the ensemble and are assigned high depth values whereas samples farther away from the rest of the ensemble are considered to be outliers and are correspondingly assigned lower depth values. Therefore, the notion of data depth provides a center outward ordering (also known as order statistics) for an ensemble of sampled data. [Mirzagar, Whitaker (2014)](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6875964)

# Band Depth

The work of [Lopez-Pintado, Romo (2009)](https://www.tandfonline.com/doi/pdf/10.1198/jasa.2009.0108) introduces the notion of _functional band depth_, a generalization of data depth for a higher dimension that is designed for ensembles of functions. The difference between the notion of functional band depth and other generalizations for higher dimensions is that it goes beyond the point-wise analysis of functional data, it provides a measure of centrality of a function among an ensemble of functional data that is both sensitive to the shape and the position of a function in comparison to the rest of ensemble members.

The _band_ in band depth can be described as the region between two (or more) curves in a line chart, it can be seen as the grey region in the image below, from [Sun, Genton, Nychka (2012)](https://onlinelibrary.wiley.com/doi/abs/10.1002/sta4.8?):

![error loading image](images/band_SUN2012.jpeg "Example of a band delimited by two curves")

Thus, the notion of _band depth_ of a curve can be defined as the proportion of bands delimited by _j_ different curves that contains the whole graph of the curve in question. One problem with this definition is that in datasets with a large number of curves or that the curves cross at some point there are few bands that completely contain curves, resulting in an poorly defined rank and many curves with the same value of band depth. The example below shows 5 curves with 5 points each, and the central function has a spike in the middle point of its graph.

In [3]:
data, plot_lines_test = plot_lines('../data/infile.csv',"../outputs/outfile_out.txt",'od','1/1/2018','1/5/2018', 5)
show(plot_lines_test)

In the example below we colour the same curves according to its band depth values, in which lighter colours represent curves with lower band depth values while stronger colours represent curves with higher values of band depth. We can see that using the original definition of band depth (the parameter 'od' selects the original method for calculating band depth), the central curve, despite belonging to the bands defined by the other curves 80% of the time, has the same colour of the outer curves, which are only contained in the bands defined by themselves, denoting a low band depth value.

In [4]:
data, plot_lines_depth = plot_lines('data/infile.csv',"outputs/outfile_out.txt",'od','1/1/2018','1/5/2018',
                                    5, depth_color=True)
show(plot_lines_depth)

# Modified Band Depth

In the same work, Lopez-Pintado and Romo proposed another notion of band depth called _modified band depth_ in which, instead of representing the proportion of bands that completely contain a curve, it denotes the proportion of time that the curve is contained in the bands defined by _j_ curves of the set of curves. This definition is a more flexible definition to work with larger datasets and robust to outliers or errors during the data cquisition process.

Using the same example we used above, now we colour the curves according to their values of _modified band depth_ and we can see that the central curve, that had the same colour as the outer curves indicating a low value of modified band depth now has a stronger orange colour, representing a higher value for its modified band depth

In [5]:
data, plot_lines_depth = plot_lines('data/infile.csv',"outputs/outfile_out.txt",'omd','1/1/2018','1/5/2018',
                                    5, depth_color=True)
show(plot_lines_depth)

# An Application of Band Depth

Now we'll show a possible application of the notion of band depth in a visualization technique called _Functional Boxplot_, proposed by [Sun, Genton (2012)](https://www.tandfonline.com/doi/abs/10.1198/jcgs.2011.09224?) and shown in the image below.

![error loading image](images/fbplot_SUN2012.png "Functional Boxplot example")

This technique is a generalization of the classic box plot for a set of curves. In the image above the pink region represents the envelope of the 50% central region (inter-quartile range) similar to the box in the classic box plot; the black curve denotes the median curve, that is the curve with the highest value of band depth (most representative curve of the set), represented by a dash inside the box in the box plot; and the maximum non-outlying envelope bounded by the two blue curves (similar to the whiskers of the box plot), also obtained by the 1.5 times the 50% central region empirical rule and representing the boundaries of the "common" behaviour for a curve in the set, meaning that every curve that goes outside this region for a moment can be considered an outlier, as can be seen by the dashed red lines in the image above. This technique uses the rank provided by the notion of band depth to produce this representation.

Now we'll show an example of this technique used in a dataset of [NYC Taxi Trips](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page), in which we counted the number of yellow taxi trips that happened in each hour of the day for each day of 2018, resulting in the following chart.

In [6]:
data, plot_lines_depth = plot_lines('data/taxis_v1.csv',"outputs/taxis_v1_out.txt",'tmd','1/1/2018','12/31/2018',24)
show(plot_lines_depth)

Looking at the graph above, it's not easy to draw any conclusions about the data other than a glimpse of its distribution through time. The _overplotting_ in this chart makes it difficult to interpret it. We can use the notion of _modified band depth_ (that we now know that is a robust enough definition of data depth for an irregular set of curves) to see the behaviour of this notion of data depth in this dataset.

In [7]:
data, plot_lines_depth = plot_lines('data/taxis_v1.csv',"outputs/taxis_v1_out.txt",'tmd','1/1/2018','12/31/2018',
                                    24, depth_color=True)
show(plot_lines_depth)

As expected we can see that the inner curves have a higher value of modified band depth, evidenced by the dark red colour, while the outer curves have lower values, evidenced by the light orange colour. We can use the values of modified band depth to build a _Functional Boxplot_ of the taxis in 2018 data.

In [11]:
fbplot_taxisv1 = functional_boxplot('data/taxis_v1.csv',"outputs/taxis_v1_out.txt",'01/01/2018','12/31/2018',
                                    24,'omd')
show(fbplot_taxisv1)

In the chart above the grey region represents the inter-quartile range, the dashed grey lines represent the maximum non-outlying envelope and the black curve represents the median curve, August 9th (we can hover over the line to see which day the curve represents). The light orange curves represent every curve that leaves the maximum non-outlying envelope in some moment, denoting a possible outlier; while the five curves that appear in the legend of the chart are the five with the lowest values of band depth, that means the five most likely to be outliers and it makes sense, since most of them are curves that represent holidays or days leading to holidays, in which the number of taxi trips is expected to be different from a "normal" day. That can be clearly seen in the red curve, representing January 1st that has a consistently high number of taxi rides during 12pm-5am compared to other days, that can represent people coming home from New Years parties, for example.